[voiceover]
National Data Archive on Child Abuse Neglect.

[Erin McCauley]
All right it is noon so let's get this party started. So welcome to the fifth session of the 2021 NDACAN summer training series. As a reminder the sessions are all recorded and then we'll turn the videos into a webinar series which we'll post on our website at the end of the summer. You can also check out previous years' Summer Series because they often have different topics and different workshops. So as I said this is the NDACAN summer training series. NDACAN Stands for the National Data Archive on Child Abuse and Neglect. If this is your first time with us, welcome we archive data on child abuse neglect both survey-based data and administrative data and we are hosted at both Cornell University and Duke University. If you are an old user but you're just back in the neighborhood then welcome. We have shifted from just based at Cornell to also being based at Duke so our staff are actually split across both universities where we can kind of leverage the resources of both but all of our staff are still with us we're just split between the universities. The theme of this year's summer training series is Data Strategies for the Study of Child Welfare. The presentations are all done by NDACAN staff, we've also had some introductions by folks at Children's Bureau who give us the contracts to archive all of this data. We picked data strategies as the theme tor this year's Summer Training Series based on feedback from participants from last year's series. So at the end of this summer you will receive a survey from me asking about your experience with the Summer Training Series. We use this to evaluate the series but also its how we developed the themes for all the different Summer Series we've done so far as well as the individual topics so make sure if you have any ideas about what we can do with future summers or things you'd like to see to just make sure to throw them into that survey because that way we can be as responsive as possible to the wishes of our data users. So here's our overview of the summer. I cannot believe that we are in our second to last but so far this summer we started out with an introduction to NDACAN we went over the data products that we have and the services and supports that we offer. Then we have Holly Larrabee in to do a presentation on the survey-based data cluster. This is the first time this topic was included so if you are interested in survey-based data make sure you check that out when it comes out at the end of the summer. Then we went through administrative data and linking, then we had a presentation on our new data sets, the VCIS data and then how that can be used to study special populations. Today we have Frank Edwards and Sarah Sernaker here and they are going to be doing a workshop on multilevel modeling kind of going over the theory of it and then giving examples with our data. And then next week we'll be rounding out the series with a workshop on latent class analysis which will again be done by Sarah Sernaker. I also wanted to let you know if you've been with us all summer and you see me or hear me every time that I will not be here for the last one of the year so Clayton will be leading the charge on that along with Sarah and Andres. So I will now pass it off to our presenters Frank and Sarah. Take it away!

[Sarah Sernaker]
Hi everyone I'm Sarah Sernaker and I am half of the presenters here. I'll be doing the first half of this presentation then I'll pass it on to Frank Edwards who will go through the little example in R of how to apply multilevel modeling. And so this is our agenda today so first we'll just describe the basic components of multilevel model, some light theory to introduce what we're talking about. We'll briefly describe multilevel structure in the context of one of our data holdings the AFCARS data which is foster care data. So I'll give a little more information when that comes up. And then Frank will do like I said on example in R using the AFCARS data to show how to actually model multilevel models to fit them that is. So understanding multilevel data. What are we talking about here? So when you think about multilevel models you can compare it to models you're used to which are regular regression models like general, ordinary least squares, and those types of models involve data that's IID which is independent and identically distributed. And so what does that mean. The observations are independent of each other and the observations come from the same data generating process. So what that really means is observations are not only independent but they they center around some shared mean and they have some degree of normal variability. So simple example if you were to model someone's w8 let's say, so w8 is your response variable and let's say h8 is your predictor variable you would take a sample of people, their h8s and w8s, and you'd expect all of these observations to be independent, right? Like my h8 and w8 are independent of my mothers h8 and w8 or sorry that was a bad example that's not, anyway. So that's the same idea. So that observations are independent and they come they are centered around some shared mean with some degree of variability. And that's what we're saying with the data generating process. So when we're talking about multilevel models the thing that does not hold is the independence. And so we have a quick example here to introduce this idea. So let's say we take a sample of students and say take a test three times during the year. Okay all students are taking three tests throughout the year. The test is administered at three different schools and when taking the sample we observe a student's test score, each of the test scores, the test wave so which test was it, the first one the second one the third one, the school in which the student is from, and the student's family's adjusted gross income. Because the idea is that we want to know whether higher income families score higher on their tests then lower income peers, and what the relationship is there. So to kind of outline this because that's a lot to keep in mind, and so we have three different schools. Here we just have school one, school 2. And then we have students within the school so student one, student two, student three, etc. And likely or similarly for school two and three. And each student is taking three tests over the year okay. And so this data would not be IID because each student's test score or even within the school would not be independent. Most likely the students test score from test three and two will be highly correlated with their previous test scores. It's not just a random data generating procedure here. And so each test score is produced by a student i, so now we're just introducing a little bit of notation. So each test score is produced by student i, taught by teacher t within school s. So we have this sort of nesting structure and that is in essence what the levels of a multilevel model are capturing. And so the students test score at wave W, so that was just is it test 1, 2, or 3, are likely correlated with the students test scores at subsequent waves which is what I was trying to explain. And so students within schools likely have correlated test scores based on you know if you go to, well within each school they have different teaching styles which may or may not be better than a different school and so that is where correlation comes within the school. So we have not only correlation within a student but within a school as well. And so instead of just a response being y we index our response as 'iwst'. It's a little notation heavy but so this is the test score of student i, the test wave so test score of test one let's say in each student or sorry each school and then teacher. And so that would be the different levels but notice that student family income is only indexed by i, right? Because that is independent of the teacher and the school and so we have the levels at the test score level and we want to account for the differences in correlation there to then understand how student family income affects the test score. So the multilevel structure like I said is kind of twofold we have a school and we have the students within the school. And so if we were to just ignore this multilevel structure if we just plotted the test scores over income of each student this is what it would look like. Right? And notice actually that we have these little clusters of three scores, right? And so those would be the three scores or the three test scores of each student. Notice they're marked by income because that a student's income there's only a single observation for that but the student is taking three tests. And so this would be ignoring the multilevel structure and so if we enter just some structure incorporating school into the model we can see it's a much different picture. So I'm just going to go back real quick. So if we were to just fit a line we'd fit it here and we'd say this is the total variation of all the tests and this is our fitted line our ordinary least squares line. But if we don't ignore it's a multilevel if we incorporate different schools we can see there's actually a much different picture. School B actually yields the highest test scores while school A is yielding the lowest test scores and noticed that the variability around each school is much smaller than the total variability in this whole figure. And so what I mean is the groupings of the points are much more localized now to each school and not just as a whole clump points. And so that is why incorporating the multilevel structure is very important because you really want to capture these correlated structures and account for those differences to then have a better understanding then of how in this instance income affects test score. And so to put it in the context of AFCARS really quickly. AFCARS is our foster care data and so it holds yearly case records of children who experience the foster care system within the fiscal year and it provides a single entry per child year. So every year a child who goes through the foster care system has one observation and it includes a lot of information like when they entered or left and what kind of services they received and it includes the unique identifier for each child. Basically a child ID number. And it's valid within states over time so if you're looking at information over the years you'd be able to identify the same child within different years of the data. And it includes data on the state with custody of the child so if a child is in foster care in Michigan that's what we mean. So these are kind of state-level information and we also have counties. So county and state level of the agency that has responsibility for this case. And like I said many other case characteristics that we won't get into here. And so the point is we have child who is in the foster care system within a county, within a state and so that is kind of a basic multilevel structure within our AFCARS foster care file. And so like I was saying case years are nested within children, children are nested within counties, counties are nested within states and states are also subject to national trends and policy. So there's a lot of levels here and it's good to incorporate them into any sort of modeling at the county or state level. And as Frank is written here "ignoring the structure can produce misleading inferences!" As the simple example of the schools and tests showed you in the difference in the figure with ignoring the structure and then incorporating it. And so I will then let me relinquish control and I'm going to give this off to Frank. And Frank feel free to go back and add anything I might have skimmed over.

[Frank Edwards]
Hi everybody that was great Sarah thank you! I'm going to work through a demo now and it will may be unless it's something that is a point of clarification we'll hold the questions until the end. Okay so were going to describe our basic multilevel model as so we're going to use some statistical notation really briefly just to set up the difference between our multilevel model and on ordinary least squares regression model. Right, so an ordinary least squares regression model says that our outcome Y is distributed following the normal distribution with mean mu and variance Sigma squared where the expected value of Y or mu is equal to be to beta 0 plus beta 1 X I. Right, and we're we're not using the epsilon you know error formulation here we're using we're incorporating that error into the distribution of the outcome. But it's it's equivalent as writing beta 0 plus beta one X I plus Sigma, right, for for our residual error. Okay so the multilevel model extends this basic set up so right here we have you know oops sorry Y sub i distributed normal. Next we're going to say Y sub I J where J is our group-level indicator were going to say that each group has a group-specific mean, mu I J, with individual level variance Sigma squared and then our linear predictor for the model will now be mu equals beta zero plus beta one X I plus Delta J, where Delta is distributed normal, zero Sigma squared Delta. So now we have two variance components on our model. We have an individual level variance component or the Sigma squared, which describes the variance for the observation Y sub I J, but we also have a group-level variance component, Delta, right, that is distributed Sigma squared Delta and so that describes variation in our group-level intercepts. So what this these Delta J's are our group-level intercepts and that's going to be really helpful. So the software I'm going to use today is from R I'm going to use lme4 which is the standard package used for frequentist multilevel models in R. It's pretty straightforward to use and I'm also going to use the 2019 AFCARS child file. The theory that we are describing would extend easily to Stata, SAS, or any other package. In Stata you can use meglm or PROC MIXED in SAS. I work primarily in R so that's what I'm going to teach today. And so I'm going to kind of show you slides but I also want to note that the code is available on a GitHub gist that let me post the link in the chat. So all the code you're going to see is on this GitHub gist so if you go to my GitHub gist you'll see that I have lines 1 through line 42 are the school test score demo right? So that data that we showed you for school test scores that's not real data. What I did was I actually simulated data right, we're simulated random income for each student and then I made income and and test scores and schools be correlated with each other, right. So I simulated a linear relationship between our variables that we could then recover through the model, right. That we can basically say "let's assume that there's some correlation structure here where school is correlated with test score and income is correlated with test score" and then we can visualize what that looks like. So if you want to kind of work through an exercise using a kind of data generation using simulation to think about multilevel structures that to me it's a really good way to intuit what might be going on in the process. I saw earlier someone suggest that income and school are very closely correlated with each other in that's absolutely true in the United States but this is a toy example, right, so these are not those were not real data you were looking at. Okay so AFCARS starts on line 43 if you want to follow along. Okay so to install packages in R you can use the install.packages. I'm running everything on R Studio which is free and open source, one reason I love R is that everything in it is free and open source and is supported by a really robust community of users so like lme4 is an externally developed package. I think Ben Bolker is the lead on it, it's got really robust support and it's got great documentation. But let's load that in and I'm also going to load in the 2019 child file using a read.delim function because we distribute the foster care file in a tab delimited text format and that's how it typically like to work with my data. And then I'm going to just draw a simple random sample of the first five observations or any five observations from the data for four variables so, is there a multilevel structure that we can observe in this data? Right, so here we see we have state identifiers that's a FIPS code, we have sex, we have the age at the start of the fiscal year and we have lifetime length of stay in foster care measured. And so clearly children within states, right, might have outcomes that are related to each other because they are subject to the same infrastructure and policy in the system that leads to correlated outcomes across kids, right. And obviously kids are associated with themselves over time so this row one child who has a lifetime length of stay in foster care over 1200 days their length of stay in year let's say this is 19 this is 2019 so that length of stay variable is obviously going to be correlated with that length of stay variable for the same child from 2018 and 17 and 16, right, if the child never exited  it's just going to increase by 365 each year, right. And knowing one of those can inform our observation for the next one and we want to account for that relational structure in how we model. So here just to give us a sense I'm going to be looking at the length of stay variable across states. But to give us a sense of the structure of the the sort of within state relationships, right, so if I look at banks of stay across states as both an average and a standard deviation across kids so I want to think about how variable are states are children within states and where is there average, we see a really close correlation here between the expected average length of stay within a state and the variance across children within the state. Right, so states are clearly patterned and we want to make sure that we are accounting for that if we're going to do a cross-state analysis of a variable like length of stay which is something that may be some of you have thought about doing. That's going to be related to case outcomes of various sorts, that's going to be related to reunification, or re-referral, all kinds of outcomes might be kind of bundled up in here. And so let's estimates a naïve model what I'll call a naïve model, past is a linear model for lifetime length of stay in years as a function of child age and child sex. So for example we we know that length of stay has to be a function of age, right, like a an infant can't have length of stay greater than a year. So we want to make sure we are accounting for age in estimating length of stay. And so here we're just going to ask we're effectively asking what's the average length of stay by child age, by sex, and we're estimating a linear relationship for age. Let me switch over to my R console that we can kind of work through some of this together. So let me make this a little bigger but this is the full script that is in the GitHub gist. So we can load in the data and the packages, right, it's going to take a moment to read in AFCARS because these are big files. If you're working in R and working with the NCANDS in particular I recommend moving over to the data table package because it will load and process the large files much more quickly than read.delim or any of the Tidyverse read functions. Oh I need to load Tidyverse  here, I also use Tidyverse in R which is you'll you'll see this kind of pipe notation this %>% it kind of allows us to string together multiple commands and it makes manipulating data a lot easier. I think Erin we're going to do kind of the data wrangling workshop in December where I'll cover how to use some of that.

[Erin McCauley]
Yes plug for the Office Hours! Come see us Fridays!

[Frank Edwards]
There we go. So anyway I'm going to reshape my data just to get it in the format that I showed you on the screen. And so here is what the first five rows of that looks like. And this is mostly just to deidentify it. But these are the data we're going to be working with right. State, sex, age at start, life length of stay. Okay and I'm computing life it's given to us in the data as days, I'm going to convert it into Sears just because I find that easier to work with. So the state level mean and variance we compute here this is how I constructed that plot that I showed you. But let's start with our naïve model. So we're going to use the lm function to estimate a simple regression model. We're going to specify of relationship as the relationship between the length of time child is in foster care in years as a function of their age at the start of that fiscal year that we are looking at and their sex treated as a binary predictor, that's what the factor code fare does. And then we can look at our model output and see that we do expect a positive relationship between age and length of stay that length of stay increases as a child gets older, and that girls typically have a slightly lower length of stay than boys. Right, tidy just presents these results in a data frame which is a little easier to work with so that's what that function does. Now what I can do from there I'm going to switch back to the slides for a second and we'll come back to the live coding when I get the multilevel model because what I'm going to do next is just to kind of illustrate  results. But these are the expected lifetime length of stay age for boys and girls at all possible ages in the data based on that model. Right, so I'm converting our regression output into expected values by age right, and so we see at you know infant see a slightly lower expected length of stay and because of the structure of the model we've estimated this to be a straight line as a function of age over time. Right, which it probably isn't so you know we probably have some misspecification here but that's okay. We're just kind of working through something simple. Okay so that's what our basic naïve model thinks about length of stay as a function of age. So let's estimate a multi-level model now. So let's assume right, because we saw that there's some real clear patterning here across states, right, that we have some states with much higher average levels for length of stay than other states and much higher variability across children than other states, right. So let's take a look at that. We're going to use the lmer function from the lme4 package and our syntax is going to be really similar to what we used for the lm function in R which is the linear model, lmer gives us a mixed-effects linear model which is mixed-effects is another name for multilevel. You might also see these called hierarchical models, that's all just different words for the same thing. And what we're going to do here is we're going to add in, in this case I'm calling it Sigma S and that's going to be our state level intercept. So we're going to estimate distinct intercepts for each and every state. So that kind of will give us a different starting position for each state. And let me just kind of switch over to the code to show you how to work with this. Okay so I'm calling this M1 ML, line 56 is where this code begins and that will be the same on the gist. So we're going to use the LMER function and if you want a help file on LMER, right, once lme4 is loaded you just type ?lmer at the console to get a description of how to set this model up. But we're going to say that life length of stay is a function so Y is a function of age at start plus sex treated as a binary variable factor is just going to turn it into a categorical and with two levels that's binary. And then this strange thing here this one and then a vertical line state tells us to estimate intercepts for each and every state. We could do something wacky that I'm going to discourage you from at the start but we could say let's estimate different slopes for each age within states so we could see if there is different growth patterns across states. Maybe we'll take a look at that if we have time at the end. But let's estimate this model and see what happens. So we're now and it's going to take a second, these are more complicated models but it's done running now and we can take a look at this our output. And so we get now slightly different estimates for intercepts. Our slope estimate isn't that different but we have this output here can be really informative for us. What it's going to tell us is the variance of our intercepts so our of our error terms. We can think of the state level intercept we've estimated as state level error, right. And so we see that those state intercepts have a variance of 0.11 and that individuals after we account for state level variance have of variance of about 3.2. Right, and so sometimes you might compute a coefficient to variation to think about what proportion of variance is explained by different components of the model. And in this case we can do some back of the envelope math you know and say that about 3% of the variation in length of stay at the child level is accounted for the state that they live in, right. Based on this naïve this is still a pretty naïve model but it's certainly getting better. We can see that this variance estimate if this variance estimate is nonzero then it suggests that there is some variation explained by that intercept, right. If we saw a variance estimate there of like 0.001 that might suggest to me that the multilevel structure I'm theorizing may not be playing much role in the actual data. Okay so that's kind of how we look at this, right, and we can write the model up this way where the expected length of stay is a function of age, sex, and the state you live in. And this is what that looks like, right. So we went from our naïve model where we just had one line for the country to now we've allowed each states to have its own intercept, right. And so what that's going to do is it's going to give some flexibility to the model fitting the data and you can see that now our intercepts have a ton of variation, right. We start from a length of stay that is estimating some impossible numbers because were saying that infants can be in for longer than a year which is tough so I need to think about how well this model is working but and this is a good quality check anytime you are running a model is to kind of run expected values and think about whether it makes sense given the nature of the question. But we have each line represents a different state and so we have tremendous heterogeneity across states in expected ages and expected length of stay. And I suspect what's going on there with impossible predictions for age 0 is that we've estimated this as a linear function of age and it's a nonlinear function, right? That if we look at the observed data this going to be nonlinearities in the relationship between age and length of stay and that makes sense theoretically. Okay so the other thing I want to talk about here for a moment is some of you have likely worked with what are called fixed-effects models in the past and so we can describe a second model that will be similar to the one we've estimated here that does not need to use the multilevel structure the multilevel model to achieve similar results. So let's estimate a model that's life length of stay that's a function of age at start and that's also function of state, right. So what we're doing here is I'm specifying a similar structure to the model and saying that length of stay is a function of age, is a function of sex, and is a function of state. The difference here is how I'm estimating those states level intercepts, right. I'm still going to be estimating here an intercept for each state. One thing that's going to happen is it's going to be much more difficult to look at this output because now we have a lot of intercepts that are estimated as simple regression terms. But the other thing that's going to happen is that these are going to translate to the conditional expected value for length of stay, conditional on sex being equal to one, which is male and age at start being equal to zero. So these are just going to be the averages for each state, right? These are just going to be the crude averages for each state conditional on age at start being zero and sex being one for this linear function, right. These are going to be those expected values. Now the problem with that is that we might have more observations in some states than others so let's take a look at how many observations we've got across each state, right. So these are FIPS codes again but we can see pretty clearly that for example in the state with FIPS code 4, we have so one is Alabama, he here we have you know about 90,000 children in Alabama. Six is California I forget what 4 is, I should know these FIPS codes by now but I don't have them memorized. Anyway we see tremendous variation here, we have many fewer kids in Alaska than we do in wherever Texas is. So we actually have different levels of information about each state and so some states were going to be able to more precisely estimate that we have more information we can leverage. Now if we have a ton of information this approach and this approach in terms of their inferences will yield just about the same prediction. But if we don't have as much information, let's say we're looking at some states where we have relatively fewer cases, what we can do in the multilevel model approach is called partial pooling. What we're going to do because we are giving each intercept a shared probability distribution that is to say those state level intercepts we are imposing as part of our model that they come from a shared probability distribution the intercepts themselves. So when we estimate what's the value for an intercept is for a particular state in a multilevel approach, we're actually borrowing information from the whole data, from all the other states in the data. In the fixed-effects approach we are simply using the observations within the state to estimate that model and we're effectively throwing out other information informing where that intercept is going to be located, right. So this is in cases where we don't have a lot of data, the fixed-effects approach will over fit and so relaxing this assumptions that we have perfect information on the states allows us to borrow information across states and add some flexibility to our models which I think is usually a good thing to do. Now in terms of thinking about causal inference there are different questions about whether to use fixed-effects and random-effects that I won't get into here but I think in general default thing to a random-effects approach is wise because the models are less likely to over fit, they're more complex to fit but they are much less likely to over fit and they are going to take full advantage of all the information you've got in the data rather than kind of throwing out some of that information. So, this is a really simple approach, right, but additional next steps we could consider is if we had multiple years of data we can include individualslevel intercepts, we can include year-level intercepts, we can include for those counties that we have identified in the data which are basically the large urban counties, those cases with more than I believe a thousand is the threshold I'm I'd have to go back and look exactly at what the cut off point is but I'm sure many of you know that counties are de-identified if they have small caseloads. So we could imagine estimating models that include all these additional terms. We can estimate slopes so that is we could say that the expected rate of change in length of stay as a child ages, we could allow that to be different across each state. And so that's called a random slope we estimated random intercepts here but we could certainly fit random slopes in this case by age quite simply. And this allows a lot of flexibility in fitting the models and describing different kinds of relationships and structures across places. There are very effective ways to model differences in units, across places and over time. So I'm going to stop there and we'll open the floor to Q & A. Thanks a ton you all and I'm sure there are lots of good questions.

[Erin McCauley]
Thank you so much Frank. I'm going to start with just one or two clarifying questions that came in. So in that code Frank, what does the parentheses one upward dash state mean? Yeah right there.

[Frank Edwards]
So yeah what we have here is our call to lme4 and so let's compare this to this model which was kind of our naïve model that we started with. We have the left-hand side of our regression equation here and the right-hand side of our regression equation here. Now what this one by state and I'm calling this kind of vertical pipe "by", effectively what we're saying is that in lem4 syntax the random effects go on the right-hand side of the vertical bar and slopes go on the left-hand side. So effectively we could or we could estimate nested coefficients so for example we could estimate different intercepts for each state by sex. So if I put sex here we would get a very different model. Putting the one here just tells R "I'm not conditioning it on anything, I'm just estimating state, I'm estimating an intercept for each state that's unconditional on any of the other variables in the model. But we could make it we could put age at start here and that would then convert this from a random intercept to both a random intercept and random slope, right. So then we would get a different slope for each age at start, for each state and we would also get an intercept for each state.

[Erin McCauley]
We have a comment coming into that. So it's constraining all of the slopes to be the same but allowing the intercepts to vary, right.

[Frank Edwards]
This model with the one by state yes constrains the slopes for age to be identical across states and allows the intercepts to vary that's exactly right.

[Erin McCauley]
Fabulous, thank you so much. So I figure we'll stay with the questions about the run through just since we're already here. Can I still use a multilevel modeling approach if I combine the NYTD and AFCARS data?

[Frank Edwards]
Sure! So the NYTD we can link to AFCARS by the AFCARS ID so we can think about outcomes in NYTD that might be a function of experiences the child had that were recorded in foster care. So we could really easily imagine that the observations in NYTD at the child level would be correlated with observations in AFCARS at the child level. So we could imagine estimating models with individual-level random intercepts to help us understand outcomes over time expecting that what happened in the AFCARS data will be correlated with what we see in the NYTD data. The other thing we could do is we could assume that geographic units might be correlated with each other so state's counties again, outcomes across those kinds of geographies are probably correlated with each other and time might be something else to think about. The simplest kind of model that I typically run is a state and year random-intercept model. So this is going to estimate two random intercepts now I don't have year in my data here but if I had multiple years of data what this model would do is it would estimate one intercept for each state and then one national-level annual intercept. So it would kind of factor in changes in the national sort of terrain of foster care over time, and state-level differences. We could get more complicated and we could estimate state by year, we could estimate lots of different additional complexity to it but this is the kind of approach that I typically do. Anyway all this is to say, yes, a multilevel model is not only appropriate for linking NYTD and AFCARS but if you are linking NYTD and AFCARS I strongly encourage you to start from the perspective that you will be estimating a multilevel model. It might be a fixed-effects model if you're really focused on like an econometric causal inference type set of questions, but I think the multilevel model is going to be a little more flexible in terms of giving you valid inferences on the data.

[Erin McCauley]
Great, thank you. And so we have another question that Sarah typed a response to, but it was just why do we have more than 50 states? And it's because of DC and Puerto Rico. And so our next question is would the code look similar for a dichotomous variable as the outcome?

[Frank Edwards]
Yeah the only difference is then we're estimating a GLM and not a linear model, right. So here we are using effectively a normal distribution for the likelihood of the data but if we have a let's say, it doesn't really make sense in this context but let's say. Let's look at what we have here. I don't know, I want to predict the sex of the child based on how long they were in foster care, right. I don't know, it's a silly thing to do but let's do it. Now, notice I changed out my syntax here. It's no longer lmer, it's GLMER much like in base R if I'm estimating a logistic regression I switch over to GLM rather than LM. And in this case I need to specify the family I'm working with the distribution family I'm working with. And in R when we're estimating a logistic regression we specify the binomial family. Right? What I often using these data my most frequent modeling choice for using these data I usually look at counts at the place level over time so I'm almost always using Poisson distributions to model my data which have some nice features in a multilevel context where we can model the [unintelligible] version really easily too but anyway. This is what that would look like so let's predict out the sex of a child based on length of stay and this is kind of inverting the question, can we guess whether a child is male or female based on how old they are and how long they've been in foster care?. You know that's not a particularly interesting model but yeah totally we can run GLM's for logistic regressions, we can run categorical models, we can run count models, you know the sky's the limit in terms of the kinds of likelihood we can use.

[Erin McCauley]
Great, thank you. So just a general question for correlation research and social sciences like child welfare research or criminal justice/procedural justice research: do you consider correlations of .5 to be strong, .3 to be moderate and .1 to be weak? Or that is, .7 and up is strong, .5 is moderate and .3 or .2 are weak and .10 is extremely weak?

[Frank Edwards]
Yeah I don't like those rules of thumbs because it depends on the magnitude of the it depends on our theory, right. Like let's take a look, obviously in this case we know that the amount of time a child could be in foster care is by definition a function of their age. So let's look at that. Here we have a .295 correlation. Now according to a rule of thumb we can describe that as a moderate association or something. But it's actually like a mechanical relationship, right. There's a deterministic relationship here between how old a child is and what the range of possible length of stays they could be in foster care for is. So I don't like those rules of thumb because I think we want to just think a little more carefully about the data. And think a little more carefully about the relationship between variables. You know, if I see a correlation between two measures that I think should be really strong and it's like a .01, you know then that might force me to reevaluate what I think might be going on in the data. So I think the question should be situated in a kind of, I'm a bit of a Bayesian but I think like think of it in terms of what your priors are for what you think that correlation should be based on the theory, based on your scientific knowledge of the topic. So like do we think that you know, receipt of a particular service should be associated with some outcome and do we have prior studies that show that link? And if so what is the magnitude of that prior link? And then we look at the correlation in our data. Does it line up with our expectations? I think a lot of this is trusting your intuition because your intuition based on your scientific knowledge and experience is I think a lot more valuable than a rule of thumb about weak and strong correlations.

[Erin McCauley]
Excellent. Thank you. So now I think I'll go to a question about the number of cases. So we've had a few questions around this but, how appropriate is multilevel modeling with AFCARS since the vast majority of county level data in this country is censored because they have fewer than 1000 cases?

[Frank Edwards]
Excellent question. Excellent question. Let me take a look at this real quick. Yeah I'm going to load in the data and just look. So usually in the AFCARS we have more than 100 counties identified each year, and those are the largest population counties. So yeah we have a challenge here in that we don't have the full population of counties in the United States in the AFCARS, right. Those counties with low populations are systematically de-identified. I'm not going to explain, I got you. So yeah I'm just wanting to remind myself of the name, it's FIPS code. Okay so we're just going to add FIPS code here and then do the same thing so now when I look at the head for AFCARS 19 we can see that the FIPS code when it's equal to 8 that means it's de-identified in the data, right. So in this case we know the state it's from we know it's from Alabama but we don't know what counties in Alabama it's from. Let's look though at which counties we do have. And so here are all the counties we do have identified and for 2019 let's see how many we've got. I'm going to run ask R to tell me unique values of the county identifier and then I'm going to ask for the length of that vector. So we have 115 saying that 8 is the deindentified code, right. So we have 115 identified counties here and send from those we can ask you know what proportion are 8's, what proportion in this case we have 347,847 as 8's and that's a lot so let's see 3 4 7 8 4 7 divided by N row AFCARS 19 so that's 57% of the data has a de-identified county, right. So if we start to work at the county level, we are by definition going to be losing inferences on some places. The other thing we could do though, and this is a little tricky and I'd need to think about theoretically how well-motivated it is though we could assume that we have urban counties as being kind of different from each other within states and then rural counties within states, so we could imagine creating a new variable that is for example, the state-FIPS pair, right. So we want those 8's, those de-identified counties to be attached to the state they are in, right. And so we could create a full FIPS code, right, which in this case actually this is already this is that. So we could attach the state FIPS code to our 8 to allow those to be all the de-identified counties within a state to act as a kind of mega county, right? If you wanted to do something like that. Again I would think carefully about whether that's appropriate for your application but that's the sort of thing we could do. Multilevel models are totally appropriate in this context if we draw lines around what the population we're making inferences to is, right. So if we are making inferences to large populous counties in the United States then, you know, isolating those counties across years is fine. Now what I would typically do to make sure that we are not selecting on our outcome is to use a population-based criteria to select those counties. So for example in a recent paper I along with Chris Wildeman and Sarah Wakefield and Karen Healy, we used the AFCARS and NCANDS data to estimate exposure rates to the child welfare system across the 20 most populous counties using these data. And what we did is we just used a population filter so instead of using which counties are identified, we said which are the 20 most populous counties in the US and then we selected only those for inclusion in the analysis. So we were ensuring that it wasn't necessarily the foster care system that was determining whether they were included or excluded from the analysis, it was population.

[Erin McCauley]
Excellent, thank you for that. That was a really helpful answer to multiple questions. So I have two questions about the SCAN policies database and how you could use that to examine the effect or impact for example of LOS by state or could you look to see if any certain policy has a relation with with LOS? So how could the SCAN be used with this?

[Frank Edwards]
Sarah would you like to take that or I can?

[Sarah Sernaker]
You could. I just know that SCAN is mostly just a bunch of indicator variables as to whether a state has like implemented certain things. I mean you could also you could always link them by state and included as a covariates and see you know the difference in your coefficients and outcomes and whatnot based on that inclusion. But that's all I was going to say to that point.

[Frank Edwards]
Yeah I think that's exactly right. I mean what we can do with SCAN at this point we have a single observation for each state for policies. So what we can do is we can describe the average levels of length of stay across states with different kinds of policies. We can't make any kind of causal inferences about the impact of policies without a different kind of design and without longitudinal data and without a lot of assumptions about policy timing and and comparability of states. So I would make any causal inferences based on it. But we could certainly say you know, do states that have you know, various kinds of reunification policies in place tends to see lower length of stays then states that don't have those policies? Absolutely. We could we could link the data in and we could add in that predictor. Now what that predictor would do in the multilevel context, is it would estimate those individual intercepts for states after accounting for the coefficients that we estimate for this policy outcome, right. So then we kind of condition on that policy variable to say, okay so states that have the same value on this policy variable, how similar our day? But then we can also account for those idiosyncratic differences across states that might be driven by something we don't observe in the data. So yeah, use the SCAN for that.

[Erin McCauley]
Great, so now I as I'm going to pop back to some of the earlier questions more about the theory of the method. So, don't covariates help you account for differences by school?

[Frank Edwards]
Yes. Yes, they do, they did. So yet we we obviously covariates help us account for differences by school to the extent that covariates are associated with school, they will. And that's also one nice thing though about a  multilevel approach is we're not assuming independence of the covariates and the random intercepts, right. So we can actually kind of bake some dependencies in there in a way that that can help us account for things we can explain with the data and maybe things that are beyond explanation from the data alone. Sarah?

[Sarah Sernaker]
I was just going to add that like the random intercepts is a way to kind of take out the the correlation. So like in the schools example like I was saying like each school probably has their own like teaching lesson and whatever and so by adding the random intercept for school you're accounting for that difference so that you can make better inference of the effect of the covariate alone. And so you want to examine the effect of the covariate after accounting for differences that you can't like control. Kind of. If that makes sense.

[Frank Edwards]
Yup.

[Erin McCauley]
Great and so just something you kind of mentioned this in the presentation and we just have a few minutes so I'm trying to get there was many questions as possible. But someone said, are income and school totally independent? Research says no, so how do you decide which variables are independent and which are not?

[Frank Edwards]
Yeah I mean I that's one thing I like about multilevel models is we don't have to make as many assumptions about independence and dependence we can kind of allow for cross-level structures in more complex models. Again that school data was a toy example. Yeah obviously if we want to build a good model for student test scores it would be much more complex than what I presented today.

[Sarah Sernaker]
Yeah. My explanation was just like a simplified version and so, no, in actuality there's not full independence but for the simple example I might have oversimplified that fact.

[Frank Edwards]
But we can model that in a multilevel context, which which can which can get cool. So we have a question about checking assumptions for multilevel models. So a lot of times you're going to see recommendations in statistics textbooks for certain kinds of tests to think about whether to use a fixed-effects or random-effects framework. And that's really in the context of making inferences that are about causal effects of variables that's about causal identification for a particular beta to think about whether the random-effects or fixed-effects framework is more appropriate. Typically those are going to favor a fixed-effects framework and the random-effects framework though I think instead should be your default rather than the fixed-effects framework because it trust your data a little less, right. The random-effects framework says maybe we don't know exactly what the intercept is for this particular unit of observation because if we're looking at a child over four years of data we only have four observations and so do we think that you know the value we observe is exactly equal to what to what that is or do we want to kind of put some fuzziness around it, right. Do we want to let there be a little bit of error in that group-level intercept structure? And that's what the random-effects model does. So it effectively avoids overfitting the data and I think avoiding overfitting should be a general practice in doing applied statistics.

[Erin McCauley]
Great thank you. So what would code look like if I wanted to model both county-level and state-level intercepts as the main county is nested within state? Is it just adding one by county FIPS to the code you have?

[Frank Edwards]
That's exactly what...

[Erin McCauley]
And I want to acknowledge that we are at time so if folks have to go we completely understand it is time to go but we do just have a few more questions.

[Frank Edwards]
Just add that, right. Now if we wanted to we could do it this way, right. Or we could do it if we wanted to because we have those 8's that we don't want to drop those maybe we want to estimate a different intercept for each 8 but as report FIPS is coded in the data it's unique across with so like FIPS is 001 or county ID could re-occur across states but anyway this is how you would do it. Is this is the simplest way to do it.

[Erin McCauley]
Great, so in the study you mentioned where you picked the 20 most populous counties, how did you kind of justify making that choice? Was it based on a power calculation or another factor?

[Frank Edwards]
No I mean it was just a theoretically-motivated question. We wanted to think about how systems were working in the largest US metros, and we also wanted a way, and inclusion criteria that wasn't using the foster care data itself to make that decision, right. So like if I if I take 115 counties that are identified in these data, we're going to have some threshold counties that are around the like hundred, hundred fifty thousand population mark that some years are going to be in the data and some years aren't going to be in the data based solely on how many kids were in foster care in that year. And I don't want my inclusion criteria to be driven by the thing that I'm trying to get inference on that I'm trying to describe, right. I don't want foster care caseloads to drive whether or not you are in the data so I just used an arbitrary population cut off. It was arbitrary but you know it could have been the 25 or 30 largest but we we stuck with 20 just because we felt it would be descriptively possible to cover 20 counties in a way that going bigger than that would get a little burdensome.

[Erin McCauley]
Great thank you. We've got two more questions and we also had a chat request come in and if you know of any publications using multilevel models using AFCARS or any other NDACAN data if you could pop it into the chat that'd be great. I'll also say that going to our, Clayton would you mind putting the link to the canDL into the chat that's a great way to look at existing studies. So Frank you have a few more minutes to go through the last two questions or do you have to run?

[Frank Edwards]
Yeah for sure. Multilevel papers using AFCARS and NCANDS is most of the papers I've written using AFCARS and NCANDS. So if you go to the canDL and look for me, Frank Edwards, find two papers in particular one from 2016 in American Sociological Review that used this multilevel models with AFCARS and one from 2019 that uses multilevel models with NCANDS. So yeah canDL is a great resource for that. But multilevel models are really appropriate for these data so I think you know to the extent that we are not seeing it as much we should be seeing it more.

[Erin McCauley]
Yeah, how would you share the importance of layered data in this type of work to folks who are just getting to know the data for example in the community presentation?

[Frank Edwards]
Yeah so I don't know that I would jump straight to multilevel models in a community presentation but I mean the simplest explanation for this structure is that do we think that state policy decisions matter? Do we think that how the counties administer their systems matters? And if we think it matters then that nesting is critical for us to incorporate in analyzing the data, right? Because if we think that differences across places are important in terms of how systems work then we have to account for it. 

[Erin McCauley]
Great thank you. And then we have one last question: are the predictor and response variables centered around some shared mean or just the response variable?

[Frank Edwards]
So when we're that was the context in which we're talking about the data generating process?

[Erin McCauley]
Yeah.

[Frank Edwards]
When we think about the data generating process we're talking about the likelihood that we specify as the distribution of their response variable or the outcome, right. So in that case were saying the outcome follows some distribution and that the location and shape of that distribution is a function of the predictors in the model. Right, so we're not assuming a random distribution on the predictors but we can. But they it gets it gets much more complicated if we want to assume error in measurement or other kinds of models that allow for there to be a probability distribution on the predictors themselves. It's doable but it gets more complex.

[Erin McCauley]
All right all that is our last question. Thank you so much for the presentations Sarah and Frank and thank you for hanging out for a few extra minutes. Those were some fabulous questions so it was really grateful to get through them. Everyone else in the line thanks for hanging out for an hour and change and next week we'll have our final workshop led by Sarah so we hope to see you then and help us round out the series and if you get a survey from me please fill it out. We really appreciate your guys feedback and it really helps us steer the direction of the summer series. So round of applause for Frank and Sarah thank you everyone.

[Sarah Sernaker]
Thank you.

[Frank Edwards]
Thanks y'all. Take care of yourselves.

[voiceover]
The National Data Archive on Child Abuse Neglect is a collaboration between Cornell University and Duke University. Funding for NDACAN is provided by the Children's Bureau, An office of the Administration for Children and Families.