Transcript for video titled "LeaRn R with NDACAN, Week 8, 'Introduction to Advanced Modeling', May 16, 2025". [MUSIC] [VOICEOVER] National Data Archive On Child Abuse And Neglect [ONSCREEN CONTENT SLIDE 1] WELCOME TO NDACAN MONTHLY OFFICE HOURS! National Data Archive on Child Abuse and Neglect DUKE UNIVERSITY, CORNELL UNIVERSITY,& UNIVERSITY OF CALIFORNIA:SAN FRANCISCO The session will begin at 11am EST 11:00 - 11:30am - LeaRn with NDACAN (Introduction to R) 11:30 - 12:00pm - Office hours breakout sessions Please submit LeaRn questions to the Q&A box This session is being recorded. See ZOOM Help Center for connection issues: https://support.zoom.us/hc/en-us If issues persist and solutions cannot be found through Zoom, contact Andres Arroyo at aa17@cornell.edu. [Paige Logan Prater] All right, I think I'm going to go ahead and get started I have a couple of announcements that I want to share with you all so welcome everyone to our last session of the NDACAN Monthly Office Hours learned with NDACAN series. NDACAN is the National Data Archive On Child Abuse And Neglect. It's housed at Duke, Cornell, and UCSF and we are funded by the Children's Bureau. Welcome everyone for folks who have been along with us on the series. Welcome back. If you're just joining for the last and final session, welcome. We will I'll kick it over to Frank to start with our last R series. And as always, we'll spend about 30 minutes talking about R. And then for the last 30 minutes or so, we will get into breakout sessions for office hour conversations. As a reminder, we will be recording this session and all of our recordings are available on our website. You can catch up or reference back to anything we chatted about over the whole series. If you have a question, please put it in the Q&A chat or type it into the chat and we will address them live and just to make sure that we have them captured in the recording. And for any Zoom support that you might need, there's information on this slide here to help you troubleshoot any issues you have. Before I kick it over to Frank, I just wanted to give a couple of announcements because this is our last session. I wanted to let folks know that we do have a Summer Training Series that kind of complements the monthly or is in line with the Monthly Office Hours offering. And so the Summer Training Series, it's themed every year and it's five to six weeks every summer. And the theme for this summer is Life Cycle Of An NDACAN Research Project. So we will be walking through everything from initially developing a research question and exploring the data to conducting analyses, visualizing and finalizing analyses and presenting information. So all of that, all of that information about the Summer Training Series is on our website and I will put links in the chat to direct y'all to information and registration links. So that's the Summer Training Series. We also have two paper awards that are housed by NDACAN. If you're part of the listserv, you might have seen an email from Erin McCauley announcing that earlier this week. So we have an Outstanding Paper Award and we have an Outstanding Graduate Student Paper Award. I will also put information in the chat about what that looks like and how you can nominate folks for these awards. Those nominations are due by July 15th. And if you have any questions about the Summer Training Series any of the paper award or future NDACAN offerings feel free to reach out to me either in the chat or via email. I think that is all of my announcement. So Frank, I'm gonna kick it over to you to get us started. [Frank Edwards] All right, great. Thank you, Paige, and welcome everybody. [ONSCREEN CONTENT SLIDE 2] LeaRn with NDACAN, Presented by Frank Edwards [Frank Edwards] Today we are going to talk about advanced, a couple of topics in advanced regression. [ONSCREEN CONTENT SLIDE 3] MATERIALS FOR THIS COURSE Course Box folder (https://cornell.box.com/v/LeaRn-with-R-NDACAN-2024-2025) contains Data (will be released as used in the lessons) Census state - level data, 2015 - 2019 AFCARS state - aggregate data, 2015 - 2019 AFCARS (FAKE) individual - level data, 2016 - 2019 NYTD (FAKE) individual - level data, 2017 Cohort Documentation/codebooks for the provided datasets Slides used in each week's lesson Exercises as that correspond to each week's lesson An .R file that will have example, usable R code for each lesson will be updated and appended with code from each lesson. [Frank Edwards] The materials for this course are the same as always. They're on the Cornell Box folder. So all of the data that we're using today is available for you on Box along with these slides and the R script that we will take a look at in a moment. [ONSCREEN CONTENT SLIDE 4] Week 8: Introduction to ADVANCED MODELING May 16, 2025 [Frank Edwards] So this week, we're going to give a kind of taste of a couple of techniques that we can ways that we can do advanced modeling within R. This is obviously in 30 minutes, not going to be a complete overview of the, you know, thousands of topics that we could discuss on advanced modeling. But R is a very flexible programming language and most statistical approaches that you might see in the literature are both implementable in R if you have the patience to program it yourself, but also generally available to you through third party packages. It's one of the big features of R is that it's an open source project. So third party developers are routinely expanding the capabilities of the R suite. Today we're going to talk about two particular topics. We introduced linear regression last time. This time we're going to talk about how you can estimate generalized linear models in R and also how you can estimate multi-level models in R. [ONSCREEN CONTENT SLIDE 5] Data used in this week’s example code AFCARS fake aggregated data ./data/afcars_aggreg_suppressed.csv Simulated foster care data following the AFCARS structure Can order full data from NDACAN: https://www.ndacan.acf.hhs.gov/datasets/request-dataset.cfm Census data ./data/ Full data available from NIH / NCI SEER [Frank Edwards] The data we'll use today is the AFCARS fake aggregated and the AFCARS fake individual level data along with some Census data. Again, these are simulated data. They are not real foster care data. So please don't use them for research. They are intended only for the kinds of practice that we're doing here. But you can order the full data from NDACAN as usual. And feel free to raise your hand or just go off mute if you have questions as we go. [ONSCREEN CONTENT SLIDE 6] Generalized Linear Models [Frank Edwards] I'm going to start with talking about generalized linear models. And I'm going to assume that most of us have seen these topics before in a statistics course or our familiar with these topics. But I'll just give us a brief introduction and then we'll go into the coding. [ONSCREEN CONTENT SLIDE 7] From linear regression We define the expected value of y in a linear regression model as 〖𝐸[𝑦〗_𝑖]= 𝛽_0+𝛽_1 π‘₯_1𝑖+𝛽_2 π‘₯_2𝑖 [Frank Edwards] So generalized linear model extends the linear regression model where we define the expected value as a linear function of a set of predictors and a set of parameters by including a link function G. [ONSCREEN CONTENT SLIDE 8] To generalized linear models We can define a generalized linear regression for the expectation of y as 𝑔(〖𝐸[𝑦〗_𝑖])= 𝛽_0+𝛽_1 π‘₯_1𝑖+𝛽_2 π‘₯_2𝑖 For a wide range of link functions g [Frank Edwards] So what we're going to do here is we're going to apply a link function to the left hand side of the regression equation to transform the right hand side of the regression equation into a linear relationship. So it's effectively going to allow us to model nonlinear processes using the same regression tools that we already know how to use. And we can use a wide range of link functions G to accomplish this. [ONSCREEN CONTENT SLIDE 9] Two Commonly used link functions Logistic regression π‘™π‘œπ‘”π‘–π‘‘(〖𝐸[𝑦〗_𝑖])= 𝛽_0+𝛽_1 π‘₯_1𝑖+𝛽_2 π‘₯_2𝑖 Poisson regression π‘™π‘œπ‘”(〖𝐸[𝑦〗_𝑖])= 𝛽_0+𝛽_1 π‘₯_1𝑖+𝛽_2 π‘₯_2𝑖 [Frank Edwards] But some of the most common ones we'll see are logistic regression where we use the logit as our link function. And then once we apply a logit to the expected value of Y, we can model the relationships between variables in our typical linear format. But this means of course that the actual relationship between X1 and X2 and the probability of Y's logit models are typically used for binary outcomes now becomes nonlinear because we are applying this nonlinear transformation. So implicitly we could imagine applying a logit inverse to both sides where the expected value of Y would now be equal to the logit inverse beta 0 plus beta 1, X1 plus beta 2, X2. Poisson regression is another commonly used GLM, generalized linear model. And with the Poisson model and along with the number of other models in the exponential family, we're going to use the natural log as our link function. [ONSCREEN CONTENT SLIDE 10] Changes to the error structure Linear regression 𝑦_𝑖= 𝛽_0+𝛽_1 π‘₯_1𝑖+𝛽_2 π‘₯_2𝑖+πœ€_𝑖 πœ€_𝑖~𝑁(0, 𝜎^2) Logistic regression 𝑝= γ€–π‘™π‘œπ‘”π‘–π‘‘γ€—^(βˆ’1) (𝛽_0+𝛽_1 π‘₯_1𝑖+𝛽_2 π‘₯_2𝑖 ) 𝑦_𝑖~π΅π‘’π‘Ÿπ‘›π‘œπ‘’π‘™π‘™π‘–(𝑝) Poisson regression πœ†_𝑖=𝑒^(𝛽_0+𝛽_1 π‘₯_1𝑖+𝛽_2 π‘₯_2𝑖 ) 𝑦~π‘ƒπ‘œπ‘–π‘ π‘ π‘œπ‘›(πœ†) [Frank Edwards] So here's a kind of quick overview of effectively what we're doing with GLMs, right? Under ordinary linear regression, we're saying that our y's center around the regression line B0, plus beta 1x, plus beta 2x, plus an error term, epsilon, where those error terms, epsilon follow a normal distribution around the regression line. So we say that our y values will sit along centered on the regression line with a normal distribution around the regression line. Under logistic regression, we're saying that our y's are going to be modeled according to a Bernoulli probability distribution. That is a probability distribution that takes the value the parameter p and then generates a random value that has probability p of being equal to one. And in order to model the process for p in order to to understand why different levels of x values correlate to different probabilities that y equals one, we will set up this logic inverse linear equation. For Poisson regression, we're going to specify a parameter named lambda that we are going to say is equal to e to the beta 0 plus beta 1 x 1 plus beta 2 x 2. Right? And we know now if we take the natural log of both sides, we could end up with log of lambda is equal to beta 0 plus beta 1 x 1 plus beta 2 x 2. Right? And this is the generalized linear model component of it. We're just transforming our ordinary regression equation using these link functions. And here we're going to say that why follows a Poisson distribution with the parameter lambda, which is going to determine the expected count. So, Poisson regression is something we often use for integer, count values, non-negative integers. And Poisson's take one parameter lambda that defines both its expectation and its variance. [ONSCREEN CONTENT SLIDE 11] Estimating glms in r We specify the distributional β€˜family’ to be used with the glm() function Logistic regression: glm(y ~ x, data = mydata, family = β€œbinomial”) Poisson regression: glm(y ~ x, data = mydata, family = β€œpoisson”) [Frank Edwards] Now how do we do this in our? It's pretty easy. We used LM last time to estimate regression models. The only difference here is we're going to add a G onto LM. So for a logistic regression model we will specify the model as GLM parentheses y tilde x and x then can be our full set of predictors x1 plus x2 plus x3 times x4 if we like we provide R a link to the data that we're using. And now we also need to specify the distributional family that we are saying that are outcome follows. So in this case the Bernoulli model we mentioned a moment ago is a special case of the binomial distribution with n equals 1. So we're going to use the binomial probability distribution as our family argument. What's the distributional family that y follows? In this case we're saying it follows a binomial distribution. Under Poisson regression we'll use the exact same set up GLM y tilde x data equals my data. Now we're going to use the Poisson probability distribution for our family or our outcome variable. Pretty straightforward. We'll go over to R studio in a moment and look at how we actually estimate and interpret the output of these models. [ONSCREEN CONTENT SLIDE 12] Multilevel models [Frank Edwards] I also want to briefly introduce multi-level models. Again this is an incredibly rich and complex topic. We could spend an entire semester discussing multi-level models if we wish to. But I want to just give us a quick overview of how they work. [ONSCREEN CONTENT SLIDE 13] From linear regression We can define a multilevel model with varying intercepts for cluster j as 𝑦_𝑖= 𝛽_0+𝛽_1 π‘₯_1𝑖+𝛽_2 π‘₯_2𝑖+𝛿_𝑗+πœ€_𝑖 πœ€ ~𝑁(0, 𝜎^2 ) 𝛿 ~𝑁(0, 𝜎^2) [Frank Edwards] We can define a multi-level model with varying intercepts for cluster j as y = beta 0 + beta 1 x + beta 2 x2 + delta j + epsilon i. I'm setting up this new term, this new parameter in my regression called delta. And what delta is going to be is it's going to be a set of cluster level intercepts. So we're going to have effectively two error terms in our model. We'll have our epsilon which indicates individual level error. Let's imagine that what we're modeling here is county-level foster care dynamics. And we are modeling those over time. So our observation might be the county-year foster care case load. But we also know that counties, especially in a system like child welfare system, can vary a lot across, can vary a lot in ways that are stable over time. So features of jurisdictions often change very slowly over time or demographic features of places often vary, change very slowly over time. What does delta parameter we'll let us do? Is it will let estimate an intercept specific to each county in the data that doesn't change over time so that we can pull out some of those time-stable differences across counties from our inferences about our betas. So it's an incredibly useful feature incredibly useful tool for addressing various kinds of clustering that we might observe in the data and with the NDACAN data a routine kind of clustering that we're going to have to deal with this geographic or spatial clustering. Multi-level models are a really useful tool for that. [ONSCREEN CONTENT SLIDE 14] Estimation in R The lme4 package provides a flexible set of multilevel models Linear, multilevel slopes: lmer(y ~ x + (1|j), data = mydata) Logistic, multilevel slopes and intercepts: glmer(y~x + (i|j), data = mydata, family = β€œbinomial”) Cheat sheet for lme4 formulas: https://stats.stackexchange.com/questions/13166/rs-lmer-cheat-sheet [Frank Edwards] In R, we're going to use the LME 4 package. This is the most commonly used package for estimating multi-level models and R. There are others, but they follow a similar syntax. And we're going to use our standard LM formula with one small twist. So to estimate a linear multi-level model, I use the LMER function. So it's just LM plus the ER. And in this case, M-E-R stands for mixed-effects regression. So that's linear mixed-effects progression of y tilde x plus apologize our font doesn't have serifs, but that's a 1 and then a bar j. And what that's telling R to do is to estimate a linear regression with an outcome y and predictor x plus a random intercept for each level of the categorical variable j. That 1 bar j says estimate random intercepts for each level of j. Data equals my data. Now LME4 extends to glms so we can estimate linear multilevel up models or we could estimate generalized linear multilevel models. So we could estimate a multilevel Poisson model if we think there might be clustering. Let's say we want to estimate a multilevel slope and intercept model. We could estimate that model with the outcome y, the predictor x with multilevel slopes for the variable i, i unit j. So what the model on this bottom row is going to do is it's going to estimate not only an intercept for each level of the cluster j, but it's going to estimate a slope let's say i is our time unit here. Now each county in our data, if we stick with our county year kind of a toy example, each county will also get in addition to an intercept, it'll get a slope that's unique to that county over time. Now these formulas can get really complex and it's a little confusing when you first get started with it. There's a really great explainer on stack overflow for on sorry in stack exchange for LME4. If you just Google search LME4 cheat sheet you'll get to it pretty quickly, but I've provided a URL to a thread on stack exchange that does a pretty good job explaining how this formula set up works. It's very similar to what we've done with the LM function with this additional syntax for how we specify the random component of the model. [ONSCREEN CONTENT SLIDE 15] Over to RSTUDIO [Frank Edwards] All right, so let's go over to R studio and actually estimate some of these things. All right, so what we're going to do is we're going to pull in our data. We're always going to start with tidyverse. We're going to fix this minor typo and I apologize and I realized that I didn't like library and LME4 somewhere. I need to. But a library in LME4 here. [ONSCREEN] > ### leaRn week 8 > ### regression intro > > library(tidyverse) ── Attaching core tidyverse packages ── tidyverse 2.0.0 ── βœ” dplyr 1.1.4 βœ” readr 2.1.5 βœ” forcats 1.0.0 βœ” stringr 1.5.1 βœ” ggplot2 3.5.1 βœ” tibble 3.2.1 βœ” lubridate 1.9.4 βœ” tidyr 1.3.1 βœ” purrr 1.0.4 ── Conflicts ─── tidyverse_conflicts() ── βœ– dplyr::filter() masks stats::filter() βœ– dplyr::lag() masks stats::lag() > library(lme4) Loading required package: Matrix Attaching package: β€˜Matrix’ The following objects are masked from β€˜package:tidyr’: expand, pack, unpack > [Frank Edwards] So let's just parse our data real quick. I'm going to put my cursor here and I'm going to I'm on a Mac. I'm going to hit command option b to run all code above line 13 on PC. That's control b. It's a nice hot key to use when we want to restart what we're doing. [ONSCREEN] > ### read data > afcars_agg<-read_csv("./data/afcars_aggreg_suppressed.csv") Rows: 4100 Columns: 10 ── Column specification ──── Delimiter: "," chr (1): sex dbl (9): fy, state, raceethn, numchild, phyabuse, sexabuse, neglect,... > afcars_ind<-read_csv("./data/afcars_2016_2019_indv_fake.csv", + na = c("NULL", "")) # to specify the values to + treat as missing Rows: 269346 Columns: 20 ── Column specification ──── Delimiter: "," dbl (20): fy, id_num, stnum, sex_f, raceth_f, phyabuse_f, sexabuse_f... > > census<-read_csv("./data/census_2015_2019.csv") Rows: 6120 Columns: 8 ── Column specification ──── Delimiter: "," chr (2): state, st dbl (6): cy, stfips, sex, race6, hisp, pop [Frank Edwards] And then I'm going to go to line 15 and hit command to return to look at the head of afcars_ag to give us a sense of what's inside of this table. We'll do the same for afcars_ind and the same for census. [ONSCREEN] > # take a look > head(afcars_agg) # A tibble: 6 Γ— 10 fy state sex raceethn numchild phyabuse sexabuse neglect entered 1 2015 1 1 1 2180 352 88 568 1073 2 2015 1 1 2 1245 198 46 331 565 3 2015 1 1 4 10 NA 0 NA NA 4 2015 1 1 5 NA 0 0 NA NA 5 2015 1 1 6 245 30 NA 71 87 6 2015 1 1 7 204 56 22 60 92 # β„Ή 1 more variable: exited > head(afcars_ind) # A tibble: 6 Γ— 20 fy id_num stnum sex_f raceth_f phyabuse_f sexabuse_f neglect_f 1 2016 8422 1 2 7 0 0 1 2 2016 8423 1 1 7 0 0 1 3 2016 8424 1 1 7 0 0 0 4 2016 8425 1 1 1 1 0 1 5 2016 8426 1 1 7 0 0 1 6 2016 8427 1 2 7 0 0 1 # β„Ή 12 more variables: prtsjail_f , aaparent_f , # daparent_f , curplset_f , ctkfamst_f , # fosfamst_f , agefirstrem_f , ageatlatrem_f , # inatend_f , inatstart_f , entered_f , exited_f > head(census) # A tibble: 6 Γ— 8 cy stfips state st sex race6 hisp pop 1 2015 1 Alabama AL 1 1 0 331305 2 2015 1 Alabama AL 1 1 1 33105 3 2015 1 Alabama AL 1 2 0 164122 4 2015 1 Alabama AL 1 2 1 2763 5 2015 1 Alabama AL 1 3 0 2540 6 2015 1 Alabama AL 1 3 1 1195 [Frank Edwards] So that just tells us what's going on. Now what I'm going to do is I want to join these tables together. We're going to estimate one model using the AFCARS individual level data and another set of models using the AFCARS aggregated. For the AFCARS aggregated, I'm going to want to join it to the census. So I need to do a little bit of cleaning work. And this is the kind of stuff we covered in the first few weeks of our learn session. You can always go back and check those videos out. But also the, Hadley Wickham's "Introduction to the Tidyverse" book is a stellar book for learning how to work with the kinds of data manipulation tools that I'm showing here. [ONSCREEN] > ## harmonize names > afcars_agg<-afcars_agg %>% + rename(year = fy, + stfips = state) |> + group_by(year, stfips) |> + summarize(entered = sum(entered, na.rm=T), + exited = sum(exited, na.rm=T)) `summarise()` has grouped output by 'year'. You can override using the `.groups` argument. > census<-census %>% + rename(year = cy) |> + group_by(year, stfips) |> + summarize(pop = sum(pop)) `summarise()` has grouped output by 'year'. You can override using the `.groups` argument. > afcars_agg<-afcars_agg |> + left_join(census) Joining with `by = join_by(year, stfips)` [Frank Edwards] So we are renaming some variables in afcars_agg and we're grouping by year in state to basically group to sum over all racial groups in the data to create a total sum of the number of children in our simulated data that both entered and exited foster care. In the census, we're grouping by year and state to sum the total population size for those states by year and then we're joining those two tables together with afcars_agg and let's confirm that looks right and it does. [ONSCREEN] > afcars_agg # A tibble: 260 Γ— 5 # Groups: year [5] year stfips entered exited pop 1 2015 1 3536 2914 1103159 2 2015 2 1483 922 184134 3 2015 4 12553 9880 1629765 4 2015 5 4009 2989 706879 5 2015 6 31258 26832 9118819 6 2015 8 4733 4025 1258312 7 2015 9 1727 1279 762174 8 2015 10 375 227 203922 9 2015 11 412 315 119177 10 2015 12 17513 13531 4104337 # β„Ή 250 more rows # β„Ή Use `print(n = ...)` to see more rows [Frank Edwards] So now we have a data set at the state year level where we have entries in column three exits in column four and the total size of the population in column five. Looking good to me. Now we're going to go back and look at this afcars_ind again, which is right here and the first model I want to look at is thinking about the relationship between a set of predictors like sex, race ethnicity, age at last removal, whether parents were known to be incarcerated and a neglect allegation on that child's foster care case. So we're going to estimate this as a logistic regression model where neglect is our outcome and we can see that this is a binary outcome, one's and zero's here. I did have to change how I parsed this. There are some variables that have the label null in the individual file, so I told R and it's parsing of the file to treat those nulls as missing values. So you can add this n a equals argument onto your read_csv to tell R how to parse things that you want it to treat as missing. Let's just go ahead and estimate this model real quick. [ONSCREEN] > m1 <- glm(neglect_f ~ factor(sex_f) + factor(raceth_f) + ageatlatrem_f > + prtsjail_f, + data = afcars_ind, + family = "binomial") [Frank Edwards] Alright, so now we have something called m1, or do not our environment. If we look in our environment tab, we can see it right here. It tells us it's a list, but we have a new object called m1, that's great. We're going to look at it. Summary is an easy way to look at model output in R, and so this will give us our old friend, the linear regression table. [ONSCREEN] > summary(m1) Call: glm(formula = neglect_f ~ factor(sex_f) + factor(raceth_f) + ageatlatrem_f + prtsjail_f, family = "binomial", data = afcars_ind) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.9762749 0.0096942 100.707 < 2e-16 *** factor(sex_f)2 0.0002115 0.0084218 0.025 0.980 factor(raceth_f)2 0.0116765 0.0107485 1.086 0.277 factor(raceth_f)3 0.8044364 0.0509404 15.792 < 2e-16 *** factor(raceth_f)4 0.4202121 0.0515342 8.154 3.52e-16 *** factor(raceth_f)5 0.2801761 0.1162915 2.409 0.016 * factor(raceth_f)6 0.0172221 0.0195646 0.880 0.379 factor(raceth_f)7 0.4116020 0.0106327 38.711 < 2e-16 *** factor(raceth_f)99 0.5526278 0.0326007 16.951 < 2e-16 *** ageatlatrem_f -0.0645091 0.0008172 -78.935 < 2e-16 *** prtsjail_f -0.6949897 0.0182478 -38.086 < 2e-16 *** --- Signif. codes: 0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 336255 on 264946 degrees of freedom Residual deviance: 326445 on 264936 degrees of freedom (4399 observations deleted due to missingness) AIC: 326467 Number of Fisher Scoring iterations: 4 [Frank Edwards] Now you have to be careful with logistic regression models, because now we are not, for example, the link function, the logit link function means that our interpretation of these coefficients is wildly different than it was under the linear model. So now what we have, for example, if we look at the age at last rem variable (ageatlatrem), what we find is that for each year of age that a child has, there log odds of having a neglect allegation on their case, our point zero six lower. What this is telling us substantively is that older children in general less frequently have neglect allegations on their files than do younger children in this data. But that point zero six is incredibly difficult to interpret on its own. It's a change in the log odd. What people will often do is exponentiate this number in order to turn it into an odd's ratio. I recommend using the predict functions in our with the type equals response argument to convert it into a straight probability scale. That'll maximize it minimize confusion generally. I really am not a fan of looking at regression tables like this. I think a visual often does a much better job here. So I'm going to use the dot whisker package to generate a quick dot and whisker plot that will visualize our regression point estimates for beta and the confidence intervals for beta. [ONSCREEN] A box plot of factors like sex, race-ethnicity categories, age at removal, and parents in jail assigned with numerical values from -1 to 1. Each plotted point contains error bars, called whiskers. A dotted line appears at value zero. > library(dotwhisker) > dwplot(m1, ci_method = "wald", + vline = geom_vline(xintercept = 0, lty=2)) [Frank Edwards] We've also provided a dotted line here and this will tell us about statistical significance. If the confidence interval crosses the dotted line, the result is not statistically significant at conventional thresholds if it doesn't it is and these are 95% intervals. So we can see pretty clear relationships across a number of the measures with neglect allegations and what this is telling us is that the average levels of neglect the incidence of an neglect flag on this child's file differs meaningfully across these categories in the data. Right? We're kind of not using a causal approach here. We're using a descriptive approach to regression here, but it works fairly well, and we're seeing a lot of patterns in the data. Next, I want to show you a Poisson model really quickly. So we if we remember the aggregate data that we created, we have enter and exit here. And so what I want to ask is whether entries into foster care are related to exits from foster care. That is, do places with more exits have more entries, do places with fewer exits have fewer entries. That would be assuming a positive relationship, or, you know, is there some other, is there potentially a negative relationship? Just more exits mean fewer entries. That would be interesting. Let's take a look at it. We're going to estimate the model M2, and then I'm going to summary to take a look at it, and what we see pretty clearly is that places with more exits tend to also have more entries, perhaps unsurprising. [ONSCREEN] > m2<-glm(entered ~ exited, + data = afcars_agg, + family = "poisson") > summary(m2) Call: glm(formula = entered ~ exited, family = "poisson", data = afcars_agg) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 7.842e+00 1.276e-03 6143 <2e-16 *** exited 1.099e-04 1.043e-07 1054 <2e-16 *** --- Signif. codes: 0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1020358 on 259 degrees of freedom Residual deviance: 235131 on 258 degrees of freedom AIC: 237715 Number of Fisher Scoring iterations: 4 [Frank Edwards] We actually, what we're doing here is we're effectively, we haven't included any information about population size in the model, right? And so exits and entries are both going to be deeply influenced by the size of the underlying population. One way we can address that in a Poisson regression is with an offset. And what an offset does is that effectively, because of our log link function, if we take the log of some exposure variable, like population size, because of the rules of logarithms, we effectively end up with a model where our outcome becomes entered over population. So that is, it becomes a rate model rather than a count model, which can be really useful to look at. [ONSCREEN] > m3<-glm(entered ~ exited, + offset = log(pop), + data = afcars_agg, + family = "poisson") > summary(m3) Call: glm(formula = entered ~ exited, family = "poisson", data = afcars_agg, offset = log(pop)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.598e+00 1.377e-03 -4064.99 <2e-16 *** exited -5.443e-06 1.190e-07 -45.75 <2e-16 *** --- Signif. codes: 0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 211388 on 254 degrees of freedom Residual deviance: 209263 on 253 degrees of freedom (5 observations deleted due to missingness) AIC: 211806 Number of Fisher Scoring iterations: 4 [Frank Edwards] And here, we actually find something kind of interesting. After we adjust for population size, that is when we look at entries per population, we actually do see a small negative relationship. That is, we expect to see for every one new exit, negative .00005 fewer entries. So I don't know how far we want to take this again, I'm mostly estimating these models for us to learn how the software works, but could be some interesting dynamic between entries and exits. We do know that, you know, things like I think of the work of someone like Fred Wolkson, who's shown some really clear relationships between the capacity of a system in the number of entries that are generated within the system. Okay, and I did call LME4 the appropriate time. If we don't have, now we're going to pivot to some multi-level models. If you don't have LME4 installed, you can just run this install.packages line to install it. We don't need to do that more than once though for each installation of R, so we can just run the library in there. [ONSCREEN] > #install.packages("lme4") > library(lme4) [Frank Edwards] I want to go ahead and set up stnum and afcars_ind. If we go back up afcars_ind includes a, where is it? It's actually not showing up on my screen output, but there is a variable, no, there it is. stnum. This is a unique identifier for a state. It's a state-fifth code, and it looks like numeric right now, but I want to make sure that our handles this as a categorical variable, not as a numeric variable, so I'm gonna force it into a factor. So we'll use mutate to force that state identifier into a factor, and I'm gonna compute entry rates per thousand population, as entered over pop times a thousand, exits per thousand population, as exits over pop times a thousand, and we're also going to center our year measure. When we're estimating multilevel models, the estimation can get fairly complex, and when we have measures on very different scales, like our entry and exit per thousand, are going to roughly be on a zero to 10 scale, but then if we throw year in there, which has a scale in our data from 2015 to 2019, it's on a really, really different register. So we're just gonna center it at zero by subtracting off, well, not quite centering it, but we're going to do a linear transform of it to make it begin at zero. So it's on a similar scale to our entry to exit measure. [ONSCREEN] > afcars_ind <- afcars_ind |> + mutate(stnum = factor(stnum)) [Frank Edwards] Another approach to this that we can use is to use a scale transformation using the scale function, which will just do what that mean-centered Z transform. That is, it'll subtract off the mean from a vector and divide it by its standard deviation to turn everything into Z scores. And if you ever run into convergence issues with multi-level models, or if you estimate Bayesian models, those Z transformations can be incredibly helpful for improving model convergence. They don't change the relationships in the data, but they do change how the parameters look. [ONSCREEN] > afcars_agg <- afcars_agg |> + mutate(entered_per1000 = entered / pop * 1000, + exited_per1000 = exited / pop * 1000, + year_c = year - 2015) [Frank Edwards] Alright, so we're going to estimate our first model M4ML right here. We're going to use lmer to estimate entries per thousand as a function of exits per thousand plus a random intercept for each state and a random intercept for each year. This is typically how I like to handle time variation. [ONSCREEN] > # random intercepts for state and year m4_ml <- lmer(entered_per1000 ~ > exited_per1000 + + (1|stfips) + (1|year_c), + data = afcars_agg) [Frank Edwards] Rather than assuming a linear relationship between time and the outcome. Instead, we can use intercepts to flexibly fit. Whatever is going on over time, we don't need to impose any kind of assumption that it must be increasing or decreasing in a consistent way over time. Instead, we're just going to use the data to estimate an intercept specific to each year and allow that to vary flexibly over time. Let's take a quick look at this output. [ONSCREEN] > summary(m4_ml) Linear mixed model fit by REML ['lmerMod'] Formula: entered_per1000 ~ exited_per1000 + (1 | stfips) + (1 | year_c) Data: afcars_agg REML criterion at convergence: 478.2 Scaled residuals: Min 1Q Median 3Q Max -3.2890 -0.4439 -0.0844 0.3803 3.1290 Random effects: Groups Name Variance Std.Dev. stfips (Intercept) 0.14502 0.3808 year_c (Intercept) 0.04374 0.2091 Residual 0.27814 0.5274 Number of obs: 255, groups: stfips, 51; year_c, 5 Fixed effects: Estimate Std. Error t value (Intercept) 0.44667 0.16732 2.669 exited_per1000 1.04928 0.03347 31.353 Correlation of Fixed Effects: (Intr) extd_pr1000 -0.740 [Frank Edwards] So what do we have here? We have some new component. First, we can see as before that we do see a clear relationship between x's and entries, significant at conventional threshold. But we have a whole new component to our output now. We have this random effects portion of our table. And what this tells us is the variance of each of our random effects component. So now we have added glance. We know that we still, after including everything that we've included in the model, we have a .52 residual standard deviation. But our state intercepts have accounted for quite a lot of variation as have our time intercepts. We can think about this as how much after we adjust for time-stable differences across states and national differences over time. How much variation is there still remaining among state year observations? And still quite a lot, but our multilevel structure of our model has accounted for quite a lot of extra variation that we didn't account for in the non-multilevel version of a model like this. Finally, we'll estimate a model that includes random slopes. So now, we've changed our specification of our random effect structure a little bit. Instead of having an intercept for each year, now I'm going to have an intercept for each state and a slow for each state that's a function of time, right? So that is each state will have effectively a linear growth trajectory. It's own linear growth trajectory. We've included year_c here as a predictor in the main model to estimate a national linear growth trajectory. So that then these individual state slopes can be interpreted as deviations from the national trend. So let's estimate that and take a look at what we get. [ONSCREEN] > # random intercepts and slopes by state (over time) m5_ml <- > lmer(entered_per1000 ~ exited_per1000 + year_c + + (year_c|stfips), + data = afcars_agg) > summary(m5_ml) Linear mixed model fit by REML ['lmerMod'] Formula: entered_per1000 ~ exited_per1000 + year_c + (year_c | stfips) Data: afcars_agg REML criterion at convergence: 452.8 Scaled residuals: Min 1Q Median 3Q Max -2.7914 -0.3762 -0.0291 0.3936 3.5905 Random effects: Groups Name Variance Std.Dev. Corr stfips (Intercept) 0.34082 0.5838 year_c 0.03184 0.1784 -0.77 Residual 0.20248 0.4500 Number of obs: 255, groups: stfips, 51 Fixed effects: Estimate Std. Error t value (Intercept) 0.66652 0.14754 4.517 exited_per1000 1.06412 0.03223 33.020 year_c -0.13738 0.03212 -4.277 Correlation of Fixed Effects: (Intr) e_1000 extd_pr1000 -0.764 year_c -0.421 -0.101 [Frank Edwards] All right. So we can see here any time I see a standard deviation that's meaning fully non-zero. It's telling me that that is accounting for some patterns in the data. So in this case we're seeing that are we we estimate substantial variation in these slopes across states. We still see substantial variation in the state level intercepts with some residual variation. This correlation here is telling us about the correlation between those estimated state intercepts and slopes were actually assuming a fairly complex variance structure in the relationships between these intercepts and slopes. And this is giving us a little bit of information about that. And we still see a similar relationship for exits. If we want, we can use goodness of fit comparisons to think about whether model four or model five is providing us with a better fit to the data. [ONSCREEN] > # compare fits > BIC(m4_ml, m5_ml) df BIC m4_ml 5 505.9375 m5_ml 7 491.5529 > [Frank Edwards] This might be a useful question. If we're like, do we need slopes? Do we need intercepts? What kind of structure works best here? And in this case, we're seeing a substantial improvement in fit for our model with random slopes for each state. A BIC difference of, you know, 15 here is generally fairly meaningful. So in this case, I would say that model five is fitting the data better than model four. And if I were using model fit as my primary criteria for selection, I'd probably prefer model five at point. So that's all I have for today. [Paige Logan Prater] Thanks Frank. [VOICEOVER] The National Data Archive on Child Abuse and Neglect is a joint project of Duke University, Cornell University, University of California San Francisco, and Mathematica. Funding for NDACAN is provided by the Children's Bureau, an Office of the Administration for Children and Families. [MUSIC]