[Music][VOICEOVER] National Data Archive on Child Abuse and Neglect. [Clayton Covington] All right thank you and welcome everyone to the NDACAN Summer Training Series for 2024 and specifically welcome to this week's presentation entitled "Assessing Reporting Issues in NCANDS and AFCARS" which will be led by Research Associate and Assistant Professor Dr. Alexander Roehrkasse of Butler University. Next slide please. So just to give you all a little overview of where we've been and where we're going with this series. So we started last week with one of our several sessions focusing on the strengths and limitations of various data sets. So we focused on the NCANDS last week and this week we're going to be focusing on like I said assessing reporting issues before we turn back to the AFCARS, look at some survey design and weighting and and again the general theme of this year's series is about the best practice in the use of NDACAN data. So these are ways to just refine and improve your various analyses and some things to keep in mind as you're conducting analyses with NDACAN data Holdings. Moving toward August we'll be reviewing the recently released NSCAW 3 data set before wrapping up with a strengths and limitations session focus on the NYTD data set. Next slide please. All right so without further ado I will turn it over to Dr Roehrkasse. [Alexander F. Roehrkasse] Thank you Clayton. It's a pleasure here [Clayton Covington] Sorry one one final thing that I forgot to mention. For all of our attendees the way that this session is going to work if and when you have questions please use the Q&A chat which is at the bottom of your screens on Zoom. We're going to hold all of our questions to the end. And then I will read them aloud and we'll answer them at the end. Sorry about that. [Alexander F. Roehrkasse] No worries at all. Thanks Clayton. It's a pleasure to be here. I always enjoy doing these despite not being able to see anyone's face or you not being able to see me. Today we'll be talking about assessing reporting issues in NCANDS and AFCARS the Archive's two main administrative data sets. There's going to be some overlap with the strengths and weaknesses presentations on each of those data sets but today is not really going to be a summary assessment of all of the different reporting issues in those data sets. That's really beyond the scope of a 1-hour presentation but also it wouldn't be that helpful because which issues are really important or relevant for you is going to depend significantly on what your research goals are. So I will be illustrating some of the most common and some of the least well understood reporting issues in the data sets. But mostly I'm going to be illustrating some techniques, some strategies for doing your own assessments of these reporting issues so you can implement them in your own research and identify those reporting issues that are going to be most relevant to you. I'll talk a little bit about missing data although I'll say we have quite a bit of material on missing data available through the Archive. We've done a number of presentations on missing data in Archive data you can find those on the Archive website. So I'll be giving a sort of conceptual overview of missing data that will help us talk through other reporting issues but I won't be today laying out specific missing data strategies. I'll talk a little bit about measurement error. So we have a problem obviously if we have missing data but often we have the data we want or the data we think we want but in fact those data are measured with error. So just because we have the information doesn't mean it's true I'll talk a little bit about different strategies for trying to identify measurement error. Then I'll talk a little bit about record linkage failure. Similarly to missing data I won't be going through the specific techniques of record linkage but rather talking about why we link records. Record linkage is one of the most important assets in our Archive administrative data but it's not always straightforward. There are some issues that can arise in linking records so I'll give a brief overview of just how possible it is to link records in Archive data and what some of the consequences of failed linkage are. And then lastly I'll walk through pretty briefly just a subset of these strategies in R just to give you some tools, some coding tools for doing your own assessments of reporting issues. Okay so first let's talk about missing data and this will be review for some of you I'm sure. Whenever we're thinking about missing data we want to think about the mechanism or process that leads those data to be missing in the first place and often we talk about these as a missing data mechanism. So whenever we notice missing data in our data set we want to ask what is the missing data mechanism? What's the process that led these data to be missing? Generally speaking there's three kinds of missing data mechanisms. The first is that our data can be missing completely at random. Let's just say it really is truly random why any piece of information is missing or not. I like to think about that caseworker who maybe was short on time and didn't have enough time to to to make a cup of coffee or go go pick up a cup of coffee and so they were a little foggy that morning and so they might have forgotten to enter some information into a case report. That was kind of a one-off thing or maybe it happens once or twice a month or once or twice a year but it's not like it happens every day, it's not like it happens for every caseworker in that department, it's not like it happened a lot more in 2012 than in 2013, it's just kind of a random thing. For these kind of scenarios it's often permissible just to drop observations with missing data, at least in so far as we're using these data in the estimation of statistical models. So let's say you had some missing data and you were trying to estimate a regression model in R or in Stata or in SAS, the default strategy in most of these statistical software packages is to drop observations with missing data. If your data are missing completely at random, generally speaking that will be a defensible strategy. It will yield unbiased estimates if although it will be the case that your results may be statistically underpowered. This is not the case for counting exercises, so anytime we're trying to count the number of reports that have a certain property if we're dropping observations that lack information about that property, we'll be systematically under counting things with that property. So it's important to clarify that listwise deletion is okay if we're estimating statistical models. Under conditions of random missingness. But if we're trying to count things as we often are, we'll need to develop some other strategy. Second family of missing data mechanisms: missing at random. So here our data are not missing completely at random they are missing systematically but systematically with respect to factors that we observe. So the example I like to think of here is that say in Texas between 2000 and 2009 it was the policy of CPS agencies in that state not to record information about child behavioral risk factors. So that's pretty systematic right? We're talking about all CPS agencies in Texas for a defined period of time. So it's not really random it's pretty systematic but it's systematic with respect to factors that we observe entirely. We know whether records come from Texas. We know whether they came in 2000 or 2001 or 2009 or 2010. And so they're non-random with respect to observable factors which means we can use those observable factors to make responsible guesses about what the true underlying values might be. In these scenarios we might use strategies like multiple imputation models, or maximum likelihood estimation, or Bayesian methods to develop imputation or basically statistical guesses about the true underlying value and these under certain assumptions can lead to valid estimations of the true underlying values. Third family of missing data mechanisms though, by far the most problematic, is when our data are missing not at random. And this is when the missingness itself is predicted by unobservable factors most often the value of the missing variable itself. The example I like to think of most often is a child whose ethnoracial identity is somewhat ambiguous. If caseworkers are more likely not to record children's race or ethnicity, if they don't know a child's race or ethnicity, this will leave children with multiple racial and ethnic identities or highly minoritized ethnoracial identities to be more likely to have their race and ethnicity missing. This is to say certain children with certain races and ethnicities are more likely to have that information missing than children with other races and ethnicities. The missingness itself is a function of the true underlying value. In this scenario which is very often the case none of these other strategies like missing data imputation, maximum likelihood, Bayesian methods or listwise deletion are going to yield valid estimates of the true underlying value. In these scenarios we either need to collect more data or reformulate our research question or change the scope of our research study. Okay so this as sort of conceptual framework we're thinking about missing data in research. Overwhelmingly when we're talking about missing data we're talking about missing values. So information about an observation, an observation we observe that is missing. So maybe we observe a child or a child report and we have a number information about a number of different variables in that report but for a few variables there's no there's no value we have a missing value for an observed observation. Sometimes though we fail to observe a unit of analysis altogether and I'm going to call these missing observations. Things we should have observed, cases that actually occurred in the real world, but which we have failed to observe altogether. So this a little bit different than a missing value, a single piece of information about an observation that we do see. This is a whole observation that we fail to see. In my experience people tend to think a little bit less often about missing observations because they're not right there in your data, by definition they're actually not in your data and so we have to be a little more careful to think creatively about what we might not actually even be capturing in our data. I want to talk about two sources of missing observations that can arise in the use of administrative data specifically NDACAN administrative data and some strategies for dealing with them. The first is state-level non-reporting that is to say states that just don't report to our data systems at all in a particular year. Now fortunately when we're talking about AFCARS, AFCARS is a mandatory reporting system and so at least for the 21st century, all states are reporting to AFCARS in each year. The same is not true for NDACAN or for NCANDS rather which is a voluntary reporting system. So in this table what you can see is in the first column the data year or the fiscal year the year for which data are supposed to correspond. And then if moving to the right you can see the column "States reporting" which indicates the number of states which reported to NCANDS in that particular fiscal year. If you jump to the bottom and you look at the year 2022, you'll see that all 52 states and as well as Puerto Rico and DC reported to NCANDS so that's a good thing we had complete reporting in 2022. You'll see though in 2021, Arizona did not report to NCANDS and going back further in time two looks like 2000 before 2012, a larger number of states were not regularly reporting to NCANDS. What this means is that if you're trying to do analyses that generalize to the entire United States and include years in which states certain states are not reporting to NCANDS your results will not be valid estimates for the whole United States. So what do we do? We may need to circumscribe our study. So instead of making claims about the entire United States just make claims about the states that we observe. Or we might circumscribe the historical scope of our analysis just focusing on years where we do observe the whole United States. Or we can treat this as a missing data problem and try to use different methods, multiple imputation for example, to try to estimate the number of observations we may be missing from Oregon or North Dakota or Michigan in any given year. Okay a second source of missing observations is that in NCANDS there is a discrepancy between the year in which CPS agencies report data to the Children's Bureau and the year in which those reports are actually made. Let me try to state that a little more clearly. Each NCANDS Child File is supposed to correspond to a federal fiscal year. More or less the data in any given NCANDS Child File contains all those maltreatment reports that were made within that federal fiscal year. But there are exceptions to that. So let me show you a graph that illustrates this. Along the horizontal axis are dates for the 2019 federal fiscal year. So on the left hand side is the beginning of the 2019 federal fiscal year on the right hand side is the end of the 2019 federal fiscal year. The bars are the number of reports that are made in each half month period. So report dates in NCANDS are coded to the half month and so each bar represents a half month of child reports. What you can see is that there's variation over each half month in terms of precisely how many reports are made. But on the whole you could you can see a sort of average. If you look down to the legend though you'll see that the blue portions of the bar come from the 2019 NCANDS Child File but the red bars come from the 2020 NCANDS Child File. What's going on here? Well as we approach the end of the federal fiscal year there are a number of child reports that just don't make it into the 2019 NCANDS Child File. For whatever administrative reason there's just not enough time to squeeze them in. And so they end up getting reported to the Children's Bureau in the 2020 NCANDS Child File even though they occurred in the 2019 federal fiscal year. Practically speaking, what this means is that if we want to study the 2019 federal fiscal year and we want to make sure that we capture all the child reports that actually occurred in that period, we need to use not only the 2019 NCANDS Child File we also need to use the 2020 NCANDS Child File and make sure that we include those child reports that actually occurred in the federal fiscal year 2019 but didn't make it into NCANDS until the 2020 Child File submission year. When we move to a demonstration in R I'll show you precisely how to do some of these adjustments so that for any given historical period we're interested in studying we're making sure to include all the actual observations that occurred not just those that made it into a specific submission file. Okay. Whenever we're thinking about missing observations or missing values, we want to understand that the missing data mechanism isn't usually directly observable. Sometimes we have some contextual information about precisely why data were missing but often we don't and we have to use tricks, tools, to make inferences about what the missing data mechanism is. It's important to remember that these missing data mechanisms can operate at the state level, at the county level at the caseworker level or the child level. And so when we're trying to assess reporting issues in NCANDS and AFCARS we want to ask ourselves what information can we use to make inferences about what the missing data mechanism is. There's different strategies to do this but for my money the single most helpful way to do this is to examine patterns in our data, observed data, across time, across states and counties, across variables, Etc.. So most of the rest of my presentation is going to be showing you figures and tables that I would make if I were trying to assess reporting issues in NCANDS and AFCARS. We'll talk through them and try to understand how we would use such tables and figures to make inferences about missing data about reporting issues. Okay, so this table shows for the 2021 NCANDS Child File, the percentages of records that have nonmissing values for each state, for a set of child risk factor variables. So what do I mean? In the first column there you see we have just a few states Alabama Alaska down through Indiana, and then the columns represent different variables that are included in the NCANDS Child File. More specifically child risk factors. So child alcohol abuse, child drug abuse, child emotional difficulties, child visual impairments, learning difficulties, physical disabilities, behavioral issues, medical issues. For Alabama you can see there are no missing values for any of these variables. In Alabama they observe this information completely. The same is true for Colorado. In contrast in Idaho there's no information about these factors. All of these variables are missing in their entirety. Most states though fall somewhere in between. What can we learn about the missing data mechanisms just by looking at a table like this? Well we can see that it's clearly the case that in Alabama and Colorado it's the policy of CPS agencies in these states to record this information and they're implementing that policy successfully. In Idaho in contrast I'd be willing to bet that there's it's not the case that there's a policy that caseworkers should record this information and no one is doing it. I'd be willing to bet instead that there's some policy in Idaho that it's not the official policy and practice of caseworkers to record this information. Let's look at Connecticut and Delaware. We can see that in Connecticut 79% of records contain information about alcohol abuse. 79% Of records have information about drug abuse. 79% Of records have information about emotional difficulties, etc. Etc.. You can see across all of these different variables precisely 79% of records have this information. What this tells me is that it tends not to be the case that any given child report has information about one of these variables but not the others. That tells us about the something about the missing data mechanism. It does not appear to be the case that certain caseworkers are just forgetting to record one variable or another. It seems to be the case that systematically, some caseworkers are recording all of this information and other caseworkers are recording none of this information. Now this isn't conclusive but this gives us some hints about where we might explore further. Maybe it's the case that some agencies record this information and others don't. That would lead to a sort of stable 79% hit rate. So we could then go to county-level data and see which counties have different levels of of missing data and that would be a way to further understand patterns of missingness in our data what this missing data mechanism really is. We could look to county-level records we could also look to previous years we could go back to the 2020 NCANDS Child File the 2019 NCANDS Child File. How much are these missingness rates changing over time? These would be ways we would use this table to. Then deepen our investigation of what the missing data mechanisms are. Then of course when we look at states like Hawaii or Florida where there's more variation in which variables get counted and which don't, it's harder to tell what's going on and so further investigation would be particularly important in these states. If this is beyond the scope of your skills or interests or time, then the easiest solution in this case clearly would be to focus your analysis on Alabama and Colorado those states for which we observe this information if we're comfortable limiting the scope of our analysis to those jurisdictions where we have good data that's always okay so long as our claims about our inferences are right sized. Okay once we've identified missing values what's the right way to deal with them? As I've said we might limit the scope of our jurisdictions or sorry limit the scope of our analysis to jurisdictions with good data. It's important to understand here what's going on is that we're compromising our external validity. And so if we choose to study Alabama and Colorado alone that's great but we need to make sure that we're clear that we're only studying Alabama and Colorado. If we're trying to make inferences about the United States our external validity will be compromised. Alternatively we can use methods for estimating missing data and this will help preserve some of the external validity of our results but this comes at the cost of potentially compromising our internal validity. If we fail to estimate that model correctly or that model incorporates assumptions that aren't true then even for the jurisdictions where we do observe information our inferences will be invalid. Okay let's move on now to talking about measurement error. Measurement error can be even trickier to identify than missing observations and missing values. Often we're sort of delighted even overwhelmed with how much information we have when we're using NCANDS administrative or NDACAN administrative data but just because we observe information so much information doesn't mean it's true. We always have to be careful in assuming the truth of the information that we have at hand and subjecting that information to appropriate tests so that we can develop kustified confidence in its accuracy. Like missing data mechanisms, measurement error isn't often directly observable. When we're trying to think about measurement error it's important for you to get clarity about what the construct is that you're actually trying to measure. Let me give you an example. Sometimes the definition of different values will change over time: what it means to have a clinical disability, what it means to have a behavioral risk factor, what it means to have an emotional disturbance. If that definition changes is that a problem for you in your analysis? Sometimes when we're doing analysis over time it's really important that we have the same stable definitions across time. Other times we want those definitions to change if those definitions are changing in the real world. Whether or not this is a case of measurement error actually depends on what your question is what you're really trying to investigate. Just as with missing data mechanisms, we want to ask ourselves what information can we use to make inferences about measurement error? And as before for my money the best way to do this is again to look at patterns across time, across states and counties, across variables and across across repeated observations. So let's do. This is a big figure so let's walk through it slowly. Really it's a collection of 20 figures one figure for each of 20 states and which states we're looking at here isn't terribly important. On the horizontal axis of each plot a little mini plot you can see the timespan 2010 to 2019. The vertical axis counts the number of entrances into foster care in that state in that year. You can see though for each plot though there are a number of different trend lines that have different colors and if you look over to the legend on the right hand side you'll see that each color corresponds to a different ethnoracial group. The ethnoracial categorization scheme that I'm using here is one that is included in the NCANDS Child Files. It's a derived variable that comes from an underlying set of binary variables corresponding to each race and Hispanic ethnicity. You'll see here that I've plotted missing or unknown values though as its own trend line so we can get some sense of how much missingness is occurring in any given state here. Now when we're trying to assess reporting issues either missing values or measurement error in administrative data using this strategy very often what we're looking for are large trend breaks. So sudden changes in the number of entrances having a certain property. If we see a huge trend break in a single year that gives us some indication that something might be up. So let's scan the graph and see are there any irregularities anomalies that give us concern? For the most part these trend lines are pretty stable. Of course they go up and down but often when they go up and down they do it together which gives us some confidence. If we look at Delaware DE which is in the second row and the fourth column we'll see that there seems to be a pretty large increase in the number of black or African-American children admitted into foster care in 2018. But if we look at the absolute number of children entering it's pretty small and so for small counts like that there can be some instability. It looks like there are really only 11 more children than in the previous year. I'm not terribly concerned about that. Let's go down to the very bottom right hand corner and look at Maine there may be something a little strange going on here. You'll notice that the number of children entering with missing ethnoracial information is going up in the last couple years at the same time that Hispanic or Latino counts are going down a good bit. This might give us some concern that the underlying binary variable corresponding to Hispanic ethnicity might be causing some issues here. And so we could look further into the Hispanic binary variable in Maine in these years to try to understand how Maine was reporting Hispanic ethnicity over time. Let's look at a more problematic. So same strategy here looking at different states over the same period but a different set of variables. Here we're looking at a variable about whether or not a child admitted into foster care has a clinically diagnosed disability now clinically diagnosed disability. Is just going to be a messyer category than a child's race or ethnicity. In many respects it's harder to determine, there's more likely to be variable definitions of what a clinical diagnosis or a disability is across time and across states. And so we should bring a little more wariness about this variable in any analysis we might incorporate it into. What do we see? Well in some states these values seem to be pretty stable but in other states some funny stuff is going on here. So if we look to Arkansas for example the top row third column there seem to be huge swings from year to year in the number of children entering foster care that have a clinical disability. And these aren't like small differences these are very large differences year-to year. If we look at Illinois there's clearly a very large number of years where no children have a well let me restate that a large number of years say between 2011 and 2015 where no children entering foster care have the value of no clinical disability. That gives us some suspicion. Similarly around 2016 there's this huge shift in the number of children who do and don't have documented clinical disabilities. In Indiana there's another huge shift but in this case it's between children who are said not to have disabilities and children who are said not yet to have had such a disability determined. It's difficult to know what the appropriate strategy is here if you were to incorporate this variable into your analysis. It would depend on what your analytic goals were, what questions you were trying to answer, how you were using this variable in your analysis. But you can see that simply plotting trends in this variable over time and across states is a good way to start to understand some of the reporting issues. That are characteristic of certain variables in administrative data. Okay let's talk a little bit about record linkage failure. So one of the most important assets of Archive administrative data is that we're able to link records of children multiple records of the same child as they come into contact with the system multiple times. So if a child is investigated for maltreatment multiple times they're assigned a unique identifier and we're able to link those records of distinct investigations to the same child over time. If a child is investigated for maltreatment and then enters foster care we can use a unique identifier to link their record in NCANDS and in AFCARS. If a child leaves foster care through adoption we can use a unique identifier to link that child in the AFCARS foster care file and the AFCARS adoption file. But these linkages don't always work the way they're supposed to work. The most important caveat to understand here is that states assign unique child identifiers on their own terms. Practically speaking this means that we can often identify children over time as long as they stay in the same state. But if children move out of state and come into into contact with this system once again they will be assigned an identifier in a new state that won't correspond to their old identifier in their state of previous residence. Even if a child stays in the same state though our ability to link them over time is sometimes limited. This table shows the percentages of children in the NCANDS Child File which were apparently able to be linked by Child ID but for whom date of birth and sex also match. Let me state that a little more clearly. I've colored green all of those state years that have the value of 100. What this means is that any child in that state year who we were able to link to another child in the previous or subsequent year, so any link that we were able to make using the Child ID variable, also had the same value of date of birth and sex. In other words we linked that child using the Child ID variable and then we looked at when we looked at date of birth and sex for the children we linked across time they corresponded. And of course if we think we're matching truly the same child over time we would want their date of birth and their sex to match. I've colored as red any of the cells that have zero. What this means is that we've been able to make a link to the previous previous or subsequent year using the Child ID variable but none of those links have the same date of birth or the same sex. That is to say that we're getting false positive links. A state has reorganized its ID system so that we're linking children who aren't actually the same child. It's not always clear why these issues arise but it has pretty significant consequences for our analysis if we don't take it into account. If we have false positive linkages that is to say we link children who we think are the same child but are not in fact the same child this is going to lead measures of the intensity or dosage of system contact to be biased upward, and any estimates of the prevalence of contact to be biased downward. On the other hand if there are children who we fail to link so children who do in fact have multiple points of contact say they moved out of state and had had another point of contact, if we fail to link those children we have a sort of false negative linkage then our estimates of dosage will be biased downward and our estimates of prevalence upward. Let me give another example of a record linkage issue as I said before when children enter sorry exit foster care into adoption we should in principle be able to link their record in the foster care file the AFCARS foster care file to the AFCARS adoption file. So on the left you'll see the fiscal year 2021 on the right you'll see federal year fiscal year 2020 not huge differences between the two so let's just focus on 2021. For a select number of jurisdictions you'll see in the first column the number of children who were recorded in the foster care file to have left foster care in that year into an adoption placement, so in Alabama 790 children. In the second column, the number of adoption file records so in the AFCARS adoption file how many children do we see entering foster care or sorry adoption in that year? 792 So a pretty close number. In the third column though you'll see how many of those children have common child IDs and very unfortunately the number is zero that is to say we fail entirely to match any children across those two data systems. In Alaska in contrast we're able to link 100% of all foster file exits to adoption file records although as you can see there are four adoption file records for which we don't have foster care or foster file information. So this is another example of a failure to link as a result of uncommon child IDs and here we would probably end up treating it as a missing data problem. So if we're studying foster care exits into adoption and we fail to link certain children we'll fail to have certain information that's included only in the adoption file records and we need to ask ourselves okay what's the missing data mechanism? Why is it that some children are failing to link and others are not? Clearly this is partly a function of state policy. Certain states have 100% % match other states have 0% match. But other states like Arizona California Colorado have some matches but not others so what's going on here? Why do certain children match and others not? Okay so that's my walkthrough of different strategies for assessing reporting issues in NCANDS and AFCARS and now what I'd like to do is walk through a brief demonstration in R illustrating how to implement some of these strategies that I've just described. I'm going to well I've just pasted into the chat links to two files. A link to the presentation which will eventually be posted on the NDACAN website. But if you'd like to follow along now with the code that I'm walking through you'll see a link to a Google drive there you can download this R script can follow along and annotate it as you like. (Https://drive.google.com/file/d/1ZVQyrc6mK32oCp8hOYaRRmM3DnKuMiAi/view?usp=sharing). The program written in R is included in the downloadable files for the slides and the transcript. As you can see I'm using RStudio which is our recommended sort of implementation of R and I'll be sort of walking through code over here on the left hand side and through different sort of plots here on the right hand side. Okay so the first thing we'll want to do is clear our environment since we've just opened RStudio it's already clear. We're going to install a couple packages. The data table package is helpful for reading in certain data files particularly large data files very quickly and the Tidyverse is a suite of tools for managing data that I often recommend. I'm going to set a file path for my working directory and set a seed oops and we're ready to get going. Okay first what I'd like to do is demonstrate this problem that I described earlier of delayed reporting in NCANDS the fact that in any given submission year like the 2019 NCANDS Child File. Not all child reports that actually occurred in the 2019 federal fiscal year are going to be included in that file. What do we do? Well we pull in information from the 2020 submission file to round it out. So let's walk through how we actually do something like this. NCANDS data files are organized by the year in which maltreatment reports are submitted to NCANDS which is different than the year in which the report actually occurs. How do we measure the number of reports that actually occur in the fiscal year 2019? So I'm going to load in here a set of anonymized data this is a 1% sample of the 2019 NCANDS Child File, I'm using a 1% sample because the full file is is quite quite large I'm working on my laptop and I don't want to keep you waiting. I also want to clarify that all the data I'm using today are either fabricated or anonymous so they'll in general sort of demonstrate patterns that you will actually see in real data but none of these data are actually real data and so you shouldn't rely on any of the findings in my slides or in these code as actual true findings they're just for the Dem point of demonstrating techniques for assessing reporting issues. So we can go ahead and look at just very briefly like what this data object looks like and you can see that we have just a few variables here the submission year, the report ID so the unique identifier corresponding to a child. The county from which the report was made, the half month for which the report was made, the disposition of the report, the child's age, their sex, and their race and ethnicity. So just a few variables included in this smaller file. Let's go ahead and reformat these data so that R understands that this report date variable is a time variable is a date variable. And now let's plot a histogram of the report dates for the 2019 NCANDS Child File. So we're going to take this 1% sample we're going to group the data by report date. So we're going to tell R let's group our data give it a group structure corresponding to each half month. Then we're going to ask R to summarize our data. More specifically we're going to ask it to count up the number of rows or the number of observations in each group and we'll generate a new variable just called n which corresponds to the number of observations in each group. We'll keep the grouping structure that we specified previously and then we'll go ahead and make a plot using the Tidyverse's ggplot function where the horizontal access will be that half month variable, the vertical access will be our count. We'll plot the histogram as a bar graph but since it already has baked into it a sort of counting function we'll need to tell it just count up the numbers that are actually in the data. We've already we've already counted it. And then I'm going to plot two vertical lines that correspond to the actual fiscal year 2019 just so we can visualize where the data are actually coming from and and how many of them actually fit into the window that we're interested in. Let's go ahead and generate that plot. What do we see? Well clearly most reports in the 2019 child fire are falling inside the 2019 federal fiscal year. But some clearly are from earlier years or earlier weeks, months, even earlier years that didn't make it into the previous submission files. And even actually some late late files end up making it into the 2019 file. What this means you can almost imagine like how in the 2020 file these sort of observations here are kind of missing from this area here and so we need to use the kind of tail in one submission year to fill in a gap in its adjacent reporting year. Otherwise stated, if our sampling frame is the fiscal year 2019 we have some extra observations that we don't want, namely these observations outside the red bars, but we're also seemingly missing some of these observations that we do want. It feels like things are dropping off here because these reports aren't actually making it into the submission year file. Now we can solve the second problem by appending the 2020 data and we can solve the first problem by simply telling R to drop data outside of our sampling frame. So let's go ahead and read in a similar file of 2020 data, again an anonymized 1% sample of the 2020 NCANDS Child File. Let's go ahead and append them or or bind rows essentially stacking these two files and again tell R that this report date variable is a date variable. Then let's make sure to filter out any reports that are outside of this window. So we say we're going to keep or filter in any observations that have a report date variable inside the range of September 1st 2018 to August 31st 2019. Now let's replot our histogram as a stacked bar graph that illustrates how much each submission year, that is the 2019 Child File 2020 Child File, contributes reports to Any Given reporting year. Each bar again represents a half-month interval. Let's go ahead and plot this new file and what we can see is that I've I've forgotten to sort of plot lines here that delineate the the 2019 fisal year but you can see that when we combine these files.we're now able to see how the 2020 submission year is contributing to some of those months where our submissions didn't quite make it into or rather where our child reports didn't quite make it into the 2019 submission year. It's important to note this isn't actually an issue with AFCARS data if our unit of analysis is the fiscal year but it is an important challenge if you're dealing with calendar years as a unit of analysis, so it's important to be clear about whether your unit of analysis is the fiscal year or the calendar year. Lastly very quickly I'm just going to show you how to generate some of those trend plots that we use to analyze measurement error and missing data in children's race and ethnicity in foster care, a diagnosis of clinical disabilities in foster care. I share this code just so that you can implement similar trend plots and and do your own sorts of assessments of reporting issues. So here we're going to read in an anonymized 5% sample of the AFCARS foster care file spanning 2010 to 2019. Let's go ahead and examine that object real quick. Again a smaller version of this file with just a few variables. Let's go ahead and restructure our data as counts of entrances into foster care and then we'll group those counts by state, year, and ethnoracial group. So let's create a new object, a sort of count object, where we're going to filter in only those records that are new entrances into foster care and just for the sake of demonstration today just those just a few states maybe half of US states. Again we'll group our data by state, by fiscal year, and by ethnoracial group. Again we'll summarize our data just by counting the number of records that correspond to each group and then we'll just clean up our race and ethnicity variable and assign it some labels that are easily legible. Now let's go ahead and plot a series of trend lines to examine any reporting issues. We'll again use the Tidyverse's ggplot function. We'll say that the horizontal axis will be this fiscal year variable, the vertical axis will be our count, and then we'll both colorize and group things according to race and ethnicity. We'll tell ggplot to plot a line graph. We'll specify a set of horizontal axis breaks. We'll label our variable our axes, and then we'll use this nifty facet function. Facet allows us to break things out or decompose our figure by some variable, in this case state, so that instead of having all states arrayed on a single figure, we'll generate multiple figures corresponding to each state. And in this case because we're just counting the number of entrances, states have very different sized populations we'll allow each state to have its own vertical axis. What does that look like? Okay we'll need to kind of reshape our windows a little bit here and give R a minute to rerender the graph. Okay so now we have a set of trend lines corresponding to the number of children having each derived race and ethnicity over time. And again we can see some wonkiness in certain trends that might warrant investigation, but this will depend on your analytic goals. Sometimes it can be helpful to try different scales depending on precisely what it is that you're trying to suss out. So as you can see there are many fewer children, Asian children or American Indian children in some states than others, and so it can be a little bit difficult to kind of really get to the bottom of these trend lines down here and so sometimes I recommend that people plot things on a log scale which is roughly equivalent to looking at proportional changes over time. And we're able to see the sort of continuity or breaks in some of these trends with lower counts. Okay so obviously again not a full suite of tools for assessing reporting issues in NCANDS and AFCARS but some tips and tricks and useful code for the checks that I do most often in my own research when I'm trying to understand missing data measurement error and other reporting issues in NCANDS and AFCARS. So we've got about 10 minutes left. I will stop here and open up the floor for questions. [Clayton Covington] All right thank you Alex. This was wonderful and I'm going to start off with the first question from an attendee who askedO how do states keep track and assign IDs to half siblings in foster care? [Alexander F. Roehrkasse] To half siblings in Foster care? [Clayton Covington] To half siblings like a half sister half brother. [Alexander F. Roehrkasse] Interesting question. Well there's no let's see to the best of my knowledge there's no definitive way to identify siblings in in the foster care files. So for example if you're used to using say decennial census data or American Community survey data, there because those are household-based surveys there are ways to link siblings to one another. Now if someone knows more about this than me please correct me if I'm wrong but there's no such variables in the AFCARS files that allow you definitively to link siblings. Now there are caregiver identifier variables in administrative data and so under certain assumptions you could identify scenarios where two children in the administrative data have the same caretaker, and so you could say are likely to live in the same home and are therefore likely to be a kin of some sort but to the best of my knowledge I don't know that you can distinguish between siblings, half siblings or other co-residents. That would require some pretty heroic assumptions. [Clayton Covington] Okay the next question says, I'd like to clarify Alex mentioned that data are constructed for examples or you're using sample data not actual data. Does this mean I cannot rely on tables one and three in the slides as reference for states that do and do not report NCANDS in table one and state years with linkage issues in table three? Are these data actual or are these tables fabricated for sample purposes only? [Alexander F. Roehrkasse] Great question yeah great question. So I should have spoken with greater care and precision because some of these tables that do summarize reporting issues are in fact describing real data. So Clayton if you can help me there were a number of tables and figures described there so let's go through and make sure we're being clear about which ones are are valid and and which are artificial. [Clayton Covington] Of course so this person is referring to table one for state years. [Alexander F. Roehrkasse] Yep so this table is real. This table describes the actual state of reporting for NCANDS over roughly the last two decades. We have information on NCANDS going back further in time and so if you'd like additional information for the you know very early 2000s or 1990s reach out and we can provide files that help summarize the state of reporting in earlier years. But yes table one you can rely on you can take to the bank. [Clayton Covington] And the other table they wanted to verify was Table Three with linkage issues. [Alexander F. Roehrkasse] With linkage issues yes also true data. [Clayton Covington] Yep okay. [Alexander F. Roehrkasse] But as you can see I've just for the purposes of this demonstration just pulled out a small number of states just to illustrate different patterns. So if you want information for the full set of 50 states, reach out and again we can provide this information more systematically. [Clayton Covington] Okay the next question asks, do we need to be concerned about duplicates when combining NCANDS? Some records from prior years may not be finalized, for example having no disposition, but are finalized in the next year. Should we identify them and only keep the final one? [Alexander F. Roehrkasse] That's a great question. So just to restate the question. Every child report in NCANDS has a disposition variable. And so in any given submission year that report may have a sort of unfinalized disposition, and then get finalized in the subsequent year, get reported again in the subsequent year with a finalized disposition. And so it would be one report measured at two different points in time, with different dispositions. I'll confess that I'm actually not sure about the extent of this problem. In principle, that report should have the same ID, and so you could, an easy way to assess the degree of that problem would be to go ahead and Link all of the records that you can for the sort of period of your analysis and then pull out any linked reports that had the same report date because in principle if it's the same report it will have the same report date. I'm not aware of any sort of child receiving multiple reports in the in the same date. I guess because we collapse report dates to the half month they could have multiple reports within a half month and so then you would want to see you could then examine whether or not two records had the same Child ID, same report date, and different dispositions, and that would give you some indication that that single report was reported in multiple submission years even though it's only one report. So that's a little bit of a jumbled answer. The long and short of it is I don't know the extent to which this is an issue but that's easily investigated by going ahead and linking records using the Child ID variable and then comparing how much those linked records vary in terms of report date, disposition, and submission year. [Clayton Covington] All right the next question asks, do you have any resources you'd recommend to investigate state policy changes that may impact missing values slash observations? For example, we're curious whether certain state changes or certain states change how they report alternative response cases and NCANDS over time. [Alexander F. Roehrkasse] Yes great question. We have a new data resource called SCAN. It's new so its ability to track change in policy over time is somewhat limited because it's a prospective resource not a retrospective resource. But our friends at Mathematica have been really putting in the elbow grease to track down and monitor state policies that affect the kind of outcomes that are measured in NCANDS and AFCARS. So I'd encourage you to go explore the SCAN database which is has a ton of information about state policies that will help you get to the bottom of some of this. I think we're now up to two maybe three years in SCAN and so not too many resources on change over time, but a lot of helpful resources in in variation across states in state policy. I would just you know encourage folks when they're thinking about these things to try to get clarity about how definitional variation, definitional change enters into their analysis. Sometimes it's a problem sometimes it's not you actually want that to be baked in. Furthermore there are different ways that definitional change can affect your data. Sometimes the category scheme of responses changes. Sometimes the meaning of a particular response changes. Often this takes quite a lot of elbow grease to really understand. And so I would encourage folks to to really try to get into the nitty-gritty about, for the variables that they're most interested in, how these variables actually arise in our data. [Clayton Covington] All right so we're nearly at time but we'll ask one more question and it says I'm new to AFCARS and NCANDS and wanted to know if either data set tracks re-entry for foster care placement within a single year. [Alexander F. Roehrkasse] Yeah this is a great question. I it's worth saying that the AFCARS is in the process of a pretty major redesign. We don't know when this will come down the pike for the research community but in the coming honestly decade we will be able to track entrances into foster care with much more temporal specificity. The unfortunate answer is no, there is no reliable way to track multiple entrances into foster care within a given year. There are different sort of tricks you can do to try to estimate the number of prior foster care entrances a child has had over a particular period. If you want to reach out by email I'm happy to discuss some of those but the long and short of it is the information in AFCARS is not specific enough to identify multiple foster care entrances within a federal fiscal year. [Clayton Covington] All right and with that we're going to end the questions but Alex can you go to the last slide please. So just want to highlight that we're going to continue the Summer Training Series next week at the same time 12pm Eastern Time on July 24th with NDACAN statistician Sarah Sernaker who will be covering the AFCARS data set speaking to its both strengths and limitations. Well thank you everyone and thank you Alex for wonderful presentation and we look forward to seeing you all next week. [Alexander F. Roehrkasse] Thanks everyone take care. [VOICEOVER] The National Data Archive on Child Abuse Neglect is a collaboration between Cornell University and Duke University. Funding for NDACAN is provided by the Children's Bureau an Office of the Administration for Children and Families. [R Code] ######### # NOTES # ######### # THIS PROGRAM FILE DEMONSTRATES SOME STRATEGIES DISCUSSED IN # SESSION 2 OF THE 2024 NDACAN SUMMER TRAINING SERIES # "ASSESSING REPORTING ISSUES IN NCANDS & AFCARS" # FOR QUESTIONS, CONTACT ALEX ROEHRKASSE # (AROEHRKASSE@BUTLER.EDU; ALEXR.INFO) ############ # 0. SETUP # ############ # Clear environment rm(list=ls()) # Install packages (only necessary once) #install.packages(c('data.table', 'tidyverse')) # Loads packages library(data.table) library(tidyverse) # Set filepaths afrpc <- 'C:/Users/aroehrkasse/Box/Presentations/-NDACAN/2024_summer_series/' setwd(afrpc) # Set seed set.seed(1013) ################################## # 1. DELAYED REPORTING IN NCANDS # ################################## # NCANDS data files are organized by the year in which # maltreatment reports are *submitted* to NCANDS. # A very common oversight is that this is *not* the same # as the year in which reports *occur.* # How do we measure the number of reports occurring in Fiscal Year (FY) 2019? # (FYs are Sep. 1 - Aug. 31) # Let's read in an anonymized 1% sample of the 2019 NCANDS Child File d19 <- fread('cf_2019_anon_samp.csv') # Let's reformat NCANDS's report date variable as a # date format variable that R can understand. d19 <- d19 %>% mutate(rptdt = as.Date(rptdt)) # And let's plot a histogram of the report dates for the 2019 NCANDS Child File d19 %>% group_by(rptdt) %>% # tell R to group the data by half-month summarize(n = n(), .groups = 'keep') %>% # count child-reports in each group ggplot(aes(x = rptdt, y = n)) + geom_bar(stat = 'identity') + # plot a bar chart of the counts geom_vline(aes(xintercept = as.numeric(as.Date('2018-09-01'))), color = 'red', size = 1.5) + geom_vline(aes(xintercept = as.numeric(as.Date('2019-08-31'))), color = 'red', size = 1.5) + labs(x = 'Report date', y = 'Number of child-reports') + theme_bw() # If our sampling frame is FY2019, we ## (i) have some extra observations we don't want and ## (ii) are seemingly missing some observations that we do want. # We can solve (ii) by appending submission-year 2020 data, and # solve (i) by dropping data outside our sampling frame. d20 <- fread('cf_2020_anon_samp.csv') d1920 <- d19 %>% bind_rows(d20) %>% mutate(rptdt = as.Date(rptdt)) d1920 <- d1920 %>% filter(rptdt %in% as.Date('2018-09-01'):as.Date('2019-08-31')) # Now let's replot our histogram as a stacked bar graph # that illustrates how much each submission year # contributes reports to any given reporting year. # Each bar represents a half-month interval. d1920 %>% group_by(rptdt, subyr) %>% summarize(n = n(), .groups = 'keep') %>% ggplot(aes(x = rptdt, y = n, fill = fct_rev(as.factor(subyr)))) + geom_bar(stat = 'identity') + labs(x = 'Report date', y = 'Number of child-reports', fill = 'Submission year') + theme_bw() # This is not so much an issue with AFCARS data if your # unit of analysis is the FISCAL year, # but it is an important (and similar) challenge if your # unit of analysis is the CALENDAR year. ############################################ # 2. EXAMINING REPORTING OF RACE IN AFCARS # ############################################ # Let's read a 5% sample of the AFCARS Foster Care files for 2010-2019 fc <- fread('fc.csv') head(fc) # Then let's restructure our data as # counts of entrances into foster care # by state, year, and ethnoracial group. fccount <- fc %>% filter(Entered == 1 & St <= 'MA') %>% group_by(St, FY, RaceEthn) %>% summarize(n = n(), .groups = 'keep') %>% mutate(RaceEthn = factor(RaceEthn, levels = c(1:7,99), labels = c('White, non-Hispanic', 'Black or African American,\nnon-Hispanic', 'American Indian/Native Alaskan,\nnon-Hispanic', 'Asian, non-Hispanic', 'Native Hawaiian/Other Pacific Islander,\nnon-Hispanic', 'More than One Race,\nnon-Hispanic', 'Hispanic or Latino', 'Missing or Unknown'))) # Now let's plot a series of trend lines # to examine any possible reporting issues fccount %>% ggplot(aes(x = factor(FY), y = n, color = RaceEthn, group = RaceEthn)) + #geom_point() + geom_line() + scale_x_discrete(breaks = seq(2010, 2019, 3)) + labs(color = 'Derived Child\nRace and Ethnicity', x = 'Year', y = 'Number of entrances into foster care') + facet_wrap(~St, scales = 'free') + # faceting can help examine multiple trends theme_bw() # It can be helpful to try different scales # depending on what you're trying to suss out. fccount %>% ggplot(aes(x = factor(FY), y = n, color = RaceEthn, group = RaceEthn)) + #geom_point() + geom_line() + scale_x_discrete(breaks = seq(2010, 2019, 3)) + scale_y_continuous(trans = 'log2') + labs(color = 'Derived Child\nRace and Ethnicity', x = 'Year', y = 'Number of entrances into foster care\n(logarithmic scale)') + facet_wrap(~St) + # faceting can help examine multiple trends at once theme_bw()