[voiceover] National Data Archive on Child Abuse Neglect. [Erin McCauley] Welcome to the 2021 NDACAN summer training series. We are recording the session. We do turn the entire summer training series into a webinar series which will be posted on our website later this summer. You can also see previous summer training series, all of them are available on our website. So as I said this is the NDACAN summer training series. It is hosted by the national data archive on child abuse and neglect. The NDACAN recently moved so we are both at Cornell University and Duke University. So we're we have staff across both universities so that's been an exciting change. The topic of the summer is data strategies for the study of child welfare. We've kind of gone over some of the basics, some survey-based data and now we're today we're going to be talking about linking. You know this is hosted by NDACAN we we have all of the data and kind of our work with NDACAN through contract with Children's Bureau and we also have a representative from Children's Bureau is going to be doing an introduction today. So here is an overview of the summer. If you've been with us from the start you know that we first did an introduction to NDACAN what are the datasets, programs, services available. Then last week we had our data analyst for the survey-based data cluster come in and talk a bit about that data and kind of preview some data that's coming down the pike. Today we have our research associate Clayton Covington here to talk about the administrative data and what's in that cluster. And then we have our statistician Sarah Sernaker who's going to be leading us through a workshop about how to link that data although the lessons about linking I would say are also available to apply to other datasets. Then next week we're going to be talking about one of our new data products the VCIS data and that data can be linked with the data we'll be talking about today to create a longer time span looking at child welfare and can be used to study special populations. Then we'll be finishing out the summer with two workshops that are really focused on analysis so we'll have Sarah back to lead some of those along with Frank Edwards who is another research associate within the data archive. And will be talking about multilevel modeling and and latent class analysis. So that'll be kind of our overview of where we've been and where we're going. So now I'm going to pass it over to Tammy White who is from Children's Bureau and she's going to be doing a little introduction for our session today. [Tammy White] Hi everyone welcome I just wanted to do a very quick hello and welcome to this third training session. I am Tammy White from the Children's Bureau I just wanted to say thank you for attending and to let you know that this is at Children's Bureau we think this is a very important service that the archive provides not only to house all our administrative datasets but to do a series of webinar trainings especially in the summer. If you attended some in the past years welcome back if you haven't I hope this means that you'll be attending the future ones for the summer and also next year because we think this is a nice ongoing service that the archive provides. So thank you and enjoy the linking the datasets that's an important skill and I hope that you all are able to take what you've learned today and contact the archive in the future to get datasets and link them up. They have a lot of great resources for you and I hope you continue to use them for your research. Thanks a lot. [Clayton Covington] Thank you Tammy. So hello everyone as Erin said earlier my name is Clayton Covington I'm a research associate here with the National Data Archive on Child Abuse and Neglect. And to just give you all a little overview of our agenda for today I'm going to be giving an overview of our administrative datasets specifically focusing on the National Child Abuse and Neglect Data System or NCANDS and the Adoption and Foster Care Analysis Reporting System or known as AFCARS. There is one other administrative dataset that I will briefly mention but that will not be the focus of the linking presentation today. Following the overview of our administrative datasets Sarah Sernaker will take over but going through a step-by-step process that again is specific to our datasets but also can be generalized to other datasets about linking and then do a walk-through with linking using statistical software. So to begin our overview of the administrative data cluster. So when we say administrative data what do we mean? So administrative data are data that are collected by government agencies or large organizations usually for the purposes of record keeping rather than for statistical analyses which will be the focus of our presentation. And because they are recorded for or are collected for record keeping purposes there are some tricks and tips that we'll share with you in order to most effectively use the data for your research agendas. So here at the National Data Archive on Child Abuse Neglect our administrative data cluster covers children's progression through the child welfare system. So as I mentioned earlier the NCANDS is a dataset specifically focused on child protective histories and it looks at instances of alleged maltreatment that are substantiated, unsubstantiated or receive an alternate disposition. We then progress to the AFCARS which is specifically looking at foster care experiences including a variety of placements such as congregate care placements, placements with nonrelative kin, placements in group homes among many other things as well as emancipation and children who are not tracked in the system. And then the final dataset is the national youth in transition database or the NYTD dataset and this is specifically looking at children who are likely to age out of care before their 18th birthday and so they are surveyed starting at age 17 and every other year until age 21 on a voluntary basis. But as I mentioned earlier the focus of today's presentation is going to be looking at the NCANDS in the AFCARS but if you are interested in learning more about the NYTD dataset, we have previous series that have focused with in-depth explorations of the NYTD dataset including our very first summer webinar series if I remember correctly was exclusively dedicated to the NYTD dataset. And those can be accessed from the NDACAN website. So we'll start with NCANDS. So the National Child Abuse and Neglect Data System was created as a voluntary system in response to 1988 child abuse and prevention treatment act also known as the CAPTA Amendment. And these are case-level data collected for all children who received a response from a child protective services agency or CPS in the form of an investigation or an alternative response. So the primary file that you'll be working with if you request an NCANDS dataset is known as the child file in this is a child-specific records for each report of an alleged child abuse and neglect that received a CPS response. And then completed reports are are those that resulted in a disposition or finding during that reporting year. And so one NCANDS record is known as a report-child pair and it combines the report ID with the child ID to uniquely identify a single record. So when we're looking at the child file report in the child file record there are a few distinctions that we should keep in mind. First, when we're looking at the child report these are instances of suspected child abuse and they may involve one or more children or records and they may be substantiated, unsubstantiated, receive an alternative response or another disposition. And they may involve multiple perpetrators. We're looking at a child record data child file record data related to only one the data related to only one child in the given report and they are representing a victim or a non-victim. And data in all fields for victims records in the data are concerning perpetrators up to three perpetrators for a victim-record. So one is looking at a report the other is focused on the individual child record and as I mentioned earlier the two of these combine to create an NCANDS record. So within the child file here are a few variables to keep in mind. So report variables that are most commonly referenced and used are the report data such as the report ID, the report date, its disposition, and investigation start date. Child data including the child ID, demographic information and one important note that we'll give you is that child age ages that are recorded as 22 or 23 indicate sex trafficking maltreatment types. And because if I believe correctly the children I think the oldest age generally speaking are up to age maybe like 19 so if you see 22 or 23 that's just a very specific code for child sex trafficking maltreatment. Maltreatment data include a number of types of maltreatment including sex trafficking, medical neglect, and individual dispositions as well as fatalities. And then their child risk factors are recorded in our report variables such as histories of sexual abuse, substance abuse, and diagnosed disabilities. Now when looking at the child variables there are a number of fields that I will highlight. Looking at caregiver risk factors such as substance abuse and financial problems that's one way to identify a little bit about the child's background based on their caregiving situation. We also look at the services that are provided to a child such as foster care, adoption, and counseling. You can also access more information about perpetrators including their relationship to victims, their demographic information and sex trafficking. And additional fields that are available in the NCANDS are the AFCARS ID, date of death, plan of safe care and referral to care-related services. And that AFCARS ID will come into play more where we get to the actual linking presentation because that's actually that's the point of connection for linking the datasets. So the NCANDS agency file is another component of the NCANDS dataset that isn't as popular but it's still really useful depending on the type of analysis that you want to do. So again these include all the CAPTA-required items and a bunch of summary data such as prevention services that are provided, referrals and reports between institutions, additional information on the child victims reported in the child file, more child fatality data, and Part C of Individuals with Disabilities Education Act or IDEA reporting. Now we'll move to the AFCARS. So the AFCARS provide case level information on children who are under the placement and care responsibility of Title IV-E child welfare agencies. And in terms of collection the states document the information in their in their state specific electronic record system and they then send that data to be compiled by the Children's Bureau. The Children's Bureau then works with the states to correct errors. So if you've used our datasets in the past you might have gotten an email from Andres about an updated dataset and that's often because the Children's Bureau is working constantly to preserve the integrity of the data and make sure it is as accurate as possible. So if there's any type of corrections that are made on a state level we'll often issue an update based on those new records. So a few variables that I will highlight are demographic information, removal, placement, and other case-related information. Some examples include the date of birth of the child, caretakers of the child, and foster/ adoptive parents. Also we provide self-identified race information on the child and foster parents. We provide date of first and recent removal, number of removals, and that discharge date. You can also access the date of placement, the number of placements that a child has had and the placement location. And there are also case plan goals, termination of parental right dates, and sources of federal financial support. So when working with multiple years of the AFCARS dataset the years of the AFCARS files can be stacked. And when more than one year of the foster care file is used, there will be duplicated AFCARS IDs which is variable code StFCID. And a child has a record for each year that they are in foster care. So if needing to resolve data to one row per child you have to keep the most recent year. And so this is just about switching between long and wide formats or transforming the dataset which I think Sarah can probably touch on little bit later when she takes over. Speaking of taking over Sarah, I think I'll pass it off to you to talk about steps in linking. [Sarah Sernaker] Thank you Clayton have you relinquished control of the screen as well? Okay. Hello everyone my name is Sarah Sernaker have statistician here at NDACAN. And I think my emails provided at the end if there's any questions. So I'll just jump right into it. So Clayton did a good introduction of the datasets and I'm going to take a step back just to go over definitions so that we are all on the same page here. And Clayton already introduced what administrative data is so I won't reiterate that but just to go over what we mean by data linking or joining and that's really just the simple combination of two datasets that share at least one common variable. So not even in the sense of NDACAN data but in general if you were to link data it's just the combination of two datasets and you're linking by something they share in common. And with any tables we're just referring to a dataset that is arranged in rows and columns where the rows generally hold the case or the record and we call those observations and the columns are the variables so like the variables that Clayton was just describing. And I've tried to keep in this presentation whenever I use specific variable names from our datasets to make them in this kind of standard coding font. So and then when we talk about linking we talk about the variables that the datasets share or have in common and those are generally called the case or more commonly a key. And so those could either be a single variable in two datasets that are the same or it could be a combination of variables that uniquely define a record. And so like Clayton was describing we have a lot of child Ids in our data and if you were to work with NCANDS in multiple years you might see the same child ID appear more than once. So if you're working with multiple years and that child has multiple records and so a child ID itself may not be unique depending on what the scope of your research is. And so that's why we say that a combination of variables will sometimes uniquely define a record. So then in the example I'm talking about you would need to specify the year and the child ID for that unique record. And so so when you're linking data you usually have two datasets that you'd like to link and if you want to link them they must share at least one variable of the same entity. And I say entity because it's a really what is contained in the data and what the information is. And so if you had two datasets that you could link on state but one dataset has the states in abbreviations and one dataset has the state's written out, in essence we as humans know that that's the same information but you need to clean it up to make sure that they are either both in abbreviations or both in you know written out and then you could combine them. But again that's what I mean by entities so like the information that a variable holds needs to be the same. And linking variables aren't necessarily named the same so in the like example I just brought up with the states in one dataset it could be ST or in the other dataset it could be written out as STATE and so these are all just things to keep in mind. In the AFCARS set of data files which I'm just using an umbrella term for our NCANDS, AFCARS, and NYTD files the most commonly used entity is a child. And so usually people are interested in following the history of the child through the welfare system and so depending on what the variable name is between the datasets the entity itself of interest is usually you know at the child level. And within our datasets we have variable which is this STFCID which stands for state foster care ID and this is a unique child identifier that can be found in the NCANDS, the AFCARS and I believe the NYTD as well. I think they can this can be found in most of our datasets here. And so this is a unique child identifier that is comprised of the state abbreviation and the AFCARS foster care ID which is in essence the record number. And so if you were to use one of our datasets that is the most commonly used for linking. And so generally speaking not even in the context of our data but if you were to try to link two datasets I've kind of put together the steps of what you would want to do. And the first step would be to kind of prepare each table separately before linking them. Asking yourself what's needed and not needed in research and removing unnecessary variables so kind of reducing your scope. Just because if you are doing linking it may cause problems that you're not aware of, it may be a computational issue where if you just leave every variable in there it might just be too much memory. And it just makes everything neater. So really reducing your scope to the variables of interest taking a subset based on your research questions so not just the specific variables but maybe filtering out rows. If they are, the example I chose here so if there's only children under 15 is of interest just filtering out rows where that's not the case. And again this is all just to facilitate the linking, computational easing computational burden if one could arise and just really making everything a lot simpler. And then appending datasets of different years. I've included it here because especially when you work with our NCANDS child files we have separate files for each year so if you were to request 2010 and 2011 you would probably get two different data files. Although they would have the same variables and whatnot you'd want to make sure to just append those so that you have year variable that indicates what year it came from and then you could just stack those. Because those are the only difference is that they are different years so that's not a linking issue. And so step one is essentially just cleaning up each dataset before you link it. Which is generally the case in any statistical measure, cleaning is always first step. So the next step would be to reduce each of the datasets to the one row per unique identifier. And so what I mean here and I'll say resolving the data tables, by resolving I mean reducing to one row per data file for unique identifier. And so that if you were to look up the terminology later on is going from long format to wide format. And that's really commonly used terminology. And so what I mean here is that so for example in the NCANDS data if you were to work with again 2010 in 2011 you might find a child appear more than once or even within the same year they might have more than one observation per child. And so you'd want to resolve that because if the main interest is to observe what's happening to this child you kind of want just one row to summarize the experience within NCANDS. And so that usually means taking summary statistics whether that's taking the most recent information, whether it's taking a sum of information, so creating variables where instead of having a bunch of rows for this one child in one year that shows the different records they are on you create a variable that says this child was appeared three times this year. And so that, those are things that definitely take time and again need consideration depending on the research scope but a very important step is resolving each data table to one row per unique identifier. Or in other words resolving your data table. And if you do not do this and you tried to link you might find duplicate rows in the end and you might not notice right away and you might get you know, this might affect your research goal. You might over or undercount things if you have duplicate rows. And so that's why it's really important to resolve each table before linking. So the next step once you've cleaned your data, once you resolve them down to one row per identifier of interest so we'll just stick with child so resolving to one row per child, you'd want to save the clean data as new tables. Rather than just jumping into the linking it's good to have a new file that has your cleaned data. Again just to facilitate the linking, again for computational purposes it just is a way to facilitate this. And then the next step is to just how to practically do it. And in a lot of cases you'd be working in a programming language and it would just be a matter of loading your new clean data tables, and performing the linkage which just means using the right functions and options within the functions. And if you were to link more than two datasets I would first suggest you know going through this process with two datasets, link two datasets and then use that new linked dataset with your third dataset and go through the process again to then link the third. So just doing it essentially step-by-step for each of the datasets. And so this is just an overview that I'll go through really quickly because it's really important once you start joining and merging tables and this is all very common terminology in coding and in join-merge statements. You can make an inner join where you link data and you only keep observations that have matching variables in both tables. You can do what's called a left or right join and that will keep you specify which table you know so let's take left join for instance, if you're doing a left join on table 1 that means you keep all of the rows from table 1 and match what you can from table 2. So it doesn't matter if there's not all if there's if not all rows match, you're telling your program to keep all the rows from table 1 and do what you can with table 2. And likewise with the right join just in the other way. And a full outer join says to keep all the rows from table 1 and all the rows from table 2 regardless of if they match or not, match what you can and just keep the rest. And I say keep the rest so anything that's not matched, a lot of programming languages will just fill in blanks the usually deal with that pretty naturally as I'll show you in the example. So, resolving NCANDS child file. So specifically in the context of our data, the NCANDS child file as Clayton was describing is organized by report-child so each observation is on the report level and might have multiple children associated with it. And likewise a child could be on multiple reports in a year. And so to get one row per child it may be necessary to aggregate information to populate variables. So as I was saying before if a child appears on multiple reports in a year you might need to create a variable that just says the number of reports the child appeared on rather than keeping every row that this child appears, aggregate that information into a single variable. And if you're dealing with multiple years which I'd like I mentioned before those can be easily stacked because those should all have the same variables, that's not really a linking case that's just kind of combining data. And so jumping to the AFCARS foster care file the foster care file is given out each year, and each year a child should only appear once because it's the summary of that child's experience in the foster care system that year. But they might appear in throughout the years. If you are using multiple years a child may appear more than once and again kind of taking the subset based on your scope of research, common filters are based on children just entering foster care in a year, or children who are left who left foster care in a year. And so doing the same sort of procedure of making summary statistics where needed but the most often case if you're using the AFCARS foster care file is to use the most recent entry. And so if you're using the years 2010 through 2017, using the 2017 observation for that child usually summarizes most of information you would need. The most recent record is up-to-date and has information like when they first entered foster care for the first time, their most recent stay, and things like that. So often when you deal with the AFCARS foster care file it's sufficient to just use the most recent record. And so once you get down to actually programming, depending on the language there's different functions to use and these are four common programming languages that are used and I've just included the functions within the programming language that you would use to do a link or a merge. And in SAS it's a PROC SQL function, in SPSS a STAR JOIN or MATCH FILES. In R you can use merge which is in Base R, if you're familiar with R that just means you know if you have R installed, merge is already installed in there. If you're a fan of the Tidyverse, inner join, left join, full join and I think they have an outer join also are included in that package. In Stata which I'll go through a quick demonstration in a little, the merge function is most commonly used. And then there is a joinby as an alternative. And so this would be once you have your data ready, depending on what language you like you would use this function to actually do the linkage. And so there are a lot of limitations when you do try to link in general and you know with our data that has a lot of data sensitivity and privacy, that can limit linking. And so what I mean is that in some data child ID might not be present not necessarily our data but just in general data archives might not keep such identifiable information and that just might limit what you can do with linking. The problem that always arises is missing data. And so if you're trying to link on a variable that has any missing data that's going to be a problem. And it might be a problem that you can work around but it is something that will need to be considered. Errors and inconsistencies in record-keeping is definitely something to keep in mind especially if you work with multiple years. If you're trying to link on variables that might have changed throughout the years, for instance you're trying to link on race for instance, and the racial codes are 1 2 3 4 for specific races and then in later years they changed that to 1 2 3 4 5 you know a little more granular, it might not just be as simple as a one-to-one, two-to-two, three-to-three. You would have to recode or just take consideration to make sure that you're making consistent linkage. And that kind of goes along with changes in record-keeping. Whether that's inconsistencies within a variable or differences in the variables that are recorded. You might find that you're trying to link data and are interested in a variable that only appears in 2017 and not any earlier years. And so in that case you wouldn't really need to link because the variable that you're interested in doesn't appear in the earlier years anyway. So anyway these are all things just to keep in mind. They're limitations, it's not a stop all it's just things to keep in mind that you should either address or clean your data or just consider while you're doing linkage in any data, that is. And to summarize before I go into the example, data linkage is possible in any two datasets that share at least one common but unique identifier. And so in our cases the most commonly used identifier is the state foster care ID [StFCID] and you should prepare your data separately, clean it prepare it, filter it based on your research goal, and then join it. And that's really just the yeah to summarize all that I was saying. And I think that's the end of the slide so that me get out of this. So let me keep this up while I get the example up. And so I'm going to open. So I'm not going to go through the basics of Stata. I'm going to run through this probably quicker than I should just because of time. And so I won't go through all the details of using Stata. Hopefully most of you have some familiarity or Google is always an excellent research resource. But I've just opened my do files for this linking exercise and so what I'm going to show you today is linking a very heavily simplified version of our NYTD data and the AFCARS data. And so first I'm going to just set the working directory so I'm telling Stata where to find my files. And the first thing you would want to do, so let's say you want to link our NYTD data with the AFCARS data and we did not go into as much detail as we should have and that's my fault for not bringing it up is the NYTD data. So as Clayton said in the very beginning NYTD is a survey given to youth aging out of foster care. And so you might be interested in their foster care experience from AFCARS and how that has affected them later in life. And so let me pull up an example of this data. And so the button I've just hit so I've loaded the data, import, and so it's loaded it into Stata and you can see we have the variables and Stata will tell you how many variables and observations you have and I've just clicked this button to view our data. So this is really handy when you just want to look and get a sense of what your data looks like. So this is like I said a heavily simplified version of our NYTD and I've masked a lot of things for privacy and security. And this is these are not real dates associated with real records, again for privacy and security. So NYTD is a survey that's given three times up to three times to youth and it asked them questions such as so here I've kept homeless and incarcerated, so this would be asking the youth whether they experienced homelessness in the past year or whether they were incarcerated and its a 0 for No and a 1 for Yes. [Erin McCauley] Sarah can you make it a little bit bigger so that people can see the data a little bit better? Sorry to interrupt. [Sarah Sernaker] So I did a full screen I don't know if I can zoom in and here let's see. Could it help to go into full screen? I can't make the font bigger. Is this better? [Erin McCauley] It's a little bit better. [Sarah Sernaker] Okay yeah no I'm sorry I can't I don't know how to make the font bigger but hopefully this is easier to see because now I've gotten rid of the other windows. So anyway each person in the survey is asked up to three times so that is recorded in waves. And so this would be the first time they're asked in 2014 and then two years later they received a survey again to see okay in the past year since the last survey did you now experience homelessness or incarceration? And then and they are asked a third time. And a person doesn't have to respond to all three times so for example this person did not respond that's why in responded they have a zero. And for example this person only responded twice or no they responded to the first time so they got a follow-up and then they didn't respond so that's why they only have two rows. So that's a very simplified dataset and a very simplified explanation of the NYTD dataset. So that was just to get a sense of what the data looks like and so I have a lot of windows open here so bear with me I'm holding onto our data and now going back to the do file. And so what we want to do like I was saying we're interested in looking at the experience of a unique person through the foster care system and then afterwards and so the state FIPS code or foster care ID I've kept in here but simplified it again so it's just one, two, three, up to 14. So notice that this person has three rows with them corresponding to the three times they responded. This person also has three rows and similarly so each person is identified by the state foster care ID. And we want to put them into one single row per individual. And so as I was saying before this is going from a long so we have long multiple rows per person to a wide format. So we want one row per person and we're going to summarize their full experience recorded here in one single row. So how to do that in Stata is using this reshape function and you tell Stata okay I want wide so I want my end result to be a wide format. And so then you need to say okay what are the what is identifying the individual rows themselves? So we have the date, whether they responded homeless and incarcerated so those things we want to keep once we make it wide. And these are the unique identifiers for each observation. And so what I mean is that we have each person coming from a state and then to get the to go from long to wide you need to specify the wave. So let me just run this so you can see how this changed. So if you were paying attention you'd notice that this went from a lot of observations to only 13 now and this got a lot quieter. So we still have the state and the state foster care ID so those were our unique identifiers. We have notice we have these variables responded one, homeless one, incarcerated one, and then outcome report one so the date of the first wave. Then we have the date of the second wave, whether they responded in the second wave, the homeless in wave two and incarcerated in wave two. And similarly for wave three. And so this is what we mean by going from long to wide and this is what sometimes happens especially working with survey and longitudinal data using these indices of one, two, three to show what wave or yeah so in our case we're talking about waves and what wave is came from. And so notice we have all the same information in wave one this person responded they weren't homeless and did not experience incarceration. In wave two they did experience homelessness and did not experience incarceration and notice so for those individuals who never responded so like this person who did not respond at all, Stata knows how to deal with the fact that they didn't have that information in wave two and three. We know they didn't respond in wave one so that's noted and they weren't followed up. You don't get a follow-up if you don't respond and so notice these periods are for missing data and the blanks are for missing depends on the data type. But my bottom line is Stata knows how to deal with this without you having to worry about it.So I've gone through the data, I've reshaped it, I've resolved it to one row per person I want to save it for when I do my linking. And so I've saved it, and you want to save it as a Stata do as a Stata .dta file, notice here it says dta saved I think it needs to be in a .dta format to do linking in Stata. So we've taken care of the NYTD data and like I said we want to link it to AFCARS so I'm going to import the AFCARS data now and it's going to clear out the NYTD. We're done with NYTD for now we want to work with our AFCARS and clean that up before we think it. So I've imported this, I'm going to look at it again so I'm clicking this little magnifying thing. You should always observe your data before working with it just to get a sense of what it looks like. So again the identifier state foster care ID is present again. We have one, two, three up to 12 here. The time they were first removed from their homes so keep in mind this is AFCARS data so they have experienced foster care system at some point. So this is when a child would have been first removed from their home, this is when they were last removed, so their most recent removal from their home, how many times they've been removed in the past up to 2015, and whether they were removed due to physical abuse, neglect physical abuse or neglect and I've kept this istpr variable in and that's termination of parental rights. And so notice that some people have multiple records so person one, two, three, four that's just one observation within that fiscal year but this person five has records from 2015, 16, and 17. Similarly person nine has four rows depending on the years. And so we would need to resolve this. So I'm going to minimize that and jump back over to the do file. And what you would want to do so like I said in the AFCARS really you just want to keep whatever is most recent so notice let's just look at person five. They have a record from 2015, 16, and 17. The first removal date is the same because that was when they were first ever removed from foster care. The last removal date would have been updated with each passing year and the total removal should be updated based on how many times they were removed from their home. So let me just jump back here. And so like I was saying with the AFCARS, you can simply keep the most recent record per child because the records are cumulative and kept up to date with most of the information needed for research purposes. And so that's what I'm going to do here. We just want to have some sort of indicator that says which one is the most recent so that then we can remove all of the others. And so what I'm doing here is grouping by the state foster care ID and grouping within the fiscal year. So depending on the state foster care ID we want to find the most recent so the largest fiscal year and we're going to sort by the most recent. So let me just run this and show you what I mean. And so now we have this indicator, most recent observation yes we do and so it's a one if it's a most recent and zero if it's not. So going back to this person five because they have three rows notice we have 2015 through 2017 in the indicators telling us okay 2017 is the most recent observation we have for this person. And so that is just a step we need to take because then we want to get rid of anything that has a zero because if we have the most up-to-date version then we don't need the past records. So we're using this function "keep" and we want to keep if our new variable is equal to one. So if it is indeed the most recent observations. So I'm going to run this and notice a lot of rows disappeared. And we're only left with most recent observation equal to one. We should have one's all across the board here. And just you know harping on this person five, notice that we've kept the most recent 2017 observation for that individual. And so we should save this that's all we really needed to do here and to reiterate this is hugely simplified from the data you would receive if you were to request data from us but again for privacy, security and time I've just created a really simplified version here. And so my point is you probably would have to do a little bit more if you received our real data but for here we resolved this to one row per person 1, 2, 3, 4, 5, up to 12 and so we're done with that and we should say that and so we're going to save. This is our file name and I'm just going to run that so let me show you so it's saved as a DTA and it's done. So now we're ready to merge we're going to merge or join based on the state foster care ID to get whether an individual experienced physical abuse and neglect and how that might have impacted whether they then later went on to experience homelessness or incarceration. And so right now I have the AFCARS still saved in Stata and so that's fine because we've cleaned it, we've saved it. We don't really need to reload it here because it's not using a lot of memory and so it's fine to just take the next step. So were going to merge it with the NYTD. And so once we have our clean data, we have our other dataset cleaned and ready to merge in Stata that's the function "merge". And so by specifying merge one-to-one that says that we have one row that should be merged with one other row in the next data. So that we shouldn't have multiple rows that are being merged. In essence we resolved those datasets to one unique observation per row. And so we're telling Stata we're merging on a one-to-one, no duplicate observations, everything's resolved and we are merging on this variable, we are merging on the state foster care ID okay so that's the variable we are merging on and then we tell Stata what dataset we'd like to merge with. So how we're merging, what variable, and then what's our data we are merging with. So let me run this and pay attention to what happens here, let me expand this because magic is going to happen. So notice there was a change. And that is because we have merged the NYTD okay on to the AFCARS so let me increase it so we've merged based on the state foster care ID. Notice we have some people who were not present in AFCARS, so remember AFCARS was like our baseline and we are merging NYTD with AFCARS so these three individuals were not found in AFCARS but they were in NYTD which is why they been brought in but you see a lot of blanks here. But again Stata knows how to deal with it unless you tell Stata not to include these it's just going to fill in blanks where it doesn't have information from AFCARS because it wasn't present in AFCARS. So for instance person 11, did not have these AFCARS variables but we can see they were in the NYTD because they have the NYTD variables. So remember that this outcome report responded homeless, incarceration, for wave one, two and three, or from our NYTD and that's information we have that person but again not in AFCARS. And Stata is really handy because it will tell you this information and so this merge function this merge variable automatically shows up when you do linkage in Stata and it will tell you whether it matched, so if it matched that means that person was in both NYTD and AFCARS so you have information across the board notice. "Master only" which means this person only showed up in AFCARS and it says master because we started with AFCARS. And so AFCARS is our master dataset that it's referring to here. So this person was only present in AFCARS and not in NYTD notice we don't have any information for this person in the NYTD variables. And then we have "using only two" that means that it only showed up in the second dataset but not in the first so that's why we have NYTD information but no AFCARS information. And that I think is just about it. That's the end of my code notice there's not much to it. Again I've created a nice simple version go through this quickly and it would probably take a lot more steps but this is like the bare-bones. We cleaned it we resolved it for one observation per person, we saved our data, and then we did the simple linkage. And it often is just as simple as one line of code to do the actual merging. And really as is the case with most statistics, most of the work is just the cleaning of the data before you're ready to use it. [Erin McCauley] All right Sarah, thank you so much for that walk-through. So as Sarah said you know that was kind of a clean version of the data but all the steps are there. We do have a number of questions open in the Q and A and only a few minutes so I'll read the first one and then Clayton will take over moderating the Q and A. so what kind of geographic identifiers are associated with the records? State is included but does it go any smaller than that, county, city? [Sarah Sernaker] I can answer that so if you were to request our data, we don't generally provide county level data just because there are a lot of privacy issues with that just because it gets to really small counts and identifiability problems. So in our data state is usually the smallest geographic unit you would be able to get. [Erin McCauley] For AFCARS we have some counties but any county with less than a thousand observations is suppressed for privacy. [Sarah Sernaker] Yes and data requests for county often need a lot of, it goes through a longer process because it just is doubly problematic. But yeah. [Clayton Covington] Okay the next question asks, I'm unclear what you mean by append datasets for different years. Do you adding the same variables at the end of the row with new variable names or something different? [Sarah Sernaker] That's a great question. So by appending datasets I just mean that because the only difference would be the year of the data, you should just be able to do like a simple stacking of the data. They should have the same variables so like state, sex, race, and all in the same locations. The only difference is year. And so you would just simply stack one on top of the other and make sure to have a variable indicating which here they came from. If that makes sense. Hopefully that clarifies it. [Clayton Covington] I think it does. Thank you Sarah. Next question asks, do you have guidance and or syntax for SPSS on how to link multiple years of AFCARS so we can follow entry cohorts over time? [Sarah Sernaker] I know last year's version of this which is similar in its slides, I did an example in SPSS and I think that's available on our site the code. [Erin McCauley] Yes we'll post it into the chat. [Sarah Sernaker] Yeah, and that was a linkage example and I'm not sure so that was a linkage between I think also AFCARS and NYTD. So if you were if you wanted to link AFCARS between years it would be it would be similar but yeah. I have code example in SPSS. Let me just stop right there. Stop myself. [Clayton Covington] Okay next question. Is it possible that one youth would move states and therefore have different states for each wave yet one STFCID? [Sarah Sernaker] That is a great question because this is an issue that researchers on our end are just are trying to figure out how to deal with and trying to measure the impact, more importantly. Because if a child moves, they get a new state ID and it you can't track them. And so that's a real problem and something to be aware of that if a child is reported in Michigan foster care for a few years but let's say they move to Illinois, they are essentially a new observation, a new data point and we can't really track that. [Clayton Covington] Okay, next question asks, are the AFCARS removal reasons about the most recent or the first removal? [Sarah Sernaker] That would be the most recent. [Clayton Covington] Alright, next question: what are your general recommendations for addressing missing data in the linked dataset, multiple imputation for example? [Sarah Sernaker] Yeah, it really depends on the scope of your research and how much missingness. Because if you are just missing like a few observations like I don't know up to five you might be able to get away with just omitting them if you think that's you know reasonable to do. Otherwise some sort of imputation whether that's a simple like a mean imputation would be the simplest way to go or a multiple imputation which is more complex but a little it yields better results in the end. Yeah that's a great question when dealing with any data because missing data is inevitable. [Clayton Covington] Next question, is it true for NCANDS as well that county is provided if there are counts of more than 1000? [Sarah Sernaker] Again this is like on our side and I deal with the data directly so like I know we record county. The problem is when outsiders request data, it goes through a more rigorous process and if you did get county which I do think I guess we would give it out but you're not going to get every county, it needs to have a population of yeah I think thousand or more or we need 1000. [Erin McCauley] Yes so Michael Dineen was our data analyst just popped into the chat I know only panelists can see it but he said that that is true that we also suppress NCANDS data for small counties. [Sarah Sernaker] Yes Michael knows better than I. [Erin McCauley] Thanks for chiming in Michael. [Clayton Covington] Thanks Michael. Next question: does AFCARS have placement start and end dates for each placement? [Sarah Sernaker] So this is also something to keep in mind when dealing with AFCARS. The date of placement and the placement given would be the most recent but the problem is if a child is moving around within the year, so if they start the year off in one placement, they moved to a separate placement and then they end the year at a totally new placement, we're really only capturing that last placement. So what is their placement at the time of the record. And so usually it's sufficient and captures the whole picture but there are cases where we definitely acknowledge that we are not capturing the full movement within system. So again a limitation with the data and something to keep in mind. [Clayton Covington] Alright, next question: is there SPSS syntax available to identify NCANDS reports concerning children in foster care? [Sarah Sernaker] I think that would just be a matter of linking NCANDS and AFCARS so if you're interested in children in NCANDS data who have gone through the foster care system it would you'd have to link them to see which children are in both of the datasets. And there are like I said before there is example SPSS code from last year linking NYTD and AFCARS and so you'd kind of have to extrapolate that to how to link NCANDS and AFCARS. [Clayton Covington] It also appears that Michael added "to track changes you'll need to look at every year the child is in foster care" just as an added add-on, again look at the presentation that Erin shared with all of you all in the chat and in addition keeping in mind what Michael set. And then we're going to go to our last question because we are at time, and then we'll end the session. And so the last. [Sarah Sernaker] Michael should add his email if he feels like he wants to field some of these questions because we have ours but Michael [med39@cornell.edu] is definitely a huge helpful resource is more familiar with these data and even I am. So Michael feels comfortable sharing his email he should do so. So okay last question sorry. [Clayton Covington] Last question I know that there are other questions but please note you're welcome to ask us questions online via email and also note that this presentation will be posted at of later date so any syntax or code you want to look up will be available. So last question: could you provide any input on how you link parents with children in these datasets? [Sarah Sernaker] Well the parents we don't have much information on it's really just a matter of if a parent, we we so we record some information about the parent in the AFCARS and a little information in NCANDS if the parent is the perpetrating offender, but we don't have like records on the parent itself. So you'd have to be able to link the child again and just note that parent information. [Clayton Covington] All right well thank you everyone for attending this this week's session. Again I'm so sorry we cannot get to all of your questions but again if real free to email us with the link that Andres just shared in the chat [NDACANsupport@cornell.edu] and please note again that all of this information will be later will be available later this summer. So thank you again for attending today's session. We hope to see you at next week's session and subsequent sessions that are relevant to you. Thank you all for your time. [voiceover] The National Data Archive on Child Abuse Neglect is a collaboration between Cornell University and Duke University. Funding for NDACAN is provided by the Children's Bureau, An office of the Administration for Children and Families.