Transcript for video titled "LeaRn R with NDACAN, Week 3, 'Basic Data Management, November 15, 2024".

[VOICEOVER]
National Data archive on Child Abuse and Neglect.

[ONSCREEN CONTENT SLIDE 1]
WELCOME TO NDACAN MONTHLY OFFICE HOURS!
National Data Archive on Child Abuse and Neglect
DUKE UNIVERSITY, CORNELL UNIVERSITY,& UNIVERSITY OF CALIFORNIA:SAN FRANCISCO
 The session will begin at 11am EST
 11:00 - 11:30am - LeaRn with NDACAN (Introduction to R)
 11:30 - 12:00pm - Office hours breakout sessions
 Please submit LeaRn questions to the Q&A box
 This session is being recorded.
 See ZOOM Help Center for connection issues:
https://support.zoom.us/hc/en-us
 If issues persist and solutions cannot be found through Zoom, contact Andres Arroyo at aa17@cornell.edu.
 
[Paige Logan Prater]
All right I will go ahead and get started. Andres thank you for starting the recording. I think there will be some folks trickling in but we want to be timely and make sure we get to all the R content so I'll go ahead and just open the space. So thank you everybody for joining we're excited to keep this series going. This is the 2024 to 2025 Office Hours series hosted at the National Data Archive on Child Abuse and Neglect. My name is Paige Logan Prater I am the Graduate Research Associate at NDACAN and if you've joined us for the first two sessions welcome back and if this is your first session welcome period. This is our third of a series of R training modules that we'll get into. Frank Edwards is leading us on that charge. So we will spend the first 30 minutes of the hour talking about R. Actually I think this we're not going to be talking about intro to R that was our first session. We are talking about data management I believe. So we'll do 30 minutes with Frank on R and then for the second 30 minutes we will do our typical Office Hours breakout session. Frank will stay on and we have another colleague Alex who'll join us so we they can answer any lingering questions about R stuff. They can answer questions about data sets and the archive as well as other statistical analysis questions, career stuff all that all that jazz. So yeah I think couple of reminders if you have questions throughout please put them in the Q&A chat in the Q&A box or the chat. We will get to them towards the end. We are recording this session so we are trying to keep it as clean as possible for transcription purposes and all of these sessions will be recorded with slides on our website. So you can refer back to them. Yeah I think that is it Frank I will kick it over to you to jump into the content.

[Frank Edwards]
All right thank you so much Paige very glad to have everyone here today. As Paige mentioned today we'll be discussing the basics of data management with R. And yeah we'll just dive right in.

[ONSCREEN CONTENT SLIDE 2]
LeaRn with NDACAN, Presented by Frank Edwards

[ONSCREEN CONTENT SLIDE 3]
Materials for this Course
Course Box folder (https://cornell.box.com/v/LeaRn-with-R-NDACAN-2024-2025) contains
Data (will be released as used in the lessons)
Census state-level data, 2015-2019
AFCARS state-aggregate data, 2015-2019
AFCARS (FAKE) individual-level data, 2016-2019
NYTD (FAKE) individual-level data, 2017 Cohort
Documentation/codebooks for the provided datasets
Slides used in each week’s lesson
Exercises as that correspond to each week’s lesson
An .R file that will have example, usable R code for each lesson – will be updated and appended with code from each lesson

[Frank Edwards]
So the data that we'll be using today all the materials are up on our shared box folder. If you go into week three you'll find four files you'll find PowerPoint and PDF of these slides you'll also find a file an R script called week3.R which includes all of the code that I'll be showing today and we'll walk you through some basics of data management in R. There's also a homework file that's titled hw3.Rmd and that's in an R markdown folder file.  R markdown is a really nice format for writing up reports when we are combining both prose and code. So it's a really convenient format for coding if you haven't used it before. The data that we'll be using is in the data subdirectory of the LeaRn with NDACAN folder and the data we'll be focused on today are the Census state level data 2015 to 2019 and the AFCARS state aggregate data 2015 to 2019. Code books are also available in that directory. And a reminder we've said it before I'll say it again these data are perturbed that is they are not real data. Please do not use them for research these are strictly for demonstration purposes if you would like access to the AFCARS Foster Care file that we have derived this from and then perturbed, that is kind of introduced some noise to, you can apply to NDACAN for access.

[ONSCREEN CONTENT SLIDE 4]
Week 3: Basic Data Management, November 11, 2024

[Frank Edwards]
Without further ado let's talk about data management.

[ONSCREEN CONTENT SLIDE 5]
Data used in this week’s example code
Census aggregate data from 2015-2019 (census_2015_2019.csv)
Population counts by state, year, sex, race, and ethnicity
Publicly available from CDC Wonder:
https://wonder.cdc.gov/single-race-population.html
AFCARS aggregate data from 2015-2019 (afcars_aggreg_suppressed.csv)
Counts by state, year, sex, race/ethnicity of children in foster care; number of children removed due to physical or sexual abuse, or neglect; the number of children who entered or exited foster care in that year
Can order full data from NDACAN:
https://www.ndacan.acf.hhs.gov/datasets/request-dataset.cfm

[Frank Edwards]
So data management is a pretty broad topic and to start with let's talk about the data that we're actually using. Again we're using Census 15 to 19 and AFCARS 15 to 19 this is in census_2015_2019.csv and afcars_aggreg_suppressed.csv. The Census file contains population counts by state, year, sex, race and ethnicity and this file that we have compiled for you is derived from CDC Wonder which takes some of the Census small area estimates and gives us single year by age by sex bridge race population estimates for U.S. Counties. It's a really really handy file to use especially for our administrative data in the AFCARS and NCANDS. AFCARS the file we're using today contains counts of events by year, sex, race ethnicity of children in foster care, the number of children removed due to physical or sexual abuse or neglect, the number of children who entered or exited foster care in that year. And again you can order that data from NDACAN or just use the demo data for practice today.

[ONSCREEN CONTENT SLIDE 6]
How R Works with Data

[Frank Edwards]
So let's talk about how R works with data. We're going to be a little theoretical to start and a little kind of like basics of computer science and working with R to start and then we'll pivot into working with code.

[ONSCREEN CONTENT SLIDE 7]
Reading, Writing and the Environment
The ‘read’ family of functions load data
read_csv, read_tsv, read_fwf (tidyverse version of read.)
Data must be loaded into your environment (RAM)
The environment is temporary
The ‘write’ family of functions store data on disk
write_csv is most common, saveRDS has some uses

[Frank Edwards]
So in R we have what's called the environment and that's where you're going to be loading things into the active RAM of your computer using generally one of the read family of functions. Now in base R we have read.csv, read.tsv, read.fwf.. In Tidyverse which we introduced last time and it's my preferred way to work with R, we have a similar parallel suite of functions that use an underscore instead of a dot. And so read underscore csv read underscore tsv read underscore fwf will each read three kinds of files. An NDACAN distributable will provide you with comma delimited files for most of the files that we're working with so those are going to be csvs, that is they're tabular files flat files where each column of the data set is separated by a comma. A tsv is really similar except instead of using a comma as the delimiter it uses a tab. And a fixed width format is is a little bit funkier in that each column has a defined number of characters in width. And for example the CDC Wonder file that we're using here is an fwf file. Those can be a little bit trickier to work with. You have to pay really careful attention to your code books for those. But we'll typically import our data using these read functions and in R we're going to load data into our environment in order to work with it. And when R loads things into your environment that is your active workspace it's loading it onto your computer or your server's RAM your random access memory. So it's loaded actively into RAM to work with it's not pulling the data from disk anymore once it's in the environment. And the environment because it's stored in RAM is inherently temporary. We have ways to save changes to our environment but I generally recommend against that. I generally want recommend that we treat our environment as a temporary space that will be erased as soon as we quit our current session. So think of the environment as your sandbox where we're going to be building whatever we're going to be building for the particular session but it's temporary. It will go away. If we want to commit something and we want to store it I strongly recommend using the 'write' family of functions. So we kind of 'read' we use the 'read' functions to pull data in we use the 'write' functions to generate output to store on disk. Right, in contrast to RAM something we write onto our disk is has a file name and a location on disk that we can then pull up later. Something that happens in RAM again once we quit out of our R session that will be gone. So write underscore csv is the function I find myself using most often. SaveRDS is another good one to use if we have an R data object that we want to store exactly as it is. So you know if we're doing something like multiple imputation where we're fitting really complex models saveRDS can be helpful to save a particular object from our environment to disk but write_csv is what we're going to use most of the time. That is if I've kind of done some series of transformations to my data I'll use write_csv to spit out a csv file that contains my output that I'll then work from. But think of the environment as temporary. 

[ONSCREEN CONTENT SLIDE 8]
Your laptop (or server’s) disk and the environment
setwd points R to a location
But this is tedious, and bad form with Git / collaboration
RStudio Projects automatically orient your environment to a particular disk location
Get in the habit of using projects and developing a consistent directory structure

[Frank Edwards]
Right? Now your laptop or your server's disk is what we're going we're going to point to the disk in order to pull things into our environment. And a lot of beginners in R will use set wd to point R to a location on disk. It's tedious though. So we can use the set wd function to set R's working directory and it actually makes collaboration quite difficult. Or if you're like me and you maybe are writing code from multiple machines and using software like Git for file synchronization and collaboration, set wd can actually get in the way of your code pretty quickly. So I strongly recommend kind of orienting our data management around RStudio projects. RStudio projects will automatically point your environment to the location on disk where that project file is stored. And if you get in the habit of using projects it encourages you to develop a pretty consistent directory structure and a pretty consistent workflow for working in R and RStudio that I think is really beneficial the more projects you start developing. 

[ONSCREEN CONTENT SLIDE 9]
Screenshot of RStudio, drawing attention to the Project menu in the top right corner of the integrated development environment (IDE).

[Frank Edwards]
So I just want to show you briefly how we interface with the project structure in RStudio. So we have this little cube over here. If we click it we'll get a drop-down menu that allows us to to establish a new project. And typically what I'm going to do is I'm going to create a new project in an existing directory. And what RStudio is going to do is it's going to save a .Rproj file in that directory. Now whenever we open up RStudio I can click on this drop-down for projects and choose the project that I'm working on. You can see here that my most recent projects that I've opened in RStudio are available for me to switch over to at a click. And the nice thing about projects is it's going to automatically switch my working directory at the same time as it switches over the project orientation. So I no longer really have to think about my disk and locations on my disk because it'll be automatically oriented to a root directory based on the project and that's incredibly useful. So strongly recommend you get in the habit of using RStudio projects it will make your life a lot easier especially when you start juggling multiple projects.

[ONSCREEN CONTENT SLIDE 10]
Screenshot of mac file browser showing a research project directory name polmort-24, which contains 4 subdirectories (data, models, ms, vis), R scripts, and R objects

[Frank Edwards]
But the other thing we need to think about is okay we're pointing it to a particular place and you can see here this is an example of a project that I'm working on right now. It's data analysis looking at police-involved killings and a time series analysis. And we have you can see that polmort-24.Rproj file right that's my project file that organizes this directory. But whenever I open this project in Rstudio it's going to point me directly to this folder polmort-24 and the thing I want to point out in this folder is that I think it's a really good habit to develop a consistent directory structure across your projects. That is here I typically will have subdirectories for each project folder that include things like data that is where all of my input data files are going to go. Models which in this case I'm estimating some complex model models that I want to save to disk so that's where I'm going to store the output for those models. Ms is where I'm storing the manuscript that I'm writing at the same time as I'm doing my data analysis. And vis is where I store my visual outputs so graphics that I produce using R usually as pngs or PDFs I'll store in the vis folder. But having this kind of consistent directory structure makes it so when you pivot project to project you're actually using the same modular kind of meta infrastructure for your project even if the project differs. So I think developing whatever workflow works consistently for you but having a consistent workflow that you apply across projects will be incredibly useful as you kind of scale up complexity in working with R in RStudio.

[ONSCREEN CONTENT SLIDE 11]
A screenshot of RStudio showing the files pane with a clean directory structure, containing a ./data subdirectory and a few R scripts.
The script pane contains the following code in a file names week3.R
library(tidyverse)
afcars_demo<-read_csv("./data/afcars_aggreg_suppressed.csv")
census_demo<-read_csv("./data/census_2015_2019.csv")

[Frank Edwards]
So loading data into RStudio, right. So you can see here I have created a project folder for the LeaRn with NDACAN and the first thing I'm doing is I'm libraring in Tidyverse because we're going to consistently use the Tidyverse suite of functions. First I'm going to walk you through the code on the slides and then we'll actually pivot over to RStudio and I will execute the code so you can kind of see it work sequentially. But here we're going to load in afcars_demo and Census demo and we're pulling those into active RAM. You can see on the files pane of my RStudio browser that right now what I have is my root directory which contains my R scripts and some outputs and then a data subdirectory. So I've moved afcars_aggreg_suppressed.csv and census_2015_2019.csv into the data subdirectory and now I know anytime I'm reading in a file it's going to be in data.

[ONSCREEN CONTENT SLIDE 12]
RStudio screenshot showing output and script.
Script text:

library(tidyverse)

afcars_demo<-read_csv("./data/afcars_aggreg_suppressed.csv")

census_demo<-read_csv("./data/census_2015_2019.csv")

### change census variable names
### to match afcars names for join

census_demo<-census_demo %>%
  rename(fy = cy,
         stname = state,
         state = stfips)


CONSOLE OUTPUT TEXT:

> head(afcars_demo)
# A tibble: 6 × 10
     fy state sex   raceethn numchild phyabuse sexabuse
  <dbl> <dbl> <chr>    <dbl>    <dbl>    <dbl>    <dbl>
1  2015     1 1            1     2180      352       88
2  2015     1 1            2     1245      198       46
3  2015     1 1            4       10       NA        0
4  2015     1 1            5       NA        0        0
5  2015     1 1            6      245       30       NA
6  2015     1 1            7      204       56       22
# ℹ 3 more variables: neglect <dbl>, entered <dbl>,
#   exited <dbl>
> head(census_demo)
# A tibble: 6 × 8
     cy stfips state   st      sex race6  hisp    pop
  <dbl>  <dbl> <chr>   <chr> <dbl> <dbl> <dbl>  <dbl>
1  2015      1 Alabama AL        1     1     0 331305
2  2015      1 Alabama AL        1     1     1  33105
3  2015      1 Alabama AL        1     2     0 164122
4  2015      1 Alabama AL        1     2     1   2763
5  2015      1 Alabama AL        1     3     0   2540
6  2015      1 Alabama AL        1     3     1   1195

[Frank Edwards]
Now another feature another challenge routinely when we're thinking about data management is how do we get two data sets to talk to each other? And this is where in comparison to some other software suites that tend to only want to work with one table at a time R really sings and where Tidyverse becomes incredibly useful. So when we look at the structure of afcars_demo and census_demo our goal here is going to be to set up a join where we have a count of the number of foster care entries by each state and we want to compute per capita foster care entry rates by state. In order to set that up though we first need to prepare ourselves for a join. And the way we first can handle that there's a lot of different ways to go about it. My preference is to harmonize the variable names and values. That is to choose one of my tables and ensure that all of the column names that I'm going to be using for matching are exactly the same and that the values within those columns that I'm going to use for matching are also exactly the same. So here I'm going to take Census and I'm going to change its variable names for year for state name and fips codes. We can see that in afcars_demo we have the fips code for a state that is the numeric federal information processing system number is written just as state and in Census it's written as stfips. So I want to change stfips and Census over to state. We don't have state name at all in afcars_demo but I'd like to retain that. So let's change the the full name of the state and let's call it stname and so that will now get joined over onto afcars_demo when we run our join and we'll change cy over to fy for consistent time names.

[ONSCREEN CONTENT SLIDE 13]
RStudio screenshot.
script text:

library(tidyverse)

afcars_demo<-read_csv("./data/afcars_aggreg_suppressed.csv")

census_demo<-read_csv("./data/census_2015_2019.csv")

### change census variable names
### to match afcars names for join

census_demo<-census_demo %>%
  rename(fy = cy,
         stname = state,
         state = stfips)

## Collapse race / ethnicity / sex in both tables
## to create total pop, total entries

census_collapse<-census_demo %>%
  group_by(fy, state) %>%
  summarize(pop = sum(pop))

afcars_collapse<-afcars_demo %>%
  group_by(fy, state) %>%
  summarize(entered = sum(entered, na.rm = T))


console output:

> head(census_collapse)
# A tibble: 6 × 3
# Groups:   fy [1]
     fy state     pop
  <dbl> <dbl>   <dbl>
1  2015     1 1103159
2  2015     2  184134
3  2015     4 1629765
4  2015     5  706879
5  2015     6 9118819
6  2015     8 1258312
> head(afcars_collapse)
# A tibble: 6 × 3
# Groups:   fy [1]
     fy state entered
  <dbl> <dbl>   <dbl>
1  2015     1    3536
2  2015     2    1483
3  2015     4   12553
4  2015     5    4009
5  2015     6   31258
6  2015     8    4733

[Frank Edwards]
Now we also have a bit of a challenge here in that we have different levels of measurement and we have different coding structures for sex for race for gender and ethnicity. And so what we want to do to get started for our join is we want to collapse the data and we're going to use the Tidyverse pair of functions group_by and summarize to do that. Group_by and summarize are a really really great set of tools and we always use them together. And in this case I want to just collapse AFCARS is down to all children regardless of age, race, ethnicity, or sex and compute the total number of children who entered foster care within each state. So I can do that and we can see line 22 afcars_collapse we'll take afcars_demo then the pipe operator root by fy and state, pipe, summarize entered equals sum entered. We do have some missing values here so I'm going to use na.rm equals true to ensure that I can actually get numeric output. That's going to implicitly drop all my missing cases. We can talk about missing data analysis another day that's not necessarily always the greatest idea but in this case we're going to do it. And I'm also going to collapse rensus down to just the state level. So we're going to end up with data structured with in the way that we see on the console pane. That is, we have a Census table that takes fy state and pop and an AFCARS table that takes fy state and entered. 

[ONSCREEN CONTENT SLIDE 14]
RStudio screen shot.

Script:

library(tidyverse)

afcars_demo<-read_csv("./data/afcars_aggreg_suppressed.csv")

census_demo<-read_csv("./data/census_2015_2019.csv")

### change census variable names
### to match afcars names for join

census_demo<-census_demo %>%
  rename(fy = cy,
         stname = state,
         state = stfips)

## Collapse race / ethnicity / sex in both tables
## to create total pop, total entries

census_collapse<-census_demo %>%
  group_by(fy, state) %>%
  summarize(pop = sum(pop))

afcars_collapse<-afcars_demo %>%
  group_by(fy, state) %>%
  summarize(entered = sum(entered, na.rm = T))

### join the data.frames, and compute a per capita entry rate

dat_join<-afcars_collapse %>%
  left_join(census_collapse)

dat_join<-dat_join %>%
  mutate(entries_per1000 = entered / pop * 1000)


Console output:

> head(dat_join)
# A tibble: 6 × 4
# Groups:   fy [1]
     fy state entered     pop
  <dbl> <dbl>   <dbl>   <dbl>
1  2015     1    3536 1103159
2  2015     2    1483  184134
3  2015     4   12553 1629765
4  2015     5    4009  706879
5  2015     6   31258 9118819
6  2015     8    4733 1258312
> dat_join<-dat_join %>%
+   mutate(entries_per1000 = entered / pop * 1000)
> head(dat_join)
# A tibble: 6 × 5
# Groups:   fy [1]
     fy state entered     pop entries_per1000
  <dbl> <dbl>   <dbl>   <dbl>           <dbl>
1  2015     1    3536 1103159            3.21
2  2015     2    1483  184134            8.05
3  2015     4   12553 1629765            7.70
4  2015     5    4009  706879            5.67
5  2015     6   31258 9118819            3.43
6  2015     8    4733 1258312            3.76


[Frank Edwards]
So now we're ready to go for our join. We can use the left_join function to take an object on the left hand side and append a table to it. So we're going to use census_collapse and afcars_collapse. We've already harmonized the names so when we conduct our join what's going to happen is it's going to match census_collapse by fy and state and attach the variables from Census onto our afcars_collapse table based on those key columns fy and state. And you can see the result of that in that first head call in the console. Now we have something called dat_join we have an object a data frame or a tibble, which is just Tidyverse's cutesy name for data frames, that contains fy state entered and pop, that is we've merged our tables, we've joined our tables together and now we're set up to to start actually doing some good analysis. The first thing I want to do with that analysis is compute an entry rate per capita. In this case I'll compute entries foster care entries per thousand population. And I'll use mutate to do that. Mutate is a Tidyverse function that we covered last time that adds columns to a data frame adds columns to a table. So anytime we want to create a new column or overwrite an existing column we'll use mutate for that if we're not changing the structure of the data otherwise. Summarize is our other column creator but what summarize is going to do is it's going to collapse the data. If we have grouped it it'll collapse it by those categorical groupings. If we have not grouped it it'll collapse it down to a single row. But if we don't want to collapse we just want to compute a new column, mutate is our friend. So here entries per thousand is equal to the call the vector entered divided by the vector pop times a thousand, pretty straightforward.

[ONSCREEN CONTENT SLIDE 15]
RStudio screenshot with script, console, and plot viewer. Plot viewer shows 5 histograms of foster care entry rates per 1000 population across states for 2015 - 2019. Pattern shows gradual left clustering over time.

SCRIPT NEW CODE:

### visualize the data

dat_join %>%
  ggplot(aes(x = entries_per1000)) +
  geom_histogram() +
  facet_wrap(~fy)

CONSOLE OUTPUT:

> dat_join %>%
+   ggplot(aes(x = entries_per1000)) +
+   geom_histogram() +
+   facet_wrap(~fy)
`stat_bin()` using `bins = 30`. Pick better value
with `binwidth`.
Warning message:
Removed 5 rows containing non-finite outside the
scale range (`stat_bin()`).

[Frank Edwards]
My next step after I've done something like this is always going to be to visualize my distributions to think about what's going on across time with the data. So here I'm going to take dat_join and because ggplot plays really nicely with the Tidyverse it's a core part of the Tidyverse I can actually just pipe a ggplot call onto my table. So I have dat_join pipe ggplot I'm going to use entries per thousand as my x aesthetic and I want to look at a geom histogram. So this will just give me an at-a-glance look of the distribution of foster care entry rates across states over time. I've added a facet_wrap here because we have five years of data and overlaying these histograms on top of each other would obscure maybe some changes over time. And what I'd really like to look at is the distribution of entry rates across states for each year. And so facet_wrap by fy does that. We could have done by state. That gets very busy with 50 states but it's actually something I use all the time. If you find yourself doing state-level analysis with our administrative data and you're looking at time series or other kinds of features like that I'd strongly recommend checking out the facet geo package in R. It actually will arrange your facet_wrap according to the geographic location of U.S. States and it's a really nice visual to work with for the kind of things that we may want to check out when we're doing our exploratory data analysis with the administrative data. But you know looking at these histograms I can see okay the annual patterns look relatively similar. Maybe we're getting a little more left-hand clustering over time which we know to be true as foster care entry rates have declined nationally over this period. You know the tails on the right side are getting pulled in a bit. So but this is just a a quick gut check to make sure that we're not we haven't made any errors in how we've computed anything, to make sure that the data looks the way we might expect it to look. 

[ONSCREEN CONTENT SLIDE 16]
RStudio screenshot with plot pane, console and script. Plot pane shows a spaghetti plot with time series of per capita foster care entries for all states. Trends are a bit all over the place, but stable for the most part between 1 and 5 per 1000 population

SCRIPT NEW CODE:
### visualize state time series
dat_join %>%
  ggplot(aes(x = fy, y = entries_per1000,
             group = state)) +
  geom_line(alpha = 0.5)

CONSOLE OUTPUT:

> dat_join %>%
+   ggplot(aes(x = fy, y = entries_per1000,
+              group = state)) +
+   geom_line(alpha = 0.5)
Warning message:
Removed 5 rows containing missing values or
values outside the scale range
(`geom_line()`).

[Frank Edwards]
We could also visualize the state time series with a spaghetti plot, right? So here I am taking dat_join and instead of histograms this time we're just going to put all 50 time trends on one plot and I can use the grouping aesthetic for that. So here on my ggplot call I've provided three aesthetic parameters x y and group. And group will tell R to draw a separate line for each value of group. I gave geom_line an alpha parameter to allow for some opacity in the line so that we can really see the overlaps a little more clearly. But we do see pretty clear clustering going on here and it appears that most states are following similar trends. We have a few outlying states with very high levels of entry. I believe that top one is West Virginia but the bulk of states are all falling within that kind of two to five per thousand entry rate consistently over the time period that we're looking at. 

[ONSCREEN CONTENT SLIDE 17]
RSTudio screenshot showing script, console, and file browser pane with new subdirectory 'vis'

SCRIPT NEW CODE:

### visualize state time series
dat_join %>%
  ggplot(aes(x = fy, y = entries_per1000,
             group = state)) +
  geom_line(alpha = 0.5)

ggsave("./vis/state_ts.png", width = 5, height = 5)

write_csv(dat_join, "./data/join_demo.csv")

CONSOLE OUTPUT:

ggsave("./vis/state_ts.png", width = 5, height = 5)
Warning message:
Removed 5 rows containing missing values or
values outside the scale range
(`geom_line()`).
>
> write_csv(dat_join, "./data/join_demo.csv")

[Frank Edwards]
Now maybe we're at a point where we're happy with the work that we've done we like the that state time series visual that we've produced and so we want to store that to disk. I'm going to go ahead and create a new directory called vis v I s and that's just me you know that's just what I like to use you could call it visuals you could call it figures you could call it whatever you want. But I think it's a really good idea to create a subdirectory within your project specific for visual output, specific for graphics. And I'm going to ggsave my plot into the vis folder as state_ts.png. Ggsave will take the last ggplot object that was rendered on your plotting device and save it to disk. And I'm going to specify width equals 5 height equals 5 that's in inches by default because that'll be a nice high resolution image that will look pretty good if I use it to pull into a Word document or if I pull it in using RStudio using R Markdown or LaTeX or something like that. It'll be a nice high resolution image that'll look good in a paper. Now I also create this dat_join object and maybe I don't want to have to run this script every time I'm going to do the analysis again. So I can actually take effectively commit my work to disk using a write_csv. Most of what I did in this script right was pull the two data sets clean them up a little bit and join them and compute this per capita entry rate. So we could think of dat_join as being the focal output of this script and so I'm going to commit that to disk as as join_demo.csv that I'm going to store in my data folder. And so the write_csv function takes two arguments it takes the object that's going to be committed to disk and it takes a file path. And in this case because I am using R projects I have already pointed RStudio and R and the R console to the working directory for my project so can just tell it save it in ./Data which will say take the current directory you're in, look for the data subdirectory, and then store it as join demo.csv. So this is a kind of soup to nuts basic crash course in how we do practical data management in R and RStudio. I will pull up the script now and just show you how it runs.

[ONSCREEN]
The speaker runs through the code and output in the slides that have been previously shown.

[Frank Edwards]
So this is the exact same script and you have access to this as week3.R in the Box folder but it will produce output that looks like this as we go. We've read those two files in. We're harmonizing the names. We're collapsing census_collapse. We're going to collapse AFCARS. We're going to do our join. All looks good. We're going to compute our entries per thousand. Again looks good. We'll do our visual for year histograms. Okay we'll do our spaghetti plot which you know just looks like a big pile of spaghetti. That's a good one to do. You know we could also add the facet rep by state here if we wanted to but again that's going to be very busy. I mentioned a second ago I really like using the geofacet package so if you want to think about doing small multiple plots for states I strongly recommend checking out geofacet. And then I'm going to save my spaghetti plot to disk. I'm getting a warning message removed five rows that's because there's some NA's in the data but we can check if we like using our files pane to see that I've saved this to disk and it was saved right now. So that that's good. And I'll also commit dat_join to disk which we can see now is over here in join_demo.csv. So that's all I've got today. I will kind of stop the screen I'll keep the screen share going but wait yeah okay I thought I wasn't screen sharing for a second but I'm good. Yeah so questions comments reflections on how your pets are doing today. Anything.

[Paige Logan Prater]
Thanks yeah please anyone if there are any questions put them in the chat put them in the Q&A box or you can also just come off mute and ask your question.

[Frank Edwards]
Yeah and I will note I gave y'all a homework. I'll show it to you real quick it's hard. And I kind of intended it to be hard because working with these data is hard. If you check out h3. Rmd I'm asking you to replicate some of what we did today but I'm also asking you to handle the race and sex variables and to recode them to harmonize them so you could compute race- and sex-specific foster care entry rates. So this is definitely a challenge but you should have the tools you need to handle that. So if you want to challenge yourself try the homework.

[Paige Logan Prater]
Thanks Frank. Thanks for that warning I feel like sometimes when when there are assignments and we don't know that they are hard and they are hard then we're like is it me? But it sounds like it won't be you if you're having some struggles.

[Frank Edwards]
It won't just be you. The other reason I gave this to you his homework because I didn't want to write code to do it for the demo because it's too complicated. But I will be providing a solution set so you'll be able to check your work against it. That'll go back in the box folder when it's done.

[Paige Logan Prater]
Great I will just open the floor for you know any lingering questions but it sounds like and looks like there are no questions coming in. So Frank if you want to stop sharing your screen I can start nd we will move through. Andres you can stop recording.

[VOICEOVER]
The National Data Archive on Child Abuse and Neglect is a joint project of Duke University, Cornell University, University of California San Francisco, and Mathematica. Funding for NDACAN is provided by the Children's Bureau, an Office of the Administration for Children and Families.

[MUSIC]