[voiceover]
National Data Archive on Child Abuse Neglect.

[Clayton Covington]
Hello everyone and welcome to the 2021 Summer Training Webinar Series. This is the conclusion of the Summer Training Webinar Series hosted by the National Data Archive on Child Abuse and Neglect. If you have been working with us for a while you know for a long time we've been affiliated with Cornell University and more recently we are also affiliated with Duke University. So the general theme for this year's series has been "Data Strategies for Child Welfare" where we have done a whole host of things including a general introduction to NDACAN and our data that we house. Specifically we focused on some of our survey-based data whereas in past years we focused more on the administrative data cluster. We've also done work to look at multilevel modeling in a workshop on that in recent years. And we're going to conclude this session with looking at a presentation from Sarah. So here has been the schedule that we done. So we started at the beginning last month and just as I was telling you all a little bit earlier we done whole range of topics. One topic I didn't mention earlier was the VCIS data which is one of our most recent data set releases that looks at more historical data. And we looked at the utility of that data compared with our current administrative data cluster and specifically looking at special populations with an emphasis on the American Indian and Alaska Native population. And the final session today is going to be a latent class analysis workshop with NDACAN statistician Sarah Sernaker. So I think Sarah can take over.

[Sarah Sernaker]
Hi everyone, my name is Sarah Sernaker. As Clayton said I must statistician working with NDACAN. So today we're talking about latent class analysis and so I have a few slides introducing what that even means and what type of models and when you might want to use it. Let me dive into the math and algorithms just to give you a little more foundational understanding of what is going on. And then I have two examples to show you of actually applying latent class analysis to some data. And so let's just jump right in. Latent class analysis so anytime you see LCA that means latent class analysis. And I'll just say that this is a topic that could take like a whole week to discuss. And so there's definitely some things that we're not going to get to today but hopefully you'll leave with a basic understanding of how LCA works and how to start using it in your own research. So what even is a latent class or the latent variable? You might have heard that term in other places or maybe not but a latent variable is a variable you don't directly observe but it explains much of the relationship or covariation between actual observed variables. And you could have more than one latent variable within your data but to start with let's just consider one unobserved latent variable. And so that's really the key descriptor of a latent variable: that it's unobserved. And so let me jump to an example to provide a little context. So one example that I thought of is a latent variable of religiosity. But how do you explicitly measure religiosity, right? How do you take a dataset to say religiosity is scale on. That doesn't make sense. What you can measure though, are these measurable variables such as how often you go to church a week, frequency of prayer, self described importance. And so this is what we mean in the distinction between measurable variables and what the latent variable underlying those might be. And a lot of the times latent variables are researcher-informed and so like this grouping someone else could call religious affinity. I've just called it religiosity here. And so the point is we have measurable variables that are related, there's some relationship happening here and the underlying reason is someone's inherent religiosity. And so I have another example talking about the like national trends and stuff. And so I have latent variables described here so now I have two latent variables of economic development and social development. And these are usually of interest to researchers. Like how is the country doing not only economically but socially? But how do you actually measure that? What are indicators for the latent variables? And so some that I've listed here just as an example is something like GDP, population growth, life expectancy, unemployment rate. Those are concrete variables you could go out into the world and measure. And those are you know proxies or indications of your latent variable economic development. Similarly for social development I've included things like literacy rate or educational attainment, community life participation, and measures of trust. And these are just things I thought of randomly to demonstrate my point here. And so that hopefully gives you a brief understanding of what we mean by what is latent variable is and the distinction in the measurable variables you can actually observe because remember latent variables are the variables underlying the relationship that you can measure through measurable variables. So that's all about latent variables okay? So we want to talk about latent class analysis. And I'm realizing I'm kind of using interchangeably latent class and latent variable and they are a bit interchangeable but I'll get into the more concretely with the math slide. But for now you could just consider them as the same. So latent class analysis is a statistical technique to try to identify those latent subgroups. So just going back for instance let's say you measured these variables all of these variables and you want to then group them somehow you want a model to group them. But say you don't know what the latent variable is. You want a model to group it for you to then inform what could be underlying your measured variables. Okay so that is what a LCA tries to do. It tries to make these groupings that the latent class is informing. And this is analogous to factor analysis if anyone is familiar with that and it's also a subset of structural equation modeling which I will not get into at all today. And I found this table from this paper Collins and Lanza but I thought it was really helpful just to put everything into perspective because if you're here learning about latent class analysis most likely occurred about latent profile analysis or maybe latent trait analysis, some other form of latent analysis. And it really just comes down to what type of variables you have and where that falls. So we have latent class analysis which implies or means that you're dealing with observed categorical variables and observed latent variables. And just to compared to factor analysis where your observed variables are continuous and your latent variables are also continuous. And so that really is the defining characteristic of what we mean and latent class analysis: you have categorical observed variables and categorical latent variables. So when would you want to use this? Latent class analysis is often used for exploratory purposes. Let's say you have a bunch of variables and you know there's probably some structure underlying them, some sort of latent variable underlying them, but you don't know what you don't know which variables fits into which class best. And so that's what latent class analysis tries to do. It tries to categorize your variables for you using math and algorithms etc. To make an informed grouping in essence. You could also use this for dimension reduction in the same way that like principal component analysis or factor analysis sometimes use so like the first step would be to apply latent class analysis to create these sort of clusters of variables and then modeling just those clusters. I'm not really going to get into that there but it is a common use. And I keep saying clustering so that's another thing that this is used for, grouping variables and/or regression which I don't think I talk about very much. I just will say this is different from straight up clustering because clustering methods most often rely on covariance structures and distance measures and most often used continuous variables. And so there is like slight distinction but the main idea is the same in that LCA seeks to group or cluster is another word variables together. That me just make sure I touched on everything I wanted to say. Yes okay. So let's dive into the math here. So the latent class model. So we have some assumptions going into latent class analysis. And I've started with just a simple example. So assume we have observed variable A, and just a reminder in latent class analysis everything needs to be categorical okay, so you have observed variable A and that has J levels okay? And you have another variable B, that has K levels. A and B don't need to be independent. In fact the fact that they are not independent is usually what's driving the desire to do LCA. The whole point about LCA is we want to understand whether the relationship is spurious or can be explained by a latent variable. And so let's let X be the unobserved latent variable and that has T classes and the big assumption going into modeling is that we assume that variables A and B are conditionally independent given the class level of X. And that will hopefully make more sense after the next slide and through examples. But for now what I mean by all these classes and levels. So let's say you have data on youth in high school and you want to measure delinquency in order to understand who's more at risk of not graduating or other negative outcomes. You have two measurable variables to start. Let's say whether a student has more than five unexcused absences from school. So that would be variable A. Does a student have more than five unexcused absences? That would just have two levels, yes or no, right? And your second variable B could be how often the student drinks alcohol a week? And the levels there could be, you know, just distinctions of not all, one to two days, 3 to 5 days, or every day. So a kind of continuous put on a onto a categorical scale so then A how many or did they skip more than five, yes or no? B the categorical measure of how much they are drinking. Okay so that's what you have measured. Underlying the X, unobserved latent class could have three levels. Let's say the three levels of your latent class would be high risk, moderate risk, or low risk. So you have these students, you have these characteristics and you want to understand where they fall into this latent variable of risk, of delinquency and risk. So that is hopefully putting in perspective what we mean by classes and levels and unobserved latent variable that you want to discover-based on what you measured. So lot of notation here. Let's let pi J T of A given X denote the conditional probability of observation being in class J of A given that it is in class T of X. Okay so what the heck does it mean? So once we assign people to a latent class so let's say student 1 is assigned to that high risk class okay? So given that student 1 is in the high risk class, what is the probability that they have had five unexcused absences? So it's kind of like once you're assigned to the class, within that class you've been assigned to, so here it was high risk, moderate risk, low risk. Given that, what's your probability of the having skipped five unexcused days. So it's kind of like a reverse probability. And again just to break this down, pi indicates probability, J is the class J of A, so letter a was whether they have skipped or not. T is the class of X, so high risk, moderate risk, low risk. And A given X this is not like pi to this power, this is just more notation of the probability of A given X. So the probability of skipping class given your class assignment. Hopefully I have not lost any of you I know this is a lot of notation. The slide and the next slide are going to be heavy and then I'll move on to the demonstration. So those are our conditional probabilities. And now let's let pi T X so we just have the probability of being in class T of X and X is our only latent variable here and again just to remind you X is our latent variable and it has T classes. Here that's three classes: high risk, moderate risk, low risk. So pi T of X would simply be what the probability of a student falling into high-risk, moderate risk or low risk notwithstanding any of the characteristics just what's like the distributional breakdown of being in each class of your latent variable, okay? Next week have the joint probability so that's the probability of an observation being in class K of A, class J of B, and class T of C so this should be X that is a typo. This should be your latent variable X. And that's like I'm not going to go to deeply into that because this doesn't have much meaning in practicality it's kind of just used to get to these two probabilities. And so with all that in mind the whole latent class model can be expressed such as this, so notice we have each piece of the pi here we have your probability of falling into a class, the probability of once you're assigned to a class of your observable variable A and then similarly the probability of once you're assigned into a class the observable variable B. Okay and so just to reiterate one of our assumptions was the have conditional independence of A and B and so that means once you're assigned to a class the probability of skipping for instance should be independent of each other once you've been assigned to the latent class X. Let me just make sure I said everything, I think so. Yeah and so much you go through and again I'm going there's one more slide with some heavy math but the main estimates of interest are going to be the conditional probability and the probability of being in a class. So you have all this math, you're going to go through an algorithm to derive estimates and these two are your main estimates of interest usually. And so it's just keep on trucking. So the algorithm behind the scenes. As I just wrote the latent class model is expressed as this [pi_sub_jkt^ABX = pi_sub_t^X pi_sub_jt^A|X pi_sub_kt^B|X] and so how would you actually solve for this and what's happening under the scenes of your programming language? So these are solved using maximum likelihood estimation to estimate class membership probabilities and so as highlighted in green that that's your class membership probability and the conditional probability that an observation provided a certain response given that they been classified to a latent class and those were each the pi A given X and pi B given X. And so the actual algorithm is called an Expectation Maximization to solve for these quantities. And the whole reason I've made a slide of this usually I would skim over the details but when you're dealing with an iterative algorithm you could come into a few hiccups. And that was really why I wanted to introduce it because because you are using an iterative algorithm you need to rely on an initial value to start like you need to supply a kick off point for this to solve to give you an estimate on and that can cause some issues. Another issue you might run into is you can continue to iterate indefinitely. This really relies on this equation being what's called convex and that means that a solution is easy and identifiable but often you'll see an in the example you'll see some functions are concave which means which basically just means that your algorithm or your program is going to have a hard time finding a true solution. And you can sometimes settle with a good enough solution but really when you're dealing with these iterative algorithms what you want is like a true convex solution. And so I'll just leave it at that. And so when you're running these algorithms, and you'll see in the example, you fit and you you fit this model based on a specified number of classes, right? So you might not know how many classes you need and that's part of the exploration with your data. And so you would get a latent class analysis model over your data and you might tell your package to fit two classes. But how do you know that's really how many you should do, notwithstanding, not research withstanding? And so what you would need to do is fit this model over a number of different classes so fitting it with two latent classes, three, up to let's say six latent classes. And then you would compare AIC or BIC or there's other metrics and usually this is summarized in what's called a scree or and elbow plot. And so my point is it takes a little bit of finagling to find what works best with your data and math can tell you that but I've also included the point here that if math is telling you you know four latent classes is slightly better than three latent classes, but you know from research and interpretability that three classes makes more sense that should also be a valid consideration. You know if the math is saying you know one is only slightly better then there is you know subject matter expertise that you should use if you are able to help guide you. You know just because math is telling you 110 classes makes most sense, that might not be useful or interpretable in your research. And so that it's just a consideration also to keep in mind. And let me make sure I just have notes on my other screen. Okay so how do we actually implement that. I've given you a very brief hopefully understanding of LCA, I've given you the heavy math to confuse you maybe, but how would you actually implement this. And you're not going to be on paper writing the algorithm yourself. These are written into packages within programming languages. In the R programming language this poLCA package is useful and it has a function of the same name this poLCA. In SAS it's proc lca. In Stata we have this gsem which stands for generalized structural equations modeling so I said like in one of the earlier slides LCA is a kind of subset of structural equation modeling so that's what gets kind of grouped under the generalized SEM in Stata. And I will say SPSS is not actually seem to support this so unfortunate to SPSS users in this conversation. And I do know that Mplus I keep seeing Mplus pop-up in some literature regarding LCA but I will freely admit I know nothing about Mplus. So I know it exists in there I cannot help you. I'll just leave it at that. And so today I'm going to show you examples in Stata and I'm actually hoping to also show you a little bit of R. So let's jump to the fun stuff and hopefully this will make more sense as we go through examples. In the first example I've chosen uses carcinoma data carcinoma. And so I'm going to minimize this to let you guys read that. Oh sorry wrong screen that I'm minimizing okay. And let me just pull up the data because I just find it's easier to describe it when you can see the data. So we're using carcinoma data and I've loaded it into Stata. I have do file that we are going to walk through but just to give you an overview of what the data is. So I have not loaded it I've just opened the do file. And I'm just going to do this and I don't think there's really enough time to go into like Stata basics. So if there's questions at the end about little Stata syntax I'm happy to clarify. Okay so I've just loaded the carcinoma data and what we have here is we have information from seven ethologists 1234567 they are labeled A through G. And they classified 118 slides so notice Stata will tell you we have seven observ seven variables with 118 observations down here where my mouse is. And so each pathologist was given 118 slides and they were asked whether they believed cancer is present. And I think it's irrelevant as to whether it was or not this is just to see the opinions of each of these pathologists I whether cancer was present. And so the whole point here is we want to group observations based on how sorry I'm just getting my Windows set up here how I like. On how parsimonious the the classification were. And so let's just dive into it a little bit and hopefully it starts making a little more sense. And so I'm going to exit out of this screen. This was just to get an idea of the data. Oh and I will say each response is binary. Zero means that the doctor did not think that carcinoma was present and one indicates they did believe carcinoma was present. So I have my do file. I'm going to make this a little bigger would that be helpful? Okay. Hopefully that helps. And so what I recommend any time in any analysis with any data is to try to get an understanding. So I just run the summarize function over it. Then we have the variables, the number of observations, the mean. And so the mean here is actually equivalent to the proportion of observations that a doctor has indicated carcinoma is present. So how many ones did a doctor provide. So for instance Dr. B categorized about 67% of the observations to be carcinoma compared to Dr. F who is a little more wary and only said that 20% of the observations were carcinoma. So again just to understand, to make sure everything looks right, nothing is out of range, every things either zero or one. Again just to understand, make sure everything looks good, if something was missing. Okay so in Stata when you want to fit a latent class model like I said before, it uses this gsem function. And so that's what you see here. We have gsem and the first thing you see here is using path notation. And it's not as important in LCA or at least the examples I'm showing here. But all you need to know is these are our observed variables and we have this arrow indicating this arrow with nothing to the right of it, indicating that we want to fit a latent variable kind of grouping these variables together okay? And then we have comma logit. This logit here specifies that each of these variables are binary and that might not be the case right? Let's say that are a is Gaussian B is logit C is multinomial. And so I've written in the comments here "if your variables were different you can specify that by specifying it within each parentheses rather than kind of grouped out as I have it. Again that's only because all of them are logit and so I'm just kind of capturing that in one go rather than writing it all out as you see here. And I will just note that gsem can handle other variables. So notice here I've said A is Gaussian which implies A is continuous. But that's when you start to stray away from the strict LCA where everything is categorical. But I mean the lines get kind of blurred. You know if one is Gaussian and the rest are logit it's so it's yeah. It's like that contingency table I showed in the beginning. Okay but anyway so we have gsem, I've specify the variables are observed variables for which we want to fit the latent class. We leave this blank because there's no covariates at this time. And so we specified that each of them are logit and this is where you specify how many classes you want to fit for your latent variable. And so I specify that we have this latent class variable that I am naming Z, this could be anything I can name this W I can name is Q. So I've named its Z here just because, and I'm saying I want to fit a latent class model with two with a one latent class variable with two levels. Okay. So let's just fit this. So I'm going to run this I'm highlighting it and I'm executing and then stuff happens in my window. So fitting class model. You see a lot of stuff come up and it says not concave. And convergence not achieved. And so this is a great example of and why introduced the algorithm slide is because these are serious problems that do come up in trying to fit this. Okay. But just scrolled back up to show you what is happening here. So it's fitting the class model. So first it's fitting so it's probabilities of class membership. So notice this fit pretty well there's only two classes. Fitting an outcome model so there is no issue there. It fiddles around with starting values and finds us set that it likes and then takes off. So then it gets the full model. Notice it says not concave. So I'm realizing I might have gotten those terms mixed up convex and concave. So we do want it concave. And I think I misspoke before I think I said something backwards. So we want it concave because that means there's a nice answer waiting for us a nice estimate at the bottom of that concave function. Okay so we're seeing here this is not concave. So the algorithm took a step the iteration and then decided to go back to where it was and retry because it was just not reaching a good point so it got a little further and then it hit another problem and the problem persisted and persisted until finally it reached the maximum number of iterations that Stata that's built-in the default and that's 300. It tells you convergence not achieved. Okay so when your algorithm converges that's a good thing that means it found the solution, it should be a global optimal solution. You want convergence. Convergence not achieved implies that your algorithm did not find an answer. It might a it will land on an answer but it might not be very good depending on what your function looks like which we won't even like dive into here. So I'm just going to leave that for now okay. Were going to address it in a second. But what else do we have here? So it goes through the iterations of the algorithm then it finally just says okay we'll use what we have at iteration 300. Convergence is not achieved but like Stata has something to deal with to work with so it just moves forward with that even though it's not ideal. So then this fit they give you some coefficients these of which are not very important we don't have covariates and I will show you an example hopefully we'll get to with covariates. So this is not very informative this table right here. It's telling you the class it's reiterating, the the type of function or the type of data that each variable is, logit is it Bernoulli that's binary okay. And then it gives you coefficients for each of those variables. Again this one not as important on its own. And so I'm going to skip it. And then it goes through class 2 so that was all for class 1. Class 2 goes through the same. Okay I'm scrolling to the bottom. Convergence not achieved, so it tells you that again. So all of that was not really the estimates of interest that I described. That was all data fitting the model it's giving you basic estimates that are not like the main goal here okay. So how can we deal with convergence not achieved? First we need to deal with that or at least try to deal with it before we can look at the estimates of interest. And also I did note before that I've only fit this with two classes but how do we know that with two with a latent variable with two classes so how do we know that that's best and not three or four or five or six here? And so I have all of this code to kind of deal with the convergence problem and also so that we can compare an increasing number of classes. So let's just break this down again. So I have gsem I have the same thing as above I'm specifying two classes but since I ran that and it gave 'convergence not achieved' I'm willing to tell the program that you might run into difficulties. The option is difficult. So you're telling Stata that this is going to be a difficult problem. And that comes down to that will tell Stata to deal with I think it's how it takes the iterations like how it defines the iterative depth. There's like so much math behind the scenes that I did not even touch on today and it just tells Stata to try something else. It's saying if you come into the convergence and the concavity problem just try something else. And then another thing which is not great but you can tell Stata to just stop a certain number of iterations. And this could be informed by looking at this. So if we go back and look at while it's fitting the full model notice the log likelihood just does not change. And so we can tell Stata okay we've seen what's going on just stop at 50 and that's can be the same as if you kept going so just stop running at 50. So this is telling it to stop at 50 iterations. This is helpful not only with convergence problems but computational. Sometimes Stata or depending on the complexity of your model it just might take like a really long time to get to 300 iterations. So that's how we are dealing with it here. There's a few options also I'll explain down here but for now I'm just telling Stata to try something else as it iterates through. I'm going to run this all at once so it's going to fit models with two classes, three classes, four classes, five classes, and then I'm storing it. So I have 'estimates store twoclass' so I'm storing AIC values, BIC values, the log likelihood from the model that was just run. So I'm labeling it as 'twoclass' also that I can call it later okay. So here I'm recalling 'twoclass' 'threeclass' 'fourclass'. So let me just run this because it takes a little bit of time. Oh and I've also included 'quietly' in three-, four-, and five- class as you'll see and that's because is if we're just doing this if you're just trying to figure out how many classes you need off of that and after you know considering the convergence and computational issues you really don't need to see all this stuff that Stata spits out. So I've told Stata to do it and do it quietly like don't show me that mess of output that you just showed me. Okay. So now I'm running this 'estimate stats' and I have the listed models that I've just fit. And so notice it makes a nice little table for each model. I can compare AIC and BIC directly. You can also use the log likelihood but I don't know AIC and BIC are sufficient so like it's just one more thing to compare I don't know. So things to keep in mind though lower AIC is better lower BIC is better. You should not compare between them but compare within them. Likelihood you're looking for higher. So larger log likelihood, smaller IC's. Okay. And so what we see here is let's just look at AIC for now. We have 656 then it drops to 615.41 then it goes back up slightly and then it drops again at five classes. BIC we have 686 it drops, it increases and then BIC actually increases again. And usually you don't such discrepancies with AIC and BIC I'm not sure why we're getting this kind of precipitous drop here but I'm going to go with three classes. This is the initial drop and we see an increase of AIC and BIC thereafter for the most part. And so based on that I'm going to say okay a three class model seems to be the best fit. So I go back into Stata and I'm saying okay let me rerun my three classes because Stata you need to rerun things for them to be like in memory to then call. So I'm rerunning my model with three classes just so that it's in the memory here. I'm scared I'm going to run out of time to let me go little quicker because this is only one of the examples. So I want to refit the model with three classes because I am saying that seems to be the best based on AIC and BIC but some other stuff you can do when you have convergence or issues with the algorithm is you can specify start values. Usually, and I want to say this is what Stata does but I'm not 100% sure it is, usually algorithms will take a naïve estimate of probabilities just like the observed probabilities to inform where to start but sometimes random start values are better and so this is an option to say okay instead of using some predefined start values just use random values so you're drawing five different random starting points you set a seed so that it's reproducible. And you also have this 'difficult'option here and that again just tells Stata even though you're doing random draws for start values you still might have to do a little more consideration and rethink your iterative steps. Okay. So one thing it's trying in this random initial values. Another thing is trying options in the E M algorithms and that would just that kind of goes along with the iterative steps. I don't know exactly so I'm let me back up a second. So that would be including the option 'emopts' and then you're telling iterate 30 and again that 'difficult' thing. And I'm not sure off the top of my head what exactly that fixes within the algorithm. I think it just tries more it might try a different optimization procedure. I don't I don't know off the top of my head and I don't want to spend time. But my point is you're going to run into errors I mean this is like a super simple dataset. There's no missing values and we ran into errors okay? It is inevitable that you'll probably run into errors especially in Stata. And these are just ways to try to get around it and so I've included it just for demonstration so I'm just going to run this for a time and so I've just refit a LCA model with three classes. Okay you've fit your model, you've chosen how many classes you think is ideal, you've fit it. Now we want the estimates of interest in that is with this 'estat lcprob'. So this gives you the probabilities of class membership okay. And that oh it already ran. So this gives you we have class 1 and class 2 and class 3, right? We've specified three classes is best. This says that about 37% of observations than in if you were to observe more observations 37% would fall into class 1, 18% would fall into class 2, and 44% would fall into class 3. Okay. That's great. But the next question is well why is what comprises each class? Like what is what do the classes mean right? And so that's when you want to understand the posterior probability and those with the conditional probabilities of A given X and conditional probability of B given X that's what tells you what is making up each of the classes the composition of each class. What characteristics define each latent class? And while this is loading let me just note also that classes are just labels. We could have easily called class 2 class 1 or class 1 class like these are just labels that we're making we're making these groupings. If I were to run this again maybe I ended up with all of these observations categorized or labeled under class 2. So it's not like this is class 1 and class 1 is the best and class 2 isn't. It's like they're just labels for the groupings that are happening okay? And so when you are able to identify them (sorry I'm trying to make this bigger) that is when you really like label them and say okay this is what this class represents. So I've just run this 'estat lcmean' and so it's giving us a breakdown within each class, what is the probability here. Okay so backing up what do we have? We have doctors indicating whether a slide presented carcinoma it was a simple yes or no 0 or 1. And so within class 1 what's the probability that Dr. A classified a cell as carcinoma or classified a slide as having carcinoma. So in class 1 that probability is small. And similarly for the rest of the doctors these probabilities are small. So within class 1 the probability of defining a cell to be carcinoma is really small. So what does that tell us? That tells us that class 1 is comprised of the observations that are clearly not carcinomic, okay? These are the observations or the slides given to the doctors that they're all in agreeance does not present carcinoma. Okay hopefully you're with me here. I've also written it here if that helps make sense of what I'm saying. So class 2 now we see we see this Dr. A had a  if you're in class 2 the probability of Dr. A identifying the cell as carcinomic is 50%. Dr. B has 100% chance. So given that he's in class 2 there's 100% chance of him identifying a slide as carcinomic. C is 0% so that's quite the difference, D is also really low, E is high, F is low, G is high. So what does this tell us? Class 2 seems to be the observations where there is contention. There's disagreement on what the slides present. They're not in agreement with a cell having carcinoma or not. And this might be interesting. This might be the group that you are looking for to then go back to those observations and to see you know why are these not so clear-cut as the rest? Okay like the importance of why this question might be interesting. And then quickly down to class 3 we see these are all pretty high. I mean 1.98 and 0.47 is low but you know the rest are really high. So class 3 would be the observations of slides where there is a clear indication of carcinoma. All the doctors most of the doctors are in very high agreeance that these are all carcinogenic slides. So that is this 'estat lcmean' these are posterior probabilities and this is really showing you the characteristics within each class and that's where you really come up with like the naming. So as we went through them notice I named one as the agreement of no cancer, two, disagreement, and three agreement of cancer. I have signed that label to these classes. And just really quickly I am just. So I have down here a way to visualize this same thing. Let's see if this can run really quickly. Let me make this a little bigger. I just really want to jump to the other example superfast to show you. Maybe a more applicable so it's going through the figures example. Just going to give this one minutes to run. And it's going. So as it's going through remember class 1 is when everyone was in agreeance that it was not. Class 2 we have this disagreement, some are some say it's cancer some say it's not cancer. All right let me give this three more seconds before I switch. And so then we have the picture without having to compare all the little numbers. Hopefully the figure makes it clear. These are the distinctions happening between the classes: no cancer, cancer, disagreement. And again those are labels I am now assigning based on the composition of each class. So I'm going to really quickly move over to the next example because I think it's a little more interesting and we're definitely not going to have time to like go into detail but as I said this could be like a weeklong seminar if we wanted to be. So I have this NYTD example and I'm going to load this NYTD data and let me jump back really quickly because in my slides I  explain it. NYTD surveys youth who age out of foster care services. I'm using this outcomes survey. And the outcomes survey measures well-being, financial and educational outcomes, etc. etc. I recommend going to our codebook website to get more information. The point is there are variables in their such as employment skills, public housing assistance, substance abuse that we might want to understand how each individual is being affected. Are individuals you know high risk based on their employment status or assistance needs and whatnot? And so that is just the background of the NYTD data and I'm like flying through this because there's a lot here. And but it really mirrors what I was just doing. So in this first chunk of code I'm fitting a bunch of. Well first so we just have a lot more variables so that's why you see a whole list of variables. And they're all logistic. Notice it's the same syntax. It's 'lclass' now I'm naming it C so that's our latent variable C. I've already run this and I've seen it's difficult to iterate. I've done it up to 150 and so I'm just going to run this to show you the comparison. Or actually I'm not just for time because this just takes a long time. But I will say this one is a little iffier. It's not as clear-cut as the other data where its 3 looked to be the best. And I'm going to attributes that to the fact that this is kind of masked data. I've played with this and kind of made it messier than the truth and so it might just be too noisy to really get a good model fit. So coming back to this trying to keep track and finish this within two minutes so we can get to questions. I'm fitting this model with two classes. Okay it's done, I've specified or note that I didn't need to specify 'difficult' or 'iterate'. It did find a good solution. So I'm going to get probabilities of class membership and then I'm going to get the estimated means for each the posterior probabilities. Like trying to understand the composition of each class. And it's running. And I will say while this is running just a quick plug for R Studio. R Studio does this I think a lot better. To me I'm just used to R I think the output is easier to understand. For those of you who are adept in R like R you can re-create the carcinoma example. It's within that poLCA package. They have the data built-in and this is the code to do so. So this will be on video you can pause it here and copy the code to reproduce that. And down here is the NYTD example. And again this is just replicating what I'm doing in Stata but I'm just showing you this because notice this is still running and R does not take nearly as long and I don't think it runs into the same convergence issues that you find with Stata. And that just would come down to how to the default settings between the two and the though wave that they are solving that EM algorithm and there are definitely different ways to do that. So this is running and I'll give it one more minute. But I'll just show you here this is super fast. Notice this just did a whole lot and Stata is still working. I chose three classes or no let's choose the two classes and I'm just going to plot something. So again this is still running I'm going to let that go and then we'll get to questions. But to show you so this is saying the same thing as Stata, a little prettier okay. So that's R. Let me just jump back to Stata because that's where we were but again this is all on video so that's my plug for R. But just really quickly the NYTD distinction. So notice we fit a model with two classes. You're looking at the composition thereof. Notice in class 1 we have a lot of high probabilities of public food assistance, homelessness, and children. And so while these are still high they are higher than class 2 so that's why it didn't include them here. So we have high probabilities of public food assistance, homelessness, incarceration kind of and children. So that's class might indicate people who are more at risk. They've experienced homelessness or incarceration more and they have children that are putting a strain you know may be on their resources so that's why I don't know. So than class 2 two we can see they have higher employment, they are on educational aid and I'm just highlighting here the high probability variables. They there is one more about education currently enrolled. So class 2 might be the people who are less at risk so they're employed, they seem to be enrolled or in school and so that is where I'll wrap up because I haven't left nearly enough time for questions so I'm just going to shut up now and put up my references. These are great references because like I said this could go on like forever but these have great explanations and examples. And then I'm Sarah and here is my email and that wraps up our summer I'll leave it here.

[Clayton Covington]
Thank you Sarah for that very robust conversation and discussion of latent class analysis. I'm going to start off with one question that's kind of been asked a couple of times so first of all, people want to know: are can you use the capital A M O S or AMOS to conduct a latent class analysis in SPSS?

[Sarah Sernaker]
I don't know what AMOS is.

[Clayton Covington]
Okay well we will come back to that if we can later. Next question.

[Sarah Sernaker]
Looks like it's related to SPSS. I really don't know what capabilities are in SPSS. I just read in a few places it doesn't handle LCA so I left it at that. So I really don't know what's possible in SPSS.

[Clayton Covington]
Okay. Thank you for clarifying. There's another question asking: is it true that minimum sample size for latent class analysis is 250 individuals?

[Sarah Sernaker]
I'm not sure where that 250 is coming from. And if that's just an umbrella recommendation that any LCA analysis should have 250. But I would say I've not heard that before or seen it in literature I mean higher sample sizes are always better but the only thing I'll say is if you have smaller sample sizes you're not going to be able to fit as complex as models as you probably shouldn't anyway. So you're going to run out of degrees of freedom. So I'd say there's no rule against sample size but like it's definitely going to dictate how your model turns out.

[Clayton Covington]
All right so there was an earlier question that was talking about the slide in which you're talking about the algorithm and its asking what you meant by initial values? Do these mean baseline data?

[Sarah Sernaker]
So this is like, I only brought this up because you're going to you might run into issues with the algorithm and it's just one place to fix it. But these so like I said they're iterative algorithms which means they're solving at each step. So the algorithm starts somewhere okay you need to tell it where to start like a ballpark or neighborhood to say okay start looking for the solutions here. And then the algorithm takes steps in like various directions. It's like following a scavenger hunt, let's say, it's saying okay takes one step in the right direction so it's like okay I'm going to keep going because this seems to be right and it keeps going but it might come to a snag and it will back up and so it's kind of like walking around a space to find the solution but you need to tell it where to start walking. And so that is what the initial values for and a lot of the times you don't even have to worry about it. Like if you notice in R, R I didn't specify any initial values and it finds a solution just fine. The only reason I brought it up is because it's one thing to try if you're having those convergence issues. It might just be starting in a bad place and if you put it somewhere else it might do better. But like usually you don't have to worry about that.

[Clayton Covington]
Okay so we're at time but I will allow time for a few additional questions before we officially wrap up but before we do that I do want to answer a few things and do a little wrap up. So first of all I want to think all of you who are here to participate in the 2021 NDACAN Summer Training Webinar Series. It has been a pleasure of our team to host you all and to answer all these really rigorous and interesting questions and we hope to continue to do that in the future. Also just a reminder that you all can keep up with the latest happenings here at NDACAN by following us on Twitter at NDACAN_CU on Twitter and that's where we'll announce when all of the sessions are going up including all the recordings of the sessions that you've seen during the series. And once they are they will be announced they will be put up on the website in the first place to find out when they are announced will be via Twitter. So that is all I have to say in closing but I will allow for a few more questions. So one question is asking: should the categorical variables be coded as 1 and 0 or 1 and 2?

[Sarah Sernaker]
I think that is going to be dependent on your programming language. I think Stata likes it as zeros and ones but R shouldn't matter as long as you're defining them as factor variables and I'm not sure off the top of my head what SAS's deal is.

[Clayton Covington]
Another question asks: for the second example from state that what is the definition of class 1 and class 2?

[Sarah Sernaker]
I'm not sure where exactly but so that was me fitting a latent class model with two classes and so observations were assigned into either class 1 or class 2 and so that's what I'm referring to here is class 1 and class 2. Like this is the composition of class 1 and this is the composition of class 2. I'm not sure exactly where the confusion was, I apologize.

[Clayton Covington]
Okay. The final question that we have is: how do you account for weighted variables in the model?

[Sarah Sernaker]
Yeah I saw that question that's a great question. But I wish I could give you a good answer but I have not I just don't have the experience and I'm not sure off the top of my head. I think there are options in Stata specifically within the gsem package options to include weights. I'm not sure how you would deal with it in R and I'm not sure yeah. I would I'm not as well informed on that but I think there are ways to incorporate them in Stata.

[Clayton Covington]
Okay and with that thank you again Sarah for your again really detailed presentation I think a lot of people will find really helpful both today and in the future. And as a reminder to our attendees all of our sessions of the summer training webinar series will be made available online at a later time and once they are available we will announce it as soon as we can. Again I want to thank you all for your participation in this year's series and we hope to see you at the Office Hours and in future Series.

[Sarah Sernaker]
Thank you everybody.

[Clayton Covington]
Thank you.

[voiceover]
The National Data Archive on Child Abuse Neglect is a collaboration between Cornell University and Duke University. Funding for NDACAN is provided by the Children's Bureau, An office of the Administration for Children and Families.