Using algorithms in child welfare: promise, confusion and controversy

Source: Vaithianathan, et. al, Allegheny Family Screening Tool v2,

The use of algorithms developed through machine learning for the purpose of improving human decisionmaking is becoming more common in child welfare and in other areas of government, like criminal justice. These tools are often supported as a way to reduce racial and other biases by those making decisions about how individuals will be treated. But opponents have raised concerns that algorithms will increase bias because they are developed using data on systems that are already known to exhibit racial or other disparities. Early research suggests that algorithms can identify the highest-risk children referred to child protective services while reducing racial disparities. But many questions remain about how these tools work in practice and whether their effectiveness will be limited by the mandate to avoid reinforcing racial disparities in child welfare.

The Allegheny Family Screening Tool (AFST), implemented in 2016, is the first algorithm (of a type known as a Predictive Risk Model) to be used in decisions about the screening of referrals by a child protective services hotline. When a call (known as a “referral”) comes into the hotline in Allegheny County (which includes the city of Pittsburgh), the intake worker (or hotline screener) who takes the call must decide whether to screen it in for investigation or screen it out as not relevant to child abuse or neglect. A referral that alleges abuse or severe neglect is automatically forwarded for investigation. For other calls, the screener reviews the information provided by the caller, as well as information on the family’s previous interactions with the Office of Children Youth and Families (CYF) and other agencies. The screener also runs the AFST, which generates a risk score for each child that is used to supplement the professional judgment of the screener and their supervisor.

The AFST was developed to help hotline screeners decide whether a maltreatment referral warrants an in-person investigation, with the hope of improving the quality and consistency of screening decisions.1 The designers of the tool, leading child welfare researchers Rhema Vaithianaithan and Emily Putnam-Hornstein, sought to change the focus of screening decisions to the risk of future harm to the child rather than whether a referral meets the current definition of child maltreatment. In doing so, they sought to reduce both false negatives, or referrals that are screened out when maltreatment was present, and false positives, or referrals screened in where no maltreatment was present. The current version of the AFST uses data on all family members from past referrals and interactions with CYF as well as from the courts, jail, juvenile probation, behavioral health systems, and the child’s birth record to generate a risk score between one and 20 for each child included in a referral.2 The score represents the estimated risk that the child will experience a court-ordered removal from their home in the next two years. which serves as a proxy for serious abuse or neglect.3 Scores indicating a risk of 17 or higher with at least one child aged 16 or under are labeled as “high risk” and recommended to be screened in; approval from a supervisor is required to override this recommendation. Referrals with a risk score of less than 11 and no children under 12 are displayed as “low risk” and recommended for screening out, but the screener can override this recommendation without supervisory approval. For other referrals, the score is used to inform the screener’s decision, in consultation with their supervisor. The score is not seen by those who later investigate the allegations that are screened in, or by anyone else outside the intake unit.

Allegheny County commissioned an independent study by Goldhaber-Fiebert and Prince (2018) of the effect of the original AFST in the 15 to 17 months following full implementation in 2016.4 That study found a “moderate” increase in “screening accuracy,” which the researchers defined for screened-in reports as whether further action (the opening of a new service case or connection with an existing case) was taken by CYF or whether there was another referral within 60 days after the referral wass screened in. Screened-out referrals were deemed “accurate” if a child had no referrals for two months. The researchers found that the number of children being screened in “accurately” increased from about 358 to about 381 per month, or a monthly increase of roughly 24 children. (There are upper and lower bounds provided for all these numbers.) But part of this effect disappeared over time. The number of children being screened out “accurately” actually decreased slightly. The researchers also found that use of the algorithm brought about a halt in the downward trend of screening referrals in for investigaion and no “large or consistent” differences in outcomes across racial or ethnic groups. These results were somewhat disappointing to those who were hoping for a larger impact on accuracy, but also failed to support critics’ fears that the algorithm would increase racial disparities in investigations.5

In 2020, Vaithianathan, Putnam-Hornstein et al. published a study that they had done to validate the AFST by comparing risk scores to hospital injury encounters for children who had been reported to CPS. They took a large sample of 83,311 referrals for 47,305 children in Allegheny County between 2010 and 2016 (before AFST was implemented) and calculated an AFST score for each of them. They linked these children’s records with medical records from the Children’s Hospital of Pittsburgh, the sole provider of secondary care for children in the area. The researchers found large differences in the chances of an injury encounter for children depending on their risk levels. Plotting the risk level (from one to 20) against the chance of an injury hospitalization, the researchers found “a clear association between any-cause injury encounters and risk ventile, with an increase in the gradient for those scoring 17 and higher.” For children in the highest five percent risk level, their rate of an injury encounter for any cause was 14.5 per 100, compared with 4.9 per 100 for children classified as low-risk by the algorithm, who were in the bottom half of the risk distribution. For abuse-related injury encounters, the rate for high-risk children was 2.0 per 100, compared to 0.2 per 100 for low-risk children. And for suicide, the rates for the two groups were 1.0 per thousand compared to 0.1 per thousand.6

As the researchers explain in the 2020 paper, the AFST, a model that was developed to predict foster care placement is able to predict injury harm as captured in data on medical encounters. This is particularly significant because it is harm to children that they really wanted to predict, not placement in foster care, which is only a proxy for such harm. As Dee Wilson stated in his March 2023 commentary, “the 5% of highest risk children had an any-cause injury rate almost three times higher than 50% of the lowest risk children and a rate of abusive injury and self-harm and suicide 10 times the rate of the lowest risk children! AFST is a powerful algorithm when applied to one of the most important safety outcomes in child protection.” The lack of attention to this result by the media and child welfare leaders is disappointing. The extent to which this is due to poor communication by the authors and others, the complexity of the issue or the unwillingness of the child welfare establishment to receive any information suggesting the utility of predictive risk modeling, is unknown.

In an analysis that has not yet been published, Prindle et al.7 built a predictive risk model for San Diego County that was based on CPS data alone, in the absence of a data warehouse. They found a similarly strong relationship between risk scores calculated by the model and hospital encounters due to child maltreatment. Specifically, they found that “children classified by the PRM in the top 10% of risk of future foster care placement had rates of medical encounters for official maltreatment roughly 5 times those of children classified in the bottom 50% of risk.”

Rittenhouse, Putman-Hornstein and Vaithianathan (2022), in an article that is currently undergoing peer review, report finding that among all referrals, the AFST had no significant effect on racial disparities in screening decisions. And among the referrals with the highest risk scores, the AFST significantly reduced Black-White disparities. That result suggests that the high risk protocol for these referrals (requiring an investigation unless the supervisor agrees that it is not needed) plays a role in reducing racial disparities. The researchers also found that the AFST reduced Black-White disparities in case openings and home removal rates for investigated referrals. The reason for this is not clear. The authors speculate that the reduction in screening disparities among the highest-risk group might have played a role, or perhaps that within the lower-risk score groups, screeners might be shifting towards screening in Black and White children with similar risks of foster care placement.8

Hao-Fei Cheng et al. (2022), using the data from the original evaluation, compared the results of the AFST reported in the initial evaluation with the results that would have been obtained by the use of the algorithm alone, without input from screeners and their supervisors. That data, as already discussed, showed that racial disparities in screening did not increase with the use of the AFST. But Cheng et al. found that workers’ decisions reduced the disparities in screen-in rates for Black and White children from 20 percent based on the recommendations of the algorithm alone to nine percent with the workers’ input. This is a strictly theoretical result, since the algorithm was never used nor meant to be used without worker judgment. From sitting with workers as they discussed their cases and interviewing them about their use of the tool, the researchers concluded that screeners adjusted for what they saw as limitations of the AFST (such as the failure to consider the nature of the original referral) and that some consciously tried to reduce racial disparities. The researchers also found that workers’ judgments, while producing lower disparities, were also less accurate than the algorithm’s recommendations. It is not surprising that accuracy is associated with greater racial disparities; evidence indicates that the incidence of maltreatment is considerably higher among Black children than White children due to the disparities in their social and economic characteristics, which in turn reflect America’s history of slavery and racism.

Grimon and Mills (2022) studied the use in Colorado of an algorithm that was similar to the AFST but used only child wefare system data. They ran a randomized trial to compare decisions made by hotline teams that had access to the tool and those that did not. The study found that “giving workers access to the tool reduced child injury hospitalizations by 32 percent” and “considerably” narrowed racial disparities. Surprisingly, though, teams with access to the tool were more likely to choose to investigate children predicted to be low-risk, and less likely to refer for investigation those considered to be high-risk, than workers without access to the tool. Based on text analysis of discussion notes, the authors speculated that access to the tool might have allowed teams to focus on other features of the referral that are not included in the algorithm, such as the nature of the allegation itself. This counterintuitive impact on teams’ decisions is confusing and even disconcerting, since the entire purpose of the tool is to identify the children at highest risk.

Taken together, the studies of algorithmic tools used in screening of child maltreatment reports show that these tools by themselves are very good at assessing risk. When the tools actually implemented by human beings, the results are more confusing and we have fewer studies on which to rely. The initial evaluation of AFST shows a modest improvement in screening accuracy. The results of the trial of a similar tool in Colorado suggest that it achieved predictive success by doing the opposite of what was intended. Studies also find that the tools in practice do not increase racial disparities and may even decrease them. One study, however suggests that there may be a tradeoff between accuracy and the reduction of disparities, with workers disregarding the algorithm’s recommendations in order to reduce disparities at the expense of accuracy.

In addition to Allegheny County, one county in Colorado (Douglas County) is using an algorithm developed by the team of Putnam-Hornstein and Vaithanathian and another (Larimer County) is currently testing such a tool. Los Angeles County is piloting a Risk Stratification Tool designed by the same team that is being used to support supervisors in their management of investigations that are already open. It is designed to “identify investigations that may not have immediate safety concerns, but are at risk of future system involvement.” These investigations are recommended for “enhanced support.” This pilot was implemented with the hope of preventing more tragic incidents after three high-profile deaths of children by abuse whose families had had numerous interactions with the county’s child welfare agency.

Unfortunately, media outlets such as the Associated Press and the Los Angeles Times have published articles that are replete with misinformation, ignoring the promising research findings and the confusing ones as well. Both of these outlets misrepresented the study by Cheng et al., suggesting that the AFST increased racial disparities in screening. In its latest piece, the AP questioned the idea of screening in parents with mental illness, cognitive disabilities, or any “factors that parents cannnot control.” But whether or not parents can control a factor has nothing to do with its relevance to the risk to a child. These biased accounts by the press, as well as by orgahizations iike the American Civil Liberties Union, may be having an impact on government actions. Oregon stopped using an algorithm to help make screening decisions, a decision for which the AP appears eager to take credit. The AP and the PBS NewsHour along with other outlets have also reported that the Justice Department is investigating Allegheny County’s use of the AFST to determine whether it discriminates against people with disabilities or other protected groups.

Early research suggests that algorithmic tools used in child welfare have the potential to identify the children who most need protection. In practice, they seem to be capable of improving the accuracy of screening decisions without increasing racial disparities. But whether the kind of striking accuracy obtained by using the algorithms alone can be obtained without actually increasing racial disparities, given the underlying differential rates of abuse and neglect, is unknown. With the current climate that values eliminating racial disparities over the protection of children (and especially Black children), it is clear that such a tradeoff will not be considered.


  1. See Rittenhouse, K., Putnam-Hornstein, E. & Vaithianathan, R., “Algorithms, Humans and Racial Disparities in Child Protective Services: Evidence from the Allegheny Family Screening Tool,” (2022), available in full at, for a fuller description of the screening process.
  2. These are available from Allegheny County’s unique Data Warehouse, which brings together data from a wide variety of sources.
  3. As the developers explained in a paper describing their methodology, a proxy is needed because there was no practical way to measure actual harm to children and use it to develop the algorithm; abuse and neglect data were not available and the number of adverse events like fatalities and near-fatalities would be too small.
  4. See Jeremy D. Goldhaber-Fiebert, PhD and Lea Prince, PhD, Impact Evaluation of a Predictive Risk Modeling Tool for Allegheny County’s Child Welfare Office, March 20, 2019, available from
  5. Putnam-Hornstein contends that the results were more promising than they appear to a lay audience. She contends that the algorithm’s ability to achieve the same accuracy with (messy) real-time data as it obtained with (cleaned) historical research data was a victory in itself. She emphasizes that the size of the effect was reduced by the practices surrounding the model rather than the algorithm itself.
  6. There were no differences in rates of cancer encounters by risk level, which were assessed as a “placebo.”
  7. John Prindle, et al, “Validating a Predictive Risk Model for Child Abuse and Neglect using Medical Encounter Data.” Unpublished paper provided by Emily Putnam-Hornstein, March 25, 2023.
  8. Email from Katherine RIttenhouse to author, March 22, 2023.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s