A potentially lifesaving algorithm in Allegheny County, PA

broken arm
Image: New York Times

In August 2016, Allegheny County Pennsylvania (which includes Pittsburgh)  became the first US jurisdiction to use a predictive algorithm to screen every call to the child abuse and neglect hotline. In a brilliant article for the New York Times Magazine,  science writer Dan Hurley clearly explains how the tool works and how it changes current practice. Hurley’s account suggests that Allegheny’s experience is a hopeful one for the county and for children nationwide.

Hurley introduces the Allegheny Family Screening Tool, an algorithm developed by leading child welfare researchers in concert with DHS policymakers. To develop the algorithm, the authors analyzed all referrals made to the county child abuse hotline between April 10 and April 2014. For each referral, the authors combined child welfare data with data from the county jail, juvenile probation, public welfare, and behavioral health programs  to develop a model predicting the risk of an adverse outcome for each child named on each referral. (A more technical description is provided by the authors here.) The end product was an algorithm that calculates a risk score between 1 and 20 for each child included in a referral.

The policymakers and developers chose to use the algorithm to supplement, not supplant, the clinical judgment of hotline workers. Only if the score exceeds a certain threshhold does it trigger a mandatory investigation; below that level the risk score it provides another piece of data to help the hotline worker decide whether to assign the case for investigation.

Among the most important takeaways from Hurley’s article are the following:

  1. Before the development of the new algorithm, Allegheny County had experienced a series of tragedies in which children died after maltreatment reports had been made to the hotline but screened out. The problem was not incompetence or poor training. Hotline workers simply cannot within the 30 minutes to one hour allowed for decision making investigate all the historical data on all family members from numerous agencies with which they may have had contact.
  2. Evaluation data shared with the reporter show that implementation of the Allegheny County Screening Tool resulted in more high-risk cases being screened in and more low-risk cases being screened out. Hurley provides a real case example. A teacher reported that a three-year-old child witnessed a man dying of an overdose in her home. Department records showed numerous reports to the hotline dating back to 2008 about this family, including allegations of sexual abuse, domestic violence, parental substance abuse, inadequate food, physical care, hygiene and medical neglect. Nevertheless, the hotline worker was poised to screen out the case as low risk. The tool, however, calculated a risk rating of 19 out of 20, causing an investigator to go out to the home. Eventually, the mother was found to be unable to care for the children due to her continuing drug abuse, and they were placed with family members, where they are doing well.
  3. County officials were astute in awarding the contract to develop a predictive algorithm. Several other jurisdictions have gone with private companies such as Eckerd Connects and its for-profit partner Mindshare, which has a predictive analytics tool called Rapid Safety Feedback (RSF). The details of RSF are closely held by the companies, and the state of Illinois recently terminated its contract  because the owners refused to share its details, even after the algorithm failed to flag some children who later died. The Allegheny Family Screening Tool is owned by the county. Its workings are public and have been published in academic journals. Moreover, its developers, Emily Putnam-Hornstein and Rhema Vaithianathan are acknowledged as the worldwide leaders in their field, with extensive publications and experience in doing similar work.
  4. County officials were also astute in developing and rolling out their model. They held public meetings before implementing the tool, giving advocates a chance to interact with the researchers and policymakers. Choosing to use the tool at the hotline stage rather than a later step such as investigation made it less threatening as the tool is not being used as input on whether to remove the child, simply whether to investigate. In addition, the county commissioned an ethics investigation by two experts before implementing the tool. The reviewers concluded that not only was the tool ethical but that it might be unethical to fail to implement it. The concluded that “It is hard to conceive of an ethical argument against use of the most accurate predictive instrument,”
  5. Many opponents of predictive analytics argue that it institutionalizes racial bias by incorporating data that is itself biased. Supporters have argued that predictive algorithms reduce bias by adding objective algorithms to subjective worker judgments. Preliminary data from Pittsburgh supports the proponents, suggesting that the algorithm has resulted in more equal treatment of black and white families.
  6. Other jurisdictions are already emulating Allegheny County. Douglas County, Colorado has already commissioned Putnam-Hornstein and Vaithianathan to develop an algorithm and California has contracted with them for preliminary statewide work.

Given the Allegheny County algorithm’s promising results, one cannot help wondering whether a similar algorithm should be used at later stages of a case as well. A similar tool could be very useful in aiding investigators in making a decision about the next step in a case. Such a proposal would of course trigger an outcry if used to decide whether to remove a child from home. But like the Allegheny County screening tool, such an algorithm can be used to supplement clinical judgment rather than replace it. Policymakers need not set any level that would trigger a mandatory removal. However, they could set a risk level that requires opening a case, be it out-of-home or in-home. Many children in many states have died when agencies failed to open a case despite high risk scores on existing instruments. Algorithms can also be used to monitor ongoing in-home cases, as Rapid Safety Feedback has demonstrated. Perhaps if and when predictive algorithms are proven to be effective at protecting children they will be integrated into multiple stages and decision points, like the actuarial risk assessments that many states use today.

Identifying the children most at risk of harm by their parents or guardians has been one of the knottiest problems of child welfare. Allegheny County’s experience, as portrayed by Dan Hurley’s excellent article, provides hope that emerging predictive analytics techniques can improve government’s ability to identify these most vulnerable children and keep them safe.

Predictive analytics, machine learning, and child welfare risk assessment: questions remain about Broward study

analytics

On November 30, a major child welfare publication reported on a new study, published in the respected journal Children and Youth Services Review, that tested Broward County, Florida’s child welfare decision-making model against a model that was derived using the new techniques of data mining and supervised machine learning. The researchers concluded that 40% of cases that were referred to court for either foster care placement or intensive services could have been handled “with less intrusive options.” A close reading of this opaquely written paper, as well as conversations with two of the authors, Ira Schwartz and Peter York, reveal a pioneering effort at applying emerging data science techniques to develop a “prescriptive analytics” model that recommends the appropriate services for each child. This research is innovative and exciting but this first attempt at deriving such a prescriptive model for child welfare has serious flaws. These very preliminary results should initiate a conversation but should not be used to support policy recommendations. 

The authors began with a large database of 78,394 children with their complete case histories between 2010 and 2015. They merged datasets from the Broward County Sheriff’s Office,  ChildNet (the local agency contracted to provide foster care and in-home services) and the Children’s Services Council (CLC), which represents community based agencies serving lower-risk cases. The authors primarily used only one year of data on each child after they were discharged from the system. Children without a full year of data were not included. So the authors had a large selection of hotline, investigative, and service data for the children in their database as well as information on whether they experienced another referral within a year. 

In a nutshell, the authors applied machine learning to build a model “based on the segmentation and classification of cases at each step of the reporting, investigation, substantiation, service and outcome process.” The result was the creation of groups or clusters that have a similar combination of characteristics based on hotline and investigative data. Each stage of the modeling process produces progressively more uniform groups. The goal was to ensure that if these groups received different treatments, the difference in outcomes would be due to the treatment and not some other aspect of the children or their situations. Within each group of similar children, the researchers compared those who receive different interventions, namely removal from the home or community-based prevention services.  They used a technique called propensity score matching to control  for differences between members of .each group that might affect their outcomes. The authors use one outcome–whether a child is re-referred to the system within a year of exit–to determine whether each intervention was successful.

Based on this analysis, the authors concluded that many families are receiving services that are too intensive for their needs. For example, they concluded that “at least 40% of the cases that were referred to the court and to Childnet (mainly for foster care) were inappropriate based on the outcome data for children in their cluster group. The authors then went on to claim that these  “inappropriate referrals”  are actually harming children. For example,  “inappropriate referrals” to court were 30% more likely to return to the system after the court referral than they would have been if the referral had not been made.  And “inappropriate referrals” to ChildNet were 175% more likely to return to the system than similar cases that did not receive such a referral.

Finally, the authors present a “prescriptive” model that addresses the question, “Which services are most likely to prevent a case from having another report of abuse an/or neglect [within a year]?” This concept of “prescriptive analytics” is a new one in child welfare if not human services in general. The authors devote only two paragraphs to this model but they note that it would result in a decline in “inappropriate referrals” to court and ChildNet.

Even if we accept the machine learning process presented by the authors as a reasonable basis for estimating risk, several issue remain about the authors’ findings. The first issue is the use of one-year re-referral rates to denote intervention success. Ongoing maltreatment may not be seen or reported for months or years. The authors report that 57% of their cases that received another referral did so within one year. However, that leaves 43% that were referred after a year had passed. These cases were not counted as “failures” by their model. In addition, because the databased covered only 2010 to 2015, the authors did not include any referrals that happened after 2015, including those that are yet to happen. If the authors classified  some cases wrongly as not returning, this reduces the validity of their model.

The second problem stems from that famous social science bugaboo–unmeasured differences between groups. The authors relied entirely on hotline and investigative data on family history and characteristics. Yet, many family issues may not be reflected in the data. These could include unknown histories of criminal behavior, mental illness, violence, or drug abuse. If the authors observed that an intervention appeared to cause harm to certain children, the explanation may not be that the intervention was inappropriate. A more plausible explanation might be that that the matching algorithm failed to correctly assess risk as well as the social workers in the system.  If the cases referred to the court were in fact those that social workers correctly identified as being at higher risk (even though this was not picked up by the algorithm) one might expect higher rates of return to the system of these cases relative to cases that were matched with them by the algorithm but not referred to the courts.  This possibility seems a lot more likely than the possibility that court-ordered services made parents more abusive or neglectful.

A third problem relates to the use of the child rather than the family as the unit of analysis. The family or household is obviously the appropriate unit of analysis here. It was the parents or caregivers that perpetrated the abuse or neglect and they are the main recipients of services. Author Peter York agreed that using the family would be be more appropriate but explained that most of the data in the system were linked to the child and not the family. Using the child as the level of analysis means that the same parents will be counted as many times as they have children in the system. This will obviously weight larger families more heavily, with whatever biases this may introduce.

Finally, it is concerning that the authors reported about the proportions of children that were provided with too-intensive services such as foster care but not the proportion that were provided with services that are not intensive enough. We all know about the worst case scenarios when children die or are severely injured after the system failed to respond appropriately to a report, but there are many more cases in which allegations are not substantiated or interventions are not intensive enough, and the children return to the system later, often in worse shape. Reporting on one type of error but not its opposite invariably raises questions about bias.

The authors should not be blamed for making too much of their findings. In their article abstract, they do not mention the specific findings about over-reliance on foster care and more intensive child welfare interventions. Rather, they  argue that their findings indicate that “predictive analytics and machine learning would significantly improve the accuracy and utility of the child welfare risk assessment instrument being used.” I fervently agree with that statement. But this new approach by Schwartz et al is qualitatively different from the predictive risk modeling algorithms currently being applied and studied by jurisdictions around the country. In particular, the authors used machine learning to identify groups with similar risks but which received different treatments. Their purpose was to assess the effectiveness of distinct treatments of different subgroups. How well this approach will accomplish that purpose remains to be seen. This fascinating study is just the the beginning of a conversation about the utility of this new approach, not an argument for reducing the reliance on foster care or community services.