The ultimate aim of this article is to enable data science and compliance teams to create better, more accurate, and more compliant ML models.
A fundamental question raised by the increasing use of machine learning (ML) — is quickly becoming one of the biggest challenges for data-driven organizations, data scientists, and legal personnel around the world. This challenge arises in various forms and has been described in various ways by practitioners and academics alike, but all relate to the basic ability to assert a causal connection between inputs to models and how that input data impacts model output.
According to Bain & Company, investments in automation in the US alone will approach $8 trillion in the coming years, many premised on recent advances in ML. But these advances have far outpaced the legal and ethical frameworks for managing this technology. There is simply no commonly agreed-upon framework for governing the risks — legal, reputational, ethical, and more — associated with ML.
This post aims to provide a template for effectively managing this risk in practice, with the goal of providing lawyers, compliance personnel, data scientists, and engineers a framework to safely create, deploy, and maintain ML models, and to enable effective communication between these distinct organizational perspectives. The ultimate aim of this article is to enable data science and compliance teams to create better, more accurate, and more compliant ML models.
Key Objectives & Three Lines of Defense
Projects that involve ML will be on the strongest footing with clear objectives from the start. To that end, all ML projects should begin with clearly documented initial objectives and underlying assumptions. These objectives should also include major desired and undesired outcomes and should be circulated amongst all key stakeholders. Data scientists, for example, might be best positioned to describe key desired outcomes, while legal personnel might describe specific undesired outcomes that could give rise to legal liability. Such outcomes, including clear boundaries for appropriate use cases, should be made obvious from the outset of any ML project. Additionally, expected consumers of the model — from individuals to systems that employ its recommendations — should be clearly specified as well.
Once the overall objectives are clear, the three “lines of defense” should be clearly set forth. Lines of defense — refer to the roles and responsibilities of data scientists and others involved in the process of creating, deploying, and auditing ML. For example, the importance of “effective challenge” throughout the model lifecycle by multiple parties as a crucial step that must be distinct from model development itself. The ultimate goal of these measures is to develop processes that direct multiple tiers of personnel to assess models and ensure their safety and security over time. Broadly speaking, the first line is focused on the development and testing of models, the second line on model validation, legal and data review, and the third line on periodic auditing over time. Lines of defense should be composed of the following five roles:
- Data Owners: Responsible for the data used by the models, often referred to as “database administrators,” “data engineers,” or “data stewards.”
- Data Scientists: Create and maintain models.
- Domain Experts: Possess subject matter expertise about the problem the model is being used to solve, also known as “business owners.”
- Validators: Review and approve the work created by both data owners and data scientists, with a focus on technical accuracy. Oftentimes, validators are data scientists who are not associated with the specific model or project at hand.
- Governance Personnel: Review and approve the work created by both data owners and data scientists, with a focus on legal risk.
Some organizations rely on model governance committees — which represent a range of stakeholders impacted by the deployment of a particular model — to ensure members of each above group perform their responsibilities, and that appropriate lines of defense are put in place before any model is deployed. While helpful, such review boards may also stand in the way of efficient and scalable production. As a result, executive-led model review boards should shift their focus to developing and implementing processes surrounding the roles and responsibilities of each above group. These boards should formulate and review such processes before they are carried out and in periodic post-hoc audits, rather than individually reviewing each model before deployment.
Critically, these recommendations should be implemented in varying degrees, consistent with the overall risk associated with each model. Every model has unforeseen risks, but some deployments are more likely to demonstrate bias and result in adverse consequences than others. As a result, it's recommended that the depth, intensity, and frequency of review factor in characteristics including the model’s intended use and any restrictions on use (such as consumer opt-out requirements), the model’s potential impact on individual rights, the maturity of the model, the quality of the training data, the level of interpretability, and the predicted quality of testing and review.
Focusing On The Data Input
Once proper roles and processes have been put in place, there is no more important aspect to risk management than understanding the data being used by the model, both during training and deployment. In practice, maintaining this data infrastructure — the pipeline from the data to the model — is one of the most critical, and also the most overlooked, aspects of governing ML Broadly speaking, effective risk management of the underlying data should build upon the following recommendations:
- Document Model Requirements: All models have requirements — from the freshness of data, to specific features required, to intended uses, and more which can impact model performance, all of which need to be documented clearly. This enables validators to properly review each project and ensure that models can be maintained over time and across personnel. Similarly, data dependencies will inevitably exist in surrounding systems that feed data into the model; where these dependencies exist, they should be documented and monitored. Additionally, documentation should include a discussion of where personally identifiable information is included and why, how that data has been protected (through encryption, hashing, or otherwise), along with the traceability of that data.
- Data Quality Assessment: Understanding the quality of data fed into a model is a key component of model risk and should include an analysis of validity, accuracy, completeness, timeliness, availability, reproducible, consistency, and provenance. Many risk management frameworks rely on the so-called “traffic light system” for this type of assessment, which utilizes red, amber, and green colors to create a visual dashboard to represent such assessments.
- Encapsulate The Model: Separating the model from the underlying infrastructure allows for vigorous testing of the model itself and the surrounding processes. To that end, each step — from configuration, to feature extraction, to serving infrastructure, and more — should be clearly documented, and clearly encapsulated, so that debugging and updates can occur without too much complexity. Typically, this complexity accrues with time over the deployment cycle of a model and is one of the greatest sources of risk in using ML.
- Underlying Data Monitoring: Input data should be monitored to detect “data drift,” in which production data differs from training data, with an emphasis on how such drift might impact model performance. Data used to train the model should be statistically represented, and data ingested during deployment should be compared against this representation. Thorough leave-one-feature-out evaluations of the model — which can highlight the most determinative features in the underlying data — should also be performed. These evaluations can be used to understand whether specific features in the data should be monitored with extra care, along with potentially underutilized features, which the model may not need to ingest.
- Make Alerts Actionable: Monitoring underlying data allows for the detection of potential undesired changes in model behavior — but monitoring is only as useful as the existing alert system. Recommending an alert notifies both the data owner and the data scientists in the first line of defense, and that all alerts are saved for logging purposes so the second and third line reviewers can audit how alerts were generated and how they were responded to overtime.
Use Model Output Data As A Window Into Your Model
Understanding the outputs of a model — both during training and once in deployment — is critical to monitoring its health and any associated risks. To that end, it’s recommended that data owners, data scientists, validators, and governance personnel:
- Expose Biases: Data can inaccurately represent the real world, such as when a dataset omits or isolates a fraction of the population in a systematic way. Data can also reflect socially-derived artifacts in ways that are detrimental to particular groups. As such, removing bias from a model is not always practical, but seeking to quantify that bias — and where possible, to minimize it. For data on human subjects, it may be possible to validate outputs by cross-referencing privately-held datasets with public information, such as from a national statistics bureau. Where such validation is not feasible, policies applied to data may also need to restrict sensitive data (such as data on race or gender), and output analysis should be performed to detect potential proxies for sensitive features (such as zip codes). Perturbing sensitive features in input data and using the resulting model output to determine the model’s reliance on these sensitive features, in addition to detecting the existence of any features that are acting as proxies (such as age). In practice, detecting bias calls for a mixture of data analysis focused on both model inputs and outputs. Evaluation for bias should occur at all stages of model design and implementation, and throughout each line of defense.
- Continuous Monitoring: The model’s output should be statistically represented, just like the underlying training and deployment data the models ingest. This will require a clear understanding of where each model’s decisions are stored and establishing a statistical “ground truth” of correct behavior during training. In some cases, these representations will enable anomaly detection, or model misbehavior, to be uncovered in a timely manner. These representations will also help detect whether the input data has strayed from the training data, and can indicate when a model should be retrained on a refreshed dataset. The full impact of these methods will vary — depending, for example, on whether the model continues to train during deployment, among many other factors — but they will enable quicker risk assessment, debugging, and more meaningful alerts.
- Detect Feedback Loops: Feedback loops occur when a model’s actions influence the data it uses to update its parameters. This could occur, for example, when a content selection system and an ad-selection system exist on the same page, but do not share parameters and were not jointly trained. The two selection systems can influence one another over time, especially if both are continually updating their internal parameters. Detecting such feedback loops can be challenging and time-consuming. Organizations deploying multiple models that might interact with each other over time should pay particular attention to this phenomenon when monitoring model output.
- Document All Testing: All such analysis and testing, especially testing focused on bias within the model, should be clearly documented — both to serve as proof of attempts to minimize or avoid undesired outcomes, and to help members of the second and third lines of defense evaluate and understand the project’s development and potential risks. Testing documentation should specify who conducted the testing, the nature of the tests, the review and response process, and delineate the stages at which testing occurred. Critically, all such documentation should be easily available to every member of the first, second, and third line of defense. Making this documentation easily accessible will help ensure that testing is thorough and will enable everyone involved in the model’s deployment to clearly understand the associated risks.
As with the above recommendations on underlying data shift, actionable alerts should also be a priority in monitoring the model’s output. It is critical that these alerts are received by the right personnel, and that such alerts be saved for auditing purposes.
Conclusion
Effective ML risk management is a continuous process. While this post has been focused on the deployment of an individual model, multiple models may be deployed at once in practice, or the same team may be responsible for multiple models in production, all in various stages. As such, it is critical to have a model inventory that’s easily accessible to all relevant personnel. Changes to models or underlying data or infrastructure, which commonly occur over time, should also be easily discoverable. Some changes should generate specific alerts, as discussed above.
There is no point in time in the process of creating, testing, deploying, and auditing production ML where a model can be “certified” as being free from risk. There are, however, a host of methods to thoroughly document and monitor ML throughout its lifecycle to keep risk manageable and to enable organizations to respond to fluctuations in the factors that affect this risk.
To be successful, organizations will need to ensure all internal stakeholders are aware and engaged throughout the entire lifecycle of the model.
Thank you for reading my post.