AI Models, Like Children, Require Supervision

In January 2018, Google Photo inadvertently created an image that went viral on Reddit. Google Photo has an artificial intelligence (AI) application that merges pictures with similar backgrounds in order to create a panoramic view. Unfortunately, the algorithm merged two pictures of mountains in Banff, Alberta, with a third that contained a skier posing for a shot in the same location. The new image showed two mountain ranges with the skier’s head jutting out as a large peak between them.

In July 2018, medical news outlet STAT produced a story detailing how IBM Watson Health’s AI application for recommending cancer treatment was giving wrong or even potentially fatal guidance to doctors. Three months later, in October, Reuters reported that Amazon had scrapped an AI tool designed to filter through employment applications and make recommendations. The company discovered that the tool was penalizing cases where it could be inferred that the applicant was female. Amazon’s model was trained on applications over a 10-year period of time, where the applicants were predominantly male.

IBM has been a global leader in AI research for decades. In recent years, Amazon and Google have joined Big Blue in producing cutting edge applications for this exciting technology – yet all three had notable failures in 2018. Failure is a valuable part of the learning process for humans and machines; however, most companies wishing to deploy AI solutions have a limited capacity to absorb failure on a large scale.

AI is the buzzword of the day, and it is difficult for business leaders to go anywhere without being bombarded by the topic. In fact, more than one attendee I spoke with at the recent AFSA Vehicle Finance Conference complained about AIML-fatigue. As lenders begin to experiment with these techniques, they will learn that models, much like young children, require supervision.

Supervised versus unsupervised learning
Computer scientists conceive of supervised learning as the case where observations are labeled (i.e., good, bad, won, lost) and the machine is used to train a model to identify the proper classification using various data inputs. Unsupervised learning refers to the case where the observations are not classified ahead of time, and the machine is used to uncover hidden relationships in the data. Statistical scientists would refer to this as univariate (such as a linear regression model) versus multivariate analysis (such as principal components or cluster analysis).

When I refer to supervision in this article, I mean something quite different. Whether a lender uses a veteran statistician or a hyper-sophisticated AI algorithm, there are substantial risks associated with models built apart from a deep knowledge of the environment that the model must operate in. Nearly every catastrophic modeling failure may be traced back to a lack of contextual understanding by those who set up the model.

To illustrate the importance of context, I recommend you visit the following website (it is very entertaining): https://pdos.csail.mit.edu/archive/scigen/.

The program on this website was developed in 2005 by graduate students in computer science at MIT. It uses context-free grammar to randomly generate fake research papers, complete with illustrations, footnotes and citations. Anyone may go to the website, enter the names of the authors and hit submit. Some of the papers produced by this website have actually been accepted at conferences. The research looks legitimate to the untrained eye, and appears quite sophisticated – yet they are complete and utter nonsense. Unfortunately, I have seen lenders fall for similarly clever-looking tools that are not much more than window dressing.

The need for supervision
For those in auto finance, particularly as it relates to origination credit risk models, the supervision of an experienced lender over model development is imperative in order to avoid the problems inherent in many black-box solutions. The reason for this is that the machine takes at face value whatever it has to work with in the data. If tomorrow’s data ends up not looking like data from yesterday, the models may perform very poorly.

What makes this even more dangerous for auto lenders is that defaults do not start piling up until about nine to 12 months from origination. It may take more than a year for a lender to observe a shift in credit performance, by which time they could have many toxic loans coming through the pipeline. To mitigate this risk, model development should be supervised by people who understand how the data came to be in the first place.

AI models are essentially trying to find patterns in the data that may be used to make decisions. Many applications involve data that is very concrete, such as the color and position of traffic lights. When the top light is activated, it is always red and it always means stop for driverless vehicle AI. Credit data, on the other hand, is somewhat fuzzier. The relationship of an auto loan default to the number of revolving tradelines, for example, varies based on the quality of the application flow from dealers, where we are in the credit cycle and the financial stability of the applicant (which may not be fully known at application time).

Each of the failures identified at the beginning of this article are related to incorrect correlations that were made in the training of those AI models. There are three key correlation errors to be on guard for with auto loan data. These are:

• False correlations – Often, model inputs appear to have a correlation to performance that is, in reality, not true. For example, consider a lender that only allows high loan-to-value (LTV) ratios for applicants with the highest credit scores. In addition, they severely limit advance rates on applicants with low credit scores. That lender’s historical data will show that high LTVs are associated with the best credit performance and vice versa. An unguided machine would reward future applicants for riskier structure, and would punish those bringing equity into the deal. The correlation with performance in this example was coincident, and not causal. LTV did not cause the good or bad historical performance, it was merely assigned after credit had already been evaluated. With traditional scoring methodologies, nonsense trends are fairly easy to spot – but neural networks, for example, can produce highly nonlinear and complex interactions that make false correlations very difficult to detect.
• Dissolvable correlations – Seasoned credit managers know how customers, dealers and even your own employees can put you together in the name of getting a deal done. Unfortunately, the machine does not know this. Changes to buying patterns will quickly be discovered by people involved in the process, and can cause changes in behavior that lead to poor performance. This is particularly true of application and loan structure factors that are easy to manipulate. A lender’s historical data may show strong correlations to performance for down payment, LTV, time on job and other application factors, but if those factors are allowed to have too much influence (even if justified by historical performance) then the lender becomes severely exposed by those who would game the system – causing formerly strong correlations to dissolve rapidly.
• Stochastic correlations – In 2012, I wrote an article about the dangers of redeveloping credit scoring models using originations data and subsequent performance from the 2009 to 2010 period. I wrote this article because many lenders were looking at retooling their credit models in a post-recession environment. Credit performance is counter-cyclical on a static pool basis, meaning that loans originated during a market contraction demonstrate the best performance. On the other hand, loans originated during high growth periods demonstrate the worst performance. Many lenders in the mid-500 credit score range were experiencing annualized net losses in the four to five percent range in 2010. Just a few years later, those same lenders were posting losses in the ten to twelve percent range in the same credit niche. Credit is a moving target, and the relationship of various inputs to default shifts depending on where we are in the credit cycle. AI models tend to view inputs as deterministic, but nearly all credit predictors have a strong stochastic (random) element that complicates prediction. Oversight of these models requires individuals with an understanding of how economic factors, competition, regulatory and other forces interact to shift credit performance.

Fostering effective supervision means going beyond pure technical expertise. Most scorecard developers, data scientists and engineers working on credit models have never underwritten loans or worked in collections, nor have they spent time analyzing the dynamics of market conditions on auto loan performance. Worse yet, pure technicians view these things as pedestrian – they are singularly focused on the eloquence of their solution.

In this case, it is important to remember the words of Colin Powell, who once said “Don’t be buffaloed by experts and elites. Experts often possess more data than judgment. Elites can become so inbred that they produce hemophiliacs who bleed to death as soon as they are nicked by the real world.” It is essential to involve leaders in underwriting, funding, risk management and collections at the outset of a model build to ensure the real world is accounted for.

To protect against myopic models, lenders with internal resources should consider using a champion-challenger methodology when evaluating AI techniques. I have seen more than one vendor show what is called a “swap-set” to prove the superiority of their credit models (both in terms of performance improvement and volume increases). Swap sets are produced by running the new model on historical data in order to show how much better it performs than the old model.

What they are leaving out is that the old model did not have the benefit of the recent performance data – so of course the new one will look better. A swap set is not necessarily representative of the future due to the fact that historical paper was not purchased with the new model. A proper comparison would be to construct the new model on only data that was available to the old model if the objective is to prove that the methodology is better.

Over twenty years ago, we experimented with an HNC neural network machine at my company, as well as early versions of SAS Enterprise Miner which contained functionality to create multi-layer perceptron models, as well as other types of neural networks. In addition, we used classification and regression trees (CART), optimized on information value, to partition data (essentially identical to supervised machine learning). We never put one of those models into production unless it substantially outperformed the champion methodology (i.e., traditional statistical predictive modeling). Every modeling form has its own unique assumptions and limitations, so by making models compete we were able to minimize the risk from blind spots in new methods.

Final words
It is important to me that you don’t misunderstand my message. I am a big fan of both children and AI… but who wants to be around a child that never learned any limits? While this technology has tremendous potential, it also has limitations. The supervision of experienced lenders that understand the context of the environment is a critical component of success.

Prior to the World Cup in 2018, leading AI experts with Goldman Sachs, Electronic Arts (EA) and over 30 universities worldwide, ran millions of simulations using machine learning models in order to predict the final outcome of the event. Most of the predictions were entirely incorrect, which led SQL Services Data Scientist Nick Burns to say, “No matter how good your models are, they are only as good as your data… recent football data just isn’t enough to predict the performance in the World Cup. There’s too much missing information and undefined influences.”

Only the models produced by Electronic Arts correctly predicted that France would win the multi-stage competition. EA has been producing the FIFA video game since 1993, and is presently on its 27th installment in the series. For the last 26 years, the company has modeled each player’s performance at a micro level in order to make gameplay as realistic as possible. EA beat the competition by balancing the sophistication of the methodology with deep subject matter expertise – in other words, they supervised the model. Lenders would do well to follow suit.

Daniel Parry is co-founder and CEO of TruDecision Inc., a fintech company focused on bringing competitive advantages to lenders through analytic technology. He was previously co-founder and CEO of Praxis Finance, a portfolio acquisition company, and co-founder and former chief credit oficer for Exeter Finance. Prior to this, he was senior vice-president of Credit Risk Management for AmeriCredit Corp (GM Financial). If you have questions, you may reach Daniel at [email protected].