Sense and Nonsense About Credit Models – Part I

Auto finance, as we know it today, really began in the late 1980s with the introduction of asset-backed securitization to lenders. This instrument allowed lenders to fund more loans by cleaning out their debt facilities, and brought a substantially lower cost of funds. The latter opened the door to full spectrum lending due to the fact that lower borrowing costs allowed companies to absorb higher loss levels and still maintain profitability.

Over the past 30 years, the industry has been relatively stable with the exception of a few unscrupulous players. There have, however, been three major events that led to an industry-wide deterioration in auto loan performance. These are: (1) the technology bubble that was followed by the recession of 2001, (2) the Great Recession; and (3) the post-pandemic inflationary period of 2022-2023. The news media tend to blame all increases in delinquency and default on loose underwriting, while lenders themselves typically point to the macro environment as the culprit. A curious trend that has become more pronounced since the pandemic is that many lenders are wondering if their credit models (i.e., scoring and other origination tools) are at fault. The purpose of this series of articles is to address the most common concerns related to credit models and to separate sense from nonsense.

Nonsense: Scoring models do not work in subprime auto lending

Sense: The largest and most successful lenders in the auto finance industry do not need me to point out that the above statement is nonsense; however, the sentiment is still widespread. There are a variety of reasons why operators hold this point of view. Many who perform traditional buying (meaning judgmental underwriting in a manual environment) fear being replaced with automation, which is understandable. Others know they have learned quite a bit over their careers related to best practices and ask, “how can the model account for so many of the varied things we look for?” Those concerns are a separate issue from whether or not the models actually work, and they can be addressed with actual performance data. Having said that, it is my experience that no amount of data will convince someone who is not an objective critic.

The group I wish to address with this rebuttal is comprised of the numerous lenders who have been burned, performance-wise, by an overreliance on scores in subprime auto lending. The leaders of these companies are neither inexperienced nor unintelligent. They have their own performance data that supports this skepticism.

Originations credit models are designed to deal with very specific things, but have shortcomings that must be addressed in other ways. The following list details some of the more common issues that, if not accounted for, will lead to performance issues:

• Missing Risk Factors – There is nothing wrong with the national credit scoring models produced by Vantage, FICO and the major credit reporting agencies. They rank-order losses and have proven their value over the years; however, they do not include application and loan structure factors that seasoned buyers know are very important. For example, it doesn’t matter if an applicant scores 850 but cannot afford the payment. Payment-to-Income Ratio, Debt-to-Income Ratio, Loan-to-Value, Down Payment and a variety of stability factors such as residence and job time are all predictive of repayment or default. Failing to account for those factors in a credit program will lead to disaster.

• Inexperienced Modelers – Lenders often run into performance issues when implementing models built by inexperienced analysts (whether internal or with a vendor). I am not speaking of analysts who are unqualified from an academic or quantitative knowledge perspective, but an operational one. The data that goes into any model was influenced by the presence of many factors. Those factors include the economic environment, competitive pressure leading to adverse selection, the effectiveness of servicing and the execution of credit policy. Experienced modelers know they must remove trends from the development data that will not repeat going forward. The most obvious example of this comes from building models off of pandemic-era data. There was a tremendous amount of stimulus and loan forbearance in 2020-2021, resulting in many consumers performing well who would have otherwise defaulted. The model, without competent oversight, assumes that the credit profiles of those consumers are associated with good performance. Deploying such models in an inflationary (and potentially recessionary) environment led to performance shocks at numerous lenders. It matters less how clever the math is, but rather, how much of a contextual understanding the modeler has.

• Myopic Filters – This problem is actually an extension of the problems caused by inexperienced modelers. Myopic Filters refer to the fact that judgmental buying and the use of a strong credit policy effectively sanitize the population of approved applications. The only performance a lender has is on funded applications, which have been filtered to remove some of the riskiest paper. Modeling only the data you have will lead to a far more optimistic estimation of performance than would be merited if the filter had not been applied. Take LTV as an example. If the lender’s credit policy cuts the advance rate for the riskiest loans and allows more advance for the top tier applicants, the data will show a strong positive correlation between LTV and quality (i.e., increase the LTV and the consumer is less risky). Obviously, that is ridiculous and only a function of the filter, but many lenders fail to account for the impact of credit policy and high-side score overrides which leads to worse performance. Some analysts handle this bias with a technique referred to as reject-inferencing, which in my opinion has series limitations. Others deal with this bias by augmenting their sample using a data archive from the credit reporting agencies. The latter is preferred provided it is done properly.

Nonsense: A customized model, built off of my company’s own data is superior to a generic model

Sense: If the lender is large enough (i.e., multi-billion dollar portfolio size), they will likely get a strong lift from a customized model; however, custom for the sake of custom in many cases will produce an inferior model. I run into smaller lenders all the time who insist that their unique offering to dealers, their credit strategy and servicing makes their business model totally different from all others…ok, sure. I will agree that many small auto lenders who have very tight relationships with a smaller number of dealers will often perform better than those with similar credit profiles, but that is not really the issue. These lenders likely have 3 times the closure rate and 30 percent lower losses than others with similar paper. Those differences evaporate as the company achieves scale. The issue is that when a consumer defaults it is related to things going on in that person’s life – not the lender. A credit model is designed to assess the risk factors of the consumer, not the company. Credit strategy, policy and servicing can reduce or elevate losses, but it doesn’t change the inherent risk of the applicant relative to all other applicants in the pipeline. The more examples of consumer behavior a lender has, the more likely the model estimate will be robust and precise. Most lenders do not have sufficient data to achieve the needed level of precision, and so they must augment their own history with like credit records. Too often I see lenders with insufficient data insist on a custom rather than a generic or augmented model to their detriment. There is an appropriate size where custom is appropriate – but beyond that the lender should focus on what mix of data will provide the most reliable prediction.

Nonsense: If performance changes over time, the scorecard must be broken

Sense: The sample from which a scorecard (or any other type of model) is built represents a snapshot of a particular period in time. Once that snapshot is taken, the performance is fixed. It is not reasonable to assume that tomorrow will continue to be a steady state representation of yesterday, yet lenders do this all the time when it comes to evaluating the value of their own scores.

The primary function of a credit score is to sort applicants into increasingly better groups in terms of credit performance. This is referred to among model builders as rank-ordering. Scores are a sorting mechanism upon which a loss expectation is imposed based on a company’s own historical credit performance. Assuming consistent execution of credit policy, verification practices, and servicing strategy, credit scores should demonstrate manageable and consistent rank-ordering across economic periods. While the absolute level of losses may fluctuate, a group of 700s should consistently perform better than 650s, 600s and 550s, respectively.

There are many reasons why credit performance fluctuates, and not all of it has to do with things that are under the control of management. When recessions occur, two things happen with regard to losses. First, credit defaults are pulled forward in time. Simply put, consumers who would have taken longer to default are put under sufficient economic pressure to cause that event to occur earlier. Second, consumers who might not have defaulted reach a critical mass of debt and cannot recover. Both of these situations cause more defaults at each score level than would be observed in a steady state environment.

Conversely, after the initial bubble of defaults works through, lenders experience materially improved credit at each score level. Credit defaults that were pulled forward in time are in effect sanitizing future periods. Furthermore, stressed economic periods usually result in lenders tightening both policy and verification standards. With increased scrutiny, fewer dollars to deploy, and less competition, credit providers are able to cherry-pick the population of consumers. Credit scores cannot be clairvoyant to future events. As such, risk managers must take their baseline expectation for losses by score and flex that figure based on where we are within the credit cycle.

Determining whether one’s current model is deteriorating should be based on evaluating the model’s continued ability to separate good loans from bad loans. A simple way to visualize this is to plot the default rate by score for multiple time periods (aged to the same point in time – 12 months on books, for example). A deteriorating model will show a flatter slope when compared to earlier time periods. This indicates that the model is becoming increasingly less effective at pushing bad loans to the lower scores and good loans to the higher scores. A more sophisticated way of measuring this is for a competent risk analyst to use tools such as K-S, Divergence and ROC statistics. Performance by score will most certainly fluctuate based on the external environment, but it is a shift in the ability to rank-order losses that determines whether the model requires redevelopment.

Best Practices
Predictive credit models have proven their usefulness over several decades, but they have limitations that must be carefully addressed. To do so requires separating fact from fiction in order to make the best use of them. Some key best practices related to the points of this article are as follows:

• Employ Correct Measurements: When performance goes south, the originations team blames collections. The collections team will blame originations, and both groups will be certain that the risk model is an issue. There are a variety of tools risk managers may employ to determine where the source of deterioration is coming from (i.e., originations, servicing or the macro environment). The fact of the matter is that performance will fluctuate across time by credit tier. As it relates to the credit model, lenders should run monthly score related reports such as Population Stability, Characteristic Analysis, Static Pool by Score Tier and continued goodness-of-fit measurements that were discussed above. If you are not familiar with some of these, it would be wise to reach out to people you trust in the industry to learn more.

• Establish a Performance Baseline: Subprime auto portfolios with a 600 average credit score will typically produce a 25 percent cumulative unit loss rate, with 50 percent of defaults occurring by 18 months in. Depending on the credit policy, loan structure and vehicle mix such a portfolio will typically have a 13-15 percent cumulative net loss rate, and a 7 percent annualized net loss rate. In stressed periods, annualized net loss may rise to 10-12 percent over a 12 month period and then taper off. Rebounding from a downturn will typically lead to a 4-5 percent annualized net loss rate for up to 12-18 months. This, of course, is just an illustration. Each lender must establish their own baseline and range, and should use sufficient tools to monitor performance so that adjustments may be made prior to a delinquency and default shock.

• Ensure Data Sufficiency: Credit models are estimated in a variety of ways (logistic, support-vector machines, machine learning, neural network). Each method has assumptions which, if not dealt with, will produce a flawed model. The one thing they all have in common, though, is they cannot overcome bad data. I don’t necessarily mean data that is incorrect, but rather data that is not likely representative of the period the model will be used in. Furthermore, insufficient data will produce volatile and biased scores. When building a credit model, lenders typically take at least two years’ worth of data that has been seasoned at least 12 months. If there are less than 2,000 defaults in that time frame you do not have enough to build a single scoring model, and may best be served by an off-the-shelf score tailored to your credit niche. If there are less than 5,000 defaults, I strongly suggest augmenting your data with a bureau archive. Do not make the mistake of simply appending performance of the approvals you did not fund, as these likely represent deals that you were not competitive on. Doing so will produce an overly optimistic model. Best practice suggests having a competent analyst create a multivariate statistical profile of your entire application distribution and using that profile to create the bureau archive. Remember, the score has to work on all applications that come through the pipeline, not just the ones you wanted.

The next article in this series will explore the questions of regional models and the efficacy of the variety of automated tools that fall under the umbrella of artificial intelligence.