Finding the right fit: How Plaid reconciles pending and posted transactions
Plaid's API helps developers provide financial services to tens of millions of consumers across North America. These services help consumers manage their personal finances, let them transfer funds and make payments, and allow them to access loans and mortgages. Our mission is to improve people's lives by delivering access to the financial system.
We work toward this mission not only by helping consumers to access their financial data, but also by improving the quality of that data. Enriching data with machine learning is one of the objectives of our Data Science and Infrastructure team. In this post, we'll talk about one of the ML models our team built, as well as the stepping stones it took to get here.
The pending-posted problem
One way Plaid adds value to transactions data is by identifying when pending transactions from a consumer’s account become posted. A transaction is pending when it is being processed by the bank. While it is pending, the amount is deducted from the account owner’s available funds but not from the account balance. Once the transaction settles, the transaction goes from “pending” to “posted”. A posted transaction is mostly finalized, and the monetary amount has been withdrawn from the account.
When Plaid takes snapshots of accounts, we receive a list of transactions with descriptions, monetary amounts, and whether the transaction is pending or posted. While we know if transactions are pending versus posted, banks often do not tell us which pending transactions from a previous snapshot correspond to the new posted transactions from the current snapshot. This matching is critical to clients. If they send notifications to consumers with each new transaction, it’s important that they don’t duplicate those notifications.
Unfortunately, it’s often not obvious which posted transactions map to which previously pending transactions. A common difficult matching problem is restaurant bills. When a consumer’s credit card is charged for their bill at a restaurant, the restaurant initiates a pending transaction. It does not include service charges and tips. Once the restaurant’s receipts are batched (often at the end of the business day), they finalize the transactions by adding on the gratuity to the transactions. This is when the transactions become posted.
There are other situations in which corresponding pending and posted transactions may look different. Hotels often leave higher pending charges as holds on the account for incidental fees. Bars create single-dollar pending transactions for open tabs, which settle to the actual bill amounts once the transactions post. Merchants, payment processors, and financial institutions each may change the transaction descriptions.
Our high-level approach to this problem is to build a model to predict the likelihood, or match score, that a given pending and posted transaction represent the same underlying transaction. If a pending transaction disappears from one account snapshot to the next, we match it with the "most likely" posted transaction that appeared on the new snapshot. We greedily continue matching while match scores are above a certain threshold.
The crux of the problem is choosing a model for determining this match score.
To solve this problem, we initially thought of rules that would tell us how likely a given pending and posted transaction are to match. Here’s a visual representation of example rules to match pending and posted transactions initiated by restaurants:
This rule-based approach is called a decision tree, which segments the space of independent variables, like information about the transactions, and attempts to find the regions of this space likely to correspond to matching transactions. While the decision tree in the above visualization outputs boolean predictions, the decision trees usually used in more powerful machine learning, including in our models, output likelihood predictions instead.
Algorithms exist for training decision trees, but in practice, standalone trees are rarely used. This is because they tend to learn the noise behind the training data instead of the underlying relationships within the data. For example, suppose our training dataset included many different transactions whose descriptions were simply the names of ridesharing services. A decision tree might erroneously learn that descriptions don’t matter, since so many pairs of non-matching transactions would have similar descriptions.
This issue is called overfitting.
Excess model complexity results in overfitting as it allows the model to contort to the training data. Overfitting is known as “high variance” because overfit models are strongly dependent on training data, and small changes in input will result in large changes in predictions. On the other hand, insufficient variables and insufficient model complexity results in underfitting, in which the model is too inflexible to find meaningful relationships within the training data. Underfitting is known as “high bias” because underfit models have significant systematic prediction errors, or bias.
A fundamental challenge in data science is the bias-variance tradeoff. Carelessly increasing model complexity leads to higher variance and lower bias. If our models optimize purely on bias measurements such as accuracy on the training set, they will tend to overfit.
To solve the pending-to-posted matching problem without overfitting, our first model augmented the concept of decision trees using bagging and feature sampling. Let’s first discuss bagging, which refers to bootstrap aggregating.
“Bootstrapping” is the process of training models on random samples of the training data. By limiting the amount of data used in the training process, bootstrapping combats overfitting by providing different noise profiles during training.
“Aggregating” is the process of combining many different bootstrapped models. With bootstrapped trees, the aggregation process typically lets the trees "vote" by computing the average of the likelihoods predicted by the trees. Since the training subsets are randomly sampled, the decision trees still fit the dataset on average, but the voting gives a more robust prediction.
Combining bootstrapping and aggregating results in bagging.
The bagged model reduces variance more if the component models are uncorrelated. However, only bootstrapping on different samples of training data often results in trees that have highly correlated predictions, because the most informative branching rules are often similar across sampled training data. For example, because transaction descriptions are a strong indicator of whether or not a pending and posted transaction match, most of our trees will rely heavily on this indicator. In this case, bagging has a limited ability to reduce the variance of our overall model.
This is where feature sampling comes in.
To reduce the trees’ correlation, our model also randomly sampled features in addition to randomly sampling training data, resulting in a random forest. A staple in data scientists’ toolkits, random forests are powerful predictors with low overfitting risk, high performance, and high ease of use.
This was the model that Plaid used for several years to match pending and posted transactions. Over time, this method proved to be effective, but not excellent: we noticed a high false negative rate when we evaluated the model against human-labeled data. We needed to improve the model so it would more reliably find matches.
When Random Forests Fail
Random Forests, and bagging in general, are susceptible to underfitting imbalanced datasets. Our random forest models for pending-to-posted matching suffered from this problem. Since each pending transaction has at most one posted transaction in a training set, most candidate pairs of pending and posted transactions are not a match. This meant our training sets had an imbalance in which a large majority of the data was “not matching”; as a result, our random forest model erred on the side of predicting lower probabilities of matching, resulting in a high false negative rate.
To solve this problem, we used boosting. Boosting restricts decision trees to simple forms – for example, trees that aren’t very deep – in order to reduce the bias of the overall model. The boosting algorithm iteratively explores the training data, adding the restricted trees that maximally improve the aggregate model. As with bagging, the trees vote to come up with a final decision.
This process eventually learned that improving performance on the minority case — pairs of transactions that do match — would maximize model improvement. The algorithm dove deep into identifying what conditions predict that case. With well-tuned hyperparameters, we finally saw a major improvement in our false negative rate.
Another advantage of boosting was the ability to flexibly define the “model improvement” metric during training. By assigning asymmetric penalties to false positives and false negatives, we trained a model more aligned with how those model errors asymmetrically impact consumers.
Our new boosting model lowered our false negative rate by 96% compared to the random forest model, ultimately providing higher quality transactions data to our clients and consumers. In addition to internal metric improvements, we also saw a significant reduction in support tickets filed by our customers about pending-to-posted transaction matching.
It is essential to understand how the characteristics of machine learning model archetypes lead to different strengths and weaknesses. While our new model has made significant improvements in the quality of data we provide, it has tradeoffs of its own. Boosting is sensitive to the model improvement metric and to other hyperparameters that restrict how simple the trees must be. In this case, the improved consumer experience was well worth the careful training procedure and meticulous tuning.
There’s much more we have yet to explore. For example, which boosting algorithm is best, given that we work with a large number of categorical variables? Given that we process transactions multiple times daily for one in four Americans with bank accounts, how do we ensure our matching algorithm is fast enough to keep up? Are deep neural networks a worthwhile investment for this problem, given the difficulty in interpreting and explaining the reasoning behind their output?
If you want to help us answer these challenging questions and many others, or if you’re interested in learning more about how we use data science to empower financial services, e-mail me at firstname.lastname@example.org or check out plaid.com/careers.