How Plaid parses transaction data
At Plaid, we link financial accounts to applications, removing the need for our customers to build individual connections to financial institutions and standardizing the data that’s used across multiple accounts. That means developers can focus on creating innovative products and services.
One of the most interesting challenges we face is the aforementioned data standardization, or normalization: how can we simplify data across thousands of different formats to be used in thousands of different ways, and why should we?
The ‘why’ is easy: by changing the value of transaction data from a mere record of financial activity to a building block of deep user understanding, we provide our customers with the ability to glean meaningful insights. In turn, they can help their users make better financial decisions.
The ‘how’ is more complex: that’s because the data financial institutions natively provide is messy and convoluted—and far from normalized. A look over your last bank statement will likely leave you pondering at least a few transactions. For example, take the transaction below, displayed via three separate financial institutions:
POS DEBIT Chick Fil A 4/5
Authorized purchase Chkfila 333222121 NY NY
This is a single merchant across just three banks. Imagine millions of merchants across thousands of banks. How can we normalize this?
The named entity parsing problem
Location parsing and merchant parsing are the two most important and impactful challenges we face when enriching transaction descriptions. They are examples of a common research topic in machine learning, named-entity recognition (NER), in which we seek to locate named entities in unstructured text and classify them into predefined categories (in our case, location and merchant). Here’s an example:
McDonald’s F1001 01/21/2020 New York NY
Although NER problems are often solved with machine learning approaches, a complex model might not be needed for every instance. For example, a transaction description like the one above immediately signals a McDonald’s transaction made in New York. That means we can skip the computationally intensive model and directly return the result. To achieve this, we built a light fuzzy matching algorithm to extract the location and merchant information directly from the transaction description.
Unfortunately, not all location strings appear in the tidy format above, and there will always be new merchants that we’ve not previously seen. Here’s an example:
POS WD SAPPS #06 / 06063-BEDMINSTER NJUS
Let’s say this is a new restaurant called SAPPS, which just opened in Bedminster, NJ. The Naive Location Matcher would not be able to determine that Bedminster, NJ represents a location since the token ‘BEDMINSTER’ is concatenated with a numeric string, and the string ‘US’ is also appended to ‘NJ’. Moreover, since SAPPS is a new business not included in our merchant dataset, we’re unable to identify it as the merchant.
Plaid has therefore developed a solution to tackle these NER challenges using a language model and bidirectional long short-term memory.
A statistical language model is a probability distribution over sequences of words, in which the model assigns a probability to each word or token in a given sequence. More practically, it encodes the internal meaning of a word with the information contained in its neighbors. There are two major categories of language model approaches:
- Masked language model (MLM) approaches, which predict a [MASK] token using all tokens in a sentence
- Autoregressive (AR) approaches, which perform left-to-right or right-to-left prediction
Typically the MLM approach works better for natural language understanding tasks (e.g. named-entity recognition, text classification), while the AR approach performs well for language generation tasks due to its sequential nature. Using the [MASK] token in MLM enables us to model the meaning of a word using all the surrounding words save for the word itself (otherwise, the model would learn each word from its own embeddings and ignore the contextual information).
Plaid uses an MLM model similar to BERT (Bidirectional Encoder Representations from Transformers) to help tackle the location and merchant parsing problem. BERT (BERT Paper) is one of the most well known and high-performing masked language representation models and is designed to pre-train deep bidirectional representations of natural language by using Transformer Encoders to encode contextual information of the input sequences.
Before we dive deeper into the language model, let’s take a small detour to explore the idea behind Transformer Encoders, which are a vital component of the BERT model.
The Transformer architecture was proposed in the paper Attention is All You Need, and is essentially a Sequence-to-Sequence (Seq2Seq) encoder combined with an Attention Mechanism.
A Seq2Seq encoder takes in a sequence of items (in our case, words) and outputs another sequence in which each item is encoded with the information from the surrounding items. The Attention Mechanism helps to decide which other item(s) in the sequence are important, while encoding and understanding the information of a specific item. Take the following sentence: Jack won the championship and he felt so proud of it. The Attention Mechanism would understand that he refers to the person Jack and therefore assign more significant attention to the token Jack.
Masked language model
Returning to the MLM, the model structure looks like:
Let’s use the transaction description ‘McDonald’s New York NY’ to illustrate the model’s behavior at a high level. The model would:
- Tokenize the transaction description.
[‘McDonald’s’, ‘New’, ‘York’, ‘NY’]
- Send tokens through an embedding layer, which transforms tokens into a 2d matrix.
3. Encode the embeddings input with contextual information through a set of Transformer Encoder layers.
4. Attach a fully connected layer to apply linear transformation to the encoding result.
5. Apply a softmax layer to produce probability for each possible token.
6. Update the model parameters with back propagation after calculating the loss.
The trained MLM is effectively a Seq2Seq encoder that takes in a textual sequence and emits another sequence. Each element of the latter is encoded with the information of its surrounding elements.
Because the MLM is an unsupervised learning approach, we’re not limited by the amount of labeled data when building the model. By feeding it the sea of Plaid-managed transactions, we end up with a language model embedded with the meaning of transaction descriptions.
Once the encoded sequences are sent from the MLM, they are fed into a downstream parser to recognize the target entities (merchant / location). For the downstream bidirectional parser, we leveraged the Bidirectional LSTM (long short-term memory) model, a state-of-the-art approach for entity recognition problems. The high-level model structure is as follows:
The Bidirectional LSTM model is an extension of the Unidirectional LSTM, itself a member of the RNN (Recurrent Neural Network) family. The Unidirectional LSTM is designed to recognize patterns in sequential data, such as time series and human language. It does so by extending the contextual meaning of the preceding text into a target word. Bidirectional LSTMs go one step further, by understanding contextual information both forwards and backwards, rather than only the former.
By leveraging the Bidirectional LSTM framework, we effectively train two separate LSTM neural networks—one that takes the original copy of the text sequence and the other that takes the reversed copy—and eventually aggregate the results together. In this way, each token in the sequence encapsulates the information from both directions. A final prediction can thus be made having a holistic view of the text sequence.
This combination of our string matching / regex rules and our Neural Networks has yielded promising results for our location and merchant parsing product. As of today, we are able to correctly identify 95% of merchant and location information in transaction descriptions when present.
Moving forward, we’re eager to explore additional improvements, such as how model performance might vary with different sets of hyperparameters or how a combination of character-level CNNs (Convolutional Neural Networks)—known for capturing the semantic information of unfamiliar words—and word embeddings might produce even better results.
If you’d like to help us find answers to these questions and many others, or if you’re interested in learning more about the ways we use data science to empower financial services, email me directly at firstname.lastname@example.org or check out plaid.com/careers.