Transaction Dupes - Debugging multiple race conditions
Last summer, we started receiving feedback from a few customers that they were being notified about pairs of new transactions that appeared to be duplicates of one another.
Pulling up a level, our Transactions product returns users' transaction history at almost every bank in the United States. For many customers, the best part of Plaid's approach is that we can give them nearly real-time updates on transaction data, which they use to monitor account activity for fraud, help users save money through spending “round ups,” and track progress toward budget goals. When Plaid detects a new transaction, we send a webhook to notify the customer that the account has new data. They can then retrieve the new data through the Plaid API.
Because so many of our customers count on regular data updates, we take the timeliness and accuracy of these notifications very seriously. So when we heard there was a problem, we wanted to get to the bottom of it as soon as possible. This post is a reflection on what we did—as well as some takeaways for handling concurrent operations across multiple services.
Plaid keeps transactions data up-to-date with a three-phase pipeline:
- Scheduler determines when we should poll the bank for updated transaction data for an account.
- When it is time for an update, the bank integration communicates with the financial institution to retrieve raw updated account data.
- Reconcile processes the results and sends webhooks to the customer if we found new transactions.
For the accounts affected by this bug, these “new” transactions were duplicates of old transactions, and calls to our API would only return these duplicates for a few hours before they disappeared. We’d seen similar in the past that had been isolated to particular banks. In this case, however, we saw the duplicates pop up across the board, indicating that this was a problem with our batch update system rather than an issue with any one institution. We also noticed that the problem also occurred disproportionately frequently for the highest-traffic institutions, but we did not recognize the importance of this clue at the time.
We immediately zeroed in on Reconcile, the last service in our pipeline. This service compares previously returned transaction data with the new results just received from the institution. Reconcile marks each transaction in the result set as either new (never seen before), an update (a previously seen transaction with new data), or a deleted transaction (a previously seen transaction that did not show up in this result set). Because bank APIs don’t usually provide consistent transaction IDs, this process is inherently imprecise. It was the obvious culprit for the problem of misidentified duplicates. There was just one problem with this theory: The reconciliation code had not been touched in months, and there was no reason for it suddenly to stop working.
Lacking a smoking gun, we looked first at the reconciliation logs for the affected accounts, hoping to see a pattern. Curiously, we found that the problem wasn’t so much that we were failing to match transactions in a single result set, but rather that we were processing multiple sets of results for the same accounts at the same time. Somehow, the reconciliation code was being run concurrently on updates for the same accounts. We had not observed this before, and for good reason: It was supposed to be impossible.
Plaid is careful to avoid running multiple concurrent requests for the same account to avoid corrupting the account’s state. We do this by acquiring a lock in the bank integration code on the account to be updated before we begin the data pull for that account. Any concurrent updates would fail to acquire the lock, meaning the update would fail long before reaching the reconciliation stage.
Bug #1: TOCTOU
Putting aside the question of how duplicate updates were running in the first place, the team figured out how such duplicate updates could lead to duplicate new transactions—and why they would be cleaned up in a few hours. At the start of the reconciliation process, Reconcile reads the existing set of transactions for the account from the database and compares it to the most recent result set from the institution. Once the comparison is complete, it inserts the new transactions into the database. Astute readers may already have spotted the problem—a type of bug known as a race condition, more specifically, a “time of check to time of use” error.
This kind of bug crops up whenever two separate processes simultaneously check some condition and then take an action that modifies the condition without taking into account the ongoing action of the other process. In this case, both updates read the same set of transactions and then inserted the same “new” transaction into this set. (If the two updates had been properly sequenced, the second update would have seen the new transaction inserted by the first update and done nothing.) When the next update ran for the account a few hours later, it compared the duplicate transactions (now in the “old” set saved to the database) to the results returned from the institution, assumed that the duplicates had been deleted by the institution, and removed them from the database.
This race condition had existed in the Reconcile code base for as long as Plaid had been doing reconciliation—more than 3 years!—without being triggered. It had been lying dormant, waiting for near-concurrent updates to be triggered. This is a general feature of race conditions, which makes them difficult to find, difficult to reproduce, and difficult to verify. They are affected by changes in timing, which in turn can be affected by non-local behavior. Something can change in a completely unrelated component, even a different program running on the same computer, or even in the ordering or timing of input from a different computer on the network, as in this case.
Though we now understood how the duplicate transactions were entering the database, we weren’t satisfied that we could move on. There was still the mystery of how we came to be processing twin updates in the first place. Even this was really two mysteries: Why was Scheduler executing duplicate update processes? And why were these duplicates not being caught by the lock designed to prevent simultaneous update processes for the same account? Though another per-account lock in Reconcile would have fixed the immediate problem, we needed to identify the root cause to be sure that the duplicate updates were not triggering some other latent race condition elsewhere.
Bug #2: The not-so-critical section
The team started with the latter of these questions and began an investigation into the second stage of the pipeline, the bank integrations. The integration framework code is responsible for acquiring and releasing the lock that prevents simultaneous updates to the same account. After looking at this code more closely, we saw why duplicate updates might be able to reach Reconcile: The lock was only held during the time that an update was waiting on API requests to the institution. The lock was released before sending the resulting data on to Reconcile.
But why did this only now become an issue? A seemingly innocuous recent change turned out to have broader implications than we had realized. Previously, a second update that failed to acquire the lock would fail completely and be retried several hours later. This was changed, however, so that the second update would wait until the lock was available until proceeding. This behavior saved us from many user-facing failures when customers initiated synchronous requests for other data types for the same account, but it also inadvertently allowed near-simultaneous updates to flow through to the third phase. The duplicate update process would be delayed reaching Reconcile by as much time as it took to request data from the institution, but enough uncertainty in timing still existed within Reconcile to allow this second update to “catch up” to the first one during the reconciliation process and play havoc with the database.
Bug #3: The nonexclusive queue
That explained how Reconcile could see duplicate data, but we still needed to understand why duplicate updates were being scheduled in the first place. Scheduler had just received a major overhaul and was actively being hardened. What we found was yet another race condition, but only for institutions with the highest traffic levels. In Scheduler, each institution has a queue of accounts to be updated and workers responsible for scheduling those updates. Small institutions are handled by a single thread in Scheduler. Bigger institutions, however, are processed by a number of threads all working off the same queue. This was why large institutions were so overweighted in our sample. The ultimate cause of the duplicate extractions behavior was a failure to properly synchronize on accessing the head of the queue. Each thread would look at the account at the top of the queue, initiate an update, and then move it to the back of the queue. While this worked for single-threaded institutions, the multiple threads for the larger institutions could each schedule the same account before moving it back, leading to duplicate updates.
Another curious side effect of this behavior was its impact on the update rate. We had already seen that the queue was not being processed as quickly as we expected it to be. This, we had previously assumed, was a separate bug. It was not until seeing the race condition that we realized the two odd behaviors, duplicate new transactions and a slow update rate, were really two symptoms of the same problem. Adding proper synchronization to Scheduler quickly fixed both problems.
The experience of debugging and fixing this issue taught us some important lessons about how we should build and maintain our systems going forward.
Correctness under Concurrency
Defensive steps taken elsewhere in a complex system, such as the lock used to prevent concurrent data updates in this example, can hide potential problems. But even if something is not currently causing problems, it is still a latent bug that should be fixed. One can be sure that code will not trigger undesirable behavior only if it is correct without assuming anything about the context in which it is called.
We could have tested for this level of correctness by using a technique called fuzzing in which a component is provided with invalid input to see how it behaves. Hitting Reconcile with concurrent attempts to reconcile transactions for the same accounts would have revealed this undesirable behavior immediately.
Debugging Service Pipelines
Identifying the cause of our duplicates problem required seeing the big picture of how the three components of the pipeline fit together—looking at each in isolation only got us more confused. We have since made it easier to take a step back by creating a unique identifier for each update process in the first service, Scheduler, and logging that identifier in all messages for the process from any of the three services. In addition, we keep the clocks synchronized across the pipeline so that filtering for log messages matching this identifier provides a linear view of what happened during the update.
All the Pieces Matter
The beginning of our investigation revealed a useful clue: duplicates were weighted disproportionately towards the larger institutions. But we ignored this signal because the shape of the graph seemed roughly right. Digging into this mystery first would have gotten us to the root cause faster.
Make a Checklist
The same challenges crop up again and again when building distributed systems. We've begun asking ourselves the same set of questions for each service we build:
- How does the service behave as load increases?
- What assumptions does the service make about its input? About the context in which it executes? About what else is or is not executing at the same time?
- How can we make these assumptions explicit? Can we fail fast if they do not hold? Can we architect the service such that it works correctly even if they do not hold?
- What information can we log to help debug problems caused by inputs or conditions that violate untested assumptions?
- How can we build in redundancy so that the stack is resilient to incorrect behavior from an individual service or component?
If you like asking these sorts of questions and building robust systems, Plaid is hiring!