In this post, I will discuss two phase commit (aka 2PC) distributed transaction commit protocol and some of the problems associated with it.
What is a distributed commit protocol?
A commit protocol is an algorithm used for atomically committing a transaction. Atomicity implies that either all the changes (writes / updates) in the transaction will be committed taking the database to a new state or everything will be rolled-back without changing any state in the system.
Distributed commit protocols are widely used in distributed databases but are generally applicable to all kinds of distributed systems. For example, a distributed OLTP database that supports ACID transactions will have to use such a protocol to atomically commit or rollback a transaction on multiple participating nodes, leaving the database in a consistent state at the end with all committed changes being durable.
Two Phase Commit
As the name suggests, 2PC executes in two separate phases. The general topology of the system has a coordinator node where the transaction originates and two or more participating nodes. Coordinator node is also a participating node. For the sake of this discussion, let’s refer to transaction as a database transaction that comprises of one or more SQL statements (e.g INSERT, UPDATE, DELETE) that attempt to change the state of table / database.
Take this user flow as an example
SQL> INSERT INTO ... SQL> INSERT INTO ... SQL> DELETE FROM ... SQL> UPDATE TABLE T ... SQL> COMMIT
For a distributed transaction that uses 2PC, COMMIT statement triggers the execution of the protocol
Phase 1 – Request Vote
- Coordinator requests vote from all the participants. The votes will be used to decide if the transaction can be safely committed or needs to be aborted.
- Each participant sends one of the following response types to coordinator
- YES – Participant ready to commit the transaction locally
- NO – Participant wants to abort the transaction. Usually the case when something has gone wrong during the local execution of transaction at the participant
Phase 2 – Commit or Abort
Coordinator uses the responses received to decide the outcome of protocol.
All participants voted YES
- Coordinator sends COMMIT message to all participants. Each participants commits the transaction locally, release locks/resources etc and sends an acknowledgement to coordinator.
- Coordinator commits the transaction locally (updates metadata etc) and sends the response to user.
At least one participant voted NO
- Since the transaction has to be either atomically committed at all sites or no state change should happen, a single NO vote determines the outcome of the algorithm
- Coordinator sends ABORT message to all participants asking them to abort the local transaction.
- This will result in each participant rolling back the local changes (e.g using undo log) and restore the local state of data to a point where transaction never really happened.
- Coordinator receives acknowledgement from all participants nodes and aborts the transaction locally.
Note that coordinator can also be a participant in the protocol and regardless of whether it is a pure coordinator or both coordinator and participant, it can still decide to ABORT the transaction (for whatever problem it encounters locally) even if it had received YES votes from all the participants.
The above mentioned steps in both phases highlight the good scenario when everything in the system goes well without failures or crashes. Let’s now discuss few failure scenarios to understand critical problems associated with two-phase commit.
Failure Scenario 1: Coordinator crashes after completion of phase 1 but before starting phase 2
In this case, coordinator crashes after receiving votes from phase 1 but before starting phase 2. Consider a participant that had responded YES in first phase. It can’t simply assume positive outcome and go ahead with committing the transaction because it doesn’t know if rest of the participants had agreed to commit in phase 1.
If the participants can communicate with each other, then they can find out the response of each other. If at least one of the participants had voted NO, it is pretty clear that coordinator would have decided to abort the transaction. So, the participants can make progress on this in-flight transaction by aborting the local changes.
However, if all the participants had voted YES, it is still not enough for them to go ahead with committing the transaction since they don’t know the final outcome decided by the coordinator. The protocol is effectively blocked on coordinator recovering from the failure and continuing with phase 2.
Failure Scenario 2: Coordinator starts phase 2, crashes before sending phase 2 outcome to a subset of nodes. Some of the participant nodes that received the phase 2 outcome act upon it and crash.
This is an extension of the previous scenario. Recovery becomes tricky if all the alive participant nodes had voted YES in phase 1. It is invalid for any of them to go ahead and commit the transaction because what if the crashed participant had voted NO and had aborted the transaction before crashing. When it comes back up, the data in the cluster would be inconsistent.
Similarly it is invalid to go ahead and abort the transaction if the outcome was to commit and the crashed participant had already committed the transaction.
Similar to the previous scenario, the protocol gets blocked due to the ambiguity on the outcome of the protocol in the event of failure
- When the failed coordinator recovers, then it can help participants progress by telling them to commit/abort the transaction depending on the outcome decided before it crashed.
- If the coordinator never recovers and we bring up a backup coordinator then it will still have to wait to get a response from the crashed participant node to determine the final outcome of the blocked protocol before it allows the remaining participants to proceed with commit or abort.
2PC protocol chooses to block and trades off liveness for safety to maintain consistency and correctness in the cluster in the event of failure. As a result, the transaction that was in the middle of commit by 2PC gets stalled incurring higher latency.
Additionally, any other concurrent conflicting (touching overlapping data) transactions on the participant nodes get stalled too since their completion is dependent on the completion of the transaction that is waiting on two-phase commit protocol to finish.
As an example, consider transactions T1 and T2 that want to update row R. T1 and T2 are conflicting transactions as they are interested in updating same data. Let’s say that database starts executing T1. As part of that, it would have acquired a lock (aka row level locking) on R to prevent any concurrent conflicting transaction (e.g T2) from modifying the row at the same time.
Only after T1 commits, can T2 proceed with updating row R. If T1 is stalled due to the above mentioned blocking scenario in two-phase commit, T2 and any other conflicting transaction will be impacted as well resulting in reduced throughput (transactions per sec) of the system in addition to higher latency per transaction.