Decomposing Transactions in MongoDB

Distributed transactions in MongoDB were developed incrementally, starting at the level of single-node WiredTiger transactions, and building up to single replica set transactions and finally distributed, cross-shard transactions which were first introduced in version 4.2. In the EOVP framework recently discussed by Alex Miller, transactions at each layer conform to slightly different classifications within this framework.

Single Node and Replica Set Transactions

WiredTiger is a multi-version (optimistic) concurrency control system that supports snapshot isolated transactions. At a single MongoDB node, transactions are executed against WiredTiger and are ordered based on commit timestamps assigned above the storage layer, and these commit timestamps are used to order transactions in the storage layer. Commit timestamp selection occurs essentially via an “oracle”, which at the single node level is effectively an atomic counter. These timestamps are assigned at some point near the start of the transaction, and when a transaction commits, it uses this timestamp to determine its visibility ordering. Validation of concurrency/isolation semantics happens online in these transactions, with write conflicts manifested eagerly at the time the conflict occurs. Similarly, persistence occurs upon commit, which requires a flush to the WiredTiger WAL (?)

Replica set transactions don’t really change the underlying picture, since essentially all behaviors are the same except they occur at the primary while a transaction is being executed. Once a transaction commits, it will be written into the oplog and replicated to a secondary, which serves a higher level (consensus level) durability/persistence guarantee.

Distributed, Cross-Shard Transactions

Distributed transactions in MongoDB generalize a few of the components in the lower level transaction models e.g. validation and execution essentially occur concurrently, and for mostly the same windows of time. As a transaction proceeds in its execution phase, its operations are routed to the appropriate shards that own keys for the data being read/written, and validation occurs online, at each shard. In particular, this involves checking of write-write conflicts per SI requirements and also prepare conflicts, which are manifested in MongoDB as a way to ensure that concurrent transactions become visible atomically across shards.

Once a transaction is ready to commit, it initiates a variant of two-phase commit, which conducts the main sections of the ordering and persistence phases for the transaction. Commit timestamp selection for distributed transactions in MongoDB is a partially distributed process, since there is no centralized timestamp oracle. That is, during the prepare phase, the transaction coordinator will collect prepare timestamps from each shard participating in the transaction, and a commit timestamp is then computed as some timestamps \(\geq\) the maximum of these prepare timestamps. This serves as a way to guarantee monotonicity of commit timestamps between dependent transactions across shards, without requiring a centralized timestamp oracle. Once this commit timestamp is computed, determining visibility ordering, the transaction can commit at each shard, making its commit record durable/persistent within a replica set and becoming visible to other transactions.

TODO: Read timestamp chosen upfront also plays a part in ordering, as mentioned by Marc and Alex.