Evaluation of Mongo Query Execution Architecture

ShahNilay
5 min readApr 12, 2023

--

Since you’re on this page, I’m assuming you have a basic idea of what is mongo and some basics of mongo database. In this article we will try to understand how query execution happens in mongo internally and evaluation of storage engine in mongo db.

source: https://www.wikitechy.com/interview-questions/mongodb/what-are-the-storage-engines-used-by-mongodb/

First let’s start with some architectural level basics of a database

In my opinion, every database provides 2 components: API and BE storage. User talks with database data via API call for eg. SQL or get/set in Redis or any other existing database. The difference is in the structure/contract of syntax each database developer defines independently.

The first SQL query was defined in the late 60s or 70s. Historically, different Relational Databases defined different query syntaxes, but in the bottom data storage mechanism remained almost the same. You store the data in the file system over a particular file in a particular block with some offset. Relational data gets stored in row and column format, NoSql data gets stored in key-value or document format, and some other data stores data in graph data structure. So in the bigger picture, any database has 2 components: API and storage.

Some key points of mongo data storage architecture

When users submit JSON documents to Mongo, they are stored internally as BSON (Binary JSON) format for faster and more efficient storage. Mongo retrieves BSON and converts them back to JSON for user consultation. Here is a definition of BSON from MongoDB
The maximum BSON document size is 16 megabytes

Usually writes in RAM are faster than writes on disk.
You can’t modify a single byte on disk (you can modify the full page or block or some bigger unit on SSD/disk) (IO operations are time-consuming so, it’s not advisable. Remember, page replacement algo in OS?). So during db writes, you manage the data in RAM and flush it on the hard drive in a bulk manner for better efficiency.

Mongo WAL: while you’re writing in RAM (and write on disk is yet to be executed), Mongo manages time series writes on disk (aka write-ahead logs) to avoid data loss after a crash. So when a crash happens, dirty pages have been lost (which haven’t been written on disk yet). To get those back, mongo fetches the latest stored collection and performs the action again which is stored in the journals.

Now you have a decent idea about how a db writes works (we will discuss read-writes further in this article) and why a good storage engine design is an important part of any database. Let’s talk about storage engine evaluation in Mongo.

MMAPV1 (Memory Map files)

In MMAPV1 BSON documents are stored directly on disk uncompressed, and the _id primary key index maps to a special value called Diskloc (disc location). Diskloc is a pair of 32 bit integers representing the file number and the file offset on disk where the document lives.

How does Id search work?
Traverse through B+ tree and get disk loc pointer (page number and offset) and fetch data using the metadata mentioned above.

Side effect of this logic: what will happen when an update query happens which increases the document size? All the following document offset gets changed which is super inefficient.
Also, they implemented locking in v1 on db level. So if one collection is updating something, you can not execute an update in another collection.

​​Mongo did improve MMapv1 to make it a collection level lock (table level lock) but later deprecated MMapv1 in 4.0.

WiredTiger MongoDB Architecture (V2)

Problem mentioned in V1 is solved in V2 and locks are defined at collection level (but still locking mechanism at row level haven’t been achieved like in SQL dbs)

WiredTiger has many features such as document level locking and compression.

Documents(in BSON format) in WiredTiger storage engine are compressed and stored in a hidden index where the leaf pages are recordId (64 bits or 8 bytes in size double then v1), BSON pairs. This is a similar model to PostgreSQL where all indexes are secondary and point directly to the tuple id on the heap. This means more BSON documents can be fetched with fewer I/Os making I/O in WiredTiger more effective increasing the overall performance (as more compressed data can fit inside the same page).

Side effect of this logic: more RAM consumption (as 2 layer of indexing and index size has also increased), double look ups 2*log(n).
The double lookup cost consumes CPU, memory, time and disk space to store both primary index and the hidden clustered index.

Advantage: I/O efficiency increased due to compressed data.

In V2 secondary index points to recordId of primary index.

Clustered Collections Architecture (V3)

Locking at row level is achieved in V3.

Primary Index is defined based on Id and complete document(s) based on page size and document size is stored inside the leaf. So when the fetch query arrives, data is fetched from the disk and cached in RAM for some interval.

Side effect of this logic: Issue happens when secondary indexes are defined. Secondary indexes are defined over primary indexes (in v2 they are defined over recordId which are 8 bytes long), but in mongo, id is 12 bytes longer and there are no restrictions in defining custom ids, so no upper limit in size for id field which increases RAM consumptions ultimately.

This changes Mongo architecture to be similar to MySQL InnoDB, where secondary indexes point to the primary key. But unlike MySQL where tables MUST be clustered, in Mongo at least get a choice to cluster your collection or not. This is actually a pretty good tradeoff.

Note: read about YOGA db, it supports lock at (row, column) level. Obviously, it will increase tremendous load at RAM as if there are a million rows and each row has n columns then worst case scenario will have millions*n locks at runtime.

Additional points

Size of ID is 12 bytes: (4 for timestamp, 4–5 for machine and region id, and remaining for counter, similar to snowflake). The issue is, when there are millions of entries, at runtime index(B tree) has been fetched to RAM and in B/B+ tree leaf nodes are actual data, so it will have millions of leaves which will fill up RAM in a great manner (12*millions bytes).

Note: size of id is the actual reason behind discord leaving mongo and shifted to apache cassandra as managing trillions of data in RAM was the actual root cause for discord.

Users can submit documents with a field that never existed in any document in the collection. This feature is one that Mongo is attractive to devs and it is also the feature that is abused the most.

References

https://www.mongodb.com/docs/upcoming/core/clustered-collections/#clustered-collections
https://www.mongodb.com/docs/v4.4/reference/operator/meta/showDiskLoc/
https://www.youtube.com/watch?v=ONzdr4SmOng

--

--

ShahNilay
ShahNilay

No responses yet