LevelDB is an excellent example to learn system design and writing product-level code. Jeff Dean and Sanjay are real programmers. LevelDB is beautifully written with high efficiency.

Data store layout in LevelDB

At any given time, the data in LevelDB is persisted in servel sorted lists (Level). Each sorted list is partitioned into one or more fixed-size chunks (files).

Each entry in the list has the format <user_key, seqno | type, value>. The <user_key, seqno> in combination is also called “Internal Key” inside LevelDB. Value could be just a deletion marker. The list is sorted by key as ascending order and seqno as descending order. The integer is serialized according to a Google’s standard. The Internal key format can be found in the “ParsedInternalKey” data structure. And the ordering can be found in InternalKeyComparator function at db/dbformat.cc file.

https://developers.google.com/protocol-buffers/docs/encoding#varints

Each file (sorted table) has one or more data blocks, one or more metadata blocks and two index blocks, one for data blocks, one for meta blocks.

There might be records with the same user key (different sequence number) in memory table or Level 0.

In other one level, there are no records with duplicate user key. (see DoCompaction() for details)

File types:

.ldb is sorted table

.log is write ahead logs

LOG is the activity history

MANIFEST contains the information about tables that make a database

Memory Table

A Memory Table is a mutable in-memory storage and it is implemented as a SKIP LIST. During runtime, there are two memory tables, one is mutable, one is immutable. Once the mutable one is full, the immutable one will be flushed to L1 and becomes the new mutable memory table. The mutable one will become the immutable one. To add users data in the memory table, LevelDB first allocate a memory object in the heap, the heap objects are managed by an object called arena (utils/arena.cc), which is a singleton. The heap object is then inserted to the SKIP LIST. The SKIP LIST implementation is found at db/skiplist.h. The SKIP LIST is implemented as a randomized data structure.

The SKIP LIST is a single-write-multi-read data structure, meaning that concurrent writes need to be synchronized while concurrent read can proceed without synchronization. The below figure shows the state of a skip list after inserting node with key = 9,3,7,8 and with the height = 4,3,2,1

The compacted memtable will be pushed down as long as it creates no overlaps in that level. The Maximum level is set to 2. The constant value is coded in dbformat.h as kMaxMemCompactLevel=0

Cache

The default implementation is a LRU cache. User can define her customized cache implementation by extending the cache class. The interface of the cache is Insert(key,value,charge,deleter). The charge is the total bytes in memory for this entry. The deleter is a callback to delete the entry.

In LevelDB, the Cache is used by TableCache and BlockCache. BlockCache is used to cache the block data. TableCache is to cache the index block for each opened SST table file. The cachable number of index blocks is configurable, the default value is 1000. The BlockCache size is 8<<20 by default. Given a 4KB block size, the cachable block size in memory is about 32GB.

SSTable handling

A table may contain one index block, one meta index block, one or more data blocks and one or more meta blocks. Each block is associated with a BlockHandle structure for easy manipulation. BlockHandle contains the offset and length of the block. BlockContents has the pointer pointing to the real block data and two flags indicating whether the block is cachable or has a data allocated in heap.

From the low-level, there’re two methods to open the SSTable, one is through memory-mapped IO, the another is through Filesystem APIs. In the former case, the lower level software will allocate a file buffer in memory, the file read function is implemented by simply pointing the descired offset. In the latter case, there is no file buffer in memory, user have to define its own buffer if necessary. As a result, if the SSTable is opened in a memory-mapped way, the heap_allocated and cacheable flags are both set to false, as it is the lower level software manages the buffer.

Snapshot implementation

Snapshots provide read-only view over a certain state of the entire database. Users can creat snapshot by calling GetSnapShot() interface. User can call Read() together with a Snapshot, in this case, the latest value on the state indicated by the Snapshot is returned. LevelDB maintains a monotically increased sequence number reflects the version of the entire database. Each write to the DB increases the sequence number by 1. LevelDB embeds sequence number in each record written by constructing an internal key which is the combination of the user key and the sequence number. LevelDB persists all the value versions for a specific key as long as it is more recent the oldest snapshot. During compaction, all the obsolete will be dropped to save the storage.

Debugging information

The GetProperty interface gives debugging information. The sstables returns the files information, the format is “filenumber:filesize [start key @ sequence .. end key @ sequence]”

Block

The key-value pairs are compressed by only writing the deltas of the key data. Upon each block_restart_interval, the full key data is written. The restart points (the offset for each restart) is encoded and appended to the block buffer.

Critical path for read

1 DBImpl::Get() 
2 -> MemTable::Get() 
3 -> Version::Get()
4    -> Version::FindFile()
5    -> TableCache::Get()
6       -> TableCache::FindTable()
7          -> Cache::Lookup()
8          -> Table::Open()
9             -> RandomAccessFile::Read()
10              -> ReadBlock()
11                 -> RandomAccessFile::Read()
12          -> Cache::Insert()
13      -> Table::InternalGet() 
14         -> Block::NewIterator()
15         -> Iterator::Seek()
16         -> Filter:KeyMayMatch()
17         -> BlockReader()
18            -> Cache::Lookup()
19            -> ReadBlock()
20               -> RandomAccessFile::Read()
21            -> Cache::Insert() 
22         	-> Iterator::Seek()

Critical path for L0 Write

1 DBImpl::Write()
2 -> LogWriter::AddRecord()
3 -> WriteBatchInternal::InsertInto()
4    -> MemTable::Add()
        -> Skiplist::Insert()
5 -> VersionSet::SetLastSequence()

Critical path for Flush

1 DBImpl::WriteLevel0Table()
2 -> MemTable::NewIterator()
3 -> BuildTable()
4    -> Env::NewWritableFile()
5    -> TableBuilder::Add()
6       -> BlockBuilder::Add()
7    -> TableBuilder::Finish()
8       -> WriteBlock()
9          -> WritableFile::Append()
10   -> TableCache::NewIterator()
11 -> Version::PickLevelForMemTableOutput()
12 -> VersionEdit::AddFile()

Critical path for Compaction

1 DBImpl::DoCompactionWork()
2 -> Version::MakeInputIterator()
3    -> NewTwoLevelIterator()
     -> NewMergingIterator()
3 -> OpenCompactionOutputFile()
4 -> Builder::Add()
5 -> FinishCompactionOutputFile()
6 -> InstallCompactionResults()
7    -> VersionEdit::AddFile()
8    -> Version::LogAndApply()
9       -> VersionEdit::SetLogNumber()
10      -> VersionEdit::SetPrevLogNumber()
11      -> VersionEdit::SetNextFile()
12      -> VersionEdit::SetLastSequence()
13      -> Builder.Apply()
14      -> WriteSnapshot()
15      -> LogWriter::AddRecord()
16      -> SetCurrentFile()
17      -> AppendVersion()