Menu

Physical_engine

silex6

Physical engine component goal is to encode and decode structures of the database file, implement safe write operations through transactions, ensure data consistency by checking content against user-defined constraints and deliver basic services for fast querying the database content.

The physical engine should be split in two layers. Low layer goal is to implement safe write through transactions, ensure data consistency, and handle disk space usage. The low layer should also split this disk space into logical files that will be used for database objects, as tables, views, indexes and, of course, reusable free space. High layer goal is to encode and decode physical structures of the database file, as B+Tree indexes, full-text indexes, records, fields, statistical data and metadata.

Background

The physical engine is the component that manipulates permanent structures of the database file. Physical storage engine should be a library component that will be used by several threads. Code should be thread-safe and re-entrant, as the physical engine should allow concurrent access of the same page by the different worker threads.

Physical storage engine depends on the disk caching component. All page access should be done through the disk caching component.

Physical storage engine should have 2 layers, see below

As part of the physical storage engine, a background garbage collector thread should be implemented in order to recycle empty pages and deleted collections. This thread should inspect transactions, collections, pages and should set the corresponding bit to '1' in [Free_space_table]s.

Low layer

The low layer should manipulate LOBs, [Populations,_tables_and_indices], [Transaction_data] and [Allocation_map]. This layer will be used to allocate (including pre-allocate) and release pages (flags as being 'free space'), [Backup_or_restore] a database, sequential read all pages of a collection, gain access to a page according to transaction rules, create shadow copies of pages and implement ACID transactions (atomic, consistent, isolated and durable).

Allocation map structure is similar to a B+ tree, but simpler. Operations on the allocation map should be implemented the same way as those on B+ trees.

Shadow copies

When update is done on permanent page A in collection B of transaction 0, then the physical storage engine should create a copy of page A in transaction of the current thread and should create collection B in that transaction if needed. Reading of page A in collection B of transaction C sometimes implies reading of a shadow copy located on another transaction (page A of collection B in a transaction !=C). See [Transaction_lifecycle].

The shadow copies should be copied back to permanent pages (transaction 0) by a background thread once transaction has been validated.

The low layer should add lock (semaphore) on permanent page A from collection B of transaction 0 if a shadow copy of this page has been created by the thread to indicate that page has been updated. The lock should be released by the low layer once transaction has ended (commit or roll back).

There should be three transaction modes: read committed, serialized and snapshot.

In a serialized mode transaction, the low layer should put to sleep (waiting) any thread that try to read a locked page until the lock has been released (= updated page has been committed or roll back) or timeout has expired. A timeout condition should raise an exception and cause roll back of the transaction.

In a read committed mode transaction, the low layer should not put to sleep the thread if it tries to read a locked page and return the shadow copy contained in the most recent transaction that has been committed.

In a snapshot mode transaction, the low layer should not lock any page read or written, and commit should return an error.

When page A of collection B is requested by a thread C, the low layer should:

  1. Check if a shadow copy of page A of collection B exists in thread C's transaction
  2. Check if a shadow copy of page A of collection B exists in other committed transactions.
  3. Read page A of collection B in transaction 0.

A in-memory transaction status cache could be implemented in order to speedup collection-wide search on transactions (as search for all committed transactions).

High layer

The high layer should manipulate table records, and B+ Tree nodes. This layer will be used in order to read / update [Records_and_fields] into pages, sequential read records and fields on a table or extent, read and update index, read and update statistics, search a key into an index, sequential read an index, and check for duplicates / missing keys in an index.

The high layer depends on the low layer in order to read/write pages and to isolate writes into transactions.

When a record is deleted from a page, the high layer should update related indexes and statistical data. Every subsequent record in the page should be moved up, and free space will be gained at end of the page. The engine should then scan next 5 pages, and stack all records together in order to try to generate an empty page. The empty page should then be recycled by the background garbage collector thread that is part of the physical engine.

When a record is updated, then current record is deleted, and record with new values should be added at (logical) end of the table. In this case index and statistical data should be updated only for fields that have actually been updated. For rewritten fields only the link pointer located on the leaf node of the index have to be updated.

New records should be added at logical end of the table. If new pages are needed, then the engine should allocate several pages at a time (in order to lower database fragmentation). This will appear when there are not enough pages in the collection to store all data. The number of pages allocated should allow storing 50 records (excluding LOB data)

Topics

See also

[General_architecture]


Related

Wiki: Allocation_map
Wiki: B+_Tree_manipulation
Wiki: Backup_or_restore
Wiki: Cycled_Id
Wiki: FileLayer_Implemented
Wiki: Free_space_table
Wiki: General_architecture
Wiki: Indices_on_strings
Wiki: Pattern_indices
Wiki: Populations,_tables_and_indices
Wiki: Records_and_fields
Wiki: Specification
Wiki: Transaction_data
Wiki: Transaction_lifecycle

MongoDB Logo MongoDB