BEP-365: Support multi-database based on data pattern

fynn-z · March 15, 2024, 9:57am

BEP-365: Support multi-database based on data pattern

BEP-365: Support multi-database based on data pattern

1. Summary

This BEP proposes an efficient storage solution on the BSC(BNB Smart Chain) which can improve the performance and maintainability significantly by splitting blockchain data into multi-databases based on data patterns.

2. Motivation

Currently, all of the BSC node’s data, except for historical block and state data, is stored in a single key-value database instance, with different types of data segregated by different prefixes, as shown in the table below. With the rapid increase in the amount of data, several problems are being faced:

Mixed storage of data with different patterns in a single database results in inefficient performance and additional compaction overhead;
Less and Less efficient querying as the single database size increases continuously;
Concurrent reading and writing of a single database cannot fully utilize disk bandwidth;
Not friendly to using multiple disks for data storage
Inability to customize database parameters to optimize read/write performance for data from different patterns

Database	Category	Key Scheme	Size
KV Store	Headers	headerPrefix + num + hash	72.28MiB
	Bodies	blockBodyPrefix + num + hash	12.40GiB
	Receipt lists	blockReceiptsPrefix + num + hash	7.73GiB
	Difficulties	headerPrefix + num+ hash + headerTDSuffix	4.03MiB
	Block number → hash	headerNumberPrefix + hash	3.61MiB
	Block hash → number	headerPrefix + num + headerHashSuffix	1.33GiB
	Transaction index	txLookupPrefix + hash	176.04GiB
	Bloombit index	bloomBitsPrefix + bit + section + hash	8.12GiB
	Contract codes	CodePrefix + code hash	20.23GiB
	Path trie state lookups	stateIDPrefix + state root	3.52MiB
	Path trie account nodes	trieNodeAccountPrefix + hexPath	40.34GiB
	Path trie storage nodes	trieNodeStoragePrefix + accountHash + hexPath	473.95GiB
	Account snapshot	SnapshotAccountPrefix + account hash	13.17GiB
	Storage snapshot	SnapshotStoragePrefix + account hash + storage hash	246.98GiB
Ancient (Chain)	Bodies	-	797.23GiB
	Receipts	-	664.62GiB
	Diffs	-	356.49MiB
	Headers	-	20.21GiB
	Hashes	-	1.23GiB
Ancient (State)	Account Data	-	1.52GiB
	Storage Data	-	1.63GiB
	History Meta	-	248.81MiB
	Account Index	-	2.03GiB
	Storage Index	-	3.65GiB

This BEP aims to address these issues and provide a more efficient and high-performance way to store and maintain block and state data.

3. Specification

3.1 Multi-Databases

Chaindata will be divided into three stores, BlockStore, TrieStore, and Others, according to data schema and read/write behavior. After database separation, the data layout obtained by db inspect is as the below table shows:

Block Database: Block-related data is stored in this store, including headers, bodies, receipts, difficulties, number-to-hash indexes, hash-to-number indexes, and historical block data.
Trie Database: All trie nodes of the current state and historical state data of nearly 9w blocks are stored here.
Original Database: The remaining data will be stored in this store, including snapshot, txIndex, contract code, and other metadata, etc.

Database	Type	Category	Data Pattern
BlockDatabase	KV Store	Headers	Sequential (except “Block hash-> number”)
		Bodies
		Receipt lists
		Difficulties
		Block number → hash
		Block hash → number
	Ancient	Bodies
		Receipts
		Diffs
		Headers
		Hashes
TrieDataBase	KV Store	Path trie account nodes	Path Hash
		Path trie storage nodes
	Ancient	Account Data	Sequential
		Storage Data
		History Meta
		Account Index
		Storage Index
OriginalDatabase	KV Store	Account snapshot	Hash
		Storage snapshot
		Transaction index
		Bloombit index
		Contract codes

3.2 Folder Structure

The folder structure for multiple databases is shown below. The original database is located within the chaindata/ folder, and new block/ and state/ folders have been introduced to store block and trie data. Additionally, there is an ancient folder for storing historical data under each of these directories.

4. Rationale

Separate databases according to data schema and read/write behavior can improve the performance, scalability, and maintainability of chain nodes.

4.1 Block DataBase

Diving deeper into the data flow of block-related data, recent blocks are first stored in a key-value (KV) database and then appended sequentially to the ancient database and deleted from the KV database when reaching the ancient threshold, which wastes some disk bandwidth.

Originally the latest 90K blocks will be retained in the key-value database by default, regarded as “non-finalized” in the context of Proof-of-work. Now with Proof-of-Stake-Authority and fast-finality, there’s no need to keep so many blocks for blockchain reorganization. It makes more sense to rely on the finalized block for chain freezing directly after fast-finality is enabled on the BSC Mainnet. This means we can keep less recent blocks before migrating them to the ancient database for blockchain reorganization.

In summation, the block data retained in the kv database will be reduced to 20-30 blocks after using the finalized block as the chain freeze indicator, this part of the data can be stored in memory, but considering the robustness and stability of the node, using a separated kv database will be more reasonable and probability most of the block data are deleted when flush from memtable to sst file and then append them to the ancient database. Therefore, unnecessary bandwidth consumption can be reduced.

4.2 Trie DataBase

The trie node data constitutes roughly half of the overall key-value database volume and grows rapidly. Latency to read and write trie node data significantly impacts the block execution and verification performance. Improving its read and write speed can contribute to an overall improvement in blockchain performance.

Through in-depth analysis of the data model and read/write behavior of Trie node data, it is evident that Trie nodes exhibit significant overwrite operations in path mode. The keys are relatively ordered along the paths. If this data is stored with other hash-keyed data in the same key-value database, it would significantly amplify the cost of database compaction, leading to senseless bandwidth consumption. If splitting the database for trie data, the independent db can process compaction with a more simplified LSM hierarchy, which would reduce the read/write latency of the entire database and improve the performance.

4.3 Original DataBase

After separating the block and trie data, the remaining data is stored in the original database containing the account/storage snapshot, txIndex, and contract code. Due to the reduction in the amount of data, the depth of the LSM tree is reduced, thus the read/write performance of the original database will be improved.

It is worth mentioning that during the execution phase of the blockchain, there is frequent access to snapshot data. This reduction of reading latency of snapshot data will contribute significantly to overall execution performance.

5. Compatibility

There are no compatibility issues and no hard fork is required, but the node needs to be resync.

6. Test

TBD

7. License

The content is licensed under CC0.

fynn-z · March 19, 2024, 4:06am

Hi, Share some test result for the multi-database here.

Env Spec
- Machine
  - EC2: m6i.4xlarge
  - CPU: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz 16 core
  - Mem: 64G
  - SSD1: GP3 3000 IOPS, 500M/S
  - SSD2: GP3 3000 IOPS, 500M/S
- Geth v1.3.10 run with pbss+pebble
Single disk

	import block	chain_validation	chain_execution	chain commit
Single database	278	32.1	213	45.4
Multi database	260	31.2	188	52.5
Result	6.5% ↑	2.8% ↑	11.7% ↑	15.6 % ↓

Multi disk

Store the trie data on another disk(ssd2) through softlink.

Performance not synced to latest block

	import block	chain_validation	chain_execution	chain commit
Single database	310	38.5	231	50.4
Multi database	257	39.4	177	58.4
Result	17.1% ↑	2.3% ↓	23.4% ↑	15.9 % ↓

Performance synced to latest block

	import block	chain_validation	chain_execution	chain commit
Single database	305	34.7	241	33.6
Multi database	276	32.2	208	39.4
Result	9.5% ↑	6.2% ↑	13.7% ↑	17.2 % ↓

Topic		Replies	Views
BEP(Idea): Aggregated Path-Based State Scheme BEP storage	0	862	October 18, 2023
BEP(Idea v0.1) Segmented History Data Maintenance BEP	4	1632	October 9, 2023
BEP-323: Bundle Format for Greenfield BEP storage	15	894	November 27, 2023
BEP-322: Builder API Specification for BNB Smart Chain BEP bep	1	1317	December 13, 2023
State Expiry MVP0.1 Release BEP storage , bep	0	468	December 15, 2023

BEP-365: Support multi-database based on data pattern

BEP-365: Support multi-database based on data pattern

1. Summary

2. Motivation

3. Specification

3.1 Multi-Databases

3.2 Folder Structure

4. Rationale

4.1 Block DataBase

4.2 Trie DataBase

4.3 Original DataBase

5. Compatibility

6. Test

7. License

Related topics