BEP(Idea): Aggregated Path-Based State Scheme

fynn-z · October 18, 2023, 2:52am

1. Summary

This BEP proposes an aggregated path-based state scheme to improve performance and reduce the space for state storage. It aggregates and stores smaller value-size(avg 100+ bytes) trie nodes into larger nodes. It will bring the following benefits:

Reduces the reading amplification by lowering tree depth with aggregated trie node to improve reading performance;
More efficient and friendly for SSD storage with aggregated nodes;
Better data locality for MPT, better trie node prefetching performance with two layer trie nodes stored together;
Less state storage space by minimizing the overhead of the key space of massive trie nodes.

2. Motivation

2.1 Challenges For BSC

Due to the high TPS and traffic, BSC’s storage size is growing rapidly. In the past two years, the state data size has grown from 200GB to 430G, and the significant performance drop for validators still happens when running about 2~3 weeks after offline pruning. As a result, the entire BSC network is becoming unstable, and even the Full/RPC node has been lagging and can not catch up with the latest block. Below we will analyze the problem from two dimensions: performance degradation and disk space.

2.1.1 Performance Degraded

With the increasing scale of the MPT tree, its height may also grow, causing lookup and update operations to require more IOs. This can lead to decreased performance for read and write operations, impacting system responsiveness. In BSC, a multi-tier caching system is employed to optimize the retrieval of state data, effectively leveraging cache mechanisms to enhance contract execution speed. However, the cache misses could potentially lead to significant performance degradation.

From the perspective of performance monitoring, the time consumption of chain validation will increase periodically, often accompanied by cache missing and reading trie node from disk.

For disk reads, the current depth of the MPT tree is about 12 layers, and the lSM tree of LevelDB is around 7 layers, which will cause huge read amplification. From the profile, it may result in a read amplification of about 10x. Of course, it only happens when the cache is missing, but since the whole block execution has a long-tail effect, read amplification must be one of the main causes of performance degradation.

For cache hit rate, as the size of the MPT tree continues to increase, the hit rate of the cache continues to decline, and according to monitoring data, in the past year, the hit rate of the overall TrieCache has been reduced from 88% to about 84%. It is not difficult to analyze that the hit rate of the TrieCache directly affects the time of transaction validation, so increasing the proportion of valid data in the TrieCache to improve the cache hit rate will greatly help performance.

2.1.2 Disk Space Occupied

We analyzed the current world state data(hash-based) for BSC, and from the results, there are close to 4 billion small KVs stored in the database, their key size is 32 bytes, and the average value is about 100 bytes. Specific data are available in the table below.

AccountTrie	ShortNode	FullNode	Total
Count	205372697	67377613
Node Size	key: 32 bytes value: ≈106 bytes	key: 32 bytes value≈145 bytes
Total Size	key: 6.12G value: 20.27G	keys: 2G value: 9.09G	37.48G

Storage Trie	ShortNode	FULLNODECNT	Total
Count	3103435640	1073887225
Node Size	key: 32 value: ≈45 bytes	key: 32 bytes value: ≈142 bytes
Total Size	key: 92G value: 130G	keys: 32G value: 142.02G	396.02G

2.2 Path-Based State Scheme

PBSS efficiently converts MPT storage from hash-based addressing to path-based addressing, and it also involves substantial refactoring of the underlying storage for the entire state database. As you can see from the figure below, PBSS converts the key of the trie node stored in underlying storage from hash to path, and each trie node has a unique PathKey in the MPT tree, so for the underlying storage, the persistence of MPT is doing overwrite, which is why it can do inline prune.

However, both Hash-base state schema storage and Path-base state schema storage have yet to effectively address the following issues.

Huge small key-value pair, serious space amplification.
IO amplification due to tree depth

3. Specification

3.1 Aggregated Path-Base State Scheme

We use the character of trie tree morphology to aggregate and store two or more levels of trie nodes to improve storage efficiency and reduce write amplification

3.2 General Workflow

When the node is written to the diffLayer, it is initially aggregated. The delta records of each AggNode are stored in the diffLayer when it is merged to the disk layer, then applied delta records are added to AggNode and persisted to disk.

3.3 Aggregate Rules

For full node and short node, different aggregation rules will be applied to achieve the aggregation effect

For a full node:
- If the path key consists of an even number of nibbles, aggregate it into the AggNode with the same path key. E.g. 0x00 Full Node ➝ 0x00 AggNode
- If the path key consists of an odd number of nibbles, aggregate it into the AggNode with the path key obtained by removing the last nibble from the node’s path key. E.g. 0x001 Full Node ➝ 0x00 AggNode
For a short node:
- If the path key consists of an odd number of nibbles, aggregate it into the AggNode with the path key obtained by removing the last nibble from the node’s path key. E.g. 0x001 Short Node ➝ 0x00 AggNode
- (Optional) If the path key consists of an even number of nibbles, aggregate it into the AggNode with the path key obtained by removing the last two nibbles from the node’s path key. E.g. 0x0001 Short Node ➝ 0x00 AggNode.

3.4 AggNode

The keys of the AggNode are as below and Its AggPath must consist of an even number of nibbles, E.g. 0x00, 0x00FF, 0xABFD, etc.

3.3.1 Memory layout

The AggNode Structure

type AggPath []byte  // even length

type AggNode struct {
root []byte
childs [16][]byte
}

3.3.2 Persist layout

Trie Type	Key	Value
account trie	AccountTirePrefixKey_AggPath	RLP(AggNode)
storage trie	StorageTriePrefixKey_trieOwner_AggPath	RLP(AggNode)

4. Rationale

As the depth and scale of the MPT tree in BSC continue to grow, the number of TrieNode has reached 5 billion, with an average size of around 100 bytes. This is extremely unfriendly for SSDs and databases, and it also significantly degrades read performance. Therefore, this BEP proposes a scheme to aggregate storage for two layers of TrieNode. This scheme can greatly improve the locality of TrieNode storage and reduce the number of disk-read IO operations. Additionally, by aggregating a large number of small TrieNodes, the data stored in the database can be reduced to a few KB or even a dozen KB. This is relatively friendly for SSDs and databases, and it also saves overhead caused by storing a massive number of keys.

5. Backward Compatibility

This is not a hard fork BEP
The data is not compatible with the PBSS or HBSS. We should generate a new BSC snapshot based on the APBSS.

6. License

The content is licensed under CC0.

Topic		Replies	Views
BEP-365: Support multi-database based on data pattern BEP storage	1	486	March 19, 2024
State Expiry MVP0.1 Release BEP storage , bep	0	466	December 15, 2023
BEP(Idea v0.1) Segmented History Data Maintenance BEP	4	1631	October 9, 2023
Introducing BSC State Expiry Column BEP storage , bep	1	848	August 4, 2023
BEP(Idea): State Expiry On BNB Chain BEP storage	10	3701	August 23, 2023