Chains
BNB Beacon Chain
BNB ecosystem’s staking & governance layer
Developers
Ecosystem
Staking
Earn BNB and rewards effortlessly
Tokenization Solutions
Get Your Business Into Web3
Community
Parallel Computing has been an area of active research interest and application for decades. Nowadays, Parallel Execution on EVM is emerging as a new paradigm in the blockchain industry.
Oct, 17th, 2022, NodeReal delivered a feature release: Parallel EVM 2.0 which is based on the latest BNB Chain release v1.1.16. You may refer to the detailed release note at node-real repo.
Actually, it has been a while since The Implementation of Parallel EVM 1.0. As we have noted in that blog, the whole Parallel EVM implementation is supposed to be accomplished in several phases:
Parallel execution is believed to be able to greatly improve the blockchain performance, as many high-performance Layer1 blockchains claim that they support parallel execution, such as Solana, Avalanche, Aptos, Sui... However, these blockchains did some tradeoffs at the very beginning, sacrificing some user experience to make it much easier to support parallel execution. The inconvenience is mainly due to the requirement of a predeclared access list of each contract interface, making it lack dynamic access ability.
But for Ethereum or BNB Chain, it is not the case, that’s why there is no big progress to support parallel in Ethereum or BNB Chain yet.
As we implement the parallel execution on BNB Chain, we encountered many challenges, especially to make it stable and efficient. In this blog, we will walk you through the architecture design and the latest result.
Before digging into the detailed implementation, we would like to do a brief sharing on how the project is carried out.
Parallel EVM has been desired for a long time. There was tremendous traffic on the BSC network around Nov 2021, as shown by the following diagram, which finally makes us put the Parallel EVM on agenda.
Phase 1.0 was kicked off in the middle of Dec 2021, it took us around 3 months to design, implement, tuning and test.
Phase 2.0 was kicked off at the start of April 2022 and it took us around 2 months.
We planned to support parallel validator mode in phase 3.0, but it has not started yet. The current implementation is very complicated. Rather than adding more code, it would be better to take a break and rethink it.
This part would be a bit technical and needs some engineering background, but it is crucial to understand the Parallel EVM implementation, if you are not quite interested in the details, you may skip it.
In short, Parallel EVM is to improve the performance of block processes by executing the transactions concurrently, as simplified and illustrated by the following diagram.
To be more specific, there will be two major components: Dispatcher & Slot.
A configured number of slots(thread) will be created on process startup, each slot will receive transaction requests from the dispatcher and execute transactions concurrently. Transactions that are dispatched to the same slot will be executed in order.
The dispatcher will dispatch the transactions dynamically and efficiently, mainly trying to achieve a low conflict rate and a balanced workload.
The following diagram shows the general workflow. If you are interested and wanna go deeper, it would be better for you to read the code directly.
Here are the major components of Parallel EVM. The fundamental components were implemented in phase 1.0, which is gray-marked. In phase 2.0, we added new components such as Streaming Pipeline, Unconfirmed Access, Memory Pool, Balance Makeup and we also redesigned components like dispatcher, and conflict detector to improve the performance.
Transaction execution will be split into 2 stages: The Execution Stage & Finalize Stage.
Dispatcher is responsible for transaction preparation, dispatch and result merge; The dispatch policy could impact the execution performance, we tried to dispatch these potential conflict transactions to the same slot, to reduce the conflict rate;
In Dispatcher 2.0, we actually removed the dispatch action, the dispatch channel IPC is no longer needed. Dispatcher 2.0 has 2 parts: static dispatch & dynamic dispatch.
It is a bit complicated, with unconfirmed state access, there will be a priority to access StateObject, with descending priority order: Self Dirty ➝ UnConfirmed Dirty ➝ Base StateDB(dirty, pending, snapshot and trie tree).
In a word, the principle is to try best to get the most reliable state info.
Transactions are executed sequentially and have a dependency on the previous result for both BNB Chain and Ethereum, since they use a shared world state;
Conflict detector is used to check if the parallel execution result is valid or not. If it is not valid, it will be rescheduled as a redo.
In phase 1.0, We use DirtyRead policy to do conflict detect, that is if any of the state(balance, nonce, code, KV Storage...) read by the transaction was changed(dirty), we mark it as conflicted; These conflicted transactions will be scheduled a Redo with the latest world state.
In Parallel 2.0, we do conflict detection based on reading. We no longer care what has been changed, the only thing we should care about is to check whether what we read is correct or not. We will keep the detailed read record and compare it with the base StateDB on conflict Detect. It is more straightforward and accurate.
To make conflict detection more efficient, we brought two new concepts:
The merge component is to merge the confirmed result into the base StateDB. It is executed within the dispatcher routine since it needs to access the base StateDB; To keep concurrent safe, we limit the access to base StateDB on dispatcher only, the parallel execution slot could not access the base StateDB directly;
In phase 1.0, we allocate a new StateDB for each transaction, it is a snapshot copy of the base StateDB. According to the memory analysis, each copy operation could take ~62KB of memory. These memories are allocated for each transaction and will be discarded when the transaction is finished. These short-lived objects will trigger GC frequently, which will have a big impact on the performance.
To avoid allocating objects too frequently, we use a memory pool to manage all these objects. They will be recycled asynchronously while the block is committing.
Parallel 1.0 uses DeepCopy to copy state objects, it is cpu, memory-consuming when the storage contains lots of KV elements. We replaced it with LightCopy in phase 2.0 to avoid redundant memory copy of StateObject. With LightCopy, we do not copy any of the storage at all, actually, it is a must if we use UnConfirmed State Access , since the storage would be accessed from different unconfirmed StateDB, we can not simply copy all the KV elements of a single state object.
The lifecycle of transaction execution would help better understand how these components work together.
In general, it could be divided into 3 stages:
It is a bit different between phase 1.0 and phase 2.0 for transaction execution since we introduced more components. The main differences are:
It would be more straightforward to show the whole picture through in a pipeline view:
Take 8 concurrency as an example for a brief view.
In phase 1.0, it is a simple pipeline, there could be many wait periods(Orange Background) to synchronize the parallel execution state.
In phase 2.0, we introduced Streaming Pipeline, which greatly eliminated the waiting period, to make sure the pipeline could be more efficient.
I could reveal more detail here on the streaming pipeline.
Operations of ConflictDetect, Finalize, Merge will all be done within the dispatch thread. And each parallel execution slot will have a shadow slot, it is a backup slot, doing exactly the same job as the primary slot. Shadow slot is used to make sure redo can be scheduled ASAP.
When the execution stage is completed, it doesn't need to wait for its previous transaction's final result. The transaction can queue its result and move on to execute the next transaction in the pending queue without waiting.
That is called streaming pipeline, since transactions can be executed continuously and there is no waiting anymore.
We set up several devices to run the compatible test of full sync, almost all the blocks on the BNB Chain testnet and mainnet are covered.
We set up 2 instances to test the performance benefit, with parallel number 8 and pipe commit enabled.
The 2 instances use the same hardware configuration, with 16 cores, 64G memory, and 7T SSD,
It ran for ~50 hours. The total block process(execution, validation, commit) cost is reduced by ~20% -> ~50%, the benefits vary for different block patterns.
We have rebased the code to the latest BNB Chain release v1.1.16, you may have a try if you are interested.
For the detail usage, you may refer: https://github.com/node-real/bsc/releases/tag/v1.1.16+Parallel-2.0
Please be aware that:
About NodeReal
NodeReal is a one-stop blockchain infrastructure and service provider that embraces the high-speed blockchain era and empowers developers by “Make your Web3 Real”. We provide scalable, reliable, and efficient blockchain solutions for everyone, aiming to support the adoption, growth, and long-term success of the Web3 ecosystem.
Join our community to learn more about NodeReal and stay up to date!