MIMD 0007 MagicBlock Ledger Redesign #363

bmuddha · 2025-05-12T15:14:45Z

bmuddha
May 12, 2025
Maintainer

MIMD-0007: MagicBlock Ledger Redesign

Abstract

The current MagicBlock validator ledger implementation is inherited from the
Solana codebase. While robust, this implementation is overengineered for
MagicBlock’s operational model, resulting in code bloat, suboptimal
performance—especially in hot write paths—and maintenance challenges due to
unused or irrelevant code. This MagicBlock Improvement Document (MIMD) proposes
a radical redesign of the blockstore, leveraging the non-forking, append-only
nature of the MagicBlock ledger. The new design aims to maximize throughput,
scalability, and maintainability by introducing a custom storage engine,
optimized data formats, and efficient indexing strategies.

1. Motivation

The inherited Solana ledger implementation is tailored for a different
consensus and operational paradigm, where forking and random access are common.
MagicBlock, by contrast, operates as an ephemeral rollup with strictly
append-only, non-forking semantics. This mismatch leads to several
inefficiencies and limitations:

Wasted resources: Features like fork handling, random writes, and complex
compaction are unnecessary and unused.
Performance bottlenecks: Write throughput is limited by the underlying
storage engine and serialization overhead, especially in single-threaded hot
paths.
Maintenance burden: Large codebase with irrelevant logic increases
cognitive load and risk of bugs, making future improvements harder.
No multi-process support: RocksDB does not allow multiple processes to
access the same database concurrently, making it unusable with standalone
services like RPC or analytics.

A purpose-built ledger can address these issues, unlocking higher throughput,
simpler operations, and a more maintainable codebase.

2. Current Implementation Overview

2.1 Storage Engine

Engine: RocksDB (Log-Structured Merge-tree)
Strengths: High-performance for random, write-heavy workloads.
Drawbacks:
- LSM compaction introduces unpredictable write stalls, causing latency spikes.
- Most features (random inserts, data modifications, fork support) are unused in MagicBlock.
- Write amplification and CPU contention during compaction reduce effective throughput.
- Overhead from generic design not tailored to append-only workloads.

2.2 Storage Format

Multiple column families are used, each with custom key/value formats:

Column Family	Key Format	Value Format
Transaction Status	Signature + Slot	Protobuf TransactionStatusMeta
Address Signatures	Pubkey + Slot + Txn Index + Sig	Bincode AddressSignatureMeta
Slot Signatures	Slot + Txn Index	Signature
Blocktime	Slot	Unix timestamp
Blockhash	Slot	Blockhash
Transaction	Signature + Slot	Protobuf SanitizedTransaction
Transaction Memos	Signature + Slot	Client string
Perf Samples	Slot	Bincode PerfSample
Account Mod Datas	Surrogate ID (u64)	Bincode Account Mod Data

Serialization: Protobuf and bincode are used, both relatively expensive in CPU cycles, especially for write-heavy, single-threaded workloads. This serialization overhead is a significant contributor to latency and CPU usage.

2.3 Write Behavior

Single-threaded writer: Listens to geyser events, serializes, and writes to all relevant column families.
Throughput bottleneck: Each transaction triggers 5–10 separate writes, capping throughput at ~300K ops/sec (~30K tx/sec).
Potential for improvement: While RocksDB tuning is possible, it would require significant engineering effort and would not address the fundamental mismatch with MagicBlock’s append-only, non-forking model.

3. Proposed Design

3.1 Custom Storage Engine

3.1.1 Overview

Abandon RocksDB in favor of a custom, highly-specialized storage solution
designed specifically for append-only, non-forking workloads.
Core concept: The ledger is divided into superblocks, each representing
a contiguous range of blocks. This segmentation allows for efficient
management, rotation, and truncation of ledger data.
Superblock: A collection of files (a primary flat file along with index
files) storing serialized blocks, transactions, and metadata in append-only,
interleaved fashion. Each superblock is self-contained, a mini-ledger in essense, and can be managed
independently.

3.1.2 Superblock Structure

Flat file: Memory-mapped, sequentially written. Contains:
- Block delimiters to mark the start of each block.
- Transactions and status metadata, serialized in a compact, high-performance format.
Indexes: Maintained separately using mdbx, a modern, high-performance, LMDB-like key-value store. MDBX is chosen for its superior concurrency, performance, and reliability compared to RocksDB for this use case.
Rotation: When a superblock reaches a size or age threshold, a new one is started. Oldest superblocks are deleted wholesale (when reaching disk quota) for efficient truncation. This rotation mechanism ensures that the ledger remains within configured storage limits and that truncation is a fast, atomic operation.

3.1.3 Advantages

Write throughput: Sequential, append-only writes maximize disk and OS cache efficiency, minimizing random I/O and write amplification.
Read efficiency: Indexes allow fast access to recent data; hot (latest) superblocks will be memory-resident due to mmap, ensuring low-latency reads for the most active data.
Truncation: Deleting old data is a cheap filesystem operation, avoiding expensive per-transaction cleanup and compaction.
Multi-process access: Both flat files and MDBX indexes support concurrent access, enabling real-time RPC, analytics, and backup services to share the same underlying database without contention or locking issues.
Simplicity: The design is significantly simpler, reducing the maintenance burden and making future improvements easier.

3.1.4 Account Data Modifications

Separation of concerns: Account data modifications are stored in a dedicated index within the superblock, decoupled from transaction data for clarity and performance. This ensures that account state changes are efficiently queryable and do not interfere with transaction processing.

3.2 Storage Format

Serialization: Adopt rkyv for all block and transaction data.
- Rationale:
  - 3x faster serialization than protobuf, reducing CPU usage and latency.
  - Zero-copy deserialization for rapid reads, massively improving RPC performance and RPS handling capabilities.
  - Lower CPU overhead, especially under high write concurrency, enabling higher throughput.
- Implementation note: Proxy types may be required for Solana API compatibility; future API changes may further optimize for rkyv.
Data format: The main unit of data is an rkyv-serialized bundle of SanitizedTransaction and TransactionStatusMeta. These bundles are written sequentially at preconfigured alignments, and a sequence of these is delimited by block data, allowing efficient block and transaction retrieval.
Size limit: Every single flat file in a superblock will have a limited size, which will trigger rollover to a new superblock once filled up. This ensures predictable storage usage and efficient management.

3.3 Index Structures

3.3.1 Block Index

Purpose: Map block number/slot to offset in the flat file.
Usage: Enables direct access to block data and all contained transactions via a single seek, supporting efficient block retrieval and scanning.
Key: The key is a slot number (u64).
Value: Block info, including hash, blocktime, and other relevant metadata.

3.3.2 Transaction Index

Purpose: Map transaction signature to offset in the flat file.
Usage: Supports fast lookup for transaction status and metadata, enabling efficient transaction queries by signature.
Key: The key is the transaction signature ([u8; 64])
Value: Offset into the flat file where the transaction and its status meta are rkyv-serialized.

3.3.3 Address Signatures Index

Purpose: Map account public key to a list of offsets for all transactions involving that account.
Usage: Enables explorer and RPC queries for account activity, supporting efficient retrieval of all transactions for a given account.
Key: DUPSORT Account's Pubkey (enables multiple values per key).
Value: Offset into the flat file where the transaction data is
rkyv-serialized, along with a boolean flag indicating whether the account was
writable in that transaction.

3.4 High Level Diagram

4. Operational Considerations

4.1 Hot/Cold Data Management

Hot superblock(s): Recent superblocks are kept in memory (naturally due to mmap) for low-latency access. This ensures that the most frequently accessed data is always available with minimal I/O overhead.
Cold superblocks: Older superblocks are offloaded to disk (also via mmap); these are rarely accessed except for historical queries, minimizing their impact on performance.

4.2 Truncation and Archival

Truncation: Oldest superblocks are deleted as a unit when disk quota is reached. This operation is fast and atomic, avoiding the need for expensive per-transaction cleanup.
Archival: Optional offloading of superblocks to remote or cold storage for long-term retention, supporting compliance and audit requirements without impacting active ledger performance.

4.3 Multi-process and Service Access

Concurrency: Both flat files and MDBX indexes support safe concurrent access, enabling multiple services (e.g., RPC, analytics, backup) to operate on the live ledger state without contention.
Use cases: Enables real-time RPC, analytics, and backup services to operate on live ledger state, supporting a wide range of operational and monitoring requirements.

5. Open Questions & Future Work

Index update strategies: How to efficiently update indexes in the
presence of high write concurrency? As MDBX has a single writer model, in
order to increase write efficiency and decrease transaction creation
overhead, multiple (potentially thousands under high load) inserts can be
batched together from a single inserter thread. The Address signatures
index writes can be deferred to another thread,
as this index is used primarily for the convenience of explorers and thus
doesn't require atomicity with other writes. To avoid lock contention with
the Transaction index, this index can be moved to a
separate MDBX environment.
Crash recovery: What mechanisms are needed to ensure atomicity and
durability? After each block insert, a system-wide flush will be issued, thus
creating a checkpoint for later ledger replay in case a crash occurs. This
ensures that the ledger can always be recovered to a consistent state.
Expected throughput in terms of TPS: Fine-tuned MDBX can handle 300-400K
inserts per second. As we only have one primary index (Transactions
index), this number directly translates to TPS, as block
inserts should be relatively infrequent in comparison, and writes to Address
signatures index can happen in deferred fashion
in the background. Thus, achieving around 100K TPS should be well within the
realm of possibilities.
Migration plan: A detailed migration plan from the current RocksDB-based
ledger to the new superblock-based design will be required, including data
conversion, downtime minimization, and rollback strategies.
Testing and benchmarking: Comprehensive performance and correctness
benchmarks must be defined and executed to validate the new implementation
under realistic workloads.

6. Conclusion

This proposal outlines a comprehensive redesign of the MagicBlock ledger,
tailored to its unique operational model. By leveraging append-only,
non-forking semantics, a custom storage engine, and faster and cheaper
serialization, the new design promises significant improvements in throughput,
scalability, and maintainability. The proposed architecture not only addresses
the current bottlenecks and inefficiencies but also lays a solid foundation for
future enhancements, operational flexibility, and ease of integration with
external services.

Dodecahedr0x · 2025-05-12T19:29:43Z

Dodecahedr0x
May 12, 2025
Maintainer

Impressive proposals, here is some of my feedback and questions raised:

Abandon RocksDB in favor of a custom, highly-specialized storage solution designed specifically for append-only, non-forking workloads.

If I'm not mistaken, the non-forking assumption comes from the assumption that the security mechanism will not use optimistic proofs. This could work if real-time zero-knowledge proofs were possible, but that is still far from a reality. Are we sure this assumption will hold with future security mechanisms? (e.g., rollbacks might happen with optimistic proofs)

3.1.2 Superblock Structure

Do superblocks contain some form of snapshot of the state at the start of the block? That would really make them self-contained and could help the proving of most security mechanisms.

MDBX focuses on the local-only setting. Wouldn't it make sense to prepare for a case where a single operator runs a distributed cluster of validators sharing a private key and writing to the same storage?

Serialization: Adopt rkyv for all block and transaction data.

Why rkyv over bytemuck? Bytemuck has more recent updates, is less verbose (rkyv seems to use a new proc-macro), but has strict alignment enforcement.

Index update strategies: How to efficiently update indexes in the presence of high write concurrency?

Distributed storage would come with multiple writers, which could help solve contention of these separate parts of the data. Moreover, distributed DB often claim to have the highest throughput (although I'm not sure about their consistency properties)

After each block insert, a system-wide flush will be issued, thus creating a checkpoint for later ledger replay in case a crash occurs

I don't understand this: Where is the data flushed? Does that imply flushing pending transactions that are not processed yet? What's the connection between the flush and the checkpoint?

1 reply

bmuddha May 13, 2025
Maintainer Author

Are we sure this assumption will hold with future security mechanisms?

Well, no, they might become invalidated in the future. But the design of the ledger allows us to rollback to previous states if necessary, it's just going to be slower, but assuming that such scenarios should rarely happen (if ever), it's a good compromise which optimizes for common case.

Do superblocks contain some form of snapshot of the state at the start of the block?

We can tightly couple each superblock with an accountsdb snapshot, thus you can start replaying any superblock on top of the state, which existed when superblock began.

MDBX focuses on the local-only setting. Wouldn't it make sense to prepare for a case where a single operator runs a distributed cluster of validators sharing a private key and writing to the same storage?

We have to use embedded storages for several reasons:

Fully fledged database solutions aren't easy to setup (especially in cluster configurations), this will add maintenance burden, and complicate onboarding
Clusters aren't no cheap, they can be several times more expensive to run than the ER node, with embedded solutions we pay nothing.
Even with cluster running on the same datacenter, the latencies can be orders of magnitude higher.

Why rkyv over bytemuck?

bytemuck only supports PODs, and we need to serialize variable size data structures.

I don't understand this: Where is the data flushed?

We extensively make use of memory mappings, which means that the data is memory resident and not persisted to disk. Flushes allow us to write all the changes from RAM to persistent storage.

What's the connection between the flush and the checkpoint?

Even if the validator crashes after the flush, we have some consistent state on disk, which acts like a check point for a restart.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MIMD 0007 MagicBlock Ledger Redesign #363

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

MIMD 0007 MagicBlock Ledger Redesign #363

Uh oh!

Uh oh!

bmuddha May 12, 2025 Maintainer

MIMD-0007: MagicBlock Ledger Redesign

Abstract

1. Motivation

2. Current Implementation Overview

2.1 Storage Engine

2.2 Storage Format

2.3 Write Behavior

3. Proposed Design

3.1 Custom Storage Engine

3.1.1 Overview

3.1.2 Superblock Structure

3.1.3 Advantages

3.1.4 Account Data Modifications

3.2 Storage Format

3.3 Index Structures

3.3.1 Block Index

3.3.2 Transaction Index

3.3.3 Address Signatures Index

3.4 High Level Diagram

4. Operational Considerations

4.1 Hot/Cold Data Management

4.2 Truncation and Archival

4.3 Multi-process and Service Access

5. Open Questions & Future Work

6. Conclusion

Replies: 1 comment · 1 reply

Uh oh!

Dodecahedr0x May 12, 2025 Maintainer

Uh oh!

Uh oh!

bmuddha May 13, 2025 Maintainer Author

bmuddha
May 12, 2025
Maintainer

Replies: 1 comment 1 reply

Dodecahedr0x
May 12, 2025
Maintainer

bmuddha May 13, 2025
Maintainer Author