Description
Hello! Thank you for publishing Dhara.
I have been playing with it and wanted to share some performance numbers. I am using a NAND Flash chip on an embedded system. When I use my raw nand driver, I get the following numbers writing and reading full pages. (This output is from my UART shell:)
# nand_write_read
..............................
Write 3840.0 KB at 986.988 KB/s
..............................
Read 3840.0 KB at 1998.049 KB/s
Done. Marked 0 this run, 0 total
#
However, when I run a Dhara write/read loop on a freshly erased NAND (with a brand new map/journal), I get these numbers:
# dhara_write_read
................................
Write 4000.0 KB at 620.230 KB/s
................................
Read 4000.0 KB at 271.348 KB/s
#
I was surprised that reads are slower than writes. Dhara sector reads are 86% slower than raw NAND page reads. Dhara sector writes are 37% slower than raw NAND page writes.
Through logging, I can see that this is because of two reasons:
- Dhara walks the Radix tree from the root for every sector lookup
- Dhara does a partial page read each time it walks a node of the tree
Dhara walks the Radix tree from the root for every sector lookup
Here is the comment at the top of trace_path()
:
/* Trace the path from the root to the given sector, emitting
* alt-pointers and alt-full bits in the given metadata buffer.
Since the vast majority of sector accesses are sequential, meaning the next target sector will be sector+1 (since file data is read or written in big streams), it seems like there could be an easy optimization to not search the entire tree every time. Perhaps it could remember the last sector found, and then start searching at the sibling node of the Radix tree. I looked at the loop but did not see an obvious way to implement that.
Dhara does a partial page read each time it walks a node of the tree
Here are some logs showing Dhara's partial page reads of metadata as it walks the tree. Notice that it does ~10 small metadata reads of len 132 bytes each, before it has the page address and can read the full page of len 2048 bytes (from offset 0) that the user is seeking.
E (11422) nand: nand_read_page: blk 18, pg 15, pg_addr 0x48F, pg_offset 548, p_data 0x20009354, len 132
E (11423) nand: nand_read_page: blk 9, pg 47, pg_addr 0x26F, pg_offset 284, p_data 0x20009354, len 132
E (11424) nand: nand_read_page: blk 5, pg 31, pg_addr 0x15F, pg_offset 152, p_data 0x20009354, len 132
E (11425) nand: nand_read_page: blk 3, pg 15, pg_addr 0xCF, pg_offset 1076, p_data 0x20009354, len 132
E (11426) nand: nand_read_page: blk 2, pg 15, pg_addr 0x8F, pg_offset 548, p_data 0x20009354, len 132
E (11427) nand: nand_read_page: blk 1, pg 47, pg_addr 0x6F, pg_offset 284, p_data 0x20009354, len 132
E (11428) nand: nand_read_page: blk 1, pg 31, pg_addr 0x5F, pg_offset 152, p_data 0x20009354, len 132
E (11429) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 1076, p_data 0x20009354, len 132
E (11430) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 548, p_data 0x20009354, len 132
E (11431) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 284, p_data 0x20009354, len 132
E (11432) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 152, p_data 0x20009354, len 132
E (11433) nand: nand_read_page: blk 1, pg 1, pg_addr 0x41, pg_offset 0, p_data 0x2000d1c0, len 2048
.E (11435) nand: nand_read_page: blk 18, pg 15, pg_addr 0x48F, pg_offset 548, p_data 0x20009354, len 132
E (11436) nand: nand_read_page: blk 9, pg 47, pg_addr 0x26F, pg_offset 284, p_data 0x20009354, len 132
E (11437) nand: nand_read_page: blk 5, pg 31, pg_addr 0x15F, pg_offset 152, p_data 0x20009354, len 132
E (11438) nand: nand_read_page: blk 3, pg 15, pg_addr 0xCF, pg_offset 1076, p_data 0x20009354, len 132
E (11439) nand: nand_read_page: blk 2, pg 15, pg_addr 0x8F, pg_offset 548, p_data 0x20009354, len 132
E (11440) nand: nand_read_page: blk 1, pg 47, pg_addr 0x6F, pg_offset 284, p_data 0x20009354, len 132
E (11441) nand: nand_read_page: blk 1, pg 31, pg_addr 0x5F, pg_offset 152, p_data 0x20009354, len 132
E (11442) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 1076, p_data 0x20009354, len 132
E (11443) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 548, p_data 0x20009354, len 132
E (11444) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 284, p_data 0x20009354, len 132
E (11445) nand: nand_read_page: blk 1, pg 2, pg_addr 0x42, pg_offset 0, p_data 0x2000d1c0, len 2048
...etc...
The large number of semi-random accesses to different blocks and pages is the reason for Dhara's read slowness on my system. There is a considerable amount of overhead to kick off a transaction -- a 132 byte metadata read takes just about as long as a full 2048 byte full page read, due to bus and DMA setup and kickoff.
It seems like there is an opportunity to cache and avoid all these NAND I/O calls. Either by caching ten metadata entries (since it seems to follow mostly the same tree path in sequential block reads, and they are only 132 bytes each), or by having one or two full-page caches at the driver level (so that, for example, all those reads to block 1, page 15 could be reduced to just one full-page read, and then the various offset reads for metadata could come from RAM).
This is just for your information only. Dhara's performance already meets my system requirements, so I don't plan to implement any optimizations. But it seems like there is some low-hanging fruit that could result in significant performance improvements.
Thanks again!