-
Notifications
You must be signed in to change notification settings - Fork 2
DocGarbage
This document describes the operation of the garbage collection and reclamation mechanism present in the SLASH2 MDS.
There are multiple types of garbage:
-
Full garbage
When a
versionof a file is completely destroyed. This may happen as a result of the following metadata operations issued by a client:- a clobbering
rename(2)over an existing destination file - an
unlink(2)when thest_nlinkfield reaches zero - a zero (i.e. full)
truncate(2), orO_TRUNCbeing specified inopen(2)
Only the details of full garbage reclamation are covered in this document.
- a clobbering
-
Partial garbage
This is when a region of a file is known to be garbage and only part of it need be garbage collected. This scenario only arises when a bmap with multiple residency loses the valid replica status of the data, from either:
- overwriting some portion of the bmap, invalidating all other replicas of the bmap; or
- specifically ejecting of a replica (see
msctl repl-remove). Note that the last replica of a bmap may never be ejected.
The details of partial garbage reclamation are not covered in this document. The operation of partial garbage reclamation is handled by the update scheduling engine.
-
Stash each UNLINK operation (and others listed above) into the system log before replying to the UNLINK RPC.
-
Distillthe log entry from the tile and write into the one of the reclaim log files.Each RECLAIM entry consists of the following information:
- The identity of the file: SLASH2 FID and generation number
- The identities of the I/O servers that have a replica of the file at the moment.
-
Read the unlink log files and send RPCs to I/O servers. An unlink log file will be removed when all relevant I/O servers have responded.
Currently, we use tiling code to distill some log entries (those related to namespace update and truncation) from the system journal for further processing. Every log entry is written to the journal and the tile at the same time. We have to adjust the tiles to map to different regions of the system journal as time goes by. The main motivation of tiles is to cache the log entries and avoid reading the system journal again.
Since we already have a list of pending transactions, we could leverage that list for the purpose of distilling. That way, there is no need to maintain tiles. Data associated with log entries can just hang off their respective transaction handles.
Because we determine the current log tail by looking at the head of the list, we can also prevent a log entry from being reused before it is distilled. Not all entries need to be distilled.
We also need to log some counters (bmap sequence number and SLASH2 FID); we can insert them periodically in the log, so we are never worried that they will be overwritten or do not exist at all.
Bmap sequence numbers are used to expire old bmaps on the I/O servers. We don't require strict time synchronization among clients and servers. So it is entirely up to the MDS to decide when to expire bmaps it has leased.
There is a problem where the MDS receives 30MB/s I/O every minute. This slows down the READDIR performance a lot.
The current implementation reads a batch of reclaim entries and determines each IOS to contact. After the log file is read, if no IOS can be contacted, the processing that was done to read the file was wasted. Ensuring that a connection to an IOS appears healthy before starting heavier processing would be a nice addition.
Perhaps taking the approach of the partial garbage reclamation is a better approach for full garbage reclamation as well.
