- 
                Notifications
    You must be signed in to change notification settings 
- Fork 0
Possible Improvements for PLFS small file mode
- Reduce memory usage.
- Improve stability(avoid out of memory).
Small file mode in PLFS can create files very quickly. However once there are too many files(billions of files) in a directory, then it is impossible for small file code to list all the files in this directory.
The reason for this is that the small file mode will cache all the filenames in memory(in a std::map) so that it can merge the log records related to these files together to get the latest view of this directory.
Please note that there is no way to support plfs_readdir() for very large directory, this function will return set which contains all the filenames inside this directory. If the set is too large to be held in memory totally, then we must fail this call.
johnbent This is not related to small file mode at all but thanks for remembering this and pointing this out. Seems like we need to add a size and offset parameter to plfs_readdir(). I wonder how difficult that would be.
jingwang I think using the plfs_readdir_c() would be fine.
For one name per call interfaces(plfs_readdir_c()), we can divide all the filenames into several subset, so that we can handle those subsets one by one in memory.
 We need to change the physical layout in backends for small file from:
   /backend/dir1/SMALLFILECONTAINER/dropping.names.x.[1~M]
 to:
   /backend/dir1/SMALLFILECONTAINER/subset.0/dropping.names.x.[1~M]
   ................................/subset.1/dropping.names.x.[1~M]
                                           .
   ................................/subset.N/dropping.names.x.[1~M]
We need to define a hash function hash(filename) -> [0 ~ N]. Then all
the name records related to fileA will be written to the subdirectory
subset.hash("fileA").
Then plfs_readdir_c() can be implemented as following:
- set I to 0
- load dropping.names in subset.I/ to memory and merge all the records
- now we get all the names in a map
- using iterator to return all the names one by one
- if all the names inside this subset have been returned to caller, I++.
- if I > N, we are done. Otherwise, goto step 2.
Then in this way, we can handle very large directories as long as we can store 1/N of its file names in memory.
johnbent Interesting. This divide-and-conquer is a nice approach that I hadn't thought of. But as you point out, there is a set of trade-offs which aren't great. Is there another way to do it? Since we have a Plfs_dirp structure returned from plfs_opendir_c then we can stash small amounts of info in there. Couldn't we just store basically a DIR * in there and just do readdir's on it and return what it gets? And when it fills the buffer, then we leave the DIR * alone, and then callers can call plfs_readdir_c as many times as they want? Now, we'll also need to iterate across all of the backends and have a DIR * on each one and this means that we might return the same dirent if it exists in multiple backends but maybe that's a better tradeoff. And I know that instead of a physical DIR * it will be a IOSDirHandle but it should otherwise be the same.
jingwang For small file mode, we can't rely on the backend for readdir(). So we can't get dirent from backend, we must create them according to the content of dropping files. However the problem here is that: in posix, directories can be linearly accessed, so you can get the files in a directory one by one. However that is not true for small file mode, because we are a log-structured file system. Only after merging all the log records together, you can travel the directory tree. If you don't cache any information in memory, you can't know whether this file has been returned to the caller or not.
More physical files will be generated by small file. Previously one process only create one dropping.names.xxx files in the container directory, but now it might create a dropping.names.xxx file in each of the subset.X directory, which make it creating N dropping.names files in total.
Please note that this approach can only address the readdir() issue, if we want to handle very large directory correctly, we need to find a way to get all the metadata information related to a given file in a very large directory, which is covered in the next topic.
- Reduce memory usage so that we can handle very large directory.
Similarly, the small file mode require that metadata of all the files in the same directory to be cached in memory so that it can get the required metadata of a given file. This would be a problem for very large directory.
This is a difficult problem to fix, due to the existence of RENAME operations. The file might be renamed from another name, so in order to get its metadata, we need to know the metadata of the file renamed from. However we don't know which file will be renamed to this file, because it happens in the future.
Currently, name records are merged to the in-memory cache in order of time. The oldest record is merged first and the latest record is merged at last.
build_metadata() {
   set<string, Metadata> cached_meta;
   record = first_record; // The earliest record in all droppings
   do {
      if (record is create/open record)
     update_metadata(cached_meta[filename], record);
  if (record is rename record)
     cached_meta[filename_to] = cached_meta[filename_from];
  if (record is delete record)
     cached_meta.erase(filename);
   } while (get_next_record() != NULL)
}
In this way, we can get the latest metadata of all the logical files in one pass. If we want to get the metadata info without the map structure, we could change it as following:
stat_file(filename) {
   record = last record; //The latest record in all the droppings.
   Metadata file_meta;
   do {
      if (record is delete filename)
     break;  //We are done.
  if (record is rename filename -> newname)
     break;  //We are done.
  if (record is rename somename -> filename)
     merge_metadata(file_meta, stat_file(somename));
  if (record is open/create filename)
     update_metadata(file_meta, record);
  // Skip records not related to the filename.
   } while ((record = get_previous_record()) != NULL);
   return file_meta;
}
More IO related to stat. Without the cache, every time when we want to stat a file, we might need to scan all the dropping.names files all over again.
Take the possibility of partial write into consideration, we need define a correct way to find the last record. And for the name records, its length is determined at runtime and the only way we can get the correct start and end is reading those record one by one from the start. Which is not suitable for backward log merging.
Write buffering can help to solve this problem, please see next topic for more information.
- Improve performance.
- Enable backward merging to support previous topic.
- It's a common practice for log-structured file systems.
Currently the records in dropping.names.xxx files have variable size, which makes it impossible to find the last record in this file without reading the whole file. If we make the record fix-length and large enough to hold all possible filenames, then it would waste a lot of space.
So a better way to do this is that we aggregate several records into large and fix-length blocks(4MB for example). And then we can improve the write performance and we can safely find complete blocks when read it back.
For write, just hold the record in memory for a while until those records can fill up a 4MB block or plfs_flush() is called.
For read, load 4MB block at a time, and parse the records in it. In this way, we can read the first block or the last one as we wish without worrying about partial write.
It's not a generic buffering mechanism. However buffer and cache are not only critical to performance, but also important to the consistency. And the existence of buffer or cache might change the way we perform certain operations. So my point is that it is very hard to develop a transparent buffering layer. And I believe that the buffering inside small file would bring the greatest benefits to small file.
johnbent Oh, I think I understand. If the fix-length was 4M, then you could read the file backwards in 4M blocks. But some blocks might have holes at the end I think. And that's fine; I just want to understand. I think you would still need to read each block in forward-order though since each record in the block is variable size. What is variable in each record? Just the filename right? So this is just the namefile that we're talking about here. We could make fixed size records by using MAX_NAMELEN as a constant size field for each filename. I wonder how much space that would waste and if we care about that.
jingwang I'm afraid that would waste too much memory space. Maybe at least 80% of the dropping.names.xxx files would be wasted. It wastes both IO bandwidth and storage.
johnbent Also, are you maintaining the cached_meta[] structure during writes and creates? Because LANL was seeing crashes due to memory exhaustion during the create phase. Would there be a way to disable this and not cache during the create phase and only start caching when someone does some sort of a query? Then we could get super-fast create rates.
jingwang Described and addressed in this topic.
- So that very large physical files won't be created.
- Some memory objects are associated with dropping files, by closing really huge dropping files and creating new ones, the memory usage can be reduced.
Currently when a file is created, some data is cached in memory for future use, for example, there is a map(filename -> FileID) for every dropping.names.xxxx file. This is because we store filenames only in the dropping.names.xxxx files. In the droppings.data.xxxx file we store just a FileID for space savings. So when the application want to write some data after the file is created, then it can lookup the FileID of this file from the map.
However when a process creates lots of files, the map might grow really large. So we need to find a way to control the memory used by this map.
This fix is pretty straight forward, we do not update the map when the file is created. And then when we really want to write some data to the file, we can append an OPEN record to the dropping.names.xx file thus allocate a new FileID to this file. However we still need this map there, because we can't afford to append an OPEN record for every write request so we need a way to know that the OPEN record has been set.
johnbent2 If a write never shows up, will the file exist? Is there a CREAT record? Isn't it the OPEN record that records existence of the file?
jingwang2 It will show up. Yes, there is a CREATE record. No, the CREATE record records the existence of the file.
johnbent2 Can't we stash information inside of the Plfs_fd structure? Could we stash the FileID in there and then not need a map?
jingwang2 Good point. I will come up with a new possible fix to discuss this.
- For use case such as write after create, we will append one OPEN record to the dropping.names.xxx file. While this record can be avoid previously.
johnbent2 Sorry, I don't understand this. Why did we not need to do this previously?
jingwang2 Previously the CREATE record implicitly indicates an open operation so that you don't need to append one OPEN record if the file is just created.
- There still be a chance that the map and the dropping.names.xxx files grow really large, if the process write some data to lots of files.
johnbent2 Sorry, I don't understand this either. We still need to maintain the map so we don't do an OPEN record for every write. So how is this map any smaller than it was previously?
jingwang2 Since the entries for a file which is created but not opened are removed from the map. So this map might be smaller than it was previously.
We record how many records has been added to the dropping.names.xxx file, and once the number of records exceed the threshold, then this dropping file needs to be closed and we will create a new one for future modifications.
In this way, we can ensure that the total memory used by the map is less than N*(average sizeof(filename, FileID)). And also there won't be very large files in backends.
johnbent2 So the map isn't a list of every file that exists in that directory. It's just a list of which files are open in that directory by that process. Yeah, so I think we can just stash whatever info we need inside the Plfs_fd structure. No? Boy, I hope so! :)
jingwang2 Yes, you are right. I might thought that maintaining a map like that could reduce the records to dropping.names.xxx files and avoid having multiple file ids for the same file in one dropping file. I will discuss the Plfs_fd solution in next paragraph.
- More dropping files might be created than before.
- OPEN records can't be reused across dropping files, so there might be more records in dropping.names.xx files than before.
Currently there is one map from filename to fileid for every dropping files. We can change that to a map associated with Plfs_fd, from dropping_id to its fileid in this dropping. Here is the benefits:
- Assume that there are P processes in the machine, so there would be P dropping files.
- Assume that one process will create or open N files during its entire life.
- Assume that one process will open M files at the same time at most.
So previously there would be P maps and there would be N entries in each of them.
  Memory usage = O(P*N)
After the change, there would be M maps for each process, and the map has only one entries(typically there would only be one dropping file for each process).
  Memory usage = O(P*M*1)
And it is reasonable for us to assume that M would be much smaller than N, because that we shouldn't open lots of file at the same time.
- More OPEN records will be appended to dropping.names.xxx file. We must allocate a file id before data is written to this file, so every plfs_open() + plfs_write() will append an OPEN record to the dropping file.
- It complicates the read code path a little bit, because that there might be several file ids for the same file in a single dropping file.