-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up process list by faster check for empty directory #6432
base: main
Are you sure you want to change the base?
Conversation
In common I'm unsure if your way of change is the right way. The file management module was implemented for accessing files on local storage. For other access types like S3 the file management interface should be implemented for S3 access and must be replaced on runtime with the version of the local storage as you can not have two implementations of the same interface (well known but never solved issue). This was the plan while development the 3.x version. As I see now that this whole DFG project documentation was not got public published - at least not here: https://github.com/kitodo/kitodo-production/wiki/Developer-Documentation-Kitodo.Production-3.x So your change can maybe break this general attempt to add a separation layer between but it if even possible that this layer was already broken without your change. I will try your changes and report back if there are issues or anything else. |
Sorry for creating confusion here. I am not using S3 natively but mounting it using rclone and FUSE mount. So my S3 storage acts like a normal file system. This setup just exposes the performance bottleneck of the current code more clearly. So the new code should work with all form of local or mounted Network storage and speed things up for those storages as well. |
Thank you for this explanation and how you get this working is interesting way. In this way the behaviour should be identical to other local or network based file systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this changes but I can not confirm that is a speed up on local storage - I had the impression that displaying needs more time. But then major issue is that now processes are displayed as exportable which are not exportable.
Without your changes its looks like this for a few processes:
With your changes the processes in the middle are now exportable but they did not contain any images nor other media files
Both processes from the audio and video test project did not contain any files but shown before and after as exportable. But this is an other issue.
Thanks for your review. I will inspect that. I will probably have to inspect closer, what exactly is happening in the convoluted function, which still appears - for this specific use case - like absolute overkill to me, because it is retrieving a lot of file metadata not needed for checking the folder for emptiness. |
I tried to extend the logic to maybe fix the wrong result in your case. (Folders are indicated as non empty although they are empty - processes are exportable)
|
4aa52b2
to
03a946a
Compare
This seems to be intended by the current code. Only files that are used to generate image derivatives are considered here. kitodo-production/Kitodo/src/main/java/org/kitodo/production/services/data/ProcessService.java Lines 2841 to 2842 in 6e9e69b
|
return !entries | ||
.map(Path::getFileName) | ||
.map(Path::toString) | ||
.anyMatch(name -> query.getRight().matcher(name).matches()); // Stop after first match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntelliJ IDEA suggest here instead of !entries.anyMatch()
to use entries.noneMatch()
to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got the same recommendation. I was just wondering wether entries.noneMatch() is slower here. But maybe we should trust the IDE more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed no differences on my local development system. Maybe you can time it on your system with an S3 in background? Maybe the first solution was faster but maybe there is no difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be no real performance differences
https://stackoverflow.com/a/57779846
.map(Path::toString) | ||
.anyMatch(name -> query.getRight().matcher(name).matches()); // Stop after first match | ||
} catch (IOException e) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should a raised IOException interpreted as a directory with content? I'm unsure about this but it could be okay. It could be helpful if this exception got logged in trace or debug log level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOException now marks the folder as having no images.
Good to know. It is different for non-image media files as for image media files but if this must be changed then not in this pull request. |
* It stops checking as soon as the first file is found. | ||
* | ||
* @return true if the folder is empty, false if it contains at least one file | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In JavaDoc thrown exception is missing.
Whily trying to use a S3 storage for the Kitodo files i noticed that the process list became very slow. S3 storage is slower than NFS or local storage of course, but we can sigificantly speed up the process list lf we refactor file existence checks.
The problem is that for every process in the process list a check is triggered wether the document can be exported.
kitodo-production/Kitodo/src/main/java/org/kitodo/production/services/data/ProcessService.java
Lines 2830 to 2847 in 84b0497
This calls a function which checks if the process has images, which checks wether the directory with the original images is empty
kitodo-production/Kitodo/src/main/java/org/kitodo/production/services/file/FileService.java
Lines 1451 to 1461 in 84b0497
This in turn calls a complex function which does way more than returning the contents of the dir to check if the dir is empty.
kitodo-production/Kitodo/src/main/java/org/kitodo/production/model/Subfolder.java
Lines 346 to 357 in 84b0497
I replaced this with an optimized file existence check, which speeds up performance significantly. (On S3 from 15 seconds to one second).