Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up process list by faster check for empty directory #6432

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

BartChris
Copy link
Collaborator

@BartChris BartChris commented Feb 19, 2025

Whily trying to use a S3 storage for the Kitodo files i noticed that the process list became very slow. S3 storage is slower than NFS or local storage of course, but we can sigificantly speed up the process list lf we refactor file existence checks.

The problem is that for every process in the process list a check is triggered wether the document can be exported.

* Checks and returns whether the process with the given ID 'processId' can be exported or not.
* @param processId process ID
* @return whether process can be exported or not
*/
public static boolean canBeExported(int processId) throws IOException, DAOException {
Process process = ServiceManager.getProcessService().getById(processId);
// superordinate processes normally do not contain images but should always be exportable
if (!process.getChildren().isEmpty()) {
return true;
}
Folder generatorSource = process.getProject().getGeneratorSource();
// processes without a generator source should be exportable because they may contain multimedia files
// that are not used as generator sources
if (Objects.isNull(generatorSource)) {
return true;
}
return FileService.hasImages(process, generatorSource);
}

This calls a function which checks if the process has images, which checks wether the directory with the original images is empty

* @param process Process
* @param generatorSource Folder
* @return whether given URI points to empty directory or not
*/
public static boolean hasImages(Process process, Folder generatorSource) {
if (Objects.nonNull(generatorSource)) {
Subfolder sourceFolder = new Subfolder(process, generatorSource);
return !sourceFolder.listContents().isEmpty();
}
return false;
}

This in turn calls a complex function which does way more than returning the contents of the dir to check if the dir is empty.

private Map<String, URI> listDirectory(Pair<URI, Pattern> query, boolean absolute) {
FilenameFilter filter = (dir, name) -> query.getRight().matcher(name).matches();
try (Stream<URI> relativeURIs = fileService.getSubUris(filter, query.getLeft()).parallelStream()) {
Stream<URI> resultURIs = absolute ? relativeURIs.map(
uri -> new File(ConfigCore.getKitodoDataDirectory().concat(uri.getPath())).toURI())
: relativeURIs.map(uri -> URI.create(uri.toString().replaceFirst("^[^/]+/", "")));
Function<URI, String> keyMapper = createKeyMapperForPattern(query.getRight());
return resultURIs.collect(Collectors.toMap(keyMapper, Function.identity(), (previous, latest) -> latest,
() -> new TreeMap<>(fileService.getMetadataImageComparator())));
}
}

I replaced this with an optimized file existence check, which speeds up performance significantly. (On S3 from 15 seconds to one second).

@henning-gerhardt
Copy link
Collaborator

In common I'm unsure if your way of change is the right way. The file management module was implemented for accessing files on local storage. For other access types like S3 the file management interface should be implemented for S3 access and must be replaced on runtime with the version of the local storage as you can not have two implementations of the same interface (well known but never solved issue). This was the plan while development the 3.x version. As I see now that this whole DFG project documentation was not got public published - at least not here: https://github.com/kitodo/kitodo-production/wiki/Developer-Documentation-Kitodo.Production-3.x

So your change can maybe break this general attempt to add a separation layer between but it if even possible that this layer was already broken without your change.

I will try your changes and report back if there are issues or anything else.

@BartChris
Copy link
Collaborator Author

BartChris commented Feb 19, 2025

In common I'm unsure if your way of change is the right way. The file management module was implemented for accessing files on local storage. For other access types like S3 the file management interface should be implemented for S3 access and must be replaced on runtime with the version of the local storage as you can not have two implementations of the same interface (well known but never solved issue). This was the plan while development the 3.x version. As I see now that this whole DFG project documentation was not got public published - at least not here: https://github.com/kitodo/kitodo-production/wiki/Developer-Documentation-Kitodo.Production-3.x

So your change can maybe break this general attempt to add a separation layer between but it if even possible that this layer was already broken without your change.

I will try your changes and report back if there are issues or anything else.

Sorry for creating confusion here. I am not using S3 natively but mounting it using rclone and FUSE mount. So my S3 storage acts like a normal file system. This setup just exposes the performance bottleneck of the current code more clearly. So the new code should work with all form of local or mounted Network storage and speed things up for those storages as well.

@henning-gerhardt
Copy link
Collaborator

Thank you for this explanation and how you get this working is interesting way. In this way the behaviour should be identical to other local or network based file systems.

Copy link
Collaborator

@henning-gerhardt henning-gerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this changes but I can not confirm that is a speed up on local storage - I had the impression that displaying needs more time. But then major issue is that now processes are displayed as exportable which are not exportable.

Without your changes its looks like this for a few processes:

Screenshot_2025-02-20_09-55-21

With your changes the processes in the middle are now exportable but they did not contain any images nor other media files

Screenshot_2025-02-20_09-58-13

Both processes from the audio and video test project did not contain any files but shown before and after as exportable. But this is an other issue.

@BartChris
Copy link
Collaborator Author

BartChris commented Feb 20, 2025

Thanks for your review. I will inspect that. I will probably have to inspect closer, what exactly is happening in the convoluted function, which still appears - for this specific use case - like absolute overkill to me, because it is retrieving a lot of file metadata not needed for checking the folder for emptiness.

@BartChris BartChris marked this pull request as draft February 20, 2025 09:13
@BartChris
Copy link
Collaborator Author

BartChris commented Feb 20, 2025

I tried to extend the logic to maybe fix the wrong result in your case. (Folders are indicated as non empty although they are empty - processes are exportable)
In what folders have you put the originals/TIFF-files (Mine are in images/original)? Maybe my logic did not cover different cases of folder structures. I can imagine that on local storage the speed difference is small or non noticable. On my S3 based network mount the results are correct and the difference when navigating the process list is from "unusable slow" (old code) to "quite fast" (new code).

I tried this changes but I can not confirm that is a speed up on local storage - I had the impression that displaying needs more time. But then major issue is that now processes are displayed as exportable which are not exportable.

With your changes the processes in the middle are now exportable but they did not contain any images nor other media files

@BartChris BartChris force-pushed the speed_up_process_list branch from 4aa52b2 to 03a946a Compare February 20, 2025 14:40
@BartChris
Copy link
Collaborator Author

BartChris commented Feb 20, 2025

Both processes from the audio and video test project did not contain any files but shown before and after as exportable. But this is an other issue.

This seems to be intended by the current code. Only files that are used to generate image derivatives are considered here.

// processes without a generator source should be exportable because they may contain multimedia files
// that are not used as generator sources

return !entries
.map(Path::getFileName)
.map(Path::toString)
.anyMatch(name -> query.getRight().matcher(name).matches()); // Stop after first match
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntelliJ IDEA suggest here instead of !entries.anyMatch() to use entries.noneMatch() to use.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the same recommendation. I was just wondering wether entries.noneMatch() is slower here. But maybe we should trust the IDE more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed no differences on my local development system. Maybe you can time it on your system with an S3 in background? Maybe the first solution was faster but maybe there is no difference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be no real performance differences
https://stackoverflow.com/a/57779846

.map(Path::toString)
.anyMatch(name -> query.getRight().matcher(name).matches()); // Stop after first match
} catch (IOException e) {
return false;
Copy link
Collaborator

@henning-gerhardt henning-gerhardt Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a raised IOException interpreted as a directory with content? I'm unsure about this but it could be okay. It could be helpful if this exception got logged in trace or debug log level.

Copy link
Collaborator Author

@BartChris BartChris Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOException now marks the folder as having no images.

@henning-gerhardt
Copy link
Collaborator

Both processes from the audio and video test project did not contain any files but shown before and after as exportable. But this is an other issue.

This seems to be intended by the current code. Only files that are used to generate image derivatives are considered here.

Good to know. It is different for non-image media files as for image media files but if this must be changed then not in this pull request.

@BartChris BartChris marked this pull request as ready for review February 20, 2025 15:50
* It stops checking as soon as the first file is found.
*
* @return true if the folder is empty, false if it contains at least one file
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JavaDoc thrown exception is missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants