Speed up process list by faster check for empty directory #6432

BartChris · 2025-02-19T10:18:37Z

Whily trying to use a S3 storage for the Kitodo files i noticed that the process list became very slow. S3 storage is slower than NFS or local storage of course, but we can sigificantly speed up the process list lf we refactor file existence checks.

The problem is that for every process in the process list a check is triggered wether the document can be exported.

kitodo-production/Kitodo/src/main/java/org/kitodo/production/services/data/ProcessService.java

Lines 2830 to 2847 in 84b0497

    
                * Checks and returns whether the process with the given ID 'processId' can be exported or not. 
        
                * @param processId process ID 
        
                * @return whether process can be exported or not 
        
                */ 
        
               public static boolean canBeExported(int processId) throws IOException, DAOException { 
        
                   Process process = ServiceManager.getProcessService().getById(processId); 
        
                   // superordinate processes normally do not contain images but should always be exportable 
        
                   if (!process.getChildren().isEmpty()) { 
        
                       return true; 
        
                   } 
        
                   Folder generatorSource = process.getProject().getGeneratorSource(); 
        
                   // processes without a generator source should be exportable because they may contain multimedia files 
        
                   // that are not used as generator sources 
        
                   if (Objects.isNull(generatorSource)) { 
        
                       return true; 
        
                   } 
        
                   return FileService.hasImages(process, generatorSource); 
        
               }

This calls a function which checks if the process has images, which checks wether the directory with the original images is empty

kitodo-production/Kitodo/src/main/java/org/kitodo/production/services/file/FileService.java

Lines 1451 to 1461 in 84b0497

    
                * @param process Process 
        
                * @param generatorSource Folder 
        
                * @return whether given URI points to empty directory or not 
        
                */ 
        
               public static boolean hasImages(Process process, Folder generatorSource) { 
        
                   if (Objects.nonNull(generatorSource)) { 
        
                       Subfolder sourceFolder = new Subfolder(process, generatorSource); 
        
                       return !sourceFolder.listContents().isEmpty(); 
        
                   } 
        
                   return false; 
        
               }

This in turn calls a complex function which does way more than returning the contents of the dir to check if the dir is empty.

kitodo-production/Kitodo/src/main/java/org/kitodo/production/model/Subfolder.java

Lines 346 to 357 in 84b0497

    
           private Map<String, URI> listDirectory(Pair<URI, Pattern> query, boolean absolute) { 
        
               FilenameFilter filter = (dir, name) -> query.getRight().matcher(name).matches(); 
        
               try (Stream<URI> relativeURIs = fileService.getSubUris(filter, query.getLeft()).parallelStream()) { 
        
                   Stream<URI> resultURIs = absolute ? relativeURIs.map( 
        
                       uri -> new File(ConfigCore.getKitodoDataDirectory().concat(uri.getPath())).toURI()) 
        
                           : relativeURIs.map(uri -> URI.create(uri.toString().replaceFirst("^[^/]+/", ""))); 
        
                   Function<URI, String> keyMapper = createKeyMapperForPattern(query.getRight()); 
        
                   return resultURIs.collect(Collectors.toMap(keyMapper, Function.identity(), (previous, latest) -> latest, 
        
                       () -> new TreeMap<>(fileService.getMetadataImageComparator()))); 
        
               } 
        
           }

I replaced this with an optimized file existence check, which speeds up performance significantly. (On S3 from 15 seconds to one second).

henning-gerhardt · 2025-02-19T12:26:01Z

In common I'm unsure if your way of change is the right way. The file management module was implemented for accessing files on local storage. For other access types like S3 the file management interface should be implemented for S3 access and must be replaced on runtime with the version of the local storage as you can not have two implementations of the same interface (well known but never solved issue). This was the plan while development the 3.x version. As I see now that this whole DFG project documentation was not got public published - at least not here: https://github.com/kitodo/kitodo-production/wiki/Developer-Documentation-Kitodo.Production-3.x

So your change can maybe break this general attempt to add a separation layer between but it if even possible that this layer was already broken without your change.

I will try your changes and report back if there are issues or anything else.

BartChris · 2025-02-19T12:54:52Z

In common I'm unsure if your way of change is the right way. The file management module was implemented for accessing files on local storage. For other access types like S3 the file management interface should be implemented for S3 access and must be replaced on runtime with the version of the local storage as you can not have two implementations of the same interface (well known but never solved issue). This was the plan while development the 3.x version. As I see now that this whole DFG project documentation was not got public published - at least not here: https://github.com/kitodo/kitodo-production/wiki/Developer-Documentation-Kitodo.Production-3.x

So your change can maybe break this general attempt to add a separation layer between but it if even possible that this layer was already broken without your change.

I will try your changes and report back if there are issues or anything else.

Sorry for creating confusion here. I am not using S3 natively but mounting it using rclone and FUSE mount. So my S3 storage acts like a normal file system. This setup just exposes the performance bottleneck of the current code more clearly. So the new code should work with all form of local or mounted Network storage and speed things up for those storages as well.

henning-gerhardt · 2025-02-19T14:26:54Z

Thank you for this explanation and how you get this working is interesting way. In this way the behaviour should be identical to other local or network based file systems.

henning-gerhardt

I tried this changes but I can not confirm that is a speed up on local storage - I had the impression that displaying needs more time. But then major issue is that now processes are displayed as exportable which are not exportable.

Without your changes its looks like this for a few processes:

With your changes the processes in the middle are now exportable but they did not contain any images nor other media files

Both processes from the audio and video test project did not contain any files but shown before and after as exportable. But this is an other issue.

BartChris · 2025-02-20T09:03:43Z

Thanks for your review. I will inspect that. I will probably have to inspect closer, what exactly is happening in the convoluted function, which still appears - for this specific use case - like absolute overkill to me, because it is retrieving a lot of file metadata not needed for checking the folder for emptiness.

BartChris · 2025-02-20T14:34:21Z

I tried to extend the logic to maybe fix the wrong result in your case. (Folders are indicated as non empty although they are empty - processes are exportable)
In what folders have you put the originals/TIFF-files (Mine are in images/original)? Maybe my logic did not cover different cases of folder structures. I can imagine that on local storage the speed difference is small or non noticable. On my S3 based network mount the results are correct and the difference when navigating the process list is from "unusable slow" (old code) to "quite fast" (new code).

I tried this changes but I can not confirm that is a speed up on local storage - I had the impression that displaying needs more time. But then major issue is that now processes are displayed as exportable which are not exportable.

With your changes the processes in the middle are now exportable but they did not contain any images nor other media files

BartChris · 2025-02-20T14:46:28Z

Both processes from the audio and video test project did not contain any files but shown before and after as exportable. But this is an other issue.

This seems to be intended by the current code. Only files that are used to generate image derivatives are considered here.

kitodo-production/Kitodo/src/main/java/org/kitodo/production/services/data/ProcessService.java

Lines 2841 to 2842 in 6e9e69b

    
           // processes without a generator source should be exportable because they may contain multimedia files 
        
           // that are not used as generator sources

Kitodo/src/main/java/org/kitodo/production/model/Subfolder.java

henning-gerhardt · 2025-02-20T14:52:25Z

Kitodo/src/main/java/org/kitodo/production/model/Subfolder.java

+            return !entries
+                    .map(Path::getFileName)
+                    .map(Path::toString)
+                    .anyMatch(name -> query.getRight().matcher(name).matches()); // Stop after first match


IntelliJ IDEA suggest here instead of !entries.anyMatch() to use entries.noneMatch() to use.

I got the same recommendation. I was just wondering wether entries.noneMatch() is slower here. But maybe we should trust the IDE more.

I noticed no differences on my local development system. Maybe you can time it on your system with an S3 in background? Maybe the first solution was faster but maybe there is no difference.

There seem to be no real performance differences
https://stackoverflow.com/a/57779846

henning-gerhardt · 2025-02-20T14:53:55Z

Kitodo/src/main/java/org/kitodo/production/model/Subfolder.java

+                    .map(Path::toString)
+                    .anyMatch(name -> query.getRight().matcher(name).matches()); // Stop after first match
+        } catch (IOException e) {
+            return false;


Should a raised IOException interpreted as a directory with content? I'm unsure about this but it could be okay. It could be helpful if this exception got logged in trace or debug log level.

IOException now marks the folder as having no images.

henning-gerhardt · 2025-02-20T15:01:23Z

Both processes from the audio and video test project did not contain any files but shown before and after as exportable. But this is an other issue.

This seems to be intended by the current code. Only files that are used to generate image derivatives are considered here.

Good to know. It is different for non-image media files as for image media files but if this must be changed then not in this pull request.

henning-gerhardt · 2025-02-20T18:54:18Z

Kitodo/src/main/java/org/kitodo/production/model/Subfolder.java

+     * It stops checking as soon as the first file is found.
+     *
+     * @return true if the folder is empty, false if it contains at least one file
+     */


In JavaDoc thrown exception is missing.

Speed up process list by faster check for empty directory

461b404

solth requested a review from henning-gerhardt February 19, 2025 11:09

henning-gerhardt suggested changes Feb 20, 2025

View reviewed changes

BartChris marked this pull request as draft February 20, 2025 09:13

Try to fix folder discovery

03a946a

BartChris force-pushed the speed_up_process_list branch from 4aa52b2 to 03a946a Compare February 20, 2025 14:40

henning-gerhardt reviewed Feb 20, 2025

View reviewed changes

Kitodo/src/main/java/org/kitodo/production/model/Subfolder.java Show resolved Hide resolved

henning-gerhardt reviewed Feb 20, 2025

View reviewed changes

apply review fixes

035d2d6

BartChris marked this pull request as ready for review February 20, 2025 15:50

henning-gerhardt reviewed Feb 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up process list by faster check for empty directory #6432

Speed up process list by faster check for empty directory #6432

BartChris commented Feb 19, 2025 •

edited

Loading

henning-gerhardt commented Feb 19, 2025

BartChris commented Feb 19, 2025 •

edited

Loading

henning-gerhardt commented Feb 19, 2025

henning-gerhardt left a comment

BartChris commented Feb 20, 2025 •

edited

Loading

BartChris commented Feb 20, 2025 •

edited

Loading

BartChris commented Feb 20, 2025 •

edited

Loading

henning-gerhardt Feb 20, 2025

BartChris Feb 20, 2025

henning-gerhardt Feb 20, 2025

BartChris Feb 20, 2025

henning-gerhardt Feb 20, 2025 •

edited

Loading

BartChris Feb 20, 2025 •

edited

Loading

henning-gerhardt commented Feb 20, 2025

henning-gerhardt Feb 20, 2025

	* Checks and returns whether the process with the given ID 'processId' can be exported or not.
	* @param processId process ID
	* @return whether process can be exported or not
	*/
	public static boolean canBeExported(int processId) throws IOException, DAOException {
	Process process = ServiceManager.getProcessService().getById(processId);
	// superordinate processes normally do not contain images but should always be exportable
	if (!process.getChildren().isEmpty()) {
	return true;
	}
	Folder generatorSource = process.getProject().getGeneratorSource();
	// processes without a generator source should be exportable because they may contain multimedia files
	// that are not used as generator sources
	if (Objects.isNull(generatorSource)) {
	return true;
	}
	return FileService.hasImages(process, generatorSource);
	}

	* @param process Process
	* @param generatorSource Folder
	* @return whether given URI points to empty directory or not
	*/
	public static boolean hasImages(Process process, Folder generatorSource) {
	if (Objects.nonNull(generatorSource)) {
	Subfolder sourceFolder = new Subfolder(process, generatorSource);
	return !sourceFolder.listContents().isEmpty();
	}
	return false;
	}

	private Map<String, URI> listDirectory(Pair<URI, Pattern> query, boolean absolute) {
	FilenameFilter filter = (dir, name) -> query.getRight().matcher(name).matches();
	try (Stream<URI> relativeURIs = fileService.getSubUris(filter, query.getLeft()).parallelStream()) {
	Stream<URI> resultURIs = absolute ? relativeURIs.map(
	uri -> new File(ConfigCore.getKitodoDataDirectory().concat(uri.getPath())).toURI())
	: relativeURIs.map(uri -> URI.create(uri.toString().replaceFirst("^[^/]+/", "")));
	Function<URI, String> keyMapper = createKeyMapperForPattern(query.getRight());
	return resultURIs.collect(Collectors.toMap(keyMapper, Function.identity(), (previous, latest) -> latest,
	() -> new TreeMap<>(fileService.getMetadataImageComparator())));
	}
	}

Speed up process list by faster check for empty directory #6432

Are you sure you want to change the base?

Speed up process list by faster check for empty directory #6432

Conversation

BartChris commented Feb 19, 2025 • edited Loading

henning-gerhardt commented Feb 19, 2025

BartChris commented Feb 19, 2025 • edited Loading

henning-gerhardt commented Feb 19, 2025

henning-gerhardt left a comment

Choose a reason for hiding this comment

BartChris commented Feb 20, 2025 • edited Loading

BartChris commented Feb 20, 2025 • edited Loading

BartChris commented Feb 20, 2025 • edited Loading

henning-gerhardt Feb 20, 2025

Choose a reason for hiding this comment

BartChris Feb 20, 2025

Choose a reason for hiding this comment

henning-gerhardt Feb 20, 2025

Choose a reason for hiding this comment

BartChris Feb 20, 2025

Choose a reason for hiding this comment

henning-gerhardt Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

BartChris Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

henning-gerhardt commented Feb 20, 2025

henning-gerhardt Feb 20, 2025

Choose a reason for hiding this comment

BartChris commented Feb 19, 2025 •

edited

Loading

BartChris commented Feb 19, 2025 •

edited

Loading

BartChris commented Feb 20, 2025 •

edited

Loading

BartChris commented Feb 20, 2025 •

edited

Loading

BartChris commented Feb 20, 2025 •

edited

Loading

henning-gerhardt Feb 20, 2025 •

edited

Loading

BartChris Feb 20, 2025 •

edited

Loading