Skip to content

Add GitIdentifiers helper#428

Merged
garydgregory merged 14 commits intomasterfrom
feat/git-identifiers
Apr 12, 2026
Merged

Add GitIdentifiers helper#428
garydgregory merged 14 commits intomasterfrom
feat/git-identifiers

Conversation

@ppkarwasz
Copy link
Copy Markdown
Member

@ppkarwasz ppkarwasz commented Apr 9, 2026

This PR adds a GitIdentifiers utility class to support computing SWHID identifiers for a wider range of sources.

The DigestUtils.gitTree method introduced in #427 was limited to directories on the local filesystem. GitIdentifiers
replaces and extends it to also handle virtual directory structures such as archive contents.

New API

blobId: computes a Git/SWHID blob identifier. Four overloads are provided:

Overload Notes
blobId(MessageDigest, byte[]) Content already in memory
blobId(MessageDigest, InputStream) Content buffered from stream
blobId(MessageDigest, long, InputStream) Size known; streams directly without buffering
blobId(MessageDigest, Path) Regular file or symlink

treeId(MessageDigest, Path): computes a Git/SWHID tree identifier for a directory on the filesystem.

treeIdBuilder(MessageDigest): returns a TreeIdBuilder for constructing a tree identifier from any source. The
builder accumulates entries via:

  • addFile(FileMode, String, …): three overloads matching those of blobId above; paths containing / automatically
    create intermediate subdirectories.
  • addDirectory(String): creates a subdirectory node and returns its builder; accepts multi-level paths.
  • build(): computes and returns the tree identifier.

FileMode: enum with values REGULAR, EXECUTABLE, SYMBOLIC_LINK, and DIRECTORY.

Example

TreeIdBuilder builder = GitIdentifiers.treeIdBuilder(DigestUtils.getSha1Digest());
try (TarArchiveInputStream tar = new TarArchiveInputStream(new GzipCompressorInputStream(input))) {
    TarArchiveEntry entry;
    while ((entry = tar.getNextTarEntry()) != null) {
        String name = entry.getName();
        if (name.isEmpty()) {
            continue; // root directory entry
        }
        if (entry.isDirectory()) {
            builder.addDirectory(name);
        } else if (entry.isSymbolicLink()) {
            builder.addSymbolicLink(name, entry.getLinkName());
        } else {
            FileMode mode = (entry.getMode() & 0111) != 0 ? FileMode.EXECUTABLE : FileMode.REGULAR;
            builder.addFile(mode, name, entry.getSize(), tar);
        }
    }
}

This change moves `gitBlob` and `gitTree` from `DigestUtils` into a separate utility class, to prepare for an enhancement of the provided API.

The git tree identifier can be computed for many objects: the most natural is a directory in a filesystem, but we can also compute the identifier on an archive containing this directory. Additional usages will require expanding the API, beyond what can be reasonably contained in `DigestUtils`.
The `OpenOption` parameters are not very useful, since files are usually opened read-only.
This change adds a `GitIdentifiers.TreeIdBuilder` class to allow the computation of a SWHID identifier from an archive.
Copy link
Copy Markdown
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ppkarwasz

Thank you for the follow up PR 😄

I had an initial pass this PR. I have high-level and low-level comments. The low-level comments are in the PR. At the high-level, I wonder if there is some YAGNI here with new APIs for all of byte[], Path, and InputStream. At least there isn't File 😉

I am not sure this needs to be that general an API that needs to deal with all of byte[], Path, and InputStream`. I would either:

  • Base general input processing using IO's builder package, or
  • Pair down the API to only what the Maven plugin needs for build attestations.
    WDYT?

* A supplier of a blob identifier that may throw {@link IOException}.
*/
@FunctionalInterface
private interface BlobIdSupplier {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be a Commons IO IOSupplier.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know, but do we want to force a dependency on commons-io for something users might never use?

Comment thread src/main/java/org/apache/commons/codec/digest/GitIdentifiers.java Outdated
Comment thread src/main/java/org/apache/commons/codec/digest/GitIdentifiers.java
Comment thread src/main/java/org/apache/commons/codec/digest/GitIdentifiers.java Outdated
@ppkarwasz
Copy link
Copy Markdown
Member Author

I am not sure this needs to be that general an API that needs to deal with all of byte[], Path, and InputStream. I would either:

  • Base general input processing using IO's builder package, or

  • Pair down the API to only what the Maven plugin needs for build attestations.

WDYT?

I would rather not add a dependency on IO at this point, but I would like to handle at least two use cases:

  • Computing treeId of a Path, which is what I use in the Maven plugin. All the blobId methods are currently not useful and they could be private, but I think exposing them gives a bigger picture.

  • Computing the id of a TAR or ZIP archive with decompressing them. For this I only need treeIdBuilder, addDirectory and addFile(FileMode, String, long, InputStream). This is because archive formats usually have the uncompressed size of the content, which makes computation more efficient.

We can probably remove the methods with byte[] and InputStream without a length.

@garydgregory
Copy link
Copy Markdown
Member

I am not sure this needs to be that general an API that needs to deal with all of byte[], Path, and InputStream. I would either:

  • Base general input processing using IO's builder package, or
  • Pair down the API to only what the Maven plugin needs for build attestations.

WDYT?

I would rather not add a dependency on IO at this point, but I would like to handle at least two use cases:

  • Computing treeId of a Path, which is what I use in the Maven plugin. All the blobId methods are currently not useful and they could be private, but I think exposing them gives a bigger picture.
  • Computing the id of a TAR or ZIP archive with decompressing them. For this I only need treeIdBuilder, addDirectory and addFile(FileMode, String, long, InputStream). This is because archive formats usually have the uncompressed size of the content, which makes computation more efficient.

We can probably remove the methods with byte[] and InputStream without a length.

OK, then let's reduce the public footprint to only what's needed for the Maven plugin.

@ppkarwasz
Copy link
Copy Markdown
Member Author

@garydgregory,

I have reduced the public footprint slightly, by removing the methods with an InputStream of unknown size: this is not needed in practice and its implementation requires loading the entire stream into memory.

What is left is potentially useful:

  • For the Maven plugin I only need treeId(MessageDigest, Path), however blobId(MessageDigest, Path) is potentially useful to users that want to compute both dir and cnt SWHID identifiers.

  • For the use can I gave in the description of the PR (computing the tree identifier of an archive without extracting its content) additional methods are necessary:

    • treeIdBuilder(MessageDigest) to bootstrap the process,
    • addDirectory and addFile(FileMode, String, long, InputStream) to register the various entries of the archive. The blobId(MessageDigest, long, InputStream) is used by the implementation of the latter, so it is worth exposing.
    • (optional) I added an addSymbolicLink method, which, although not strictly necessary, adds a certain elegance to the Commons Compress-based example I gave at the beginning.
  • If I count correctly, this leaves blobId(MessageDigest, byte[]) and addFile(FileMode, String, byte[]), which are low-level operations and are left to allow users to use the API, even if the other methods don't fit their use case.

@ppkarwasz
Copy link
Copy Markdown
Member Author

ppkarwasz commented Apr 11, 2026

I think the two main points we should discuss, before releasing this API are:

  1. The sanitization of . and .. segment paths, with are not allowed. These segments do no appear naturally, when traversing a Path. A . segment, however, could appear in some archives: should we allow it? A .. segment, however, has no place in a non-malicious archive and I don't think implementing path normalization logic here is appropriate: what do you think?

  2. TreeIdBuilder does not have a fluent API. We could introduce something like:

    builder.addFile(...)
            .startDirectory("foo")
            .addFile(...)
            .endDirectory()
            .addFile(...)

    but this seems overkill. Besides, this would introduce a problem: currently addDirectory can create multiple nested directories at once, so implementing endDirectory would add unnecessary complexity. The reason the API allows to traverse multiple directory levels in one call is that archives do not necessarily contain directories: a TAR file with just a foo/bar/baz.txt directory is just fine. I didn't want to bother the user with creating intermediate directories.

@garydgregory
Copy link
Copy Markdown
Member

The sanitization of . and .. segment paths, with are not allowed. These segments do no appear naturally, when traversing a Path. A . segment, however, could appear in some archives: should we allow it? A .. segment, however, has no place in a non-malicious archive and I don't think implementing path normalization logic here is appropriate: what do you think?

If normalizing . has 0% chance of malicious side-effects (symbolic link?) then we should normalize it. Isn't that the case?

TreeIdBuilder does not have a fluent API.

I think we can go YAGNI here.

@ppkarwasz
Copy link
Copy Markdown
Member Author

If normalizing . has 0% chance of malicious side-effects (symbolic link?) then we should normalize it. Isn't that the case?

This class doesn't write anything to disk nor it follows symlinks. The . and .. path segments are of no concern for the security of this class.

The reason I don't want .. in path names is that it would add complexity for a case that doesn't seem legitimate. An archive entry like foo/../bar or ../../etc/passwd looks more like a malicious entry. GNU TAR warns about such entries, this API will throw if a user passes such an entry.

I allowed the presence of . in b305895

@ppkarwasz
Copy link
Copy Markdown
Member Author

Using the current version of this API I successfully verified that the Maven tar.gz and zip distributions and an unpacked Maven installation have the same content.

This should be enough for the Maven plugin.

@garydgregory garydgregory merged commit 387120c into master Apr 12, 2026
9 checks passed
@garydgregory
Copy link
Copy Markdown
Member

OK, we are good to then. Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants