Add `GitIdentifiers` helper by ppkarwasz · Pull Request #428 · apache/commons-codec

ppkarwasz · 2026-04-09T19:58:14Z

This PR adds a GitIdentifiers utility class to support computing SWHID identifiers for a wider range of sources.

The DigestUtils.gitTree method introduced in #427 was limited to directories on the local filesystem. GitIdentifiers
replaces and extends it to also handle virtual directory structures such as archive contents.

New API

blobId: computes a Git/SWHID blob identifier. Four overloads are provided:

Overload	Notes
`blobId(MessageDigest, byte[])`	Content already in memory
`blobId(MessageDigest, InputStream)`	Content buffered from stream
`blobId(MessageDigest, long, InputStream)`	Size known; streams directly without buffering
`blobId(MessageDigest, Path)`	Regular file or symlink

treeId(MessageDigest, Path): computes a Git/SWHID tree identifier for a directory on the filesystem.

treeIdBuilder(MessageDigest): returns a TreeIdBuilder for constructing a tree identifier from any source. The
builder accumulates entries via:

addFile(FileMode, String, …): three overloads matching those of blobId above; paths containing / automatically
create intermediate subdirectories.
addDirectory(String): creates a subdirectory node and returns its builder; accepts multi-level paths.
build(): computes and returns the tree identifier.

FileMode: enum with values REGULAR, EXECUTABLE, SYMBOLIC_LINK, and DIRECTORY.

Example

TreeIdBuilder builder = GitIdentifiers.treeIdBuilder(DigestUtils.getSha1Digest());
try (TarArchiveInputStream tar = new TarArchiveInputStream(new GzipCompressorInputStream(input))) {
    TarArchiveEntry entry;
    while ((entry = tar.getNextTarEntry()) != null) {
        String name = entry.getName();
        if (name.isEmpty()) {
            continue; // root directory entry
        }
        if (entry.isDirectory()) {
            builder.addDirectory(name);
        } else if (entry.isSymbolicLink()) {
            builder.addSymbolicLink(name, entry.getLinkName());
        } else {
            FileMode mode = (entry.getMode() & 0111) != 0 ? FileMode.EXECUTABLE : FileMode.REGULAR;
            builder.addFile(mode, name, entry.getSize(), tar);
        }
    }
}

This change moves `gitBlob` and `gitTree` from `DigestUtils` into a separate utility class, to prepare for an enhancement of the provided API. The git tree identifier can be computed for many objects: the most natural is a directory in a filesystem, but we can also compute the identifier on an archive containing this directory. Additional usages will require expanding the API, beyond what can be reasonably contained in `DigestUtils`.

The `OpenOption` parameters are not very useful, since files are usually opened read-only.

This change adds a `GitIdentifiers.TreeIdBuilder` class to allow the computation of a SWHID identifier from an archive.

garydgregory

Hi @ppkarwasz

Thank you for the follow up PR 😄

I had an initial pass this PR. I have high-level and low-level comments. The low-level comments are in the PR. At the high-level, I wonder if there is some YAGNI here with new APIs for all of byte[], Path, and InputStream. At least there isn't File 😉

I am not sure this needs to be that general an API that needs to deal with all of byte[], Path, and InputStream`. I would either:

Base general input processing using IO's builder package, or
Pair down the API to only what the Maven plugin needs for build attestations.
WDYT?

garydgregory · 2026-04-09T20:11:21Z

+         * A supplier of a blob identifier that may throw {@link IOException}.
+         */
+        @FunctionalInterface
+        private interface BlobIdSupplier {


This could just be a Commons IO IOSupplier.

Yes, I know, but do we want to force a dependency on commons-io for something users might never use?

ppkarwasz · 2026-04-09T21:17:53Z

I am not sure this needs to be that general an API that needs to deal with all of byte[], Path, and InputStream. I would either:

Base general input processing using IO's builder package, or

Pair down the API to only what the Maven plugin needs for build attestations.

WDYT?

I would rather not add a dependency on IO at this point, but I would like to handle at least two use cases:

Computing treeId of a Path, which is what I use in the Maven plugin. All the blobId methods are currently not useful and they could be private, but I think exposing them gives a bigger picture.
Computing the id of a TAR or ZIP archive with decompressing them. For this I only need treeIdBuilder, addDirectory and addFile(FileMode, String, long, InputStream). This is because archive formats usually have the uncompressed size of the content, which makes computation more efficient.

We can probably remove the methods with byte[] and InputStream without a length.

garydgregory · 2026-04-09T21:44:39Z

I am not sure this needs to be that general an API that needs to deal with all of byte[], Path, and InputStream. I would either:

Base general input processing using IO's builder package, or

Pair down the API to only what the Maven plugin needs for build attestations.

WDYT?

I would rather not add a dependency on IO at this point, but I would like to handle at least two use cases:

Computing treeId of a Path, which is what I use in the Maven plugin. All the blobId methods are currently not useful and they could be private, but I think exposing them gives a bigger picture.

Computing the id of a TAR or ZIP archive with decompressing them. For this I only need treeIdBuilder, addDirectory and addFile(FileMode, String, long, InputStream). This is because archive formats usually have the uncompressed size of the content, which makes computation more efficient.

We can probably remove the methods with byte[] and InputStream without a length.

OK, then let's reduce the public footprint to only what's needed for the Maven plugin.

ppkarwasz · 2026-04-11T18:39:32Z

@garydgregory,

I have reduced the public footprint slightly, by removing the methods with an InputStream of unknown size: this is not needed in practice and its implementation requires loading the entire stream into memory.

What is left is potentially useful:

For the Maven plugin I only need treeId(MessageDigest, Path), however blobId(MessageDigest, Path) is potentially useful to users that want to compute both dir and cnt SWHID identifiers.
For the use can I gave in the description of the PR (computing the tree identifier of an archive without extracting its content) additional methods are necessary:
- treeIdBuilder(MessageDigest) to bootstrap the process,
- addDirectory and addFile(FileMode, String, long, InputStream) to register the various entries of the archive. The blobId(MessageDigest, long, InputStream) is used by the implementation of the latter, so it is worth exposing.
- (optional) I added an addSymbolicLink method, which, although not strictly necessary, adds a certain elegance to the Commons Compress-based example I gave at the beginning.
If I count correctly, this leaves blobId(MessageDigest, byte[]) and addFile(FileMode, String, byte[]), which are low-level operations and are left to allow users to use the API, even if the other methods don't fit their use case.

ppkarwasz · 2026-04-11T18:57:14Z

I think the two main points we should discuss, before releasing this API are:

The sanitization of . and .. segment paths, with are not allowed. These segments do no appear naturally, when traversing a Path. A . segment, however, could appear in some archives: should we allow it? A .. segment, however, has no place in a non-malicious archive and I don't think implementing path normalization logic here is appropriate: what do you think?
TreeIdBuilder does not have a fluent API. We could introduce something like:
```
builder.addFile(...)
        .startDirectory("foo")
        .addFile(...)
        .endDirectory()
        .addFile(...)
```
but this seems overkill. Besides, this would introduce a problem: currently addDirectory can create multiple nested directories at once, so implementing endDirectory would add unnecessary complexity. The reason the API allows to traverse multiple directory levels in one call is that archives do not necessarily contain directories: a TAR file with just a foo/bar/baz.txt directory is just fine. I didn't want to bother the user with creating intermediate directories.

garydgregory · 2026-04-11T19:26:28Z

The sanitization of . and .. segment paths, with are not allowed. These segments do no appear naturally, when traversing a Path. A . segment, however, could appear in some archives: should we allow it? A .. segment, however, has no place in a non-malicious archive and I don't think implementing path normalization logic here is appropriate: what do you think?

If normalizing . has 0% chance of malicious side-effects (symbolic link?) then we should normalize it. Isn't that the case?

TreeIdBuilder does not have a fluent API.

I think we can go YAGNI here.

ppkarwasz · 2026-04-11T21:11:01Z

If normalizing . has 0% chance of malicious side-effects (symbolic link?) then we should normalize it. Isn't that the case?

This class doesn't write anything to disk nor it follows symlinks. The . and .. path segments are of no concern for the security of this class.

The reason I don't want .. in path names is that it would add complexity for a case that doesn't seem legitimate. An archive entry like foo/../bar or ../../etc/passwd looks more like a malicious entry. GNU TAR warns about such entries, this API will throw if a user passes such an entry.

I allowed the presence of . in b305895

ppkarwasz · 2026-04-11T21:41:13Z

Using the current version of this API I successfully verified that the Maven tar.gz and zip distributions and an unpacked Maven installation have the same content.

This should be enough for the Maven plugin.

garydgregory · 2026-04-12T11:07:15Z

OK, we are good to then. Merged.

ppkarwasz added 6 commits April 9, 2026 13:58

fix: remove OpenOption parameters

940ef4c

The `OpenOption` parameters are not very useful, since files are usually opened read-only.

fix: remove useless @SInCE annotations

1d44d87

fix: make GitDirectoryEntry internal

92bf278

feat: add GitIdBuilder

e2554aa

This change adds a `GitIdentifiers.TreeIdBuilder` class to allow the computation of a SWHID identifier from an archive.

fix: SpotBug error

587369c

garydgregory requested changes Apr 9, 2026

View reviewed changes

ppkarwasz added 4 commits April 9, 2026 22:29

fix: try fix Java 8 javadoc error

378e915

fix: extract getGitPrefix again

4585d6b

fix: improve encapsulation

faafda8

fix: Javadoc on Java 8

3eaa9ba

ppkarwasz commented Apr 9, 2026

View reviewed changes

Comment thread src/main/java/org/apache/commons/codec/digest/GitIdentifiers.java Outdated

ppkarwasz added 3 commits April 11, 2026 19:46

fix: remove blobId(MessageDigest, InputStream)

9064aa2

fix: add addSymbolicLink helper

235142d

fix: simplify populateFromPath and blobId

31a2792

fix: ignore . path segment

b305895

garydgregory merged commit 387120c into master Apr 12, 2026
9 checks passed

garydgregory deleted the feat/git-identifiers branch April 12, 2026 11:07

garydgregory added a commit that referenced this pull request Apr 12, 2026

Update action description for Add GitIdentifiers helper #428

27b228d

ppkarwasz mentioned this pull request Apr 13, 2026

Add SWHID identifiers for release verification apache/tooling-trusted-releases#1154

Open

Conversation

ppkarwasz commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New API

Example

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

garydgregory Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

ppkarwasz Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ppkarwasz commented Apr 9, 2026

Uh oh!

garydgregory commented Apr 9, 2026

Uh oh!

ppkarwasz commented Apr 11, 2026

Uh oh!

ppkarwasz commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garydgregory commented Apr 11, 2026

Uh oh!

ppkarwasz commented Apr 11, 2026

Uh oh!

ppkarwasz commented Apr 11, 2026

Uh oh!

Uh oh!

garydgregory commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ppkarwasz commented Apr 9, 2026 •

edited

Loading

ppkarwasz commented Apr 11, 2026 •

edited

Loading