Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] rebase -m: partial support for copying extra commit headers #1902

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

phillipwood
Copy link

@phillipwood phillipwood commented Apr 7, 2025

This patch is largely a response to https://lore.kernel.org/git/[email protected]/ . I'm in two minds about whether we should consider merging such partial support but if it helps forges preserve extra commit headers then it may well be worth it.

Cc: Patrick Steinhardt [email protected]
Cc: Elijah Newren [email protected]
cc: "brian m. carlson" [email protected]
cc: Phillip Wood [email protected]

There are external projects such as GitButler that add extra headers to
the commit objects that they create. Unfortunately these headers are
lost when the user runs "git rebase". In the absence of merge conflicts
copying these headers across to the rebased commit is relatively
straight forward as the sequencer creates the rebased commits using
commit_tree_extended() rather than forking "git commit". Preserving them
in the presence of merge conflict would mean that we either need to add
a way to creating extra headers when running "git commit" or modifying
the sequencer to stop using git commit when the commit message needs to
be edited. Both of these options involve a significant amount of more
work.

While losing the extra headers if there are merge conflicts is a
significant shortcoming for users rebasing their branches it is not such
a problem on the server where forges typically reject a rebase if it has
conflicts. Therefore even though this commit is far from a complete
solution it is a significant improvement for forges that have not yet
moved to using "git replay" which already preserves extra commit
headers.

In the long run we would also want to preserve extra headers when
cherry-picking a commit. As we cannot currently preserve extra headers
when the user wishes to edit the commit message of the cherry-picked
commit this patch only changes the behavior of "git rebase"

Signed-off-by: Phillip Wood <[email protected]>
@phillipwood
Copy link
Author

/submit

Copy link

gitgitgadget bot commented Apr 7, 2025

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1902/phillipwood/rebase-preserve-extra-commit-headers-v1

To fetch this version to local tag pr-1902/phillipwood/rebase-preserve-extra-commit-headers-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1902/phillipwood/rebase-preserve-extra-commit-headers-v1

Copy link

gitgitgadget bot commented Apr 8, 2025

On the Git mailing list, "brian m. carlson" wrote (reply to this):

On 2025-04-07 at 15:52:43, Phillip Wood via GitGitGadget wrote:
> From: Phillip Wood <[email protected]>
> 
>     [RFC] rebase -m: partial support for copying extra commit headers
>     
>     This patch is largely a response to
>     https://lore.kernel.org/git/[email protected]/ . I'm in two minds
>     about whether we should consider merging such partial support but if it
>     helps forges preserve extra commit headers then it may well be worth it.

I'd like to see command-line options to control this and ideally a
configuration option.  Right now, we know nothing about these extra
headers, including an expected format.  If a future version of Git (say,
3.0) adds a new header and the user includes invalid data in this extra
header (which happens all the time with author and committer
information), then 2.50 will propagate it on rebase and it won't be
fixed until the user uses a version of Git that understands the header
and can fsck it correctly.  That's not really great, since it means we
can unknowingly spread corruption.

I am pretty sure that at $DAYJOB we'll need to have a discussion about
whether we want to propagate these headers during rebase and I'm
personally leaning against it.

Why, you ask?  I've seen at least the following types of corruption:

* Missing timezones
* Timezones with less than four digits
* Valid timezones padded to more than four digits with zeros
* Timezones which don't exist and never have (e.g., +1700)
* Timezones which are so absurdly large that they push the date to a
  year when nobody alive now will still be living
* Date stamps that are larger than 2^64
* Date stamps which are smaller than 2^64 but beyond the expected life
  of the Sun
* Extra angle brackets in the email field
* Nothing in between the email brackets
* Nothing before the email brackets (no name at all)
* Names which are not UTF-8 but without an encoding header
* Names which are not valid in the specified encoding
* Emails which are not valid UTF-8[0]
* Emails which don't meet the (ludicrously generous to the point of
  being nearly unparseable) RFC production
* Encodings which are not valid IANA charsets
* Messages with no body and no blank line (just the newline at the end
  of the final header)
* gpgsig headers that include random non-ASCII bytes and control
  characters[1]

Note that all of these must be parsed in some meaningful way because
users don't want their forge to serve them a 500 despite them having
sent wildly invalid data.  I encountered these during part of our
transition from Rugged to Git (reftable, SHA-256) and they definitely
added a lot of interesting complications (plus the need for lots of
tests).

Considering that writing valid data should not be that hard[2] (and
should definitely be a priority) but apparently is for many people, I'm
very wary of us propagating headers we're not ready to fsck and I'd like
to have an out for users and forges who would like to be a little more
careful.

With those constraints, I'm not totally opposed in principle to this
feature.

I see Patrick is CC'd here and I'm interested in his thoughts, as well
as, of course, those of anyone else as well.

[0] SMTPUTF8 (RFC 6531 et al.) specifies that mailbox names may now
contain UTF-8.  For instance, you can email 🔵 at this domain and it
will be delivered to me.
[1] It should be noted that all of our signing implementations produce
only non-control printable ASCII characters plus newline.
[2] Maybe I am being unfair or unduly harsh to implementers, but I do at
least agree with the half of Postel's Law that says that one should be
conservative with what one generates and I would hope others do as well.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

Copy link

gitgitgadget bot commented Apr 8, 2025

User "brian m. carlson" <[email protected]> has been added to the cc: list.

Copy link

gitgitgadget bot commented Apr 8, 2025

On the Git mailing list, Phillip Wood wrote (reply to this):

Hi brian

On 08/04/2025 02:22, brian m. carlson wrote:
> On 2025-04-07 at 15:52:43, Phillip Wood via GitGitGadget wrote:
>> From: Phillip Wood <[email protected]>
>>
>>      [RFC] rebase -m: partial support for copying extra commit headers
>>      >>      This patch is largely a response to
>>      https://lore.kernel.org/git/[email protected]/ . I'm in two minds
>>      about whether we should consider merging such partial support but if it
>>      helps forges preserve extra commit headers then it may well be worth it.
> > I'd like to see command-line options to control this and ideally a
> configuration option.  Right now, we know nothing about these extra
> headers, including an expected format.  If a future version of Git (say,
> 3.0) adds a new header and the user includes invalid data in this extra
> header (which happens all the time with author and committer
> information), then 2.50 will propagate it on rebase and it won't be
> fixed until the user uses a version of Git that understands the header
> and can fsck it correctly.  That's not really great, since it means we
> can unknowingly spread corruption.

We could certainly add some way to make this opt-in if there is a desire for it and you make a good point about compatibility if we add a new commit header. I'm not sure I'd describe preserving these headers when rebasing as spreading corruption though as we're simply rewriting existing commits. If the user chose to merge rather than rebase we'd still have the same issues without creating any new commits.

> I am pretty sure that at $DAYJOB we'll need to have a discussion about
> whether we want to propagate these headers during rebase and I'm
> personally leaning against it.

My understanding is that GitHub has been using "git replay" for rebases and therefore copying extra commit headers since the middle of 2023 [1,2]. The message I linked to in my original mail suggests that the "change-id" header is preserved when rebasing on GitHub.

> Why, you ask?  I've seen at least the following types of corruption:
> > * Missing timezones
> * Timezones with less than four digits
> * Valid timezones padded to more than four digits with zeros
> * Timezones which don't exist and never have (e.g., +1700)
> * Timezones which are so absurdly large that they push the date to a
>    year when nobody alive now will still be living
> * Date stamps that are larger than 2^64
> * Date stamps which are smaller than 2^64 but beyond the expected life
>    of the Sun
> * Extra angle brackets in the email field
> * Nothing in between the email brackets
> * Nothing before the email brackets (no name at all)
> * Names which are not UTF-8 but without an encoding header
> * Names which are not valid in the specified encoding
> * Emails which are not valid UTF-8[0]
> * Emails which don't meet the (ludicrously generous to the point of
>    being nearly unparseable) RFC production
> * Encodings which are not valid IANA charsets
> * Messages with no body and no blank line (just the newline at the end
>    of the final header)
> * gpgsig headers that include random non-ASCII bytes and control
>    characters[1]

Thanks for sharing that, it is an interesting list. On the subject of encoding I do think our documentation could be clearer that the encoding applies to all the headers as well as the commit message. As far as I can see it only mentions the commit message, not the author or committer identities but repo_logmsg_reencode() re-encodes the whole commit buffer. Out of interest do you think we could be doing a better job with fsck to pick up some of these problems earlier?

I think "git rebase" only cares that the author identity can be parsed by split_ident() which is fairly lenient.

> I see Patrick is CC'd here and I'm interested in his thoughts, as well
> as, of course, those of anyone else as well.

Yes me too

Thanks for your thoughtful and intereting reply

Phillip

[1] https://github.blog/changelog/2023-06-28-rebase-commits-now-created-using-the-merge-ort-strategy/
[2] https://github.blog/engineering/infrastructure/scaling-merge-ort-across-github/

Copy link

gitgitgadget bot commented Apr 8, 2025

User Phillip Wood <[email protected]> has been added to the cc: list.

Copy link

gitgitgadget bot commented Apr 8, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

Phillip Wood <[email protected]> writes:

> Thanks for sharing that, it is an interesting list. On the subject of
> encoding I do think our documentation could be clearer that the
> encoding applies to all the headers as well as the commit message. As
> far as I can see it only mentions the commit message, not the author
> or committer identities but repo_logmsg_reencode() re-encodes the
> whole commit buffer. Out of interest do you think we could be doing a
> better job with fsck to pick up some of these problems earlier?
>
> I think "git rebase" only cares that the author identity can be parsed
> by split_ident() which is fairly lenient.

"rebase" already knows that it has to be picky which header fields
need to be propagated and which must not be, doesn't it?  Can the
same be said for arbitrary "extra" header fields?

Information on some of the header fields are inherently destroyed
when you refine an existing commit.  The value on the 'parent'
headers may need to be updated (unless "rebase" is fast-forwarding
an earlier part of the changes on the same base), the 'author'
information usually wants to be preserved, but when the scale of the
change since the previous iteration is so large, you may give it a
new authorship, the 'committer' information should record who
created the new commit object that records the result of rebasing,
the 'gpgsig' and 'gigsig-sha256' header fields would lose validity
if you are creating a new object that is different from the original
by even a single bit (if we are somehow recording which predecessor
commit the new one replaces, it certainly is safe to drop these that
have lost validity, as we can go back to the predecessor to see it
has a valid signature, and the change in the new commit that lost
the signature fields is the moral equivalent of the original.
Otherwise, carrying a stale signature may serve as a reminder that
the commit was rewritten in the past---I dunno).  And so on.

Now, one thing that worries me is this.  If "extra" commit headers
include truly extra fields with unknown semantics, the machinery
cannot tell which ones are safe and benefitial to propagate.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant