Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make remote follow username regex comply to rfc 7565 #3528

Conversation

timothyjrogers
Copy link
Contributor

@timothyjrogers timothyjrogers commented Mar 27, 2025

Description

This change updates the username parsing for remote follows, which is based on the acct: property of a Fediverse user, to allow more than just alphanumeric characters and instead allow for all characters allowed for in RFC 7565.

What type of Pull Request is this?

  • [ x] Bug Fix
  • Enhancement
  • Plumbing / Internals / Dependencies
  • Refactor

Does this PR change settings or dependencies, or break something?

  • This PR changes or adds default settings, configuration, or .env values
  • This PR changes or adds dependencies
  • This PR introduces other breaking changes

Details of breaking or configuration changes (if any of above checked)

Documentation

  • New or amended documentation will be required if this PR is merged
  • I have created a matching pull request in the Documentation repository
  • I intend to create a matching pull request in the Documentation repository after this PR is merged

Tests

  • [x ] My changes do not need new tests
  • All tests I have added are passing
  • I have written tests but need help to make them pass
  • I have not written tests and need help to write them

@timothyjrogers
Copy link
Contributor Author

The failing unit test here is also failing on main, I have raisedan issue and it looks like that's already being fixed on #3431.

@mouse-reeve I could use your input on the other two issues. The reason it's so long is because I'm capturing the entire RFC7565 allowed character set, but I see your regex for local names is only alpha numbeic plus dashes, dots, and underscores. If your intent for Bookwyrm is to only allow those (Mastodon does similar) then easy enough I'll just trim the regex down to that.

@hughrun
Copy link
Contributor

hughrun commented Mar 30, 2025

@timothyjrogers Thanks for looking at this.

I'm certainly not a regex expert, but a couple of things that may be helpful and/or need to be made consistent:

Our UsernameField enforces a maximum of 150 characters for a username, which will apply to any users brought in to a BookWyrm server:

class UsernameField(ActivitypubFieldMixin, models.CharField):
    """activitypub-aware username field"""

    def __init__(self, activitypub_field="preferredUsername", **kwargs):
        self.activitypub_field = activitypub_field
        super(ActivitypubFieldMixin, self).__init__(
            _("username"),
            max_length=150,
            unique=True,
            validators=[validate_username],
            error_messages={
                "unique": _("A user with that username already exists."),
            },
        )

GitHub is also complaining that your regex allows an unlimited number of dashes, so if it's possible to put a character limit on the regex check itself that would be handy.

You'll also notice that UsernameField uses a validate_username validator:

def validate_username(value):
    """make sure usernames look okay"""
    if not re.match(r"^[A-Za-z\-_\.0-9]+@[A-Za-z\-_\.0-9]+\.[a-z]{2,}$", value):
        raise ValidationError(
            _("%(value)s is not a valid username"),
            params={"value": value},
        )

To my eyes this looks a little more similar to yours than to the current regex in the follow view, so maybe we just need to align them?

@timothyjrogers
Copy link
Contributor Author

Thank you for the feedback @hughrun and @ilkka-ollakka , excellent points. I did not see the username limit in the model, I've clamped the regex so it will only capture 150 characters or less. I also moved the regex over into the regex utility, and split it up into multiline strings to make it easier to read.

I do generally agree with @ilkka-ollakka that there is probably a cleaner way to do this -- clamping the length on that regex will still persist this problem if there was a user with a 151+ character username, maybe on an instance running a modified fork or something. Some extra parsing and validating logic wouldn't go amiss in the follow controller, but that's a much less likely scenario and this bug was just about usernames being chopped at certain characters so I think maybe a separate issue could be logged to improve the validation in general?

In any case I've tightened it up enough to make CodeQL and the linter happy, and confirmed locally that it will recognize remote usernames compliant to the spec and the character limit. If we want to address additional validation here I'm happy to do so, otherwise we can log another issue. Thanks again!

@hughrun
Copy link
Contributor

hughrun commented Mar 31, 2025

Thanks @timothyjrogers

I've just realised that I didn't properly think through your original question:

The reason it's so long is because I'm capturing the entire RFC7565 allowed character set, but I see your regex for local names is only alpha numbeic plus dashes, dots, and underscores. If your intent for Bookwyrm is to only allow those (Mastodon does similar) then easy enough I'll just trim the regex down to that.

The answer to this is definitely "yes, the intent is to limit the character-set", though we may be able to be slightly more generous. Whilst we'd prefer to be compliant with standards where possible, in this case the generosity of RFC7565 is in conflict with the limitations of RFC3986, because usernames become part of URIs for various things. That means we can't allow any of these characters reserved in RFC3986 for usernames because they will potentially break the URIs:

:, /, ?, #, [, ], @, !, $ &, ', (, ), * +, ,, ;, =

Because a user account is created for every remote user when they are followed, that means we can't allow these characters in usernames anywhere.

My apologies for not picking up on this earlier.

@timothyjrogers
Copy link
Contributor Author

Thanks @hughrun ! That makes sense. Given that, I have restricted the remote name regex as described above to just \w plus dots, dashes, and tildes.

@hughrun hughrun merged commit 8830432 into bookwyrm-social:main Apr 1, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remote follow does not accept usernames with dots
3 participants