Skip to content

Proposal: Public Suffix API #676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

mckenfra
Copy link

This formalizes #231 into a concrete proposal.

@Rob--W Rob--W requested review from oliverdunk and xeenon August 20, 2024 12:29
Copy link
Member

@oliverdunk oliverdunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I will reach out to the PSL maintainers I have been in contact with to ask them to take a look. I'll also share this internally to get an overall opinion from Chrome.

@mckenfra mckenfra changed the title Add Public Suffix API proposal Proposal: Public Suffix API Aug 23, 2024
Copy link
Member

@Rob--W Rob--W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the current proposal, my main requests for changes are:

  • if other browser vendors are onboard, I prefer a synchronous API over an asynchronous one (and with that the bulk query method would be redundant)
  • strict option may be too broad, and I'd favor splitting it up in specific options to allow extensions to customize their behavior
  • parse method has limited utility. Drop it; I expect the only given use case ("search-vs-navigate") to be covered by the broader options to getRegistrableDomain.

First, to help with minimizing the amount of back and forth, here is what I am proposing (and I have chatted with @simon-friedberger to confirm that he's okay with the design):

browser.publicSuffix.isKnownPublicSuffix(string) -> boolean
browser.publicSuffix.getKnownPublicSuffix(string) -> string | null
browser.publicSuffix.getRegistrableDomain(string, { allowIP, allowUnknownSuffix, allowPlainSuffix }) -> string | null

The return values are null to allow extensions to use the API like getXXX(val) ?? val, but I'm also willing to consider empty string or throwing an error.

isKnownPublicSuffix returns whether a specific string is on the PSL. I include this here because the PSL algorithm returns the longest label, but sometimes one may be interested in knowing whether there is any other shorter label that might have been a valid domain in theory. E.g. github.io is a public suffix itself, but could also be interpreted as an eTLD+1 for github.io. Depending on the use case.

The getRegistrableDomain method follows the definition of "registrable domain" from the URL spec: https://url.spec.whatwg.org/#host-miscellaneous by default. However, that interpretation is too strict for some use cases, hence the extra options.

For the use case of "search or navigate", allowIP and allowPlainSuffix would be set to true, but allowUnknownSuffix to false. Or they could use the getKnownPublicSuffix

allowIP is because IP addresses have to be special-cased by the extension. For most domain inputs, one could split at dots to try and get a different domain level, but that logic does not make sense for IP addresses. If this distinction is unimportant, this option can be dropped and merged with allowUnknownSuffix

The allowUnknownSuffix option exists to exclude non-domains with unknown suffix such as green.banana, otherwise getRegistrableDomain would effectively return a string for almost every input.

The allowPlainSuffix option only exists because there are domains that do not have an eTLD+1 but can still be navigated to, such as github.io and blogspot.com. These examples are public suffixes themselves, but there is no +1 in eTLD+1.

Examples for getKnownPublicSuffix(string) -> string | null :

  • github.io -> github.io
  • foo.github.io -> github.io
  • facebook.co.uk -> co.uk
  • 192.168.2.1 -> null
  • green.banana -> null

Semantics for isKnownPublicSuffix(string) -> boolean :

  • True iff getKnownPublicSuffix(string) returns the input string.

Examples for getRegistrableDomain(string, { allowIP, allowUnknownSuffix, allowPlainSuffix }) -> string | null

-github.io -> null (with allowPlainSuffix=true -> github.io)

  • foo.github.io -> foo.github.io
  • facebook.co.uk -> facebook.co.uk
  • 192.168.2.1 -> null (with allowIP=true -> 192.168.2.1)
  • green.banana -> null (with allowUnknownSuffix=true -> green.banana)

hostname: string,
options?: RegistrableDomainOptions,
)
: Promise<string | null>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether it is feasible to make this API synchronous. We usually require new extension APIs to be asynchronous unless a good reason is given otherwise.

In practice, the example extensions I checked in your list currently use a library with a synchronous getDomain method, and generally rewriting already-sync code to use an asynchronous method is difficult. Moreover, the bulk getRegistrableDomain method proposed here shows that the basic getRegistrableDomain method already appears to have too much overhead that necessitates a bulk query method.

In Firefox, the internal effectiveTLDService API can be used from the parent and content process, including extensions. There is no implementation constraint for requiring this to be implemented in the parent process.

In Chromium, I see at least one use of registry_controlled_domains in the renderer (LocalDOMWindow::IsCrossSiteSubframe in local_dom_window.cc), which suggests that the information may be available in the child process too.

In WebKit, I see topPrivatelyControlledDomain being used areSameSiteIgnoringPublicSuffix in Document.cpp, suggesting that the information may be available in the content process as well.

Since the current implementations offer the functionality in the child process, and the more ergonomic version of the API is for it to be synchronous, I'm favoring the APIs to be non-async.

@oliverdunk @xeenon Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gone ahead and changed this proposal's API to sync in commit 5ba391f However, I can revert this to async if necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds reasonable to me, however I would ultimately defer to @rdcronin here.

@mckenfra
Copy link
Author

On the current proposal, my main requests for changes are:

* if other browser vendors are onboard, I prefer a synchronous API over an asynchronous one (and with that the bulk query method would be redundant)

Agreed - this would alleviate any performance concerns and allow us to simplify the API as you suggest.

* `strict` option may be too broad, and I'd favor splitting it up in specific options to allow extensions to customize their behavior

Agreed.

* `parse` method has limited utility. Drop it; I expect the only given use case ("search-vs-navigate") to be covered by the broader options to `getRegistrableDomain`.

As per my response to your other comment, the reason the parse() method was introduced was because:

  1. A method that's returning something other than a registrable domain (e.g. an IP address) shouldn't really be called getRegistrableDomain(). Whereas returning an IP address from parse() seems fine.
  2. If you call getRegistrableDomain(hostname, { allowIP: true, allowUnknownSuffix: true }), you do not know what the return value is. It may be an IP address or a domain name. With parse(), this is explicitly available via the kind field in the returned object.

My gut instinct is that for some use cases it will be useful/essential to know whether the returned value is an IP address or a domain name.

sometimes one may be interested in knowing whether there is any other shorter label that might have been a valid domain in theory.

Agreed, I made this point in an earlier discussion of this proposal.

The getRegistrableDomain method follows the definition of "registrable domain" from the URL spec
[...]
The allowPlainSuffix option only exists because there are domains that do not have an eTLD+1 but can still be navigated to, such as github.io and blogspot.com. These examples are public suffixes themselves, but there is no +1 in eTLD+1.

See my response to your other comment. It is possible that a clarification of the algorithm might avoid the need for allowPlainSuffix.

Examples for getRegistrableDomain(string, { allowIP, allowUnknownSuffix, allowPlainSuffix }) -> string | null
[..]
* green.banana -> null (with allowUnknownSuffix=true -> green.banana)

In section "Behaviours: 4. Strict" in the proposal, I have stated that the equivalent of a registrable domain for an unknown-suffixed hostname that is not an IP address should be simply "the last domain label". So for green.banana, the registrable domain would be just banana.

Some further examples:

Input hostname Registrable Domain Equivalent
printer.homenet homenet
backup.homenet homenet
my-company-server.internal internal
localhost localhost

@mckenfra
Copy link
Author

mckenfra commented May 10, 2025

@simon-friedberger See my above comment that getRegistrableDomain(string, { allowUnknownSuffix }) should return only the very last domain label for a domain lacking a known suffix. An alternative is instead to return the full input hostname as-is. Do you have any thoughts which is preferable?

@mckenfra
Copy link
Author

@Rob--W I have addressed your comments in commit 5ba391f as follows:

  • Added the API methods and options you suggested
  • Removed getRegistrableDomains() and parse()
  • Changed the API to sync instead of async, with explanation
  • Stated that option allowUnknownSuffix should cause the registrable domain of an unknown-suffixed input hostname to be the full input hostname itself
  • Stated that a use case involving auto-generating filtering rules for every possible eTLD from an input rule such as myorg.* is out-of-scope

Copy link
Member

@Rob--W Rob--W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents look good to me, in my opinion this is ready for wider review.

In a recent comment (#676 (comment)), you wondered whether to return the full input domain or just the last domain label for an input without known suffix. The current proposal has an example my.net.foobar that returns the full input, which looks good to me.

@simon-friedberger
Copy link

@simon-friedberger See my above comment that getRegistrableDomain(string, { allowUnknownSuffix }) should return only the very last domain label for a domain lacking a known suffix. An alternative is instead to return the full input hostname as-is. Do you have any thoughts which is preferable?

I think in this case we want to apply the following rule from the PSL algorithm:

If no rules match, the prevailing rule is "*". (See Note 2)

meaning for green.banana the eTLD is banana and the eTLD+1 is green.banana but notably, this is not the full input because for very.green.banana we would still return green.banana for getRegistrableDomain

Copy link
Member

@oliverdunk oliverdunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though adding @rdcronin for a second approval on the Chrome side. This was a really thorough proposal, and I greatly appreciate the time and effort that has been put into it.

@oliverdunk oliverdunk requested a review from rdcronin May 14, 2025 10:09
@mckenfra
Copy link
Author

@Rob--W @simon-friedberger @oliverdunk I have pushed a minor update f1d81e5 containing the following changes:

  1. Renamed API methods, e.g. getRegistrableDomain() --> getDomain()
  2. Stated that getDomain(hostname, { allowUnknownSuffix }) returns the last 2 labels of unknown-suffixed input hostnames
  3. Updated the Open Web API section

@oliverdunk
Copy link
Member

@Rob--W @simon-friedberger @oliverdunk I have pushed a minor update

Thanks for the heads up! No concerns and still LGTM.

Copy link
Member

@Rob--W Rob--W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good. I have added suggestions to clarify that the DomainOptions prescribe criteria for the input and the relation of the output; previously the text only states what the output would look like.

These suggestions do not alter the proposal (and are already mentioned in more detail elsewhere in the document), but may help readers who only look at the API definition.

Comment on lines +380 to +382
// Determines if the given hostname is itself a known eTLD (i.e. in the PSL).
export function isKnownSuffix(
hostname: string,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Determines if the given hostname is itself a known eTLD (i.e. in the PSL).
export function isKnownSuffix(
hostname: string,
// Determines if the given suffix is itself a known eTLD (i.e. in the PSL).
export function isKnownSuffix(
suffix: string,

Nit: The input is not a hostname but a suffix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using the term "hostname" in the sense of the hostname property of javascript's URL class:

const hostname = new URL("https://co.uk").hostname;
const isKnownSuffix = publicSuffix.isKnownSuffix(hostname);

The input parameter may be any hostname, and if it is anything other than a known suffix (e.g. an IP address, or an eTLD+1), then the method returns false.

In section "Behaviours: 4. Invalid hostname" I state:

This API's methods should throw an error if a hostname passed as an input parameter
[contains invalid characters]

I intended that same error-throwing behaviour to apply to this method too, so referring to the input parameter as a "hostname" is a way of conveying that intent.

@mckenfra
Copy link
Author

@Rob--W I have pushed update e0f2835 containing:

  • minor rewording of the DomainOptions comments (as you note, the same information is already described in more detail later in the proposal, so this is mainly to help the casual reader)
  • updated section "Behaviours: 4. Invalid hostname" to clarify that an input hostname parameter may be an IP address or a domain name

Copy link
Collaborator

@xeenon xeenon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this very detailed proposal!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants