Skip to content

Proposal: Public Suffix API #676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

mckenfra
Copy link

This formalizes #231 into a concrete proposal.

@Rob--W Rob--W requested review from oliverdunk and xeenon August 20, 2024 12:29
Copy link
Member

@oliverdunk oliverdunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I will reach out to the PSL maintainers I have been in contact with to ask them to take a look. I'll also share this internally to get an overall opinion from Chrome.

@mckenfra mckenfra changed the title Add Public Suffix API proposal Proposal: Public Suffix API Aug 23, 2024
Copy link
Member

@Rob--W Rob--W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the current proposal, my main requests for changes are:

  • if other browser vendors are onboard, I prefer a synchronous API over an asynchronous one (and with that the bulk query method would be redundant)
  • strict option may be too broad, and I'd favor splitting it up in specific options to allow extensions to customize their behavior
  • parse method has limited utility. Drop it; I expect the only given use case ("search-vs-navigate") to be covered by the broader options to getRegistrableDomain.

First, to help with minimizing the amount of back and forth, here is what I am proposing (and I have chatted with @simon-friedberger to confirm that he's okay with the design):

browser.publicSuffix.isKnownPublicSuffix(string) -> boolean
browser.publicSuffix.getKnownPublicSuffix(string) -> string | null
browser.publicSuffix.getRegistrableDomain(string, { allowIP, allowUnknownSuffix, allowPlainSuffix }) -> string | null

The return values are null to allow extensions to use the API like getXXX(val) ?? val, but I'm also willing to consider empty string or throwing an error.

isKnownPublicSuffix returns whether a specific string is on the PSL. I include this here because the PSL algorithm returns the longest label, but sometimes one may be interested in knowing whether there is any other shorter label that might have been a valid domain in theory. E.g. github.io is a public suffix itself, but could also be interpreted as an eTLD+1 for github.io. Depending on the use case.

The getRegistrableDomain method follows the definition of "registrable domain" from the URL spec: https://url.spec.whatwg.org/#host-miscellaneous by default. However, that interpretation is too strict for some use cases, hence the extra options.

For the use case of "search or navigate", allowIP and allowPlainSuffix would be set to true, but allowUnknownSuffix to false. Or they could use the getKnownPublicSuffix

allowIP is because IP addresses have to be special-cased by the extension. For most domain inputs, one could split at dots to try and get a different domain level, but that logic does not make sense for IP addresses. If this distinction is unimportant, this option can be dropped and merged with allowUnknownSuffix

The allowUnknownSuffix option exists to exclude non-domains with unknown suffix such as green.banana, otherwise getRegistrableDomain would effectively return a string for almost every input.

The allowPlainSuffix option only exists because there are domains that do not have an eTLD+1 but can still be navigated to, such as github.io and blogspot.com. These examples are public suffixes themselves, but there is no +1 in eTLD+1.

Examples for getKnownPublicSuffix(string) -> string | null :

  • github.io -> github.io
  • foo.github.io -> github.io
  • facebook.co.uk -> co.uk
  • 192.168.2.1 -> null
  • green.banana -> null

Semantics for isKnownPublicSuffix(string) -> boolean :

  • True iff getKnownPublicSuffix(string) returns the input string.

Examples for getRegistrableDomain(string, { allowIP, allowUnknownSuffix, allowPlainSuffix }) -> string | null

-github.io -> null (with allowPlainSuffix=true -> github.io)

  • foo.github.io -> foo.github.io
  • facebook.co.uk -> facebook.co.uk
  • 192.168.2.1 -> null (with allowIP=true -> 192.168.2.1)
  • green.banana -> null (with allowUnknownSuffix=true -> green.banana)

hostname: string,
options?: RegistrableDomainOptions,
)
: Promise<string | null>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether it is feasible to make this API synchronous. We usually require new extension APIs to be asynchronous unless a good reason is given otherwise.

In practice, the example extensions I checked in your list currently use a library with a synchronous getDomain method, and generally rewriting already-sync code to use an asynchronous method is difficult. Moreover, the bulk getRegistrableDomain method proposed here shows that the basic getRegistrableDomain method already appears to have too much overhead that necessitates a bulk query method.

In Firefox, the internal effectiveTLDService API can be used from the parent and content process, including extensions. There is no implementation constraint for requiring this to be implemented in the parent process.

In Chromium, I see at least one use of registry_controlled_domains in the renderer (LocalDOMWindow::IsCrossSiteSubframe in local_dom_window.cc), which suggests that the information may be available in the child process too.

In WebKit, I see topPrivatelyControlledDomain being used areSameSiteIgnoringPublicSuffix in Document.cpp, suggesting that the information may be available in the content process as well.

Since the current implementations offer the functionality in the child process, and the more ergonomic version of the API is for it to be synchronous, I'm favoring the APIs to be non-async.

@oliverdunk @xeenon Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gone ahead and changed this proposal's API to sync in commit 5ba391f However, I can revert this to async if necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds reasonable to me, however I would ultimately defer to @rdcronin here.

@mckenfra
Copy link
Author

On the current proposal, my main requests for changes are:

* if other browser vendors are onboard, I prefer a synchronous API over an asynchronous one (and with that the bulk query method would be redundant)

Agreed - this would alleviate any performance concerns and allow us to simplify the API as you suggest.

* `strict` option may be too broad, and I'd favor splitting it up in specific options to allow extensions to customize their behavior

Agreed.

* `parse` method has limited utility. Drop it; I expect the only given use case ("search-vs-navigate") to be covered by the broader options to `getRegistrableDomain`.

As per my response to your other comment, the reason the parse() method was introduced was because:

  1. A method that's returning something other than a registrable domain (e.g. an IP address) shouldn't really be called getRegistrableDomain(). Whereas returning an IP address from parse() seems fine.
  2. If you call getRegistrableDomain(hostname, { allowIP: true, allowUnknownSuffix: true }), you do not know what the return value is. It may be an IP address or a domain name. With parse(), this is explicitly available via the kind field in the returned object.

My gut instinct is that for some use cases it will be useful/essential to know whether the returned value is an IP address or a domain name.

sometimes one may be interested in knowing whether there is any other shorter label that might have been a valid domain in theory.

Agreed, I made this point in an earlier discussion of this proposal.

The getRegistrableDomain method follows the definition of "registrable domain" from the URL spec
[...]
The allowPlainSuffix option only exists because there are domains that do not have an eTLD+1 but can still be navigated to, such as github.io and blogspot.com. These examples are public suffixes themselves, but there is no +1 in eTLD+1.

See my response to your other comment. It is possible that a clarification of the algorithm might avoid the need for allowPlainSuffix.

Examples for getRegistrableDomain(string, { allowIP, allowUnknownSuffix, allowPlainSuffix }) -> string | null
[..]
* green.banana -> null (with allowUnknownSuffix=true -> green.banana)

In section "Behaviours: 4. Strict" in the proposal, I have stated that the equivalent of a registrable domain for an unknown-suffixed hostname that is not an IP address should be simply "the last domain label". So for green.banana, the registrable domain would be just banana.

Some further examples:

Input hostname Registrable Domain Equivalent
printer.homenet homenet
backup.homenet homenet
my-company-server.internal internal
localhost localhost

@mckenfra
Copy link
Author

mckenfra commented May 10, 2025

@simon-friedberger See my above comment that getRegistrableDomain(string, { allowUnknownSuffix }) should return only the very last domain label for a domain lacking a known suffix. An alternative is instead to return the full input hostname as-is. Do you have any thoughts which is preferable?

@mckenfra
Copy link
Author

@Rob--W I have addressed your comments in commit 5ba391f as follows:

  • Added the API methods and options you suggested
  • Removed getRegistrableDomains() and parse()
  • Changed the API to sync instead of async, with explanation
  • Stated that option allowUnknownSuffix should cause the registrable domain of an unknown-suffixed input hostname to be the full input hostname itself
  • Stated that a use case involving auto-generating filtering rules for every possible eTLD from an input rule such as myorg.* is out-of-scope

Copy link
Member

@Rob--W Rob--W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents look good to me, in my opinion this is ready for wider review.

In a recent comment (#676 (comment)), you wondered whether to return the full input domain or just the last domain label for an input without known suffix. The current proposal has an example my.net.foobar that returns the full input, which looks good to me.

@simon-friedberger
Copy link

@simon-friedberger See my above comment that getRegistrableDomain(string, { allowUnknownSuffix }) should return only the very last domain label for a domain lacking a known suffix. An alternative is instead to return the full input hostname as-is. Do you have any thoughts which is preferable?

I think in this case we want to apply the following rule from the PSL algorithm:

If no rules match, the prevailing rule is "*". (See Note 2)

meaning for green.banana the eTLD is banana and the eTLD+1 is green.banana but notably, this is not the full input because for very.green.banana we would still return green.banana for getRegistrableDomain

Copy link
Member

@oliverdunk oliverdunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though adding @rdcronin for a second approval on the Chrome side. This was a really thorough proposal, and I greatly appreciate the time and effort that has been put into it.

@oliverdunk oliverdunk requested a review from rdcronin May 14, 2025 10:09
@mckenfra
Copy link
Author

@Rob--W @simon-friedberger @oliverdunk I have pushed a minor update f1d81e5 containing the following changes:

  1. Renamed API methods, e.g. getRegistrableDomain() --> getDomain()
  2. Stated that getDomain(hostname, { allowUnknownSuffix }) returns the last 2 labels of unknown-suffixed input hostnames
  3. Updated the Open Web API section

@oliverdunk
Copy link
Member

@Rob--W @simon-friedberger @oliverdunk I have pushed a minor update

Thanks for the heads up! No concerns and still LGTM.

Copy link
Member

@Rob--W Rob--W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good. I have added suggestions to clarify that the DomainOptions prescribe criteria for the input and the relation of the output; previously the text only states what the output would look like.

These suggestions do not alter the proposal (and are already mentioned in more detail elsewhere in the document), but may help readers who only look at the API definition.

Comment on lines +380 to +382
// Determines if the given hostname is itself a known eTLD (i.e. in the PSL).
export function isKnownSuffix(
hostname: string,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Determines if the given hostname is itself a known eTLD (i.e. in the PSL).
export function isKnownSuffix(
hostname: string,
// Determines if the given suffix is itself a known eTLD (i.e. in the PSL).
export function isKnownSuffix(
suffix: string,

Nit: The input is not a hostname but a suffix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using the term "hostname" in the sense of the hostname property of javascript's URL class:

const hostname = new URL("https://co.uk").hostname;
const isKnownSuffix = publicSuffix.isKnownSuffix(hostname);

The input parameter may be any hostname, and if it is anything other than a known suffix (e.g. an IP address, or an eTLD+1), then the method returns false.

In section "Behaviours: 4. Invalid hostname" I state:

This API's methods should throw an error if a hostname passed as an input parameter
[contains invalid characters]

I intended that same error-throwing behaviour to apply to this method too, so referring to the input parameter as a "hostname" is a way of conveying that intent.

@mckenfra
Copy link
Author

@Rob--W I have pushed update e0f2835 containing:

  • minor rewording of the DomainOptions comments (as you note, the same information is already described in more detail later in the proposal, so this is mainly to help the casual reader)
  • updated section "Behaviours: 4. Invalid hostname" to clarify that an input hostname parameter may be an IP address or a domain name

Copy link
Collaborator

@xeenon xeenon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this very detailed proposal!

Copy link
Collaborator

@rdcronin rdcronin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! This generally looks good, but there are a few things I'd like to see addressed.

Comment on lines +327 to +334
#### 4. Search vs Navigate

Firefox makes use of the PSL in order to determine whether to issue a search query
or whether to try a navigation, when a user enters a domain-like string in the
url bar. In such instances, a PSL lookup is made and:

* If the domain has a known eTLD, attempt to navigate.
* If the domain has an unknown eTLD, use a search engine.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a use case for the PSL, but I'm not sure it's as compelling a use case for extensions themselves (since it boils down to "handle things with TLDs differently", which is already described)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use case was provided by a reviewer during review of this proposal to justify the inclusion of functionality to distinguish between known vs unknown suffixes. I will add an update to note that this specific use case may not apply directly to extensions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdcronin IMHO there are two ways the PSL is used:

  1. Figuring out a site, for grouping or associating data with a site, or deleting site history, etc.
  2. Figuring out if something looks like a valid domain - that's what this is and I don't think it's covered above...? Actually, it looks like all the other examples are examples of (1.) and this is the only example of (2.).

Comment on lines +408 to +410
// If true, the returned domain should be encoded as Unicode.
// Default = false (Punycode)
unicode?: boolean,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be an enum instead, to allow future encoding options?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will rename this option from unicode to encoding, as follows:

interface DomainOptions {

  // Determines how the returned domain should be encoded.
  // Default = punycode
  encoding?: DomainEncoding,

  // etc.
}

enum DomainEncoding {
  punycode,
  unicode,
}


// If true, and the input hostname is an IP address, then this is returned as-is.
// Default = false
allowIP?: boolean,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd lean towards "allowIpAddress[es]"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will rename this to allowIpAddress.

The other options are singular (e.g. allowUnknownSuffix) rather than plural, so I would favour the singular here too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be allowIPAddress.

// the penultimate two domain labels of the input.
// Default = false
allowUnknownSuffix?: boolean,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we provide an option for whether to include private registries? (I think in this proposal, called "implied suffixes"?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the "Private section" of the PSL dataset? If so, these private registries are always included, as explained in proposal section "Features of the PSL: 1. ICANN vs Private".

Comment on lines +693 to +696
Unfortunately, while this offered a solution to the performance problem,
it added additional complexity to the API. To resolve this issue, the API
has now been changed to being synchronous, which has allowed the batching method
to be removed, thereby making the API more ergonomic.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed in our meeting, I'd like us to expand this section with more details about this. Sync APIs are against our typical best practices and, even if they are possible without significant extra architectural work today, they may not always be that way (over the past decade, we've seen many architectural shifts that have invalidated previously-held assumptions, such as subframes being in the same process or navigation being driven by the renderer process).

I'm not fundamentally opposed to these being sync, but I think we should have good reason beyond just making it a bit easier, since we may pay heavily for it in the future. The best justification would be a scenario that's not solvable without a sync API (e.g., use in document_start), but AFAIK we haven't identified any of those. Are there specific instances of things that are dramatically more complex to implement with async APIs? Can we add those here as documentation for why we diverge from typical best practices?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, I will look into how to expand this section with further justification.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this async preference documented somewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We mention it briefly here: https://github.com/w3c/webextensions/blob/main/proposals/proposal_template.md#schema

I think @Rob--W had some thoughts on how to update this section, so he might be able to help (sorry to sign you up for that Rob!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants