Proposal: Public Suffix API #676

mckenfra · 2024-08-19T14:25:32Z

This formalizes #231 into a concrete proposal.

proposals/public-suffix.md

oliverdunk

Thanks for this! I will reach out to the PSL maintainers I have been in contact with to ask them to take a look. I'll also share this internally to get an overall opinion from Chrome.

proposals/public-suffix.md

Rob--W

On the current proposal, my main requests for changes are:

if other browser vendors are onboard, I prefer a synchronous API over an asynchronous one (and with that the bulk query method would be redundant)
strict option may be too broad, and I'd favor splitting it up in specific options to allow extensions to customize their behavior
parse method has limited utility. Drop it; I expect the only given use case ("search-vs-navigate") to be covered by the broader options to getRegistrableDomain.

First, to help with minimizing the amount of back and forth, here is what I am proposing (and I have chatted with @simon-friedberger to confirm that he's okay with the design):

browser.publicSuffix.isKnownPublicSuffix(string) -> boolean
browser.publicSuffix.getKnownPublicSuffix(string) -> string | null
browser.publicSuffix.getRegistrableDomain(string, { allowIP, allowUnknownSuffix, allowPlainSuffix }) -> string | null

The return values are null to allow extensions to use the API like getXXX(val) ?? val, but I'm also willing to consider empty string or throwing an error.

isKnownPublicSuffix returns whether a specific string is on the PSL. I include this here because the PSL algorithm returns the longest label, but sometimes one may be interested in knowing whether there is any other shorter label that might have been a valid domain in theory. E.g. github.io is a public suffix itself, but could also be interpreted as an eTLD+1 for github.io. Depending on the use case.

The getRegistrableDomain method follows the definition of "registrable domain" from the URL spec: https://url.spec.whatwg.org/#host-miscellaneous by default. However, that interpretation is too strict for some use cases, hence the extra options.

For the use case of "search or navigate", allowIP and allowPlainSuffix would be set to true, but allowUnknownSuffix to false. Or they could use the getKnownPublicSuffix

allowIP is because IP addresses have to be special-cased by the extension. For most domain inputs, one could split at dots to try and get a different domain level, but that logic does not make sense for IP addresses. If this distinction is unimportant, this option can be dropped and merged with allowUnknownSuffix

The allowUnknownSuffix option exists to exclude non-domains with unknown suffix such as green.banana, otherwise getRegistrableDomain would effectively return a string for almost every input.

The allowPlainSuffix option only exists because there are domains that do not have an eTLD+1 but can still be navigated to, such as github.io and blogspot.com. These examples are public suffixes themselves, but there is no +1 in eTLD+1.

Examples for getKnownPublicSuffix(string) -> string | null :

github.io -> github.io
foo.github.io -> github.io
facebook.co.uk -> co.uk
192.168.2.1 -> null
green.banana -> null

Semantics for isKnownPublicSuffix(string) -> boolean :

True iff getKnownPublicSuffix(string) returns the input string.

Examples for getRegistrableDomain(string, { allowIP, allowUnknownSuffix, allowPlainSuffix }) -> string | null

-github.io -> null (with allowPlainSuffix=true -> github.io)

foo.github.io -> foo.github.io
facebook.co.uk -> facebook.co.uk
192.168.2.1 -> null (with allowIP=true -> 192.168.2.1)
green.banana -> null (with allowUnknownSuffix=true -> green.banana)

proposals/public-suffix.md

Rob--W · 2025-05-09T16:15:20Z

proposals/public-suffix.md

+    hostname: string,
+    options?: RegistrableDomainOptions,
+  )
+  : Promise<string | null>;


I'm wondering whether it is feasible to make this API synchronous. We usually require new extension APIs to be asynchronous unless a good reason is given otherwise.

In practice, the example extensions I checked in your list currently use a library with a synchronous getDomain method, and generally rewriting already-sync code to use an asynchronous method is difficult. Moreover, the bulk getRegistrableDomain method proposed here shows that the basic getRegistrableDomain method already appears to have too much overhead that necessitates a bulk query method.

In Firefox, the internal effectiveTLDService API can be used from the parent and content process, including extensions. There is no implementation constraint for requiring this to be implemented in the parent process.

In Chromium, I see at least one use of registry_controlled_domains in the renderer (LocalDOMWindow::IsCrossSiteSubframe in local_dom_window.cc), which suggests that the information may be available in the child process too.

In WebKit, I see topPrivatelyControlledDomain being used areSameSiteIgnoringPublicSuffix in Document.cpp, suggesting that the information may be available in the content process as well.

Since the current implementations offer the functionality in the child process, and the more ergonomic version of the API is for it to be synchronous, I'm favoring the APIs to be non-async.

@oliverdunk @xeenon Thoughts?

I have gone ahead and changed this proposal's API to sync in commit 5ba391f However, I can revert this to async if necessary.

This sounds reasonable to me, however I would ultimately defer to @rdcronin here.

mckenfra · 2025-05-10T10:38:47Z

On the current proposal, my main requests for changes are:

* if other browser vendors are onboard, I prefer a synchronous API over an asynchronous one (and with that the bulk query method would be redundant)

Agreed - this would alleviate any performance concerns and allow us to simplify the API as you suggest.

* `strict` option may be too broad, and I'd favor splitting it up in specific options to allow extensions to customize their behavior

Agreed.

* `parse` method has limited utility. Drop it; I expect the only given use case ("search-vs-navigate") to be covered by the broader options to `getRegistrableDomain`.

As per my response to your other comment, the reason the parse() method was introduced was because:

A method that's returning something other than a registrable domain (e.g. an IP address) shouldn't really be called getRegistrableDomain(). Whereas returning an IP address from parse() seems fine.
If you call getRegistrableDomain(hostname, { allowIP: true, allowUnknownSuffix: true }), you do not know what the return value is. It may be an IP address or a domain name. With parse(), this is explicitly available via the kind field in the returned object.

My gut instinct is that for some use cases it will be useful/essential to know whether the returned value is an IP address or a domain name.

sometimes one may be interested in knowing whether there is any other shorter label that might have been a valid domain in theory.

Agreed, I made this point in an earlier discussion of this proposal.

The getRegistrableDomain method follows the definition of "registrable domain" from the URL spec
[...]
The allowPlainSuffix option only exists because there are domains that do not have an eTLD+1 but can still be navigated to, such as github.io and blogspot.com. These examples are public suffixes themselves, but there is no +1 in eTLD+1.

See my response to your other comment. It is possible that a clarification of the algorithm might avoid the need for allowPlainSuffix.

Examples for getRegistrableDomain(string, { allowIP, allowUnknownSuffix, allowPlainSuffix }) -> string | null
[..]
* green.banana -> null (with allowUnknownSuffix=true -> green.banana)

In section "Behaviours: 4. Strict" in the proposal, I have stated that the equivalent of a registrable domain for an unknown-suffixed hostname that is not an IP address should be simply "the last domain label". So for green.banana, the registrable domain would be just banana.

Some further examples:

Input hostname	Registrable Domain Equivalent
printer.homenet	homenet
backup.homenet	homenet
my-company-server.internal	internal
localhost	localhost

mckenfra · 2025-05-10T12:09:37Z

@simon-friedberger See my above comment that getRegistrableDomain(string, { allowUnknownSuffix }) should return only the very last domain label for a domain lacking a known suffix. An alternative is instead to return the full input hostname as-is. Do you have any thoughts which is preferable?

mckenfra · 2025-05-12T04:43:09Z

@Rob--W I have addressed your comments in commit 5ba391f as follows:

Added the API methods and options you suggested
Removed getRegistrableDomains() and parse()
Changed the API to sync instead of async, with explanation
Stated that option allowUnknownSuffix should cause the registrable domain of an unknown-suffixed input hostname to be the full input hostname itself
Stated that a use case involving auto-generating filtering rules for every possible eTLD from an input rule such as myorg.* is out-of-scope

Rob--W

The contents look good to me, in my opinion this is ready for wider review.

In a recent comment (#676 (comment)), you wondered whether to return the full input domain or just the last domain label for an input without known suffix. The current proposal has an example my.net.foobar that returns the full input, which looks good to me.

proposals/public-suffix.md

simon-friedberger · 2025-05-14T07:21:32Z

@simon-friedberger See my above comment that getRegistrableDomain(string, { allowUnknownSuffix }) should return only the very last domain label for a domain lacking a known suffix. An alternative is instead to return the full input hostname as-is. Do you have any thoughts which is preferable?

I think in this case we want to apply the following rule from the PSL algorithm:

If no rules match, the prevailing rule is "*". (See Note 2)

meaning for green.banana the eTLD is banana and the eTLD+1 is green.banana but notably, this is not the full input because for very.green.banana we would still return green.banana for getRegistrableDomain

oliverdunk

LGTM, though adding @rdcronin for a second approval on the Chrome side. This was a really thorough proposal, and I greatly appreciate the time and effort that has been put into it.

proposals/public-suffix.md

mckenfra · 2025-05-14T15:38:38Z

@Rob--W @simon-friedberger @oliverdunk I have pushed a minor update f1d81e5 containing the following changes:

Renamed API methods, e.g. getRegistrableDomain() --> getDomain()
Stated that getDomain(hostname, { allowUnknownSuffix }) returns the last 2 labels of unknown-suffixed input hostnames
Updated the Open Web API section

oliverdunk · 2025-05-14T15:40:01Z

@Rob--W @simon-friedberger @oliverdunk I have pushed a minor update

Thanks for the heads up! No concerns and still LGTM.

Rob--W

Still looks good. I have added suggestions to clarify that the DomainOptions prescribe criteria for the input and the relation of the output; previously the text only states what the output would look like.

These suggestions do not alter the proposal (and are already mentioned in more detail elsewhere in the document), but may help readers who only look at the API definition.

Rob--W · 2025-05-14T16:29:36Z

proposals/public-suffix.md

+  // Determines if the given hostname is itself a known eTLD (i.e. in the PSL).
+  export function isKnownSuffix(
+    hostname: string,


Suggested change

// Determines if the given hostname is itself a known eTLD (i.e. in the PSL).

export function isKnownSuffix(

hostname: string,

// Determines if the given suffix is itself a known eTLD (i.e. in the PSL).

export function isKnownSuffix(

suffix: string,

Nit: The input is not a hostname but a suffix.

I am using the term "hostname" in the sense of the hostname property of javascript's URL class:

const hostname = new URL("https://co.uk").hostname; const isKnownSuffix = publicSuffix.isKnownSuffix(hostname);

The input parameter may be any hostname, and if it is anything other than a known suffix (e.g. an IP address, or an eTLD+1), then the method returns false.

In section "Behaviours: 4. Invalid hostname" I state:

This API's methods should throw an error if a hostname passed as an input parameter
[contains invalid characters]

I intended that same error-throwing behaviour to apply to this method too, so referring to the input parameter as a "hostname" is a way of conveying that intent.

proposals/public-suffix.md

mckenfra · 2025-05-15T14:38:05Z

@Rob--W I have pushed update e0f2835 containing:

minor rewording of the DomainOptions comments (as you note, the same information is already described in more detail later in the proposal, so this is mainly to help the casual reader)
updated section "Behaviours: 4. Invalid hostname" to clarify that an input hostname parameter may be an IP address or a domain name

xeenon

Thanks for this very detailed proposal!

rdcronin

Thank you for the PR! This generally looks good, but there are a few things I'd like to see addressed.

rdcronin · 2025-05-16T20:14:19Z

proposals/public-suffix.md

+#### 4. Search vs Navigate
+
+Firefox makes use of the PSL in order to determine whether to issue a search query
+or whether to try a navigation, when a user enters a domain-like string in the
+url bar. In such instances, a PSL lookup is made and:
+
+* If the domain has a known eTLD, attempt to navigate.
+* If the domain has an unknown eTLD, use a search engine.


This is a use case for the PSL, but I'm not sure it's as compelling a use case for extensions themselves (since it boils down to "handle things with TLDs differently", which is already described)

This use case was provided by a reviewer during review of this proposal to justify the inclusion of functionality to distinguish between known vs unknown suffixes. I will add an update to note that this specific use case may not apply directly to extensions.

@rdcronin IMHO there are two ways the PSL is used:

Figuring out a site, for grouping or associating data with a site, or deleting site history, etc.

Figuring out if something looks like a valid domain - that's what this is and I don't think it's covered above...? Actually, it looks like all the other examples are examples of (1.) and this is the only example of (2.).

rdcronin · 2025-05-16T20:15:45Z

proposals/public-suffix.md

+    // If true, the returned domain should be encoded as Unicode.
+    // Default = false (Punycode)
+    unicode?: boolean,


Could this be an enum instead, to allow future encoding options?

Yes, I will rename this option from unicode to encoding, as follows:

interface DomainOptions { // Determines how the returned domain should be encoded. // Default = punycode encoding?: DomainEncoding, // etc. } enum DomainEncoding { punycode, unicode, }

rdcronin · 2025-05-16T20:16:11Z

proposals/public-suffix.md

+
+    // If true, and the input hostname is an IP address, then this is returned as-is.
+    // Default = false
+    allowIP?: boolean,


nit: I'd lean towards "allowIpAddress[es]"

Ok, I will rename this to allowIpAddress.

The other options are singular (e.g. allowUnknownSuffix) rather than plural, so I would favour the singular here too.

Should be allowIPAddress.

rdcronin · 2025-05-16T20:19:27Z

proposals/public-suffix.md

+    // the penultimate two domain labels of the input.
+    // Default = false
+    allowUnknownSuffix?: boolean,
+  }


Should we provide an option for whether to include private registries? (I think in this proposal, called "implied suffixes"?)

Are you referring to the "Private section" of the PSL dataset? If so, these private registries are always included, as explained in proposal section "Features of the PSL: 1. ICANN vs Private".

rdcronin · 2025-05-16T20:24:35Z

proposals/public-suffix.md

+Unfortunately, while this offered a solution to the performance problem,
+it added additional complexity to the API. To resolve this issue, the API
+has now been changed to being synchronous, which has allowed the batching method
+to be removed, thereby making the API more ergonomic.


As we discussed in our meeting, I'd like us to expand this section with more details about this. Sync APIs are against our typical best practices and, even if they are possible without significant extra architectural work today, they may not always be that way (over the past decade, we've seen many architectural shifts that have invalidated previously-held assumptions, such as subframes being in the same process or navigation being driven by the renderer process).

I'm not fundamentally opposed to these being sync, but I think we should have good reason beyond just making it a bit easier, since we may pay heavily for it in the future. The best justification would be a scenario that's not solvable without a sync API (e.g., use in document_start), but AFAIK we haven't identified any of those. Are there specific instances of things that are dramatically more complex to implement with async APIs? Can we add those here as documentation for why we diverge from typical best practices?

Understood, I will look into how to expand this section with further justification.

Is this async preference documented somewhere?

We mention it briefly here: https://github.com/w3c/webextensions/blob/main/proposals/proposal_template.md#schema

I think @Rob--W had some thoughts on how to update this section, so he might be able to help (sorry to sign you up for that Rob!).

Add Public Suffix API proposal

0f09b4a

Rob--W reviewed Aug 20, 2024

View reviewed changes

Rob--W requested review from oliverdunk and xeenon August 20, 2024 12:29

oliverdunk reviewed Aug 21, 2024

View reviewed changes

proposals/public-suffix.md Outdated Show resolved Hide resolved

proposals/public-suffix.md Show resolved Hide resolved

mckenfra changed the title ~~Add Public Suffix API proposal~~ Proposal: Public Suffix API Aug 23, 2024

mckenfra force-pushed the publicsuffix branch from 960f99d to 0f09b4a Compare September 6, 2024 03:05

Update Public Suffix API proposal

3301526

mckenfra force-pushed the publicsuffix branch from 638833b to 3301526 Compare September 6, 2024 04:29

mckenfra requested review from Rob--W and oliverdunk September 9, 2024 11:31

Rob--W reviewed Sep 11, 2024

View reviewed changes

Rob--W mentioned this pull request Sep 12, 2024

Publish minutes of 2024-09-12 meeting #685

Merged

simon-friedberger reviewed Nov 14, 2024

View reviewed changes

proposals/public-suffix.md Outdated Show resolved Hide resolved