-
Notifications
You must be signed in to change notification settings - Fork 68
Proposal: Public Suffix API #676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I will reach out to the PSL maintainers I have been in contact with to ask them to take a look. I'll also share this internally to get an overall opinion from Chrome.
960f99d
to
0f09b4a
Compare
638833b
to
3301526
Compare
ee17198
to
c217156
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the current proposal, my main requests for changes are:
- if other browser vendors are onboard, I prefer a synchronous API over an asynchronous one (and with that the bulk query method would be redundant)
strict
option may be too broad, and I'd favor splitting it up in specific options to allow extensions to customize their behaviorparse
method has limited utility. Drop it; I expect the only given use case ("search-vs-navigate") to be covered by the broader options togetRegistrableDomain
.
First, to help with minimizing the amount of back and forth, here is what I am proposing (and I have chatted with @simon-friedberger to confirm that he's okay with the design):
browser.publicSuffix.isKnownPublicSuffix(string) -> boolean
browser.publicSuffix.getKnownPublicSuffix(string) -> string | null
browser.publicSuffix.getRegistrableDomain(string, { allowIP, allowUnknownSuffix, allowPlainSuffix }) -> string | null
The return values are null
to allow extensions to use the API like getXXX(val) ?? val
, but I'm also willing to consider empty string or throwing an error.
isKnownPublicSuffix
returns whether a specific string is on the PSL. I include this here because the PSL algorithm returns the longest label, but sometimes one may be interested in knowing whether there is any other shorter label that might have been a valid domain in theory. E.g. github.io
is a public suffix itself, but could also be interpreted as an eTLD+1 for github.io
. Depending on the use case.
The getRegistrableDomain
method follows the definition of "registrable domain" from the URL spec: https://url.spec.whatwg.org/#host-miscellaneous by default. However, that interpretation is too strict for some use cases, hence the extra options.
For the use case of "search or navigate", allowIP
and allowPlainSuffix
would be set to true, but allowUnknownSuffix
to false. Or they could use the getKnownPublicSuffix
allowIP
is because IP addresses have to be special-cased by the extension. For most domain inputs, one could split at dots to try and get a different domain level, but that logic does not make sense for IP addresses. If this distinction is unimportant, this option can be dropped and merged with allowUnknownSuffix
The allowUnknownSuffix
option exists to exclude non-domains with unknown suffix such as green.banana
, otherwise getRegistrableDomain
would effectively return a string for almost every input.
The allowPlainSuffix
option only exists because there are domains that do not have an eTLD+1 but can still be navigated to, such as github.io and blogspot.com. These examples are public suffixes themselves, but there is no +1
in eTLD+1
.
Examples for getKnownPublicSuffix(string) -> string | null
:
github.io
->github.io
foo.github.io
->github.io
- facebook.co.uk ->
co.uk
192.168.2.1
->null
green.banana
->null
Semantics for isKnownPublicSuffix(string) -> boolean
:
- True iff
getKnownPublicSuffix(string)
returns the input string.
Examples for getRegistrableDomain(string, { allowIP, allowUnknownSuffix, allowPlainSuffix }) -> string | null
-github.io
-> null (with allowPlainSuffix=true
-> github.io
)
foo.github.io
->foo.github.io
facebook.co.uk
->facebook.co.uk
192.168.2.1
->null
(withallowIP=true
->192.168.2.1
)green.banana
->null
(withallowUnknownSuffix=true
->green.banana
)
proposals/public-suffix.md
Outdated
hostname: string, | ||
options?: RegistrableDomainOptions, | ||
) | ||
: Promise<string | null>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether it is feasible to make this API synchronous. We usually require new extension APIs to be asynchronous unless a good reason is given otherwise.
In practice, the example extensions I checked in your list currently use a library with a synchronous getDomain
method, and generally rewriting already-sync code to use an asynchronous method is difficult. Moreover, the bulk getRegistrableDomain
method proposed here shows that the basic getRegistrableDomain
method already appears to have too much overhead that necessitates a bulk query method.
In Firefox, the internal effectiveTLDService API can be used from the parent and content process, including extensions. There is no implementation constraint for requiring this to be implemented in the parent process.
In Chromium, I see at least one use of registry_controlled_domains
in the renderer (LocalDOMWindow::IsCrossSiteSubframe
in local_dom_window.cc
), which suggests that the information may be available in the child process too.
In WebKit, I see topPrivatelyControlledDomain
being used areSameSiteIgnoringPublicSuffix
in Document.cpp
, suggesting that the information may be available in the content process as well.
Since the current implementations offer the functionality in the child process, and the more ergonomic version of the API is for it to be synchronous, I'm favoring the APIs to be non-async.
@oliverdunk @xeenon Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have gone ahead and changed this proposal's API to sync in commit 5ba391f However, I can revert this to async if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds reasonable to me, however I would ultimately defer to @rdcronin here.
Agreed - this would alleviate any performance concerns and allow us to simplify the API as you suggest.
Agreed.
As per my response to your other comment, the reason the
My gut instinct is that for some use cases it will be useful/essential to know whether the returned value is an IP address or a domain name.
Agreed, I made this point in an earlier discussion of this proposal.
See my response to your other comment. It is possible that a clarification of the algorithm might avoid the need for
In section "Behaviours: 4. Strict" in the proposal, I have stated that the equivalent of a registrable domain for an unknown-suffixed hostname that is not an IP address should be simply "the last domain label". So for Some further examples:
|
@simon-friedberger See my above comment that |
@Rob--W I have addressed your comments in commit 5ba391f as follows:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contents look good to me, in my opinion this is ready for wider review.
In a recent comment (#676 (comment)), you wondered whether to return the full input domain or just the last domain label for an input without known suffix. The current proposal has an example my.net.foobar
that returns the full input, which looks good to me.
I think in this case we want to apply the following rule from the PSL algorithm:
meaning for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though adding @rdcronin for a second approval on the Chrome side. This was a really thorough proposal, and I greatly appreciate the time and effort that has been put into it.
@Rob--W @simon-friedberger @oliverdunk I have pushed a minor update f1d81e5 containing the following changes:
|
Thanks for the heads up! No concerns and still LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looks good. I have added suggestions to clarify that the DomainOptions prescribe criteria for the input and the relation of the output; previously the text only states what the output would look like.
These suggestions do not alter the proposal (and are already mentioned in more detail elsewhere in the document), but may help readers who only look at the API definition.
// Determines if the given hostname is itself a known eTLD (i.e. in the PSL). | ||
export function isKnownSuffix( | ||
hostname: string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Determines if the given hostname is itself a known eTLD (i.e. in the PSL). | |
export function isKnownSuffix( | |
hostname: string, | |
// Determines if the given suffix is itself a known eTLD (i.e. in the PSL). | |
export function isKnownSuffix( | |
suffix: string, |
Nit: The input is not a hostname but a suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using the term "hostname" in the sense of the hostname property of javascript's URL class:
const hostname = new URL("https://co.uk").hostname;
const isKnownSuffix = publicSuffix.isKnownSuffix(hostname);
The input parameter may be any hostname, and if it is anything other than a known suffix (e.g. an IP address, or an eTLD+1), then the method returns false
.
In section "Behaviours: 4. Invalid hostname" I state:
This API's methods should throw an error if a hostname passed as an input parameter
[contains invalid characters]
I intended that same error-throwing behaviour to apply to this method too, so referring to the input parameter as a "hostname" is a way of conveying that intent.
@Rob--W I have pushed update e0f2835 containing:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this very detailed proposal!
This formalizes #231 into a concrete proposal.