diff --git a/index.html b/index.html index d9ea77c..0294dc3 100644 --- a/index.html +++ b/index.html @@ -93,17 +93,20 @@

Languages and Language Tags

Tags for identifying the natural language of content or the international preferences of users are one of the fundamental building blocks of the Web. The language tags found in Web and Internet formats and protocols are defined by [[BCP47]]. Consistent use of language tags provides applications the ability to perform language-specific formatting or processing. For example, a user-agent might use the language to select an appropriate font for displaying text or a Web page designer might style text differently in one language than in another.

Many of the core standards for the Web include support for language tags; these include the xml:lang attribute in [[XML10]], the lang and hreflang atttributes in [[HTML]], the language property in [[XSL10]], and the :lang pseudo-class in CSS [[CSS3-SELECTORS]], and many others, including SVG, TTML, SSML, etc.

-

Natural Language (or, in this document, just language). The spoken, written, or signed communications used by human beings.

+ +

Natural Language (or, in this document, just language). The spoken, written, or signed communications used by human beings.

There are many ways that languages might be identified and many reasons that software might need to identify the language of content on the Web. Document formats and protocols on the Web generally use the identifiers used in most other parts of the Internet, consisting of the language tags defined in [[BCP47]]. "BCP" nomenclature refers to the current set of IETF RFCs that form the "best current practice".

-

Language tag. A string used as an identifier for a language. In this document, the term language tag always refers explicitly to a [[BCP47]] language tag. These language tags consist of one or more subtags.

+ +

Language tag. A string used as an identifier for a language. In this document, the term language tag always refers explicitly to a [[BCP47]] language tag. These language tags consist of one or more subtags.

Specifications for the Web that require language identification MUST refer to [[BCP47]].

Specifications SHOULD NOT refer to specific component RFCs of [[BCP47]].

-

[[BCP47]] is a multipart document consisting, at the time this document was published, of two separate RFCs. The first part, called Tags for Identifying Languages [[RFC5646]], defines the grammar, form, and terminology of language tags. The second part, called Matching of Language Tags [[RFC4647]], describes several schemes for matching, comparing, and selecting content using language tags and includes useful terminology related to comparison of language preferences to tagged content.

-

Formulations such as "RFC 5646 or its successor" MAY be used, but only in cases where the specific document version is necessary.

+

[[BCP47]] is a multipart document consisting, at the time this document was published, of two separate RFCs. The first part, called Tags for Identifying Languages [[RFC5646]], defines the grammar, form, and terminology of language tags. The second part, called Matching of Language Tags [[RFC4647]], describes several schemes for matching, comparing, and selecting content using language tags and includes useful terminology related to comparison of language preferences to tagged content.

+ +

Formulations such as "RFC 5646 or its successor" MAY be used, but only in cases where the specific document version is necessary.

While this style of reference was once popular, using the BCP reference is more accurate. Since the grammar of language tags has been fixed since [[RFC4646]], referring to the BCP will not incur additional compliance risk to most implementations.

@@ -190,8 +193,6 @@

Languages and Language Tags

For example, JavaScript internationalization [[ECMA-402]] and [[CLDR]] provide a "best fit" algorithm which can be tailored by implementers.

- -
@@ -240,7 +241,7 @@

Locales and Internationalization

Since the adoption of the current [[BCP47]] identifier syntax, a number of locale models have adopted BCP47 directly or provided adaptation or mappings between proprietary models and language tags. Notably, the development and adoption of the open-source repository of locale data known as [[CLDR]] has led to wider general adoption of language tags as locale identifiers.

-

Common Locale Data Repository (or [[CLDR]]). The Common Locale Data Repository is a Unicode Consortium project that defines, collects, and curates sets of data needed to enable locales in systems or operating environments. CLDR data and its locale model are widely adopted, particularly in browsers.

+

Common Locale Data Repository (or [[CLDR]]). The Common Locale Data Repository is a Unicode Consortium project that defines, collects, and curates sets of data needed to enable locales in systems or operating environments. CLDR data and its locale model are widely adopted, particularly in browsers.

Unicode Locale Identifier or Unicode Locale. A language tag that follows the additional rules and restrictions on subtag choice defined in UTR#35 [[LDML]]. Any valid Unicode locale identifier is also a valid [[BCP47]] language tag, but a few valid language tags are not also valid Unicode locale identifiers.

@@ -507,7 +508,59 @@

Locales and Internationalization

Users expect form fields and other data inputs to use a presentation for non-linguistic fields that is consistent with the document or application where the values appear. User's usually expect their input to match the document's context rather than the user-agent or operating environments and input validation, prompting, or controls are also thus consistent with the content. This gives content authors the ability to create a wholly localized customer experience and is generally in keeping with customer expectations.

- + +
+

Choosing between metadata and text-processing language

+ +

There are two common uses for language tags in document formats, protocols, and specifications. In some cases, language tags are used to provide metadata about intended audience for collections of content, such as at the record or document level. In other cases, language tags are used to identify the language of specific bits of text in order to facilitate text processing.

+ +
+
The language of the intended audience
+ +

Metadata that describes the language of the intended audience is about the document as a whole. Such metadata may be used for searching, serving the right language version, classification, etc. Where there are language changes in a document, information about the language of the intended audience is not specific enough to support text-processing, that is to say, in a way that would be needed for the application of text-to-speech, styling, automatic font assignment, etc.

+ +

The language of the intended audience does not include every language used in a document. Many documents on the Web contain embedded fragments of content in different languages, whereas the page is clearly aimed at speakers of one particular language. For example, a German city-guide for Beijing may contain useful phrases in Chinese, but it is aimed at a German-speaking audience, not a Chinese one.

+ +

On the other hand, it is also possible to imagine a situation where a document contains the same or parallel content in more than one language. For example, a Web page may welcome Canadian readers with French content in the left column, and the same content in English in the right-hand column. Here the document is equally targeted at speakers of both languages, so there are two audience languages. This situation is not as common on the Web as in printed material since it is easy to link to separate pages on the Web for different audiences, but it does occur where there are multilingual communities. Another use case is a blog or a news page aimed at a multilingual community, where some articles on a page are in one language and some in another.

+ +

There are also pages where the navigational information, including the page title, is in one language but the real content of the page is in another. While this is not necessarily good practice, it doesn't change the fact that the language of the intended audience is usually that of the content, regardless of the language at the top of the document source.

+ +

Metadata about the language of the intended audience is usually best declared outside the document, such as in the HTTP Content-Language header.

+
+ +
+
The text-processing language
+ +

When specifying the text-processing language you are declaring the language in which a specific range of text is actually written, so that user agents or applications that manipulate the text (such as voice browsers, spell checkers, or style processors) can process the text in a language-appropriate manner. So we are, by necessity, talking about associating a single language with a specific range of text.

+ +

This specificity distinguishes the declaration of the language for text-processing from that of the language of the intended audience.

+ +

The language for text-processing is usually best declared using attributes on elements, including setting a document-wide default.

+ + + +
+

Further Reading

diff --git a/local.css b/local.css index bafbc19..921693c 100644 --- a/local.css +++ b/local.css @@ -77,6 +77,15 @@ kbd { text-align: start; } +.kw { + font-family: Menlo, Consolas, "DejaVu Sans Mono", Monaco, monospace; + font-size: .95em; + color: blue; + page-break-inside: avoid; + hyphens: none; + text-transform: none; +} + .summary { padding: 1em;