Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the use of ECHAR and UCHAR in canonical N-Quads. #27

Merged
merged 10 commits into from
Apr 20, 2023
Merged

Conversation

gkellogg
Copy link
Member

@gkellogg gkellogg commented Apr 5, 2023

Fixes #16.

Previously discussed in the RCH WG https://www.w3.org/2023/03/15-rch-minutes.html#r01.

cc/ @philarcher @yamdan


Preview | Diff

@gkellogg gkellogg added needs discussion Proposed for discussion in an upcoming meeting spec:substantive Change in the spec affecting its normative content (class 3) –see also spec:bug, spec:new-feature labels Apr 5, 2023
@gkellogg gkellogg requested review from afs and domel April 5, 2023 18:41
@gkellogg gkellogg requested a review from afs April 5, 2023 23:10
@gkellogg gkellogg removed the needs discussion Proposed for discussion in an upcoming meeting label Apr 5, 2023
@afs
Copy link
Contributor

afs commented Apr 6, 2023

Should we have a formal review of this document?

We could use the opportunity to run a process in the WG.

Such as review needn't affect FPWD - it could be before or soon after. The aim is to (unofficially) get a stable point for RCH.

@gkellogg
Copy link
Member Author

gkellogg commented Apr 6, 2023

Sounds like a good general topic for the WG to discuss. Maybe as part of the FPWD discussion.

gkellogg added a commit to w3c/rdf-n-triples that referenced this pull request Apr 10, 2023
Copy link
Contributor

@yamdan yamdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late comment; it seems to me that we could explicitly list four more "ECHAR"s (i.e., \b, \t, \f, \') to remove some ambiguity.

spec/index.html Outdated
@@ -266,35 +271,18 @@ <h2>A Canonical form of N-Quads</h2>
MUST NOT use the datatype IRI part of the <a href="#grammar-production-literal">literal</a>,
and are represented using only <a href="#grammar-production-STRING_LITERAL_QUOTE">STRING_LITERAL_QUOTE</a>.
</li>
<!--li><code><a href="#grammar-production-HEX">HEX</a></code> MUST use only uppercase letters (<code>[A-F]</code>).</li-->
<li>Characters MUST NOT be represented by <code><a href="#grammar-production-UCHAR">UCHAR</a></code>.</li>
<li><code><a href="#grammar-production-HEX">HEX</a></code> MUST use only uppercase letters (<code>[A-F]</code>).</li>
<li>Within <a href="#grammar-production-STRING_LITERAL_QUOTE">STRING_LITERAL_QUOTE</a>,
the characters
<code>U+0022</code>, <code>U+005C</code>, <code>U+000A</code>, <code>U+000D</code>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<code>U+0022</code>, <code>U+005C</code>, <code>U+000A</code>, <code>U+000D</code>
<code>U+0022</code>, <code>U+005C</code>, <code>U+000A</code>, <code>U+000D</code>,
<code>U+0008</code>, <code>U+0009</code>, <code>U+000C</code>, <code>U+0027</code>

IIRC, it might be a good idea to add here the other four "ECHAR"s, i.e., \b (U+0008), \t (U+0009), \f (U+000C), \' (U+0027), otherwise how to encode them seems currently ambiguous.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While those characters aren’t included in the requirement for using ECHAR, there is no ambiguity, as the text explicitly says characters between U+0000 and U+001F along with U+007F that are not explicitly encoded using ECHAR must be encoded as UCHAR. All other characters must be represented natively in Unicode. There are requirements elsewhere about \\, but this could be made more explicit. I think \’ should not be represented using ECHAR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think I misunderstood the following text:

characters between U+0000 and U+001F along with U+007F that are not explicitly encoded using ECHAR must be encoded as UCHAR

From your comment, I now understand that these four ECHARs, \b (U+0008), \t (U+0009), \f (U+000C), \' (U+0027), should be encoded or represented as follows. Is that correct?

  • \b (U+0008), \t (U+0009), \f (U+000C) => encoded using UCHAR (i.e., \u0008, \u0009, \u000C)
  • \' (U+0027) => represented natively in Unicode (i.e., ')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably requires tweaking for the characters you mention, but yes, in that lower range, characters are either ECHAR or UCHAR. Some characters outside that range also must use ECHAR and 007F must use UCHAR. I’ll tweak the wording some more tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit to clarify this, and use the relevant character abbreviations which were originally intended and in fact in my own implementation.

Copy link
Contributor

@yamdan yamdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your tweaking. The text has become crystal clear to me. I added a minor typo fixing.

gkellogg and others added 2 commits April 18, 2023 10:37
Co-authored-by: Ted Thibodeau Jr <[email protected]>
Co-authored-by: Ted Thibodeau Jr <[email protected]>
@gkellogg gkellogg merged commit 719bc64 into main Apr 20, 2023
@gkellogg gkellogg deleted the canon-escapes branch April 20, 2023 21:15
gkellogg added a commit to w3c/rdf-n-triples that referenced this pull request Apr 20, 2023
* Updates canonical N-Triples to be consistent with N-Quads. Fixes #2.
  *  Synchronize changes made in w3c/rdf-n-quads#27 for canonicalization.

---------

Co-authored-by: Ted Thibodeau Jr <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:substantive Change in the spec affecting its normative content (class 3) –see also spec:bug, spec:new-feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-consider use of escapes in canonical N-Quads
5 participants