Update the use of ECHAR and UCHAR in canonical N-Quads. #27

gkellogg · 2023-04-05T18:41:04Z

Fixes #16.

Previously discussed in the RCH WG https://www.w3.org/2023/03/15-rch-minutes.html#r01.

cc/ @philarcher @yamdan

Preview | Diff

Fixes #16.

spec/index.html

Co-authored-by: Andy Seaborne <[email protected]>

afs · 2023-04-06T11:55:11Z

Should we have a formal review of this document?

We could use the opportunity to run a process in the WG.

Such as review needn't affect FPWD - it could be before or soon after. The aim is to (unofficially) get a stable point for RCH.

gkellogg · 2023-04-06T14:47:31Z

Sounds like a good general topic for the WG to discuss. Maybe as part of the FPWD discussion.

yamdan

Sorry for the late comment; it seems to me that we could explicitly list four more "ECHAR"s (i.e., \b, \t, \f, \') to remove some ambiguity.

yamdan · 2023-04-13T09:00:19Z

spec/index.html

@@ -266,35 +271,18 @@ <h2>A Canonical form of N-Quads</h2>
      MUST NOT use the datatype IRI part of the <a href="#grammar-production-literal">literal</a>,
      and are represented using only <a href="#grammar-production-STRING_LITERAL_QUOTE">STRING_LITERAL_QUOTE</a>.
    </li>
-    <!--li><code><a href="#grammar-production-HEX">HEX</a></code> MUST use only uppercase letters (<code>[A-F]</code>).</li-->
-    <li>Characters MUST NOT be represented by <code><a href="#grammar-production-UCHAR">UCHAR</a></code>.</li>
+    <li><code><a href="#grammar-production-HEX">HEX</a></code> MUST use only uppercase letters (<code>[A-F]</code>).</li>
    <li>Within <a href="#grammar-production-STRING_LITERAL_QUOTE">STRING_LITERAL_QUOTE</a>,
      the characters 
      <code>U+0022</code>, <code>U+005C</code>, <code>U+000A</code>, <code>U+000D</code>


Suggested change

<code>U+0022</code>, <code>U+005C</code>, <code>U+000A</code>, <code>U+000D</code>

<code>U+0022</code>, <code>U+005C</code>, <code>U+000A</code>, <code>U+000D</code>,

<code>U+0008</code>, <code>U+0009</code>, <code>U+000C</code>, <code>U+0027</code>

IIRC, it might be a good idea to add here the other four "ECHAR"s, i.e., \b (U+0008), \t (U+0009), \f (U+000C), \' (U+0027), otherwise how to encode them seems currently ambiguous.

While those characters aren’t included in the requirement for using ECHAR, there is no ambiguity, as the text explicitly says characters between U+0000 and U+001F along with U+007F that are not explicitly encoded using ECHAR must be encoded as UCHAR. All other characters must be represented natively in Unicode. There are requirements elsewhere about \\, but this could be made more explicit. I think \’ should not be represented using ECHAR.

Thanks, I think I misunderstood the following text:

characters between U+0000 and U+001F along with U+007F that are not explicitly encoded using ECHAR must be encoded as UCHAR

From your comment, I now understand that these four ECHARs, \b (U+0008), \t (U+0009), \f (U+000C), \' (U+0027), should be encoded or represented as follows. Is that correct?

\b (U+0008), \t (U+0009), \f (U+000C) => encoded using UCHAR (i.e., \u0008, \u0009, \u000C)

\' (U+0027) => represented natively in Unicode (i.e., ')

Probably requires tweaking for the characters you mention, but yes, in that lower range, characters are either ECHAR or UCHAR. Some characters outside that range also must use ECHAR and 007F must use UCHAR. I’ll tweak the wording some more tomorrow.

Added a commit to clarify this, and use the relevant character abbreviations which were originally intended and in fact in my own implementation.

yamdan

Thank you for your tweaking. The text has become crystal clear to me. I added a minor typo fixing.

spec/index.html

Co-authored-by: Dan Yamamoto <[email protected]>

spec/index.html

Co-authored-by: Ted Thibodeau Jr <[email protected]>

* Updates canonical N-Triples to be consistent with N-Quads. Fixes #2. * Synchronize changes made in w3c/rdf-n-quads#27 for canonicalization. --------- Co-authored-by: Ted Thibodeau Jr <[email protected]>

Update the use of ECHAR and UCHAR in canonical N-Quads.

0498947

Fixes #16.

gkellogg added needs discussion Proposed for discussion in an upcoming meeting spec:substantive Change in the spec affecting its normative content (class 3) –see also spec:bug, spec:new-feature labels Apr 5, 2023

gkellogg requested review from afs and domel April 5, 2023 18:41

Remove dangling statement about not using UCHAR.

dd69523

afs requested changes Apr 5, 2023

View reviewed changes

spec/index.html Outdated Show resolved Hide resolved

spec/index.html Outdated Show resolved Hide resolved

gkellogg and others added 2 commits April 5, 2023 15:57

Apply suggestions from code review

2e7909e

Co-authored-by: Andy Seaborne <[email protected]>

Language changes suggested by @afs.

d6a5327

gkellogg requested a review from afs April 5, 2023 23:10

gkellogg removed the needs discussion Proposed for discussion in an upcoming meeting label Apr 5, 2023

domel approved these changes Apr 6, 2023

View reviewed changes

afs approved these changes Apr 6, 2023

View reviewed changes

gkellogg added a commit to w3c/rdf-n-triples that referenced this pull request Apr 10, 2023

Synchronize changes made in w3c/rdf-n-quads#27 for canonicalization.

d8cf773

Add paragraph saying that Canonical N-Quads extends Canonical N-Triples.

3c71958

yamdan reviewed Apr 13, 2023

View reviewed changes

gkellogg added 2 commits April 14, 2023 10:44

Add other characters to be escaped using ECHAR when canonicalizing.

3927d71

Fix styling of grammar table.

851c32a

yamdan reviewed Apr 15, 2023

View reviewed changes