-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the use of ECHAR and UCHAR in canonical N-Quads. #27
Conversation
Co-authored-by: Andy Seaborne <[email protected]>
Should we have a formal review of this document? We could use the opportunity to run a process in the WG. Such as review needn't affect FPWD - it could be before or soon after. The aim is to (unofficially) get a stable point for RCH. |
Sounds like a good general topic for the WG to discuss. Maybe as part of the FPWD discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late comment; it seems to me that we could explicitly list four more "ECHAR"s (i.e., \b
, \t
, \f
, \'
) to remove some ambiguity.
spec/index.html
Outdated
@@ -266,35 +271,18 @@ <h2>A Canonical form of N-Quads</h2> | |||
MUST NOT use the datatype IRI part of the <a href="#grammar-production-literal">literal</a>, | |||
and are represented using only <a href="#grammar-production-STRING_LITERAL_QUOTE">STRING_LITERAL_QUOTE</a>. | |||
</li> | |||
<!--li><code><a href="#grammar-production-HEX">HEX</a></code> MUST use only uppercase letters (<code>[A-F]</code>).</li--> | |||
<li>Characters MUST NOT be represented by <code><a href="#grammar-production-UCHAR">UCHAR</a></code>.</li> | |||
<li><code><a href="#grammar-production-HEX">HEX</a></code> MUST use only uppercase letters (<code>[A-F]</code>).</li> | |||
<li>Within <a href="#grammar-production-STRING_LITERAL_QUOTE">STRING_LITERAL_QUOTE</a>, | |||
the characters | |||
<code>U+0022</code>, <code>U+005C</code>, <code>U+000A</code>, <code>U+000D</code> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<code>U+0022</code>, <code>U+005C</code>, <code>U+000A</code>, <code>U+000D</code> | |
<code>U+0022</code>, <code>U+005C</code>, <code>U+000A</code>, <code>U+000D</code>, | |
<code>U+0008</code>, <code>U+0009</code>, <code>U+000C</code>, <code>U+0027</code> |
IIRC, it might be a good idea to add here the other four "ECHAR"s, i.e., \b
(U+0008), \t
(U+0009), \f
(U+000C), \'
(U+0027), otherwise how to encode them seems currently ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While those characters aren’t included in the requirement for using ECHAR, there is no ambiguity, as the text explicitly says characters between U+0000 and U+001F along with U+007F that are not explicitly encoded using ECHAR must be encoded as UCHAR. All other characters must be represented natively in Unicode. There are requirements elsewhere about \\
, but this could be made more explicit. I think \’
should not be represented using ECHAR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think I misunderstood the following text:
characters between U+0000 and U+001F along with U+007F that are not explicitly encoded using ECHAR must be encoded as UCHAR
From your comment, I now understand that these four ECHARs, \b
(U+0008), \t
(U+0009), \f
(U+000C), \'
(U+0027), should be encoded or represented as follows. Is that correct?
\b
(U+0008),\t
(U+0009),\f
(U+000C) => encoded using UCHAR (i.e.,\u0008
,\u0009
,\u000C
)\'
(U+0027) => represented natively in Unicode (i.e.,'
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably requires tweaking for the characters you mention, but yes, in that lower range, characters are either ECHAR or UCHAR. Some characters outside that range also must use ECHAR and 007F must use UCHAR. I’ll tweak the wording some more tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a commit to clarify this, and use the relevant character abbreviations which were originally intended and in fact in my own implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your tweaking. The text has become crystal clear to me. I added a minor typo fixing.
Co-authored-by: Dan Yamamoto <[email protected]>
Co-authored-by: Ted Thibodeau Jr <[email protected]>
Co-authored-by: Ted Thibodeau Jr <[email protected]>
* Updates canonical N-Triples to be consistent with N-Quads. Fixes #2. * Synchronize changes made in w3c/rdf-n-quads#27 for canonicalization. --------- Co-authored-by: Ted Thibodeau Jr <[email protected]>
Fixes #16.
Previously discussed in the RCH WG https://www.w3.org/2023/03/15-rch-minutes.html#r01.
cc/ @philarcher @yamdan
Preview | Diff