-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON BOM serialization "trims" whitespace from DPKG license text (XML does not) #135
Comments
TrimStringSerializer is used in JSON because JSON does not support the concept of CDATA. There were too many instances where people (or various Java implementations) would use the library without properly performing input validation before sending data to create the BOM. TrimStringSerializer exists to reduce the amount of errors people were encountering while serializing JSON. For license text, the recommended way to include full license text that would result in identical JSON and XML is to base64 encode it which is supported by both formats. Properties are |
Not sure that transforming whitespace (but ignoring characters like ? < > ) really counts of input validation? Unless the API documents that it is making a transformation, I think it should pass through the text unchanged. Also passing it through unchanged is much more useful behaviour, since it gives the user of the API the choice of behaviour, rather than blocking them in. Use of base64 is pretty unusual inside a json file, and certainly complicates understanding the contents when looked at by a human e.g. for debugging, so it should be a choice not something enforced. |
What's the desired way forward here? Based on @stevespringett's response this seems to be working as designed. |
I provided a bit of detail and analysis in #394 but I have some additional feedback. Concerning base64, this is a universally accepted way to encode binary data in both XML and JSON documents, like if you needed to include the contents of a pdf document. In contrast, license and copyright text are indeed text. They should be correctly encoded as a JSON String without any characters arbitrarily stripped out or modified or replaced. A standards compliant JSON generator/parser will correctly encode/decode the textual data, preserving the original text. Concerning CDATA, whether license and copyright text is encoded as CDATA or with characters escaped (with entity references) is irrelevant, a bit of a stylistic choice. A standards compliant XML generator/parser will correctly encode/decode the textual data, preserving the original text. |
Background
On debian & ubuntu systems the dpkg copyright files are (in modern times, thank goodness) intended to be machine readable according to this spec. The CycloneDX linux generator on Ubuntu faithfully replicates the text of the copyright file into
components/[]/licenses/[]/license/text/content
as one might expect.According to the JSON AbstractBomGenerator.java line 68 it would appear ALL STRINGS, when serialized to JSON, are serialized with TrimStringSerialize which not only trims whitespace but removes it similar to how an HTML processor might.
The XML AbstractBomXmlGenerator.java does not remove whitespace, which would seem to be the correct behavior.
Bug
History
TrimStringSerializer
in 0fab7fb#diff-9f5ef24a21ed10eaae782875e4efc0cd90cec8b7f598bee89d096f50431db5cc.Without test cases to accompany either of those changes, it's hard for me to understand why they were made. The history of July 9th 2020 doesn't show PRs or groups of commits that seem to help me understand either. The problem is that this behavior was obviously desired, but I'm not clear why or how it would be helpful.
Potential Solutions
Personal Note
This is my first comment to this project, and I look forward to working with you if possible. I have both personal and professional interest in this area, and I hope to both integrate with and contribute to CycloneDX.
The text was updated successfully, but these errors were encountered: