Proposal: support embedded protos for literal serialization #14

Ostrzyciel · 2024-12-23T16:29:51Z

Literals in RDF can sometimes store quite a lot of data by themselves, which can bog down the serialization, transmission, and parsing if not done efficiently. Examples of this are:

With RDF 1.2, we will get the new rdf:JSON datatype for JSONs.
The recent CDT proposal includes datatypes for lists and maps.
GeoSPARQL uses the geo:wktLiteral which typically contains lists of numerical data (potentially very long).
RDF 1.1 includes the xsd:hexBinary and xsd:base64Binary datatypes that inefficiently encode binary data as ASCII strings.
RDF 1.1 contains numerical literals (e.g., xsd:double) possibly could be represented more efficiently as bytes.

In all of these cases, the data in question could be represented more efficiently, if a binary format was used. Note that this would not make sense in every possible use case, as Jelly uses the lexical space of literals for reason. Many RDF libraries will simply refuse or make it very hard to work with the value space of the literal directly, instead of the lexical space. Additionally, lexical<->value space conversions are already included in these libraries, and we don't have to reimplement them, which would surely introduce many bugs. This is the reason why Jelly currently doesn't use value encodings for numerical datatypes.

However, in some cases (e.g., when transmitting data from IoT sensors), such specialized encodings would make a lot of sense.

Scope:

As this feature is expected to be applied on a case-by-case basis, it should be designed in such a way that the information about how the literal is encoded should be included in the stream itself.
- The RdfDatatypeEntry would be extended with an optional field that would specify how to parse fields with this datatype. This information could be for example a string with the fully-qualified named of a Protocol Buffer that should be used to read the binary data.
- If the consumer doesn't have a matching decoder registered, it can simply throw an error and refuse to read the message.
The lex field of RdfLiteral should be changed to be of type bytes instead of string.
- This is a backwards-compatible change, because these two types are encoded exactly the same in Protobuf.
- If there is no registered handler for the datatype in question, the lex field would be treated as UTF-8 (as normal).
- If there is a matching handler, the consumer should read the field tag to get the size of the field and hand off parsing to the registered handler. The handler is informed how many bytes it should read (we know it from the tag). After it finishes parsing, the main consumer code resumes work.
- During serialization, this will require the handlers to inform the producer how many bytes do they intend to use to encode the field, so that the field tag can be written correctly. This may be a tricky performance problem (see below).
In principle, the datatype handlers could be any code (including a raw byte array copy). I think we should make it RECOMMENDED that the handler name is the FQN of a protobuf message that should be read/written without delimiting (the delimiter is already present in RdfLiteral).

Implementation & performance:

This feature should be disabled by default, because it relies on the consumer having available the specific datatype handlers. In Jelly-JVM, this should only be enabled per-datatype when the user requests it explicitly.
The producer/consumer hand-off should be doable, but may require some digging in the protobuf library internals to realize well.
- The simplest solution is to simply buffer the entire literal in a byte array and later parse it from memory. This would work, but would absolutely kill the performance for very large literals (e.g., several MB).
- A better solution would be to create a "child" byte stream that is only able to write/read a walled-off portion of the main byte stream. I'd have to do some prototyping to see how elegant could we make this. Is there any prior work on dynamic nesting of protos?
In serialization we may have the problem that the length of the nested message must be known ahead of time. This will range from "easy" to "very hard" depending on the data in question, and may potentially be a large performance problem.
- In practice, we will have to run something like computeSerializedSize just like Protobuf does before the actual serialization.
- Tricks like rewinding the output stream to update the size will not help. This would require buffering everything written, at which points it's simpler to just write the nested message to a byte array.
The resulting proto will be limited to 2GB in size by most protobuf implementations (int32 addressing, I guess). I don't think it's a huge problem...
As a guiding scenario for exploring the performance issues, let's use this: In the RDF stream have a literal of size 100MB which is just a binary blob. Upon receiving the blob, we want to stream it directly to disk, without storing the thing in RAM.

The text was updated successfully, but these errors were encountered:

Ostrzyciel · 2024-12-24T11:15:36Z

One thing came up when I discussed this with someone else – for this to work, the consumer would have to FIRST read the datatype field in RdfLiteral, and then the content of the field. This is tricky, because protobuf allows fields to be serialized in any order, and one field can even be repeated multiple times – in that case, the last occurrence is the one that should be used. If we kept this approach, it would have the following consequences:

"Naive" implementations would have to always buffer the byte content of the literal, which would incur a performance penalty.
To get better performance, it would be required that the producer saves the field in a specific order (which would most likely require modifying the protoc-generated code, so "meh"), and the consumer would also require similar adjustments in protoc-gen'd code.

I'm not sure if it's such a good idea, as this leaves you with the choice of either sacrificing performance or maintainability.

One alternative I could think of is to place the binary blob in a subsequent RdfStreamRow, and only leave a small binary mark in RdfLiteral that the blob is there.

Ostrzyciel · 2024-12-24T11:40:40Z

Regarding the SOTA of this, I only found this: https://protobuf.dev/programming-guides/proto3/#any https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/any.proto

It's a somewhat similar idea, but it would require us to specify the nested message FQN with every literal... which is not very efficient, to say the least. It also suffers from the same field ordering issue that I outlined above. So, as far as I understand the protobuf-java code, the implementation simply reads the entire nested message into a byte buffer and then parses it from memory. Not great.

Note that the FQNs in Google's Any message are URLs to allow for dynamic fetching of protos and binaries to decode them... which sounds like a security nightmare. Let's NOT do this.

Ostrzyciel · 2024-12-26T18:59:52Z

Open question: how to handle compatibility negotiation in the gRPC pub/sub protocol?

Maybe we could add an optional field to RdfStreamOptions that lists all supported datatype IRI -> binary handler mappings? This would also be possible to include in other contexts, as an entirely optional thing.

Ostrzyciel · 2024-12-26T19:00:31Z

We should also gather the information on the FQNs used by the community in the wild and how should they be interpreted. Something similar to this: #15

Some of these that are RDF-specific (e.g., CDTs) could live here in this repo. Others like for example tensors could be simply references to external repos.

Ostrzyciel · 2025-03-12T14:29:00Z

Regarding compactness of this approach – if we have these embedded protos as a bytes field and then nest, let's say, a varint in there, we will be wasting some bytes on specifying the length of the bytes field. It would probably be much more efficient to have dedicated varint / float fields that would not have this overhead. It would still be possible to make this extensible, it's just that implementations would have the ability to use native field encodings.

We'd basically only need to add 3 more fields, corresponding to the 3 other wire types: VARINT, I32, I64. See: https://protobuf.dev/programming-guides/encoding/#structure

One thing to consider is whether we could make pluggable parsers for such fields. No idea, honestly.

For nested messages, this is not an issue, as we can simply skip writing the top-level message tag and start writing the contents of the message. This way we are not wasting any bytes.

Ostrzyciel added the new protocol feature Discussion about a new feature in the Jelly protocol label Dec 23, 2024

This was referenced Dec 26, 2024

Investigation: text compression in literals #16

Open

Implement an "autotuner" for encoder settings Jelly-RDF/jelly-jvm#250

Open

Ostrzyciel mentioned this issue Mar 10, 2025

RDF/CBOR eclipse-rdf4j/rdf4j#5268

Open

Ostrzyciel mentioned this issue Mar 16, 2025

Investigation: look-back term encoding #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: support embedded protos for literal serialization #14

Proposal: support embedded protos for literal serialization #14

Ostrzyciel commented Dec 23, 2024

Ostrzyciel commented Dec 24, 2024

Ostrzyciel commented Dec 24, 2024

Ostrzyciel commented Dec 26, 2024

Ostrzyciel commented Dec 26, 2024 •

edited

Loading

Ostrzyciel commented Mar 12, 2025

Proposal: support embedded protos for literal serialization #14

Proposal: support embedded protos for literal serialization #14

Comments

Ostrzyciel commented Dec 23, 2024

Ostrzyciel commented Dec 24, 2024

Ostrzyciel commented Dec 24, 2024

Ostrzyciel commented Dec 26, 2024

Ostrzyciel commented Dec 26, 2024 • edited Loading

Ostrzyciel commented Mar 12, 2025

Ostrzyciel commented Dec 26, 2024 •

edited

Loading