You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, when multiple names/prefixes/datatypes must be defined in the stream before a single statement, each falls into a separate RdfStreamRow. For example (from Nanopub Registry):
rows {
name {
id: 0value: "sig"
}
}
rows {
name {
id: 0value: "hasAlgorithm"
}
}
# andheregoesthequad
If the entries are added sequentially (and they often are), we could perhaps squash it into:
By changing the type of the value field to repeated.
This would save 4 bytes per each squashed entry (2 for tag and LEN of RdfStreamRow, and 2 for tag and LEN of Rdf(Name|Prefix|Datatype)Entry. Further savings could be achieved if we processed triples in minibatches (maybe introduce such API to ProtoEncoder?), where we'd have to assume that the dictionaries are large enough to hold all needed entries for the minibatch. This should not be a problem for batches of, let's say, 10 statements.
I'd have to run some scripts on the datasets in RiverBench to see what would be the savings, in concrete terms. TODO: test it with different minibatch sizes, starting from 1 up to, let's say, 16.
The text was updated successfully, but these errors were encountered:
Currently, when multiple names/prefixes/datatypes must be defined in the stream before a single statement, each falls into a separate
RdfStreamRow
. For example (from Nanopub Registry):If the entries are added sequentially (and they often are), we could perhaps squash it into:
By changing the type of the
value
field torepeated
.This would save 4 bytes per each squashed entry (2 for tag and LEN of
RdfStreamRow
, and 2 for tag and LEN ofRdf(Name|Prefix|Datatype)Entry
. Further savings could be achieved if we processed triples in minibatches (maybe introduce such API toProtoEncoder
?), where we'd have to assume that the dictionaries are large enough to hold all needed entries for the minibatch. This should not be a problem for batches of, let's say, 10 statements.I'd have to run some scripts on the datasets in RiverBench to see what would be the savings, in concrete terms. TODO: test it with different minibatch sizes, starting from 1 up to, let's say, 16.
The text was updated successfully, but these errors were encountered: