-
Notifications
You must be signed in to change notification settings - Fork 660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sub-par concurrent read performance with jena-iri #1470
Comments
Are you calling jena-iri directly? 1/ (repeated from JENA-2309) One such IRI3986 implementation is https://github.com/afs/x4ld/tree/main/iri4ld . Other implementations can be plugged in. 2/ https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/riot/system/FactoryRDFCaching.java#L62 FYI: tarql/tarql#99 upgrades tarql to Apache Jena 4.5.0 |
Adding a cache to E_IRI/IRIx should be simple and I can check how much this improves. How does the iri4ld implementation differ from jena's current default one functionality-wise?
Good to know that its possible to compare performance of spark-based tarql to original tarql within jena4! :) In addition, I noticed that E_BNode also causes waits due to synchronization in a SecureRandom instance. This is probably better handled as a separate issue but for now I just wanted to document it here. CONSTRUCT { <urn:example:s> <urn:example:p> ?a, ?b, ?c } # ... 16 columns in total
FROM <file:data.csv>
WHERE { BIND(bnode(?a) AS ?foobar) } The same job with tarql/jena2 executes somewhere between 50-60 sec where with bnode it seems to tend more towards 60sec - so in single thread processing the effect is less visible. It seems that threads competing for the bnode call is also a bottleneck. |
Javadoc has the operations described: An Jena IRIProvider: It is a java-coded parser for RFC 3986. The parser is a single file ( jena-iri is a general system for IRIs. It is complicated to build. iri4ld simple to build and provides the operations needed for linked data. Like jena-iri, it is independent of the Jena RDF codebase. iri4ld has less in the the way of extras not used by Jena. The parser is IRI3986.java - all URIs (except it works in Java unicode strings so RFC 3987). It has some additional scheme specific rule support for the common schemes: it covers "http:", "https:", "did:", "file:" "urn:uuid:", "urn:", "uuid:" (which is not official) and "example:" (RFC 7595). |
The parsers generate blank nodes by allocating a UUID once at the start of a parser run, then xor'ing the label into the random number. Unlabelled blank nodes get a not-writable label (it has a 0 byte in it) allocated from a counter. |
IRIx is not the place to put a cache. IRIx is general IRI machinery for any purpose. The session is provided by an FactoryRDF (FactoryRDFCaching extends FactoryRDFStd implements FactoryRDF). The cache is then of NodeURIs. |
Version
4.6.0-SNAPSHOT
What happened?
I started again looking into the issues I had with Jena in Spark settings; related to https://issues.apache.org/jira/browse/JENA-2309
Right now I am investigating some long standing performance issues where concurrent processing time does not scale directly with the number of cores. Concretely, I am comparing our spark+jena4-based tarql re-implementation with original tarql (jena2).
One culprit is the jena-iri package which uses synchronized singleton lexers which introduce locking overhead between the worker threads. A quick fix is to make those lexers thread-local which reduces the overhead. On my notebook in power save and performance mode I get these improvements:
jena-4.6.0-SNAPSHOT:
power save: 68 sec
performance: 21 sec
thread-local-fix:
power save: 54 sec
performance: 19sec
Profiler output (relevant column is the number of waits):

A related issue I am currently investigating is that a lot of time is spent in the IRI parsing machinery e.g. via E_IRI. For testing I changed it to return the argument as given which reduced the total processing time (in performance mode) from 19 to 13 seconds - so around 30% - time that is predominantly spent in the jena-iri lexers. I am not yet sure however if there is anything that can be optimized without compromising functionality though.
Are you interested in making a pull request?
Yes
The text was updated successfully, but these errors were encountered: