-
Notifications
You must be signed in to change notification settings - Fork 4
Documentation for LinksetEvaluator
This is a tool that compares two linksets. It is implemented as a MapReduce job and can run on a cluster. It would be invoked on the command line in a way that's similar to Silk MapReduce. It can be used to calculate statistics (precision, recall) and to calculate differences between different versions of a linkset.
Input
- Two linkset files in the Hadoop file system (HDFS)
- A parameter that states the relationship between the two files
A linkset file is an N-Triple file containing RDF links.
The relationship can be
- file 1 is the linkset to be evaluated; file 2 is a complete set of reference links
- file 1 is the linkset to be evaluated; file 2 is a sample of reference links
- file 1 is the linkset to be evaluated; file 2 is an older version of the linkset
Output
- statistics
- total triples in each file
- how many triples occur in both files
- how many triples occur only in file 1
- how many triples occur only in file 2
- (if file 2 is a complete set of reference links) precision+recall
##Building and launching## The java project is located at
latc-platform/linkqa
It can be built by running
mvn assembly:assembly
which generates the file
latc-platform/linkqa/target/linkqa-jar-with-dependencies.jar
Launching with hadoop (assuming there is a hadoop filesystem at localhost:54310):
./hadoop jar linkqa-jar-with-dependencies.jar eu.latc.linkqa.LinksetEvaluator http://example.org/test-run/some-id hdfs://localhost:54310/user/dummy/linkset.nt hdfs://localhost:54310/user/dummy/refset.nt
Output (excerpt): (TODO: Factor out the namespaces)
<http://example.org/test-run/some-id>
<http://qa.linkeddata.org/ontology/linkset> <hdfs://localhost:54310/user/raven/linkset.nt> ;
<http://qa.linkeddata.org/ontology/linksetSize> "8"^^<http://www.w3.org/2001/XMLSchema#long> ;
<http://qa.linkeddata.org/ontology/overlapSize> "5"^^<http://www.w3.org/2001/XMLSchema#long> ;
<http://qa.linkeddata.org/ontology/referenceset>
<hdfs://localhost:54310/user/raven/refset.nt> ;
<http://qa.linkeddata.org/ontology/refsetSize>
"10"^^<http://www.w3.org/2001/XMLSchema#long> ;
<http://www.atl.external.lmco.com/projects/ontology/ontologies/core/alignment/AlignmentEvaluation.n3#precision>
"0.625"^^<http://www.w3.org/2001/XMLSchema#double> ;
##Discussion##
-
The disadvantage of the output is, that the URIs to the filenames are probably not the one that should eventually become published. So probably one has to do a search/replace before publishing it.
-
Currently, the process-uri (which becomes the subject of the RDF-output is the first parameter). It is probably better to make it an optional last parameter, as it can be autogenerated if not given.
-
The switch for toggling on/off the precision/recall in the output is currently missing.