Skip to content

Documentation for LinksetEvaluator

Aklakan edited this page Aug 19, 2011 · 5 revisions

This is a tool that compares two linksets. It is implemented as a MapReduce job and can run on a cluster. It would be invoked on the command line in a way that's similar to Silk MapReduce. It can be used to calculate statistics (precision, recall) and to calculate differences between different versions of a linkset.

Input

  1. Two linkset files in the Hadoop file system (HDFS)
  2. A parameter that states the relationship between the two files

A linkset file is an N-Triple file containing RDF links.

The relationship can be

  • file 1 is the linkset to be evaluated; file 2 is a complete set of reference links
  • file 1 is the linkset to be evaluated; file 2 is a sample of reference links
  • file 1 is the linkset to be evaluated; file 2 is an older version of the linkset

Output

  1. statistics
    • total triples in each file
    • how many triples occur in both files
    • how many triples occur only in file 1
    • how many triples occur only in file 2
    • (if file 2 is a complete set of reference links) precision+recall

##Building and launching## The java project is located at

latc-platform/linkqa

It can be built by running

mvn assembly:assembly

which generates the file

latc-platform/linkqa/target/linkqa-jar-with-dependencies.jar

Launching with hadoop (assuming there is a hadoop filesystem at localhost:54310):

./hadoop jar linkqa-jar-with-dependencies.jar eu.latc.linkqa.LinksetEvaluator http://example.org/test-run/some-id hdfs://localhost:54310/user/dummy/linkset.nt hdfs://localhost:54310/user/dummy/refset.nt 

Output (excerpt): (TODO: Factor out the namespaces)

<http://example.org/test-run/some-id>

    <http://qa.linkeddata.org/ontology/linkset> <hdfs://localhost:54310/user/raven/linkset.nt> ;

    <http://qa.linkeddata.org/ontology/linksetSize> "8"^^<http://www.w3.org/2001/XMLSchema#long> ;

    <http://qa.linkeddata.org/ontology/overlapSize> "5"^^<http://www.w3.org/2001/XMLSchema#long> ;

    <http://qa.linkeddata.org/ontology/referenceset>

        <hdfs://localhost:54310/user/raven/refset.nt> ;

    <http://qa.linkeddata.org/ontology/refsetSize>

        "10"^^<http://www.w3.org/2001/XMLSchema#long> ;

    <http://www.atl.external.lmco.com/projects/ontology/ontologies/core/alignment/AlignmentEvaluation.n3#precision>

        "0.625"^^<http://www.w3.org/2001/XMLSchema#double> ;

##Discussion##

  • The disadvantage of the output is, that the URIs to the filenames are probably not the one that should eventually become published. So probably one has to do a search/replace before publishing it.

  • Currently, the process-uri (which becomes the subject of the RDF-output is the first parameter). It is probably better to make it an optional last parameter, as it can be autogenerated if not given.

  • The switch for toggling on/off the precision/recall in the output is currently missing.