Skip to content

Commit a3bff3d

Browse files
committed
Adding version 1.0
git-svn-id: https://agiga.googlecode.com/svn/trunk@2 6da5e2a9-21dc-cb1e-a833-26ae97719531
1 parent ed25012 commit a3bff3d

20 files changed

+2835
-0
lines changed

LICENSE.txt

+339
Large diffs are not rendered by default.

README.txt

+107
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
Annotated Gigaword API and Command Line Tools v1.0 - July 21, 2012
2+
------------------------------------------------------------------
3+
4+
This release includes a Java API and command line tools for reading
5+
the Annotated Gigaword dataset XML files.
6+
7+
-------------------
8+
Project Hosting :
9+
-------------------
10+
11+
For the latest version, go to:
12+
http://code.google.com/p/agiga
13+
14+
-------------------
15+
Command Line Tools:
16+
-------------------
17+
18+
The command line tools provide a convenient way to print human
19+
readable versions of the XML annotations. The entry point is
20+
edu.jhu.agiga.AgigaPrinter and it has the following usage.
21+
22+
usage: java edu.jhu.agiga.AgigaPrinter <type> <gzipped input file>
23+
where <type> is one of:
24+
words (Words only, one sentence per line)
25+
lemmas (Lemmas only, one sentence per line)
26+
pos (Part-of-speech tags)
27+
ner (Named entity types)
28+
basic-deps (Basic dependency parses in CONNL-X format)
29+
col-deps (Collapsed dependency parses in CONNL-X format)
30+
col-ccproc-deps (Collapsed and propagated dependency parses in CONNL-X format)
31+
phrase-structure (Phrase structure parses)
32+
coref (Coreference resolution as SGML similar to MUC)
33+
stanford-deps (toString() methods of Stanford dependency parse annotations)
34+
stanford-phrase-structure (toString() method of Stanford phrase structure parses)
35+
for-testing-only (**For use in testing this API only**)
36+
and where <gzipped input file> is an .xml.gz file
37+
from Annotated Gigaword
38+
39+
For example, to print part-of-speech tags for the file
40+
nyt_eng_199911.xml.gz, we could run:
41+
42+
java -cp build/agiga-1.0.jar:lib/* edu.jhu.agiga.AgigaPrinter pos annotated_gigaword/nyt_eng_199911.xml.gz
43+
44+
-------------------
45+
Java API :
46+
-------------------
47+
48+
The Java API provides streaming access to the documents in the .xml.gz
49+
files. Two iterators are provided: StreamingDocumentReader and
50+
StreamingSentenceReader. Both of these take as input the path to an
51+
Annotated Gigaword file and an AgigaPrefs object.
52+
53+
By default, the AgigaPrefs constructor will ensure that every
54+
annotation in the XML is read in and that the resulting objects are
55+
fully populated. However, by turning off certain options, it's
56+
possible to skip the reading and creation of objects corresponding to
57+
unused annotations.
58+
59+
StreamingDocumentReader is an iterator over AgigaDocument objects. The
60+
AgigaDocument class gives access to the coreference resolution (via
61+
AgigaCoref objects) annotations and the sentences (via AgigaSentence
62+
objects).
63+
64+
StreamingSentenceReader is an iterator over AgigaSentence
65+
objects. This bypasses the document level annotations such as coref
66+
and the document ids and provides direct access to the sentence
67+
annotations only.
68+
69+
AgigaPrinter provides examples of how to use these iterators and set
70+
the AgigaPrefs object so that only the necessary annotations are read.
71+
Examples of how to use the Agiga objects can also be found in the
72+
AgigaDocument.write* and AgigaSentence.write* methods.
73+
74+
----------------------
75+
One- vs. Zero-Indexing:
76+
----------------------
77+
78+
In the XML, the sentences and tokens are given Ids that are
79+
one-indexed. However, we find it to be more convenient to work with
80+
zero-indexed **indices** in the Java API. Accordingly, the Java API
81+
does not provide access to these original Ids but instead provides
82+
access to indices. These indices are accessed via methods named
83+
get*Idx(), such as AgigaSentence.getIdx() and
84+
AgigaMention.getSentenceIdx() -- or AgigaToken.getIdx() and
85+
AgigaDependency.getGovIdx(). These indices also correspond to the
86+
ordered elements in the Lists used throughout the API.
87+
88+
Of course, the original Ids from the XML can be recovered by adding
89+
one to the indices in the API. However, we didn't want to confuse the
90+
issue by providing API calls for both.
91+
92+
-------------------
93+
Building :
94+
-------------------
95+
96+
A build.xml is provided for building with Apache Ant. Example
97+
commands are below and should be run from the top level directory that
98+
contains the build.xml.
99+
100+
# To compile:
101+
ant
102+
103+
# To clean and compile
104+
ant clean compile
105+
106+
# To build jars of classes and sources:
107+
ant jar

build.xml

+79
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
<?xml version="1.0"?>
2+
<project name="agiga" default="compile" basedir=".">
3+
4+
<property name="classes.path" value="${basedir}/classes" />
5+
<property name="source.path" value="${basedir}/src" />
6+
<property name="build.path" value="${basedir}/build" />
7+
<property name="build.path" value="${basedir}/build" />
8+
<property name="version" value="1.0" />
9+
<property name="app.jar.path" value="${build.path}/agiga-${version}.jar" />
10+
<property name="source.jar.path" value="${build.path}/agiga-sources-${version}.jar" />
11+
12+
<property name="compile.debug" value="true"/>
13+
<property name="compile.deprecation" value="false"/>
14+
<property name="compile.optimize" value="true"/>
15+
<property name="compile.source" value="1.6" />
16+
<property name="compile.target" value="1.6" />
17+
<property name="compile.encoding" value="utf-8" />
18+
19+
<target name="classpath" description="Sets the classpath">
20+
<echo message="${ant.project.name}" />
21+
<path id="classpath">
22+
<fileset dir="${basedir}/lib">
23+
<include name="*.jar"/>
24+
</fileset>
25+
</path>
26+
<path id="classes">
27+
<pathelement location="${classes.path}" />
28+
</path>
29+
</target>
30+
31+
<target name="clean" description="Delete built files">
32+
<echo message="${ant.project.name}" />
33+
<delete includeemptydirs="true" failonerror="false">
34+
<fileset dir="${classes.path}/" includes="**/*"/>
35+
<fileset dir="${build.path}/" includes="**/*"/>
36+
</delete>
37+
</target>
38+
39+
<target name="build-dir" description="Create build output directories">
40+
<echo message="${ant.project.name}" />
41+
<mkdir dir="${classes.path}" />
42+
<mkdir dir="${build.path}" />
43+
</target>
44+
45+
<target name="compile" depends="classpath,build-dir"
46+
description="Compile source files">
47+
<echo message="${ant.project.name}" />
48+
<javac srcdir="${source.path}"
49+
destdir="${classes.path}"
50+
debug="${compile.debug}"
51+
encoding="utf-8"
52+
deprecation="${compile.deprecation}"
53+
optimize="${compile.optimize}"
54+
source="${compile.source}"
55+
target="${compile.target}"
56+
includeantruntime="false">
57+
<classpath refid="classpath" />
58+
<compilerarg value="-Xlint"/>
59+
</javac>
60+
</target>
61+
62+
<target name="jar" depends="compile"
63+
description="Creates jar files of the classes and sources">
64+
<echo message="${ant.project.name}" />
65+
<jar destfile="${app.jar.path}">
66+
<fileset dir="${classes.path}"
67+
excludes="**/*Test.class"/>
68+
</jar>
69+
<jar destfile="${source.jar.path}">
70+
<fileset dir="${source.path}"
71+
includes="**/*.java"
72+
excludes="**/*Test.java"/>
73+
</jar>
74+
</target>
75+
76+
<target name="all" depends="clean,compile"
77+
description="Clean and re-compile." />
78+
79+
</project>

setupenv.sh

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
ROOT_DIR=`pwd`
2+
export CLASSPATH=$ROOT_DIR/classes:$ROOT_DIR/lib/*

src/edu/jhu/agiga/AgigaConstants.java

+71
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
package edu.jhu.agiga;
2+
3+
/**
4+
* This class contains the names of the XML tags and attributes used in
5+
* Annotated Gigaword .xml.gz files.
6+
*
7+
* @author mgormley
8+
*
9+
*/
10+
public class AgigaConstants {
11+
12+
// XML Dependency Parse Tag names
13+
public enum DependencyForm {
14+
BASIC_DEPS("basic-dependencies"),
15+
COL_DEPS("collapsed-dependencies"),
16+
COL_CCPROC_DEPS("collapsed-ccprocessed-dependencies");
17+
18+
private String xmlTag;
19+
20+
private DependencyForm(String xmlTag) {
21+
this.xmlTag = xmlTag;
22+
}
23+
24+
public String getXmlTag() {
25+
return xmlTag;
26+
}
27+
}
28+
29+
// XML Tag names
30+
public static final String FILE = "FILE";
31+
public static final String FILE_ID = "id";
32+
33+
public static final String DOC = "DOC";
34+
35+
public static final String SENTENCES = "sentences";
36+
public static final String SENTENCE = "sentence";
37+
public static final String TOKEN = "token";
38+
public static final String TOKEN_ID = "id";
39+
public static final String WORD = "word";
40+
public static final String LEMMA = "lemma";
41+
public static final String POS = "POS";
42+
public static final String NER = "NER";
43+
public static final String NORM_NER = "NormNER";
44+
public static final String PARSE = "parse";
45+
public static final String DEP = "dep";
46+
public static final String DEP_TYPE = "type";
47+
public static final String GOVERNOR = "governor";
48+
public static final String DEPENDENT = "dependent";
49+
50+
public static final String COREFERENCES = "coreferences";
51+
public static final String COREFERENCE = "coreference";
52+
public static final String MENTION = "mention";
53+
public static final String M_SENTENCE = "sentence";
54+
public static final String START = "start";
55+
public static final String END = "end";
56+
public static final String HEAD = "head";
57+
58+
// XML Attribute names
59+
public static final String DOC_ID = "id";
60+
public static final String DOC_TYPE = "type";
61+
62+
public static final String CHARACTER_OFFSET_BEGIN = "CharacterOffsetBegin";
63+
public static final String CHARACTER_OFFSET_END = "CharacterOffsetEnd";
64+
65+
public static final String MENTION_REPRESENTATIVE = "representative";
66+
67+
private AgigaConstants() {
68+
// private constructor
69+
}
70+
71+
}

src/edu/jhu/agiga/AgigaCoref.java

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
package edu.jhu.agiga;
2+
3+
import java.util.ArrayList;
4+
import java.util.List;
5+
6+
/**
7+
* Each AgigaCoref object provides access to all the mentions of a single entity
8+
* in a document. These coreference resolution annotations are represented as a
9+
* list of coref mentions, or AgigaMention objects.
10+
*
11+
* @author mgormley
12+
*
13+
*/
14+
public class AgigaCoref {
15+
16+
private List<AgigaMention> mentions;
17+
18+
public AgigaCoref() {
19+
this.mentions = new ArrayList<AgigaMention>();
20+
}
21+
22+
public List<AgigaMention> getMentions() {
23+
return mentions;
24+
}
25+
26+
public void add(AgigaMention mention) {
27+
mentions.add(mention);
28+
}
29+
30+
}

0 commit comments

Comments
 (0)