Skip to content

Commit

Permalink
Merge pull request #1 from ahalterman/master
Browse files Browse the repository at this point in the history
Expand documentation
  • Loading branch information
scotthaleen committed Jan 7, 2015
2 parents 624d59a + 7cb3529 commit 92f6c79
Showing 1 changed file with 30 additions and 17 deletions.
47 changes: 30 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,47 @@
# mitie-trainer

Model Training tool for [MITIE](https://github.com/mitll/MITIE)
An interactive, browser-based model training tool for
[MITIE](https://github.com/mit-nlp/MITIE). The MIT Information Extraction tool
provides fast and easily trained named entity recognition (NER) and binary relation
extraction abilities and is free for both noncommercial and commercial use.
This package is a browser-based wrapper on the training tool, allowing for
faster tagging of training data for input into MITIE.


### Setup

- Download [MITIE-models-v0.2.tar.bz2](http://sourceforge.net/projects/mitie/files/binaries/MITIE-models-v0.2.tar.bz2)
- extract `tar -xjf MITIE-models-v0.2.tar.bz2`
- move the **MITIE-models/english/total_word_feature_extractor.dat** to **html/data/models/**

- If it's not already present, install Tangelo, a Python framework used to
communicate between the browser and the backend. You can `pip install
tangelo` or [read the Tangelo
docs](http://tangelo.readthedocs.org/en/v0.8/installation.html) for more
details.
- Download the MITIE models:
[MITIE-models-v0.2.tar.bz2](http://sourceforge.net/projects/mitie/files/binaries/MITIE-models-v0.2.tar.bz2)
- Extract the models: `tar -xjf MITIE-models-v0.2.tar.bz2`
- Move the **MITIE-models/english/total_word_feature_extractor.dat** to
**html/data/models/**

### Data

Create your own training sample

You can create your own training samples by running a tsv file through
the **src/sample_format.py**

By default the script expects the format of **ID\tTEXT_BODY** for each
row.

Run the script like below
You should structure your training data in a tab-separated file (in the form
`ID\tTEXT_BODY` for each row). Run this TSV through the formatting script in
/src/ to convert it into the JSON that the trainer expects. If your TSV of ids
and stories were called `output.tsv` and were located in the mitie-trainer
directory, make the JSON like this:

`cat output.tsv | ./src/sample_format.py > sample.json`

Place your **sample.json** file at **html/data/trainings/sample/**

Start tangelo with **html/** as the root directory
Then place the **sample.json** file at **html/data/trainings/sample/**

Start Tangelo with **html/** as the root directory from the command line:

`tangelo start --root /path/to/mitie-trainer/html`

Navigate to where Tangelo is running in your browser (the default is 0.0.0.0:8080) and
begin using the trainer.

The model can be trained either from the browser, or the tagged training data
can be exported as a JSON and added to the model using the [Python
bindings](https://github.com/mit-nlp/MITIE/blob/master/examples/python/train_ner.py)
that came with MITIE.

0 comments on commit 92f6c79

Please sign in to comment.