-
Notifications
You must be signed in to change notification settings - Fork 15
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1 from ahalterman/master
Expand documentation
- Loading branch information
Showing
1 changed file
with
30 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,34 +1,47 @@ | ||
# mitie-trainer | ||
|
||
Model Training tool for [MITIE](https://github.com/mitll/MITIE) | ||
An interactive, browser-based model training tool for | ||
[MITIE](https://github.com/mit-nlp/MITIE). The MIT Information Extraction tool | ||
provides fast and easily trained named entity recognition (NER) and binary relation | ||
extraction abilities and is free for both noncommercial and commercial use. | ||
This package is a browser-based wrapper on the training tool, allowing for | ||
faster tagging of training data for input into MITIE. | ||
|
||
|
||
### Setup | ||
|
||
- Download [MITIE-models-v0.2.tar.bz2](http://sourceforge.net/projects/mitie/files/binaries/MITIE-models-v0.2.tar.bz2) | ||
- extract `tar -xjf MITIE-models-v0.2.tar.bz2` | ||
- move the **MITIE-models/english/total_word_feature_extractor.dat** to **html/data/models/** | ||
|
||
- If it's not already present, install Tangelo, a Python framework used to | ||
communicate between the browser and the backend. You can `pip install | ||
tangelo` or [read the Tangelo | ||
docs](http://tangelo.readthedocs.org/en/v0.8/installation.html) for more | ||
details. | ||
- Download the MITIE models: | ||
[MITIE-models-v0.2.tar.bz2](http://sourceforge.net/projects/mitie/files/binaries/MITIE-models-v0.2.tar.bz2) | ||
- Extract the models: `tar -xjf MITIE-models-v0.2.tar.bz2` | ||
- Move the **MITIE-models/english/total_word_feature_extractor.dat** to | ||
**html/data/models/** | ||
|
||
### Data | ||
|
||
Create your own training sample | ||
|
||
You can create your own training samples by running a tsv file through | ||
the **src/sample_format.py** | ||
|
||
By default the script expects the format of **ID\tTEXT_BODY** for each | ||
row. | ||
|
||
Run the script like below | ||
You should structure your training data in a tab-separated file (in the form | ||
`ID\tTEXT_BODY` for each row). Run this TSV through the formatting script in | ||
/src/ to convert it into the JSON that the trainer expects. If your TSV of ids | ||
and stories were called `output.tsv` and were located in the mitie-trainer | ||
directory, make the JSON like this: | ||
|
||
`cat output.tsv | ./src/sample_format.py > sample.json` | ||
|
||
Place your **sample.json** file at **html/data/trainings/sample/** | ||
|
||
Start tangelo with **html/** as the root directory | ||
Then place the **sample.json** file at **html/data/trainings/sample/** | ||
|
||
Start Tangelo with **html/** as the root directory from the command line: | ||
|
||
`tangelo start --root /path/to/mitie-trainer/html` | ||
|
||
Navigate to where Tangelo is running in your browser (the default is 0.0.0.0:8080) and | ||
begin using the trainer. | ||
|
||
The model can be trained either from the browser, or the tagged training data | ||
can be exported as a JSON and added to the model using the [Python | ||
bindings](https://github.com/mit-nlp/MITIE/blob/master/examples/python/train_ner.py) | ||
that came with MITIE. | ||
|