Skip to content

Usage: command line API

Vincent Hellendoorn edited this page Jul 10, 2017 · 3 revisions

This page gives a fine-grained example of how to use the command-line API step-by-step, with explanations, for both a typical natural language use-case and a code use-case. Here, I presume some 'train' and 'test' directory with your data. These examples are analogous to the library API examples in the code base.

A typical natural language use-case:

(think: closed vocabularies, modeling each line separately, etc).

  • First some common flags for lexing:
    • -l simple tells the Jar to use the simple (default) lexer which splits on whitespace and punctuation, preserving punctuation as a separate token. No specific NLP lexers are included, but see footnote1 to use your own;
    • --delims tells it to add a start and end-of-sentence markers;
    • --per-line tells the lexer and model to treat each line as a sentence. This is common for natural language (but not for code).
  • Create a vocabulary for the train data, throwing away all words seen only once, and write to train.vocab:
        java -jar SLP-Core_v0.2.jar vocabulary train train.vocab --unk-cutoff 2 -l simple --delims
    • vocabulary mode to create the vocabulary, takes two positional arguments: in-directory and out-file path
    • --unk-cutoff (or -u) sets the minimum number of times seen to be preserved
    • -l sets the lexer (also used below), in this case to simple and we add --delims (both explained above)
  • Train the model using 4-grams and the previously constructed vocabulary and write the counter to file:
        java -jar SLP-Core_v0.2.jar train --train train --counter train.counts --vocabulary train.vocab --closed --order 4 --per-line -l simple --delims
    • train mode to train the model --train specifies the train path
    • --counter sets the counter output file
    • --vocabulary (or -v) loads the previously constructed vocabulary
    • --closed closes this vocabulary so no new words are added
    • --order (or -o) sets the n-gram order to use. Note that we do not need to specify a model since we are just storing counts
    • --per-line, -l and --delims as above, for lexing.
  • Test the model using the vocabulary from above and ADM smoothing:
        java -jar SLP-Core_v0.2.jar test --test test --counter train.counts -v train.vocab --closed -o 4 --model adm --per-line -l simple --delims
    • test mode, --test specifies the test path
    • -v: vocabulary as above
    • --model (or -m): model to use, in this case adm: absolute discounting (best in toolkit for natural language)
  • Note that if we didn't want to store the counter, we could do the train, test steps in one go with train-test mode:
        java -jar SLP-Core_v0.2.jar train-test --train train --test test -v train.vocab --closed -o 4 --model adm --per-line -l simple --delims

A typical Java use-case:

The Java example shows off some more advanced options that make more sense with the nested nature of code. It doesn't need to make a vocabulary or close it since all tokens are relevant, but it does use cache components and nested modeling. We also demonstrate the new --giga option that uses a counter that is far more efficient for very large (giga-token) corpora:

  • Train the model with 6-grams and write to file:
        java -jar SLP-Core_v0.2.jar train --train train --counter train.counts --vocabulary train.vocab --order 6 -l java --delims --giga
    • --vocabulary (or -v): we still specify the vocabulary path, but now the vocabulary will be written there after training instead. We need to store it with the counter since the counter stores the words translated to indices in the vocabulary.
    • --order (or -o) 6: 6-grams are too long for natural language (generally) but are much more powerful for source code.
    • --language (or -l): Java lexer is included in the package. No other programming languages at present, but pre-tokenized (e.g. with Pygments) data can be used1.
    • --giga: use the giga-corpus counter (assuming you are using a lot of data, otherwise no need) to speed up counting of very large corpora.
    • Note the absent flag --per-line: Java is lexed per file only
  • Test the model with JM smoothing:
        java -jar SLP-Core_v0.2.jar test --test test --counter train.counts --vocabulary train.vocab -o 6 --model jm --cache --nested -l java --delims
    • Note: no --closed: we leave the vocabulary completely open.
    • --model (or -m): use Jelinek-Mercer smoothing as the model; this works much better for source code than for natural language and is 'lighter' to count
    • --cache (or -c): add a file-cache component. Boosts modeling scores for code especially.
    • --nested (or -n): build a recursively nested model (using same smoothing) on the test corpus centered around the file to be tested. This gives very high modeling accuracies by prioritizing localities, especially combined with the cache.
  • Again, if you don't want to store the counts, consider using the train-test mode:
        java -jar SLP-Core_v0.2.jar train-test --train train --test test -o 6 -m jm -c -n -l java --delims

Other options:
The package allows you to lex the text before running the rest of the models, with the same lexing options as before. You can then read in the lexed text using -l tokens. Lexed text is written to file as tokens separated by tabs per line as in the original file. This can also be used to pre-lex your text with whatever lexer you prefer (e.g. Pygments) and then read it into this package.

  • To lex corpus to parallel directory in lexed format (two positional arguments: source-path, target-path):
        java -jar SLP-Core_v0.2.jar lex train train-lexed -l java --delims
  • To do the same except also translate the tokens to indices in a vocabulary first, and store said vocabulary as well (e.g. to compare with another toolkit without risking that it will lex your tokens differently):
        java -jar SLP-Core_v0.2.jar lex-ix train train-lexed-ix -v train.vocab -l java --delims

As a final note, if you have no train/test split but want to get 'self-entropy' (like 10-fold cross-validation), simply specify the same train and test file (or set the explicit flag -s). This will be interpreted as 'self-testing' by the model and will make it 'forget' every sequence right before modeling it, and then 'relearn' it. This way, you get accurate, 'infinite-fold' cross-validation!

1 To use pre-tokenized text, write your tokens tab-separated and preserve the original line-breaks (if needed). Then, use the -l tokens lexer to read in your tokens by splitting on tabs.

Clone this wiki locally