-
Notifications
You must be signed in to change notification settings - Fork 5
Triage Application User Manual
Table of Contents generated with DocToc
- Pre-Installation Requirements
- Installation
- Organizion of the Triage System
- Using the Triage System
- Starting the Triage Web Application Server
- Accessing the Triage Web App
- Stopping the Triage Web App Server We present here a user manual for running and maintaining a web-based system for peforming document triage given a corpus of PDF files. We will describe processes for installation, execution and maintenance of the system.
Note that this system is provided with no warranty or guarantee
- MySQL 5.1 (http://dev.mysql.com/downloads/mysql/5.1.html)
- SwfTools (http://wiki.swftools.org/wiki/Installation)
The Server:
- Must own a port number to process http requests from client web browsers.
- Must be able to send http requests to http://eutils.ncbi.nlm.nih.gov (PubMed's eCitation services).
- Must be able to login to MySql with a user defined login with privileges to create (and destroy) databases.
- How to manage MySQL Users: http://dev.mysql.com/doc/mysql-security-excerpt/5.1/en/user-account-management.html
This system is provided as a \*.tar.gz archive for Unix and Linux systems,
a \*.dmg instalallable for Macs and an \*.exe installable for PCs.
All packages are available for download from http://bmkeg.s3-website-us-west-2.amazonaws.com/index.html#triage
The triage task is primarily concerned with sorting documents from an input document set assigned to a curator (called a 'triage corpus') into a set of categories (where each category actually designates a set of documents that are each called a 'target corpus'). The data that assigns each scientific article from triage corpus to it's target corpus is it's 'in-out-code' which can be one of three values: 'in', 'out' or 'unclassified'.
This simple construct forms the basis of the system and provides a relatively straightforward way to attach additional cues and information about each article's possible inclusion in a target corpus based on NLP analysis of the document's contents.
Each PDF file being processed should start with it's pubmed id, followed by an underscore and a single letter denoting if it is to be included in a target corpus. Thus some examples of possible filenames are as follows:
19763139_A.pdf 19911007_AG.pdf 21470346.pdf
This indicates that the article with the PubMed id 19763139 is a member of the target corpus denoted by the code 'A'. These codes are set when you create the target corpus.
Once installed, the system permits two modes of use: (A) using command-line tools to administer the server, and upload/process scientific articles into corpora and (B) use the web interface to assign articles to specific corpora.
The system uses the pdf2swf command to convert PDF files to the SWF file for displaying in
FlexPaper (http://flexpaper.devaldi.com/). We therefore have to identify to the system where to find
the pdf2swf executable with the following command.
setSwfToolsBinDirectory </path/to/pdf2swf/executable>
This step has to be executed only once.
The triage system stores all of its data in a MySQL database. This includes articles content, classification codes, and how they are organized into collections (corpora). Before using the triage system you need to create a trage database using one of the triage commands pre-installed in your system.
In order to execute this command you must select a name and use a suitable login name and password for existing user with suitable permissions.
buildTriageDatabase -db <name-of-database> -l <login> -p <password>
-db DBNAME : Database name
-l LOGIN : Database login
-p PASSWD : Database password
Removing a database from the system involves a similar command to the one creating it.
destroyTriageDatabase -db <name-of-database> -l <login> -p <password>
-db DBNAME : Database name
-l LOGIN : Database login
-p PASSWD : Database password
The first step to running the system is to build corpora that the triaged articles are being sorted into.
The following example would create a corpus named 'GO', owned by a user called 'Rocky' with the single letter code 'G'. Each target corpus could be all papers concerned with Gene Ontology curation or all papers curated into the database as a whole.
editArticleCorpus -name "GO" -desc "Gene Ontology" -owner "Rocky" -regex "G"
-db <name-of-database> -l <login> -p <password>
-db DBNAME : Database name
-desc DESCRIPTION : Corpus description
-l LOGIN : Database login
-name NAME : Corpus name
-owner OWNER : Corpus owner
-p PASSWD : Database password
-regex REGEX : Regular expression to recognize incoming files (optional)
Note that the first time you run a database command in this system, the system needs to generate a lookup object for the many Journals referenced in pubmed. This is quite slow the first time you run the command, but is a one-time step.
The next step is to build the triage corpora that hold the articles.
A triage corpus is a special kind of corpus used to organize a collection of articles for the triaging procedure. Each triage corpus should denote a natural collection of papers (such as all those papers assigned to a specific individual or all papers from a given Journal). A triage corpus is the entry point for a paper in the system.
The following example would create a triage corpus named 'curator1', owned by a user called 'Curator 1'.
editTriageCorpus -name "curator1" -desc "Curator 1's triage corpus" -owner "Curator 1"
-db <name-of-database> -l <login> -p <password>
-db DBNAME : Database name
-desc DESCRIPTION : Corpus description
-l LOGIN : Database login
-name NAME : Corpus name
-owner OWNER : Corpus owner
-p PASSWD : Database password
The crucial task of loading files into a triage corpus is performed by the following function:
buildTriageCorpusFromPdfDir -pdfs </complete/path/to/pdf/directory> -triageCorpus "<triage-corpus-name>"
-rules </path/to/rules/file> -codeList <path/to/codeList/file>
-db <name-of-database> -l <login> -p <password>
-codeList CODES : Encoded file names + codes (optional)
-triageCorpus CORPUS : Corpus name
-db DBNAME : Database name
-l LOGIN : Database login
-p PASSWD : Database password
-pdfs PDF-DIR-OR-FILE : Pdfs directory or file
-rules FILE : Rules file (optional)
This will run through all files in the targeted directory and load them into the named triage corpus. Note
that the system will iterate over every target corpus and assign an in-out code to every article. The user
may supply formatted codes in a text file (using the -codeList option) rather than editing each file
name on disk. Thus, if a PDF file has the filename 19763139.pdf with an entry 19763139_A.pdf in the codeList
file, it would be assigned a code of in to the target corpus designated by the letter A and a code of out to
all others. If no codes are assigned either from the file names or from the codeList then *all articles in the
upload will be assigned a code of unassigned for all corpora.
Note also that the way that the text is extracted from the PDF files uses rule files for the LAPDF-Text system.
You may use a specified rule file here to improve performance of the text extraction if necessary.
The system has three query command line functions for an administrator to query the state of the system from the command line.
The reportCorpusCounts command returns a formatted count of the contents of each target and triage corpus.
reportCorpusCounts -db DBNAME -l LOGIN -p PASSWD
-db DBNAME : Database name
-l LOGIN : Database login
-p PASSWD : Database password
The reportTriageCorpusContents command returns a formatted list of all the documents in a given triage corpus (relating to a defined target corpus).
reportTargetCorpusContents -db DBNAME -l LOGIN -p PASSWD -targetCorpus CNAME
-db DBNAME : Database name
-l LOGIN : Database login
-p PASSWD : Database password
-targetCorpus CNAME : Target Corpus Name
The reportTriageCorpusContents command returns a formatted list of all the documents in a given triage corpus.
reportTriageCorpusContents -db DBNAME -l LOGIN -p PASSWD -targetCorpus CNAME -triageCorpus CNAME
-db DBNAME : Database name
-l LOGIN : Database login
-p PASSWD : Database password
-targetCorpus CNAME : Target Corpus Name
-triageCorpus CNAME : Triage Corpus Name
We have three commands to edit data from the system
The deleteTargetCorpus will remove all traces of a given target corpus from the system.
deleteTriageCorpus -db DBNAME -l LOGIN -p PASSWD -targetCorpus TARGET
-db DBNAME : Database name
-l LOGIN : Database login
-p PASSWD : Database password
-targetCorpus TARGET : Target Corpus Name
The deleteTriageCorpus will remove all traces of a given triage corpus from the system.
deleteTriageCorpus -db DBNAME -l LOGIN -p PASSWD -targetCorpus TARGET
-db DBNAME : Database name
-l LOGIN : Database login
-p PASSWD : Database password
-triageCorpus TRIAGE : Triage Corpus Name
The deleteTriageScoresBasedOnCodefile uses a code file (a list of formatted pmid_A.pdf file names) to remove paper's association with a given triage corpus.
deleteTriageScoresBasedOnCodefile -codeList CODES -db DBNAME -l LOGIN -p PASSWD -triageCorpus CORPUS
-codeList CODES : Encoded files
-db DBNAME : Database name
-l LOGIN : Database login
-p PASSWD : Database password
-triageCorpus CORPUS : Triage Corpus name
Before we can use the classifiers, we need to train them. This is done by the following command:
triageDocumentsClassifier -train -targetCorpus "GO" [-homeDir /path/to/directory/for/model]
-db <name-of-database> -l <login> -p <password>
-db DBNAME : Database name
-homeDir DIR : Directory where application data will be persisted
-l LOGIN : Database login
-p PASSWD : Database password
-targetCorpus NAME : The target corpus that we're linking to
-train : If present will train and generate model, if absent will
compute and update prediction scores in Triage Document.
Either -train or -predict should be specified.
-triageCorpus NAME : The triage corpus to be evaluated. It is required if
-predict is used.
This runs through all the example data from all the triage corpora where the in-out code is set to
either in or out and trains an SVM classifier (derived from a baseline set of features). There is an
option argument for where the model should be placed, if this is not set then the model will be saved in
the home directory of the user running the command.
Applying the classifier is accomplished with the following command:
triageDocumentsClassifier -predict -targetCorpus "GO" [-homeDir /path/to/directory/for/model]
-db <name-of-database> -l <login> -p <password>
-db DBNAME : Database name
-homeDir DIR : Directory where application data will be persisted
-l LOGIN : Database login
-p PASSWD : Database password
-targetCorpus NAME : The target corpus that we're linking to
-predict : If present will compute and update prediction scores in
Triage Document. Either -train or -predict should be
specified.
-triageCorpus NAME : The triage corpus to be evaluated. It is required if
-predict is used.
Note that execution of this command is exactly like the training step with a single option changed. This will
generate scores for each paper in each category.
This runs through all the example data from all the triage corpora where the in-out code is set to
either in or out and trains an SVM classifier (derived from a baseline set of features). There is an
option argument for where the model should be placed, if this is not set then the model will be saved in
the home directory of the user running the command.
triageServer -db <name-of-database> -l <login> -p <password>
This should start the web server so that the curators can access the display.
Navigate in a browser to: http://localhost:8080/triage
Currently you should just kill the job that was started with the triageServer command.