This project provides code that uses BioC structures as a UIMA type and provide support for it's use in text mining applications based on the CleartTk UIMA system. Note that our processing uses either BioC data formatted as XML or as JSON.
This diagram shows the relationship between the various elements. Note that annotations are primarily structured using infons key-value tables, which are themselves unspecified. Using this library to extract data from *.nxml files generates BioC formatted data with a predefined organization based on edu.isi.bmkeg.uimaBioC.uima.readers.Nxml2TxtFilesCollectionReader.
- The document has an infons object describing it's metadata:
{"pmc": "2191828",
"pmid": "7528775",
"publisher-id": "95105720",
"relative-source-path": "7528775.txt",
"type": "formatting",
"value": "article-id"}
- We add a single passage containing all available text to the document with an infons object:
{'type':'document'}. All annotations can then be attached to this passage. - We use annotations with an infons object
{'type':'formatting', 'value': '???'}where the value field could befront,abstract,bodyorref-listto denote those parts of the text. - Similarly, we use annotations with an infons object
{'type':'formatting', 'value': '???'}where the value field could betitle,subtitle,sec,p,captionorfigto denote those parts of the text. - The same is true for the following simple text formating elements:
bold,italic,sub,sup. - Note that each
BioCAnnotationhas aBioCLocationwith alengthandoffsetvalue that embeds it into the body of the text as a whole.
Other indexing processes use infons to construct BioC annotations for other elements (PDF annotations, Named entities, etc). This UIMA library provides access to UIMA-based computation for this effort by providing pipelines with collection readers that read BioC-formatted (and libaries that execute on BioC-formatted data).
