Skip to content

An unsupervised algorithm for identifying pdf segments

License

GPL-3.0, Apache-2.0 licenses found

Licenses found

GPL-3.0
LICENSE
Apache-2.0
LICENSE.txt
Notifications You must be signed in to change notification settings

TekstoSense/pdf-segmenter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Segmenter

Segment or Section identification from a PDF file is not a new endeavour in text analytics area. There are few supervised approaches available to correctly identify segments from a pdf. These solutions are mostly apt for scientifc articles as it is trained against scientific corpus. There are few commercial applications also available to identify sections from pdf. The supervised approach suffers from specificity. The existing unsupervised approach suffers from inaccuracy.

pdf-segmenter is an unsupervised algorithm to identify segment from any type of documents with good accuracy. In our internal testing with scientific articles and journals it clocks more than 90% accuracy. It is able to identify Title and Author as first section and Abstract, Introduction etc. as subsequent sections. It also identifies Tables and Figures with good accuracy. pdf-segmenter provides output as json.

Note : It also works for other type of general pdf documents but accuracy is less than 90%. In our subsequent releases we will improvise the accuracy of other pdf document like books etc.

Supported Type

  • All Journals and Published Papers.
  • White Papers and other general type PDF.
  • Only unstructered document

Upcoming Releases

  • Patent document support
  • CRF based model for supervised approach of identifying sections from scientifc articles
  • TEI support

Dependency

PDFBox 2.0.2

License

GNU

About

An unsupervised algorithm for identifying pdf segments

Resources

License

GPL-3.0, Apache-2.0 licenses found

Licenses found

GPL-3.0
LICENSE
Apache-2.0
LICENSE.txt

Stars

Watchers

Forks

Packages

No packages published

Languages