PDF Segmenter

Segment or Section identification from a PDF file is not a new endeavour in text analytics area. There are few supervised approaches available to correctly identify segments from a pdf. These solutions are mostly apt for scientifc articles as it is trained against scientific corpus. There are few commercial applications also available to identify sections from pdf. The supervised approach suffers from specificity. The existing unsupervised approach suffers from inaccuracy.

pdf-segmenter is an unsupervised algorithm to identify segment from any type of documents with good accuracy. In our internal testing with scientific articles and journals it clocks more than 90% accuracy. It is able to identify Title and Author as first section and Abstract, Introduction etc. as subsequent sections. It also identifies Tables and Figures with good accuracy. pdf-segmenter provides output as json.

Note : It also works for other type of general pdf documents but accuracy is less than 90%. In our subsequent releases we will improvise the accuracy of other pdf document like books etc.

Supported Type

All Journals and Published Papers.
White Papers and other general type PDF.
Only unstructered document

Upcoming Releases

Patent document support
CRF based model for supervised approach of identifying sections from scientifc articles
TEI support

Dependency

PDFBox 2.0.2

License

GNU

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE.txt		LICENSE.txt
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

PDF Segmenter

Supported Type

Upcoming Releases

Dependency

License

About

Licenses found

Releases 1

Packages

Languages

License

Licenses found

TekstoSense/pdf-segmenter

Folders and files

Latest commit

History

Repository files navigation

PDF Segmenter

Supported Type

Upcoming Releases

Dependency

License

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages