docs: readme

dhdaines · Feb 20, 2025 · fa91ff2 · fa91ff2
1 parent 958a09d
commit fa91ff2
Showing 1 changed file with 23 additions and 32 deletions.
diff --git a/README.md b/README.md
@@ -14,44 +14,32 @@ would be specifically one of these things and nothing else:
 2. Obtaining the absolute position and attributes of every character,
    line, path, and image in every page of a PDF.
 
-If you just want to extract text from a PDF, there are a lot of better
-and faster tools and libraries out there, see [these
-benchmarks](https://github.com/py-pdf/benchmarks) for a summary (TL;DR
-pypdfium2 is probably what you want, but pdfplumber does a nice job of
-converting PDF to ASCII art).
-
 The purpose of PLAYA is to provide an efficent, parallel and
 parallelizable, pure-Python and Pythonic (for its author's definition
 of the term), lazy interface to the internals of PDF files.
 
-But yes, you *can* also extract text with PLAYA now.  This is fast compared
-to other pure-Python libraries, slow compared to anything else, and I
-can't guarantee that the output is any good.  On my Thinkpad X250
-(Core i5-5300U circa 2015) I get these speeds when extracting the
-zoning bylaw of my town (486 pages of tagged PDF), using `playa --text` on
-the command-line:
-
-| Tool | Time |
-|------|------|
-| pdfminer.six | 36.6s |
-| pypdf | 17.7s |
-| PLAYA (1 CPU) | 18.2s |
-| PLAYA (1 CPU, PyPy 3.9) | 10.8s |
-| PLAYA (2 CPUs) | 10.5s |
-| pypdfium2 | 1.7s |
-| Poppler | 1.6s |
-
-Soon, this will get faster.  You will also be able to use
-[PAVÉS](https://github.com/dhdaines/paves) for this and other
-higher-level tasks, and it will be better, maybe.
-
-Also, for things other than extracting text, PLAYA is actually quite
-efficient.  For instance, it is very good at reading logical structure
-trees.  On the zoning bylaw above, extracting the entire tree with its
-text contents as JSON using `playa --structure` takes only 23 seconds,
-whereas `pdfplumber --structure-text` takes 69 seconds and `pdfinfo
+If you just want to extract text from a PDF, there are a better and/or
+faster tools and libraries out there, notably
+[pypdfium2](https://pypi.org/project/pypdfium2/) and
+[pypdf](https://pypi.org/project/pypdf/), among others.  See [these
+benchmarks](https://github.com/dhdaines/benchmarks) for a comparison.
+Nonetheless, you will notice in this comparison that:
+
+- PLAYA (using 2 CPUs) is the fastest pure-Python PDF reader by far
+- PLAYA has no dependencies and no C++
+- PLAYA is MIT licensed
+
+PLAYA is also very good at reading logical structure trees.  On the
+zoning bylaw above, extracting the entire tree with its text contents
+as JSON using `playa --structure` takes only 23 seconds, whereas
+`pdfplumber --structure-text` takes 69 seconds and `pdfinfo
 -struct-text` (which doesn't output JSON) takes 110 seconds.
 
+I cannot stress this enough, *text extraction is not the primary use
+case for PLAYA*, because [extracting text from PDFs is not
+fun](https://pypdf.readthedocs.io/en/latest/user/extract-text.html#why-text-extraction-is-hard),
+and I like fun.  Do you like fun?  Then read on.
+
 ## Installation
 
 Installing it should be really simple as long as you have Python 3.8
@@ -522,3 +510,6 @@ This repository obviously includes code from `pdfminer.six`.  Original
 license text is included in
 [LICENSE](https://github.com/dhdaines/playa/blob/main/LICENSE).  The
 license itself has not changed!
+
+For the moment PLAYA is developed and maintained by [David
+Huggins-Daines](https://ecolingui.ca/) <[email protected]>.