Node-RED Node to be used to extract text from a pdf file making use of hummusjs.
Run the following command in the root directory of your Node-RED install:
npm install node-red-contrib-pdf-hummus
This node splits text out of a PDF document making use of the npm module hummusjs and the text extraction sample.
This early release is a get it working node-red wrappering of the text extraction sample code, which does more than is actually is needed by this node. Hence it has a larger than necessary memory requirement.
The node needs a filename and a PDF input buffer as input.
The filename being written to can be overridden by setting
msg.filename.
The document to be added should be passed in as a data buffer
in msg.payload.
The node can also be driven by a HTTP input node, where a pdf
file is POSTed to the flow. The pdf file buffer and name will
then be taken from the request field of the msg.
To use this implementation, in the http input properties, ensure
that "Method" is set to "POST", and "Accept file uploads?" is
ticked.
The output is a json object on msg.payload.
If the split option is selected then an event is sent for each page.
File Inject Implementation
[{"id":"75540143.de239","type":"pdf-hummus","z":"434de041.4e4f4","name":"","filename":"myfile.txt","split":true,"mode":{"value":"asBuffer"},"x":270.5,"y":65,"wires":[["cd04c7ce.3b70c8","a84e0874.451318"]]},{"id":"38aade4f.ab0c12","type":"fileinject","z":"434de041.4e4f4","name":"","x":103,"y":62,"wires":[["75540143.de239"]]},{"id":"cd04c7ce.3b70c8","type":"debug","z":"434de041.4e4f4","name":"","active":true,"console":"false","complete":"false","x":449.5,"y":65,"wires":[]},{"id":"a84e0874.451318","type":"watson-discovery-v1-document-loader","z":"434de041.4e4f4","name":"","environment_id":"","collection_id":"","default-endpoint":true,"service-endpoint":"https://gateway.watsonplatform.net/discovery/api","x":411,"y":133,"wires":[["2e9ac940.1cd4c6"]]},{"id":"2e9ac940.1cd4c6","type":"debug","z":"434de041.4e4f4","name":"","active":true,"console":"false","complete":"true","x":610.5,"y":131,"wires":[]}]
HTTP POST Implementation
[ { "id": "1f7ebf39.38b309", "type": "pdf-hummus", "z": "639c38eb.3b18c8", "name": "", "filename": "", "split": false, "mode": { "value": "asBuffer" }, "x": 487, "y": 227, "wires": [ [ "757b38a5.04059" ] ] }, { "id": "e8e6b15b.868638", "type": "http in", "z": "639c38eb.3b18c8", "name": "", "url": "/pdfin", "method": "post", "upload": true, "swaggerDoc": "", "x": 204, "y": 228, "wires": [ [ "1f7ebf39.38b309" ] ] }, { "id": "757b38a5.04059", "type": "http response", "z": "639c38eb.3b18c8", "name": "", "statusCode": "200", "headers": {}, "x": 792, "y": 230, "wires": [] } ]
Deploy the sample flow, and create a HTTP POST as follows:
- METHOD: POST
- URL: http://localhost:1800/pdfin
- BODY_TYPE: form-data
- HEADERS: Key - "file" Value - target.pdf
For simple typos and fixes please just raise an issue pointing out our mistakes. If you need to raise a pull request please read our contribution guidelines before doing so.
Copyright 2017 IBM Corp. under the Apache 2.0 license.