This project extracts text from PDF files using pdf.js in a Node.js environment, preserving line breaks.
- Extracts text from PDFs using
pdfjs-dist
- Maintains original line breaks and paragraph structure
- Works in Node.js with ES Modules
- Node.js 20+ (Older versions may cause errors due to
Promise.withResolvers
usage)
- Install dependencies:
npm install
To extract text from a PDF file:
node main.js path/to/your.pdf
The script will print the extracted text while maintaining its original formatting.
If you encounter this error, ensure you're using Node.js 20+:
The script processes Y-axis positions to detect line breaks. If formatting is still incorrect, adjust the threshold in:
if (prevY !== null && Math.abs(y - prevY) > 5) {
pageText += '\n';
}