6.2.0 #14679
DevinTDHa
announced in
Announcement
6.2.0
#14679
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
📢 Spark NLP 6.2.0: A new stage for unstructured document ingestion and processing at scale
Spark NLP 6.2.0 introduces key upgrades across entity extraction, document normalization, HTML reading, and GGUF-based models. To recap, since the releases of Spark NLP 6.1 you can:
AutoGGUFRerankerReader2Doc: streamlines the process of loading and integrating diverse file formats (PDFs, Word, Excel, PowerPoint, HTML, Text, Email, Markdown) directly into Spark NLP pipelines with a unified and flexible interface.Reader2Table: streamlines tabular data extraction from multiple document formats with seamless pipeline integration.Reader2Image: extract structured image content from various document typesSpark NLP release 6.2.0 further focuses on automation, structure-awareness, and resource efficiency, making pipelines easier to configure, manage, and extend.
🔥 Highlights
🚀 New Features & Enhancements
EntityRulerModel and DocumentNormalizer Auto Modes
EntityRulerModelautoModeparameter to enable predefined regex entity groups ("network_entities","communication_entities","media_entities","email_entities","all_entities").extractEntitiesparameter to filter entities within auto modes.DocumentNormalizerpresetPatternandautoModeparameters to apply built-in text cleaning patterns."light_clean","document_clean","social_clean","html_clean", and"full_auto".Together, these additions significantly reduce boilerplate setup for common text extraction and normalization workflows.
Hierarchical Element Identification in HTMLReader
element_idandparent_idmetadata fields for each parsed HTML element.title → paragraph → link) for hierarchical retrieval and contextual reasoning.AutoGGUF Annotator Enhancements
For
AutoGGUFModel,AutoGGUFVision,AutoGGUFEmbeddings,AutoGGUFRerankerclose()method to explicitly release llama.cpp model resources, preventing memory retention in long-running sessions.setRemoveThinkingTag(tag: String)parameter to remove internal<think>...</think>sections from model outputs.(?s)<$tag>.+?</$tag>🐛 Bug Fixes
❤️ Community Support
💻 Installation
Python
Spark Packages
CPU
GPU
Apple Silicon
AArch64
Maven
FAT JARs
What's Changed
Full Changelog: 6.1.5...6.2.0
This discussion was created from the release 6.2.0.
Beta Was this translation helpful? Give feedback.
All reactions