nlpprepkit

nlpprepkit is a Python library for text preprocessing, designed to simplify and accelerate the preparation of text data for natural language processing (NLP) tasks.

Features

Text Cleaning: Remove extra whitespace, special characters, emojis, HTML tags, URLs, numbers, and social tags.
Contraction Expansion: Expand common English contractions (e.g., "don't" → "do not").
Unicode Normalization: Normalize text to ASCII representation.
Pipeline Support: Create customizable pipelines for sequential text processing.
Profiling: Measure the execution time of each step in the pipeline.
Caching: Avoid redundant processing with built-in caching.
Parallel Processing: Process large text datasets efficiently.

Installation

Install the library using pip:

pip install nlpprepkit

Or install from source:

git clone https://github.com/vnniciusg/nlpprepkit.git
cd nlpprepkit
pip install -e .

Quick Start

Using the Pipeline

from nlpprepkit.pipeline import Pipeline

# Define a custom processing step
def lowercase(text):
    return text.lower()

# Create a pipeline and add the step
pipeline = Pipeline()
pipeline.add_step(lowercase)

# Process text
result = pipeline.process("This is a TEST.")
print(result)  # Output: "this is a test."

Text Cleaning Functions

from nlpprepkit.functions import remove_extra_whitespace, remove_special_characters

text = "This   is   a   test!!!"
cleaned_text = remove_extra_whitespace(text)
print(cleaned_text)  # Output: "This is a test!!!"

cleaned_text = remove_special_characters(cleaned_text)
print(cleaned_text)  # Output: "This is a test"

Expanding Contractions

from nlpprepkit.functions import expand_contractions

text = "I'm going to the store."
expanded_text = expand_contractions(text)
print(expanded_text)  # Output: "I am going to the store."

Running Tests

To run the tests, use pytest:

pytest

Contributing

Contributions are welcome! Feel free to submit a pull request or open an issue on GitHub.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
examples		examples
nlpprepkit		nlpprepkit
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

nlpprepkit

Features

Installation

Quick Start

Using the Pipeline

Text Cleaning Functions

Expanding Contractions

Running Tests

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

vnniciusg/nlpprepkit

Folders and files

Latest commit

History

Repository files navigation

nlpprepkit

Features

Installation

Quick Start

Using the Pipeline

Text Cleaning Functions

Expanding Contractions

Running Tests

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages