This project provides tools for unsupervised document clustering using NLP based approaches. It includes:
st_workflow: Sentence transformer based document clusteringd2v_workflow: Document2Vec-based document clustering
We use uv as a faster and more reliable alternative to pip.
pip install uv
uv syncuv run ./st_workflow/st_main.py -i documents/-
-i: Path to a folder containing your text documents. -
-o: (Optional) Output folder. Defaults to st_clusters/.
uv run ./d2v_workflow/d2v_document_clustering.pyst_workflowuses a pre-trained model & supports multiple formats: txt, pdf, docx, html, htm.d2v_workflowwas trained using the ag_news dataset.