This is a simple script to create a pretraining dataset from a folder of input files (txt, md, pdf, docx, epub, html")
-
Clone the repository and navigate to the project directory:
git clone https://github.com/agi-dude/pretraining-generator cd pretraining-generator -
Install the required dependencies:
pip install -r requirements.txt
-
Run the script:
python main.py
-
Follow the GUI prompts to select the input folder and output file.