This project involves processing and extracting data from various Sanskrit texts and audios and compile it as SwaraSangraha
, A Sanskrit Chanting-style Speech Dataset.
- Amarakosha (अमरकोषः)
- Ashtadhyayi (अष्टाध्यायी)
- Meghaduta (मेघदूतम्)
- Valmiki Ramayana (वाल्मीकि रामायण)
- TarkaSangraga (तर्कसंग्रह)
- Patanjali Yoga Sutrani (पातञ्जलयोगसूत्राणि)
It includes modules for:
- Web Scraping of Sanskrit texts
- Computing Total Duration of Audio Files
- Demucs-based Speech Separation
📁 code
📁 processing
📄 demucs.py # Demucs-based audio separation
📄 duration.py # Computes duration of audio files
📁 scraping
📄 amarakosha.py # Scrapes Amarakosha text & audio
📄 ashtadhyayi.py # Scrapes Ashtadhyayi text & audio
📄 meghaduta.py # Scrapes Meghaduta text & audio
📄 ramayana.py # Scrapes Ramayana text & audio
📄 tarkasangraha.py # Scrapes Tarkasangraha text & audio
📄 yogasutra.py # Scrapes Yogasutra text & audio
📁 test # Directory for test files
📁 demucs # Output directory for processed audio
📁 demucs_temp # Temporary files during Demucs processing
📁 SwaraSangraha # Collection of scraped Sanskrit audio/text
📁 separated_audio # Storage for separated audio components
Ensure you have the required dependencies installed:
pip install numpy pandas librosa mutagen tqdm beautifulsoup4 requests pydub
python code/scraping/amarakosha.py
python code/scraping/ashtadhyayi.py
python code/scraping/meghaduta.py
python code/scraping/ramayana.py
python code/scraping/tarkasangraha.py
python code/scraping/yogasutra.py
python code/processing/duration.py
python code/processing/demucs.py
- Ensure you have access to the internet while running the scraping scripts.
- Errors and warnings will be logged in
error_log.txt
.