Skip to content
This repository was archived by the owner on Feb 28, 2024. It is now read-only.

Google Summer of Code 2020

Aish Raj Dahal edited this page Jan 23, 2020 · 4 revisions

Below is a short list of project ideas for Google Summer of Code 2020

Integrate Tesseract into the indexing pipeline

Expand content on Sangraha by utilizing OCR to extract text from PDF documents. The latest version of Tesseract does OCR using deep learning. There is already trained data available for Devanagiri which we have previously tried using and didn’t get optimal translation. The trained data can be enhanced to support Nepali language or we can create a new one for Nepali language. Mentor: Pragya Tripathi

Skills required: Java, Machine learning

Implement Nepali Stemmer and Analyzer for Elasticsearch.

Lucene and Elasticsearch does not support Nepali language stemmer. Currently, we are using Hindi language stemmer as a workaround. To improve the quality of the search we plan to implement Nepali stemmer in Elasticsearch. This project is a good opportunity to give back to open source projects that Sangraha depends upon. Mentor: Anup Dhamala

Skills required: Java, NLP

Integrate Wiki JS

We want to integrate Wiki.js in Sangraha to allow crowdsourcing of the content. Wiki.js will allow users to add new content and admins to moderate them. It will also provide user management features. Mentor: Prasanna Suman

Skills required: Javascript