Google Summer of Code 2021

MINERVA

Google Summer of Code 2021 🚩 Report: "MINERVA"

CONTRIBUTIONS

MINERVA DATASET GENERATION

To create/generate OSS License dataset for Atarashi

Codebase: GitHub
Documentation: Wiki

Why there is a need to generate a dataset?

To implement any Machine learning/Deep learning algorithm we need a better and bigger dataset of SPDX Licences. Due to the lack of dataset currently, all the 10 algorithms which have been tested on Atarashi are restricted to 59% accuracy. But unfortunately, there exists no such dataset for open source licenses on the web.
Advanced Architectures and algorithms such as LSTMs, GRU, BERT, WordNET, etc. require huge volumes of the dataset before achieving the ability to outperform the accuracy of even traditional algorithms such as TF-IDF, n-gram, etc.
Licenses differ from traditional corpora, because of which 50-60% keywords are similar in any two licenses, and if the licenses have the same license heading but different versions, they're around 90% similar.

There was a loose implementation of n-gramming different permutations and combinations of license text paragraphs. Ref: SPDX-OSS Dataset. This method needed further improvement to make the dataset more accurate and realistic.

The main idea was to initially split the licenses into different permutations and combinations of license text paragraphs maintaining a sliding window approach and use these existing files as a baseline model for further manipulating and generating texts. Added FOSSology Nomos agent STRINGS.in regex to the files.

1. DOWNLOADING RECENT RELEASED LICENSE FROM SPDX RELEASED JSON TO TXT FILES

SPDX License List is a list of commonly found licenses and exceptions used in free and open or collaborative software, data, hardware, or documentation, and releases are done on a quarterly basis (more or less) at the end of January, April, July, and October. SPDX Licenses were manually downloaded to txt format from license-list-data/text/. Licenses JSON format can be downloaded from license-list-data/json/

I worked on a script to download the licenses from SPDX, SPDX-exceptions, FOSSology Database directly to txt format by passing JSON filepath of new release. Script : download

For SPDX licenseListVersion: 3.13, licenses downloaded are : files
Original FOSSology db licenses (SPDX licenses are subset of licenses present here) : files

SPDX recent release : SPDX

 python ./Download-licenses-Script/spdx.py

SPDX-exceptions recent release : SPDX-exceptions

 python ./Download-licenses-Script/exceptions.py

Licenses in Fossology Database : licenseRef

 python ./Download-licenses-Script/database-foss.py

2. GENERATED FILES THROUGH INITIAL SPLIT

The basic idea was n-gramming license text paragraphs such that we are able to maintain a sliding window, i.e for a licene with 4 paragraphs, all the different files that I wanted to generate were - para1, para2, para3, para4, para1+para2, para2+para3, para3+para4, para1+para2+para3, para2+para3+para4, para1+para2+para3+para4. Not para1+para3, para1+para3+para4, etc. because the structure of licenses needs to be maintained.

 python ./Script-Initial-Split/initial_split.py

Script : initial_split
Files : SPDX
Files : FOSSologyDatabase

3. GENERATED FILES BY ADDING REGEX TO FILES SPLITS

For license check and new dataset generation which satisfies each and every condition of regex for a license file, I used string generators through free and open-source libraries such as xeger, intxeger and whosoever comes lexicographically closer to the existing datasets in our database used with a threshold value being considered so that randomness in string generation of licenses can be kept at a minimum. It has to be done lexicographically since relevance to existing datasets is maintained.

I have extracted regex from STRINGS.in file, scripts, extracted regex-csvs can be found in STRINGSin-Regex-Extraction.

4. HANDLING REGEX EXPANSION

To the regex extracted from STRINGS.in file major task was to handle expansions i.e .{1,32}, .{1,64}. There were 3 cases considered, to generate ambiguous characters, replacing with an empty string, or generating a sequence of words from the license itself such that it holds proper meaning to it. Ambiguous characters were straightaway rejected after discussion with mentors. Validated the generated files from the second and third approaches using NOMOS and observed that the third approach results are drastically better than the second approach. So for the generating sequence of words, I worked on two algorithms and integrated them with the existing codebase.

A. NGRAM
(basically a set of co-occurring words within a given window)
B. MARKOV
(As an extension of Naive Bayes for sequential data, the Hidden Markov Model provides a joint distribution over the letters/tags with an assumption of the dependencies of variables x and y between adjacent tags.)

Added "Multiprocessing" to the Script to speed up the process of data generation.

Codebase : Ngram
To generate licenses with ngram expansion:

 python ./ngram/licenses.py

Codebase : Markov
To generate licenses with ngram expansion:

 python ./markov/markov_licenses.py

After getting validated by NOMOS, Ngram regex expansion performed better than Markov expansion.

5. VALIDATION OF FILES GENERATED

We use Nomos to identify the licenses, either with license headers with which its regex matches or labels such as Unclassified licenses, No License found, Public-domain, Restricted, and so on. This is a baseline validation for the resulting text files. Terminal command to run this will be :

 sudo nomos -J -d <folder_with_files>

And to use multiple cores to validate files (here I am using 3 cores) :

 sudo nomos -J -d <folder_with_files> -n 3

After the validation files were segregated into different folders, with license headers as folder names.

This is a brief overview of the project.

The entire codebase has now been moved to FOSSology : Minerva-Dataset-Generation

6. ADDED NOISE TO DATASET - AUGLY IMPLEMENTATION

I have added noise to the generated dataset using Augly for increasing both the size and the diversity of labeled training data which also helps to build robust ML models. Augly offers transformations in both function and class formats, as well as intensity functions to help us understand how intense a transformation is (based on the given parameters). AugLy can also create important metadata that will assist in understanding how your data was altered.

Tested on a license over : Colab
After a discussion with the mentors, it was concluded that the licenses will be manipulated using SIMULATE TYPOS and REPLACE SIMILAR CHARACTERS

👨🏻‍🏫 DELIVERABLES

Scripts to Download licenses (txt) from JSON file.
License Generated using Sliding Window Approach (that is initial split)
Implemented Multiprocessing to License Generation.
Scripts handling Regex Expansion using both the algorithms (i.e Markov and Ngram)
Shifting the Codebase of Atarashi (Discussed and implementation in progress)
Augly Implementation on Licenses (Tested and will be continuing after training on generated dataset is done.)

🚀 FUTURE PLANS

Normalisation of text files generated.
Writing custom hooks for repetitive code.
Use the generated data for training ML models.
Experiment with advanced NLP Algorithms for license generation and validation techniques.

📚 Things I learned from Google Summer of Code

To write optimized codes.
Explored various NLP algorithms tested and implemented them.
Sharpened my skill of GIT
Learned the importance of time management as well as perfect deliverables.
Worked on my scripting skills.
Understood the importance of constructive discussions with mentors and peers.
Improved my documentation skill
Various open-source licenses and their importance in codes, projects, and software.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Assets		Assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google Summer of Code 2021

MINERVA

Google Summer of Code 2021 🚩 Report: "MINERVA"

CONTRIBUTIONS

MINERVA DATASET GENERATION

1. DOWNLOADING RECENT RELEASED LICENSE FROM SPDX RELEASED JSON TO TXT FILES

SPDX recent release : SPDX

SPDX-exceptions recent release : SPDX-exceptions

Licenses in Fossology Database : licenseRef

2. GENERATED FILES THROUGH INITIAL SPLIT

3. GENERATED FILES BY ADDING REGEX TO FILES SPLITS

4. HANDLING REGEX EXPANSION

5. VALIDATION OF FILES GENERATED

6. ADDED NOISE TO DATASET - AUGLY IMPLEMENTATION

👨🏻‍🏫 DELIVERABLES

🚀 FUTURE PLANS

📚 Things I learned from Google Summer of Code

Let's get connected!

About

Releases

Packages

License

SinghShreya05/GSoC-2021

Folders and files

Latest commit

History

Repository files navigation

Google Summer of Code 2021

MINERVA

Google Summer of Code 2021 🚩 Report: "MINERVA"

CONTRIBUTIONS

MINERVA DATASET GENERATION

1. DOWNLOADING RECENT RELEASED LICENSE FROM SPDX RELEASED JSON TO TXT FILES

SPDX recent release : SPDX

SPDX-exceptions recent release : SPDX-exceptions

Licenses in Fossology Database : licenseRef

2. GENERATED FILES THROUGH INITIAL SPLIT

3. GENERATED FILES BY ADDING REGEX TO FILES SPLITS

4. HANDLING REGEX EXPANSION

5. VALIDATION OF FILES GENERATED

6. ADDED NOISE TO DATASET - AUGLY IMPLEMENTATION

👨🏻‍🏫 DELIVERABLES

🚀 FUTURE PLANS

📚 Things I learned from Google Summer of Code

Let's get connected!

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages