To create/generate OSS License dataset for Atarashi
Why there is a need to generate a dataset?
- To implement any Machine learning/Deep learning algorithm we need a better and bigger dataset of SPDX Licences. Due to the lack of dataset currently, all the 10 algorithms which have been tested on Atarashi are restricted to 59% accuracy. But unfortunately, there exists no such dataset for open source licenses on the web.
- Advanced Architectures and algorithms such as LSTMs, GRU, BERT, WordNET, etc. require huge volumes of the dataset before achieving the ability to outperform the accuracy of even traditional algorithms such as TF-IDF, n-gram, etc.
- Licenses differ from traditional corpora, because of which 50-60% keywords are similar in any two licenses, and if the licenses have the same license heading but different versions, they're around 90% similar.
There was a loose implementation of n-gramming different permutations and combinations of license text paragraphs. Ref: SPDX-OSS Dataset. This method needed further improvement to make the dataset more accurate and realistic.
The main idea was to initially split the licenses into different permutations and combinations of license text paragraphs maintaining a sliding window approach and use these existing files as a baseline model for further manipulating and generating texts. Added FOSSology Nomos agent STRINGS.in regex to the files.
SPDX License List is a list of commonly found licenses and exceptions used in free and open or collaborative software, data, hardware, or documentation, and releases are done on a quarterly basis (more or less) at the end of January, April, July, and October. SPDX Licenses were manually downloaded to txt format from license-list-data/text/. Licenses JSON format can be downloaded from license-list-data/json/
I worked on a script to download the licenses from SPDX, SPDX-exceptions, FOSSology Database directly to txt format by passing JSON filepath of new release. Script : download
For SPDX licenseListVersion: 3.13, licenses downloaded are : files
Original FOSSology db licenses (SPDX licenses are subset of licenses present here) : files
SPDX recent release : SPDX
python ./Download-licenses-Script/spdx.py
SPDX-exceptions recent release : SPDX-exceptions
python ./Download-licenses-Script/exceptions.py
Licenses in Fossology Database : licenseRef
python ./Download-licenses-Script/database-foss.py
The basic idea was n-gramming license text paragraphs such that we are able to maintain a sliding window, i.e for a licene with 4 paragraphs, all the different files that I wanted to generate were - para1, para2, para3, para4, para1+para2, para2+para3, para3+para4, para1+para2+para3, para2+para3+para4, para1+para2+para3+para4. Not para1+para3, para1+para3+para4, etc. because the structure of licenses needs to be maintained.
python ./Script-Initial-Split/initial_split.py
Script : initial_split
Files : SPDX
Files : FOSSologyDatabase
For license check and new dataset generation which satisfies each and every condition of regex for a license file, I used string generators through free and open-source libraries such as xeger, intxeger and whosoever comes lexicographically closer to the existing datasets in our database used with a threshold value being considered so that randomness in string generation of licenses can be kept at a minimum. It has to be done lexicographically since relevance to existing datasets is maintained.
I have extracted regex from STRINGS.in file, scripts, extracted regex-csvs can be found in STRINGSin-Regex-Extraction.
To the regex extracted from STRINGS.in file major task was to handle expansions i.e .{1,32}, .{1,64}. There were 3 cases considered, to generate ambiguous characters, replacing with an empty string, or generating a sequence of words from the license itself such that it holds proper meaning to it. Ambiguous characters were straightaway rejected after discussion with mentors. Validated the generated files from the second and third approaches using NOMOS and observed that the third approach results are drastically better than the second approach. So for the generating sequence of words, I worked on two algorithms and integrated them with the existing codebase.
A. NGRAM
(basically a set of co-occurring words within a given window)
B. MARKOV
(As an extension of Naive Bayes for sequential data, the Hidden Markov Model provides a joint distribution over the letters/tags with an assumption of the dependencies of variables x and y between adjacent tags.)
Added "Multiprocessing" to the Script to speed up the process of data generation.
Codebase : Ngram
To generate licenses with ngram expansion:
python ./ngram/licenses.py
Codebase : Markov
To generate licenses with ngram expansion:
python ./markov/markov_licenses.py
After getting validated by NOMOS, Ngram regex expansion performed better than Markov expansion.
We use Nomos to identify the licenses, either with license headers with which its regex matches or labels such as Unclassified licenses, No License found, Public-domain, Restricted, and so on. This is a baseline validation for the resulting text files. Terminal command to run this will be :
sudo nomos -J -d <folder_with_files>
And to use multiple cores to validate files (here I am using 3 cores) :
sudo nomos -J -d <folder_with_files> -n 3
After the validation files were segregated into different folders, with license headers as folder names.
This is a brief overview of the project.
The entire codebase has now been moved to FOSSology : Minerva-Dataset-Generation
I have added noise to the generated dataset using Augly for increasing both the size and the diversity of labeled training data which also helps to build robust ML models. Augly offers transformations in both function and class formats, as well as intensity functions to help us understand how intense a transformation is (based on the given parameters). AugLy can also create important metadata that will assist in understanding how your data was altered.
Tested on a license over : Colab
After a discussion with the mentors, it was concluded that the licenses will be manipulated using SIMULATE TYPOS and REPLACE SIMILAR CHARACTERS
- Scripts to Download licenses (txt) from JSON file.
- License Generated using Sliding Window Approach (that is initial split)
- Implemented Multiprocessing to License Generation.
- Scripts handling Regex Expansion using both the algorithms (i.e Markov and Ngram)
- Shifting the Codebase of Atarashi (Discussed and implementation in progress)
- Augly Implementation on Licenses (Tested and will be continuing after training on generated dataset is done.)
- Normalisation of text files generated.
- Writing custom hooks for repetitive code.
- Use the generated data for training ML models.
- Experiment with advanced NLP Algorithms for license generation and validation techniques.
- To write optimized codes.
- Explored various NLP algorithms tested and implemented them.
- Sharpened my skill of GIT
- Learned the importance of time management as well as perfect deliverables.
- Worked on my scripting skills.
- Understood the importance of constructive discussions with mentors and peers.
- Improved my documentation skill
- Various open-source licenses and their importance in codes, projects, and software.