GitHub - rjzak/malware-modeler-rs: Generate training data and models for benign vs. malicious files plus related tools.

malware modeler

A machine learning application and library for training logistic regression models for benign vs. malicious prediction plus related tools.

This code is alpha quality and is not fully tested. Don't use in a production setting.

There are four basic steps:

Feature extraction: find top k n-grams, k is about 100k to 1m, n should be 8.
Dataset file creation: from your malware and goodware collection, create a dataset file which is the featurized samples as a dataset file.
Model training: trains a model based on the training data.
Evaluation: evaluate the model against some testing or validation (hold-out data).

Additionally:

The similarity feature can be used to ensure the samples used for training have decent variation.
N-gram features, dataset files, and models are tied to a file type.
The model can reduce the k features to some lesser amount, allowing the model to do further feature selection to hopefully make a better model.
Models should only be made for one file type. So a model for EXEs, one for PDFs, one for ELFs, etc.
The training data should be based on a balanced collection with a lot of samples. Same amount of benign and malicious samples, should have at least hundreds of thousands.
These are simple models, which are only are as good as the training data. Bad, mis-labeled, or too similar data yields a worthless model.

Based on the following research:

Edward Raff, William Fleming, Richard Zak, Hyrum Anderson, Bill Finlayson, Charles K Nicholas, Mark Mclean, William Fleming, Charles K Nicholas, Richard Zak and Mark Mclean. KiloGrams: Very Large N-Grams for Malware Classification. In Proceedings of KDD 2019 Workshop on Learning and Mining for Cybersecurity (LEMINCS'19). 2019. Article.
William Fleshman, Edward Raff, Richard Zak, Mark McLean and Charles Nicholas. Static Malware Detection & Subterfuge: Quantifying the Robustness of Machine Learning and Current Anti-Virus. In 2018 13th International Conference on Malicious and Unwanted Software (MALWARE). October 2018, 1–10. Best Paper Award. Article, Arvix, DOI.
Edward Raff and Charles Nicholas. Hash-Grams: Faster N-Gram Features for Classification and Malware Detection. In Proceedings of the ACM Symposium on Document Engineering 2018. 2018. Article, DOI.
Richard Zak, Edward Raff and Charles Nicholas. What can N-grams learn for malware detection? In 2017 12th International Conference on Malicious and Unwanted Software (MALWARE). October 2017, 109–118. Article, DOI.
Edward Raff, Richard Zak, Russell Cox, Jared Sylvester, Paul Yacci, Rebecca Ward, Anna Tracy, Mark McLean and Charles Nicholas. An investigation of byte n-gram features for malware classification. Journal of Computer Virology and Hacking Techniques, September 2016. Article, DOI.

Additional tools:

Extract files from a Zip archive based on file type, useful for working with files from VirusShare.
Get a summary of files in a Zip archive by file type.
Check files in a directory for similarity with each other to help you build a dataset with good variation.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.cargo		.cargo
.github		.github
src		src
testdata		testdata
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
build.rs		build.rs
readme.md		readme.md