We are still updating this project and formatting the documentations for Artifact Evaluation.
This is the implementation of paper titled "Unlocking Low Frequency Syscalls in Kernel Fuzzing with Dependency-Based RAG". For more details about SyzGPT, please refer to our paper from ISSTA'25. We also provide a README_for_review, which was once located in an anonymous repository for better understanding by reviewers.
Quick Glance: SyzGPT is an LLM-assisted kernel fuzzing framework for automatically generating effective seeds for low frequency syscalls (LFS). Linux kernel provides over 360 system calls and Syzkaller defines more than 4400 specialized calls encapsulated for specific purposes of system calls. However, many of these syscalls (called LFS) are hard to be consistently covered due to the complex dependencies and mutation uncertainty, leaving the testing space. SyzGPT can automatically extract and augment syscall dependencies for these LFS and generate effective seeds with dependency-based RAG (DRAG). Our evaluation shows that SyzGPT can improve overall code coverage and syscall coverage, and find LFS-induced vulnerabilities. We also release a toy model π€CodeLlama-syz-toy specialized for Syz-program.
Project Structure
____ ____ ____ _____
/ ___| _ _ ____ / ___|| _ \|_ _|
\___ \ | | | ||_ / | | _|| |_) | | |
___) || |_| | / /_ | |_| || __/ | |
|____/ \__, //____| \____||_| |_|
|___/
.
βββ analyzer/ # Corpus Analyzer
βββ crawler/ # Crawler for Linux Manpages
βββ data/ # Data used in SyzGPT
βββ examples/ # Examples for better understanding
βββ experiments/ # Experiments
βββ extractor/ # Two-Level Syscall Dependency Extractor
βββ fine-tune/ # Fine-tuning LLM specialized for Syz-programs
βββ fuzzer/ # SyzGPT-fuzzer
βββ generator/ # SyzGPT-generator
βββ scripts/ # Some useful scripts
...
βββ config.py # Configs, need to be copied as private_config.py
βββ syzgpt_generator.py # Main entry of SyzGPT-generator
- For running SyzGPT:
- Machine with 16+ CPU cores, 64GB+ memory (For parallel fuzzing experiments)
- Ubuntu 20.04 + (20.04 and 22.04 tested)
- Syzkaller-runnable dependencies (setup according to Syzkaller project or use our docker)
- Python 3.8+ (Python-3.8, 3.9, 3.10, 3.11 tested)
- LLM API access or self-deployed LLM
- For kernel compilation:
- Clang 15+ (15.0.6, 16.0.6, 17.0.6 tested)
- GCC 11+ (11.4 and 12.2 tested)
- For LLM fine-tuning or serving:
- GPU with 48GB+ (1 x A800, 2 x RTX 3090 tested)
- torch 2.0+
We will release our docker image on dockerhub soon.
docker run -itd --name SyzGPT --privileged=true DOCKER_IMAGE
# You will find SyzGPT located at /root/SyzGPT
# And SyzGPT-fuzzer is located at /root/fuzzers/SyzGPT-fuzzer
You can also setup SyzGPT from scratch on a Ubuntu 20.04/22.04. Or base on our image qgrain/kernel-fuzz:v1
.
docker run -itd --name SyzGPT-from-scratch --privileged=true qgrain/kernel-fuzz:v1
- Clone this project:
# Recommend at /root/SyzGPT, so that the following instructions can match with the path.
# If you are a normal user on a physical machine, feel free to clone it at a convenient place.
git clone https://github.com/QGrain/SyzGPT.git
-
Setup SyzGPT-generator: Please refer to Setup section in generator/README.md
-
Setup SyzGPT-fuzzer: Please refer to fuzzer/README.md
SyzGPT can serve as a standalone seed generator through SyzGPT-generator (Section 2.1). It can also cooperate with SyzGPT-fuzzer for kernel fuzzing (Section 2.2).
We have open-souced the augmented syscall depencies at data/dependencies. So you can directly run SyzGPT without extracting syscall dependency. You can also extract the syscall dependencies on your own (Section 2.3).
For any questions in using SyzGPT, you may refer to Troubleshooting or feel free to raise an issue.
We provide a simplest running instruction here. For detailed usage, please refer to generator/README.md.
Suppose you have:
- A corpus generated by at local Syzkaller or some other existing fuzzers. We provide a
corpus_24h.db
at data/corpus_24h.db for reproduction. - A file containing target syscalls. We provide a
sampled_variants.txt
at data/sampled_variants.txt for reproduction.
(NOTE: it will interact with LLM and cost tokens)
# cd the root of this project, e.g., /root/SyzGPT
# (1) Use official OpenAI api (it will load api_key, llm_model, ... from private_config.py)
python syzgpt_generator.py -s /root/fuzzers/SyzGPT-fuzzer -w WORKDIR -e data/corpus_24h.db -f sampled_variants.txt
# (2) Use third party api
python syzgpt_generator.py -M gpt-3.5-turbo-16k -u https://api.expansion.chat/v1/ -k API_KEY -s /root/fuzzers/SyzGPT-fuzzer -w WORKDIR -e data/corpus_24h.db -f sampled_variants.txt
# (3) Use local LLMs
python syzgpt_generator.py -M CodeLlama-syz-toy -u http://IP:PORT/v1/ -s /root/fuzzers/SyzGPT-fuzzer -w WORKDIR -e data/corpus_24h.db -f sampled_variants.txt
Explanation of Parameters:
-s
: path to the SyzGPT-fuzzer, must be specified.-w
: output the generated results and logs toWORKDIR
, must be specified.-e
: path to external corpus, only nessary in one-time seed generation.-f
: path to the file containing target syscall list, only needed in one-time seed generation.-c
: you can also specify the target syscalls through-c CALL1 CALL2 ...
manually, instead of-f
.-M
: model name, used with third party api or local LLM.-u
: base_url to api address or local hosted LLM.-k
: api_key for third party api service.
You will find the outputs in WORKDIR
look like:
βββ external_corpus/ # external corpus specified through -e
βββ generated_corpus/ # generated seeds in Syz-program format
βββ generation_history.json # generation history for feedback-guided seed generation
βββ query_prompts/ # generation logs including query prompts and results, can be used for fine-tuning.
βββ reverse_index.json # reverse index for DRAG
βββ target_syscalls/ # generation targets
We provide a simplest running instruction here. For detailed usage, please refer to generator/README.md and fuzzer/README.md.
- Run SyzGPT-fuzzer:
# cd the location where you setup SyzGPT-fuzzer
taskset -c 8-15 ./bin/syz-manager -config /root/SyzGPT/fuzzer/cfgdir/SyzGPT.cfg -bench benchdir/SyzGPT.log -statcall -backup 24h -enrich WORKDIR/generated_corpus -period 1h -repair
Explanation of Parameters (refer to fuzzer/README.md for more details):
-statcall
: enable syscall tracking during fuzzing.-backup
: backup rawcover, corpus.db, CoveredCalls, and crahes every 24h.-enrich
: load the enriched seeds fromWORKDIR/generated_corpus
everyINTERNAL
(1h).-period
:INTERVAL
of loading enriched seeds.-repair
: enable program repair which is implemented in SyzGPT-fuzzer.
- Run SyzGPT-generator (NOTE: it will interact with LLM and cost tokens):
# cd the root of this project
python syzgpt_generator.py -s /root/fuzzers/SyzGPT-fuzzer -w /root/fuzzers/SyzGPT-fuzzer/workdir/v6-1/SyzGPT/generated_corpus -D 1h -T 1h -S 24h -m 100 -P 10
# seemingly, you can also use other api service
# or local hosted LLM with -M, -u, -k (introduced in section 2.1)
Explanation of Parameters (refer to generator/README.md for more details):
-s
and-w
have been introduced above.-D
: an empirical delay (1h) before generator start to work, which leave the fuzzer to explore by default.-T
: generate seeds everyINTERVAL
(1h), need to be in the same pace with-enrich
in fuzzer.-S
: stop generating after 24h.-m
: max generation amount, 100 here.-P
: probability of feedback-guided re-generation for failed seeds, 10% here.
Extract specialized call level dependency [β°~3min]:
We extract the specialized call level (syz-level) dependency through resource-based static analysis on Syzlangs.
- Extract defined syscalls of the fuzzer (different fuzzers would have different builtin syscalls, e.g., KernelGPT):
# it will generate debug.log at ~/SyzGPT/data/debug.log and generate builtin_syscalls* at -o
cd extractor
python parse_builtin_syscalls.py -s ~/fuzzers/SyzGPT-fuzzer -o ../data/
- Extract syz-level dependencies:
# it will generate syz-level dependencies at -o
python extract_syz_dependencies.py -b ../data/builtin_syscalls.json -o ../data/syz_dependencies
Extract system call level dependency [β°~1h]:
- Crawler the manpage documentation by syscalls:
# it will download the docs at crawler/man_docs/SYSCALL.json
cd crawler
python get_syscall_doc.py
- Extract call-level dependencies (NOTE: it will interact with LLM and cost tokens):
# -d for dumb mode, recommended
cd extractor
python extract_call_dependencies.py -f ../crawler/syscall_from_manpage.txt -d
There are other useful tools under scripts/
.
1. diff_config.py: diff two kernel configurations with rich printing.
- Usage:
python diff_config.py <config1_path> <config2_path>
2. result_parser.py: visualize the fuzzing crashes with de-duplication.
- Usage:
python result_parser.py -D WORKDIR1 WORKDIR2 ... -c -u -e SYZFATAL SYZFAIL
3. build_llvm-project.sh: automatically build llvm-project with specified version.
- Usage:
./build_llvm-project.sh <VERSION> (e.g., 15.0.6)
4. collect_repro.py: collect reproducers from Syzbot (as syzbot limit the requests in 1 per second, we need to rewrite this script)
- Usage:
python collect_repro.py
We have repleased a toy version of CodeLlama-syz at Huggingface. For more details, please refer to fine-tune/README.md
Our approach is able to be migrated to other kernel fuzzing framework, such as Syzkaller-like (ACTOR, ECG, KernelGPT, ...) and Healer-like (Healer and MOCK) fuzzers.
We have demonstrated the migration on KernelGPT, please refer to the implementation instruction in experiments/KernelGPT/README.md.
Simple as migrating to Syzkaller-like, as long as you are familiar with RUST.
We also prepare a instruction for migrating to MOCK, please refer to the implementation instruction in experiments/MOCK/README.md.
Thanks to Zhiyu Zhang (@QGrain) and Longxing Li (@x0v0l) for their valuable contributions to this project.
In case you would like to cite SyzGPT, you may use the following BibTex entry:
# TBD