Welcome to the gfuzztools Developer Guide!
This is a concise reference to help you navigate and utilize the gfuzztools C library for grammars. This guide covers fundamental concepts such as the 8-bit representation, grammar structures, and conversion from JSON. It also contains brief documentation on how to use key functions for fuzzing and sampling strings.
- Store the grammar you want to work with in a JSON file.
- Use our converter to convert the JSON representation of the grammar into our C representation.
- Copy and paste the output of the conversion (found in
./data) into your C file as aGrammarobject. - Run your desired function (see below for documentation).
The program simply aims to generates random strings from a given grammar. The program is reliant on two inputs to run: your Grammar, and the Token you wish to start fuzzing from.
- Initialise your
TokenArrayto store the fuzzed strings:
TokenArray fuzzed;
fuzzed.index = 0;- Run the function:
Token token = ...
Grammar grammar = ...
unify_key_inv(token, &grammar, &fuzzed)- Print the result:
print_token_array(&fuzzed);To ensure randomness of the rand() function, ensure to set the seed to the current time by placing srand((unsigned int)time(NULL)); at the start of your main function.
This is an implementation of The simplest grammar fuzzer in the world in C.
Try it out!
Execute the following commands once you have cloned the repository locally:
make fuzzer_exampleand./bin/fuzzer_example.o.You should see three different 8-bit tokens printed to your terminal each time you run the program.
key_get_def() serves as the cornerstone for obtaining the complete definition of a non-terminal or terminal key in the grammar, given a specific string length.
You need to run this function in order to use all other functionality in gfuzztools.
Note: the three hash tables must be defined as global variables for the program to run:
// Place these lines before your main() function
KeyHashTable key_strs;
RuleHashTable rule_strs;
GrammarHashTable grammar_hash;
// Surround your runner code with the following initialisation and breakdown code in the main() function.
// Setup
init_key_hash_table(&key_strs);
init_rule_hash_table(&rule_strs);
init_grammar_hash_table(&grammar_hash);
# your program code here
// Cleanup
breakdown_key_hash_table(&key_strs);
breakdown_rule_hash_table(&rule_strs);
breakdown_grammar_hash_table(&grammar_hash);Token token = ...;
Grammar grammar = ...;
size_t l_str = ...; // specify the desired string length
KeyNode* definition = key_get_def(start_token, grammar, l_str);For most applications, token is usually the first token of the grammar 0x80.
Calculates the total number of possible strings of a given length that a key node can produce.
// First run the cornerstone function.
KeyNode* definition = key_get_def(...)
int count = get_count(definition);It considers both the count of the key itself and recursively includes the counts from its associated rules.
Generates a list of dynamic token arrays (DTAs), each representing a possible string produced by a key node.
// First run the cornerstone function.
KeyNode* definition = key_get_def(...)
DynTokenArray* strings = extract_strings(definition);You may then print the strings using print_list_of_dtas(strings);
Try it out!
Execute the following commands once you have cloned the repository locally:
make sampling_stringsand./bin/sampling_strings.o.You should see a print-out of multiple nodes (representing phrases) containing several tokens (representing words).
Enables the extraction of a specific string at a given index.
// First run the cornerstone function.
KeyNode* definition = key_get_def(...)
int index = ...;
DynTokenArray* string = get_string_at(definition, index);It navigates through the grammar considering both key nodes and rule nodes to pinpoint the desired string.
The first major milestone in gfuzztools: the ability to uniformly at random sample a string from a grammar.
This function automatically calls the cornerstone function, so the runner code is extremely simple:
Token token = ...;
Grammar grammar = ...;
size_t l_str = ...;
DynTokenArray* string = string_sample_UAR(token, &grammar, l_str);You can then use print_dta() to print the sampled string. Similar to unify_key_inv() ensure the randomness of the rand() function by setting its seed based on the current time: place srand((unsigned int)time(NULL)); at the start of your main function.
In order to optimise the code we try to avoid the use of strings in gfuzztools. Instead, we map each token in the grammar to an 8-bit number and then generate random "strings" from these bytes. This mapping allows us to see some smart optimisations such as checking if a token is non-terminal or not in O(1) time.
The mapping is simple: nonterminals are assigned 8-bit keys starting from 0x80, and terminals are assigned 8-bit keys starting from 0x00. That is, if the MSB of the key is set, the token is a non-terminal.
We also store and assign keys to all non-terminals in the order they appear in the grammar. For example, in the simple grammar
<expression> ::= <term> "+" <term>
<term> ::= <factor> "*" <factor>
<factor> ::= "0" | "1" | "2" | ... | "9"
the first non-terminal (NT) is <expression> and so it would receive the key 0x80. The second NT is <term> which would receive the key 0x81, and finally <factor> would receive the key 0x82. We would also store the grammar as such:
GRAMMAR = [<NT-struct for 0x80>, <NT-struct for 0x81>, <NT-struct for 0x82>]
This allows for the 4 least-significant bits of the key to also provide us with the index of the non-terminal in the grammar.
We use structures in C to represent each part of the grammar, and use a typedef'd uint_8 data type to represent individual tokens in the grammar.
/* ./include/grammar.h */
typedef uint8_t Token;
typedef struct Rule
{
size_t num_tokens;
Token tokens[MAX_TOKENS_IN_RULE];
} Rule;
typedef struct NonTerminal
{
Token name;
size_t num_rules;
Rule rules[MAX_RULES_FOR_NONTERMINAL];
} NonTerminal;
typedef struct Grammar
{
size_t num_non_terminals;
NonTerminal non_terminals[MAX_NONTERMINALS_IN_GRAMMAR];
} Grammar;The use of structures naturally leads to a more complex grammar initialisation:
Grammar GRAMMAR = {
// Number of NonTerminals
6,
// Array of NonTerminals
{
// NonTerminal 1
{
// Name of NonTerminal
0x80,
// Number of rules NonTerminal has
1,
// Array of Rules
{
// Rule 1
{
// Number of Tokens in Rule
1,
// Array of Tokens
{0x81}
}
}
},
// NonTerminal 2
{
0x81,
1,
{
{
2,
{0x82, 0x83}
}
}
},
...
}
};We have developed a JSON-to-C converter which takes in a JSON grammar file as a command-line argument and outputs the C initialisation code of the grammar in the above format.
Suppose you have a BNF grammar you wish to work with:
<start> ::= <sentence>
<sentence> ::= <noun_phrase> <verb>
<noun_phrase> ::= <article> <noun>
<noun> ::= "horse" | "dog" | "hamster"
<article> ::= "a" | "the"
<verb> ::= "stands" | "walks" | "jumps"
This toolkit requires that you have this grammar converted into JSON format:
{
"<start>": [["<sentence>"]],
"<sentence>": [["<noun_phrase>", "<verb>"]],
"<noun_phrase>": [["<article>", "<noun>"]],
"<noun>": [["horse"], ["dog"], ["hamster"]],
"<article>": [["a"], ["the"]],
"<verb>": [["stands"], ["walks"], ["jumps"]]
}Then, from the ./src/fuzzer/ directory, execute the Python script:
python3 ./src/converter.py <path_to_json_grammar_file>Replace the contents of GRAMMAR in your main .c file with the outputted code found in ./data/grammar_c_8bit.txt.
Notes:
converter.pyoutputs the C initialisation code as well as a lookup tablegrammar_lookup.txtwhich shows you the keys for every token in the grammar.converter.pycan be run with the--debugflag to output a debug-friendly version of the C initialisation code which uses the raw token strings rather than the 8-bit keys.- We do not dynamically read in the JSON and convert it to C in order to optimise resources. Having the grammar stored in static memory is much faster.