If you're only interested in developer code, skip background. Ongoing.
Dictionary compression : uniquely matches strings to fixed-size integers. - Effective only if repeating strings, i.e. similiar words lose benefit - Also if applied to fraction of a whole relation, ineffective - Most srings stored are generally less than 200 bytes and often less than 30 bytes per string
LZ4 (dictionary compression example) - Not efficient for compressing individual strings - requires kB input size for efficient compression - So it's used to compress columnar blocks (many string values together) - Therefore prevents random access; - Example: decompressing large blocks for these values, some of which goes unused.
Potential:
- Use in conjunction with dictionary compression - i.e. after data is compressed, FSST can compress the strings in the dictionary
- Can apply on existing database systems
- Compressed Query Processing - Can complete equality comparisons on the compressed, without needing decompression
FSST - Fast Static Symbol Table
- Replace frequently-occuring substrings of 1-8 bytes with 1-byte codes.
- Remaining symbols/symbols that don't frequently occur are escaped, to indicate they should be copied as is. Result of symbol table being limited (256 bytes). Reserve the last byte of table for an escape byte.
- Ties are resolved randomly
- Translate each 1-byte code into its symbolic substring, using an immutable array table (256 entries)
- Decompress into symbols and store as 8-byte word in array.
/** */
void decodeBasic (int[] in, int[] out, symbolTable, actualLengthOfSymbols){
int code = *in++; //Dereference to get (*in) before the in pointer is moved forwards.
*out = sym[code]; //Translate the symbol, cast to 8 byte word and put it into outtput buffer
out+= len[ccode]; //Moves the pointer head forwards to the new out[0]/next place to write.
}
void decodeWithEscape (...) {
if (code == 255)
*out++ = *in++; //Copy the escape character.
}
/***/
- findLongestSymbol() finds the longest matching symbol at the current input position. If no matching symbol is found. The input byte is escaped.
- Choosing the 256 symbols
- Naive greedy single-pass: count and pick the most frequent occured. Con: does not consider overlapping symbols ex. ("http://w", ttp://www) and if sequential read-in, shorter symbols will be consumed long before the better/longer symbols (h before ttp://w)
- Actual iterative algorithm - Linear time, multiple (ex. 5) iterations, and on-the-fly compression, bottom-up
- Concatenate short symbols to longer symbols
- Multiple iterations update the table, add new symbols, remove bad symbols
- Base case: empty symbol table
- Each iteration:
- Iterate over the uncompressed input and compress with existing symbol table, count frequency
- Select the highest-gain symbols to construct a new symbol table. Choose from:
- Old table
- New symbols generated by concatenating pairs (2) symbols
- Reconsider all symbols that consist of a single byte
- Each existing symbol concatenated with the next occuring byte (even if that single byte is not currently a symbol)
- Ties for gain are resolved randomly for symbols