-
Couldn't load subscription status.
- Fork 570
Description
So @gloryknight came up with a very interesting and simple heuristic that apparently works really well in many JSON files: whenever a , is encountered (can be set to another character, like a space or newline), LZ is forced to stop the current substring, and start anew (with a leading comma), except when the current substring starts with a comma. See these two pull-requests:
The reason for this being effective is a bit subtle: imagine that we have a string we are scanning through, and the next set of characters will be abcdefg. Furthermore, our dictionary already has the substrings abc, abcd and defg (plus the necessary substrings to get to this point), but not efg. Obviously, the ideal combination of tokens would be abc + defg. Instead we'll get abcd + e + f + g. This can happen quite often in LZ. So how to avoid this? Well, I guess gloryknight's insight was that not all characters are created equal here; they can have special functions. One of those is as separator characters. Think of natural language: or words are separated by spaces, so if we split on the space character (and similar separator like newlines, dots, commas) we would converge on identical substrings much quicker.
Since LZString is most commonly used whn compressing JSON, which strips out all unnecessary whitespace, the , is the option that seems to improve compression performance (although maybe { and : also make sense, or maybe all three?). In his tests it gave significant compression benefits at a small perf cost.
The best bit? This is perfectly backwards compatible with previous codes: the output can be decompressed by the same function as before.