-
Notifications
You must be signed in to change notification settings - Fork 54
Description
Currently BigSampler tends to undersample for a given input ratio when performing a deterministic sample
One avenue to explore is:
Once hashes are created they are normalized in a [0.0, 1.0] range by boundLong. Potentially this function should be updated or modified. One possible way is using the upper/lower bound of the input results instead, however this may be difficult to implement in practice. It could also be removed and replaced, the specifics of this implementation are lost to time and have dropped out of my memory.
Another path instead or in addition to this is:
We primarily use farmhash, which is not a cryptographic hash function. Is the output sufficiently uniform in its distribution? If not, now that additional hashes are available within BigQuery, is there another function with a more appropriate output distribution