Skip to content

Improve sample population selection for deterministic sampling #699

@idreeskhan

Description

@idreeskhan

Currently BigSampler tends to undersample for a given input ratio when performing a deterministic sample

One avenue to explore is:

Once hashes are created they are normalized in a [0.0, 1.0] range by boundLong. Potentially this function should be updated or modified. One possible way is using the upper/lower bound of the input results instead, however this may be difficult to implement in practice. It could also be removed and replaced, the specifics of this implementation are lost to time and have dropped out of my memory.

Another path instead or in addition to this is:

We primarily use farmhash, which is not a cryptographic hash function. Is the output sufficiently uniform in its distribution? If not, now that additional hashes are available within BigQuery, is there another function with a more appropriate output distribution

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions