Skip to content

AggregateNumericRangeEquality: Mismatch between implementation and documentation #37

Open
@kklein

Description

@kklein

The doc string of the corresponding add_*_constraint method claims the following:

Since we expect aggregate_column to be a numeric column, this leads to a multiset of aggregated values. These values should correspond to the integers ranging from start_value to the cardinality of the multiset.

Hence if we have, for a given key, n rows (in other words, the cardinality of the multiset) and a start_value of k, I would expect a range to be complete if exactly the following rows exist:

(key, k)
(key, k+1)
...
(key, k+n-1)

Yet, the implementation checks the following:

def missing_from_range(values, start=0):
return set(range(start, max(values) + start)) - set(values)

On the one hand, it revolves around the maximal encountered value instead of the cardinality of the set. On the other hand, the start value is added to said maximum.

It is easy to come up with an example where both outlined behaviours diverge. Assume start_value to equal k and the observed rows to correspond to this:

(key, k)
(key, k+1)
(key, k+2)

According to the former definition - as described in the doc string - this would be a legitimate key.
According to the latter definition, we would expect

(key, k)
(key, k+1)
(key, k+2)
...
(key, k+2+k)

which would flag the current key as a failure for some k.

We do not notice this diverging behaviour in our tests since our tests only use start_value=1.

What is intended behaviour?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions