Description
The doc string of the corresponding add_*_constraint
method claims the following:
Since we expect aggregate_column to be a numeric column, this leads to a multiset of aggregated values. These values should correspond to the integers ranging from start_value to the cardinality of the multiset.
Hence if we have, for a given key, n
rows (in other words, the cardinality of the multiset) and a start_value of k
, I would expect a range to be complete if exactly the following rows exist:
(key, k)
(key, k+1)
...
(key, k+n-1)
Yet, the implementation checks the following:
datajudge/src/datajudge/constraints/groupby.py
Lines 36 to 37 in 0350318
On the one hand, it revolves around the maximal encountered value instead of the cardinality of the set. On the other hand, the start value is added to said maximum.
It is easy to come up with an example where both outlined behaviours diverge. Assume start_value
to equal k
and the observed rows to correspond to this:
(key, k)
(key, k+1)
(key, k+2)
According to the former definition - as described in the doc string - this would be a legitimate key.
According to the latter definition, we would expect
(key, k)
(key, k+1)
(key, k+2)
...
(key, k+2+k)
which would flag the current key as a failure for some k
.
We do not notice this diverging behaviour in our tests since our tests only use start_value=1
.
What is intended behaviour?