Skip to content

Conversation

@leei
Copy link

@leei leei commented Nov 8, 2023

The is a replacement for #3366. It defines a new keyword mergeduplicates that can be set to a Function that combines values to merge columns instead of erroring when there are duplicate column names and makeunique=false.

It is implemented in stages, the first of which creates a temporary struct UpdateIndex which is used to initialize a DataFrame or represent a set of column names and columns that will be merged into the resulting DataFrame in an hcat or hcat! operation. It then follows up with commits that extend permutedims and joins to resolve column clashes in the same fashion.

leei added 5 commits October 30, 2023 16:19
…te columns when `makeunique` is false.

If `mergeduplicates` is passed a function then that function is invoked on the values of all duplicate columns and its return value is assigned to that named column.
@bkamins
Copy link
Member

bkamins commented Nov 11, 2023

As I have commented before - it will be super hard to review such a big PR. That is why I have recommended to split it into smaller PRs and merge them incrementally.

But I will try to comment on this PR (however, take note that because it is so big it will be hard to properly review it and be sure that all issues are caught/discussed).

As a side note - it seems that you did not use latest main branch state to make this PR.

rename!(index(df), vals, makeunique=makeunique)
makeunique::Bool=false, mergeduplicates::MergeDuplicates=nothing)
if !makeunique && isa(mergeduplicates, Function)
(new_columns, colindex) = process_updates(UpdateIndex(vals), _columns(df), mergeduplicates)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_columns is not defined for general AbstractDataFrame.

makeunique::Bool=false)
rename!(index(df), vals, makeunique=makeunique)
makeunique::Bool=false, mergeduplicates::MergeDuplicates=nothing)
if !makeunique && isa(mergeduplicates, Function)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docstring seems not to have been updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in particular, I am not clear what rename!/rename should do when mergeduplicates is passed.

return df
end

function rename!(idx::Index, new_index::Index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functions for Index should be added in other/index.jl

end

function rename!(idx::Index, new_index::Index)
splice!(idx.names, 1:length(idx.names), new_index.names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we first check that idx and new_index are independent?

1 │ 1 2 3
```
"""
rename(df::AbstractDataFrame, vals::AbstractVector{Symbol};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring update is missing

Wherever the `mergeduplicates` keyword argument is available it is either `nothing` or
a `Function` that will be executed to combine duplicated columns (when `makeunique=false`)
"""
MergeDuplicates = Union{Nothing,Function}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK to add this definition, but then its docstring should be more precise I think.

will be combined by invoking the function with all values from those columns.
e.g. `mergeduplicates=coalesce` will use the first non-missing value. Since `hcat` and
`hcat!` are performed recursively for more than two frames, this `mergeduplicates`
function will only combine two columns at a time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not clear what happens if makeuniqe=true and mergeduplicates is Function`.

Horizontally concatenate data frames.
If `makeunique=false` (the default) column names of passed objects must be unique.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this statement does not seem to be true after this PR.

if it is `true` a new unique name will be generated by adding a suffix,
if it is `false` an error will be thrown unless a `mergeduplicates` functiom is provided.
- `mergeduplicates` : defines what to do if `name` already exists in `df` and `makeunique`
is false. It should be given a Function that combines the values of all of the duplicated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is false. It should be given a Function that combines the values of all of the duplicated
is false. It should be given a `Function` that combines the values of all of the duplicated

if it is `false` an error will be thrown unless a `mergeduplicates` functiom is provided.
- `mergeduplicates` : defines what to do if `name` already exists in `df` and `makeunique`
is false. It should be given a Function that combines the values of all of the duplicated
columns which will be passed as a varargs. The return value is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not clear if the passed function takes elements of the columns iteratively or whole columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it is not clear how things are processed if multiple duplicate columns are provided.

@bkamins
Copy link
Member

bkamins commented Nov 11, 2023

I have not finished reviewing the PR. I will try to get back to it later.

@elainethale
Copy link

Estimate on when this functionality will be merged into DataFrames.jl? Has this PR been replaced with yet another one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants