Handling missing values in cor() #343

randyzwitch · 2018-02-04T01:59:18Z

This is somewhere between a question and adding arguments/keywords to cor()...

I was adding missing support to ECharts.jl and the last chart type is corrplot(). I was thinking of just calling complete_cases!() on a dataframe, prior to converting to a Matrix and passing to cor. But then I ran into this theoretical discussion about the correctness of pairwise complete only correlations:

http://bwlewis.github.io/covar/missing.html

So my question is: is StatsBase the right place to handle the different treatments? I would say yes, taking a principled stand about what the choices can be, rather than not support missing values at all. If it is the right place, I'd love to collaborate with someone about adding the functionality.

The text was updated successfully, but these errors were encountered:

andreasnoack · 2018-02-04T14:05:18Z

I guess it is. It used to be naturally located in DataArrays but it might be here. I've really like if we could come up with a scalable solution to this to avoid that each and every function would have to have skipmissing argument. However, the situation you mention is indeed a challenge. How can we express pairwise completeness generically?

nalimilan · 2018-02-04T15:55:14Z

The approach taken so far is that reduction functions do not have any special treatment for missing values, people just need to use skipmissing meanually (see also #342). But pairwise functions are a different case since there are indeed several possible choices, which are harder to implement generically.

To avoid adding keyword arguments to all functions, we could extend skipmissing to work on matrices, wrapped in a EachSkipMissing object which would also store the :pairwise/:complete/etc. argument. Then you could create a two-column view on that object, which would automatically omit the rows with missing values either on the pair of selected columns (:pairwise), or on the whole row (:complete). cor could use that approach to automatically compute correlation over the appropriate rows. Or we could implement a custom cor function for EachSkipMissing objects, which would extract the relevant field and perform the needed steps; in that case EachSkipMissing would just be used for signalling.

Old discussion: https://groups.google.com/d/msg/julia-stats/tDhWFpHYZYI/HSVbNF-8DgAJ

rofinn · 2018-02-04T19:44:45Z

FWIW, I'd prefer to have a separate package(s) for statistics with missing values and imputation.

I think that statistics with missing values is a large enough topic that it warrants its own package.
I've seen a lot of R analysis where folks default to using pairwise complete observations before even reviewing their dataset, so I wouldn't mind making it a little harder to do in julia :)

nalimilan · 2018-02-04T19:48:24Z

I agree that imputation should be handled separately as it's too complex. But we need to provide a way to skip observations with missing values, or people will implement it in packages and we will end up with an inconsistent system. We are already stricter than many statistical packages (SAS, Stata) by not skipping missing values by default, we can't prevent users from doing incorrect analyses.

rofinn · 2018-02-04T20:05:54Z

Yeah, I realize we can't prevent people from doing incorrect analyses, but we can try to guide people to the appropriate techniques based on how the API is organized. In the extreme case, I'm reminded of Python's cryptography library which has "Hazardous Materials" modules.

randyzwitch · 2018-02-04T20:08:53Z

Yeah, I realize we can't prevent people from doing incorrect analyses, but we can try to guide people to the appropriate techniques based on how the API is organized.

This is what I was getting at around taking a principled stand on the choices. If pairwise complete almost always wrong, and since we already have completecases!, users could figure that out themselves if they choose. But for mean imputation or passing missing forward by default, I feel like that is a base statistical functionality.

rofinn · 2018-05-05T04:24:59Z

After further consideration, I might be open to providing some kind of MissingModel interface (e.g., cor(X::AbstractArray, wv::AbstractWeights, mm::MissingModel, ... )) for people who have very specific requirements. I'm still not sure how many models StatsBase should provide out of the box and I think a "Hazardous Materials" label of some sort would be apt.

rofinn · 2018-09-11T17:36:24Z

This is a bit tangential, but I think I'd like to support a method (e.g., Pearson, Kendall, Spearman, EventSync) argument regardless. I think it would then be up to specific methods to decide which missing models they're willing to support (e.g., complete, pairwise).

mkborregaard · 2019-04-03T06:54:38Z

Slack discussion today with @nalimilan @ararslan @genauguy seemed to lean towards allowing more arguments to skipmissing, so one could do cor(skipmissing(x, y), method = :spearman). There was some discussion of whether to return a specific SkipMissing iterable or an iterator of tuples.
This seems crucial to resolve for the functionality here.

randyzwitch mentioned this issue Feb 4, 2018

Improve corrplot randyzwitch/ECharts.jl#23

Open

nalimilan mentioned this issue May 2, 2020

Make covariance and correlation work for iterators, skipmissing in particular. JuliaStats/Statistics.jl#34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling missing values in cor() #343

Handling missing values in cor() #343

randyzwitch commented Feb 4, 2018

andreasnoack commented Feb 4, 2018

nalimilan commented Feb 4, 2018

rofinn commented Feb 4, 2018

nalimilan commented Feb 4, 2018

rofinn commented Feb 4, 2018 •

edited

Loading

randyzwitch commented Feb 4, 2018

rofinn commented May 5, 2018

rofinn commented Sep 11, 2018

mkborregaard commented Apr 3, 2019

Handling missing values in cor() #343

Handling missing values in cor() #343

Comments

randyzwitch commented Feb 4, 2018

andreasnoack commented Feb 4, 2018

nalimilan commented Feb 4, 2018

rofinn commented Feb 4, 2018

nalimilan commented Feb 4, 2018

rofinn commented Feb 4, 2018 • edited Loading

randyzwitch commented Feb 4, 2018

rofinn commented May 5, 2018

rofinn commented Sep 11, 2018

mkborregaard commented Apr 3, 2019

rofinn commented Feb 4, 2018 •

edited

Loading