Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling missing values in cor() #343

Open
randyzwitch opened this issue Feb 4, 2018 · 9 comments
Open

Handling missing values in cor() #343

randyzwitch opened this issue Feb 4, 2018 · 9 comments

Comments

@randyzwitch
Copy link
Contributor

This is somewhere between a question and adding arguments/keywords to cor()...

I was adding missing support to ECharts.jl and the last chart type is corrplot(). I was thinking of just calling complete_cases!() on a dataframe, prior to converting to a Matrix and passing to cor. But then I ran into this theoretical discussion about the correctness of pairwise complete only correlations:

http://bwlewis.github.io/covar/missing.html

So my question is: is StatsBase the right place to handle the different treatments? I would say yes, taking a principled stand about what the choices can be, rather than not support missing values at all. If it is the right place, I'd love to collaborate with someone about adding the functionality.

@andreasnoack
Copy link
Member

I guess it is. It used to be naturally located in DataArrays but it might be here. I've really like if we could come up with a scalable solution to this to avoid that each and every function would have to have skipmissing argument. However, the situation you mention is indeed a challenge. How can we express pairwise completeness generically?

@nalimilan
Copy link
Member

The approach taken so far is that reduction functions do not have any special treatment for missing values, people just need to use skipmissing meanually (see also #342). But pairwise functions are a different case since there are indeed several possible choices, which are harder to implement generically.

To avoid adding keyword arguments to all functions, we could extend skipmissing to work on matrices, wrapped in a EachSkipMissing object which would also store the :pairwise/:complete/etc. argument. Then you could create a two-column view on that object, which would automatically omit the rows with missing values either on the pair of selected columns (:pairwise), or on the whole row (:complete). cor could use that approach to automatically compute correlation over the appropriate rows. Or we could implement a custom cor function for EachSkipMissing objects, which would extract the relevant field and perform the needed steps; in that case EachSkipMissing would just be used for signalling.

Old discussion: https://groups.google.com/d/msg/julia-stats/tDhWFpHYZYI/HSVbNF-8DgAJ

@rofinn
Copy link
Member

rofinn commented Feb 4, 2018

FWIW, I'd prefer to have a separate package(s) for statistics with missing values and imputation.

  1. I think that statistics with missing values is a large enough topic that it warrants its own package.
  2. I've seen a lot of R analysis where folks default to using pairwise complete observations before even reviewing their dataset, so I wouldn't mind making it a little harder to do in julia :)

@nalimilan
Copy link
Member

I agree that imputation should be handled separately as it's too complex. But we need to provide a way to skip observations with missing values, or people will implement it in packages and we will end up with an inconsistent system. We are already stricter than many statistical packages (SAS, Stata) by not skipping missing values by default, we can't prevent users from doing incorrect analyses.

@rofinn
Copy link
Member

rofinn commented Feb 4, 2018

Yeah, I realize we can't prevent people from doing incorrect analyses, but we can try to guide people to the appropriate techniques based on how the API is organized. In the extreme case, I'm reminded of Python's cryptography library which has "Hazardous Materials" modules.

@randyzwitch
Copy link
Contributor Author

Yeah, I realize we can't prevent people from doing incorrect analyses, but we can try to guide people to the appropriate techniques based on how the API is organized.

This is what I was getting at around taking a principled stand on the choices. If pairwise complete almost always wrong, and since we already have completecases!, users could figure that out themselves if they choose. But for mean imputation or passing missing forward by default, I feel like that is a base statistical functionality.

@rofinn
Copy link
Member

rofinn commented May 5, 2018

After further consideration, I might be open to providing some kind of MissingModel interface (e.g., cor(X::AbstractArray, wv::AbstractWeights, mm::MissingModel, ... )) for people who have very specific requirements. I'm still not sure how many models StatsBase should provide out of the box and I think a "Hazardous Materials" label of some sort would be apt.

@rofinn
Copy link
Member

rofinn commented Sep 11, 2018

This is a bit tangential, but I think I'd like to support a method (e.g., Pearson, Kendall, Spearman, EventSync) argument regardless. I think it would then be up to specific methods to decide which missing models they're willing to support (e.g., complete, pairwise).

@mkborregaard
Copy link
Contributor

Slack discussion today with @nalimilan @ararslan @genauguy seemed to lean towards allowing more arguments to skipmissing, so one could do cor(skipmissing(x, y), method = :spearman). There was some discussion of whether to return a specific SkipMissing iterable or an iterator of tuples.
This seems crucial to resolve for the functionality here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants