-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling missing values in cor() #343
Comments
I guess it is. It used to be naturally located in |
The approach taken so far is that reduction functions do not have any special treatment for missing values, people just need to use To avoid adding keyword arguments to all functions, we could extend Old discussion: https://groups.google.com/d/msg/julia-stats/tDhWFpHYZYI/HSVbNF-8DgAJ |
FWIW, I'd prefer to have a separate package(s) for statistics with missing values and imputation.
|
I agree that imputation should be handled separately as it's too complex. But we need to provide a way to skip observations with missing values, or people will implement it in packages and we will end up with an inconsistent system. We are already stricter than many statistical packages (SAS, Stata) by not skipping missing values by default, we can't prevent users from doing incorrect analyses. |
Yeah, I realize we can't prevent people from doing incorrect analyses, but we can try to guide people to the appropriate techniques based on how the API is organized. In the extreme case, I'm reminded of Python's cryptography library which has "Hazardous Materials" modules. |
This is what I was getting at around taking a principled stand on the choices. If pairwise complete almost always wrong, and since we already have |
After further consideration, I might be open to providing some kind of |
This is a bit tangential, but I think I'd like to support a |
Slack discussion today with @nalimilan @ararslan @genauguy seemed to lean towards allowing more arguments to |
This is somewhere between a question and adding arguments/keywords to
cor()
...I was adding
missing
support to ECharts.jl and the last chart type iscorrplot()
. I was thinking of just callingcomplete_cases!()
on a dataframe, prior to converting to a Matrix and passing tocor
. But then I ran into this theoretical discussion about the correctness of pairwise complete only correlations:http://bwlewis.github.io/covar/missing.html
So my question is: is StatsBase the right place to handle the different treatments? I would say yes, taking a principled stand about what the choices can be, rather than not support missing values at all. If it is the right place, I'd love to collaborate with someone about adding the functionality.
The text was updated successfully, but these errors were encountered: