-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: stats for ERF decoding and time generalization #1848
Comments
cc @flrgsr |
@dengemann I like Hebart's solution but this may be computationally demanding... By the way, to what extend MNE is handling distributed computing? |
DYI style ... |
with joblib
|
@agramfort I think @deep-introspection is referring to distributed computing across machines, MPI style |
@dengemann Indeed, but thanks @agramfort for the joblib tip, sounds pretty cool! |
see clusterlib it's WIP though
|
Salzberg says: "(It may be tempting, because it is easy to do with powerful computers, to run many cross On the otherhand this is essentially what Eugster and colleagues propose and what is implemented in the caret resamples function If I'm not mistaken. It essentially computes confidence intervals across resamples based on a t-statistic. Here nuances refer to using bootstrap for model comparison instead of e.g. a repaeated 10-fold CV. (cc @topepo) |
cc @cmoutard |
Seeing if you guys thought up anything on this issue. I'm trying to gauge whether my decoding scores are "significant" or not as has been mentioned here, doing permutation testing on something like this would take a long time :) |
Without knowing too much about the details of the problem, my thoughts are:
|
That sounds reasonable - though how is resampling possible within the current infrastructure for this class? It seems like right now we fit the model, and then have one call to |
@choldgraf let's keep the API issue aside for a moment. @topepo thanks a lot for your highly appreciated feedback. But as far as I understand it and as it seems with your R code, the resampling you would suggest makes the assumption that the boostraps / folds / resamples are independent but they are not. The degrees of freedom will hence be erroneous. Do I overlook something? |
You are correct about the independence. For example, two bootstrap samples of the first ten integers would might be
so that the holdouts are:
Since there are common samples in the holdouts, they are not independent. It is possible to use something like a linear mixed model to account for the within-sample correlations between the resampling statistics generated by these holdouts. However, the resample-to-resample variation is typically huge in comparison and accounting for that will deal with the lion's share of the correlation structure. The simplest thing to do is to compare two models by testing the differences in their performance at the resample level. So if resample holdout #1 has an accuracy of 90% for model A and 95% for model B, evaluate the difference of 5% Using all of the resample-specific accuracy differences we can test that their average is zero. That's basically a paired t-test. Ignoring the sample-to-sample correlation reduces the power of the test but you can probably compensate for that by doing more resamples (if that is a concern). A model that accounts for both the resample-to-resample and sample-to-sample correlations would be very hard to parameterize and estimate. |
So in other words, we alleviate this problem by slightly modify the question. A further trick could then be to test e.g. a a "serious" model against a dummy-classifier (predict top category or predict other silly pattern). But then I still don't see the real difference between doing a paired t-test and a one sample t-test for a given model, e.g. setting the chance level to zero by subtracting 0.5 for a 2 class case. Intuitively, and this is what people have recently argued (see refs above), this does not permit to correctly perform NHST, as the statistics will be erroneous (samples are included in multiple folds, which is usually not the case for e.g. experimental statistics where paired tests apply). |
For simple 2-group comparisons, there isn't too much of a difference.
Well yes but that is a pretty black and white way of viewing it. With the t-test analogy, a person could do a regular t-test under an experimental design where there is a correlation structure. The result is that the variance of the difference is artificially inflated (bad). However, suppose that there is missing data and many of the differences cannot be estimated. In that case, it might be better to ignore the correlation since the loss of power due to missing data is worse than overestimating the variance of the difference. As another example, when the correlation is near zero, there is literature that claims that you are better off (statistically) not estimating the covariance parameter at all. I don't think that a comprehensively "correct" model can be estimated from these data, so we can fit one that has the smallest approximation error. "All models are wrong..." |
I sympathise a lot with this view. Just trying to think what we can recommend users who have to deal with reviewers from e.g. psychology and biology. I think the catch is that they would argue that this procedure is not conservative enough and generates to many positives. How is your "real-world" impression from industry use cases regarding resampling testing on this one? Thanks for the discussion Max. |
Honestly, nobody has ever questioned it. That may not be related to whatever level of statistical rigor that it has; people don't seem to give a @$%^ about this aspect of modeling =] Internally, during project discussions I think that the focus is on the applicability of the model. Externally, most of the reviewers who are not in JMLR or similar journals don't know much about the subject at all and focus their questions on bigger picture issues (e.g. "why didn't you just use logistic regression?") |
:)
yeah. good point, seems familiar. |
I'm getting a bit lost in your arguments; where do we go from here? Also, what would be our typical use-case for across-trials statistics? ECoG (because electrode location are subject specific)? Anything else?
Can't you use second-order stats across subject? i.e.
|
My question would be: What is the null hypothesis that you would like to reject to establish statistical significance in a cross-trial setting? |
On the assumption that the recorded individual is representative of the population, what is our confidence that the understudied neural marker |
@kingjr ah I should have been more clear. The use case I was talking about was ECoG - I don't usually do statistics across subjects, it's almost always at the electrode level (or maybe within a subject across electrodes). So in my mind, for a single electrode, I'd like to be able to put confidence bounds on the diagonal scores at least. I know that doing this is kind of murky statistical territory (because it depends on the kinds of cross validation you do etc). In terms of null hypotheses, I think that's a tricky question as people have already alluded to here. Just saying "is the reported score different from chance level" seems like a too liberal stat. Maybe some kind of "dumb" classifier as @dengemann suggests, training the same classifier on baseline data, or doing permutation testing to get an upper/lower bound on your null scores. But that's starting to get pretty heavy computationally... |
Indeed, in a classification setting, your null hypothesis could be the That may be a fairer comparison... 2015-12-19 18:23 GMT+01:00 Chris Holdgraf [email protected]:
|
So always choosing a random class sounds rather nice to me, because that would be super fast to simulate :) but sadly ease of simulation does not make for statistical rigor, so I'm probably not qualified to determine if that's legit or not |
One of the issues is that there is often more than one ways of being dumb (this applies outside the present topic too): in an imbalanced scenario, choosing the majority class or a random class gives you two different chance level. I think the discussion confirms the idea that only label permutations (i.e. [shuffle, fit score] x n_perms) is the only formally valid approach, but it is very computationally extremely expensive. In practice, I suggest we i) indicate theoretical chance level, ii) do a bunch of permuted fit on a subset of the data (typically, on baseline only) and/or iii) ignore the issue and explicitly indicate in the text that we ignore the non-independence generated by the cross val. |
I agree. One could perhaps compute both chance levels and then pick the more conservative one as null hypothesis (i.e., the one with the higher accuracy score). |
Some more materials: @qnoirhomme has a paper relevant to this issue. |
Kai goergen and Martin Hebart argue in favor of using prevalence testing instead of second order: https://arxiv.org/abs/1512.00810, they have a matlab implementation in their toolbox |
Kriegeskorte et al use bootstrap, and add a 'noise ceiling' estimate: It would be great if these stats functions were implemented here too, if anyone is interested. |
This still seems like an important topic to me. @kingjr @dengemann etc |
We can do an example of bootstrap / permutations. Not so high priority to me. Doing many splits and showing the intrinsic uncertainty of the cross validation is also often good enough: https://mne-tools.github.io/mne-r/articles/plot_machine_learning_evoked.html
… On 5 May 2019, at 13:32, jona-sassenhagen ***@***.***> wrote:
This still seems like an important topic to me. @kingjr @dengemann etc
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I just want to check if anyone thought up any idea on this issue. I am also trying to analyze my decoding scores (using SVM for two classes) and see whether the scores are significant or not using permutation testing and cluster correction. |
if you have decoding scores as time series and change level in 0.50 you could do
t-test at the group level and the cluster stat functions we have could
take as input the
time series subtracted from change ie 0.5. t-test would test that mean
is higher than 0.5
|
With @kingjr we're currently thinking about the right way to handle cross-trial significance testing for the analyses demonstrated in our decoding examples. This is basically overlapping with one of our GSOC topics, and the discussion in the comments over here. Now practically, permutation testing or bootstrapping seems pretty heavy, if it means that you resample every model you fit. This will render these kind of analyses intractable. On the flip side there seems to be a problem in using any test that makes assumptions about the degrees of freedom, as for single trial probability output, for example, the samples are not independend because of cross-validation. It seems, instead of bootstrapping the entire decoding or incorporating permutation testing into cross-validation, as suggested by Martin Hebart (first comment here), we could just compute a permutation test post-hoc on the predictions. This would be much more tractable.
I was wondering what you think might be the right way to go. Also for the GSOC project it would be nice to have some discussion about this.
cc @agramfort @Eric89GXL @banilo
The text was updated successfully, but these errors were encountered: