Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats for ERF decoding and time generalization #1848

Open
dengemann opened this issue Mar 12, 2015 · 37 comments
Open

ENH: stats for ERF decoding and time generalization #1848

dengemann opened this issue Mar 12, 2015 · 37 comments

Comments

@dengemann
Copy link
Member

With @kingjr we're currently thinking about the right way to handle cross-trial significance testing for the analyses demonstrated in our decoding examples. This is basically overlapping with one of our GSOC topics, and the discussion in the comments over here. Now practically, permutation testing or bootstrapping seems pretty heavy, if it means that you resample every model you fit. This will render these kind of analyses intractable. On the flip side there seems to be a problem in using any test that makes assumptions about the degrees of freedom, as for single trial probability output, for example, the samples are not independend because of cross-validation. It seems, instead of bootstrapping the entire decoding or incorporating permutation testing into cross-validation, as suggested by Martin Hebart (first comment here), we could just compute a permutation test post-hoc on the predictions. This would be much more tractable.
I was wondering what you think might be the right way to go. Also for the GSOC project it would be nice to have some discussion about this.

cc @agramfort @Eric89GXL @banilo

@dengemann
Copy link
Member Author

cc @deep-introspection

@dengemann
Copy link
Member Author

cc @flrgsr

@deep-introspection
Copy link
Contributor

@dengemann I like Hebart's solution but this may be computationally demanding... By the way, to what extend MNE is handling distributed computing?

@dengemann
Copy link
Member Author

By the way, to what extend MNE is handling distributed computing?

DYI style ...

@agramfort
Copy link
Member

agramfort commented Mar 12, 2015 via email

@dengemann
Copy link
Member Author

with joblib

@agramfort I think @deep-introspection is referring to distributed computing across machines, MPI style

@deep-introspection
Copy link
Contributor

@dengemann Indeed, but thanks @agramfort for the joblib tip, sounds pretty cool!

@agramfort
Copy link
Member

agramfort commented Mar 12, 2015 via email

@dengemann
Copy link
Member Author

Salzberg says:

"(It may be tempting, because it is easy to do with powerful computers, to run many cross
validations on the same data set, and report each cross validation as a single trial. However this would not produce valid statistics, because the trials in such a design are highly
interdependent.)."

On the otherhand this is essentially what Eugster and colleagues propose and what is implemented in the caret resamples function If I'm not mistaken. It essentially computes confidence intervals across resamples based on a t-statistic. Here nuances refer to using bootstrap for model comparison instead of e.g. a repaeated 10-fold CV.

(cc @topepo)

@dengemann
Copy link
Member Author

@dengemann
Copy link
Member Author

cc @cmoutard

@choldgraf
Copy link
Contributor

Seeing if you guys thought up anything on this issue. I'm trying to gauge whether my decoding scores are "significant" or not as has been mentioned here, doing permutation testing on something like this would take a long time :)

@topepo
Copy link

topepo commented Dec 17, 2015

Without knowing too much about the details of the problem, my thoughts are:

  • comparisons between resamples is probably preferable since it can drastically improve the power of the comparison (relative to a single static test set). Controlling for the resample-to-resample variability, which can be huge, is easy to do.
  • permutation tests, depending on how yo do them, tend to test the hypothesis that the signal is above the noise. If you are using them to evaluate a single model, knowing that the performance is better than noise can be a very low bar and measuring the uncertainty of the performance metric (via confidence or credible intervals) is probably a better idea. A permutation test might be more applicable if you jointly permute two models and see if the difference in performance between them is above the experimental noise.

@choldgraf
Copy link
Contributor

That sounds reasonable - though how is resampling possible within the current infrastructure for this class? It seems like right now we fit the model, and then have one call to score that will give the generalization matrix. It's not clear to me how resampling could be incorporated into that work-flow as is (maybe you'd have to run "score" for each iteration of the cross-validation loop or something?)

@dengemann
Copy link
Member Author

@choldgraf let's keep the API issue aside for a moment. @topepo thanks a lot for your highly appreciated feedback. But as far as I understand it and as it seems with your R code, the resampling you would suggest makes the assumption that the boostraps / folds / resamples are independent but they are not. The degrees of freedom will hence be erroneous. Do I overlook something?

@topepo
Copy link

topepo commented Dec 17, 2015

You are correct about the independence. For example, two bootstrap samples of the first ten integers would might be

1, 1, 1, 1, 2, 2, 4, 6, 7, 9
1, 1, 2, 2, 3, 4, 4, 7, 9, 10

so that the holdouts are:

3,  5,  8, 10
5, 6, 8

Since there are common samples in the holdouts, they are not independent. It is possible to use something like a linear mixed model to account for the within-sample correlations between the resampling statistics generated by these holdouts. However, the resample-to-resample variation is typically huge in comparison and accounting for that will deal with the lion's share of the correlation structure.

The simplest thing to do is to compare two models by testing the differences in their performance at the resample level. So if resample holdout #1 has an accuracy of 90% for model A and 95% for model B, evaluate the difference of 5% Using all of the resample-specific accuracy differences we can test that their average is zero. That's basically a paired t-test. Ignoring the sample-to-sample correlation reduces the power of the test but you can probably compensate for that by doing more resamples (if that is a concern).

A model that accounts for both the resample-to-resample and sample-to-sample correlations would be very hard to parameterize and estimate.

@dengemann
Copy link
Member Author

The simplest thing to do is to compare two models by testing the differences in their performance at the resample level. So if resample holdout #1 has an accuracy of 90% for model A and 95% for model B, evaluate the difference of 5% Using all of the resample-specific accuracy differences we can test that their average is zero. That's basically a paired t-test. Ignoring the sample-to-sample correlation reduces the power of the test but you can probably compensate for that by doing more resamples (if that is a concern).

So in other words, we alleviate this problem by slightly modify the question. A further trick could then be to test e.g. a a "serious" model against a dummy-classifier (predict top category or predict other silly pattern). But then I still don't see the real difference between doing a paired t-test and a one sample t-test for a given model, e.g. setting the chance level to zero by subtracting 0.5 for a 2 class case. Intuitively, and this is what people have recently argued (see refs above), this does not permit to correctly perform NHST, as the statistics will be erroneous (samples are included in multiple folds, which is usually not the case for e.g. experimental statistics where paired tests apply).

@topepo
Copy link

topepo commented Dec 17, 2015

I still don't see the real difference between doing a paired t-test and a one sample t-test for a given model

For simple 2-group comparisons, there isn't too much of a difference.

the statistics will be erroneous

Well yes but that is a pretty black and white way of viewing it.

With the t-test analogy, a person could do a regular t-test under an experimental design where there is a correlation structure. The result is that the variance of the difference is artificially inflated (bad). However, suppose that there is missing data and many of the differences cannot be estimated. In that case, it might be better to ignore the correlation since the loss of power due to missing data is worse than overestimating the variance of the difference. As another example, when the correlation is near zero, there is literature that claims that you are better off (statistically) not estimating the covariance parameter at all.

I don't think that a comprehensively "correct" model can be estimated from these data, so we can fit one that has the smallest approximation error. "All models are wrong..."

@dengemann
Copy link
Member Author

With the t-test analogy, a person could do a regular t-test under an experimental design where there is a correlation structure. The result is that the variance of the difference is artificially inflated (bad). However, suppose that there is missing data and many of the differences cannot be estimated. In that case, it might be better to ignore the correlation since the loss of power due to missing data is worse than overestimating the variance of the difference. As another example, when the correlation is near zero, there is literature that claims that you are better off (statistically) not estimating the covariance parameter at all.
I don't think that a comprehensively "correct" model can be estimated from these data, so we can fit one that has the smallest approximation error. "All models are wrong..."

I sympathise a lot with this view. Just trying to think what we can recommend users who have to deal with reviewers from e.g. psychology and biology. I think the catch is that they would argue that this procedure is not conservative enough and generates to many positives. How is your "real-world" impression from industry use cases regarding resampling testing on this one? Thanks for the discussion Max.

@topepo
Copy link

topepo commented Dec 17, 2015

How is your "real-world" impression from industry use cases regarding resampling testing on this one?

Honestly, nobody has ever questioned it. That may not be related to whatever level of statistical rigor that it has; people don't seem to give a @$%^ about this aspect of modeling =]

Internally, during project discussions I think that the focus is on the applicability of the model. Externally, most of the reviewers who are not in JMLR or similar journals don't know much about the subject at all and focus their questions on bigger picture issues (e.g. "why didn't you just use logistic regression?")

@dengemann
Copy link
Member Author

Honestly, nobody has ever questioned it. That may not be related to whatever level of statistical rigor that it has; people don't seem to give a @$%^ about this aspect of modeling =]

:)

Internally, during project discussions I think that the focus is on the applicability of the model. Externally, most of the reviewers who are not in JMLR or similar journals don't know much about the subject at all and focus their questions on bigger picture issues (e.g. "why didn't you just use logistic regression?")

yeah. good point, seems familiar.

@kingjr
Copy link
Member

kingjr commented Dec 19, 2015

I'm getting a bit lost in your arguments; where do we go from here?

Also, what would be our typical use-case for across-trials statistics? ECoG (because electrode location are subject specific)? Anything else?

Seeing if you guys thought up anything on this issue. I'm trying to gauge whether my decoding scores are "significant" or not as has been mentioned here, doing permutation testing on something like this would take a long time :)

Can't you use second-order stats across subject? i.e.

scores = list()
for subject in subjects:
    clf.fit(subject)
    scores.append(clf.score(subject))
p_value = stats(np.array(scores) - chance_level)

@banilo
Copy link

banilo commented Dec 19, 2015

My question would be: What is the null hypothesis that you would like to reject to establish statistical significance in a cross-trial setting?

@kingjr
Copy link
Member

kingjr commented Dec 19, 2015

What is the null hypothesis that you would like to reject to establish statistical significance in a cross-trial setting?

On the assumption that the recorded individual is representative of the population, what is our confidence that the understudied neural marker X reliably correlates with the experimental condition y.

@choldgraf
Copy link
Contributor

@kingjr ah I should have been more clear. The use case I was talking about was ECoG - I don't usually do statistics across subjects, it's almost always at the electrode level (or maybe within a subject across electrodes). So in my mind, for a single electrode, I'd like to be able to put confidence bounds on the diagonal scores at least. I know that doing this is kind of murky statistical territory (because it depends on the kinds of cross validation you do etc).

In terms of null hypotheses, I think that's a tricky question as people have already alluded to here. Just saying "is the reported score different from chance level" seems like a too liberal stat. Maybe some kind of "dumb" classifier as @dengemann suggests, training the same classifier on baseline data, or doing permutation testing to get an upper/lower bound on your null scores. But that's starting to get pretty heavy computationally...

@banilo
Copy link

banilo commented Dec 19, 2015

Indeed, in a classification setting, your null hypothesis could be the
performance of a dummy classifier (always choosing the majority class or
always choosing a class at random).

That may be a fairer comparison...

2015-12-19 18:23 GMT+01:00 Chris Holdgraf [email protected]:

@kingjr https://github.com/kingjr ah I should have been more clear. The
use case I was talking about was ECoG - I don't usually do statistics
across subjects, it's almost always at the electrode level (or maybe within
a subject across electrodes). So in my mind, for a single electrode, I'd
like to be able to put confidence bounds on the diagonal scores at least. I
know that doing this is kind of murky statistical territory (because it
depends on the kinds of cross validation you do etc).

In terms of null hypotheses, I think that's a tricky question as people
have already alluded to here. Just saying "is the reported score different
from chance level" seems like a too liberal stat. Maybe some kind of "dumb"
classifier as @dengemann https://github.com/dengemann suggests,
training the same classifier on baseline data, or doing permutation testing
to get an upper/lower bound on your null scores. But that's starting to get
pretty heavy computationally...


Reply to this email directly or view it on GitHub
#1848 (comment)
.

@choldgraf
Copy link
Contributor

So always choosing a random class sounds rather nice to me, because that would be super fast to simulate :) but sadly ease of simulation does not make for statistical rigor, so I'm probably not qualified to determine if that's legit or not

@kingjr
Copy link
Member

kingjr commented Dec 19, 2015

always choosing the majority class or always choosing a class at random

One of the issues is that there is often more than one ways of being dumb (this applies outside the present topic too): in an imbalanced scenario, choosing the majority class or a random class gives you two different chance level.

I think the discussion confirms the idea that only label permutations (i.e. [shuffle, fit score] x n_perms) is the only formally valid approach, but it is very computationally extremely expensive. In practice, I suggest we i) indicate theoretical chance level, ii) do a bunch of permuted fit on a subset of the data (typically, on baseline only) and/or iii) ignore the issue and explicitly indicate in the text that we ignore the non-independence generated by the cross val.

@banilo
Copy link

banilo commented Dec 19, 2015

One of the issues is that there is often more than one ways of being dumb (this applies outside the present topic too): in an imbalanced scenario, choosing the majority class or a random class gives you two different chance level.

I agree. One could perhaps compute both chance levels and then pick the more conservative one as null hypothesis (i.e., the one with the higher accuracy score).

@kingjr
Copy link
Member

kingjr commented Mar 16, 2016

Some more materials: @qnoirhomme has a paper relevant to this issue.

@kingjr
Copy link
Member

kingjr commented Aug 31, 2016

Kai goergen and Martin Hebart argue in favor of using prevalence testing instead of second order: https://arxiv.org/abs/1512.00810, they have a matlab implementation in their toolbox

@kingjr
Copy link
Member

kingjr commented Aug 31, 2016

Kriegeskorte et al use bootstrap, and add a 'noise ceiling' estimate:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3990488/

It would be great if these stats functions were implemented here too, if anyone is interested.

@jona-sassenhagen
Copy link
Contributor

This still seems like an important topic to me. @kingjr @dengemann etc

@dengemann
Copy link
Member Author

dengemann commented May 5, 2019 via email

@MaryZolfaghar
Copy link

I just want to check if anyone thought up any idea on this issue. I am also trying to analyze my decoding scores (using SVM for two classes) and see whether the scores are significant or not using permutation testing and cluster correction.

@agramfort
Copy link
Member

agramfort commented Feb 10, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants