ENH: stats for ERF decoding and time generalization #1848

dengemann · 2015-03-12T13:49:47Z

With @kingjr we're currently thinking about the right way to handle cross-trial significance testing for the analyses demonstrated in our decoding examples. This is basically overlapping with one of our GSOC topics, and the discussion in the comments over here. Now practically, permutation testing or bootstrapping seems pretty heavy, if it means that you resample every model you fit. This will render these kind of analyses intractable. On the flip side there seems to be a problem in using any test that makes assumptions about the degrees of freedom, as for single trial probability output, for example, the samples are not independend because of cross-validation. It seems, instead of bootstrapping the entire decoding or incorporating permutation testing into cross-validation, as suggested by Martin Hebart (first comment here), we could just compute a permutation test post-hoc on the predictions. This would be much more tractable.
I was wondering what you think might be the right way to go. Also for the GSOC project it would be nice to have some discussion about this.

cc @agramfort @Eric89GXL @banilo

dengemann · 2015-03-12T13:55:39Z

cc @deep-introspection

dengemann · 2015-03-12T14:01:50Z

cc @flrgsr

deep-introspection · 2015-03-12T15:42:52Z

@dengemann I like Hebart's solution but this may be computationally demanding... By the way, to what extend MNE is handling distributed computing?

dengemann · 2015-03-12T15:43:42Z

By the way, to what extend MNE is handling distributed computing?

DYI style ...

agramfort · 2015-03-12T15:47:43Z

with joblib

dengemann · 2015-03-12T15:48:32Z

with joblib

@agramfort I think @deep-introspection is referring to distributed computing across machines, MPI style

deep-introspection · 2015-03-12T15:49:49Z

@dengemann Indeed, but thanks @agramfort for the joblib tip, sounds pretty cool!

agramfort · 2015-03-12T15:53:28Z

see clusterlib it's WIP though

dengemann · 2015-06-04T12:18:44Z

some more food for thoughts:

http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=21D16E681FF16B0BA91D741A63806A31?doi=10.1.1.29.5194&rep=rep1&type=pdf

http://epub.ub.uni-muenchen.de/4134/1/tr030.pdf

dengemann · 2015-06-04T12:31:49Z

Salzberg says:

"(It may be tempting, because it is easy to do with powerful computers, to run many cross
validations on the same data set, and report each cross validation as a single trial. However this would not produce valid statistics, because the trials in such a design are highly
interdependent.)."

On the otherhand this is essentially what Eugster and colleagues propose and what is implemented in the caret resamples function If I'm not mistaken. It essentially computes confidence intervals across resamples based on a t-statistic. Here nuances refer to using bootstrap for model comparison instead of e.g. a repaeated 10-fold CV.

(cc @topepo)

dengemann · 2015-06-04T14:14:48Z

More materials:

http://stats.stackexchange.com/questions/17408/how-to-assess-statistical-significance-of-the-accuracy-of-a-classifier

dengemann · 2015-06-04T14:16:09Z

cc @cmoutard

choldgraf · 2015-12-17T02:28:44Z

Seeing if you guys thought up anything on this issue. I'm trying to gauge whether my decoding scores are "significant" or not as has been mentioned here, doing permutation testing on something like this would take a long time :)

topepo · 2015-12-17T15:28:07Z

Without knowing too much about the details of the problem, my thoughts are:

comparisons between resamples is probably preferable since it can drastically improve the power of the comparison (relative to a single static test set). Controlling for the resample-to-resample variability, which can be huge, is easy to do.
permutation tests, depending on how yo do them, tend to test the hypothesis that the signal is above the noise. If you are using them to evaluate a single model, knowing that the performance is better than noise can be a very low bar and measuring the uncertainty of the performance metric (via confidence or credible intervals) is probably a better idea. A permutation test might be more applicable if you jointly permute two models and see if the difference in performance between them is above the experimental noise.

choldgraf · 2015-12-17T17:29:31Z

That sounds reasonable - though how is resampling possible within the current infrastructure for this class? It seems like right now we fit the model, and then have one call to score that will give the generalization matrix. It's not clear to me how resampling could be incorporated into that work-flow as is (maybe you'd have to run "score" for each iteration of the cross-validation loop or something?)

dengemann · 2015-12-17T18:48:25Z

@choldgraf let's keep the API issue aside for a moment. @topepo thanks a lot for your highly appreciated feedback. But as far as I understand it and as it seems with your R code, the resampling you would suggest makes the assumption that the boostraps / folds / resamples are independent but they are not. The degrees of freedom will hence be erroneous. Do I overlook something?

topepo · 2015-12-17T19:00:10Z

You are correct about the independence. For example, two bootstrap samples of the first ten integers would might be

1, 1, 1, 1, 2, 2, 4, 6, 7, 9
1, 1, 2, 2, 3, 4, 4, 7, 9, 10

so that the holdouts are:

3,  5,  8, 10
5, 6, 8

Since there are common samples in the holdouts, they are not independent. It is possible to use something like a linear mixed model to account for the within-sample correlations between the resampling statistics generated by these holdouts. However, the resample-to-resample variation is typically huge in comparison and accounting for that will deal with the lion's share of the correlation structure.

The simplest thing to do is to compare two models by testing the differences in their performance at the resample level. So if resample holdout #1 has an accuracy of 90% for model A and 95% for model B, evaluate the difference of 5% Using all of the resample-specific accuracy differences we can test that their average is zero. That's basically a paired t-test. Ignoring the sample-to-sample correlation reduces the power of the test but you can probably compensate for that by doing more resamples (if that is a concern).

A model that accounts for both the resample-to-resample and sample-to-sample correlations would be very hard to parameterize and estimate.

dengemann · 2015-12-17T19:10:28Z

The simplest thing to do is to compare two models by testing the differences in their performance at the resample level. So if resample holdout #1 has an accuracy of 90% for model A and 95% for model B, evaluate the difference of 5% Using all of the resample-specific accuracy differences we can test that their average is zero. That's basically a paired t-test. Ignoring the sample-to-sample correlation reduces the power of the test but you can probably compensate for that by doing more resamples (if that is a concern).

So in other words, we alleviate this problem by slightly modify the question. A further trick could then be to test e.g. a a "serious" model against a dummy-classifier (predict top category or predict other silly pattern). But then I still don't see the real difference between doing a paired t-test and a one sample t-test for a given model, e.g. setting the chance level to zero by subtracting 0.5 for a 2 class case. Intuitively, and this is what people have recently argued (see refs above), this does not permit to correctly perform NHST, as the statistics will be erroneous (samples are included in multiple folds, which is usually not the case for e.g. experimental statistics where paired tests apply).

topepo · 2015-12-17T19:49:09Z

I still don't see the real difference between doing a paired t-test and a one sample t-test for a given model

For simple 2-group comparisons, there isn't too much of a difference.

the statistics will be erroneous

Well yes but that is a pretty black and white way of viewing it.

With the t-test analogy, a person could do a regular t-test under an experimental design where there is a correlation structure. The result is that the variance of the difference is artificially inflated (bad). However, suppose that there is missing data and many of the differences cannot be estimated. In that case, it might be better to ignore the correlation since the loss of power due to missing data is worse than overestimating the variance of the difference. As another example, when the correlation is near zero, there is literature that claims that you are better off (statistically) not estimating the covariance parameter at all.

I don't think that a comprehensively "correct" model can be estimated from these data, so we can fit one that has the smallest approximation error. "All models are wrong..."

dengemann · 2015-12-17T19:56:41Z

With the t-test analogy, a person could do a regular t-test under an experimental design where there is a correlation structure. The result is that the variance of the difference is artificially inflated (bad). However, suppose that there is missing data and many of the differences cannot be estimated. In that case, it might be better to ignore the correlation since the loss of power due to missing data is worse than overestimating the variance of the difference. As another example, when the correlation is near zero, there is literature that claims that you are better off (statistically) not estimating the covariance parameter at all.
I don't think that a comprehensively "correct" model can be estimated from these data, so we can fit one that has the smallest approximation error. "All models are wrong..."

I sympathise a lot with this view. Just trying to think what we can recommend users who have to deal with reviewers from e.g. psychology and biology. I think the catch is that they would argue that this procedure is not conservative enough and generates to many positives. How is your "real-world" impression from industry use cases regarding resampling testing on this one? Thanks for the discussion Max.

topepo · 2015-12-17T20:57:07Z

How is your "real-world" impression from industry use cases regarding resampling testing on this one?

Honestly, nobody has ever questioned it. That may not be related to whatever level of statistical rigor that it has; people don't seem to give a @$%^ about this aspect of modeling =]

Internally, during project discussions I think that the focus is on the applicability of the model. Externally, most of the reviewers who are not in JMLR or similar journals don't know much about the subject at all and focus their questions on bigger picture issues (e.g. "why didn't you just use logistic regression?")

dengemann · 2015-12-18T00:03:02Z

Honestly, nobody has ever questioned it. That may not be related to whatever level of statistical rigor that it has; people don't seem to give a @$%^ about this aspect of modeling =]

:)

Internally, during project discussions I think that the focus is on the applicability of the model. Externally, most of the reviewers who are not in JMLR or similar journals don't know much about the subject at all and focus their questions on bigger picture issues (e.g. "why didn't you just use logistic regression?")

yeah. good point, seems familiar.

kingjr · 2015-12-19T17:02:00Z

I'm getting a bit lost in your arguments; where do we go from here?

Also, what would be our typical use-case for across-trials statistics? ECoG (because electrode location are subject specific)? Anything else?

Seeing if you guys thought up anything on this issue. I'm trying to gauge whether my decoding scores are "significant" or not as has been mentioned here, doing permutation testing on something like this would take a long time :)

Can't you use second-order stats across subject? i.e.

scores = list()
for subject in subjects:
    clf.fit(subject)
    scores.append(clf.score(subject))
p_value = stats(np.array(scores) - chance_level)

banilo · 2015-12-19T17:05:52Z

My question would be: What is the null hypothesis that you would like to reject to establish statistical significance in a cross-trial setting?

kingjr · 2015-12-19T17:16:14Z

What is the null hypothesis that you would like to reject to establish statistical significance in a cross-trial setting?

On the assumption that the recorded individual is representative of the population, what is our confidence that the understudied neural marker X reliably correlates with the experimental condition y.

choldgraf · 2015-12-19T17:23:23Z

@kingjr ah I should have been more clear. The use case I was talking about was ECoG - I don't usually do statistics across subjects, it's almost always at the electrode level (or maybe within a subject across electrodes). So in my mind, for a single electrode, I'd like to be able to put confidence bounds on the diagonal scores at least. I know that doing this is kind of murky statistical territory (because it depends on the kinds of cross validation you do etc).

In terms of null hypotheses, I think that's a tricky question as people have already alluded to here. Just saying "is the reported score different from chance level" seems like a too liberal stat. Maybe some kind of "dumb" classifier as @dengemann suggests, training the same classifier on baseline data, or doing permutation testing to get an upper/lower bound on your null scores. But that's starting to get pretty heavy computationally...

banilo · 2015-12-19T17:27:52Z

Indeed, in a classification setting, your null hypothesis could be the
performance of a dummy classifier (always choosing the majority class or
always choosing a class at random).

That may be a fairer comparison...

2015-12-19 18:23 GMT+01:00 Chris Holdgraf [email protected]:

@kingjr https://github.com/kingjr ah I should have been more clear. The
use case I was talking about was ECoG - I don't usually do statistics
across subjects, it's almost always at the electrode level (or maybe within
a subject across electrodes). So in my mind, for a single electrode, I'd
like to be able to put confidence bounds on the diagonal scores at least. I
know that doing this is kind of murky statistical territory (because it
depends on the kinds of cross validation you do etc).

In terms of null hypotheses, I think that's a tricky question as people
have already alluded to here. Just saying "is the reported score different
from chance level" seems like a too liberal stat. Maybe some kind of "dumb"
classifier as @dengemann https://github.com/dengemann suggests,
training the same classifier on baseline data, or doing permutation testing
to get an upper/lower bound on your null scores. But that's starting to get
pretty heavy computationally...

—
Reply to this email directly or view it on GitHub
#1848 (comment)
.

choldgraf · 2015-12-19T17:30:30Z

So always choosing a random class sounds rather nice to me, because that would be super fast to simulate :) but sadly ease of simulation does not make for statistical rigor, so I'm probably not qualified to determine if that's legit or not

kingjr · 2015-12-19T17:35:19Z

always choosing the majority class or always choosing a class at random

One of the issues is that there is often more than one ways of being dumb (this applies outside the present topic too): in an imbalanced scenario, choosing the majority class or a random class gives you two different chance level.

I think the discussion confirms the idea that only label permutations (i.e. [shuffle, fit score] x n_perms) is the only formally valid approach, but it is very computationally extremely expensive. In practice, I suggest we i) indicate theoretical chance level, ii) do a bunch of permuted fit on a subset of the data (typically, on baseline only) and/or iii) ignore the issue and explicitly indicate in the text that we ignore the non-independence generated by the cross val.

banilo · 2015-12-19T17:38:17Z

One of the issues is that there is often more than one ways of being dumb (this applies outside the present topic too): in an imbalanced scenario, choosing the majority class or a random class gives you two different chance level.

I agree. One could perhaps compute both chance levels and then pick the more conservative one as null hypothesis (i.e., the one with the higher accuracy score).

kingjr · 2016-03-16T14:10:28Z

Some more materials: @qnoirhomme has a paper relevant to this issue.

kingjr · 2016-08-31T17:37:05Z

Kai goergen and Martin Hebart argue in favor of using prevalence testing instead of second order: https://arxiv.org/abs/1512.00810, they have a matlab implementation in their toolbox

kingjr · 2016-08-31T19:30:22Z

Kriegeskorte et al use bootstrap, and add a 'noise ceiling' estimate:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3990488/

It would be great if these stats functions were implemented here too, if anyone is interested.

jona-sassenhagen · 2019-05-05T11:32:25Z

This still seems like an important topic to me. @kingjr @dengemann etc

dengemann · 2019-05-05T12:42:22Z

We can do an example of bootstrap / permutations. Not so high priority to me. Doing many splits and showing the intrinsic uncertainty of the cross validation is also often good enough: https://mne-tools.github.io/mne-r/articles/plot_machine_learning_evoked.html

…

On 5 May 2019, at 13:32, jona-sassenhagen ***@***.***> wrote: This still seems like an important topic to me. @kingjr @dengemann etc — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

MaryZolfaghar · 2020-02-09T19:56:23Z

I just want to check if anyone thought up any idea on this issue. I am also trying to analyze my decoding scores (using SVM for two classes) and see whether the scores are significant or not using permutation testing and cluster correction.

agramfort · 2020-02-10T14:58:16Z

if you have decoding scores as time series and change level in 0.50 you could do t-test at the group level and the cluster stat functions we have could take as input the time series subtracted from change ie 0.5. t-test would test that mean is higher than 0.5

kingjr mentioned this issue Oct 22, 2015

GAT Grand average #2543

Closed

ENH: stats for ERF decoding and time generalization #1848

ENH: stats for ERF decoding and time generalization #1848

Comments

dengemann commented Mar 12, 2015

dengemann commented Mar 12, 2015

dengemann commented Mar 12, 2015

deep-introspection commented Mar 12, 2015

dengemann commented Mar 12, 2015

agramfort commented Mar 12, 2015 via email

dengemann commented Mar 12, 2015

deep-introspection commented Mar 12, 2015

agramfort commented Mar 12, 2015 via email

dengemann commented Jun 4, 2015

dengemann commented Jun 4, 2015

dengemann commented Jun 4, 2015

dengemann commented Jun 4, 2015

choldgraf commented Dec 17, 2015

topepo commented Dec 17, 2015

choldgraf commented Dec 17, 2015

dengemann commented Dec 17, 2015

topepo commented Dec 17, 2015

dengemann commented Dec 17, 2015

topepo commented Dec 17, 2015

dengemann commented Dec 17, 2015

topepo commented Dec 17, 2015

dengemann commented Dec 18, 2015

kingjr commented Dec 19, 2015

banilo commented Dec 19, 2015

kingjr commented Dec 19, 2015

choldgraf commented Dec 19, 2015

banilo commented Dec 19, 2015

choldgraf commented Dec 19, 2015

kingjr commented Dec 19, 2015

banilo commented Dec 19, 2015

kingjr commented Mar 16, 2016

kingjr commented Aug 31, 2016

kingjr commented Aug 31, 2016

jona-sassenhagen commented May 5, 2019

dengemann commented May 5, 2019 via email

MaryZolfaghar commented Feb 9, 2020

agramfort commented Feb 10, 2020 via email