Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastMCD algorithm in StatsBase ? #326

Open
romainFr opened this issue Dec 14, 2017 · 1 comment
Open

fastMCD algorithm in StatsBase ? #326

romainFr opened this issue Dec 14, 2017 · 1 comment

Comments

@romainFr
Copy link

Would there be interest in adding a function for minimum covariance determinant estimator (MCD), for robust covariance estimation of multivariate data ? The reference is :

Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212-223.

I wrote this simple version some time ago, it could become a PR if there's interest. If so, where would it belong, StatsBase or MultivariateStats ?

function fastMCD(X,p = ceil(Int,(sum(size(X))+1)/2);nrepeats=500)
    rng = Base.GLOBAL_RNG
    hMin = nothing
    sMin = Inf
  
    for i in 1:nrepeats
        idx = randperm(size(X,1))
        h1 = X[idx[1:p],:]    
        s0 = 0
        s1= 1
        while ((det(s1)!=det(s0)) & (det(s1)!=0))
            h0 = h1
            s0 = cov(h0)
            m = vec(mean(h0,1))
            Dis = vec(mapslices(x -> mahalanobis(x,m,inv(s0)),X,2))
            ord = sortperm(Dis)
            h1 = X[ord[1:p],:]
            s1=cov(h1)
       end
       
        if det(s1)<det(sMin)
            hMin = h1
            sMin = s1
        end
       
    end
    ## Reweighting
    sfull = cov(hMin)
    tmcd = vec(mean(hMin,1))
    dfull = vec(mapslices(x -> mahalanobis(x,tmcd,inv(sfull)),hMin,2))
    smcd = (median(dfull.^2)/pdf(Chisq(size(X,2)),0.5))*sfull
    dmcd = vec(mapslices(x -> mahalanobis(x,tmcd,inv(smcd)),hMin,2));
    w = FrequencyWeights(((dmcd.^2).<pdf(Chisq(size(X,2)),0.975))*1)
    t1 = mean(hMin,w,1)
    s1 = cov(hMin,w,corrected=true)
    (t1,s1)
end
@andreasnoack
Copy link
Member

Would be great to get but it should probably go into MultivariateStats since this package should mainly contain common and typically a bit old-fashioned algorithms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants