Skip to content

fastMCD algorithm in StatsBase ? #326

Open
@romainFr

Description

@romainFr

Would there be interest in adding a function for minimum covariance determinant estimator (MCD), for robust covariance estimation of multivariate data ? The reference is :

Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212-223.

I wrote this simple version some time ago, it could become a PR if there's interest. If so, where would it belong, StatsBase or MultivariateStats ?

function fastMCD(X,p = ceil(Int,(sum(size(X))+1)/2);nrepeats=500)
    rng = Base.GLOBAL_RNG
    hMin = nothing
    sMin = Inf
  
    for i in 1:nrepeats
        idx = randperm(size(X,1))
        h1 = X[idx[1:p],:]    
        s0 = 0
        s1= 1
        while ((det(s1)!=det(s0)) & (det(s1)!=0))
            h0 = h1
            s0 = cov(h0)
            m = vec(mean(h0,1))
            Dis = vec(mapslices(x -> mahalanobis(x,m,inv(s0)),X,2))
            ord = sortperm(Dis)
            h1 = X[ord[1:p],:]
            s1=cov(h1)
       end
       
        if det(s1)<det(sMin)
            hMin = h1
            sMin = s1
        end
       
    end
    ## Reweighting
    sfull = cov(hMin)
    tmcd = vec(mean(hMin,1))
    dfull = vec(mapslices(x -> mahalanobis(x,tmcd,inv(sfull)),hMin,2))
    smcd = (median(dfull.^2)/pdf(Chisq(size(X,2)),0.5))*sfull
    dmcd = vec(mapslices(x -> mahalanobis(x,tmcd,inv(smcd)),hMin,2));
    w = FrequencyWeights(((dmcd.^2).<pdf(Chisq(size(X,2)),0.975))*1)
    t1 = mean(hMin,w,1)
    s1 = cov(hMin,w,corrected=true)
    (t1,s1)
end

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions