Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group indexing can give different solutions if not reset #1

Open
ff1201 opened this issue Nov 28, 2024 · 0 comments
Open

Group indexing can give different solutions if not reset #1

ff1201 opened this issue Nov 28, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@ff1201
Copy link
Owner

ff1201 commented Nov 28, 2024

Issue: If the group indexing is not set from 1 to m (the number of groups), this can lead to different solutions. Two conditions are needed to avoid this:

  1. The group indexes should not have gaps. For example, 46,47,49.
  2. The group indexes should be ordered and start from 1.

Example:

reorder_group <- function(groups){
    max_grp_id = length(unique(groups))
    new_grp = rep(0,length(groups))
    all_grp_indices = as.numeric(names(table(groups)))
    for (i in 1:max_grp_id){
        var_ids = which(groups == all_grp_indices[i]) 
        new_grp[var_ids] = i
    }
    return(new_grp)
}
set.seed(1)
groups = sample(1:50,size=100,replace=TRUE)
data=sgs::gen_toy_data(p=100,n=50,grouped=FALSE)
grp_new = reorder_group(groups) 
order_grp = order(grp_new,decreasing=FALSE)
sgl_model = dfr_sgl(X=data$X,y=data$y,groups=groups)

This causes an error:

Error in if (any(pen_var_org < 0) | any(pen_grp_org < 0)) { : 
  missing value where TRUE/FALSE needed

The error is due to the screening indices for the groups having gaps. Removing the gaps fixes this but unless condition 2 above is also met, the solutions will be slightly different:

sgl_model = dfr_sgl(X=data$X,y=data$y,groups=grp_new, intercept=FALSE)
sgl_model_2 = dfr_sgl(X=data$X[,order_grp],y=data$y,groups=grp_new[order_grp], intercept=FALSE)
mean((sgl_model$beta[order_grp,20] - sgl_model_2$beta[,20])^2)

sgl_model = dfr_sgl(X=data$X,y=data$y,groups=groups, screen=FALSE, intercept=FALSE)
sgl_model_2 = dfr_sgl(X=data$X[,order_grp],y=data$y,groups=grp_new[order_grp], screen=FALSE, intercept=FALSE)
mean((sgl_model$beta[order_grp,20] - sgl_model_2$beta[,20])^2)

This does not seem to be caused by the screening but it's likely the L2 norm at some point is relying on the indexing being set from 1. Needs further investigation.

Current solution: added a warning to the user to sort their groups if they are detected not to be.
Future solution: to sort the group indexes and X accordingly before the fitting starts and then reverse the sorting before outputting the results. The first step is simple to implement but the latter step requires re-sorting many different outputs. Need to think of a simpler way to do this.
Temporary user fix: A user can avoid this behaviour by processing the group indices to meet the condition and changing X accordingly, as is done in the example above (sgl_model_2). The indices returned for the various metrics (such as selected_var) are of the order they are inputted into the function.

@ff1201 ff1201 added the bug Something isn't working label Nov 28, 2024
@ff1201 ff1201 self-assigned this Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant