Skip to content

variance decomposition issues across several models #60

@pierpaolocreanza

Description

@pierpaolocreanza

Hi there,

Thank you for putting together this package! I'm trying to estimate several two-way models for my application and I get puzzling results. Here's a summary by case.

1. Standard AKM, without collapsing at the spell level and using leave-out-observation methods for HE.

HE-corrected estimates do not sum to one. The more suspicious change relative to HO-correction is var(eps), but is not obvious that that's where the problem is.
image

These is my pytwoway setup for these results:

# Prepare input DataFrame for PyTwoWay
    input_df = pd.DataFrame({
        'i': panel['cluster_id'].values,
        'j': panel['hgroup_id_temp'].values,
        'y': panel['outcome'].values,
        't': panel['year'].values
    })
    
    # Set parameters
    clean_params = bpd.clean_params({
        'connectedness': 'leave_out_observation',
        'drop_single_stayers': True,
        'collapse_at_connectedness_measure': False,
        'copy': False
    })

    bdf = bpd.BipartiteDataFrame(input_df, track_id_changes=True)
    
    # Clean the data
    bdf = bdf.clean(clean_params)

    # Estimation params
    fe_params = tw.fe_params({
        'ncore': 12,
        'ndraw_trace_sigma_2': 200,
        'ndraw_trace_ho': 200,
        'he': True,
        'ndraw_trace_he': 200,
        'ndraw_lev_he': 1800,
        'attach_fe_estimates': True,
        'preconditioner': 'ichol'
    })

    # Run model and save results
    
    # Initialize and fit the FE estimator with controls
    fe_params['Q_var'] = [tw.Q.VarPsi(), tw.Q.VarAlpha()]  
    fe_params['Q_cov'] = [tw.Q.CovPsiAlpha()]
    
    fe_estimator = tw.FEEstimator(bdf, fe_params)
    fe_estimator.fit()

2. Standard AKM, collapsing at the spell level and using leave-out-spell methods for HE.

I run exactly the same code on the same data as above but change the clean_params to:

clean_params = bpd.clean_params({
        'connectedness': 'leave_out_spell',
        'drop_single_stayers': True,
        'collapse_at_connectedness_measure': True,
        'copy': False
    })

Yet, my results are even more puzzling. var(eps) now exceeds var(y) for all models---even FE!---, and by a large amount. var(alpha)_he is negative, which clearly cannot be.

image

3. BLM with no controls and without collapsing at the spell level (analogous to case 1)

My setup is very basic:

input_df = pd.DataFrame({
        'i': panel['cluster_id'],
        'j': panel['hgroup_id_temp'],
        'y': panel['patents'],
        't': panel['year']
    })

    nl = 3 # Number of worker types
    nk = 4 # Number of firm types
    blm_params = tw.blm_params({
        'nl': nl, 'nk': nk,
        'a1_mu': 0, 'a1_sig': 1.5, 'a2_mu': 0, 'a2_sig': 1.5,
        's1_low': 0.5, 's1_high': 1.5, 's2_low': 0.5, 's2_high': 1.5
    })
    cluster_params = bpd.cluster_params({
        'measures': bpd.measures.CDFs(),
        'grouping': bpd.grouping.KMeans(n_clusters=nk),
        'is_sorted': True,
        'copy': False
    })
    clean_params = bpd.clean_params({
        'drop_returns': 'returners',
        'copy': False
    })
    bdf = bpd.BipartiteDataFrame(input_df, track_id_changes=True)
    bdf = bdf.clean(clean_params).collapse(is_sorted=True)
    bdf = bdf.cluster(cluster_params)
    bdf = bdf.to_eventstudy(is_sorted=True, copy=False)

    movers = bdf.get_worker_m(is_sorted=True)
    jdata = bdf.loc[movers,:]
    sdata = bdf.loc[~movers,:]

    blm_fit = tw.BLMEstimator(blm_params)
    blm_fit.fit(jdata=jdata, sdata=sdata, n_init=100, n_best=5, ncore=12)
    blm_model = blm_fit.model

    var_decomp = tw.BLMVarianceDecomposition(blm_params)
    var_decomp.fit(jdata=jdata, sdata=sdata, blm_model=blm_model, n_samples=100, Q_var=[tw.Q.VarPsi(), tw.Q.VarAlpha()], ncore=12)

    results = {k: np.mean(v) for k, v in var_decomp.res.items()}
    results_df = pd.DataFrame.from_dict(results, orient='index', columns=['mean_value'])
    results_df.to_csv(temp_dir / "blm_variance_decomposition.csv")

Once again, var(eps) exceeds var(y):

image

What do you think could be going on? And do you have any suggestions for fixing it?
I haven't noticed a big difference by increasing the number of draws and iterations from the default to the level I have now. They're not as high as they can be---I kept them manageable as I was still tying out different approaches. Do you think more draws/iterations will help? How high would you go? For reference, in my non-collapsed connected set I have about 3 million observations for 265k workers and 30k firms.

Could it be a version compatibility issues with some of the package dependencies?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions