Skip to content

Conversation

damandhaliwal
Copy link
Contributor

New clean PR with updated for changes to the master. Addresses #675. @s3alfisc let me know what you think. Hoping to wrap this up this weekend.

If you could review the _nw_meat function in vcov_utils.py that would be great. Just want to confirm my understanding and the implementation.

@damandhaliwal damandhaliwal marked this pull request as draft August 27, 2025 01:41
@s3alfisc
Copy link
Member

Regarding tests, you can take a look at these:

def test_single_fit_feols(

Basically, we call r-fixest via rpy2 and then compare results. We can write a similar test for HAC and test fewer formulas?

@s3alfisc
Copy link
Member

s3alfisc commented Aug 31, 2025

To do's:

  • maybe check if the implementation can be improved in terms of performance (avoid memory allocations, in place operations if possible)
  • add tests for IV
  • add that post-estimation update via .vcov() produces correct results
  • add tests for errors (i.e. when lags not provided, or a string) in test_errors.py
  • update docs
  • test that if time_id is not provided and sorted, we get the same result as when setting time id
  • maybe other things, will have to think this through

Also, should we require a time_id value to be set in the vcov_kwargs for the HAC estimator? Advantage: we make very explicit what happens. Disadvantage: if users run multiple models, just sorting the data once before the feols() should increase model performance? Maybe a compromise would be to add a warning to inform users about the default behavior when they do not add a time_id arg?

Copy link

codecov bot commented Aug 31, 2025

Codecov Report

❌ Patch coverage is 6.50407% with 115 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pyfixest/estimation/vcov_utils.py 9.80% 46 Missing ⚠️
pyfixest/estimation/feols_.py 2.43% 40 Missing ⚠️
pyfixest/estimation/estimation.py 0.00% 15 Missing ⚠️
pyfixest/utils/utils.py 8.33% 11 Missing ⚠️
pyfixest/estimation/FixestMulti_.py 33.33% 2 Missing ⚠️
pyfixest/estimation/feols_compressed_.py 0.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (72b57a7) and HEAD (13be129). Click for more details.

HEAD has 3 uploads less than BASE
Flag BASE (72b57a7) HEAD (13be129)
core-tests 2 0
tests-extended 1 0
Flag Coverage Δ
core-tests ?
tests-extended ?
tests-vs-r 15.59% <6.50%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pyfixest/estimation/quantreg/QuantregMulti.py 20.86% <ø> (ø)
pyfixest/estimation/feols_compressed_.py 20.11% <0.00%> (-62.65%) ⬇️
pyfixest/estimation/FixestMulti_.py 16.19% <33.33%> (-64.29%) ⬇️
pyfixest/utils/utils.py 44.88% <8.33%> (-44.01%) ⬇️
pyfixest/estimation/estimation.py 9.19% <0.00%> (-80.75%) ⬇️
pyfixest/estimation/feols_.py 7.38% <2.43%> (-84.21%) ⬇️
pyfixest/estimation/vcov_utils.py 15.87% <9.80%> (-42.03%) ⬇️

... and 37 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@damandhaliwal
Copy link
Contributor Author

Also, should we require a time_id value to be set in the vcov_kwargs for the HAC estimator? Advantage: we make very explicit what happens. Disadvantage: if users run multiple models, just sorting the data once before the feols() should increase model performance? Maybe a compromise would be to add a warning to inform users about the default behavior when they do not add a time_id arg?

Hmm, I can definitely see the advantage of requiring the time_id value. But would the performance be that improved? I will add to the documentation.

@damandhaliwal
Copy link
Contributor Author

damandhaliwal commented Sep 6, 2025

@s3alfisc I'm working on the implementation for the Newey West Panel meat. Let me know what you think of this function for checking balance, it's different from what fixest implements:

def _check_balanced(panel_arr: np.ndarray, time_arr: np.ndarray) -> bool:
    unique_panels = np.unique(panel_arr)
    unique_times = np.unique(time_arr)
    expected_time_count = len(unique_times)

    for panel_id in unique_panels:
        mask = panel_arr == panel_id
        panel_times = np.unique(time_arr[mask])

        if len(panel_times) != expected_time_count:
            return False
    
    return True

fixest implementation in C++ below. I changed the function because this requires the panel_arr to be integers and in order while we don't

  // checking the balance
  bool balanced = true;
  if(unit[0] != 1){
    balanced = false;
  } else {
    int n_unit = 1;
    int t_current = time[0];

    for(int i=1 ; i<N ; ++i){
      if(t_current != time[i]){
        // we change time period
        // we check all is fine
        if(n_unit != G){
          balanced = false;
          break;
        }

        // OK
        n_unit = 0;
        t_current = time[i];
      } else if(unit[i] - unit[i - 1] != 1){
        balanced = false;
        break;
      }

      ++n_unit;
    }
  }

@s3alfisc
Copy link
Member

s3alfisc commented Sep 6, 2025

Oh wait a second, I actually have some code to share! Will push it now. One sec

@s3alfisc
Copy link
Member

s3alfisc commented Sep 6, 2025

Changed the following:

  • Add meat matrix for Panel HAC.
  • Requires users to always provide time_id.
  • Requires users to always provide lag when time series HAC.
  • Corrects time series HACs default -> fixes uses bwNeweyWest and does not use the same default lag as for panel HAC, see attached screenshot
  • adds unit tests

@s3alfisc
Copy link
Member

s3alfisc commented Sep 6, 2025

What we still have to do:

  • optimize the HAC code for both time series and panel
  • implement DK-HAC? but could do this in a separate PR
  • I currently turn of the small sample adjustments in the tests to make fixest & pyfixest match for panel HAC. I think the panel HAC in fixest might take nesting into account as well, which we handle for CRV errors here but is currently ignored for panel-HAC

"The function argument `separation_check` must be a list of strings containing 'fe' and/or 'ir'."
)

if vcov_kwargs is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should keep all these checks here, or move them to the vcov method? Because users can call vcov post-estimation and provide new arguments, and we would like to check these as well?

"""
return _capture_context(context + 2) if isinstance(context, int) else context

def _check_balanced(panel_arr: np.ndarray, time_arr: np.ndarray) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a unit test for it?

Btw, I think we could also have easily done this in pandas with a groupby(["panel_id", "time_id"]).size()
and throw an error if we ever see a result that is not one?

Plus, please use the @njit decorator + maybe we can parallelize?

Other comment: as we are sorting in the _meat_hac_panel function already, maybe we should call this function there and then make use of the fact that we have already sorted once? This should make things more efficient right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't need this anymore after your latest commit

@damandhaliwal
Copy link
Contributor Author

What we still have to do:

  • implement DK-HAC? but could do this in a separate PR
  • I currently turn of the small sample adjustments in the tests to make fixest & pyfixest match for panel HAC. I think the panel HAC in fixest might take nesting into account as well, which we handle for CRV errors here but is currently ignored for panel-HAC

I think we should just implement DK-HAC as well in this PR while we still have this fresh in our minds. This way we can resolve more issues and potentially even push cleaner and more optimized code. I'll take a look at the nesting thing in fixest and see what we're missing

Copy link
Contributor Author

@damandhaliwal damandhaliwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to implement the same weight logic as nw_meat and fixest to this with w[0]=0.5 and the loop going from 0, lag+1

@s3alfisc
Copy link
Member

s3alfisc commented Sep 7, 2025

Looks like fixest by default treats all time lags as consecutive: link.

Edit: looks like these have only been added with fixest 0.13, which is currently in dev. I suggest that we wait until fixest 0.13.0 is released and then add all of this logic for handling lags and assume that, until then, time series data is always consecutive?

@s3alfisc
Copy link
Member

s3alfisc commented Sep 7, 2025

I think we should just implement DK-HAC as well in this PR

Sounds good to me.

Do we want to implement the same weight logic as nw_meat and fixest to this with w[0]=0.5 and the loop going from 0, lag+1

Yes, though I think it is not 100% needed as we basically match fixest already?


time_periods, k = ordered_scores.shape
time_scores = np.zeros((time_periods, k))

Copy link
Member

@s3alfisc s3alfisc Sep 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this aggregation is wrong, no? We need to aggregate over a fixed t for each unit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with 707909a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants