[WIP] IdModel: Fix invalid promotion selection #3877

naoyam · 2025-02-12T16:31:44Z

computeCoveredGroups is used in the loop promotion analysis to find a promotion ID that has dependencies to all input exact groups. Specifically, it traverses the exact graph in a topological order and it assigns for each exact group a set of dependent input exact groups. This information is used to find promotion IDs since a promotion ID of a loop group needs to have all dependent input groups for all IDs of the given loop group.

Here's a simple example. Suppose there are two 2D input tensors, t0 and t1, and they are added together as follows:

t0: [i0, b1]
t1: [i2, i3]

t2 = t0 + t1
// t2: [i4, i5]

flatten_all_tensors();
inlineMost();

The resulting tensors and IDs look like below:

Specifically, i6, i7 and i8 form a loop group with i8 (or i7) being its promotion ID. The dependent input exact groups of i8 are {i0, i2, i4} and {i3, i5}. Note that broadcast IDs are ignored. Notice that set covers all of the dependent groups of the member IDs of the loop group. That's true obviously for i7 as it's exact mapped with i8. It's also true for i6, which only has {i0, i2, i4} as its dependent group (the dependency to b1 is ignored as its a broadcast ID).

This computeCoveredGroups just works fine in common transformation patters, where we tend to first do all merges and then splits. However, it has a problem when there's some split followed by a merge. Consider a fusion with 2 input tensors, t0 and t1 as shown below:

t0: [i0, b1]
t1: [i2]

t2 = reshape(t1, {i4, i5});
t3 = t0 + t2
// t3: [i6, i7]

flatten_all_tensors();
inlineMost();

Similar to the previous case, the final merged IDs form a loop group of {i8, i9, i10}. In this case, it should be promoted to i9 or i10 (they are exact mapped and it doesn't matter which one is picked). However, computeCoveredGroups assigns the exact group of {i2, i3} not just to the exact group of {i9, i10} but also to the exact groups of {i0, i4, i6}, {i5, i7}and as well as{i8}since all of them have dependencies to{i2, i3}. This is problematic since then findPromotionfOfLoopGroup[may incorrectly choose](https://github.com/NVIDIA/Fuser/blob/main/csrc/id_model/loop_promotion.cpp#L926-L934), e.g.,i8` as the promotion of the loop group.

The root cause of the issue lies in [computeCoveredGroups](https://github.com/NVIDIA/Fuser/blob/main/csrc/id_model/loop_promotion.cpp#L779-L822). It just propagates input exact groups down to consumer exact groups through exact expr groups, no matter what actual expr a group represents. Specifically, when it's a split, the input groups covered by the split input ID are just propagated to the split output groups, but the output groups do not actually cover the entire input groups but they should be considered to cover only the split portion of the input groups. In the above example case, {i0, i4, i6} and {i5, i7} should cover only the outer and inner split of {i2, i3}, respectively. Therefore, {i8} only inherits the outer split of {i0, i4, i6}, whereas {i9, i10} gets both the outer and inner split groups, which allows findPromotionOfLoopGroup to correctly identify {i9, i10} as the promotion of the loop group.

Previously, the coverage information is just a map from each exact ValGroup to a ValGroups of dependent input groups. Since that is not enough to represent the coverage information as precisely as discussed above, this PR introduces an additional data structure, CoveredGroup, which represents a covered exact group of an exact group. Since each exact group may cover multiple groups, std::unordered_set<CoveredGroup> is used to represent all covered groups for an exact group.

CoveredGroup optionally encodes the split of a covered group, like the above {i0, i4, i6} case. The CoveredGroup of {i0, i4, i6} would have {i0, i4, i6} as group_ and also the covered groups of its split input group, {i2, i3}, as split_in with is_inner_ being false. Similarly, {i5, i7} would have {i5, i7} as its group_ and {i2, i3} as its split_in_ but is_inner_ being true. Therefore, {i8} would just get the same coverage info propagated from {i0, i4, i6}, whereas {i9, i10} would get that as well as the one from {i5, i7}, effectively gathering both the outer and inner output groups of the split. This way, we avoid losing information of covered input groups when propagating the information through split exprs.

naoyam · 2025-02-12T16:31:56Z

!test

github-actions · 2025-02-12T16:32:41Z

Review updated until commit 2ea029b

Description

Added CoveredGroup and CoveredGroups structures for more precise dependency tracking.
Updated computeCoveredGroups to handle splits and propagate coverage information correctly.
Enhanced getInputGroupsOfExactGraph and getInputGroupsOfIELGraph to accept IdModel parameter.
Added tests to verify split-aware covered group analysis and fix invalid loop promotion.

Changes walkthrough 📝

Relevant files

Enhancement

loop_promotion.cpp `Enhance covered group analysis and handle splits` csrc/id_model/loop_promotion.cpp Added `isEqualToOrSuperSetOf` method to `CoveredGroup`. Added `toString` method to `CoveredGroup`. Updated `computeCoveredGroups` to handle splits and propagate coverage information. Updated `getInputGroupsOfExactGraph` and `getInputGroupsOfIELGraph` to accept `IdModel` parameter. Modified `propagatePromotionsInIELGraph` to handle splits correctly. Updated `projectIELPromotionToLoopGraph` and `findPromotionOfLoopGroup` to use new `computeCoveredGroups`.	+195/-34
loop_promotion.h `Define CoveredGroup and update method signatures` csrc/id_model/loop_promotion.h Added `CoveredGroup` and `CoveredGroups` structures. Updated `computeCoveredGroups` to return `std::unordered_map std::shared_ptr>.` `Updated getInputGroupsOfExactGraph and getInputGroupsOfIELGraph to be static and accept IdModel parameter.`	+84/-6

Tests

test_id_model.cpp `Add tests for covered groups and loop promotion` tests/cpp/test_id_model.cpp Added test `CoveredGroups` to verify split-aware covered group analysis. Added test `InvalidLoopPromotion` to reproduce and fix issue `hf_llama` model error #3702.	+147/-0
test_tutorial.cpp `Update test to use buildExactGraph` tests/cpp/test_tutorial.cpp Updated `IdModelReshapeAnalysis` test to use `buildExactGraph`.	+1/-1

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Possible Issue

The isEqualToOrSuperSetOf function in CoveredGroup class may not handle all edge cases correctly, especially when dealing with complex split and merge operations. Ensure that all possible scenarios are covered and tested.

// possible, but may not be complete.
bool CoveredGroup::isEqualToOrSuperSetOf(const CoveredGroup& other) const {
  if (*this == other) {
    return true;
  }

  // When both are derived from split and correspond to either inner
  // or outer.
  if (split_in_.get() && other.split_in_.get() &&
      is_inner_ == other.is_inner_) {
    const CoveredGroups& split_in = *split_in_;
    const CoveredGroups& other_split_in = *other.split_in_;

    // When both have the same split input (and both correspond to
    // either inner or outer), they should cover the same exact
    // groups. This should only happen when broadcast is merged. For
    // example, suppose there are two tensors and they are scheduled as
    // follows;
    //
    // t0: [i0]
    // t1: [i1, b2]
    //
    // t1->merge(0, 1)->split(0, 4);
    // t0->split(0, 4)
    //
    // t0->inlineAt(t1, 1)
    //
    // In this case, t0->axis(0) and t1->axis(0) have the same
    // split input group, {i0, i1}. Note that b2 is not included as
    // it's a broadcast. Also note that both are the outer
    // output. Here, group_ of t0->axis(0) is the exact group of
    // t0->axis(0), while that of tv1->axis(0) is the exact group of
    // the t1->merge(0, 1) output. In this case, however, this merge
    // is just a merge of i1 and the b2 broadcast ID, so in terms of
    // covered exact groups, it's effectively the same as that of
    // t0->axis(0). All in all, as long as both correspond to either
    // inner or outer of the same split input, they should be
    // considered the same.
    if (split_in == other_split_in) {
      return true;
    }

    // Both are derived from a split but have differnt split input
    // groups. If the input groups of this split is a superset of the
    // input groups of the split of the other CoveredGroup, this
    // CoveredGroup is a superset
    if (std::all_of(
            other_split_in.begin(),
            other_split_in.end(),
            [&](const CoveredGroup& other_split_in_group) {
              return std::any_of(
                  split_in.begin(),
                  split_in.end(),
                  [&](const CoveredGroup& split_in_group) {
                    return split_in_group.isEqualToOrSuperSetOf(
                        other_split_in_group);
                  });
            })) {
      return true;
    }
  }

  // If the other CoveredGroup has a split input, it is sufficient to
  // satisfy that this CoveredGroup is equal to or
  // superior to the split input of other
  if (other.split_in_.get()) {
    if (std::all_of(
            other.split_in_->begin(),
            other.split_in_->end(),
            [&](const CoveredGroup& other_split_in) {
              return isEqualToOrSuperSetOf(other_split_in);
            })) {
      return true;
    }
  }

  // At this point, it does not mean it's definitely false but not
  // proven to be true either.

  return false;
}

Performance Concern

The new implementation of computeCoveredGroups may introduce performance overhead due to the use of std::shared_ptr and nested loops. Consider optimizing the data structures and algorithms to improve performance.

LoopPromotionMapBuilder::computeCoveredGroups(
    const ValGraph& exact_graph,
    const IdModel& id_model) {
  // Map from an exact iter domain group, to all the exact iter domain groups it
  // covers
  std::unordered_map<ValGroup, std::shared_ptr<CoveredGroups>> covered_ids;

  ValGroups input_groups_of_graph =
      getInputGroupsOfExactGraph(exact_graph, id_model);

  for (const ValGroup& id_group :
       exact_graph.disjointValSets().disjointSets()) {
    // Initialize inputs
    if (input_groups_of_graph.has(id_group)) {
      auto init_groups = std::make_shared<CoveredGroups>();
      init_groups->insert(CoveredGroup(id_group));
      NVF_ERROR(covered_ids.emplace(id_group, init_groups).second);
    }

    // Initialize broadcast groups to empty since broadcast domains
    // don't matter for indexing
    if (std::any_of(id_group->begin(), id_group->end(), [&](Val* id) {
          return id->as<IterDomain>()->isBroadcast();
        })) {
      covered_ids[id_group] = std::make_shared<CoveredGroups>();
    }
  }

  ValGraphStmtSort exact_stmt_sort(exact_graph, input_groups_of_graph);

  for (const ExprGroup& exact_expr : exact_stmt_sort.exprs()) {
    // Initialize to empty group if not yet initialized
    for (const ValGroup& output_group : exact_graph.outputGroups(exact_expr)) {
      covered_ids.emplace(output_group, std::make_shared<CoveredGroups>());
    }

    const std::vector<ValGroup> input_groups =
        exact_graph.inputGroups(exact_expr);
    const std::vector<ValGroup> output_groups =
        exact_graph.outputGroups(exact_expr);

    // If this expr is a split, don't propagate the input coverage as
    // is but set the covered group of each output group by itself.
    // The input coverage info is propagated as the split input.
    if (exact_expr->front()->isA<Split>()) {
      NVF_ERROR(input_groups.size() == 1);
      const std::shared_ptr<CoveredGroups>& covered_groups =
          covered_ids.at(input_groups.at(0));

      for (const ValGroup& output_group : output_groups) {
        bool is_inner =
            output_group->has(exact_expr->front()->as<Split>()->inner());
        covered_ids[output_group]->insert(
            CoveredGroup(output_group, covered_groups, is_inner));
      }
      continue;
    }

    for (const ValGroup& output_group : output_groups) {
      // Note that an exact group may have multiple
      // exact expr groups and may have different coverage groups depending on
      // the expr groups. For example, this can happen with reshape or resize.
      // See test LoopPromotionCoverage for a concrete example.
      for (const ValGroup& inp_group : input_groups) {
        const std::shared_ptr<CoveredGroups>& inp_covered_groups =
            covered_ids.at(inp_group);
        covered_ids[output_group]->insert(
            inp_covered_groups->begin(), inp_covered_groups->end());
      }
    }
  }

  return covered_ids;
}

std::unordered_map<ValGroup, IterDomain*> LoopPromotionMapBuilder::
    projectIELPromotionToLoopGraph(
        const ValGraph& iel_graph,
        const std::unordered_map<ValGroup, IterDomain*>& iel_promotion_map,
        const ValGraph& loop_graph,
        const StatefulInliningInfo& inlining_info) const {
  const std::unordered_map<ValGroup, std::shared_ptr<CoveredGroups>>
      exact_covered_ids =
          computeCoveredGroups(idGraph(IdMappingMode::EXACT), id_model_);

  // Grab terminal iter domain in the loop groups.
  const VectorOfUniqueEntries<IterDomain*> terminal_loop_ids =
      computeTerminalLoopIds(inlining_info);

  std::unordered_map<ValGroup, IterDomain*> loop_promotion_map;

  for (const ValGroup& loop_group :
       loop_graph.disjointValSets().disjointSets()) {
    IterDomain* promotion_id = findPromotionOfLoopGroup(
        loop_group,
        iel_graph,
        iel_promotion_map,
        exact_covered_ids,
        terminal_loop_ids);
    if (promotion_id) {

Test Coverage

Ensure that the added tests cover all possible edge cases and scenarios, including complex split and merge operations, to validate the correctness of the computeCoveredGroups function.

// Test to verify the split-aware covered group analysis. See
// also https://github.com/NVIDIA/Fuser/pull/3877.
TEST_F(IdModelTest, CoveredGroups) {
  auto fusion_ptr = std::make_unique<Fusion>();
  auto& fusion = *fusion_ptr;
  FusionGuard fg(fusion_ptr.get());

  auto tv0 = makeContigConcreteTensor({-1, 1});
  fusion.addInput(tv0);
  auto tv1 = makeContigConcreteTensor({-1});
  fusion.addInput(tv1);

  auto tv2 = set(tv0);
  auto tv3 = reshape(tv1, {8}, {2, 4});
  auto tv4 = add(tv2, tv3);
  fusion.addOutput(tv4);

  for (auto tv : fusion.allTvs()) {
    tv->flatten();
  }

  inlineMost();

  IdModel id_model(&fusion, true);
  const auto& exact_graph = id_model.idGraph(IdMappingMode::EXACT);

  // The exact
  // group of the tv3 and tv4 IDs should cover both the inner and
  // outer split groups of the input group of the tv1 logical ID.
  const auto covered_groups =
      LoopPromotionMapBuilder::computeCoveredGroups(exact_graph, id_model);

  const auto& input_group = exact_graph.toGroup(tv1->getLogicalDomain().at(0));
  auto input_covered_group_it = covered_groups.find(input_group);
  ASSERT_NE(input_covered_group_it, covered_groups.end());
  const auto& input_covered_groups = input_covered_group_it->second;

  const auto& tv4_exact_group = exact_graph.toGroup(tv4->axis(0));
  auto tv4_exact_group_it = covered_groups.find(tv4_exact_group);
  ASSERT_NE(tv4_exact_group_it, covered_groups.end());
  const auto& tv4_covered_groups = tv4_exact_group_it->second;
  // It should consist of two CoveredGroups, both of which inheriths
  // from the logical ID of tv1 through a split
  EXPECT_EQ(tv4_covered_groups->size(), 2);
  for (const CoveredGroup& covered_group : *tv4_covered_groups) {
    EXPECT_EQ(covered_group.splitIn(), input_covered_groups);
    if (covered_group.isInner()) {
      EXPECT_EQ(
          covered_group.group(),
          exact_graph.toGroup(tv4->getLogicalDomain().at(1)));
    } else {
      EXPECT_EQ(
          covered_group.group(),
          exact_graph.toGroup(tv4->getLogicalDomain().at(0)));
    }
  }
}

// Repro of issue #3702
// https://github.com/NVIDIA/Fuser/issues/3702. Indexing traversal
// faied due to invalid loop promotion.
TEST_F(IdModelTest, InvalidLoopPromotion) {
  auto fusion_ptr = std::make_unique<Fusion>();
  auto& fusion = *fusion_ptr;
  FusionGuard fg(fusion_ptr.get());

  auto T0 = makeContigConcreteTensor({1, 32, 6});
  fusion.addInput(T0);
  auto T32 = makeContigConcreteTensor({1, 6, 2048}, DataType::BFloat16);
  fusion.addInput(T32);

  auto T6 = transpose(T0, 1, 2);
  auto T98 = broadcast(T6, {false, false, true, false});
  auto T99 = expand(
      T98,
      {IrBuilder::create<Val>(-1L),
       IrBuilder::create<Val>(-1L),
       IrBuilder::create<Val>(2L),
       IrBuilder::create<Val>(-1L)});
  auto T100 = reshape(
      T99,
      {IrBuilder::create<Val>(1L),
       IrBuilder::create<Val>(6L),
       IrBuilder::create<Val>(-1)});
  auto T11 = sin(T100);
  auto T13 = mul(T11, IrBuilder::create<Val>(1.0));
  auto T15 = castOp(DataType::BFloat16, T13);
  auto T43 = broadcast(T15, {false, true, false, false});
  auto T59 = expand(
      T43,
      {IrBuilder::create<Val>(-1L),
       IrBuilder::create<Val>(32L),
       IrBuilder::create<Val>(-1L),
       IrBuilder::create<Val>(-1L)});
  auto T61 = castOp(DataType::Float, T59);
  auto T10 = cos(T100);
  auto T12 = mul(T10, IrBuilder::create<Val>(1.0));
  auto T14 = castOp(DataType::BFloat16, T12);
  auto T41 = broadcast(T14, {false, true, false, false});
  auto T66 = expand(
      T41,
      {IrBuilder::create<Val>(-1L),
       IrBuilder::create<Val>(8L),
       IrBuilder::create<Val>(-1L),
       IrBuilder::create<Val>(-1L)});
  auto T79 = expand(
      T43,
      {IrBuilder::create<Val>(-1L),
       IrBuilder::create<Val>(8L),
       IrBuilder::create<Val>(-1L),
       IrBuilder::create<Val>(-1L)});
  auto T81 = castOp(DataType::Float, T79);
  auto T35 = reshape(
      T32,
      {IrBuilder::create<Val>(1L),
       IrBuilder::create<Val>(6L),
       IrBuilder::create<Val>(32L),
       IrBuilder::create<Val>(64L)});
  auto T36 = transpose(T35, 1, 2);
  auto T47 = castOp(DataType::Float, T36);
  auto T46 = expand(
      T41,
      {IrBuilder::create<Val>(-1L),
       IrBuilder::create<Val>(32L),
       IrBuilder::create<Val>(-1L),
       IrBuilder::create<Val>(-1L)});
  auto T48 = castOp(DataType::Float, T46);
  auto T49 = mul(T47, T48);
  fusion.addOutput(T61);
  fusion.addOutput(T66);
  fusion.addOutput(T81);
  fusion.addOutput(T36);
  fusion.addOutput(T49);

  auto options_bf16 =
      at::TensorOptions().dtype(at::kBFloat16).device(at::kCUDA, 0);
  auto options_fp32 =
      at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  auto t0 = at::randn({1, 32, 6}, options_fp32);
  auto t32 = at::randn({1, 6, 2048}, options_bf16);
  std::vector<c10::IValue> inputs({t0, t32});

  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto outputs = executor_cache.runFusionWithInputs(inputs);
  testValidate(&fusion, outputs, inputs, __LINE__, __FILE__);
}

} // namespace nvfuser

naoyam · 2025-02-13T02:44:56Z

tests/cpp/test_tutorial.cpp

@@ -654,7 +654,7 @@ TEST_F(Tutorial, IdModelReshapeAnalysis) {
  fusion.addOutput(tv3);

  IdModel id_model(&fusion);
-  ValGraph& exact_graph = id_model.idGraph(IdMappingMode::EXACT);
+  ValGraph& exact_graph = id_model.buildExactGraph();


Unrelated fix

naoyam · 2025-02-13T07:42:18Z

!test

naoyam added 5 commits February 11, 2025 22:06

repro

5d24fd7

WIP

dec53ba

WIP

f4dde88

debug print

7609ae2

unrelated fix

c82d923

naoyam added the idmodel label Feb 12, 2025

naoyam added 2 commits February 12, 2025 16:54

test cleanup

e928f43

cleanup

3108dc5

naoyam commented Feb 13, 2025

View reviewed changes

naoyam added 5 commits February 12, 2025 19:25

Merge branch 'main' into id_model_fix_invalid_covered_groups

5a864d7

cleanup

385013f

format

5ddf579

cleanup

b602ee4

fix

2ea029b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] IdModel: Fix invalid promotion selection #3877

[WIP] IdModel: Fix invalid promotion selection #3877

naoyam commented Feb 12, 2025 •

edited

Loading

naoyam commented Feb 12, 2025

github-actions bot commented Feb 12, 2025 •

edited

Loading

naoyam Feb 13, 2025

naoyam commented Feb 13, 2025

[WIP] IdModel: Fix invalid promotion selection #3877

Are you sure you want to change the base?

[WIP] IdModel: Fix invalid promotion selection #3877

Conversation

naoyam commented Feb 12, 2025 • edited Loading

naoyam commented Feb 12, 2025

github-actions bot commented Feb 12, 2025 • edited Loading

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

naoyam Feb 13, 2025

Choose a reason for hiding this comment

naoyam commented Feb 13, 2025

naoyam commented Feb 12, 2025 •

edited

Loading

github-actions bot commented Feb 12, 2025 •

edited

Loading