Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] IdModel: Fix invalid promotion selection #3877

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Feb 12, 2025

Fixes #3702

computeCoveredGroups is used in the loop promotion analysis to find a promotion ID that has dependencies to all input exact groups. Specifically, it traverses the exact graph in a topological order and it assigns for each exact group a set of dependent input exact groups. This information is used to find promotion IDs since a promotion ID of a loop group needs to have all dependent input groups for all IDs of the given loop group.

Here's a simple example. Suppose there are two 2D input tensors, t0 and t1, and they are added together as follows:

t0: [i0, b1]
t1: [i2, i3]

t2 = t0 + t1
// t2: [i4, i5]

flatten_all_tensors();
inlineMost();

The resulting tensors and IDs look like below:

Scratch-35

Specifically, i6, i7 and i8 form a loop group with i8 (or i7) being its promotion ID. The dependent input exact groups of i8 are {i0, i2, i4} and {i3, i5}. Note that broadcast IDs are ignored. Notice that set covers all of the dependent groups of the member IDs of the loop group. That's true obviously for i7 as it's exact mapped with i8. It's also true for i6, which only has {i0, i2, i4} as its dependent group (the dependency to b1 is ignored as its a broadcast ID).

This computeCoveredGroups just works fine in common transformation patters, where we tend to first do all merges and then splits. However, it has a problem when there's some split followed by a merge. Consider a fusion with 2 input tensors, t0 and t1 as shown below:

t0: [i0, b1]
t1: [i2]

t2 = reshape(t1, {i4, i5});
t3 = t0 + t2
// t3: [i6, i7]

flatten_all_tensors();
inlineMost();

Scratch-36

Similar to the previous case, the final merged IDs form a loop group of {i8, i9, i10}. In this case, it should be promoted to i9 or i10 (they are exact mapped and it doesn't matter which one is picked). However, computeCoveredGroups assigns the exact group of {i2, i3} not just to the exact group of {i9, i10} but also to the exact groups of {i0, i4, i6}, {i5, i7}and as well as{i8}since all of them have dependencies to{i2, i3}. This is problematic since then findPromotionfOfLoopGroup[may incorrectly choose](https://github.com/NVIDIA/Fuser/blob/main/csrc/id_model/loop_promotion.cpp#L926-L934), e.g.,i8` as the promotion of the loop group.

The root cause of the issue lies in [computeCoveredGroups](https://github.com/NVIDIA/Fuser/blob/main/csrc/id_model/loop_promotion.cpp#L779-L822). It just propagates input exact groups down to consumer exact groups through exact expr groups, no matter what actual expr a group represents. Specifically, when it's a split, the input groups covered by the split input ID are just propagated to the split output groups, but the output groups do not actually cover the entire input groups but they should be considered to cover only the split portion of the input groups. In the above example case, {i0, i4, i6} and {i5, i7} should cover only the outer and inner split of {i2, i3}, respectively. Therefore, {i8} only inherits the outer split of {i0, i4, i6}, whereas {i9, i10} gets both the outer and inner split groups, which allows findPromotionOfLoopGroup to correctly identify {i9, i10} as the promotion of the loop group.

Previously, the coverage information is just a map from each exact ValGroup to a ValGroups of dependent input groups. Since that is not enough to represent the coverage information as precisely as discussed above, this PR introduces an additional data structure, CoveredGroup, which represents a covered exact group of an exact group. Since each exact group may cover multiple groups, std::unordered_set<CoveredGroup> is used to represent all covered groups for an exact group.

CoveredGroup optionally encodes the split of a covered group, like the above {i0, i4, i6} case. The CoveredGroup of {i0, i4, i6} would have {i0, i4, i6} as group_ and also the covered groups of its split input group, {i2, i3}, as split_in with is_inner_ being false. Similarly, {i5, i7} would have {i5, i7} as its group_ and {i2, i3} as its split_in_ but is_inner_ being true. Therefore, {i8} would just get the same coverage info propagated from {i0, i4, i6}, whereas {i9, i10} would get that as well as the one from {i5, i7}, effectively gathering both the outer and inner output groups of the split. This way, we avoid losing information of covered input groups when propagating the information through split exprs.

@naoyam
Copy link
Collaborator Author

naoyam commented Feb 12, 2025

!test

Copy link

github-actions bot commented Feb 12, 2025

Review updated until commit 2ea029b

Description

  • Added CoveredGroup and CoveredGroups structures for more precise dependency tracking.

  • Updated computeCoveredGroups to handle splits and propagate coverage information correctly.

  • Enhanced getInputGroupsOfExactGraph and getInputGroupsOfIELGraph to accept IdModel parameter.

  • Added tests to verify split-aware covered group analysis and fix invalid loop promotion.


Changes walkthrough 📝

Relevant files
Enhancement
loop_promotion.cpp
Enhance covered group analysis and handle splits                 

csrc/id_model/loop_promotion.cpp

  • Added isEqualToOrSuperSetOf method to CoveredGroup.
  • Added toString method to CoveredGroup.
  • Updated computeCoveredGroups to handle splits and propagate coverage
    information.
  • Updated getInputGroupsOfExactGraph and getInputGroupsOfIELGraph to
    accept IdModel parameter.
  • Modified propagatePromotionsInIELGraph to handle splits correctly.
  • Updated projectIELPromotionToLoopGraph and findPromotionOfLoopGroup to
    use new computeCoveredGroups.
  • +195/-34
    loop_promotion.h
    Define CoveredGroup and update method signatures                 

    csrc/id_model/loop_promotion.h

  • Added CoveredGroup and CoveredGroups structures.
  • Updated computeCoveredGroups to return std::unordered_map
    std::shared_ptr>.
  • Updated getInputGroupsOfExactGraph and getInputGroupsOfIELGraph to be
    static and accept IdModel parameter.
  • +84/-6   
    Tests
    test_id_model.cpp
    Add tests for covered groups and loop promotion                   

    tests/cpp/test_id_model.cpp

  • Added test CoveredGroups to verify split-aware covered group analysis.
  • Added test InvalidLoopPromotion to reproduce and fix issue hf_llama model error #3702.
  • +147/-0 
    test_tutorial.cpp
    Update test to use buildExactGraph                                             

    tests/cpp/test_tutorial.cpp

    • Updated IdModelReshapeAnalysis test to use buildExactGraph.
    +1/-1     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The isEqualToOrSuperSetOf function in CoveredGroup class may not handle all edge cases correctly, especially when dealing with complex split and merge operations. Ensure that all possible scenarios are covered and tested.

    // possible, but may not be complete.
    bool CoveredGroup::isEqualToOrSuperSetOf(const CoveredGroup& other) const {
      if (*this == other) {
        return true;
      }
    
      // When both are derived from split and correspond to either inner
      // or outer.
      if (split_in_.get() && other.split_in_.get() &&
          is_inner_ == other.is_inner_) {
        const CoveredGroups& split_in = *split_in_;
        const CoveredGroups& other_split_in = *other.split_in_;
    
        // When both have the same split input (and both correspond to
        // either inner or outer), they should cover the same exact
        // groups. This should only happen when broadcast is merged. For
        // example, suppose there are two tensors and they are scheduled as
        // follows;
        //
        // t0: [i0]
        // t1: [i1, b2]
        //
        // t1->merge(0, 1)->split(0, 4);
        // t0->split(0, 4)
        //
        // t0->inlineAt(t1, 1)
        //
        // In this case, t0->axis(0) and t1->axis(0) have the same
        // split input group, {i0, i1}. Note that b2 is not included as
        // it's a broadcast. Also note that both are the outer
        // output. Here, group_ of t0->axis(0) is the exact group of
        // t0->axis(0), while that of tv1->axis(0) is the exact group of
        // the t1->merge(0, 1) output. In this case, however, this merge
        // is just a merge of i1 and the b2 broadcast ID, so in terms of
        // covered exact groups, it's effectively the same as that of
        // t0->axis(0). All in all, as long as both correspond to either
        // inner or outer of the same split input, they should be
        // considered the same.
        if (split_in == other_split_in) {
          return true;
        }
    
        // Both are derived from a split but have differnt split input
        // groups. If the input groups of this split is a superset of the
        // input groups of the split of the other CoveredGroup, this
        // CoveredGroup is a superset
        if (std::all_of(
                other_split_in.begin(),
                other_split_in.end(),
                [&](const CoveredGroup& other_split_in_group) {
                  return std::any_of(
                      split_in.begin(),
                      split_in.end(),
                      [&](const CoveredGroup& split_in_group) {
                        return split_in_group.isEqualToOrSuperSetOf(
                            other_split_in_group);
                      });
                })) {
          return true;
        }
      }
    
      // If the other CoveredGroup has a split input, it is sufficient to
      // satisfy that this CoveredGroup is equal to or
      // superior to the split input of other
      if (other.split_in_.get()) {
        if (std::all_of(
                other.split_in_->begin(),
                other.split_in_->end(),
                [&](const CoveredGroup& other_split_in) {
                  return isEqualToOrSuperSetOf(other_split_in);
                })) {
          return true;
        }
      }
    
      // At this point, it does not mean it's definitely false but not
      // proven to be true either.
    
      return false;
    }
    Performance Concern

    The new implementation of computeCoveredGroups may introduce performance overhead due to the use of std::shared_ptr and nested loops. Consider optimizing the data structures and algorithms to improve performance.

    LoopPromotionMapBuilder::computeCoveredGroups(
        const ValGraph& exact_graph,
        const IdModel& id_model) {
      // Map from an exact iter domain group, to all the exact iter domain groups it
      // covers
      std::unordered_map<ValGroup, std::shared_ptr<CoveredGroups>> covered_ids;
    
      ValGroups input_groups_of_graph =
          getInputGroupsOfExactGraph(exact_graph, id_model);
    
      for (const ValGroup& id_group :
           exact_graph.disjointValSets().disjointSets()) {
        // Initialize inputs
        if (input_groups_of_graph.has(id_group)) {
          auto init_groups = std::make_shared<CoveredGroups>();
          init_groups->insert(CoveredGroup(id_group));
          NVF_ERROR(covered_ids.emplace(id_group, init_groups).second);
        }
    
        // Initialize broadcast groups to empty since broadcast domains
        // don't matter for indexing
        if (std::any_of(id_group->begin(), id_group->end(), [&](Val* id) {
              return id->as<IterDomain>()->isBroadcast();
            })) {
          covered_ids[id_group] = std::make_shared<CoveredGroups>();
        }
      }
    
      ValGraphStmtSort exact_stmt_sort(exact_graph, input_groups_of_graph);
    
      for (const ExprGroup& exact_expr : exact_stmt_sort.exprs()) {
        // Initialize to empty group if not yet initialized
        for (const ValGroup& output_group : exact_graph.outputGroups(exact_expr)) {
          covered_ids.emplace(output_group, std::make_shared<CoveredGroups>());
        }
    
        const std::vector<ValGroup> input_groups =
            exact_graph.inputGroups(exact_expr);
        const std::vector<ValGroup> output_groups =
            exact_graph.outputGroups(exact_expr);
    
        // If this expr is a split, don't propagate the input coverage as
        // is but set the covered group of each output group by itself.
        // The input coverage info is propagated as the split input.
        if (exact_expr->front()->isA<Split>()) {
          NVF_ERROR(input_groups.size() == 1);
          const std::shared_ptr<CoveredGroups>& covered_groups =
              covered_ids.at(input_groups.at(0));
    
          for (const ValGroup& output_group : output_groups) {
            bool is_inner =
                output_group->has(exact_expr->front()->as<Split>()->inner());
            covered_ids[output_group]->insert(
                CoveredGroup(output_group, covered_groups, is_inner));
          }
          continue;
        }
    
        for (const ValGroup& output_group : output_groups) {
          // Note that an exact group may have multiple
          // exact expr groups and may have different coverage groups depending on
          // the expr groups. For example, this can happen with reshape or resize.
          // See test LoopPromotionCoverage for a concrete example.
          for (const ValGroup& inp_group : input_groups) {
            const std::shared_ptr<CoveredGroups>& inp_covered_groups =
                covered_ids.at(inp_group);
            covered_ids[output_group]->insert(
                inp_covered_groups->begin(), inp_covered_groups->end());
          }
        }
      }
    
      return covered_ids;
    }
    
    std::unordered_map<ValGroup, IterDomain*> LoopPromotionMapBuilder::
        projectIELPromotionToLoopGraph(
            const ValGraph& iel_graph,
            const std::unordered_map<ValGroup, IterDomain*>& iel_promotion_map,
            const ValGraph& loop_graph,
            const StatefulInliningInfo& inlining_info) const {
      const std::unordered_map<ValGroup, std::shared_ptr<CoveredGroups>>
          exact_covered_ids =
              computeCoveredGroups(idGraph(IdMappingMode::EXACT), id_model_);
    
      // Grab terminal iter domain in the loop groups.
      const VectorOfUniqueEntries<IterDomain*> terminal_loop_ids =
          computeTerminalLoopIds(inlining_info);
    
      std::unordered_map<ValGroup, IterDomain*> loop_promotion_map;
    
      for (const ValGroup& loop_group :
           loop_graph.disjointValSets().disjointSets()) {
        IterDomain* promotion_id = findPromotionOfLoopGroup(
            loop_group,
            iel_graph,
            iel_promotion_map,
            exact_covered_ids,
            terminal_loop_ids);
        if (promotion_id) {
    Test Coverage

    Ensure that the added tests cover all possible edge cases and scenarios, including complex split and merge operations, to validate the correctness of the computeCoveredGroups function.

    // Test to verify the split-aware covered group analysis. See
    // also https://github.com/NVIDIA/Fuser/pull/3877.
    TEST_F(IdModelTest, CoveredGroups) {
      auto fusion_ptr = std::make_unique<Fusion>();
      auto& fusion = *fusion_ptr;
      FusionGuard fg(fusion_ptr.get());
    
      auto tv0 = makeContigConcreteTensor({-1, 1});
      fusion.addInput(tv0);
      auto tv1 = makeContigConcreteTensor({-1});
      fusion.addInput(tv1);
    
      auto tv2 = set(tv0);
      auto tv3 = reshape(tv1, {8}, {2, 4});
      auto tv4 = add(tv2, tv3);
      fusion.addOutput(tv4);
    
      for (auto tv : fusion.allTvs()) {
        tv->flatten();
      }
    
      inlineMost();
    
      IdModel id_model(&fusion, true);
      const auto& exact_graph = id_model.idGraph(IdMappingMode::EXACT);
    
      // The exact
      // group of the tv3 and tv4 IDs should cover both the inner and
      // outer split groups of the input group of the tv1 logical ID.
      const auto covered_groups =
          LoopPromotionMapBuilder::computeCoveredGroups(exact_graph, id_model);
    
      const auto& input_group = exact_graph.toGroup(tv1->getLogicalDomain().at(0));
      auto input_covered_group_it = covered_groups.find(input_group);
      ASSERT_NE(input_covered_group_it, covered_groups.end());
      const auto& input_covered_groups = input_covered_group_it->second;
    
      const auto& tv4_exact_group = exact_graph.toGroup(tv4->axis(0));
      auto tv4_exact_group_it = covered_groups.find(tv4_exact_group);
      ASSERT_NE(tv4_exact_group_it, covered_groups.end());
      const auto& tv4_covered_groups = tv4_exact_group_it->second;
      // It should consist of two CoveredGroups, both of which inheriths
      // from the logical ID of tv1 through a split
      EXPECT_EQ(tv4_covered_groups->size(), 2);
      for (const CoveredGroup& covered_group : *tv4_covered_groups) {
        EXPECT_EQ(covered_group.splitIn(), input_covered_groups);
        if (covered_group.isInner()) {
          EXPECT_EQ(
              covered_group.group(),
              exact_graph.toGroup(tv4->getLogicalDomain().at(1)));
        } else {
          EXPECT_EQ(
              covered_group.group(),
              exact_graph.toGroup(tv4->getLogicalDomain().at(0)));
        }
      }
    }
    
    // Repro of issue #3702
    // https://github.com/NVIDIA/Fuser/issues/3702. Indexing traversal
    // faied due to invalid loop promotion.
    TEST_F(IdModelTest, InvalidLoopPromotion) {
      auto fusion_ptr = std::make_unique<Fusion>();
      auto& fusion = *fusion_ptr;
      FusionGuard fg(fusion_ptr.get());
    
      auto T0 = makeContigConcreteTensor({1, 32, 6});
      fusion.addInput(T0);
      auto T32 = makeContigConcreteTensor({1, 6, 2048}, DataType::BFloat16);
      fusion.addInput(T32);
    
      auto T6 = transpose(T0, 1, 2);
      auto T98 = broadcast(T6, {false, false, true, false});
      auto T99 = expand(
          T98,
          {IrBuilder::create<Val>(-1L),
           IrBuilder::create<Val>(-1L),
           IrBuilder::create<Val>(2L),
           IrBuilder::create<Val>(-1L)});
      auto T100 = reshape(
          T99,
          {IrBuilder::create<Val>(1L),
           IrBuilder::create<Val>(6L),
           IrBuilder::create<Val>(-1)});
      auto T11 = sin(T100);
      auto T13 = mul(T11, IrBuilder::create<Val>(1.0));
      auto T15 = castOp(DataType::BFloat16, T13);
      auto T43 = broadcast(T15, {false, true, false, false});
      auto T59 = expand(
          T43,
          {IrBuilder::create<Val>(-1L),
           IrBuilder::create<Val>(32L),
           IrBuilder::create<Val>(-1L),
           IrBuilder::create<Val>(-1L)});
      auto T61 = castOp(DataType::Float, T59);
      auto T10 = cos(T100);
      auto T12 = mul(T10, IrBuilder::create<Val>(1.0));
      auto T14 = castOp(DataType::BFloat16, T12);
      auto T41 = broadcast(T14, {false, true, false, false});
      auto T66 = expand(
          T41,
          {IrBuilder::create<Val>(-1L),
           IrBuilder::create<Val>(8L),
           IrBuilder::create<Val>(-1L),
           IrBuilder::create<Val>(-1L)});
      auto T79 = expand(
          T43,
          {IrBuilder::create<Val>(-1L),
           IrBuilder::create<Val>(8L),
           IrBuilder::create<Val>(-1L),
           IrBuilder::create<Val>(-1L)});
      auto T81 = castOp(DataType::Float, T79);
      auto T35 = reshape(
          T32,
          {IrBuilder::create<Val>(1L),
           IrBuilder::create<Val>(6L),
           IrBuilder::create<Val>(32L),
           IrBuilder::create<Val>(64L)});
      auto T36 = transpose(T35, 1, 2);
      auto T47 = castOp(DataType::Float, T36);
      auto T46 = expand(
          T41,
          {IrBuilder::create<Val>(-1L),
           IrBuilder::create<Val>(32L),
           IrBuilder::create<Val>(-1L),
           IrBuilder::create<Val>(-1L)});
      auto T48 = castOp(DataType::Float, T46);
      auto T49 = mul(T47, T48);
      fusion.addOutput(T61);
      fusion.addOutput(T66);
      fusion.addOutput(T81);
      fusion.addOutput(T36);
      fusion.addOutput(T49);
    
      auto options_bf16 =
          at::TensorOptions().dtype(at::kBFloat16).device(at::kCUDA, 0);
      auto options_fp32 =
          at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      auto t0 = at::randn({1, 32, 6}, options_fp32);
      auto t32 = at::randn({1, 6, 2048}, options_bf16);
      std::vector<c10::IValue> inputs({t0, t32});
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
      auto outputs = executor_cache.runFusionWithInputs(inputs);
      testValidate(&fusion, outputs, inputs, __LINE__, __FILE__);
    }
    
    } // namespace nvfuser

    @naoyam naoyam added the idmodel label Feb 12, 2025
    @@ -654,7 +654,7 @@ TEST_F(Tutorial, IdModelReshapeAnalysis) {
    fusion.addOutput(tv3);

    IdModel id_model(&fusion);
    ValGraph& exact_graph = id_model.idGraph(IdMappingMode::EXACT);
    ValGraph& exact_graph = id_model.buildExactGraph();
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Unrelated fix

    @naoyam
    Copy link
    Collaborator Author

    naoyam commented Feb 13, 2025

    !test

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    hf_llama model error
    1 participant