Skip to content

enhance: update tantivy for removing "doc_id" fast field #41198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Apr 15, 2025

Conversation

SpadeA-Tang
Copy link
Contributor

@SpadeA-Tang SpadeA-Tang commented Apr 9, 2025

Issue: #41210

After zilliztech/tantivy#5, we can provide milvus row id directly to tantivy rather than record it in the fast field "doc_id".
So rather than search tantivy doc id and then get milvus row id from "doc_id", now, the searched tantivy doc id is the milvus row id, eliminating the expensive acquiring row id phase.

The following shows a simple benchmark where insert 1M docs where all rows are "hello", the latency is segcore level, CPU is 9900K:
image
The latency is 2.02 and 2.1 times respectively.

bench mark code:

TEST(TextMatch, TestPerf) {
    auto schema = GenTestSchema({}, true);
    auto seg = CreateSealedSegment(schema, empty_index_meta);
    int64_t N = 1000000;
    uint64_t seed = 19190504;
    auto raw_data = DataGen(schema, N, seed);
    auto str_col = raw_data.raw_->mutable_fields_data()
                       ->at(1)
                       .mutable_scalars()
                       ->mutable_string_data()
                       ->mutable_data();
    for (int64_t i = 0; i < N - 1; i++) {
        str_col->at(i) = "hello";
    }
    SealedLoadFieldData(raw_data, *seg);
    seg->CreateTextIndex(FieldId(101));

    auto now = std::chrono::high_resolution_clock::now();
    auto expr = GetMatchExpr(schema, "hello", OpType::TextMatch);
    auto final = ExecuteQueryExpr(expr, seg.get(), N, MAX_TIMESTAMP);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration =
        std::chrono::duration_cast<std::chrono::microseconds>(end - now);
    std::cout << "TextMatch query time: " << duration.count() << "ms"
              << std::endl;
}

@sre-ci-robot sre-ci-robot added the size/L Denotes a PR that changes 100-499 lines. label Apr 9, 2025
@mergify mergify bot added dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement labels Apr 9, 2025
Copy link
Contributor

mergify bot commented Apr 9, 2025

@SpadeA-Tang Please associate the related issue to the body of your Pull Request. (eg. “issue: #”)

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: SpadeA-Tang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mergify mergify bot removed the ci-passed label Apr 9, 2025
@SpadeA-Tang SpadeA-Tang marked this pull request as draft April 9, 2025 12:35
@sre-ci-robot sre-ci-robot added the do-not-merge/work-in-progress Don't merge even CI passed. label Apr 9, 2025
Copy link
Contributor

mergify bot commented Apr 9, 2025

@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Signed-off-by: SpadeA <[email protected]>
@SpadeA-Tang SpadeA-Tang marked this pull request as ready for review April 10, 2025 03:46
@sre-ci-robot sre-ci-robot removed the do-not-merge/work-in-progress Don't merge even CI passed. label Apr 10, 2025
Copy link
Contributor

mergify bot commented Apr 10, 2025

@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

Copy link
Contributor

mergify bot commented Apr 10, 2025

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Apr 10, 2025

@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Signed-off-by: SpadeA <[email protected]>
Signed-off-by: SpadeA <[email protected]>
@SpadeA-Tang SpadeA-Tang marked this pull request as ready for review April 11, 2025 03:51
@sre-ci-robot sre-ci-robot removed the do-not-merge/work-in-progress Don't merge even CI passed. label Apr 11, 2025
Copy link
Contributor

mergify bot commented Apr 11, 2025

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Apr 11, 2025

@SpadeA-Tang cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

Copy link
Contributor

mergify bot commented Apr 11, 2025

@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Signed-off-by: SpadeA <[email protected]>
Signed-off-by: SpadeA <[email protected]>
Copy link

codecov bot commented Apr 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.58%. Comparing base (9ce3e3c) to head (150d406).
Report is 19 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #41198      +/-   ##
==========================================
+ Coverage   80.36%   80.58%   +0.22%     
==========================================
  Files        1484     1476       -8     
  Lines      211989   209856    -2133     
==========================================
- Hits       170359   169113    -1246     
+ Misses      35472    34612     -860     
+ Partials     6158     6131      -27     
Components Coverage Δ
Client 79.59% <ø> (ø)
Core 72.36% <100.00%> (+2.04%) ⬆️
Go 82.00% <ø> (-0.17%) ⬇️
Files with missing lines Coverage Δ
...ernal/core/src/index/JsonKeyStatsInvertedIndex.cpp 92.71% <100.00%> (-0.15%) ⬇️

... and 156 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@czs007
Copy link
Collaborator

czs007 commented Apr 14, 2025

/run-cpu-e2e

Copy link
Contributor

mergify bot commented Apr 14, 2025

@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@SpadeA-Tang
Copy link
Contributor Author

/run-cpu-e2e

Copy link
Contributor

mergify bot commented Apr 14, 2025

@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@SpadeA-Tang
Copy link
Contributor Author

/run-cpu-e2e

Copy link
Contributor

mergify bot commented Apr 14, 2025

@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Signed-off-by: SpadeA <[email protected]>
Copy link
Contributor

mergify bot commented Apr 15, 2025

@SpadeA-Tang E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@SpadeA-Tang
Copy link
Contributor Author

/run-cpu-e2e

Signed-off-by: SpadeA <[email protected]>
Copy link
Contributor

mergify bot commented Apr 15, 2025

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

@SpadeA-Tang
Copy link
Contributor Author

rerun go-sdk

Copy link
Contributor

mergify bot commented Apr 15, 2025

@SpadeA-Tang go-sdk check failed, comment rerun go-sdk can trigger the job again.

@SpadeA-Tang
Copy link
Contributor Author

rerun go-sdk

@mergify mergify bot added the ci-passed label Apr 15, 2025
@czs007
Copy link
Collaborator

czs007 commented Apr 15, 2025

/lgtm

@sre-ci-robot sre-ci-robot merged commit 70d13dc into milvus-io:master Apr 15, 2025
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement lgtm size/XL Denotes a PR that changes 500-999 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants