Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expression: Let TiDB use Hyperscan to support multi-pattern-match #23497

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

blacktear23
Copy link
Contributor

@blacktear23 blacktear23 commented Mar 24, 2021

What problem does this PR solve?

Problem Summary:
Let TiDB can support multi-pattern-match, powered by Hyperscan.

Proposal: #23504

What is changed and how it works?

What's Changed:

Add builtin-functions to support multi-pattern-match, all functions has hs_ prefix.

  • hs_build_db(id, pattern):
    this is an aggregation function used to build Hyperscan database and encoded as base64. id parameter should be a number and pattern should be a string. id parameter can be ignored so if call hs_build_db(pattern) will generate Hyperscan database without pattern's ID.

  • hs_build_db_json(patterns, [encodeFormat]):
    build Hyperscan database use json format patterns source. encodeFormat can be hex or base64, default is hex

  • hs_match(input, patterns, [format]):
    return true if input matched any patterns. format can be lines, json, hex, base64, default is lines

  • hs_match_all(input, patterns, [format]):
    return true if input matched all patterns. format can be lines, json, default is lines

  • hs_match_ids(input, patterns, [format]):
    return pattern's id list that matched input. format can be lines, json, hex, base64, default is lines

  • hs_match_json(input, patterns):
    short write for hs_match(input, patterns, "json")

  • hs_match_all_json(input, patterns):
    short write for hs_match_all(input, patterns, "json")

  • hs_match_ids_json(input, patterns):
    short write for hs_match_ids(input, patterns, "json")

Limitations

hs_match series function will treat patterns parameter as constant, so if patterns parameter is changed during evaluation, the Hyperscan database will only build once at the first row and cannot be changed!

For example, query like below:

select t.data, hs_match(t.data, t.dymaic_patterns, "json") from t;

will not execute correctly.

Patterns Format

  1. lines
    line split patterns, example:

    pattern1
    pattern2
    /pattern3/i
    
  2. json
    json array for pattern, example:

    [
      {"id": 1, "pattern": "pattern1"},
      {"id": 2, "pattern": "pattern2"},
      {"id": 3, "pattern": "/pattern3/i"}
    ]
    

    pattern field is required contain regexp pattern, id field can be ignored, if ignored id will assigned as array index plus 1.

  3. hex
    hex encoded marshaled Hyperscan database which generated by hs_build_db_json function

  4. base64
    base64 encoded marshaled Hyperscan database which generated by hs_build_db_json function

How to build:

This PR introduce build tags for conditional compile if you want to enable Hyperscan functions and do the tests:

$ BUILD_TAGS=hyperscan make dev

Or if you want to build Hyperscan supported tidb-server

$ BUILD_TAGS=hyperscan make server
  • Before building TiDB with Hyperscan functions please make sure CGO can find Hyperscan library.
  • Hyperscan version should be greater than v5

Some examples

select t.data, hs_match_ids_json(t.data, (select concat("[", (select group_concat(json_object('id', id, 'pattern', pattern) separator ",") from patterns), "]"))) from t;

select t.data, hs_match(t.data, (select group_concat(pattern SETARATPR "\n") from patterns)) from t;

select t.data, hs_match_ids(t.data, (select hs_build_db(id, pattern) from patterns), "base64") from t;

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:

Check List

Tests

  • Unit test
  • Integration test

Side effects

When matching multi patterns

  • Performance regression
    • Consumes less CPU
    • Consumes more MEM

Release note

  • Let TiDB support multi-pattern-match

@blacktear23 blacktear23 requested a review from a team as a code owner March 24, 2021 04:07
@blacktear23 blacktear23 requested review from XuHuaiyu and removed request for a team March 24, 2021 04:07
@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by writing /lgtm in a comment.
Reviewer can cancel approval by writing /lgtm cancel in a comment.

@ti-chi-bot ti-chi-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 24, 2021
@sre-bot
Copy link
Contributor

sre-bot commented Mar 24, 2021

Please follow PR Title Format:

  • pkg [, pkg2, pkg3]: what's changed

Or if the count of mainly changed packages are more than 3, use

  • *: what's changed

@blacktear23 blacktear23 changed the title Let TiDB use hyperscan to support multi-pattern-match expression: Let TiDB use hyperscan to support multi-pattern-match Mar 24, 2021
@blacktear23 blacktear23 changed the title expression: Let TiDB use hyperscan to support multi-pattern-match expression: Let TiDB use Hyperscan to support multi-pattern-match Mar 24, 2021
@bb7133
Copy link
Member

bb7133 commented Mar 24, 2021

Hi @blacktear23 Would you please:

  • write a proposal(docs) for it?
  • split this PR into smaller pieces so that it can be reviewed easily.

@blacktear23
Copy link
Contributor Author

@bb7133 this PR seems big but almost all code contains in 3 files and there are also lots of comments. And in this PR there only have 2 function implementations, all those provided functions are based on them. So split it into small PR may not make PR smaller than this one.
About proposal I will try to write one.

@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 7, 2021
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 8, 2021
@blacktear23 blacktear23 requested a review from a team as a code owner April 26, 2021 10:17
@YUXI903
Copy link

YUXI903 commented May 14, 2021

I saw your blog about this pr and would like a permission to reprint it in PingCAP's blog. I have sent you an email and it will be reviewed by you before publishing, so I hope you will reply soon~

@ti-chi-bot
Copy link
Member

@blacktear23: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 22, 2021
@winoros
Copy link
Member

winoros commented Jun 21, 2021

Once this pr is ready to review, please re-request the planner's review. Thanks!

@winoros winoros removed the request for review from a team June 21, 2021 09:07
@XuHuaiyu XuHuaiyu removed their request for review June 30, 2021 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/expression needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants