Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed request to tables with Object Storage Engines #615

Open
wants to merge 15 commits into
base: antalya
Choose a base branch
from

Conversation

ianton-ru
Copy link

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Distributed object storage table engines

Documentation entry for user-facing changes

Before ClickHouse can made distributed table function requests with option object_storage_cluster.

SELECT * FROM icebergS3('http://minio:9000/warehouse/data/', 'minio', 'minio123', 'Parquet') SETTINGS object_storage_cluster='cluster';

This PR expand this to table engined.

CREATE TABLE t (key UInt32, value String) ENGINE=IcebergS3('http://minio:9000/warehouse/data/', 'minio', 'minio123', 'Parquet') SETTINGS object_storage_cluster='cluster';

SELECT * FROM t;
CREATE DATABASE datalake ENGINE = Iceberg('http://rest:8181/v1', 'minio', 'minio123') SETTINGS catalog_type = 'rest', storage_endpoint = 'http://minio:9000/warehouse', warehouse = 'iceberg', object_storage_cluster='cluster';

SELECT * FROM datalake.`iceberg.bids`;

Modify your CI run:

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Include tests (required builds will be added automatically):

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Unit tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with Analyzer
  • Add your option here

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • Add your option here

Extra options:

  • do not test (only style check)
  • disable merge-commit (no merge from master before tests)
  • disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

  • 1
  • 2
  • 3
  • 4

@altinity-robot
Copy link
Collaborator

altinity-robot commented Feb 6, 2025

This is an automated comment for commit 293ac83 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure
Regression aarch64 S3 azureThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Regression aarch64 S3 gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Sign aarch64There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ error
Sign releaseThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ error
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors❌ failure
Successful checks
Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Ready for releaseThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Alter attach partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Alter move partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Alter replace partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Benchmark aws_s3There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Benchmark gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Benchmark minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Clickhouse Keeper SSLThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 LDAP authenticationThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 LDAP external_user_directoryThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 LDAP role_mappingThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Parquet aws_s3There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Parquet minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 ParquetThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Tiered Storage minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Tiered Storage s3amazonThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Tiered Storage s3gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 aes_encryptionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 aggregate_functionsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 atomic_insertThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 base_58There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 clickhouse_keeperThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 data_typesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 datetime64_extended_rangeThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 disk_level_encryptionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 dnsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 enginesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 exampleThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 extended_precision_data_typesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 kafkaThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 kerberosThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 key_valueThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 lightweight_deleteThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 memoryThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 part_moves_between_shardsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 selectsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 session_timezoneThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 tiered_storageThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 window_functionsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success

Copy link
Collaborator

@arthurpassos arthurpassos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a first glance, it looks good. This is the first round of review, it includes just a few questions and minor suggestions.

Once these are addressed, I'll dive a bit deeper in the AST creation logic and it should be good to go :).

@@ -247,6 +247,10 @@ class StorageObjectStorage::Configuration

virtual void update(ObjectStoragePtr object_storage, ContextPtr local_context);

virtual void setFunctionArgs(ASTs & /* args */) const
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this function needs to be either renamed or documented. Usually, when I read a function named void setSomething(Argument argument), I assume the members of the current instance will be modified.

This function is doing the opposite, it is taking the members of the existing instance and populating the argumennt.

Maybe you could construct the arguments object and return it?

ASTPtr getTableFunctionArguments()
{
      auto arguments = std::make_shared<ASTExpressionList>();
      // populate it
       return arguments;
}

and then, upon call site:

auto function_arguments = configuration->getTableFunctionArguments();

Copy link
Collaborator

@arthurpassos arthurpassos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, couple more changes or discussions and it is good to be merged

@@ -153,7 +160,7 @@ void StorageObjectStorageCluster::updateQueryForDistributedEngineIfNeeded(ASTPtr
queryToString(query));
}

configuration->setFunctionArgs(arguments->children);
configuration->getTableFunctionArguments(arguments->children);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #615 (comment), I suggested the arguments to be created inside getTableFunctionArguments instead of populating a parameter. Any specific reason you did it this way?

If there is one, I am fine with this approach

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to avoid copying after function call. children is a vector, not a pointer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow.

On line 155, you do: auto arguments = std::make_shared<ASTExpressionList>();

Then you are passing children as a reference on 167 and populating it inside the getTableFunctionArguments function.

On line 169, you read the values.

What I am suggesting is that you create the arguments list inside the getTableFunctionArguments function, populate and simply return it. It'll be a shared_ptr, there is no deep copy in this process. Then, the rest is the same.

What am I missing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes not in arguments but in arguments->children
children has type ASTs
https://github.com/ClickHouse/ClickHouse/blob/master/src/Parsers/IAST.h#L37
It's a variant of vector, not a single pointer
https://github.com/ClickHouse/ClickHouse/blob/master/src/Parsers/IAST_fwd.h#L14

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on slack, let's make the function return it as suggested in the original comment

void StorageAzureConfiguration::getTableFunctionArguments(ASTs & args) const
{
if (!args.empty())
{ /// Just check
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just remove this comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if you implement this function in the way I suggested in #615 (comment), perhaps this whole if statement is no longer necessary?

args.push_back(std::make_shared<ASTLiteral>(auth_settings[S3AuthSetting::session_token].value));
if (format != "auto")
args.push_back(std::make_shared<ASTLiteral>(format));
if (!compression_method.empty())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok not to add some other arguments like structure and headers as the docs suggest? https://clickhouse.com/docs/en/sql-reference/table-functions/s3

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this fields are already added in addStructureAndFormatToArgsIfNeeded methods.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Storages/ObjectStorage/S3/Configuration.cpp#L399
These field required for table functions too, when access parameters need to add only for table engine case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: perhaps add a comment saying this is already added somewhere else?

Copy link
Author

@ianton-ru ianton-ru Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I rename method to addPathAndAccessKeysToArgs, to consistency with addStructureAndFormatToArgsIfNeeded

arthurpassos
arthurpassos previously approved these changes Feb 14, 2025
Copy link
Collaborator

@arthurpassos arthurpassos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posted two more comments which are not mandatory, although I am curious about #615 (comment).

In any case, it can be merged as it is. LGTM

@ianton-ru ianton-ru force-pushed the project-antalya-24.12.2-distributed-table-engine branch from c7b1ad3 to 3fafe6f Compare February 18, 2025 15:43
@Enmk Enmk changed the base branch from project-antalya-24.12.2 to antalya February 19, 2025 21:28
@ianton-ru ianton-ru mentioned this pull request Feb 20, 2025
29 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants