Skip to content

Commit 11d58d2

Browse files
authored
Update DataFlow Docs (#3981)
* add console indication to dataflow main page * add GCP console screenshots * Add GCP console docs
1 parent 4b024d0 commit 11d58d2

File tree

6 files changed

+50
-13
lines changed

6 files changed

+50
-13
lines changed

docs/integrations/data-ingestion/google-dataflow/templates.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Google Dataflow templates provide a convenient way to execute prebuilt, ready-to
2222

2323
## How to Run Dataflow Templates {#how-to-run-dataflow-templates}
2424

25-
As of today, the ClickHouse official template is available via the Google Cloud CLI or Dataflow REST API.
25+
As of today, the ClickHouse official template is available via the Google Cloud Console, CLI or Dataflow REST API.
2626
For detailed step-by-step instructions, refer to the [Google Dataflow Run Pipeline From a Template Guide](https://cloud.google.com/dataflow/docs/templates/provided-templates).
2727

2828

docs/integrations/data-ingestion/google-dataflow/templates/bigquery-to-clickhouse.md

Lines changed: 49 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,19 +9,25 @@ title: 'Dataflow BigQuery to ClickHouse template'
99
import TOCInline from '@theme/TOCInline';
1010
import Image from '@theme/IdealImage';
1111
import dataflow_inqueue_job from '@site/static/images/integrations/data-ingestion/google-dataflow/dataflow-inqueue-job.png'
12+
import dataflow_create_job_from_template_button from '@site/static/images/integrations/data-ingestion/google-dataflow/create_job_from_template_button.png'
13+
import dataflow_template_clickhouse_search from '@site/static/images/integrations/data-ingestion/google-dataflow/template_clickhouse_search.png'
14+
import dataflow_template_initial_form from '@site/static/images/integrations/data-ingestion/google-dataflow/template_initial_form.png'
15+
import dataflow_extended_template_form from '@site/static/images/integrations/data-ingestion/google-dataflow/extended_template_form.png'
16+
import Tabs from '@theme/Tabs';
17+
import TabItem from '@theme/TabItem';
1218

1319
# Dataflow BigQuery to ClickHouse template
1420

15-
The BigQuery to ClickHouse template is a batch pipeline that ingests data from BigQuery table into ClickHouse table.
16-
The template can either read the entire table or read specific records using a provided query.
21+
The BigQuery to ClickHouse template is a batch pipeline that ingests data from a BigQuery table into a ClickHouse table.
22+
The template can read the entire table or filter specific records using a provided SQL query.
1723

18-
<TOCInline toc={toc}></TOCInline>
24+
<TOCInline toc={toc} maxHeadingLevel={2}></TOCInline>
1925

2026
## Pipeline requirements {#pipeline-requirements}
2127

2228
* The source BigQuery table must exist.
2329
* The target ClickHouse table must exist.
24-
* The ClickHouse host Must be accessible from the Dataflow worker machines.
30+
* The ClickHouse host must be accessible from the Dataflow worker machines.
2531

2632
## Template Parameters {#template-parameters}
2733

@@ -33,7 +39,7 @@ The template can either read the entire table or read specific records using a p
3339
| `jdbcUrl` | The ClickHouse JDBC URL in the format `jdbc:clickhouse://<host>:<port>/<schema>`. || Don't add the username and password as JDBC options. Any other JDBC option could be added at the end of the JDBC URL. For ClickHouse Cloud users, add `ssl=true&sslmode=NONE` to the `jdbcUrl`. |
3440
| `clickHouseUsername` | The ClickHouse username to authenticate with. || |
3541
| `clickHousePassword` | The ClickHouse password to authenticate with. || |
36-
| `clickHouseTable` | The target ClickHouse table name to insert the data to. || |
42+
| `clickHouseTable` | The target ClickHouse table into which data will be inserted. || |
3743
| `maxInsertBlockSize` | The maximum block size for insertion, if we control the creation of blocks for insertion (ClickHouseIO option). | | A `ClickHouseIO` option. |
3844
| `insertDistributedSync` | If setting is enabled, insert query into distributed waits until data will be sent to all nodes in cluster. (ClickHouseIO option). | | A `ClickHouseIO` option. |
3945
| `insertQuorum` | For INSERT queries in the replicated table, wait writing for the specified number of replicas and linearize the addition of the data. 0 - disabled. | | A `ClickHouseIO` option. This setting is disabled in default server settings. |
@@ -49,16 +55,15 @@ The template can either read the entire table or read specific records using a p
4955

5056

5157
:::note
52-
All `ClickHouseIO` parameters default values could be found in [`ClickHouseIO` Apache Beam Connector](/integrations/apache-beam#clickhouseiowrite-parameters)
58+
Default values for all `ClickHouseIO` parameters can be found in [`ClickHouseIO` Apache Beam Connector](/integrations/apache-beam#clickhouseiowrite-parameters)
5359
:::
5460

5561
## Source and Target Tables Schema {#source-and-target-tables-schema}
5662

57-
In order to effectively load the BigQuery dataset to ClickHouse, and a column infestation process is conducted with the
58-
following phases:
63+
To effectively load the BigQuery dataset into ClickHouse, the pipeline performs a column inference process with the following phases:
5964

6065
1. The templates build a schema object based on the target ClickHouse table.
61-
2. The templates iterate over the BigQuery dataset, and tried to match between column based on their names.
66+
2. The templates iterate over the BigQuery dataset, and attempts to match columns based on their names.
6267

6368
<br/>
6469

@@ -92,6 +97,36 @@ requirements and prerequisites.
9297

9398
:::
9499

100+
<Tabs>
101+
<TabItem value="console" label="Google Cloud Console" default>
102+
Sign in to your Google Cloud Console and search for DataFlow.
103+
104+
1. Press the `CREATE JOB FROM TEMPLATE` button
105+
<Image img={dataflow_create_job_from_template_button} border alt="DataFlow console" />
106+
2. Once the template form is open, enter a job name and select the desired region.
107+
<Image img={dataflow_template_initial_form} border alt="DataFlow template initial form" />
108+
3. In the `DataFlow Template` input, type `ClickHouse` or `BigQuery`, and select the `BigQuery to ClickHouse` template
109+
<Image img={dataflow_template_clickhouse_search} border alt="Select BigQuery to ClickHouse template" />
110+
4. Once selected, the form will expand to allow you to provide additional details:
111+
* The ClickHouse server JDBC url, with the following format `jdbc:clickhouse://host:port/schema`.
112+
* The ClickHouse username.
113+
* The ClickHouse target table name.
114+
115+
<br/>
116+
117+
:::note
118+
The ClickHouse password option is marked as optional, for use cases where there is no password configured.
119+
To add it, please scroll down to the `Password for ClickHouse Endpoint` option.
120+
:::
121+
122+
<Image img={dataflow_extended_template_form} border alt="BigQuery to ClickHouse extended template form" />
123+
124+
5. Customize and add any BigQuery/ClickHouseIO related configurations, as detailed in
125+
the [Template Parameters](#template-parameters) section
126+
127+
</TabItem>
128+
<TabItem value="cli" label="Google Cloud CLI">
129+
95130
### Install & Configure `gcloud` CLI {#install--configure-gcloud-cli}
96131

97132
- If not already installed, install the [`gcloud` CLI](https://cloud.google.com/sdk/docs/install).
@@ -134,6 +169,9 @@ job:
134169
startTime: '2025-01-26T14:34:04.608442Z'
135170
```
136171

172+
</TabItem>
173+
</Tabs>
174+
137175
### Monitor the Job {#monitor-the-job}
138176

139177
Navigate to the [Dataflow Jobs tab](https://console.cloud.google.com/dataflow/jobs) in your Google Cloud Console to
@@ -147,9 +185,8 @@ monitor the status of the job. You'll find the job details, including progress a
147185

148186
This error occurs when ClickHouse runs out of memory while processing large batches of data. To resolve this issue:
149187

150-
* Increase the instance resources: Upgrade your ClickHouse server to a larger instance with more memory to handle the data processing load.
151-
* Decrease the batch size: Adjust the batch size in your Dataflow job configuration to send smaller chunks of data to ClickHouse, reducing memory consumption per batch.
152-
These changes might help balance resource usage during data ingestion.
188+
* Increase the instance resources: Upgrade your ClickHouse server to a larger instance with more memory to handle the data processing load.
189+
* Decrease the batch size: Adjust the batch size in your Dataflow job configuration to send smaller chunks of data to ClickHouse, reducing memory consumption per batch. These changes can help balance resource usage during data ingestion.
153190

154191
## Template Source Code {#template-source-code}
155192

Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)