You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PgDog's strategy for resharding Postgres databases is to create a new, independent cluster of machines and move data over to it in real-time. Creating new databases is environment-specific, and PgDog doesn't currently automate this step.
8
+
9
+
## Requirements
10
+
11
+
New databases should be **empty**: don't migrate your [table definitions](schema.md) or [data](hash.md). These will be taken care of automatically by PgDog. The following items do need to be created manually, however:
12
+
13
+
1. Database users
14
+
2. Database schemas
15
+
16
+
### Database users
17
+
18
+
Since PgDog was built to work in cloud-managed environments, like AWS RDS, we don't usually have access to the `pg_shadow` view, which contains password hashes. Therefore, tools like [`pg_dumpall`](https://www.postgresql.org/docs/current/app-pg-dumpall.html) aren't able to operate and we can't automatically migrate users to the new database.
19
+
20
+
For this reason, migrating users to the new database cluster is currently **not supported** and is the responsibility of the operator.
21
+
22
+
Make sure to create all the necessary Postgres users and roles before proceeding to the [next step](schema.md).
23
+
24
+
### Database schemas
25
+
26
+
!!! note ":material-account-hard-hat: Work in progress"
27
+
This step will be automated by a future version of PgDog.
28
+
29
+
Before running the [schema sync](schema.md), make sure to re-create all of your existing schemas on the new databases. You can take advantage of [cross-shard DDL](../cross-shard.md#create-alter-drop) queries to make this easier.
30
+
31
+
The `public` schema is created by default for all databases, so if you aren't using any additional schemas, you can skip this step.
32
+
33
+
## Multiple Postgres databases
34
+
35
+
If you are operating multiple Postgres databases on the same database server, they will need to be resharded separately. Logical replication, which PgDog uses to move data, operates on a single Postgres database level only.
Copy file name to clipboardExpand all lines: docs/features/sharding/resharding/hash.md
+9-5Lines changed: 9 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,8 @@
1
-
# Hash resharding
1
+
---
2
+
icon: material/database-export-outline
3
+
---
4
+
5
+
# Move data
2
6
3
7
If you're using the `HASH` sharding function, adding a new node to the cluster will change the modulo number by 1. The number returned by the hash function is uniformly distributed across the entire integer range, which makes it considerably larger than the modulo. Therefore, changing it will more often than not result in most rows remapped to different shard numbers.
4
8
@@ -23,16 +27,16 @@ PgDog's strategy for resharding is to **move data** from an existing cluster to
23
27
24
28
## Data sync
25
29
26
-
Moving data online is a 2step process:
30
+
Moving data online is a 2-step process:
27
31
28
32
1. Copy data from tables using Postgres `COPY`
29
-
2. Stream realtime changes using logical replication
33
+
2. Stream real-time changes using logical replication
30
34
31
35
To make sure no rows are lost in the process, PgDog follows a similar strategy used by Postgres in logical replication subscriptions, with some improvements.
32
36
33
37
### Copying tables
34
38
35
-
Copying table data from source database cluster is done using Postgres `COPY` and logical replication slots. This is implemented in the `data-sync` command:
39
+
Copying table data from the source database cluster is done using Postgres `COPY` and logical replication slots. This is implemented in the `data-sync` command:
36
40
37
41
```bash
38
42
pgdog data-sync --help
@@ -50,5 +54,5 @@ All databases and users must be configured in `pgdog.toml` and `users.toml`.
50
54
51
55
### Real time changes
52
56
53
-
After data sync is complete, changes for all tables in the publication will be streamed in realtime. Keep this connection
57
+
After data sync is complete, changes for all tables in the publication will be streamed in real-time. Keep this connection
54
58
open until you are ready to cut traffic over to the new database cluster.
This feature is a work in progress. Support for resharding with logical replication was started in [#279](https://github.com/pgdogdev/pgdog/pull/279).
5
9
6
-
Resharding adds more nodes to an existing database cluster, spreading the data evenly between all machines. Depending on which [sharding function](../sharding-functions.md) is used, this may require recomputing shard numbers for all rows and move them between databases.
10
+
Resharding changes the number of shards in an existing database cluster, in order to add or remove capacity. To make this less impactful on production operations, PgDog's strategy for resharding is to create a new database cluster and reshard data in-flight, while moving it to the new databases.
11
+
12
+
To make this an online process, with zero downtime or data loss, PgDog hooks into the logical replication protocol used by PostgreSQL and reroutes messages between nodes to create and update rows in real-time.
The resharding process is composed of four independent operations:
21
+
22
+
1.#### [Create new databases](databases.md)
23
+
2.#### [Synchronize schema](schema.md)
24
+
3.#### [Move data](hash.md)
25
+
4.#### [Cutover traffic](cutover.md)
7
26
8
-
## Hash-based resharding
27
+
All of the individual steps are automated by PgDog, while their orchestration is currently the responsibility of the user.
9
28
10
-
PgDog's strategy for resharding for hash-sharded clusters is to create a new cluster with `N x 2` nodes (`N` is the number of nodes in the existing cluster) and move all data to the new cluster without downtime using logical replication.
29
+
## Terminology
11
30
12
-
[**→ Hash resharding**](hash.md)
31
+
| Term | Description |
32
+
|-|-|
33
+
| Source database | The database cluster that's being resharded and contains all data and table definitions. |
34
+
| Destination database | The database cluster with the new sharding configuration, where the data will be copied from the source database. |
35
+
| Logical replication | Replication protocol available to PostgreSQL databases since version 10. |
PgDog can copy tables, indexes and other entities from your production database to the new sharded database automatically. To make [data sync](hash.md) as efficient as possible, it splits the schema sync into two parts:
6
+
PgDog can copy tables, indexes and other entities from your production database to the new, sharded database automatically. This is faster than using `pg_dump`, because we separate this process into two parts:
4
7
5
-
- Pre-data tablesand primary keys
6
-
- Post-data secondary indices
8
+
1.[Create tables](#tables-and-primary-keys), primary key indices, and sequences
9
+
2. Create [secondary indices](#secondary-indices)
7
10
8
-
Before syncing data, run the first part to create the necessary tables and primary key constraints. Once data sync is caught up, run the second step to create secondary indexes, sequences and other entities.
11
+
The first step needs to be performed first, before [copying data](hash.md). The second step is performed once the data sync is almost complete.
9
12
10
-
## How it works
13
+
## CLI
11
14
12
15
PgDog has a command line interface you can call by running it directly. Schema sync is controlled by a CLI command:
13
16
@@ -18,13 +21,56 @@ pgdog schema-sync \
18
21
--publication <publication>
19
22
```
20
23
21
-
Expected and optional parameters for this command are as follows:
24
+
Required (*) and optional parameters for this command are as follows:
22
25
23
26
| Parameter | Description |
24
27
|-|-|
25
-
|`--from-database`| The name of the source database in `pgdog.toml`. |
26
-
|`--to-database`| The name of the destination database in `pgdog.toml`. |
27
-
|`--publication`| The name of the Postgres table publication with the tables you want to sync. |
28
+
|`--from-database`*| The name of the source database in `pgdog.toml`. |
29
+
|`--to-database`*| The name of the destination database in `pgdog.toml`. |
30
+
|`--publication`*| The name of the Postgres table [publication](#publication) with the tables you want to sync. |
28
31
|`--dry-run`| Print the SQL statements that will be executed on the destination database and exit. |
29
32
|`--ignore-errors`| Execute SQL statements and ignore any errors. |
30
-
|`--data-sync-complete`| Run the post-data step to create secondary indices and sequences. |
33
+
|`--data-sync-complete`| Run the second step to create secondary indices and sequences. |
34
+
35
+
## Tables and primary keys
36
+
37
+
The first step in the schema sync copies over tables and their primary key indexes from the source database to the new, resharded cluster. This has to be done separately, because Postgres's logical replication only copies data and doesn't manage table schemas.
38
+
39
+
### Primary keys
40
+
41
+
A primary key constraint is **required** on all tables for logical replication to work correctly. Without a unique index identifying each row in a table, logical replication is not able to perform `UPDATE` and `DELETE` commands.
42
+
43
+
Before starting the resharding process for your database, double-check that you have primary keys on all your tables.
44
+
45
+
## Publication
46
+
47
+
Since PgDog is using logical replication to move and reshard data, a [publication](https://www.postgresql.org/docs/current/sql-createpublication.html) for the relevant tables needs to be created on the source database.
48
+
49
+
The simplest way to do this is to run the following command on the **source database**:
50
+
51
+
```postgresql
52
+
CREATE PUBLICATION pgdog FOR ALL TABLES;
53
+
```
54
+
55
+
This will make sure _all_ tables in your database will be resharded into the destination database cluster.
56
+
57
+
!!! note "Multiple schemas"
58
+
If you're using schemas other than `public`, create them on the destination database before running the schema sync.
59
+
60
+
## Schema admin
61
+
62
+
Schema sync creates tables, indices, and other entities on the destination database. To make sure that's done with a user with sufficient privileges (e.g., `CREATE` permission on the database), you need to add it to [`users.toml`](../../../configuration/users.toml/users.md) and mark it as the schema administrator:
63
+
64
+
```toml
65
+
[[users]]
66
+
name = "migrator"
67
+
database = "prod"
68
+
password = "hunter2"
69
+
schema_admin = true
70
+
```
71
+
72
+
PgDog will use that user to connect to the source and destination databases, so make sure to specify one for both of them.
73
+
74
+
## Secondary indices
75
+
76
+
This step is performed after [data sync](hash.md) is complete. Running this step will create secondary indexes on all your tables, which will take some time.
0 commit comments