Skip to content
This repository was archived by the owner on Sep 2, 2025. It is now read-only.

Commit 289d225

Browse files
committed
Optimize slow query that uses a high amount of temporary disk space to find relations
Resolves https://github.com/dbt-labs/dbt-postgres/issues/189 The macro postgres_get_relations in relations.sql was extremely slow and used an extremely high amount of temporary disk space on a system with high numbers of schemas, tables, and dependencies between database objects (rows in pg_depend). Slow to the point of not completing in 50 minutes and using more than 160GB disk space (at which point PostgreSQL ran out of disk space and aborted the query). The solution here optimises the query and so it runs in just under 1 second on my system. It does this by being heavily inspired by the definition of information_schema.view_table_usage, and specifically: - Stripping out CTEs that can be optimisation blockers, often by causing CTEs to be materialised to disk (especially in older PostgreSQL, but I suspect in recent too in some cases). - Removing unnecessary filtering on relkind: going via pg_rewrite (or rather, the equivalent row on pg_depend) is equivalent to that - Avoiding sequential scans on any table by structuring joins/where clause to leverage indexes, especially on pg_depend - Removing unnecessary filtering out system catalog tables from dependents (they are excluded by the remaining filters on referenced tables). - Not having `select distinct ... from pg_dependent` in the innards of the query, and instead having a top level `select distinct` - on my system this saved over 45 seconds. - Excluding self-relations that depend on themselves by using oid rather than using the names of tables and schemas. I suspect this is also more robust because oids I think _can_ be repeated between system tables, and so when querying pg_depend filtering on classid and refclassid is required (and I think also means indexes are better leveraged). Comparing calls to `explain` it reduces the largest "rows" value from 5,284,141,410,595,979 (over five quadrillion) to 219 and the actual run time from never completing within 50 minutes (because it used all of the 160GB available) to completing in ~500ms. It also has some style/naming changes: - Using a `distinct` on the top level rather than a group by for clarity (performance seemed the same in my case). - Flips the definition of "referenced" and "dependent" in the query to match both the definitions in pg_depend, and the code at https://github.com/dbt-labs/dbt-postgres/blob/05f0337d6b05c9c68617e41c0b5bca9c2a733783/dbt/adapters/postgres/impl.py#L113 - Re-orders the join to I think a slightly clearer order that "flows" from views -> the linking table (pg_depend) to the tables referenced in the views. - Lowers the abstraction/indirection levels in naming/aliases, using names closer to the PostgreSQL catalog tables - this made it easier to write and understand, and so I suspect easier to make changes in future (I found I had to keep in mind the PostgreSQL definitions more than the output of the query when making changes).
1 parent 05f0337 commit 289d225

File tree

2 files changed

+38
-61
lines changed

2 files changed

+38
-61
lines changed
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
kind: Fixes
2+
body: Optimize slow query that uses a high amount of temporary disk space to find relations
3+
time: 2025-01-18T08:41:03.022013Z
4+
custom:
5+
Author: michalc
6+
Issue: "189"

dbt/include/postgres/macros/relations.sql

Lines changed: 32 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -7,68 +7,39 @@
77
#}
88

99
{%- call statement('relations', fetch_result=True) -%}
10-
with relation as (
11-
select
12-
pg_rewrite.ev_class as class,
13-
pg_rewrite.oid as id
14-
from pg_rewrite
15-
),
16-
class as (
17-
select
18-
oid as id,
19-
relname as name,
20-
relnamespace as schema,
21-
relkind as kind
22-
from pg_class
23-
),
24-
dependency as (
25-
select distinct
26-
pg_depend.objid as id,
27-
pg_depend.refobjid as ref
28-
from pg_depend
29-
),
30-
schema as (
31-
select
32-
pg_namespace.oid as id,
33-
pg_namespace.nspname as name
34-
from pg_namespace
35-
where nspname != 'information_schema' and nspname not like 'pg\_%'
36-
),
37-
referenced as (
38-
select
39-
relation.id AS id,
40-
referenced_class.name ,
41-
referenced_class.schema ,
42-
referenced_class.kind
43-
from relation
44-
join class as referenced_class on relation.class=referenced_class.id
45-
where referenced_class.kind in ('r', 'v', 'm')
46-
),
47-
relationships as (
48-
select
49-
referenced.name as referenced_name,
50-
referenced.schema as referenced_schema_id,
51-
dependent_class.name as dependent_name,
52-
dependent_class.schema as dependent_schema_id,
53-
referenced.kind as kind
54-
from referenced
55-
join dependency on referenced.id=dependency.id
56-
join class as dependent_class on dependency.ref=dependent_class.id
57-
where
58-
(referenced.name != dependent_class.name or
59-
referenced.schema != dependent_class.schema)
60-
)
10+
select distinct
11+
dependent_namespace.nspname as dependent_schema,
12+
dependent_class.relname as dependent_name,
13+
referenced_namespace.nspname as referenced_schema,
14+
referenced_class.relname as referenced_name
6115

62-
select
63-
referenced_schema.name as referenced_schema,
64-
relationships.referenced_name as referenced_name,
65-
dependent_schema.name as dependent_schema,
66-
relationships.dependent_name as dependent_name
67-
from relationships
68-
join schema as dependent_schema on relationships.dependent_schema_id=dependent_schema.id
69-
join schema as referenced_schema on relationships.referenced_schema_id=referenced_schema.id
70-
group by referenced_schema, referenced_name, dependent_schema, dependent_name
71-
order by referenced_schema, referenced_name, dependent_schema, dependent_name;
16+
-- Query for views: views are entries in pg_class with an entry in pg_rewrite, but we avoid
17+
-- a seq scan on pg_rewrite by leveraging the fact there is an "internal" row in pg_depend for
18+
-- the view...
19+
from pg_class as dependent_class
20+
join pg_namespace as dependent_namespace on dependent_namespace.oid = dependent_class.relnamespace
21+
join pg_depend as dependent_depend on dependent_depend.refobjid = dependent_class.oid
22+
and dependent_depend.classid = 'pg_rewrite'::regclass
23+
and dependent_depend.refclassid = 'pg_class'::regclass
24+
and dependent_depend.deptype = 'i'
25+
26+
-- ... and via pg_depend (that has a row per column, hence the need for "distinct" above, and
27+
-- making sure to exclude internal row to avoid a view appearing to depend on itself)...
28+
join pg_depend as referenced_depend on referenced_depend.objid = dependent_depend.objid
29+
and referenced_depend.classid = 'pg_rewrite'::regclass
30+
and referenced_depend.refclassid = 'pg_class'::regclass
31+
and referenced_depend.refobjid != dependent_depend.refobjid
32+
33+
-- ... we can find the tables they query from in pg_class, but excluding system tables. Note we
34+
-- don't need need to exclude _dependent_ system tables, because they only query from other
35+
-- system tables, and so are automatically excluded by excluding _referenced_ system tables
36+
join pg_class as referenced_class on referenced_class.oid = referenced_depend.refobjid
37+
join pg_namespace as referenced_namespace on referenced_namespace.oid = referenced_class.relnamespace
38+
and referenced_namespace.nspname != 'information_schema'
39+
and referenced_namespace.nspname not like 'pg\_%'
40+
41+
order by
42+
dependent_schema, dependent_name, referenced_schema, referenced_name;
7243

7344
{%- endcall -%}
7445

0 commit comments

Comments
 (0)