Skip to content

Commit eb19eb9

Browse files
authored
Merge pull request #247 from constructive-io/feat/quoting
feat: add relaxed quoting for qualified name tails
2 parents 1923251 + 8591d43 commit eb19eb9

File tree

4 files changed

+471
-27
lines changed

4 files changed

+471
-27
lines changed

packages/deparser/QUOTING-RULES.md

Lines changed: 376 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,376 @@
1+
# PostgreSQL Identifier Quoting Rules
2+
3+
This document describes the identifier quoting strategy used by the pgsql-deparser package. It explains the underlying PostgreSQL rules, our implementation choices, and guidance for contributors.
4+
5+
## Goals and Non-Goals
6+
7+
### Goals
8+
9+
The deparser aims to emit valid PostgreSQL SQL that can be re-parsed by PostgreSQL (and by libpg-query). Additionally, we aim to emit minimally-quoted SQL when safe, especially for qualified names, in order to reduce noisy diffs. For example, we prefer `faker.float` over `faker."float"` when the grammar allows it.
10+
11+
### Non-Goals
12+
13+
We do not preserve the user's original quoting style. If the input SQL used `"mycolumn"` but `mycolumn` is safe to emit unquoted, we may emit it unquoted.
14+
15+
We do not preserve identifier case unless quoting is required. Unquoted identifiers are folded to lowercase by PostgreSQL, so `MyFunc` and `myfunc` are equivalent when unquoted. If case preservation is needed, the identifier must be quoted.
16+
17+
We do not implement PostgreSQL's `quote_all_identifiers` mode (used by pg_dump for deterministic dumps). This could be added as a deparser option in the future if needed.
18+
19+
## PostgreSQL Background: What Quoting Means
20+
21+
### Identifier Folding
22+
23+
PostgreSQL folds unquoted identifiers to lowercase. This means:
24+
25+
- `SELECT * FROM MyTable` is equivalent to `SELECT * FROM mytable`
26+
- `SELECT * FROM "MyTable"` preserves the mixed case and refers to a table literally named `MyTable`
27+
28+
If you need to preserve uppercase letters, spaces, or special characters in an identifier, you must quote it.
29+
30+
### When Quotes Are Required
31+
32+
An identifier must be quoted if any of the following are true:
33+
34+
1. **First character is not a lowercase letter or underscore**: Identifiers starting with uppercase letters, digits, or special characters must be quoted.
35+
36+
2. **Contains characters outside the safe set**: Only lowercase letters (`a-z`), digits (`0-9`), and underscores (`_`) are considered safe. Note that `$` is explicitly excluded even though PostgreSQL allows it in some contexts; our implementation is conservative.
37+
38+
3. **Is a SQL keyword**: Keywords must be quoted to be used as identifiers, with one exception: `UNRESERVED_KEYWORD` tokens can be used as identifiers without quotes in most contexts.
39+
40+
4. **Contains embedded double quotes**: These must be escaped by doubling them (`"` becomes `""`).
41+
42+
### Keyword Categories in PostgreSQL
43+
44+
PostgreSQL classifies keywords into four categories (defined in `kwlist.h` / our `kwlist.ts`):
45+
46+
| Category | Description | Quoting Required? |
47+
|----------|-------------|-------------------|
48+
| `UNRESERVED_KEYWORD` | Can be used as identifiers in most contexts | No |
49+
| `COL_NAME_KEYWORD` | Reserved in some contexts, allowed as column names | Yes (in strict mode) |
50+
| `TYPE_FUNC_NAME_KEYWORD` | Reserved in some contexts, allowed as type/function names | Yes (in strict mode) |
51+
| `RESERVED_KEYWORD` | Fully reserved, cannot be used as identifiers without quotes | Yes |
52+
53+
Examples:
54+
- `abort`, `absolute`, `access` are `UNRESERVED_KEYWORD` - no quoting needed
55+
- `float`, `interval`, `boolean` are `COL_NAME_KEYWORD` - quoting depends on context
56+
- `left`, `right`, `join` are `TYPE_FUNC_NAME_KEYWORD` - quoting depends on context
57+
- `select`, `from`, `where`, `table` are `RESERVED_KEYWORD` - always need quoting as identifiers
58+
59+
## The Strict Quoting Algorithm
60+
61+
The `QuoteUtils.quoteIdentifier()` function implements PostgreSQL's `quote_identifier()` algorithm from `ruleutils.c`. This is the "strict" or "canonical" quoting policy.
62+
63+
### Algorithm
64+
65+
```
66+
function quoteIdentifier(ident):
67+
if ident is empty:
68+
return ident
69+
70+
safe = true
71+
72+
// Rule 1: First character must be lowercase letter or underscore
73+
if first_char not in [a-z_]:
74+
safe = false
75+
76+
// Rule 2: All characters must be in safe set
77+
for each char in ident:
78+
if char not in [a-z0-9_]:
79+
safe = false
80+
81+
// Rule 3: Must not be a keyword (except UNRESERVED_KEYWORD)
82+
if safe:
83+
kwKind = keywordKindOf(ident)
84+
if kwKind != NO_KEYWORD and kwKind != UNRESERVED_KEYWORD:
85+
safe = false
86+
87+
if safe:
88+
return ident // No quoting needed
89+
90+
// Build quoted identifier with escaped embedded quotes
91+
result = '"'
92+
for each char in ident:
93+
if char == '"':
94+
result += '"' // Escape " as ""
95+
result += char
96+
result += '"'
97+
98+
return result
99+
```
100+
101+
### Examples
102+
103+
| Input | Output | Reason |
104+
|-------|--------|--------|
105+
| `mytable` | `mytable` | All lowercase, not a keyword |
106+
| `my_table_2` | `my_table_2` | Safe characters only |
107+
| `MyTable` | `"MyTable"` | Contains uppercase |
108+
| `my-table` | `"my-table"` | Contains hyphen |
109+
| `my table` | `"my table"` | Contains space |
110+
| `2fast` | `"2fast"` | Starts with digit |
111+
| `float` | `"float"` | COL_NAME_KEYWORD |
112+
| `select` | `"select"` | RESERVED_KEYWORD |
113+
| `abort` | `abort` | UNRESERVED_KEYWORD |
114+
| `say"hello` | `"say""hello"` | Contains embedded quote |
115+
116+
## Why a Second Policy Exists: Grammar-Slot Sensitivity
117+
118+
PostgreSQL's `quote_identifier()` function is intentionally conservative and context-free. It doesn't know where the identifier will be used, so it quotes anything that might cause problems in any context.
119+
120+
However, PostgreSQL's grammar is context-sensitive. Different syntactic positions accept different sets of tokens. The key insight is:
121+
122+
**Identifiers that appear after a dot (in qualified names) are in highly permissive grammar positions that accept all keyword categories, including `RESERVED_KEYWORD`.**
123+
124+
This means that while `float()` as a standalone function call would fail to parse (because `float` is a `COL_NAME_KEYWORD`), `faker.float()` parses successfully because the `float` appears after a dot.
125+
126+
### Empirical Verification
127+
128+
We verified this behavior by testing with libpg-query:
129+
130+
| SQL | Parses? | Reason |
131+
|-----|---------|--------|
132+
| `SELECT float()` | No | `float` is COL_NAME_KEYWORD, conflicts with type name |
133+
| `SELECT faker.float()` | Yes | After dot, all keywords accepted |
134+
| `SELECT select FROM t` | No | `select` is RESERVED_KEYWORD |
135+
| `SELECT t.select FROM t` | Yes | After dot, all keywords accepted |
136+
| `SELECT * FROM interval` | No | `interval` is COL_NAME_KEYWORD |
137+
| `SELECT * FROM myschema.interval` | Yes | After dot, all keywords accepted |
138+
139+
### The Deparser's Opportunity
140+
141+
Since we know that identifiers after a dot are in permissive positions, we can emit them without keyword-based quoting. This produces cleaner output:
142+
143+
- Instead of `faker."float"()` we emit `faker.float()`
144+
- Instead of `pg_catalog."substring"()` we emit `pg_catalog.substring()`
145+
- Instead of `t."select"` we emit `t.select`
146+
147+
This is an intentional deviation from PostgreSQL's `quote_identifier()` behavior, designed to produce minimal quoting while still emitting valid SQL.
148+
149+
## The Relaxed Quoting Algorithm (After-Dot)
150+
151+
The `QuoteUtils.quoteIdentifierAfterDot()` function implements "lexical-only" quoting for identifiers in permissive after-dot positions.
152+
153+
### Algorithm
154+
155+
```
156+
function quoteIdentifierAfterDot(ident):
157+
if ident is empty:
158+
return ident
159+
160+
safe = true
161+
162+
// Rule 1: First character must be lowercase letter or underscore
163+
if first_char not in [a-z_]:
164+
safe = false
165+
166+
// Rule 2: All characters must be in safe set
167+
for each char in ident:
168+
if char not in [a-z0-9_]:
169+
safe = false
170+
171+
// NOTE: No keyword check! Keywords are allowed after dots.
172+
173+
if safe:
174+
return ident // No quoting needed
175+
176+
// Build quoted identifier with escaped embedded quotes
177+
result = '"'
178+
for each char in ident:
179+
if char == '"':
180+
result += '"'
181+
result += char
182+
result += '"'
183+
184+
return result
185+
```
186+
187+
### Key Difference from Strict Quoting
188+
189+
The only difference is that `quoteIdentifierAfterDot()` does not check for keywords. It still quotes for:
190+
- Uppercase letters (case preservation)
191+
- Special characters (hyphens, spaces, etc.)
192+
- Leading digits
193+
- Embedded quotes
194+
195+
### Examples
196+
197+
| Input | quoteIdentifier() | quoteIdentifierAfterDot() |
198+
|-------|-------------------|---------------------------|
199+
| `mytable` | `mytable` | `mytable` |
200+
| `MyTable` | `"MyTable"` | `"MyTable"` |
201+
| `float` | `"float"` | `float` |
202+
| `select` | `"select"` | `select` |
203+
| `interval` | `"interval"` | `interval` |
204+
| `my-col` | `"my-col"` | `"my-col"` |
205+
206+
## Composition Helpers
207+
208+
### quoteDottedName(parts: string[])
209+
210+
This helper applies the appropriate quoting policy to each part of a dotted name:
211+
212+
- **First part**: Uses strict quoting (`quoteIdentifier()`) because the leading identifier often appears in less-permissive grammar slots
213+
- **Subsequent parts**: Uses relaxed quoting (`quoteIdentifierAfterDot()`) because they appear after dots in permissive slots
214+
215+
```typescript
216+
static quoteDottedName(parts: string[]): string {
217+
if (!parts || parts.length === 0) return '';
218+
if (parts.length === 1) {
219+
return QuoteUtils.quoteIdentifier(parts[0]);
220+
}
221+
return parts.map((part, index) =>
222+
index === 0
223+
? QuoteUtils.quoteIdentifier(part)
224+
: QuoteUtils.quoteIdentifierAfterDot(part)
225+
).join('.');
226+
}
227+
```
228+
229+
### Examples
230+
231+
| Input Parts | Output |
232+
|-------------|--------|
233+
| `['mytable']` | `mytable` |
234+
| `['myschema', 'mytable']` | `myschema.mytable` |
235+
| `['faker', 'float']` | `faker.float` |
236+
| `['pg_catalog', 'substring']` | `pg_catalog.substring` |
237+
| `['select', 'from']` | `"select".from` |
238+
| `['MySchema', 'MyTable']` | `"MySchema"."MyTable"` |
239+
240+
### quoteQualifiedIdentifier(qualifier, ident)
241+
242+
A convenience wrapper for two-part qualified names:
243+
244+
```typescript
245+
static quoteQualifiedIdentifier(
246+
qualifier: string | null | undefined,
247+
ident: string
248+
): string {
249+
if (qualifier) {
250+
return `${QuoteUtils.quoteIdentifier(qualifier)}.${QuoteUtils.quoteIdentifierAfterDot(ident)}`;
251+
}
252+
return QuoteUtils.quoteIdentifier(ident);
253+
}
254+
```
255+
256+
## Deparser Integration Rules
257+
258+
### The Problem with the String Visitor
259+
260+
The deparser's `String` visitor processes individual string nodes from the AST. It cannot reliably determine whether a string is:
261+
- A standalone identifier (needs strict quoting)
262+
- Part of a qualified name (first part needs strict, rest needs relaxed)
263+
- A string literal (needs single-quote escaping, not identifier quoting)
264+
265+
Therefore, **dotted-name quoting must be done at the call sites that have the list of parts**, not in the String visitor.
266+
267+
### Anti-Pattern: Don't Do This
268+
269+
```typescript
270+
// WRONG: This applies per-part quoting without slot context
271+
const name = funcname.map(n => this.visit(n, context)).join('.');
272+
```
273+
274+
This pattern sends each part through the String visitor, which uses strict quoting for all parts. The result is over-quoted output like `faker."float"`.
275+
276+
### Correct Pattern: Do This Instead
277+
278+
```typescript
279+
// CORRECT: Extract raw string parts and use quoteDottedName
280+
const funcnameParts = funcname.map((n: any) =>
281+
n.String?.sval || n.String?.str || ''
282+
).filter((s: string) => s);
283+
const name = QuoteUtils.quoteDottedName(funcnameParts);
284+
```
285+
286+
This extracts the raw string values and applies the correct quoting policy per position.
287+
288+
### Where to Apply quoteDottedName
289+
290+
The following AST node handlers should use `quoteDottedName()` for their name components:
291+
292+
| Handler | Field | Description |
293+
|---------|-------|-------------|
294+
| `FuncCall` | `funcname` | Function name (e.g., `pg_catalog.substring`) |
295+
| `CreateFunctionStmt` | `funcname` | Function being created |
296+
| `ColumnRef` | `fields` | Column reference (e.g., `t.column`) |
297+
| `RangeVar` | `catalogname`, `schemaname`, `relname` | Table reference |
298+
| `TypeName` | `names` | Type name (e.g., `pg_catalog.int4`) |
299+
| `CollateClause` | `collname` | Collation name |
300+
301+
### The quoteIfNeeded Helper
302+
303+
The `deparser.ts` file contains a `quoteIfNeeded()` method that routes to `QuoteUtils.quoteIdentifier()`. This is the strict quoting policy and should be used for standalone identifiers, not for dotted-name tails.
304+
305+
## String Comparison Pitfall
306+
307+
When the deparser needs to check for specific function names (e.g., to apply special SQL syntax for `pg_catalog.substring`), comparisons should be done carefully.
308+
309+
### Potential Issue
310+
311+
If you compare against the quoted output:
312+
313+
```typescript
314+
const name = QuoteUtils.quoteDottedName(funcnameParts);
315+
if (name === 'pg_catalog.substring') { // This works because neither part needs quoting
316+
// Special handling
317+
}
318+
```
319+
320+
This works for `pg_catalog.substring` because neither `pg_catalog` nor `substring` requires quoting. But it would fail for a hypothetical `pg_catalog.Select` because the output would be `pg_catalog."Select"`.
321+
322+
### Recommended Approach
323+
324+
For robustness, compare against the raw parts before quoting:
325+
326+
```typescript
327+
const funcnameParts = funcname.map((n: any) => n.String?.sval || '');
328+
if (funcnameParts.length === 2 &&
329+
funcnameParts[0] === 'pg_catalog' &&
330+
funcnameParts[1] === 'substring') {
331+
// Special handling
332+
}
333+
```
334+
335+
Or create a helper that normalizes for comparison:
336+
337+
```typescript
338+
const rawName = funcnameParts.join('.'); // Unquoted for comparison
339+
if (rawName === 'pg_catalog.substring') {
340+
// Special handling
341+
}
342+
```
343+
344+
## Keyword Table Accuracy
345+
346+
The correctness of the strict quoting algorithm depends on the accuracy of the keyword table in `kwlist.ts`. This table must match PostgreSQL's keyword classifications for the target PostgreSQL version.
347+
348+
If the keyword table diverges from upstream PostgreSQL:
349+
- Keywords missing from the table may be emitted unquoted when they should be quoted
350+
- Keywords with incorrect categories may be quoted unnecessarily or insufficiently
351+
352+
When updating to support new PostgreSQL versions, ensure `kwlist.ts` is synchronized with PostgreSQL's `kwlist.h`.
353+
354+
## Summary: Which Function to Use
355+
356+
| Scenario | Function | Example |
357+
|----------|----------|---------|
358+
| Standalone identifier | `quoteIdentifier()` | Column name in SELECT list |
359+
| Dotted name (multi-part) | `quoteDottedName()` | `schema.table`, `schema.function` |
360+
| Two-part qualified name | `quoteQualifiedIdentifier()` | `schema.table` |
361+
| After-dot component only | `quoteIdentifierAfterDot()` | Indirection field access |
362+
| String literal | `escape()` or `formatEString()` | String values in SQL |
363+
364+
## Test Fixtures
365+
366+
The quoting behavior is verified by test fixtures in `__fixtures__/kitchen-sink/pretty/`:
367+
368+
- `quoting-1.sql` through `quoting-7.sql`: Test cases for `faker.float`, `faker.interval`, `faker.boolean`, and `pg_catalog.substring`
369+
370+
The corresponding snapshots in `__tests__/pretty/__snapshots__/quoting-pretty.test.ts.snap` demonstrate the expected output with minimal quoting.
371+
372+
## References
373+
374+
- PostgreSQL `quote_identifier()`: [ruleutils.c](https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/ruleutils.c)
375+
- PostgreSQL keyword list: [kwlist.h](https://github.com/postgres/postgres/blob/master/src/include/parser/kwlist.h)
376+
- PostgreSQL documentation on identifiers: [SQL Syntax - Lexical Structure](https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS)

0 commit comments

Comments
 (0)