|
| 1 | +# PostgreSQL Identifier Quoting Rules |
| 2 | + |
| 3 | +This document describes the identifier quoting strategy used by the pgsql-deparser package. It explains the underlying PostgreSQL rules, our implementation choices, and guidance for contributors. |
| 4 | + |
| 5 | +## Goals and Non-Goals |
| 6 | + |
| 7 | +### Goals |
| 8 | + |
| 9 | +The deparser aims to emit valid PostgreSQL SQL that can be re-parsed by PostgreSQL (and by libpg-query). Additionally, we aim to emit minimally-quoted SQL when safe, especially for qualified names, in order to reduce noisy diffs. For example, we prefer `faker.float` over `faker."float"` when the grammar allows it. |
| 10 | + |
| 11 | +### Non-Goals |
| 12 | + |
| 13 | +We do not preserve the user's original quoting style. If the input SQL used `"mycolumn"` but `mycolumn` is safe to emit unquoted, we may emit it unquoted. |
| 14 | + |
| 15 | +We do not preserve identifier case unless quoting is required. Unquoted identifiers are folded to lowercase by PostgreSQL, so `MyFunc` and `myfunc` are equivalent when unquoted. If case preservation is needed, the identifier must be quoted. |
| 16 | + |
| 17 | +We do not implement PostgreSQL's `quote_all_identifiers` mode (used by pg_dump for deterministic dumps). This could be added as a deparser option in the future if needed. |
| 18 | + |
| 19 | +## PostgreSQL Background: What Quoting Means |
| 20 | + |
| 21 | +### Identifier Folding |
| 22 | + |
| 23 | +PostgreSQL folds unquoted identifiers to lowercase. This means: |
| 24 | + |
| 25 | +- `SELECT * FROM MyTable` is equivalent to `SELECT * FROM mytable` |
| 26 | +- `SELECT * FROM "MyTable"` preserves the mixed case and refers to a table literally named `MyTable` |
| 27 | + |
| 28 | +If you need to preserve uppercase letters, spaces, or special characters in an identifier, you must quote it. |
| 29 | + |
| 30 | +### When Quotes Are Required |
| 31 | + |
| 32 | +An identifier must be quoted if any of the following are true: |
| 33 | + |
| 34 | +1. **First character is not a lowercase letter or underscore**: Identifiers starting with uppercase letters, digits, or special characters must be quoted. |
| 35 | + |
| 36 | +2. **Contains characters outside the safe set**: Only lowercase letters (`a-z`), digits (`0-9`), and underscores (`_`) are considered safe. Note that `$` is explicitly excluded even though PostgreSQL allows it in some contexts; our implementation is conservative. |
| 37 | + |
| 38 | +3. **Is a SQL keyword**: Keywords must be quoted to be used as identifiers, with one exception: `UNRESERVED_KEYWORD` tokens can be used as identifiers without quotes in most contexts. |
| 39 | + |
| 40 | +4. **Contains embedded double quotes**: These must be escaped by doubling them (`"` becomes `""`). |
| 41 | + |
| 42 | +### Keyword Categories in PostgreSQL |
| 43 | + |
| 44 | +PostgreSQL classifies keywords into four categories (defined in `kwlist.h` / our `kwlist.ts`): |
| 45 | + |
| 46 | +| Category | Description | Quoting Required? | |
| 47 | +|----------|-------------|-------------------| |
| 48 | +| `UNRESERVED_KEYWORD` | Can be used as identifiers in most contexts | No | |
| 49 | +| `COL_NAME_KEYWORD` | Reserved in some contexts, allowed as column names | Yes (in strict mode) | |
| 50 | +| `TYPE_FUNC_NAME_KEYWORD` | Reserved in some contexts, allowed as type/function names | Yes (in strict mode) | |
| 51 | +| `RESERVED_KEYWORD` | Fully reserved, cannot be used as identifiers without quotes | Yes | |
| 52 | + |
| 53 | +Examples: |
| 54 | +- `abort`, `absolute`, `access` are `UNRESERVED_KEYWORD` - no quoting needed |
| 55 | +- `float`, `interval`, `boolean` are `COL_NAME_KEYWORD` - quoting depends on context |
| 56 | +- `left`, `right`, `join` are `TYPE_FUNC_NAME_KEYWORD` - quoting depends on context |
| 57 | +- `select`, `from`, `where`, `table` are `RESERVED_KEYWORD` - always need quoting as identifiers |
| 58 | + |
| 59 | +## The Strict Quoting Algorithm |
| 60 | + |
| 61 | +The `QuoteUtils.quoteIdentifier()` function implements PostgreSQL's `quote_identifier()` algorithm from `ruleutils.c`. This is the "strict" or "canonical" quoting policy. |
| 62 | + |
| 63 | +### Algorithm |
| 64 | + |
| 65 | +``` |
| 66 | +function quoteIdentifier(ident): |
| 67 | + if ident is empty: |
| 68 | + return ident |
| 69 | + |
| 70 | + safe = true |
| 71 | + |
| 72 | + // Rule 1: First character must be lowercase letter or underscore |
| 73 | + if first_char not in [a-z_]: |
| 74 | + safe = false |
| 75 | + |
| 76 | + // Rule 2: All characters must be in safe set |
| 77 | + for each char in ident: |
| 78 | + if char not in [a-z0-9_]: |
| 79 | + safe = false |
| 80 | + |
| 81 | + // Rule 3: Must not be a keyword (except UNRESERVED_KEYWORD) |
| 82 | + if safe: |
| 83 | + kwKind = keywordKindOf(ident) |
| 84 | + if kwKind != NO_KEYWORD and kwKind != UNRESERVED_KEYWORD: |
| 85 | + safe = false |
| 86 | + |
| 87 | + if safe: |
| 88 | + return ident // No quoting needed |
| 89 | + |
| 90 | + // Build quoted identifier with escaped embedded quotes |
| 91 | + result = '"' |
| 92 | + for each char in ident: |
| 93 | + if char == '"': |
| 94 | + result += '"' // Escape " as "" |
| 95 | + result += char |
| 96 | + result += '"' |
| 97 | + |
| 98 | + return result |
| 99 | +``` |
| 100 | + |
| 101 | +### Examples |
| 102 | + |
| 103 | +| Input | Output | Reason | |
| 104 | +|-------|--------|--------| |
| 105 | +| `mytable` | `mytable` | All lowercase, not a keyword | |
| 106 | +| `my_table_2` | `my_table_2` | Safe characters only | |
| 107 | +| `MyTable` | `"MyTable"` | Contains uppercase | |
| 108 | +| `my-table` | `"my-table"` | Contains hyphen | |
| 109 | +| `my table` | `"my table"` | Contains space | |
| 110 | +| `2fast` | `"2fast"` | Starts with digit | |
| 111 | +| `float` | `"float"` | COL_NAME_KEYWORD | |
| 112 | +| `select` | `"select"` | RESERVED_KEYWORD | |
| 113 | +| `abort` | `abort` | UNRESERVED_KEYWORD | |
| 114 | +| `say"hello` | `"say""hello"` | Contains embedded quote | |
| 115 | + |
| 116 | +## Why a Second Policy Exists: Grammar-Slot Sensitivity |
| 117 | + |
| 118 | +PostgreSQL's `quote_identifier()` function is intentionally conservative and context-free. It doesn't know where the identifier will be used, so it quotes anything that might cause problems in any context. |
| 119 | + |
| 120 | +However, PostgreSQL's grammar is context-sensitive. Different syntactic positions accept different sets of tokens. The key insight is: |
| 121 | + |
| 122 | +**Identifiers that appear after a dot (in qualified names) are in highly permissive grammar positions that accept all keyword categories, including `RESERVED_KEYWORD`.** |
| 123 | + |
| 124 | +This means that while `float()` as a standalone function call would fail to parse (because `float` is a `COL_NAME_KEYWORD`), `faker.float()` parses successfully because the `float` appears after a dot. |
| 125 | + |
| 126 | +### Empirical Verification |
| 127 | + |
| 128 | +We verified this behavior by testing with libpg-query: |
| 129 | + |
| 130 | +| SQL | Parses? | Reason | |
| 131 | +|-----|---------|--------| |
| 132 | +| `SELECT float()` | No | `float` is COL_NAME_KEYWORD, conflicts with type name | |
| 133 | +| `SELECT faker.float()` | Yes | After dot, all keywords accepted | |
| 134 | +| `SELECT select FROM t` | No | `select` is RESERVED_KEYWORD | |
| 135 | +| `SELECT t.select FROM t` | Yes | After dot, all keywords accepted | |
| 136 | +| `SELECT * FROM interval` | No | `interval` is COL_NAME_KEYWORD | |
| 137 | +| `SELECT * FROM myschema.interval` | Yes | After dot, all keywords accepted | |
| 138 | + |
| 139 | +### The Deparser's Opportunity |
| 140 | + |
| 141 | +Since we know that identifiers after a dot are in permissive positions, we can emit them without keyword-based quoting. This produces cleaner output: |
| 142 | + |
| 143 | +- Instead of `faker."float"()` we emit `faker.float()` |
| 144 | +- Instead of `pg_catalog."substring"()` we emit `pg_catalog.substring()` |
| 145 | +- Instead of `t."select"` we emit `t.select` |
| 146 | + |
| 147 | +This is an intentional deviation from PostgreSQL's `quote_identifier()` behavior, designed to produce minimal quoting while still emitting valid SQL. |
| 148 | + |
| 149 | +## The Relaxed Quoting Algorithm (After-Dot) |
| 150 | + |
| 151 | +The `QuoteUtils.quoteIdentifierAfterDot()` function implements "lexical-only" quoting for identifiers in permissive after-dot positions. |
| 152 | + |
| 153 | +### Algorithm |
| 154 | + |
| 155 | +``` |
| 156 | +function quoteIdentifierAfterDot(ident): |
| 157 | + if ident is empty: |
| 158 | + return ident |
| 159 | + |
| 160 | + safe = true |
| 161 | + |
| 162 | + // Rule 1: First character must be lowercase letter or underscore |
| 163 | + if first_char not in [a-z_]: |
| 164 | + safe = false |
| 165 | + |
| 166 | + // Rule 2: All characters must be in safe set |
| 167 | + for each char in ident: |
| 168 | + if char not in [a-z0-9_]: |
| 169 | + safe = false |
| 170 | + |
| 171 | + // NOTE: No keyword check! Keywords are allowed after dots. |
| 172 | + |
| 173 | + if safe: |
| 174 | + return ident // No quoting needed |
| 175 | + |
| 176 | + // Build quoted identifier with escaped embedded quotes |
| 177 | + result = '"' |
| 178 | + for each char in ident: |
| 179 | + if char == '"': |
| 180 | + result += '"' |
| 181 | + result += char |
| 182 | + result += '"' |
| 183 | + |
| 184 | + return result |
| 185 | +``` |
| 186 | + |
| 187 | +### Key Difference from Strict Quoting |
| 188 | + |
| 189 | +The only difference is that `quoteIdentifierAfterDot()` does not check for keywords. It still quotes for: |
| 190 | +- Uppercase letters (case preservation) |
| 191 | +- Special characters (hyphens, spaces, etc.) |
| 192 | +- Leading digits |
| 193 | +- Embedded quotes |
| 194 | + |
| 195 | +### Examples |
| 196 | + |
| 197 | +| Input | quoteIdentifier() | quoteIdentifierAfterDot() | |
| 198 | +|-------|-------------------|---------------------------| |
| 199 | +| `mytable` | `mytable` | `mytable` | |
| 200 | +| `MyTable` | `"MyTable"` | `"MyTable"` | |
| 201 | +| `float` | `"float"` | `float` | |
| 202 | +| `select` | `"select"` | `select` | |
| 203 | +| `interval` | `"interval"` | `interval` | |
| 204 | +| `my-col` | `"my-col"` | `"my-col"` | |
| 205 | + |
| 206 | +## Composition Helpers |
| 207 | + |
| 208 | +### quoteDottedName(parts: string[]) |
| 209 | + |
| 210 | +This helper applies the appropriate quoting policy to each part of a dotted name: |
| 211 | + |
| 212 | +- **First part**: Uses strict quoting (`quoteIdentifier()`) because the leading identifier often appears in less-permissive grammar slots |
| 213 | +- **Subsequent parts**: Uses relaxed quoting (`quoteIdentifierAfterDot()`) because they appear after dots in permissive slots |
| 214 | + |
| 215 | +```typescript |
| 216 | +static quoteDottedName(parts: string[]): string { |
| 217 | + if (!parts || parts.length === 0) return ''; |
| 218 | + if (parts.length === 1) { |
| 219 | + return QuoteUtils.quoteIdentifier(parts[0]); |
| 220 | + } |
| 221 | + return parts.map((part, index) => |
| 222 | + index === 0 |
| 223 | + ? QuoteUtils.quoteIdentifier(part) |
| 224 | + : QuoteUtils.quoteIdentifierAfterDot(part) |
| 225 | + ).join('.'); |
| 226 | +} |
| 227 | +``` |
| 228 | + |
| 229 | +### Examples |
| 230 | + |
| 231 | +| Input Parts | Output | |
| 232 | +|-------------|--------| |
| 233 | +| `['mytable']` | `mytable` | |
| 234 | +| `['myschema', 'mytable']` | `myschema.mytable` | |
| 235 | +| `['faker', 'float']` | `faker.float` | |
| 236 | +| `['pg_catalog', 'substring']` | `pg_catalog.substring` | |
| 237 | +| `['select', 'from']` | `"select".from` | |
| 238 | +| `['MySchema', 'MyTable']` | `"MySchema"."MyTable"` | |
| 239 | + |
| 240 | +### quoteQualifiedIdentifier(qualifier, ident) |
| 241 | + |
| 242 | +A convenience wrapper for two-part qualified names: |
| 243 | + |
| 244 | +```typescript |
| 245 | +static quoteQualifiedIdentifier( |
| 246 | + qualifier: string | null | undefined, |
| 247 | + ident: string |
| 248 | +): string { |
| 249 | + if (qualifier) { |
| 250 | + return `${QuoteUtils.quoteIdentifier(qualifier)}.${QuoteUtils.quoteIdentifierAfterDot(ident)}`; |
| 251 | + } |
| 252 | + return QuoteUtils.quoteIdentifier(ident); |
| 253 | +} |
| 254 | +``` |
| 255 | + |
| 256 | +## Deparser Integration Rules |
| 257 | + |
| 258 | +### The Problem with the String Visitor |
| 259 | + |
| 260 | +The deparser's `String` visitor processes individual string nodes from the AST. It cannot reliably determine whether a string is: |
| 261 | +- A standalone identifier (needs strict quoting) |
| 262 | +- Part of a qualified name (first part needs strict, rest needs relaxed) |
| 263 | +- A string literal (needs single-quote escaping, not identifier quoting) |
| 264 | + |
| 265 | +Therefore, **dotted-name quoting must be done at the call sites that have the list of parts**, not in the String visitor. |
| 266 | + |
| 267 | +### Anti-Pattern: Don't Do This |
| 268 | + |
| 269 | +```typescript |
| 270 | +// WRONG: This applies per-part quoting without slot context |
| 271 | +const name = funcname.map(n => this.visit(n, context)).join('.'); |
| 272 | +``` |
| 273 | + |
| 274 | +This pattern sends each part through the String visitor, which uses strict quoting for all parts. The result is over-quoted output like `faker."float"`. |
| 275 | + |
| 276 | +### Correct Pattern: Do This Instead |
| 277 | + |
| 278 | +```typescript |
| 279 | +// CORRECT: Extract raw string parts and use quoteDottedName |
| 280 | +const funcnameParts = funcname.map((n: any) => |
| 281 | + n.String?.sval || n.String?.str || '' |
| 282 | +).filter((s: string) => s); |
| 283 | +const name = QuoteUtils.quoteDottedName(funcnameParts); |
| 284 | +``` |
| 285 | + |
| 286 | +This extracts the raw string values and applies the correct quoting policy per position. |
| 287 | + |
| 288 | +### Where to Apply quoteDottedName |
| 289 | + |
| 290 | +The following AST node handlers should use `quoteDottedName()` for their name components: |
| 291 | + |
| 292 | +| Handler | Field | Description | |
| 293 | +|---------|-------|-------------| |
| 294 | +| `FuncCall` | `funcname` | Function name (e.g., `pg_catalog.substring`) | |
| 295 | +| `CreateFunctionStmt` | `funcname` | Function being created | |
| 296 | +| `ColumnRef` | `fields` | Column reference (e.g., `t.column`) | |
| 297 | +| `RangeVar` | `catalogname`, `schemaname`, `relname` | Table reference | |
| 298 | +| `TypeName` | `names` | Type name (e.g., `pg_catalog.int4`) | |
| 299 | +| `CollateClause` | `collname` | Collation name | |
| 300 | + |
| 301 | +### The quoteIfNeeded Helper |
| 302 | + |
| 303 | +The `deparser.ts` file contains a `quoteIfNeeded()` method that routes to `QuoteUtils.quoteIdentifier()`. This is the strict quoting policy and should be used for standalone identifiers, not for dotted-name tails. |
| 304 | + |
| 305 | +## String Comparison Pitfall |
| 306 | + |
| 307 | +When the deparser needs to check for specific function names (e.g., to apply special SQL syntax for `pg_catalog.substring`), comparisons should be done carefully. |
| 308 | + |
| 309 | +### Potential Issue |
| 310 | + |
| 311 | +If you compare against the quoted output: |
| 312 | + |
| 313 | +```typescript |
| 314 | +const name = QuoteUtils.quoteDottedName(funcnameParts); |
| 315 | +if (name === 'pg_catalog.substring') { // This works because neither part needs quoting |
| 316 | + // Special handling |
| 317 | +} |
| 318 | +``` |
| 319 | + |
| 320 | +This works for `pg_catalog.substring` because neither `pg_catalog` nor `substring` requires quoting. But it would fail for a hypothetical `pg_catalog.Select` because the output would be `pg_catalog."Select"`. |
| 321 | + |
| 322 | +### Recommended Approach |
| 323 | + |
| 324 | +For robustness, compare against the raw parts before quoting: |
| 325 | + |
| 326 | +```typescript |
| 327 | +const funcnameParts = funcname.map((n: any) => n.String?.sval || ''); |
| 328 | +if (funcnameParts.length === 2 && |
| 329 | + funcnameParts[0] === 'pg_catalog' && |
| 330 | + funcnameParts[1] === 'substring') { |
| 331 | + // Special handling |
| 332 | +} |
| 333 | +``` |
| 334 | + |
| 335 | +Or create a helper that normalizes for comparison: |
| 336 | + |
| 337 | +```typescript |
| 338 | +const rawName = funcnameParts.join('.'); // Unquoted for comparison |
| 339 | +if (rawName === 'pg_catalog.substring') { |
| 340 | + // Special handling |
| 341 | +} |
| 342 | +``` |
| 343 | + |
| 344 | +## Keyword Table Accuracy |
| 345 | + |
| 346 | +The correctness of the strict quoting algorithm depends on the accuracy of the keyword table in `kwlist.ts`. This table must match PostgreSQL's keyword classifications for the target PostgreSQL version. |
| 347 | + |
| 348 | +If the keyword table diverges from upstream PostgreSQL: |
| 349 | +- Keywords missing from the table may be emitted unquoted when they should be quoted |
| 350 | +- Keywords with incorrect categories may be quoted unnecessarily or insufficiently |
| 351 | + |
| 352 | +When updating to support new PostgreSQL versions, ensure `kwlist.ts` is synchronized with PostgreSQL's `kwlist.h`. |
| 353 | + |
| 354 | +## Summary: Which Function to Use |
| 355 | + |
| 356 | +| Scenario | Function | Example | |
| 357 | +|----------|----------|---------| |
| 358 | +| Standalone identifier | `quoteIdentifier()` | Column name in SELECT list | |
| 359 | +| Dotted name (multi-part) | `quoteDottedName()` | `schema.table`, `schema.function` | |
| 360 | +| Two-part qualified name | `quoteQualifiedIdentifier()` | `schema.table` | |
| 361 | +| After-dot component only | `quoteIdentifierAfterDot()` | Indirection field access | |
| 362 | +| String literal | `escape()` or `formatEString()` | String values in SQL | |
| 363 | + |
| 364 | +## Test Fixtures |
| 365 | + |
| 366 | +The quoting behavior is verified by test fixtures in `__fixtures__/kitchen-sink/pretty/`: |
| 367 | + |
| 368 | +- `quoting-1.sql` through `quoting-7.sql`: Test cases for `faker.float`, `faker.interval`, `faker.boolean`, and `pg_catalog.substring` |
| 369 | + |
| 370 | +The corresponding snapshots in `__tests__/pretty/__snapshots__/quoting-pretty.test.ts.snap` demonstrate the expected output with minimal quoting. |
| 371 | + |
| 372 | +## References |
| 373 | + |
| 374 | +- PostgreSQL `quote_identifier()`: [ruleutils.c](https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/ruleutils.c) |
| 375 | +- PostgreSQL keyword list: [kwlist.h](https://github.com/postgres/postgres/blob/master/src/include/parser/kwlist.h) |
| 376 | +- PostgreSQL documentation on identifiers: [SQL Syntax - Lexical Structure](https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS) |
0 commit comments