Skip to content

Newline characters cause rows in PostgreSQL table to be broken inadvertently. #31

@YPCrumble

Description

@YPCrumble

This issue replaces #30.

The issue is that user-inputted data that includes these newline characters:

  • \u2028
  • \u2029
  • \x85

causes the dump to think that the line is actually split into more than one. The result is that the dump raises:

ValueError("Mismatch between column names and values.")

To solve it I added the following to the Python processes:

    process = subprocess.Popen(
        (
            "pg_dump",
            # Force output to be UTF-8 encoded.
            "--encoding=utf-8",
            # Quote all table and column names, just in case.
            "--quote-all-identifiers",
            # Luckily `pg_dump` supports DB URLs, so we can just pass it the
            # URL as argument to the command.
            "--dbname",
            url.geturl().replace('postgis://', 'postgresql://'),
         ) + tuple(extra_params),
        stdout=subprocess.PIPE,
    )

    # Remove newline characters.
    process = subprocess.Popen(
        "sed $'s/\u2028/ /g'",
        shell=True,
        stdin=process.stdout,
        stdout=subprocess.PIPE)
    process = subprocess.Popen(
        "sed $'s/\u2029/ /g'",
        shell=True,
        stdin=process.stdout,
        stdout=subprocess.PIPE)
    process = subprocess.Popen(
        "sed $'s/\x85/ /g'",
        shell=True,
        stdin=process.stdout,
        stdout=subprocess.PIPE)

I'd be happy to add as a PR if it's helpful, or is there a better way to handle the issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions