Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions scripts/ensure-docs-slide-pdf.js
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,25 @@ async function readPdfBytes() {
throw new Error(`Failed to download slide deck PDF: ${response.status} ${response.statusText}`);
}

// Validate the Content-Type header before consuming the body to ensure we
// are actually receiving a PDF and not arbitrary data.
const contentType = response.headers.get("content-type") ?? "";

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The three new validation paths (Content-Type mismatch, oversized Content-Length, oversized post-download body) each throw distinct errors, but no regression tests cover them. The project has a .test.js pattern for scripts (changeset.test.js, generate-schema-docs.test.js) that a companion ensure-docs-slide-pdf.test.js could follow.

💡 Sketch of missing test cases

The core function would need to export readPdfBytes (or a testable inner function), then:

// Test: unexpected Content-Type falls back to placeholder, not crash
globalThis.fetch = async () => ({
  ok: true,
  headers: { get: (k) => k === 'content-type' ? 'text/html' : null },
  arrayBuffer: async () => new ArrayBuffer(0),
});
// assert: returns placeholder PDF bytes, console.warn called

// Test: Content-Length > 50 MB falls back to placeholder
globalThis.fetch = async () => ({
  ok: true,
  headers: { get: (k) => k === 'content-type' ? 'application/pdf' : k === 'content-length' ? String(100 * 1024 * 1024) : null },
  arrayBuffer: async () => new ArrayBuffer(0),
});
// assert: returns placeholder PDF bytes

These tests confirm that each guard correctly triggers the fallback rather than writing invalid data.

if (!contentType.startsWith("application/pdf") && !contentType.startsWith("application/octet-stream")) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] startsWith("application/pdf") will also match hypothetical types like application/pdf-exploit or application/pdfx. In practice this isn't exploitable since a real server wouldn't return such types, but parsing the MIME type properly (stripping parameters) is the more defensive and idiomatic approach.

💡 Suggested fix
const mimeType = (response.headers.get("content-type") ?? "").split(";")[0].trim();
if (mimeType !== "application/pdf" && mimeType !== "application/octet-stream") {
  throw new Error(`Unexpected content-type for slide deck: ${mimeType}`);
}

This correctly handles application/pdf; charset=binary (strips params) and prevents any unintended prefix matches.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

application/octet-stream allowance largely neutralises the Content-Type gate: any arbitrary binary file — executables, archives, scripts — arrives with application/octet-stream, so the check filters only responses with a non-binary MIME type (e.g., text/html). The isPdf() magic-bytes check on line 134 is therefore the only meaningful content-type guard. The comment above this block states the intent is to ensure we are receiving a PDF, which application/octet-stream does not satisfy.

💡 Notes

This may be intentional if media.githubusercontent.com consistently serves LFS objects as application/octet-stream. If so, document that explicitly with a comment; otherwise the combination of header validation + isPdf() check creates a false sense of defence-in-depth since the header gate adds no additional content-type assurance over what the magic-bytes check already provides.

throw new Error(`Unexpected content-type for slide deck: ${contentType}`);
}
Comment on lines +115 to +120

// Guard against unexpectedly large downloads.
const MAX_BYTES = 50 * 1024 * 1024; // 50 MB
const contentLength = response.headers.get("content-length");
if (contentLength !== null && Number(contentLength) > MAX_BYTES) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] Number(contentLength) returns NaN when the header value is non-numeric, and NaN > MAX_BYTES evaluates to false, silently bypassing this guard. A malformed or crafted Content-Length header would slip past without triggering an error or warning. The post-download byte-count check on line 130 still catches oversized payloads, so the risk is mitigated — but the Content-Length early-rejection is weakened.

💡 Suggested fix
const parsedLength = contentLength !== null ? Number(contentLength) : NaN;
if (Number.isFinite(parsedLength) && parsedLength > MAX_BYTES) {
  throw new Error(`Slide deck download size ${parsedLength} exceeds limit of ${MAX_BYTES} bytes`);
}

Using Number.isFinite() ensures non-numeric headers are treated as "unknown size" (safe fallback to the post-download check) rather than silently ignored as if the value were 0.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content-Length guard silently skipped for malformed header values: Number("abc") returns NaN, and NaN > MAX_BYTES is false, so any non-numeric Content-Length bypasses the pre-download size check without throwing.

💡 Suggested fix

Replace with an explicit finite-number check:

const contentLengthNum = contentLength !== null ? Number(contentLength) : null;
if (contentLengthNum !== null && (!Number.isFinite(contentLengthNum) || contentLengthNum > MAX_BYTES)) {
  throw new Error(`Slide deck download size ${contentLength} exceeds limit of ${MAX_BYTES} bytes`);
}

This rejects both non-numeric headers and values exceeding the cap. Without this, a server returning Content-Length: chunked or any non-numeric token silently skips the guard — the post-download check on line 130 still fires, but only after the full body is already buffered in memory.

throw new Error(`Slide deck download size ${contentLength} exceeds limit of ${MAX_BYTES} bytes`);
}

const downloadedBytes = Buffer.from(await response.arrayBuffer());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50 MB guard fires only after the full body is already in memory: response.arrayBuffer() buffers the entire response synchronously before the size check on line 130 runs. A server that omits Content-Length (bypassing the header pre-check) or provides a malformed value can stream an unbounded body and exhaust the process's heap before the guard throws — potentially OOM-crashing the CI runner.

💡 Suggested fix

Stream the response and count bytes as chunks arrive:

const chunks = [];
let totalBytes = 0;
const reader = response.body.getReader();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  totalBytes += value.length;
  if (totalBytes > MAX_BYTES) {
    await reader.cancel();
    throw new Error(`Downloaded slide deck size exceeds limit of ${MAX_BYTES} bytes`);
  }
  chunks.push(value);
}
const downloadedBytes = Buffer.concat(chunks.map(c => Buffer.from(c)));

This caps memory usage at MAX_BYTES regardless of whether Content-Length is absent, lying, or malformed, which is the actual intent of the guard comment on line 122.

if (downloadedBytes.length > MAX_BYTES) {
throw new Error(`Downloaded slide deck size ${downloadedBytes.length} exceeds limit of ${MAX_BYTES} bytes`);
}
Comment on lines +122 to +132

if (!isPdf(downloadedBytes)) {
throw new Error(`Downloaded slide deck from ${url} is not a real PDF.`);
}
Expand Down
Loading