Skip to content

[code-scanning-fix] Fix js/http-to-file-access: validate Content-Type and size for LFS PDF download#41635

Merged
pelikhan merged 1 commit into
mainfrom
fix/alert-636-http-to-file-access-40d60f037fef5932
Jun 26, 2026
Merged

[code-scanning-fix] Fix js/http-to-file-access: validate Content-Type and size for LFS PDF download#41635
pelikhan merged 1 commit into
mainfrom
fix/alert-636-http-to-file-access-40d60f037fef5932

Conversation

@github-actions

@github-actions github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Generated by PR Description Updater for #41635 · 33.9 AIC · ⌖ 7.62 AIC · ⊞ 4.6K ·

Add response Content-Type validation and a 50 MB size limit to the
network-to-file path in ensure-docs-slide-pdf.js, addressing the
js/http-to-file-access CodeQL alert (#636).

Changes:
- Reject responses whose Content-Type is not application/pdf or
  application/octet-stream before reading the body.
- Enforce a 50 MB ceiling via both the Content-Length header (early
  rejection) and the actual downloaded byte count (double-check).
- The existing PDF magic-bytes check (isPdf) is preserved as a
  third layer of validation.
- All new checks throw errors that are caught by the surrounding
  try/catch block, so the script still falls back to the placeholder
  PDF rather than crashing in sandboxed environments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor Author

Nice defence-in-depth fix for the LFS slide deck download path 🔒 — layering Content-Type validation, a pre-download Content-Length cap, and a post-read byte-count check on top of the existing magic-bytes guard is exactly the right approach for this kind of network-to-disk flow.

One thing that would round this out before merge:

  • Add tests — the three new guard branches (content-type rejection, content-length pre-rejection, post-read size check) aren’t exercised by any test file yet. Covering each failure path and the happy path would give reviewers confidence and protect against regressions.

If you’d like to generate that coverage, assign this prompt to your coding agent:

Add unit tests for the new validation logic in scripts/ensure-docs-slide-pdf.js.

The readPdfBytes function now has three guard conditions to cover:
1. Content-Type rejection — mock fetch returning `Content-Type: text/html`; should throw an error containing "Unexpected content-type" and must NOT write to disk.
2. Content-Length pre-rejection — mock response with `Content-Length` header set to a value > 52428800 (50 MB); should throw before the body is consumed.
3. Post-download size check — mock response whose arrayBuffer() resolves to a buffer larger than 50 MB; should throw after the body is read.
4. Happy path — valid `Content-Type: application/pdf`, Content-Length within limit, body that passes isPdf() magic-bytes check; should write the file normally.

Use the project’s existing test framework (Jest or equivalent) and mock the global fetch. Place tests in scripts/ensure-docs-slide-pdf.test.js.

Generated by ✅ Contribution Check · 217.8 AIC · ⌖ 19 AIC · ⊞ 6K ·

@pelikhan pelikhan marked this pull request as ready for review June 26, 2026 11:32
Copilot AI review requested due to automatic review settings June 26, 2026 11:32
@pelikhan pelikhan merged commit 2d9eb0e into main Jun 26, 2026
28 checks passed
@pelikhan pelikhan deleted the fix/alert-636-http-to-file-access-40d60f037fef5932 branch June 26, 2026 11:32
@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Test Quality Sentinel completed test quality analysis.

No test files were added or modified in this PR. The only changed file is scripts/ensure-docs-slide-pdf.js (a production script). Test Quality Sentinel skipped.

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR does not have the 'implementation' label and has ≤100 new lines of code in business logic directories (default_business_additions=0).

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

PR Code Quality Reviewer completed the code quality review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses CodeQL alert js/http-to-file-access in scripts/ensure-docs-slide-pdf.js by adding additional validation gates before writing a network-fetched slide-deck PDF to docs/public/….

Changes:

  • Added Content-Type validation for the LFS download response.
  • Added a max-size guard using Content-Length plus a post-download size check.
  • Preserved the existing PDF magic-bytes check (%PDF-) and placeholder fallback behavior.
Show a summary per file
File Description
scripts/ensure-docs-slide-pdf.js Adds header and size validations intended to reduce the risk of writing unexpected network data to disk when resolving the slide deck PDF.

Review details

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 1/1 changed files
  • Comments generated: 2
  • Review effort level: Low

Comment on lines +115 to +120
// Validate the Content-Type header before consuming the body to ensure we
// are actually receiving a PDF and not arbitrary data.
const contentType = response.headers.get("content-type") ?? "";
if (!contentType.startsWith("application/pdf") && !contentType.startsWith("application/octet-stream")) {
throw new Error(`Unexpected content-type for slide deck: ${contentType}`);
}
Comment on lines +122 to +132
// Guard against unexpectedly large downloads.
const MAX_BYTES = 50 * 1024 * 1024; // 50 MB
const contentLength = response.headers.get("content-length");
if (contentLength !== null && Number(contentLength) > MAX_BYTES) {
throw new Error(`Slide deck download size ${contentLength} exceeds limit of ${MAX_BYTES} bytes`);
}

const downloadedBytes = Buffer.from(await response.arrayBuffer());
if (downloadedBytes.length > MAX_BYTES) {
throw new Error(`Downloaded slide deck size ${downloadedBytes.length} exceeds limit of ${MAX_BYTES} bytes`);
}

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /diagnose and /tdd — commenting with improvements, no blocking issues.

📋 Key Themes & Highlights

Key Themes

  • Test coverage gap: Three new validation branches have no unit tests, despite other scripts in this repo following a .test.js pattern
  • NaN edge case: Number(contentLength) > MAX_BYTES silently bypasses the Content-Length guard when the header is non-numeric (mitigated by the post-download check)
  • Loose MIME matching: startsWith("application/pdf") is slightly imprecise; parsing split(";")[0].trim() is the idiomatic approach

Positive Highlights

  • ✅ Excellent defense-in-depth: three independent validation layers (header, Content-Length, post-download byte count, magic bytes)
  • ✅ All new error paths are caught by the existing try/catch, so failures still produce the placeholder PDF rather than a crash
  • ✅ Early Content-Length rejection avoids downloading the body before failing — good resource hygiene
  • ✅ PR description is thorough and self-documenting; the security rationale is clear

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 46.2 AIC · ⌖ 8.27 AIC · ⊞ 6.5K


// Validate the Content-Type header before consuming the body to ensure we
// are actually receiving a PDF and not arbitrary data.
const contentType = response.headers.get("content-type") ?? "";

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The three new validation paths (Content-Type mismatch, oversized Content-Length, oversized post-download body) each throw distinct errors, but no regression tests cover them. The project has a .test.js pattern for scripts (changeset.test.js, generate-schema-docs.test.js) that a companion ensure-docs-slide-pdf.test.js could follow.

💡 Sketch of missing test cases

The core function would need to export readPdfBytes (or a testable inner function), then:

// Test: unexpected Content-Type falls back to placeholder, not crash
globalThis.fetch = async () => ({
  ok: true,
  headers: { get: (k) => k === 'content-type' ? 'text/html' : null },
  arrayBuffer: async () => new ArrayBuffer(0),
});
// assert: returns placeholder PDF bytes, console.warn called

// Test: Content-Length > 50 MB falls back to placeholder
globalThis.fetch = async () => ({
  ok: true,
  headers: { get: (k) => k === 'content-type' ? 'application/pdf' : k === 'content-length' ? String(100 * 1024 * 1024) : null },
  arrayBuffer: async () => new ArrayBuffer(0),
});
// assert: returns placeholder PDF bytes

These tests confirm that each guard correctly triggers the fallback rather than writing invalid data.

// Guard against unexpectedly large downloads.
const MAX_BYTES = 50 * 1024 * 1024; // 50 MB
const contentLength = response.headers.get("content-length");
if (contentLength !== null && Number(contentLength) > MAX_BYTES) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] Number(contentLength) returns NaN when the header value is non-numeric, and NaN > MAX_BYTES evaluates to false, silently bypassing this guard. A malformed or crafted Content-Length header would slip past without triggering an error or warning. The post-download byte-count check on line 130 still catches oversized payloads, so the risk is mitigated — but the Content-Length early-rejection is weakened.

💡 Suggested fix
const parsedLength = contentLength !== null ? Number(contentLength) : NaN;
if (Number.isFinite(parsedLength) && parsedLength > MAX_BYTES) {
  throw new Error(`Slide deck download size ${parsedLength} exceeds limit of ${MAX_BYTES} bytes`);
}

Using Number.isFinite() ensures non-numeric headers are treated as "unknown size" (safe fallback to the post-download check) rather than silently ignored as if the value were 0.

// Validate the Content-Type header before consuming the body to ensure we
// are actually receiving a PDF and not arbitrary data.
const contentType = response.headers.get("content-type") ?? "";
if (!contentType.startsWith("application/pdf") && !contentType.startsWith("application/octet-stream")) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] startsWith("application/pdf") will also match hypothetical types like application/pdf-exploit or application/pdfx. In practice this isn't exploitable since a real server wouldn't return such types, but parsing the MIME type properly (stripping parameters) is the more defensive and idiomatic approach.

💡 Suggested fix
const mimeType = (response.headers.get("content-type") ?? "").split(";")[0].trim();
if (mimeType !== "application/pdf" && mimeType !== "application/octet-stream") {
  throw new Error(`Unexpected content-type for slide deck: ${mimeType}`);
}

This correctly handles application/pdf; charset=binary (strips params) and prevents any unintended prefix matches.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REQUEST_CHANGES — two correctness bugs in the new validation code

The security-fix direction is sound, but two of the three new guards have implementation defects that undermine the protection they are meant to provide.

🔍 Blocking issues
  1. NaN bypass in Content-Length pre-check (line 125)Number("abc") is NaN; NaN > MAX_BYTES is false, so any malformed Content-Length silently skips the early rejection. The post-download check still exists, but only after the full body is buffered.

  2. Full response body buffered before size check (line 129)response.arrayBuffer() reads the entire response into memory before the check on line 130 fires. A server omitting or lying about Content-Length can exhaust heap before the guard throws, potentially OOM-crashing the CI runner. The fix requires streaming the response with a running byte count.

i️ Non-blocking observations
  • application/octet-stream acceptance (line 118) reduces the Content-Type gate to filtering non-binary MIME types only; the magic-bytes check on line 134 is already doing the real validation work. If LFS media always returns application/octet-stream, document that so the layering isn't misleading.

🔎 Code quality review by PR Code Quality Reviewer · 70.7 AIC · ⌖ 6.89 AIC · ⊞ 5.2K

// Guard against unexpectedly large downloads.
const MAX_BYTES = 50 * 1024 * 1024; // 50 MB
const contentLength = response.headers.get("content-length");
if (contentLength !== null && Number(contentLength) > MAX_BYTES) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content-Length guard silently skipped for malformed header values: Number("abc") returns NaN, and NaN > MAX_BYTES is false, so any non-numeric Content-Length bypasses the pre-download size check without throwing.

💡 Suggested fix

Replace with an explicit finite-number check:

const contentLengthNum = contentLength !== null ? Number(contentLength) : null;
if (contentLengthNum !== null && (!Number.isFinite(contentLengthNum) || contentLengthNum > MAX_BYTES)) {
  throw new Error(`Slide deck download size ${contentLength} exceeds limit of ${MAX_BYTES} bytes`);
}

This rejects both non-numeric headers and values exceeding the cap. Without this, a server returning Content-Length: chunked or any non-numeric token silently skips the guard — the post-download check on line 130 still fires, but only after the full body is already buffered in memory.

throw new Error(`Slide deck download size ${contentLength} exceeds limit of ${MAX_BYTES} bytes`);
}

const downloadedBytes = Buffer.from(await response.arrayBuffer());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50 MB guard fires only after the full body is already in memory: response.arrayBuffer() buffers the entire response synchronously before the size check on line 130 runs. A server that omits Content-Length (bypassing the header pre-check) or provides a malformed value can stream an unbounded body and exhaust the process's heap before the guard throws — potentially OOM-crashing the CI runner.

💡 Suggested fix

Stream the response and count bytes as chunks arrive:

const chunks = [];
let totalBytes = 0;
const reader = response.body.getReader();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  totalBytes += value.length;
  if (totalBytes > MAX_BYTES) {
    await reader.cancel();
    throw new Error(`Downloaded slide deck size exceeds limit of ${MAX_BYTES} bytes`);
  }
  chunks.push(value);
}
const downloadedBytes = Buffer.concat(chunks.map(c => Buffer.from(c)));

This caps memory usage at MAX_BYTES regardless of whether Content-Length is absent, lying, or malformed, which is the actual intent of the guard comment on line 122.

// Validate the Content-Type header before consuming the body to ensure we
// are actually receiving a PDF and not arbitrary data.
const contentType = response.headers.get("content-type") ?? "";
if (!contentType.startsWith("application/pdf") && !contentType.startsWith("application/octet-stream")) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

application/octet-stream allowance largely neutralises the Content-Type gate: any arbitrary binary file — executables, archives, scripts — arrives with application/octet-stream, so the check filters only responses with a non-binary MIME type (e.g., text/html). The isPdf() magic-bytes check on line 134 is therefore the only meaningful content-type guard. The comment above this block states the intent is to ensure we are receiving a PDF, which application/octet-stream does not satisfy.

💡 Notes

This may be intentional if media.githubusercontent.com consistently serves LFS objects as application/octet-stream. If so, document that explicitly with a comment; otherwise the combination of header validation + isPdf() check creates a false sense of defence-in-depth since the header gate adds no additional content-type assurance over what the magic-bytes check already provides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants