[code-scanning-fix] Fix js/http-to-file-access: validate Content-Type and size for LFS PDF download#41635
Conversation
Add response Content-Type validation and a 50 MB size limit to the network-to-file path in ensure-docs-slide-pdf.js, addressing the js/http-to-file-access CodeQL alert (#636). Changes: - Reject responses whose Content-Type is not application/pdf or application/octet-stream before reading the body. - Enforce a 50 MB ceiling via both the Content-Length header (early rejection) and the actual downloaded byte count (double-check). - The existing PDF magic-bytes check (isPdf) is preserved as a third layer of validation. - All new checks throw errors that are caught by the surrounding try/catch block, so the script still falls back to the placeholder PDF rather than crashing in sandboxed environments. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Nice defence-in-depth fix for the LFS slide deck download path 🔒 — layering Content-Type validation, a pre-download Content-Length cap, and a post-read byte-count check on top of the existing magic-bytes guard is exactly the right approach for this kind of network-to-disk flow. One thing that would round this out before merge:
If you’d like to generate that coverage, assign this prompt to your coding agent:
|
|
✅ Test Quality Sentinel completed test quality analysis. No test files were added or modified in this PR. The only changed file is scripts/ensure-docs-slide-pdf.js (a production script). Test Quality Sentinel skipped. |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR does not have the 'implementation' label and has ≤100 new lines of code in business logic directories (default_business_additions=0). |
|
✅ PR Code Quality Reviewer completed the code quality review. |
There was a problem hiding this comment.
Pull request overview
This PR addresses CodeQL alert js/http-to-file-access in scripts/ensure-docs-slide-pdf.js by adding additional validation gates before writing a network-fetched slide-deck PDF to docs/public/….
Changes:
- Added
Content-Typevalidation for the LFS download response. - Added a max-size guard using
Content-Lengthplus a post-download size check. - Preserved the existing PDF magic-bytes check (
%PDF-) and placeholder fallback behavior.
Show a summary per file
| File | Description |
|---|---|
scripts/ensure-docs-slide-pdf.js |
Adds header and size validations intended to reduce the risk of writing unexpected network data to disk when resolving the slide deck PDF. |
Review details
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 1/1 changed files
- Comments generated: 2
- Review effort level: Low
| // Validate the Content-Type header before consuming the body to ensure we | ||
| // are actually receiving a PDF and not arbitrary data. | ||
| const contentType = response.headers.get("content-type") ?? ""; | ||
| if (!contentType.startsWith("application/pdf") && !contentType.startsWith("application/octet-stream")) { | ||
| throw new Error(`Unexpected content-type for slide deck: ${contentType}`); | ||
| } |
| // Guard against unexpectedly large downloads. | ||
| const MAX_BYTES = 50 * 1024 * 1024; // 50 MB | ||
| const contentLength = response.headers.get("content-length"); | ||
| if (contentLength !== null && Number(contentLength) > MAX_BYTES) { | ||
| throw new Error(`Slide deck download size ${contentLength} exceeds limit of ${MAX_BYTES} bytes`); | ||
| } | ||
|
|
||
| const downloadedBytes = Buffer.from(await response.arrayBuffer()); | ||
| if (downloadedBytes.length > MAX_BYTES) { | ||
| throw new Error(`Downloaded slide deck size ${downloadedBytes.length} exceeds limit of ${MAX_BYTES} bytes`); | ||
| } |
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /diagnose and /tdd — commenting with improvements, no blocking issues.
📋 Key Themes & Highlights
Key Themes
- Test coverage gap: Three new validation branches have no unit tests, despite other scripts in this repo following a
.test.jspattern - NaN edge case:
Number(contentLength) > MAX_BYTESsilently bypasses the Content-Length guard when the header is non-numeric (mitigated by the post-download check) - Loose MIME matching:
startsWith("application/pdf")is slightly imprecise; parsingsplit(";")[0].trim()is the idiomatic approach
Positive Highlights
- ✅ Excellent defense-in-depth: three independent validation layers (header, Content-Length, post-download byte count, magic bytes)
- ✅ All new error paths are caught by the existing
try/catch, so failures still produce the placeholder PDF rather than a crash - ✅ Early Content-Length rejection avoids downloading the body before failing — good resource hygiene
- ✅ PR description is thorough and self-documenting; the security rationale is clear
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 46.2 AIC · ⌖ 8.27 AIC · ⊞ 6.5K
|
|
||
| // Validate the Content-Type header before consuming the body to ensure we | ||
| // are actually receiving a PDF and not arbitrary data. | ||
| const contentType = response.headers.get("content-type") ?? ""; |
There was a problem hiding this comment.
[/tdd] The three new validation paths (Content-Type mismatch, oversized Content-Length, oversized post-download body) each throw distinct errors, but no regression tests cover them. The project has a .test.js pattern for scripts (changeset.test.js, generate-schema-docs.test.js) that a companion ensure-docs-slide-pdf.test.js could follow.
💡 Sketch of missing test cases
The core function would need to export readPdfBytes (or a testable inner function), then:
// Test: unexpected Content-Type falls back to placeholder, not crash
globalThis.fetch = async () => ({
ok: true,
headers: { get: (k) => k === 'content-type' ? 'text/html' : null },
arrayBuffer: async () => new ArrayBuffer(0),
});
// assert: returns placeholder PDF bytes, console.warn called
// Test: Content-Length > 50 MB falls back to placeholder
globalThis.fetch = async () => ({
ok: true,
headers: { get: (k) => k === 'content-type' ? 'application/pdf' : k === 'content-length' ? String(100 * 1024 * 1024) : null },
arrayBuffer: async () => new ArrayBuffer(0),
});
// assert: returns placeholder PDF bytesThese tests confirm that each guard correctly triggers the fallback rather than writing invalid data.
| // Guard against unexpectedly large downloads. | ||
| const MAX_BYTES = 50 * 1024 * 1024; // 50 MB | ||
| const contentLength = response.headers.get("content-length"); | ||
| if (contentLength !== null && Number(contentLength) > MAX_BYTES) { |
There was a problem hiding this comment.
[/diagnose] Number(contentLength) returns NaN when the header value is non-numeric, and NaN > MAX_BYTES evaluates to false, silently bypassing this guard. A malformed or crafted Content-Length header would slip past without triggering an error or warning. The post-download byte-count check on line 130 still catches oversized payloads, so the risk is mitigated — but the Content-Length early-rejection is weakened.
💡 Suggested fix
const parsedLength = contentLength !== null ? Number(contentLength) : NaN;
if (Number.isFinite(parsedLength) && parsedLength > MAX_BYTES) {
throw new Error(`Slide deck download size ${parsedLength} exceeds limit of ${MAX_BYTES} bytes`);
}Using Number.isFinite() ensures non-numeric headers are treated as "unknown size" (safe fallback to the post-download check) rather than silently ignored as if the value were 0.
| // Validate the Content-Type header before consuming the body to ensure we | ||
| // are actually receiving a PDF and not arbitrary data. | ||
| const contentType = response.headers.get("content-type") ?? ""; | ||
| if (!contentType.startsWith("application/pdf") && !contentType.startsWith("application/octet-stream")) { |
There was a problem hiding this comment.
[/diagnose] startsWith("application/pdf") will also match hypothetical types like application/pdf-exploit or application/pdfx. In practice this isn't exploitable since a real server wouldn't return such types, but parsing the MIME type properly (stripping parameters) is the more defensive and idiomatic approach.
💡 Suggested fix
const mimeType = (response.headers.get("content-type") ?? "").split(";")[0].trim();
if (mimeType !== "application/pdf" && mimeType !== "application/octet-stream") {
throw new Error(`Unexpected content-type for slide deck: ${mimeType}`);
}This correctly handles application/pdf; charset=binary (strips params) and prevents any unintended prefix matches.
There was a problem hiding this comment.
REQUEST_CHANGES — two correctness bugs in the new validation code
The security-fix direction is sound, but two of the three new guards have implementation defects that undermine the protection they are meant to provide.
🔍 Blocking issues
-
NaN bypass in Content-Length pre-check (line 125) —
Number("abc")isNaN;NaN > MAX_BYTESisfalse, so any malformedContent-Lengthsilently skips the early rejection. The post-download check still exists, but only after the full body is buffered. -
Full response body buffered before size check (line 129) —
response.arrayBuffer()reads the entire response into memory before the check on line 130 fires. A server omitting or lying aboutContent-Lengthcan exhaust heap before the guard throws, potentially OOM-crashing the CI runner. The fix requires streaming the response with a running byte count.
i️ Non-blocking observations
application/octet-streamacceptance (line 118) reduces the Content-Type gate to filtering non-binary MIME types only; the magic-bytes check on line 134 is already doing the real validation work. If LFS media always returnsapplication/octet-stream, document that so the layering isn't misleading.
🔎 Code quality review by PR Code Quality Reviewer · 70.7 AIC · ⌖ 6.89 AIC · ⊞ 5.2K
| // Guard against unexpectedly large downloads. | ||
| const MAX_BYTES = 50 * 1024 * 1024; // 50 MB | ||
| const contentLength = response.headers.get("content-length"); | ||
| if (contentLength !== null && Number(contentLength) > MAX_BYTES) { |
There was a problem hiding this comment.
Content-Length guard silently skipped for malformed header values: Number("abc") returns NaN, and NaN > MAX_BYTES is false, so any non-numeric Content-Length bypasses the pre-download size check without throwing.
💡 Suggested fix
Replace with an explicit finite-number check:
const contentLengthNum = contentLength !== null ? Number(contentLength) : null;
if (contentLengthNum !== null && (!Number.isFinite(contentLengthNum) || contentLengthNum > MAX_BYTES)) {
throw new Error(`Slide deck download size ${contentLength} exceeds limit of ${MAX_BYTES} bytes`);
}This rejects both non-numeric headers and values exceeding the cap. Without this, a server returning Content-Length: chunked or any non-numeric token silently skips the guard — the post-download check on line 130 still fires, but only after the full body is already buffered in memory.
| throw new Error(`Slide deck download size ${contentLength} exceeds limit of ${MAX_BYTES} bytes`); | ||
| } | ||
|
|
||
| const downloadedBytes = Buffer.from(await response.arrayBuffer()); |
There was a problem hiding this comment.
50 MB guard fires only after the full body is already in memory: response.arrayBuffer() buffers the entire response synchronously before the size check on line 130 runs. A server that omits Content-Length (bypassing the header pre-check) or provides a malformed value can stream an unbounded body and exhaust the process's heap before the guard throws — potentially OOM-crashing the CI runner.
💡 Suggested fix
Stream the response and count bytes as chunks arrive:
const chunks = [];
let totalBytes = 0;
const reader = response.body.getReader();
while (true) {
const { done, value } = await reader.read();
if (done) break;
totalBytes += value.length;
if (totalBytes > MAX_BYTES) {
await reader.cancel();
throw new Error(`Downloaded slide deck size exceeds limit of ${MAX_BYTES} bytes`);
}
chunks.push(value);
}
const downloadedBytes = Buffer.concat(chunks.map(c => Buffer.from(c)));This caps memory usage at MAX_BYTES regardless of whether Content-Length is absent, lying, or malformed, which is the actual intent of the guard comment on line 122.
| // Validate the Content-Type header before consuming the body to ensure we | ||
| // are actually receiving a PDF and not arbitrary data. | ||
| const contentType = response.headers.get("content-type") ?? ""; | ||
| if (!contentType.startsWith("application/pdf") && !contentType.startsWith("application/octet-stream")) { |
There was a problem hiding this comment.
application/octet-stream allowance largely neutralises the Content-Type gate: any arbitrary binary file — executables, archives, scripts — arrives with application/octet-stream, so the check filters only responses with a non-binary MIME type (e.g., text/html). The isPdf() magic-bytes check on line 134 is therefore the only meaningful content-type guard. The comment above this block states the intent is to ensure we are receiving a PDF, which application/octet-stream does not satisfy.
💡 Notes
This may be intentional if media.githubusercontent.com consistently serves LFS objects as application/octet-stream. If so, document that explicitly with a comment; otherwise the combination of header validation + isPdf() check creates a false sense of defence-in-depth since the header gate adds no additional content-type assurance over what the magic-bytes check already provides.
Uh oh!
There was an error while loading. Please reload this page.