feat(everything): POC claude - Cross organism queries: One SILO approach#6723
feat(everything): POC claude - Cross organism queries: One SILO approach#6723anna-parker wants to merge 33 commits into
Conversation
|
Claude finished @anna-parker's task in 4m 54s —— View job Code Review
This is a significant architectural change — consolidating per-organism LAPIS/SILO instances into one unified instance. The overall approach is sound and the key mechanics (organism-prefixed segment names, organism field injection, ETag scoping) are well thought through. Below are issues worth addressing before merge. Breaking change: callers of
|
| response.body().use { writeResponse(it) } | ||
| } | ||
|
|
||
| fun injectOrganism(body: Map<String, Any?>, organism: String): Map<String, Any?> = body + ("organism" to organism) |
There was a problem hiding this comment.
injectOrganism is defined here but never called anywhere in the codebase. If it's intended for future use, that's fine to keep, but it could also be removed.
| .POST(HttpRequest.BodyPublishers.ofString(requestBody)) | ||
| .build() | ||
|
|
||
| val response = httpClient.send(request, HttpResponse.BodyHandlers.ofInputStream()) |
There was a problem hiding this comment.
The HTTP status code from LAPIS is never forwarded to the caller. If LAPIS returns a 400 or 500, the client receives HTTP 200 with the error body. Consider forwarding the status:
response.statusCode().let { response.setStatus(it) } // or equivalent Spring APIThis also applies to proxyGet.
| ) { | ||
| val queryParams = mutableMapOf("organism" to organism) | ||
| request.parameterMap.forEach { (key, values) -> | ||
| if (key != "organism" && values.isNotEmpty()) queryParams[key] = values[0] |
There was a problem hiding this comment.
Only the first value of each multi-value query parameter is forwarded. A download URL with accession=A&accession=B would silently drop accession=B. Since this is the GET path for sequence downloads, this is likely to be hit in practice.
Consider collecting all values:
request.parameterMap.forEach { (key, values) ->
if (key != "organism") values.forEach { v -> queryParams.merge(key, v) { a, b -> "$a,$b" } }
}Or switch queryParams to a MultiValueMap / build the query string directly from request.queryString.
| return if (refGenome.nucleotideSequences.size == 1) { | ||
| organism.name | ||
| } else { | ||
| throw IllegalArgumentException( |
There was a problem hiding this comment.
IllegalArgumentException is not mapped to a 4xx response by Spring — it will produce a 500. Since this is a client error (wrong endpoint for a multi-segment organism), use ResponseStatusException(HttpStatus.BAD_REQUEST, ...) instead.
| // scoping to a single organism. All-organism queries omit it. | ||
|
|
||
| @PostMapping("/{endpoint}", consumes = ["application/json"]) | ||
| fun proxyPost( |
There was a problem hiding this comment.
The catch-all /{endpoint} POST handler proxies any path segment to LAPIS with no allowlist. An external caller can reach LAPIS endpoints like /query/info, /query/swagger-ui/..., or future LAPIS admin routes. Consider limiting endpoint to a known set of safe paths (e.g., details, aggregated, nucleotideMutations, etc.) and returning 404 for anything else.
|
|
||
| public static createForOrganism(organism: string) { | ||
| return this.create(getLapisUrl(getRuntimeConfig().serverSide, organism), getSchema(organism)); | ||
| const client = this.create(getLapisUrl(getRuntimeConfig().serverSide), getSchema(organism)); |
There was a problem hiding this comment.
createForOrganism correctly injects the organism into every POST body for the new unified SILO setup. Note that streamSequences (line 235) and getDetails (line 244) have hardcoded /sample/ path prefixes that are incompatible with the new proxy URL (backendUrl/query). They resolve to backendUrl/query/sample/... which has no matching route. Both appear unused right now, but should be updated to remove the /sample prefix before they are called.
|
|
||
| function withSample<Path extends `/${string}`>(path: Path) { | ||
| return `/sample${path}` as const; | ||
| return path; |
There was a problem hiding this comment.
withSample is now a no-op (returns path unchanged). The function can be removed entirely; every call site can just pass the path literal directly. Leaving it in place is mildly misleading — a future reader might assume it does something.
| setIsLoading(false); | ||
| } | ||
| }, | ||
| [queryUrl, JSON.stringify(searchParams)], |
There was a problem hiding this comment.
JSON.stringify(searchParams) as a useCallback dependency is fine for correctness here (since hiddenFieldValues is a prop) but is a subtle serialisation-based dependency — any value change rebuilds a new string, triggering a new callback reference. This is unlikely to matter in practice, but consider listing the individual hiddenFieldValues keys, or memoising searchParams with useMemo.
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| {rows.map((row, i) => ( |
There was a problem hiding this comment.
Using array index as key in a table row (key={i}) will cause React reconciliation issues if rows are reordered between pages. Use a stable identifier like accessionVersion instead:
{rows.map((row) => (
<tr key={String(row['accessionVersion'])} ...>| Merges metadata from all organisms; deduplication by name (first occurrence wins, | ||
| caller must ensure no type conflicts for same-named fields across organisms). | ||
| Adds an "organism" field as the first metadata entry. | ||
| Sequence names are organism-prefixed (handled by get-released-data / reference genomes). |
There was a problem hiding this comment.
The comment notes "caller must ensure no type conflicts for same-named fields across organisms" — but there is no validation or error at any layer that enforces this. A metadata field named date typed as string in one organism and date in another would silently use the first-seen type in SILO, potentially causing query failures or data corruption. Consider adding a Helm fail assertion or a backend startup check that validates this constraint.
…this in the importer script as having basically empty null fields get streamed in an uncompressed is terrible perf
a21de6b to
66371ae
Compare
| } | ||
| } | ||
|
|
||
| @Transactional(readOnly = true) |
There was a problem hiding this comment.
this should be deleted now, this is a sloppy refactor
| @@ -0,0 +1,97 @@ | |||
| package org.loculus.backend.controller | |||
There was a problem hiding this comment.
Im not sure how efficient SILO is in loading all data in, in theory we could already sort data in the get-released-data endpoint by organism and reference to improve SILO's binary encoding
6d7a925 to
bda2d05
Compare
🤖 Generated with Claude Code (https://claude.com/claude-code)
This PR implements the "1 SILO" architecture: instead of one LAPIS/SILO instance per organism, all organisms share a single unified LAPIS/SILO instance. All LAPIS queries are routed through the backend, which acts as a proxy. This enables a new cross-organism search page.
Summary of Changes
Metadata structure changes
(Note this accounts for the most LOC:
+2,111 -1,627in the values.yaml and approx 1000 LOC in the backend config test files)sharedMetadatatop-level key in values.yaml for metadata fields shared across all organisms (collection date, geo-location, display name, NCBI release date,etc.) and an
organismSpecificMetadatafield for organism specific fields (segment specific fields like INSDC accession, nextclade metadata){organism}_{segment}-{reference}or{organism}_{gene}-{reference}New backend endpoints
directly to LAPIS. GET sequence-download endpoints (
/query/unalignedNucleotideSequences,/query/alignedNucleotideSequences,/query/alignedAminoAcidSequences) takeorganism,segment, andreferenceas query params and remap them to the correct unified segment/gene name (e.g.{organism}_{segment}-{reference}).Cross-organism search
/search(website/src/pages/search/index.astro): computes the intersection of metadata fields present across all organism schemas, builds a syntheticcrossOrganismSchemawith an organism field prepended, and renders the standard SearchFullUI against the unified LAPIS endpoint. Mutation search and sequence downloads are currently disabled.What this PR does not handle:
{organism}_{segment}-{reference}prefix, this should be removed via the webpage or the backendINSDC accession(s)metadata field that includes a list of all INSDC accessions for segmented organisms so this can be searched for on the cross organism page.Limitations:
Alternatives until SILO supports incremental uploads:
🚀 Preview: https://claude-1silo.loculus.org