Skip to content

feat(everything): POC claude - Cross organism queries: One SILO approach#6723

Draft
anna-parker wants to merge 33 commits into
mainfrom
claude-1silo
Draft

feat(everything): POC claude - Cross organism queries: One SILO approach#6723
anna-parker wants to merge 33 commits into
mainfrom
claude-1silo

Conversation

@anna-parker

@anna-parker anna-parker commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

🤖 Generated with Claude Code (https://claude.com/claude-code)

This PR implements the "1 SILO" architecture: instead of one LAPIS/SILO instance per organism, all organisms share a single unified LAPIS/SILO instance. All LAPIS queries are routed through the backend, which acts as a proxy. This enables a new cross-organism search page.

Summary of Changes

Metadata structure changes

(Note this accounts for the most LOC: +2,111 -1,627 in the values.yaml and approx 1000 LOC in the backend config test files)

  • Introduced a sharedMetadata top-level key in values.yaml for metadata fields shared across all organisms (collection date, geo-location, display name, NCBI release date,
    etc.) and an organismSpecificMetadata field for organism specific fields (segment specific fields like INSDC accession, nextclade metadata)
  • The unified SILO database config (_siloDatabaseConfig.tpl) merges all organisms into one schema:
    • An organism string field is added as the first metadata entry (indexed) so SILO can filter by organism.
    • Nucleotide and gene names are organism-prefixed: {organism}_{segment}-{reference} or {organism}_{gene}-{reference}
    • All metadata fields from all organisms are merged (deduplication by name, first occurrence wins). perSegment fields are expanded per multi-segment organism.

New backend endpoints

  • LapisProxyController (/query/{endpoint}): proxies all LAPIS/SILO queries from clients through the backend. POST endpoints pass the body (with organism as a filter field)
    directly to LAPIS. GET sequence-download endpoints (/query/unalignedNucleotideSequences, /query/alignedNucleotideSequences, /query/alignedAminoAcidSequences) take organism,
    segment, and reference as query params and remap them to the correct unified segment/gene name (e.g. {organism}_{segment}-{reference}).
  • LapisProxyService: wraps Java's HttpClient to forward POST and GET requests to the internal LAPIS URL, streaming the response body.
  • UnifiedReleasedDataController (/get-released-data): extended to accept an optional organism query param. When omitted, it streams released data for all organisms in a single NDJSON response, padding each row with null values for columns belonging to other organisms (required by SILO, which expects every defined column in every row). Sequence keys are organism-prefixed to match the unified SILO schema.
  • The LAPIS ingress is removed — LAPIS is no longer externally exposed. A single loculus-lapis-service replaces the per-organism services.

Cross-organism search

  • New page at /search (website/src/pages/search/index.astro): computes the intersection of metadata fields present across all organism schemas, builds a synthetic
    crossOrganismSchema with an organism field prepended, and renders the standard SearchFullUI against the unified LAPIS endpoint. Mutation search and sequence downloads are currently disabled.
  • Website code is refactored to call the backend with the appropriate parameters

What this PR does not handle:

  • Metadata name conflicts in metadata of a different type in organismSpecificFields
  • Lineage searches (not too much to add actually I just wanted to simplify)
  • Potentially organism specific metadata could be further split into organism sample and organism segment-reference specific metadata (i.e. sampling procedure vs. nextclade alignment details)
  • Unaligned sequence downloads on the cross organism query page
  • Mutation search is broken as it now requires a {organism}_{segment}-{reference} prefix, this should be removed via the webpage or the backend
  • Create a new INSDC accession(s) metadata field that includes a list of all INSDC accessions for segmented organisms so this can be searched for on the cross organism page.

Limitations:

  • One SILO is great in many ways BUT our current approach to updating SILO is just not scalable. Now when there is any new data ALL data must be streamed again to SILO. Switching to one SILO will only be possible when SILO supports incremental uploads with deletions.
  • The backend currently parses and modifies all requests to LAPIS but does not modify/parse the response. Modifying requests is feasible. However, as soon as LAPIS/SILO has different sequence and metadata names than the ones we expose to users we would need to modify these names in the response we send to users. This is definitely possible but another level of complexity, especially if the user for example requested compressed mutations. Potentially adding an option in LAPIS to map the exported gene/segment names would be a better alternative than mapping in the backend.

Alternatives until SILO supports incremental uploads:

  • Continue with one SILO per organism and add new SILO with all shared metadata and unaligned sequences (this does mean we duplicate the metadata and unaligned sequences across LAPIS/SILO instances yet again which is inefficient), this can then be easily switched out later if LAPIS/SILO is not exposed via the backend. Requires: new metadata structure, proxying of endpoints in the backend

🚀 Preview: https://claude-1silo.loculus.org

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Jun 20, 2026
@anna-parker anna-parker marked this pull request as ready for review June 20, 2026 06:46
@claude

claude Bot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Claude finished @anna-parker's task in 4m 54s —— View job


Code Review

  • Gather context and understand PR goals
  • Review backend changes (LapisProxyController, LapisProxyService, Config, etc.)
  • Review Kubernetes/deployment changes
  • Review website changes (CrossOrganismSearch, DownloadDialog, etc.)
  • Review test coverage
  • Post final review

This is a significant architectural change — consolidating per-organism LAPIS/SILO instances into one unified instance. The overall approach is sound and the key mechanics (organism-prefixed segment names, organism field injection, ETag scoping) are well thought through. Below are issues worth addressing before merge.

Breaking change: callers of get-released-data

The per-organism /{organism}/get-released-data endpoint was removed and replaced with /get-released-data. Two callers still use the old path and will get 404s at runtime:

  • ena-submission/src/ena_deposition/call_loculus.py line 157: f"{organism_url(config, organism)}/get-released-data"
  • cli/src/loculus_cli/api/backend.py line 201: f"/{organism}/get-released-data"

The loculus-silo importer already uses the root URL correctly. The ENA submission and CLI need to be updated.

Security: open proxy with no endpoint allowlist

LapisProxyController's catch-all @PostMapping("/{endpoint}") forwards any path to LAPIS with no restriction. An external caller can probe LAPIS system endpoints (e.g. /query/info, /query/swagger-ui/...). Recommend an explicit allowlist of permitted endpoints (details, aggregated, nucleotideMutations, etc.) returning 404 for anything else.

HTTP status not forwarded by proxy

LapisProxyService.proxyPost/proxyGet forward the body and a handful of headers, but the LAPIS HTTP status code is never copied to the servlet response. A 400 or 500 from LAPIS arrives at the client as a 200. This makes error handling impossible for callers and can mask LAPIS failures silently.

Multi-value GET parameter truncation

In LapisProxyController.proxyGet (line 147), only values[0] is forwarded per parameter. Any multi-value param (e.g. accession=A&accession=B in a sequence download URL) silently drops all but the first value. The simplest fix is to forward request.queryString directly instead of re-serialising through a single-value map.

withSample is now a no-op

withSample in lapisApi.ts now just returns its argument unchanged. It should be removed and the call sites inlined — leaving it in place is misleading for future readers.

Dead code in LapisProxyService

injectOrganism (line 68) is defined but never called. It can be removed, or if it's needed soon, a TODO comment would make intent clearer.

Minor issues (see inline comments)

  • resolveDefaultSegment throws IllegalArgumentException → Spring maps this to 500; should be ResponseStatusException(HttpStatus.BAD_REQUEST).
  • streamSequences and getDetails in LapisClient still have hardcoded /sample/ path prefixes that are incompatible with the new proxy URL — they would 404 if called. Both appear unused currently.
  • Row key={i} in CrossOrganismSearch table; should use accessionVersion.
  • No runtime validation that same-named fields have consistent types across organisms in the unified SILO config (only a comment warns about this).

@anna-parker anna-parker marked this pull request as draft June 20, 2026 06:47
@anna-parker anna-parker changed the title feat(everything): claude - Cross organism queries: One SILO approach feat(everything): POC claude - Cross organism queries: One SILO approach Jun 20, 2026

val requestBody = objectMapper.writeValueAsString(body)
val request = HttpRequest.newBuilder()
.uri(URI.create(url))
log.debug { "Proxying GET to $url" }

val request = HttpRequest.newBuilder()
.uri(URI.create(url))
response.body().use { writeResponse(it) }
}

fun injectOrganism(body: Map<String, Any?>, organism: String): Map<String, Any?> = body + ("organism" to organism)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

injectOrganism is defined here but never called anywhere in the codebase. If it's intended for future use, that's fine to keep, but it could also be removed.

Fix this →

.POST(HttpRequest.BodyPublishers.ofString(requestBody))
.build()

val response = httpClient.send(request, HttpResponse.BodyHandlers.ofInputStream())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTTP status code from LAPIS is never forwarded to the caller. If LAPIS returns a 400 or 500, the client receives HTTP 200 with the error body. Consider forwarding the status:

response.statusCode().let { response.setStatus(it) }  // or equivalent Spring API

This also applies to proxyGet.

) {
val queryParams = mutableMapOf("organism" to organism)
request.parameterMap.forEach { (key, values) ->
if (key != "organism" && values.isNotEmpty()) queryParams[key] = values[0]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the first value of each multi-value query parameter is forwarded. A download URL with accession=A&accession=B would silently drop accession=B. Since this is the GET path for sequence downloads, this is likely to be hit in practice.

Consider collecting all values:

request.parameterMap.forEach { (key, values) ->
    if (key != "organism") values.forEach { v -> queryParams.merge(key, v) { a, b -> "$a,$b" } }
}

Or switch queryParams to a MultiValueMap / build the query string directly from request.queryString.

return if (refGenome.nucleotideSequences.size == 1) {
organism.name
} else {
throw IllegalArgumentException(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IllegalArgumentException is not mapped to a 4xx response by Spring — it will produce a 500. Since this is a client error (wrong endpoint for a multi-segment organism), use ResponseStatusException(HttpStatus.BAD_REQUEST, ...) instead.

// scoping to a single organism. All-organism queries omit it.

@PostMapping("/{endpoint}", consumes = ["application/json"])
fun proxyPost(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch-all /{endpoint} POST handler proxies any path segment to LAPIS with no allowlist. An external caller can reach LAPIS endpoints like /query/info, /query/swagger-ui/..., or future LAPIS admin routes. Consider limiting endpoint to a known set of safe paths (e.g., details, aggregated, nucleotideMutations, etc.) and returning 404 for anything else.


public static createForOrganism(organism: string) {
return this.create(getLapisUrl(getRuntimeConfig().serverSide, organism), getSchema(organism));
const client = this.create(getLapisUrl(getRuntimeConfig().serverSide), getSchema(organism));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createForOrganism correctly injects the organism into every POST body for the new unified SILO setup. Note that streamSequences (line 235) and getDetails (line 244) have hardcoded /sample/ path prefixes that are incompatible with the new proxy URL (backendUrl/query). They resolve to backendUrl/query/sample/... which has no matching route. Both appear unused right now, but should be updated to remove the /sample prefix before they are called.


function withSample<Path extends `/${string}`>(path: Path) {
return `/sample${path}` as const;
return path;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withSample is now a no-op (returns path unchanged). The function can be removed entirely; every call site can just pass the path literal directly. Leaving it in place is mildly misleading — a future reader might assume it does something.

setIsLoading(false);
}
},
[queryUrl, JSON.stringify(searchParams)],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON.stringify(searchParams) as a useCallback dependency is fine for correctness here (since hiddenFieldValues is a prop) but is a subtle serialisation-based dependency — any value change rebuilds a new string, triggering a new callback reference. This is unlikely to matter in practice, but consider listing the individual hiddenFieldValues keys, or memoising searchParams with useMemo.

</tr>
</thead>
<tbody>
{rows.map((row, i) => (

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using array index as key in a table row (key={i}) will cause React reconciliation issues if rows are reordered between pages. Use a stable identifier like accessionVersion instead:

{rows.map((row) => (
    <tr key={String(row['accessionVersion'])} ...>

Merges metadata from all organisms; deduplication by name (first occurrence wins,
caller must ensure no type conflicts for same-named fields across organisms).
Adds an "organism" field as the first metadata entry.
Sequence names are organism-prefixed (handled by get-released-data / reference genomes).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment notes "caller must ensure no type conflicts for same-named fields across organisms" — but there is no validation or error at any layer that enforces this. A metadata field named date typed as string in one organism and date in another would silently use the first-seen type in SILO, potentially causing query failures or data corruption. Consider adding a Helm fail assertion or a backend startup check that validates this constraint.

}
}

@Transactional(readOnly = true)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be deleted now, this is a sloppy refactor

keycloakAuthoritiesConverter: KeycloakAuthenticationConverter,
corsConfigurationSource: org.springframework.web.cors.CorsConfigurationSource,
): SecurityFilterChain = httpSecurity
.csrf { it.disable() }
@@ -0,0 +1,97 @@
package org.loculus.backend.controller

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure how efficient SILO is in loading all data in, in theory we could already sort data in the get-released-data endpoint by organism and reference to improve SILO's binary encoding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants