feat: Store JSON occurrences in sqlite #321

mtaran · 2025-06-12T20:20:25Z

This is a follow-up to #319, which implemented the initial SCIP->sqlite conversion. It inserted the occurrences information into the DB as a compressed protobuf for reasons of compactness. But this is not conducive to querying or exploring the data in ad hoc ways. This PR changes the format to store a JSON representation of the occurrences instead, since JSON is natively supported by sqlite. The index I'm working with only increased from 18 MB to 33 MB after this change, which to me feels like it's worth the trade-off.

Let me know what you think!

Test plan

Updated the unit tests.

mtaran · 2025-06-13T18:20:30Z

FYI: @varungandhi-src, @antonsviridov-src

varungandhi-src · 2025-06-17T01:27:19Z

The index I'm working with only increased from 18 MB to 33 MB after this change, which to me feels like it's worth the trade-off.

We have to work with multi-GB indexes sometimes, so size and DB creation speed are both definitely concerns for us. I would need to do a test of this patch on some of the larger indexes we have.

Could you share what kinds of queries you're interested in running? E.g. if we provided a CLI tool to do things such as finding definitions, finding references etc., would that be enough?

I wonder if it makes sense to put this (or the compressed version) behind a separate flag... 🤔

mtaran · 2025-06-17T20:13:42Z

The specific thing I was trying to do is to make a dependency tree of the dependencies between code elements that use a certain API. If there was a CLI for finding references, I'd be able to use it to find where that API is used. But then I'd also need a way to find how those files (or ideally classes etc., but semanticdb didn't provide that info 😢) relate to each other, e.g. by finding what cross-references there are within that set of files.

When I gave Claude the schema etc., it was able to generate the correct SQL for me without issues (which is a great benefit of actually using a SQL DB for this sort of thing, and more specifically of using JSON for reference info):

WITH relevant_files AS (
    SELECT DISTINCT d.id, d.relative_path
    FROM documents d
    JOIN chunks c ON d.id = c.document_id
    JOIN mentions m ON c.id = m.chunk_id
    JOIN global_symbols gs ON m.symbol_id = gs.id
    WHERE gs.symbol LIKE '%/MyAPI#%' 
)
SELECT DISTINCT 
    d_dependent.relative_path as dependent_file,
    d_provider.relative_path as dependency_file
FROM relevant_files d_dependent
JOIN chunks c_dependent ON d_dependent.id = c_dependent.document_id
JOIN mentions m_dependent ON c_dependent.id = m_dependent.chunk_id
JOIN global_symbols gs ON m_dependent.symbol_id = gs.id
JOIN mentions m_provider ON gs.id = m_provider.symbol_id
JOIN chunks c_provider ON m_provider.chunk_id = c_provider.id
JOIN relevant_files d_provider ON c_provider.document_id = d_provider.id
WHERE m_dependent.role != 1  -- dependent file uses/references the symbol
  AND m_provider.role = 1     -- provider file defines the symbol
  AND d_dependent.id != d_provider.id  -- exclude self-dependencies
ORDER BY dependent_file, dependency_file;

I didn't expect that Sourcegraph would be using this sqlite version of the indexes, since I saw the original PR was marked as resolving #233 which just talked about external usage of this kind of DB as a convenient interchange format. IMHO using JSON better fulfils that specific purpose, but it's very understandable that Sourcegraph has different priorities internally.

I'm happy to use my fork if I ever need to, but it'd definitely be convenient if this JSON form of the DB were in the mainline CLI. Perhaps the DB could contain both occurrences_proto BLOB & occurrences_json TEXT fields, and the CLI could just have a --occurrences-format flag to control which gets populated?

varungandhi-src · 2025-06-30T02:29:17Z

@mtaran we're not using it in production, but we do need to be able to debug issues standalone with large indexes sometimes.

I'm OK with your proposed solution of having two different columns.

Let's go with the following:

Use the names occurrences_proto and occurrences_json for the columns as you suggested.
Use a command line flag --column-format. This can take an input of occurrences=json (this can be the default) or occurrences=proto.
- Using a more general flag names allows us to reuse it in case we want to expose other columns in alternate formats (e.g. using a comma-separated list).

I tested your patch out on a 1.1GB index (uncompressed), and with this patch, the conversion time goes from 53s to 1m24s (so about 50% higher) and the DB size goes from 704MB to 1.5GB (so 100% higher). So it's a fairly high penalty.

We can remove the Protobuf-based encoding at a later point if the performance isn't needed.

name added 3 commits June 12, 2025 09:52

failing test

3b5aaf9

basic impl

913b8e6

cleanup

c5e3169

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Store JSON occurrences in sqlite #321

feat: Store JSON occurrences in sqlite #321

Uh oh!

mtaran commented Jun 12, 2025

Uh oh!

mtaran commented Jun 13, 2025

Uh oh!

varungandhi-src commented Jun 17, 2025 •

edited

Loading

Uh oh!

mtaran commented Jun 17, 2025

Uh oh!

varungandhi-src commented Jun 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

feat: Store JSON occurrences in sqlite #321

Are you sure you want to change the base?

feat: Store JSON occurrences in sqlite #321

Uh oh!

Conversation

mtaran commented Jun 12, 2025

Test plan

Uh oh!

mtaran commented Jun 13, 2025

Uh oh!

varungandhi-src commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtaran commented Jun 17, 2025

Uh oh!

varungandhi-src commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

varungandhi-src commented Jun 17, 2025 •

edited

Loading

varungandhi-src commented Jun 30, 2025 •

edited

Loading