-
Notifications
You must be signed in to change notification settings - Fork 44
feat: Store JSON occurrences in sqlite #321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
We have to work with multi-GB indexes sometimes, so size and DB creation speed are both definitely concerns for us. I would need to do a test of this patch on some of the larger indexes we have. Could you share what kinds of queries you're interested in running? E.g. if we provided a CLI tool to do things such as finding definitions, finding references etc., would that be enough? I wonder if it makes sense to put this (or the compressed version) behind a separate flag... 🤔 |
The specific thing I was trying to do is to make a dependency tree of the dependencies between code elements that use a certain API. If there was a CLI for finding references, I'd be able to use it to find where that API is used. But then I'd also need a way to find how those files (or ideally classes etc., but semanticdb didn't provide that info 😢) relate to each other, e.g. by finding what cross-references there are within that set of files. When I gave Claude the schema etc., it was able to generate the correct SQL for me without issues (which is a great benefit of actually using a SQL DB for this sort of thing, and more specifically of using JSON for reference info): WITH relevant_files AS (
SELECT DISTINCT d.id, d.relative_path
FROM documents d
JOIN chunks c ON d.id = c.document_id
JOIN mentions m ON c.id = m.chunk_id
JOIN global_symbols gs ON m.symbol_id = gs.id
WHERE gs.symbol LIKE '%/MyAPI#%'
)
SELECT DISTINCT
d_dependent.relative_path as dependent_file,
d_provider.relative_path as dependency_file
FROM relevant_files d_dependent
JOIN chunks c_dependent ON d_dependent.id = c_dependent.document_id
JOIN mentions m_dependent ON c_dependent.id = m_dependent.chunk_id
JOIN global_symbols gs ON m_dependent.symbol_id = gs.id
JOIN mentions m_provider ON gs.id = m_provider.symbol_id
JOIN chunks c_provider ON m_provider.chunk_id = c_provider.id
JOIN relevant_files d_provider ON c_provider.document_id = d_provider.id
WHERE m_dependent.role != 1 -- dependent file uses/references the symbol
AND m_provider.role = 1 -- provider file defines the symbol
AND d_dependent.id != d_provider.id -- exclude self-dependencies
ORDER BY dependent_file, dependency_file; I didn't expect that Sourcegraph would be using this sqlite version of the indexes, since I saw the original PR was marked as resolving #233 which just talked about external usage of this kind of DB as a convenient interchange format. IMHO using JSON better fulfils that specific purpose, but it's very understandable that Sourcegraph has different priorities internally. I'm happy to use my fork if I ever need to, but it'd definitely be convenient if this JSON form of the DB were in the mainline CLI. Perhaps the DB could contain both |
@mtaran we're not using it in production, but we do need to be able to debug issues standalone with large indexes sometimes. I'm OK with your proposed solution of having two different columns. Let's go with the following:
I tested your patch out on a 1.1GB index (uncompressed), and with this patch, the conversion time goes from 53s to 1m24s (so about 50% higher) and the DB size goes from 704MB to 1.5GB (so 100% higher). So it's a fairly high penalty. We can remove the Protobuf-based encoding at a later point if the performance isn't needed. |
This is a follow-up to #319, which implemented the initial SCIP->sqlite conversion. It inserted the occurrences information into the DB as a compressed protobuf for reasons of compactness. But this is not conducive to querying or exploring the data in ad hoc ways. This PR changes the format to store a JSON representation of the occurrences instead, since JSON is natively supported by sqlite. The index I'm working with only increased from 18 MB to 33 MB after this change, which to me feels like it's worth the trade-off.
Let me know what you think!
Test plan
Updated the unit tests.