Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement incremental update system with snapshot restore #232

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

onyedikachi-david
Copy link

@onyedikachi-david onyedikachi-david commented Jan 26, 2025

Fixes: #221
/claim #221

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced incremental update capabilities for project file parsing
    • Added support for tracking and managing file changes in the knowledge graph
  • Improvements

    • Enhanced error handling during file updates
    • Implemented snapshot mechanism for graph state management
    • Added logging for file change tracking and update processes
  • Performance

    • Optimized file parsing with incremental update strategy
    • Improved efficiency in handling project file modifications

Copy link
Contributor

coderabbitai bot commented Jan 26, 2025

Walkthrough

The pull request introduces an IncrementalUpdateService to enhance the knowledge graph update process by enabling partial, targeted updates instead of full graph reconstruction. The new service is integrated into the ParsingService, allowing for more efficient file updates by identifying and processing only changed components. The implementation supports change detection, relationship management, and selective inference generation, with built-in snapshot and rollback capabilities to ensure data consistency and error resilience.

Changes

File Change Summary
app/modules/parsing/graph_construction/parsing_service.py - Added optional changed_files parameter to analyze_directory method
- Implemented incremental update logic with snapshot and fallback mechanisms
- Added new update_files method for incremental file updates
app/modules/parsing/incremental_update_service.py - New service class for managing incremental knowledge graph updates
- Methods for file node retrieval, change tracking, and selective updates
- Snapshot creation, restoration, and deletion functionality
- Async methods for file and inference updates

Sequence Diagram

sequenceDiagram
    participant PS as ParsingService
    participant IUS as IncrementalUpdateService
    participant DB as Database
    participant Graph as Knowledge Graph

    PS->>IUS: analyze_directory(changed_files)
    IUS->>IUS: create_snapshot()
    IUS->>IUS: update_files(changed_files)
    IUS->>Graph: identify affected nodes
    IUS->>Graph: remove old nodes
    IUS->>Graph: create new nodes/relationships
    IUS->>IUS: update_inferences(affected_nodes)
    alt Update Successful
        IUS-->>PS: return update results
    else Update Failed
        IUS->>IUS: restore_snapshot()
        IUS-->>PS: return error
    end
Loading

Assessment against linked issues

Objective Addressed Explanation
Change Detection [#221]
Relationship Management [#221]
Inference Handling [#221]
Performance Optimization [#221]
Error Handling and Rollback [#221]

Poem

🐰 Incremental updates hop and dance,
Graph nodes shift with every glance,
Snapshots guard our data's grace,
Parsing service sets the pace,
Knowledge grows, no time to waste! 🌱

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
5.8% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
app/modules/parsing/graph_construction/parsing_service.py (1)

516-529: Consider attaching the original exception for clarity.

Static analysis suggests raising exceptions with raise ... from e to keep the original traceback context. You might do the following to preserve the chain:

-raise HTTPException(
-    status_code=500,
-    detail=f"Update failed and restored to previous state: {str(e)}"
-)
+raise HTTPException(
+    status_code=500,
+    detail=f"Update failed and restored to previous state: {str(e)}"
+) from e
🧰 Tools
🪛 Ruff (0.8.2)

525-528: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

app/modules/parsing/incremental_update_service.py (2)

89-110: Handy retrieval of change history.

Limiting the results to 100 is a pragmatic safeguard. Provide pagination or filtering if your users need deeper logs.


201-213: Snapshot listing is concise.

Returning snapshot objects sorted by creation date is straightforward. Consider storing the snapshot creation pipeline for quick retrieval or UI display.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4e5a83 and 4ac06b4.

📒 Files selected for processing (2)
  • app/modules/parsing/graph_construction/parsing_service.py (9 hunks)
  • app/modules/parsing/incremental_update_service.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
app/modules/parsing/graph_construction/parsing_service.py

525-528: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


535-535: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (28)
app/modules/parsing/graph_construction/parsing_service.py (14)

7-8: Nice import additions for incremental updates.

Using Optional[List[str]] for changed_files is a clean approach, and importing time for snapshot naming seems appropriate.


25-25: Good modular design.

Importing IncrementalUpdateService here centralizes the incremental logic, helping maintain separation of concerns between parsing and incremental updates.


47-54: Great initialization of the incremental update service.

Fetching Neo4j config from config_provider and passing it into IncrementalUpdateService ensures consistent configuration usage across services. The approach is clean and maintains your existing injection pattern.


174-179: Docstring clarity is appreciated.

These lines provide a clear explanation of the new changed_files parameter and how it impacts the parsing process.


194-201: Proactive snapshot creation is wise.

Capturing the graph state before updating files is essential for ensuring rollback safety. This design choice enhances resilience.


202-214: Incremental update logic is well-structured.

Invoking update_files, then aggregating node and relationship counts, provides clear feedback on the scope of changes. Logging the results is helpful for debugging and tracking.


218-220: Good use of project status updates.

Updating the project status to READY upon successful incremental updates aligns well with user expectations.


224-236: Robust failure handling.

Catching exceptions, attempting snapshot restoration, and gracefully falling back to a full parse is a solid approach to error recovery. This ensures minimal data loss and a consistent final state.


239-244: Smart snapshot creation again.

Even for full parses, retaining a snapshot before the heavy-lift parsing ensures a safe fallback point. Great consistency.


280-286: Additional safety net for full parse failures.

Restoring from a snapshot on parse failure helps maintain data integrity. Great to see consistent fallback steps.


300-327: Parallel incremental logic for other languages.

Using the same fallback approach for non-Python/JS languages fosters a uniform strategy across the codebase. Maintain consistency with the next lines if you plan to expand supported languages.


468-474: New update_files method is well-defined.

Accepting project_id, file_paths, and user info is consistent with your existing patterns. This method neatly encapsulates incremental file updates for external callers.


485-490: Snapshot creation before partial updates.

Continuing the snapshot pattern ensures safe rollbacks for multi-file updates as well. Great design reuse.


492-507: Incremental update block is cohesive.

Using update_files from IncrementalUpdateService and returning the results object with node/relationship counts fosters clarity for API consumers.

app/modules/parsing/incremental_update_service.py (14)

13-24: Constructor’s scope is clear.

Injecting Neo4j driver, CodeGraphService, and InferenceService fosters straightforward usage across methods. Consider gracefully closing resources if used in a long-lived context.


25-38: Correct node retrieval method.

Querying Neo4j by file_path is a direct, efficient filter. Remember to guard for directory separators or OS-specific path nuances if that becomes relevant (e.g., Windows vs. Unix).


39-55: Good approach for identifying connected nodes.

Pulling connected nodes up to two hops away ensures capturing relevant references for partial inference updates. For large graphs, watch performance cost.


56-67: Straightforward node deletion.

Detaching and removing the file’s nodes helps ensure a clean slate for subsequent insertion. Confirm that any global references that aren’t in the same invoice are not inadvertently removed.


68-88: Solid change logging mechanism.

Creating a CHANGE node in Neo4j for each update event is a nice way to track revision history. Consider indexing on timestamp or file_path to optimize large change logs.


111-200: Snapshot creation thoroughly captures nodes and relationships.

Storing node labels, properties, and edges ensures a robust restore path. This approach is a good foundation for partial rollbacks or advanced versioning features down the line.


214-300: Restore workflow is thorough.

Deleting all existing nodes before re-inserting from snapshot ensures a consistent state. A chunk-based approach with transactions is wise to prevent partial restores.


301-326: Snapshot deletion is correct.

The straightforward query with DELETE s effectively cleans up the node. You could add checks to remove any associated indexes if you store them in separate structures, but that’s optional.


327-358: Snapshot info retrieval.

This method is valuable for introspection, returning node/relationship counts along with a timestamp.


359-454: Incremental file updates with rollback approach.

  • Good use of _remove_file_nodes to purge outdated references.
  • Incorporating RepoMap for file-level graph generation is consistent.
  • The final call to _create_change_log ensures traceability.

Watch out for concurrency issues if multiple updates happen simultaneously, but your single-transaction pattern likely mitigates it for now.


455-482: Targeted inference updates.

Limiting inference updates to affected nodes conserves resources. The chunked approach (batch_size = 50) is a good balance for large data sets.


483-507: Bulk file updates.

Looping over file paths and calling update_file individually is a simple, maintainable approach. Results are aggregated in a dict, which is easy to parse and interpret.


508-547: File deletion logic.

  • Calculating affected_nodes then removing them is correct.
  • The final call to _update_inferences preserves data integrity for connected references.

Consider partial verification that removing certain files doesn't break references in unrelated files, though your design likely addresses that with the multi-hop approach.


548-582: File status retrieval.

Provides a concise summary: total nodes, types, and last update. Great for an at-a-glance check.

@dhirenmathur dhirenmathur self-requested a review January 28, 2025 07:45
@dhirenmathur
Copy link
Contributor

dhirenmathur commented Feb 3, 2025

hey @onyedikachi-david thanks for contributing! can you please fix merge conflicts, I will pick up the review today

@nndn nndn self-requested a review February 6, 2025 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incremental Knowledge Graph updates
3 participants