This document outlines the tasks for project focused on building a graph RAG from product documentation using Python, Neo4j, and Streamlit.
- Task ID: T1
- Task Description:
- Understand the structure of the product documentation located in the
/docsdirectory. - Design and implement a Python
DirectoryParserclass to recursively parse all Markdown (.md) files within the/docsdirectory and its subdirectories. - For each Markdown file, extract its full content.
- Identify its hierarchical relationship (e.g., parent document, child documents) based on the directory structure.
- Generate a JSON representation for each parsed document. Each JSON object should represent a potential node in the knowledge graph.
- The JSON object should include at least:
name: Name of the file or directory.content: Processed and clean textual content of the markdown file.children: An array of objects with keys - name and type (if any, based on subdirectories and documents).
- Store these JSON objects, as individual
.jsonfiles in a new/temp_datadirectory.
- Understand the structure of the product documentation located in the
- Assigned to: Aastha
- Expected Delivery Date: 11th June (Wed)
- Status: DONE
- Task ID: T2
- Task Description:
- Familiarize yourself with Neo4j graph database concepts: Nodes, Relationships, Properties, and the Cypher query language.
- Set up a local Neo4j instance.
- Design and implement a Python
KnowledgeGraphclass and corresponding methods to read the JSON files generated in Task T1. - Design and implement a Python
TextProcessorclass to implement various NLP related methods like embedding generation, etc. - For each JSON object:
- Create a corresponding Node in Neo4j (e.g., with a label like
FileorDirectory). - Store key information from the JSON (like
name,content,children,embedding) as properties of this Neo4j node.
- Create a corresponding Node in Neo4j (e.g., with a label like
- Establish relationships between these nodes based on the hierarchical structure identified in Task T1. For example, create a
CONTAINSrelationship from a parent document node to its child document/directory nodes. - The primary deliverable is a Neo4j database populated with nodes representing each document and directory along with relationships representing their basic directory structure.
- Assigned to: Jay
- **Expected Delivery Date:18th June (Wed)
- Status: DONE
- Task ID: T3
- Task Description:
- The goal of this task is to enrich the knowledge graph by adding non-hierarchical (semantic) links only between document nodes.
- Re-parse the content of the documents (from the JSON files).
- Implement a strategy to identify potential relationships between different documentation sections that are not explicitly defined by the directory structure.
- Design and implement a Python
DataProcessorclass that will contain methods to implement the Task T3. - Define new relationship types in Neo4j to represent these discovered links (e.g.,
RELATED_TO).
- Assigned to: Harshika
- Expected Delivery Date: 25th June (Wed)
- Status: DONE
- Task ID: T4
- Task Description:
- Design and implement a Python
GraphRAGclass. - The
GraphRAGclass should:- Establish a connection to the Neo4j database.
- Have a core method that accepts a user's natural language query (a string) as input.
- Implement logic to translate this user query into one or more Cypher queries to execute against the Neo4j graph. This might involve:
- Searching for nodes whose
contentproperty matches keywords from the user query. - Leveraging the relationships (both hierarchical and semantic) to find connected/relevant information.
- (Optional, more advanced) If implementing text similarity for Task T3, you could embed the user query and find nodes with similar content embeddings.
- Searching for nodes whose
- The method should retrieve the
contentof the Neo4j nodes that are deemed most relevant to the user's query. - Consider how to rank or score the retrieved nodes/documents by relevance, if multiple results are found.
- The deliverable is the
GraphRAGclass, with clear documentation on how to use it and examples.
- Design and implement a Python
- Assigned to: Aastha
- Expected Delivery Date: 3rd July (Thursday)
- Status: under review
- Task ID: T5
- Task Description:
- Develop a user-friendly web interface using Streamlit to showcase the capabilities of the knowledge graph and retriever.
- The UI should allow an end-user to:
- Enter a natural language query related to the product documentation.
- View the search results retrieved by the
GraphRAGclass (from Task T4). - Display the content of the relevant document sections in a readable format.
- Consider adding features like:
- Displaying metadata about the retrieved documents (e.g., source path, related topics).
- Visualizing parts of the knowledge graph related to the query or results (optional, can be complex).
- Basic error handling and user feedback.
- Ensure the UI is intuitive and effectively demonstrates the project's value in navigating and understanding the product documentation.
- The UI should primarily interact with the
GraphRAGclass built in Task T4.
- Assigned to: Jay
- Expected Delivery Date: 7 July (Mon)
- Status: Under Review