|
1 | 1 | ## Data ingestion
|
2 | 2 |
|
| 3 | +We are going to ingest the content of PDF documents in the vector database. We'll use a |
| 4 | +tool located in the `src/indexer` folder of the project. This tool will extract the text from the PDF files, and send it to the vector database. |
| 5 | + |
| 6 | +The code of this is already written for you, but let's have a look at how it works. |
| 7 | + |
| 8 | +### The ingestion process |
| 9 | + |
| 10 | +The `src/indexer/src/lib/indexer.ts` file contains the code that is used to ingest the data in the vector database. This runs inside a Node.js application, and deployed to Azure Container Apps. |
| 11 | + |
| 12 | +PDFs files, which are stored in the `data` folder, will be sent to this Node.js application using the command line. The files provided here are for demo purpose only, and suggested prompts we'll use later in the workshop are based on those files. |
| 13 | + |
| 14 | +<div class="tip" data-title="tip"> |
| 15 | + |
| 16 | +> You can replace the PDF files in the `data` folder with your own PDF files if you want to use your custom data! Keep in mind that the PDF files must be text-based, and not scanned images. Since the ingestion process can take some time, we recommend to start with a small number of files, with not too many pages. |
| 17 | +
|
| 18 | +</div> |
| 19 | + |
| 20 | +#### Reading the PDF files content |
| 21 | + |
| 22 | +The content the PDFs files will be used as part of the *Retriever* component of the RAG architecture, to generate answers to your questions using the GPT model. |
| 23 | + |
| 24 | +Text from the PDF files is extracted in the `src/indexer/src/lib/document-processor.ts` file, using the [pdf.js library](https://mozilla.github.io/pdf.js/). You can have a look at code of the `extractTextFromPdf()` function if you're curious about how it works. |
| 25 | + |
| 26 | +#### Computing the embeddings |
| 27 | + |
| 28 | +After the text is extracted, it's then transformed into embeddings using the [OpenAI JavaScript library](https://github.com/openai/openai-node): |
| 29 | + |
| 30 | +```ts |
| 31 | +async createEmbedding(text: string): Promise<number[]> { |
| 32 | + const embeddingsClient = await this.openai.getEmbeddings(); |
| 33 | + const result = await embeddingsClient.create({ input: text, model: this.embeddingModelName }); |
| 34 | + return result.data[0].embedding; |
| 35 | +} |
| 36 | +``` |
| 37 | + |
| 38 | +#### Adding the documents to the vector database |
| 39 | + |
| 40 | +The embeddings along with the original texts are then added to the vector database using the [Qdrant JavaScript client library](https://www.npmjs.com/package/@qdrant/qdrant-js). This process is done in batches, to improve performance and limit the number of requests: |
| 41 | + |
| 42 | +```ts |
| 43 | +const points = sections.map((section) => ({ |
| 44 | + // ID must be either a 64-bit integer or a UUID |
| 45 | + id: getUuid(section.id, 5), |
| 46 | + vector: section.embedding!, |
| 47 | + payload: { |
| 48 | + id: section.id, |
| 49 | + content: section.content, |
| 50 | + category: section.category, |
| 51 | + sourcepage: section.sourcepage, |
| 52 | + sourcefile: section.sourcefile, |
| 53 | + }, |
| 54 | +})); |
| 55 | +
|
| 56 | +await this.qdrantClient.upsert(indexName, { points }); |
| 57 | +``` |
| 58 | + |
| 59 | +### Running the ingestion process |
| 60 | + |
| 61 | +Let's now execute this process. First, you need to make sure you have Qdrant and the indexer service running locally. We'll use Docker Compose to run both services at the same time. Run the following command in a terminal (**make sure you stopped the Qdrant container before!**): |
| 62 | + |
| 63 | +```bash |
| 64 | +docker compose up |
| 65 | +``` |
| 66 | + |
| 67 | +This will start both Qdrant and the indexer service locally. This may takes a few minutes the first time, as Docker needs to download the images. |
| 68 | + |
| 69 | +<div class="tip" data-title="tip"> |
| 70 | + |
| 71 | +> You can look at the `docker-compose.yml` file at the root of the project to see how the services are configured. Docker Compose automatically loads the `.env` file, so we can use the environment variables exposed there. To learn more about Docker Compose, check out the [official documentation](https://docs.docker.com/compose/). |
| 72 | + |
| 73 | +</div> |
| 74 | + |
| 75 | +Once all services are started, you can run the ingestion process by opening a new terminal and running the `./scripts/index-data.sh` script on Linux or macOS, or `./scripts/index-data.ps1` on Windows: |
| 76 | + |
| 77 | +```bash |
| 78 | +./scripts/index-data.sh |
| 79 | +``` |
| 80 | + |
| 81 | + |
| 82 | + |
| 83 | +Once this process is executed, a new collection will be available in your database, where you can see the documents that were ingested. |
| 84 | + |
| 85 | +### Test the vector database |
| 86 | + |
| 87 | +Open the Qdrant dashboard again by opening the following URL in your browser: [http://localhost:6333/dashboard](http://localhost:6333/dashboard). |
| 88 | + |
| 89 | +<div class="tip" data-title="tip"> |
| 90 | + |
| 91 | +> In Codespaces, you need to select the **Ports** tab in the bottom panel, right click on the URL in the **Forwarded Address** column next to the `6333` port, and select **Open in browser**. |
| 92 | + |
| 93 | +</div> |
| 94 | + |
| 95 | +You should see the collection named `kbindex` in the list: |
| 96 | + |
| 97 | + |
| 98 | + |
| 99 | +You can select that collection and browse it. You should see the entries that were created by the ingestion process. Documents are split into multiple overlapping sections to improve the search results, so you should see multiple entries for each document. |
| 100 | + |
| 101 | +Keep the services running, as we'll use them in the next section. |
0 commit comments