A service for generating synthetic datasets using large language models (LLMs) for various use cases, particularly focused on Estonian government AI applications.
This service creates synthetic datasets such as FAQs and conversations based on topic content. It uses template-based prompting and structured output formatting to generate high-quality data that can be used for training and testing AI models.
- Generate synthetic FAQs and conversations from source text content
- Configurable dataset structures and prompt templates
- Format outputs as JSON or plain text
- Exponential backoff retry mechanism for handling API failures
- MLflow integration for experiment tracking
- Docker support for containerized operation
- Python 3.8+
- Ollama for local LLM serving (or accessible LLM API)
- MLflow (optional, for experiment tracking)
- Clone the repository:
git clone https://github.com/your-org/synthetic-data-service.git
cd synthetic-data-service
- Install dependencies:
pip install -r requirements.txt
The service is configured via config.yaml
in the root directory:
# API connection settings
api:
url: "http://localhost:8000"
timeout_seconds: 30
# Directory paths
directories:
input: "data" # Root directory for source data files
output: "output_datasets" # Root directory for generated datasets
# FAQs generation settings
faqs:
structure_name: "single_question"
prompt_template_name: "institute_topic_question"
num_examples: 5
parameters:
redundancy_factor: 0
difficulty: "intermediate"
language: "et" # Estonian
# Conversations generation settings
conversations:
structure_name: "topic_conversations"
prompt_template_name: "institute_topic_conversation"
num_examples: 3
parameters:
redundancy_factor: 0
num_turns: 6
difficulty: "intermediate"
language: "et"
# Processing settings
processing:
wait_between_requests: 1 # Seconds to wait between API requests
mlflow:
experiment_name: "synthetic_data_generation"
The LLM settings are configured in config/model_config.yaml
:
model_name: "gemma3:1b-it-qat"
ollama_host: "http://ollama:11434"
ollama_timeout: 60
ollama_max_retries: 3
ollama_retry_delay: 5
...
dataset-generator/
│
├── src/ # Source code
│ ├── api/ # API routes and validation
│ ├── core/ # Core data generation logic
│ ├── schema/ # Pydantic schemas
│ └── utils/ # Utility functions
│
├── config/ # Configuration files
├── templates/ # Default templates
│ ├── dataset_structures/
│ └── prompts/
│
├── user_configs/ # User-defined templates/configs
│ ├── dataset_structures/
│ └── prompts/
│
├── data/ # Source data directory
├── output_datasets/ # Generated datasets
├── logs/ # Application logs
└── mlflow/ # MLflow configuration
Run the service with:
By default, the service runs on localhost:8000
. You can specify different host/port:
docker compose up --build -d
The generation script processes text files and creates synthetic datasets:
python test_dataset_generation.py
This will:
- Read all topic files from the
data
directory - Generate FAQs and conversations for each topic
- Save datasets to the output directory
- Log experiments to MLflow (if configured)
Organize your source data in the data
directory as follows:
data/
└── institution_name/
├── topic1.txt
├── topic2.txt
└── ...
Each .txt
file should contain the source content for a specific topic.
POST /generate
{
"dataset_name": "optional_name", // Optional name for dataset
"output_base_path": "/optional/path", // Optional explicit output path
"structure_name": "structure_name", // Name of dataset structure YAML
"prompt_template_name": "template_name", // Name of prompt template to use
"num_examples": 10, // Number of examples to generate
"output_format": "json", // Output format (json or text)
"parameters": { // Additional parameters
"language": "et",
"topic_content": "Content text...",
...
}
}
The service can log experiment details to MLflow. Configure MLflow in mlflow.env
:
MLFLOW_TRACKING_USERNAME=mlflowadmin
MLFLOW_TRACKING_PASSWORD=set your password
MLFLOW_FLASK_SERVER_SECRET_KEY=set your secret key
...
Access the MLflow UI at http://localhost:5000
Define dataset structures in YAML files under user_configs/dataset_structures/
:
name: "my_structure"
description: "Custom dataset structure"
root:
files:
records:
format: "json"
description: "Main dataset records"
Create prompt templates in user_configs/prompts/
:
You are a helpful AI assistant generating synthetic data.
Please generate ${num_examples} examples about ${topic}.
Additional Instructions: ${additional_instructions}
- API Connection Error: Ensure the API service is running at the URL specified in
config.yaml
- LLM Connection Error: Check that Ollama is running and accessible
- File Access Issues: Ensure proper permissions for reading input and writing output
- JSON Parse Errors: Check output templates to ensure they generate valid JSON
Application logs are stored in the logs/
directory.
The synthetic data generation service is designed to be highly configurable, allowing you to experiment with different settings to optimize output quality. This section explains how to conduct various experiments by modifying configurations and templates.
Modify settings in config/model_config.yaml
to experiment with:
Modify settings in config.yaml
to experiment with:
For FAQs: user_configs/prompts/faqs/your_template_name.txt
For conversations: user_configs/prompts/conversations/your_template_name.txt
Then update the template name in config.yaml
:
faqs:
prompt_template_name: "your_template_name"