DiaPredictor is a comprehensive web application—originally developed as a university project—designed to help individuals assess their diabetes risk and receive personalized recommendations. It leverages data analysis, machine learning, and a conversational chatbot interface to provide actionable health insights.
- Project Description
- Screenshots
- Key Features
- Installation
- Repository Visualization
- Data Overview
- Chatbot Implementation with Rasa and Streamlit
- License
Diapredictor provides an easy-to-use interface for individuals to assess their diabetes risk. The system is built on a robust dataset that has been cleaned and enriched with synthetic data to improve model accuracy.
The project features include:
- Original vs. Modified Dataset Comparison: View and analyze the differences between raw and enriched data.
- Machine Learning Model Evaluation: Compare performance metrics between models trained on different datasets.
- Chatbot Assistant: Receive personalized health advice and diabetes risk assessments.
- Predictor System: Input your health data and receive an immediate risk analysis.
The system is implemented using:
- Streamlit for an interactive frontend.
- Scikit-Learn for training and evaluating models.
- Rasa for an AI-powered chatbot.
-
Diabetes Risk Prediction
- Users can input personal health metrics (e.g., age, BMI, glucose levels, smoking status).
- The model predicts the likelihood of diabetes and provides actionable health recommendations.
-
Comparative Model Analysis
- Evaluate different models trained on both the original and enriched datasets.
- Metrics include accuracy, precision, mean squared error (MSE), and R² score.
-
Chatbot Assistance
- AI-powered chatbot offers real-time insights on diabetes risk, prevention, and lifestyle changes.
-
Data Preprocessing & Augmentation
- Imbalanced dataset? We applied SMOTENC to generate synthetic minority class samples.
- Features like gender, smoking history, and outliers were handled for optimal model training.
- Python: version 3.10
- Git: For cloning the repository and managing submodules.
- Clone the repository:
git clone https://github.com/FaresM7/DiaPredictor.git
- Create a virtual environment with Python 3.10
py -3.10 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
- Install the required dependencies:
pip install -r requirements.txt
- Run the application:
python start.py
- In case you are not redirected to the Streamlit page, open your browser and navigate to:
http://localhost:8501
Data Overview
We used the Diabetes prediction dataset from Kaggle for training the models and later enhance the dataset.
[!NOTE] This dataset is used strictly for educational and demonstration purposes.
Initial data visualization revealed the following:
- A slight drop in diabetes cases between ages 60–70, followed by a sharp increase at 80+.
- Minimal differences in diabetes cases between males and females.
- Nearly symmetrical distribution with minimal differences between mean and median values.
- Quartiles exhibited expected variations across attributes.
- Incomplete Data: Removed incomplete examples using Pandas. Labels were verified to ensure direct labeling.
- Balancing: The dataset originally had a 10:1 imbalance favoring non-diabetic cases.
- Downsampling: Majority class reduced to 20% (1:2 ratio).
- Oversampling: SMOTENC (Synthetic Minority Over-sampling for Nominal and Continuous data) was applied to the minority class to balance the dataset while addressing potential overfitting concerns.
- Splitting: Dataset divided into Training (70%), Validation (15%), and Test (15%) sets. Initial splits showed a 1:3 imbalance for diabetic cases.
- Categorical Data Encoding: Hot-end encoding converted categorical features into binary columns for model training.
- Normalization: Linear scaling normalized feature values, addressing right-skewed distributions (e.g., age). Z-Score standardization was avoided due to non-normal distributions and minimal outliers.
- Removal of Unnecessary Attributes: Attributes with minimal model impact (e.g., gender, certain smoking history categories) were excluded from the modified dataset.
- Diabetes correlation with age showed a zigzag pattern, with a sharp increase in cases at ages 75–80.
- Slight gender differences in diabetes cases remained.
- Blood glucose levels displayed increased variation after SMOTENC, with standard deviation rising from 40.90 to 52.55.
- Quartiles maintained expected variation, with a slightly increased spread compared to the original dataset.
Chatbot Implementation with Rasa and Streamlit
- streamlit: Builds the chatbot interface.
- json: Handles structured message data.
- requests: Sends requests to Rasa’s REST API.
- time: Introduces wait times for unavailable servers.
check_server_ready()
- Checks if the Rasa server is running via a GET request to
/status. - If unavailable, waits 5 seconds before retrying.
get_bot_response(user_input)
- Sends user input to the Rasa bot (
/webhooks/rest/webhook). - Processes responses:
- If multiple bot messages exist, extracts and displays all.
- If no response, prompts the user to rephrase.
- If an error occurs, returns a diagnostic message.
- Uses
st.set_page_config()to define layout and title ("Chatbot Interface").
- Verifies if the Rasa server is ready before loading the chat UI.
- Retrieves conversation history from session state and displays previous messages.
- Uses
st.chat_input()for message submission. - Saves user messages in session state for persistence.
- Sends user input to
get_bot_response()and displays the bot’s reply.
- Uses
st.chat_message()to display user and bot messages dynamically.
- Detects server issues and informs users if the bot is unavailable.
- Extracts intent (user’s goal) and entities (important details).
- Example:
- Intent:
"report_illness"(User feels unwell). - Entities:
"symptoms": "fever, headache".
- Intent:
- Define automatic replies for specific user inputs.
- Example:
- User: "Hello"
- Bot: "Hi there! How can I help?"
- Control multi-step interactions.
- Example:
- User: "I feel unwell"
- Bot: "Can you describe your symptoms?"
- User: "I have a headache and fever."
- Bot: "I recommend rest and hydration. Would you like medical advice?"
- Purpose: Delivers personalized health advice.
- How it works:
- Retrieves user conditions (e.g., smoking, hypertension).
- Provides general (e.g., exercise) and personalized (condition-specific) recommendations.
- Purpose: Estimates the user’s diabetes risk.
- How it works:
- Collects user data (e.g., age, BMI, glucose).
- Uses a machine learning model to predict risk (low, moderate, high).
- Returns actionable health tips.
- Purpose: Enhances conversations by remembering user names.
- How it works:
- Extracts the name from input and stores it in a slot.
- If missing, prompts the user to re-enter it.
Warning
Diapredictor is not a substitute for professional medical advice, diagnosis, or treatment.
The diabetes risk assessments and recommendations are for educational and research purposes only. Always seek the advice of qualified healthcare professionals for any medical concerns.
Use of this application is entirely at your own risk, and the developers assume no liability for any actions taken based on its output.
This project is licensed under the MIT License.
See the LICENSE file for details.
For inquiries, feedback, or suggestions:
- LinkedIn: in/fares-elbermawy
We welcome contributions! Feel free to open a pull request or start a discussion in our GitHub repository.

