Skip to content

Fix/issue 147 security upload validation#148

Open
kallal79 wants to merge 14 commits intoKathiraveluLab:devfrom
kallal79:fix/issue-147-security-upload-validation
Open

Fix/issue 147 security upload validation#148
kallal79 wants to merge 14 commits intoKathiraveluLab:devfrom
kallal79:fix/issue-147-security-upload-validation

Conversation

@kallal79
Copy link
Copy Markdown

@kallal79 kallal79 commented Apr 4, 2026

No description provided.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@pradeeban
Copy link
Copy Markdown
Member

/gemini review

@pradeeban
Copy link
Copy Markdown
Member

@kallal79 there are several merge conflicts. Please attend to them and to the bot comments as appropriate.

@pradeeban pradeeban marked this pull request as draft April 5, 2026 03:27
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several security and robustness improvements, including mandatory production SECRET_KEY checks, request size limits, and enhanced file upload handling with unique naming and extension filtering. It also refactors heavy machine learning dependencies (like torch and transformers) to be optional using try...except blocks to support lightweight environments. Feedback suggests that moving lazy imports from within functions to the module level would be more efficient and recommends using img.load() for more thorough image validation during uploads.

Comment on lines +49 to +51
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Importing heavy libraries like matplotlib, numpy, and pandas inside a route handler is inefficient as these imports are evaluated on every request. While Python caches modules in sys.modules, the overhead of the import statement and local variable assignment still exists. A better approach for optional dependencies is to use the try...except pattern at the module level (as seen in fl_worker.py), which keeps the route handlers clean and performant.

"""Refactor: Helper to generate base64 encoded word cloud image."""
if not keywords:
return None
from wordcloud import WordCloud
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Moving the WordCloud import inside the function causes it to be re-evaluated every time the function is called. It is recommended to handle optional dependencies at the module level using a try...except block to avoid unnecessary overhead during request processing.

Comment on lines +78 to +79
with Image.open(image_path) as img:
img.verify()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

While img.verify() is a good first step for validating an uploaded image, it only checks the image header and does not decode the actual pixel data. This means it may not detect all forms of file corruption or certain types of malicious payloads embedded in the image data. For more robust validation, consider calling img.load() or performing a simple transformation to ensure the image can be fully decoded.

@kallal79
Copy link
Copy Markdown
Author

kallal79 commented Apr 6, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces robust error handling for optional heavy dependencies (such as PyTorch, Transformers, Spacy, and HDBSCAN) across the application, allowing it to run in lightweight environments. It also adds security configurations for file uploads and secret keys. While the changes significantly improve resilience, I have flagged concerns regarding the eager loading of Spacy and SentenceTransformer models at the module level, which may still cause startup failures, and the use of JSON error responses in browser-facing routes.


logger = logging.getLogger(__name__)
model = SentenceTransformer("all-MiniLM-L6-V2")
model = SentenceTransformer("all-MiniLM-L6-V2") if SentenceTransformer else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Instantiating SentenceTransformer at the module level will cause the application to crash during startup if the model cannot be loaded (e.g., due to missing local files or lack of internet access for the initial download). To support the 'lightweight environment' goal mentioned in your comments, this model should be initialized lazily within a helper function or the route itself, similar to the pattern used in dreamsApp/core/embeddings.py.

except ImportError: # pragma: no cover - optional in lightweight test envs
SentenceTransformer = None

nlp = spacy.load("en_core_web_sm") if spacy is not None else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Loading the Spacy model at the module level will cause an OSError if the en_core_web_sm model data is not present, even if the spacy package is installed. This prevents the application from starting in environments where these heavy assets are intentionally omitted. Consider loading the model lazily inside extract_keywords_and_vectors or wrapping this call in a try-except block to handle missing models gracefully.

Comment on lines +69 to +70
if plt is None or np is None or pd is None:
return jsonify({'error': 'Plotting dependencies are not installed'}), 500
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The profile route is a page-level route that normally returns an HTML response via render_template. Returning a JSON response here when dependencies are missing is inconsistent with the expected response type for a browser-initiated request. This will cause the user to see raw JSON text instead of a styled error page. Consider rendering a dedicated error template or using Flask's flash message system with a redirect to the main dashboard.

@kallal79
Copy link
Copy Markdown
Author

kallal79 commented Apr 6, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request makes heavy machine learning and data science dependencies optional across the application, enabling it to run in lightweight environments. It also introduces several security improvements, such as enforcing a secure SECRET_KEY in production, limiting request body sizes via MAX_CONTENT_LENGTH, and implementing image validation for uploads. A security concern was identified in the image ingestion route, where failing to catch RuntimeError (specifically DecompressionBombError) could leave the server vulnerable to Denial of Service attacks.

img.verify()
with Image.open(image_path) as img:
img.load()
except (UnidentifiedImageError, OSError, ValueError):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The current exception handling for image validation does not catch DecompressionBombError, which is a subclass of RuntimeError in modern Pillow versions. If a malicious user uploads an image designed to expand into a massive amount of memory (a "decompression bomb"), it could lead to a Denial of Service (DoS) by exhausting server memory. Additionally, since the os.remove call is inside the except block, the malicious file would remain on disk if this exception is raised.

Suggested change
except (UnidentifiedImageError, OSError, ValueError):
except (UnidentifiedImageError, OSError, ValueError, RuntimeError):

@kallal79
Copy link
Copy Markdown
Author

kallal79 commented Apr 6, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several security and robustness improvements, primarily focusing on making heavy machine learning and data science dependencies optional. This allows the application to run in lightweight environments by wrapping imports for libraries like torch, transformers, spacy, and matplotlib in try-except blocks and adding runtime checks. Additionally, the PR implements request body size limits, enforces secure secret keys in production, and adds thorough image validation during ingestion. Feedback is provided regarding the new CHIME analysis function to ensure it robustly handles varying output formats from HuggingFace pipelines.

Comment on lines +104 to +105
results = classifier(text)
return max(results[0], key=lambda x: x["score"])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The output format of HuggingFace pipelines can vary between [dict] and [[dict]] depending on the library version and task configuration. Additionally, calling max() on an empty list or a dictionary (if results[0] is a dict) will lead to a crash or incorrect results. It's safer to check the structure and use .get() for the score key to ensure robustness.

Suggested change
results = classifier(text)
return max(results[0], key=lambda x: x["score"])
results = classifier(text)
# Handle both [dict] and [[dict]] return formats from the pipeline
if results and isinstance(results[0], list):
return max(results[0], key=lambda x: x.get("score", 0.0))
return max(results, key=lambda x: x.get("score", 0.0))

@kallal79
Copy link
Copy Markdown
Author

kallal79 commented Apr 6, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the application's security and flexibility by implementing request size limits, enforcing secure production configurations, and making heavy machine learning dependencies optional for lightweight environments. Key improvements include image validation to prevent decompression bombs and the introduction of lazy loading for several models. Review feedback focuses on adopting timezone-aware UTC timestamps, strengthening image security checks, and addressing architectural redundancy in the CHIME classification logic to ensure compatibility with federated learning updates.

import uuid
import threading
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Import timezone from the datetime module to support creating timezone-aware UTC timestamps. This is a best practice for backend services to ensure consistency across different server environments and time zones.

Suggested change
from datetime import datetime
from datetime import datetime, timezone

@@ -94,32 +114,48 @@ def upload_post():
timestamp = request.form.get('timestamp', datetime.now().isoformat())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using naive datetime.now() creates a timestamp based on the server's local time, which can lead to inconsistencies. It is recommended to use timezone-aware UTC timestamps.

Suggested change
timestamp = request.form.get('timestamp', datetime.now().isoformat())
timestamp = request.form.get('timestamp', datetime.now(timezone.utc).isoformat())

Comment on lines +131 to +140
try:
# Verify container/header first, then fully decode pixel data.
with Image.open(image_path) as img:
img.verify()
with Image.open(image_path) as img:
img.load()
except (UnidentifiedImageError, OSError, ValueError, RuntimeError):
if os.path.exists(image_path):
os.remove(image_path)
return jsonify({'error': 'Uploaded file is not a valid image'}), 400
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

To protect against decompression bomb attacks (where a small, highly compressed image expands to consume massive amounts of memory), it is recommended to check the image dimensions (img.width and img.height) against a safety threshold before calling img.load(). While MAX_CONTENT_LENGTH limits the file size, it does not limit the uncompressed pixel data size.

Suggested change
try:
# Verify container/header first, then fully decode pixel data.
with Image.open(image_path) as img:
img.verify()
with Image.open(image_path) as img:
img.load()
except (UnidentifiedImageError, OSError, ValueError, RuntimeError):
if os.path.exists(image_path):
os.remove(image_path)
return jsonify({'error': 'Uploaded file is not a valid image'}), 400
try:
# Verify container/header first, then fully decode pixel data.
with Image.open(image_path) as img:
img.verify()
with Image.open(image_path) as img:
# Security: Prevent decompression bombs by checking dimensions.
if img.width * img.height > 100_000_000:
raise ValueError("Image dimensions exceed safety limits")
img.load()
except (UnidentifiedImageError, OSError, ValueError, RuntimeError):
if os.path.exists(image_path):
os.remove(image_path)
return jsonify({'error': 'Uploaded file is not a valid image'}), 400

Comment on lines +79 to +93
def get_chime_classifier(self):
if self._chime_classifier is None:
try:
if pipeline is None:
raise RuntimeError("transformers is required for CHIME inference")
self._chime_classifier = pipeline(
"text-classification",
model=HF_MODEL_ID,
tokenizer=HF_MODEL_ID,
return_all_scores=True,
)
except Exception as e:
logger.error(f"Error loading CHIME model: {e}")
return None
return self._chime_classifier
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This implementation of get_chime_classifier is largely redundant with the one in dreamsApp/core/extra/advanced_sentiment.py. Crucially, it lacks the logic to detect and load locally updated models from the models/production_chime_model directory. This means improvements learned via federated learning will not be utilized by this analyzer. Consider centralizing the CHIME classification logic.

@kallal79
Copy link
Copy Markdown
Author

kallal79 commented Apr 6, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the application to make heavy machine learning dependencies optional, allowing for execution in lightweight environments. It also introduces security enhancements such as production-only secret key enforcement, request size limits, and image validation to prevent decompression bombs. Feedback provides a more robust fallback implementation for the softmax function to support multi-dimensional arrays.

Comment on lines +13 to +16
def softmax(x):
x = np.asarray(x)
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback softmax implementation is correct for 1D arrays, which is how it is currently used. However, for broader compatibility with scipy.special.softmax, you might want to ensure it handles multi-dimensional inputs correctly if its usage expands.

Suggested change
def softmax(x):
x = np.asarray(x)
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0)
def softmax(x):
x = np.asarray(x)
e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
return e_x / e_x.sum(axis=-1, keepdims=True)

@kallal79
Copy link
Copy Markdown
Author

kallal79 commented Apr 6, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the application's security and modularity by implementing request size limits, production secret key enforcement, and file extension validation. It refactors the codebase to handle heavy machine learning dependencies optionally, allowing for execution in lightweight environments. Additionally, image ingestion is hardened with integrity checks and protection against decompression bombs. A review comment suggests optimizing the image validation logic by removing redundant method calls.

Comment on lines +132 to +139
# Verify container/header first, then fully decode pixel data.
with Image.open(image_path) as img:
img.verify()
with Image.open(image_path) as img:
# Security: Prevent decompression bombs by checking dimensions.
if img.width * img.height > 100_000_000:
raise ValueError("Image dimensions exceed safety limits")
img.load()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The image validation logic can be simplified by combining the two Image.open calls. img.verify() is redundant when img.load() is called immediately after, as load() will also fail if the image is corrupt or invalid. Additionally, checking dimensions before load() is a good practice to prevent decompression bombs.

        # Security: Prevent decompression bombs and verify image integrity.
        with Image.open(image_path) as img:
            if img.width * img.height > 100_000_000:
                raise ValueError("Image dimensions exceed safety limits")
            # Fully decode pixel data to ensure the image is valid and not corrupt.
            img.load()

@kallal79
Copy link
Copy Markdown
Author

kallal79 commented Apr 6, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the application's robustness and security by implementing request size limits, production environment checks for secret keys, and comprehensive image validation to prevent decompression bombs. A significant architectural change makes heavy machine learning dependencies optional through lazy loading and conditional checks, allowing the core application to function in restricted environments. Feedback was provided regarding the use of a deprecated parameter in the transformers library.

"text-classification",
model=model_path,
tokenizer=model_path,
return_all_scores=True,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return_all_scores parameter is deprecated in the transformers library. It is recommended to use top_k=None instead to achieve the same result of returning all class scores for the classification task.

Suggested change
return_all_scores=True,
top_k=None,

@kallal79
Copy link
Copy Markdown
Author

kallal79 commented Apr 6, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several security and robustness improvements, including request body size limits, production secret key validation, and enhanced image upload verification to prevent decompression bombs. A significant portion of the changes focuses on making heavy machine learning and data processing dependencies (like torch, transformers, and matplotlib) optional, allowing the application to run in lightweight environments through lazy loading and conditional checks. Feedback was provided regarding the efficiency of the federated learning worker, specifically suggesting that the application instance be reused to avoid redundant and memory-intensive model loading.

Comment on lines 97 to 103
def run_federated_round():
if torch is None or AutoModelForSequenceClassification is None or AutoTokenizer is None or AutoConfig is None:
logger.warning("Required ML dependencies are not installed; skipping federated round.")
return

app = create_app()
with app.app_context():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling create_app() inside run_federated_round is inefficient when this function is triggered from an existing Flask application (e.g., via the dashboard). create_app() re-initializes the entire application, including loading heavy AI models in DreamsPipeline. This can lead to excessive memory usage as models are loaded multiple times in the background thread. Refactoring the function to accept an optional app instance allows reusing the existing application context and its resources.

Suggested change
def run_federated_round():
if torch is None or AutoModelForSequenceClassification is None or AutoTokenizer is None or AutoConfig is None:
logger.warning("Required ML dependencies are not installed; skipping federated round.")
return
app = create_app()
with app.app_context():
def run_federated_round(app=None):
if torch is None or AutoModelForSequenceClassification is None or AutoTokenizer is None or AutoConfig is None:
logger.warning("Required ML dependencies are not installed; skipping federated round.")
return
if app is None:
app = create_app()
with app.app_context():

@kallal79
Copy link
Copy Markdown
Author

kallal79 commented Apr 6, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the application's security and flexibility by introducing configuration for content length limits, production secret key enforcement, and image upload validation to prevent decompression bombs. A major architectural change involves wrapping heavy machine learning dependencies—such as torch, transformers, and spacy—in try-except blocks and implementing lazy loading to ensure the core application remains functional in environments where these libraries are absent. Additionally, the PR includes a compatibility shim for location extraction and improves test resilience by handling missing MongoDB instances. I have no feedback to provide.

@kallal79
Copy link
Copy Markdown
Author

kallal79 commented Apr 6, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves application robustness and security by adding configuration for content limits and secure keys, validating uploaded images against decompression bombs, and making heavy ML dependencies optional through try-except blocks and runtime checks. It also implements lazy loading for the core pipeline and models. Feedback suggests centralizing duplicated CHIME classification logic and replacing hardcoded magic numbers with named constants.

Comment on lines +80 to +115
def get_chime_classifier(self):
if self._chime_classifier is None:
try:
if pipeline is None:
raise RuntimeError("transformers is required for CHIME inference")
model_path = HF_MODEL_ID

# Prefer locally learned FL model when available.
try:
from flask import has_app_context, current_app
if has_app_context():
local_model_path = os.path.join(
current_app.root_path,
"models",
"production_chime_model",
)
if os.path.exists(local_model_path):
logger.info(
">>> SELF-CORRECTION: Learned model found at %s. Loading...",
local_model_path,
)
model_path = local_model_path
except RuntimeError:
# No active Flask app context (e.g., tests/CLI)
pass

self._chime_classifier = pipeline(
"text-classification",
model=model_path,
tokenizer=model_path,
top_k=None,
)
except Exception as e:
logger.error(f"Error loading CHIME model: {e}")
return None
return self._chime_classifier
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CHIME classification logic, including model path resolution and pipeline initialization, is duplicated between SentimentAnalyzer in dreamsApp/core/sentiment.py and AdvancedSentimentAnalyzer in dreamsApp/core/extra/advanced_sentiment.py. This redundancy increases the maintenance burden and can lead to inconsistent behavior if one implementation is updated but the other is not. Consider centralizing this logic into a single shared component or utility module.

try:
# Security: Prevent decompression bombs and verify image integrity.
with Image.open(image_path) as img:
if img.width * img.height > 100_000_000:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The image pixel limit (100,000,000) is hardcoded. It is recommended to use a named constant for such values to improve code clarity and ease of configuration.

        if img.width * img.height > 100_000_000:  # Use a constant like MAX_IMAGE_PIXELS

@kallal79
Copy link
Copy Markdown
Author

kallal79 commented Apr 6, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several security and architectural improvements, primarily focused on making heavy machine learning dependencies optional to support lightweight environments. Key changes include the enforcement of MAX_CONTENT_LENGTH, production-only SECRET_KEY validation, and enhanced image upload security through decompression bomb protection and extension whitelisting. Additionally, the PR refactors CHIME classification logic and implements lazy loading for various models. Feedback identifies a critical thread-safety issue in the lazy initialization of NLP models within keywords.py and points out code duplication between the core and extra sentiment analysis modules that should be consolidated.

Comment on lines +16 to +46
_nlp = None
_keyword_model = None

model = SentenceTransformer("all-MiniLM-L6-v2")

def _get_nlp():
global _nlp
if spacy is None:
return None
if _nlp is None:
try:
_nlp = spacy.load("en_core_web_sm")
except OSError as e:
logger.warning("spaCy model en_core_web_sm is unavailable: %s", e)
_nlp = None
except Exception as e:
logger.warning("Failed to initialize spaCy: %s", e)
_nlp = None
return _nlp


def _get_keyword_model():
global _keyword_model
if SentenceTransformer is None:
return None
if _keyword_model is None:
try:
_keyword_model = SentenceTransformer("all-MiniLM-L6-v2")
except Exception as e:
logger.warning("Failed to initialize keyword embedding model: %s", e)
_keyword_model = None
return _keyword_model
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The lazy initialization of the spacy model in _get_nlp() and the SentenceTransformer model in _get_keyword_model() is not thread-safe. In a multi-threaded environment (like a production Flask server), multiple threads could enter the if _nlp is None: or if _keyword_model is None: block simultaneously, leading to a race condition where the model might be loaded multiple times. This can cause unnecessary memory usage and potential errors.

To fix this, you should use a threading.Lock to protect the initialization block for each model, similar to the pattern used in dreamsApp/app/ingestion/routes.py for _get_location_model().

You will also need to add import threading at the top of the file.

_nlp = None
_nlp_lock = threading.Lock()
_keyword_model = None
_keyword_model_lock = threading.Lock()


def _get_nlp():
    global _nlp
    if spacy is None:
        return None
    with _nlp_lock:
        if _nlp is None:
            try:
                _nlp = spacy.load("en_core_web_sm")
            except OSError as e:
                logger.warning("spaCy model en_core_web_sm is unavailable: %s", e)
                _nlp = None
            except Exception as e:
                logger.warning("Failed to initialize spaCy: %s", e)
                _nlp = None
    return _nlp


def _get_keyword_model():
    global _keyword_model
    if SentenceTransformer is None:
        return None
    with _keyword_model_lock:
        if _keyword_model is None:
            try:
                _keyword_model = SentenceTransformer("all-MiniLM-L6-v2")
            except Exception as e:
                logger.warning("Failed to initialize keyword embedding model: %s", e)
                _keyword_model = None
    return _keyword_model

Comment on lines +80 to +102
def get_chime_classifier(self):
self._chime_classifier = init_chime_classifier(
self._chime_classifier,
pipeline,
HF_MODEL_ID,
logger,
)
return self._chime_classifier

def analyze_chime(self, text: str):
if text is None or not text.strip():
return {"label": "Uncategorized", "score": 0.0}

classifier = self.get_chime_classifier()
if classifier is None:
return {"label": "Uncategorized", "score": 0.0}

try:
results = classifier(text)
return pick_top_chime_result(results)
except Exception as e:
logger.error(f"CHIME inference error: {e}")
return {"label": "Uncategorized", "score": 0.0}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This PR introduces CHIME analysis logic (get_chime_classifier and analyze_chime) into SentimentAnalyzer. However, identical logic already exists in dreamsApp/core/extra/advanced_sentiment.py. This code duplication makes the codebase harder to maintain, as changes would need to be applied in two places.

It seems the intent is to promote CHIME analysis to a core feature. To resolve this, please remove the CHIME-related methods from AdvancedSentimentAnalyzer in dreamsApp/core/extra/advanced_sentiment.py and update any code that uses it (e.g., dreamsApp/app/ingestion/routes.py) to import from dreamsApp.core.sentiment instead.

@kallal79
Copy link
Copy Markdown
Author

kallal79 commented Apr 6, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances security by enforcing secure configuration in production and implementing image validation to prevent decompression bombs. It also refactors the application to make heavy machine learning dependencies optional, improving portability. Review feedback identifies a potential TOCTOU vulnerability in the image upload process and suggests refining exception handling for optional imports to prevent masking underlying issues.

Comment on lines 122 to +142
filename = secure_filename(image.filename)
allowed = current_app.config.get('ALLOWED_EXTENSIONS', {'png', 'jpg', 'jpeg', 'gif'})
if '.' not in filename or filename.rsplit('.', 1)[1].lower() not in allowed:
return jsonify({'error': 'Unsupported file extension'}), 400

unique_filename = f"{uuid.uuid4().hex}_{filename}"
upload_path = current_app.config['UPLOAD_FOLDER']
image_path = os.path.join(upload_path, unique_filename)
image.save(image_path)

# Delegate the heavy AI extraction sequence to the global pipeline
try:
# Security: Prevent decompression bombs and verify image integrity.
with Image.open(image_path) as img:
if img.width * img.height > MAX_IMAGE_PIXELS:
raise ValueError("Image dimensions exceed safety limits")
# Fully decode pixel data to ensure image is valid and not corrupt.
img.load()
except (UnidentifiedImageError, OSError, ValueError, RuntimeError):
if os.path.exists(image_path):
os.remove(image_path)
return jsonify({'error': 'Uploaded file is not a valid image'}), 400
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For enhanced security, it's better to validate the uploaded image before saving it to the disk. The current implementation saves the file first, then validates it, which creates a small window where a potentially malicious file exists on the server (a Time-of-check to time-of-use, or TOCTOU, vulnerability). You can validate the image directly from the in-memory stream provided by Flask's request.files. If the image is valid, you can then save it. This avoids writing unvalidated user content to the filesystem.

    filename = secure_filename(image.filename)
    allowed = current_app.config.get('ALLOWED_EXTENSIONS', {'png', 'jpg', 'jpeg', 'gif'})
    if '.' not in filename or filename.rsplit('.', 1)[1].lower() not in allowed:
        return jsonify({'error': 'Unsupported file extension'}), 400

    try:
        # Security: Prevent decompression bombs and verify image integrity from stream.
        with Image.open(image) as img:
            if img.width * img.height > MAX_IMAGE_PIXELS:
                raise ValueError("Image dimensions exceed safety limits")
            # Fully decode pixel data to ensure image is valid and not corrupt.
            img.load()

            # Image is valid, now save it.
            image.seek(0)  # Rewind stream after reading
            unique_filename = f"{uuid.uuid4().hex}_{filename}"
            upload_path = current_app.config['UPLOAD_FOLDER']
            image_path = os.path.join(upload_path, unique_filename)
            image.save(image_path)
    except (UnidentifiedImageError, OSError, ValueError, RuntimeError):
        return jsonify({'error': 'Uploaded file is not a valid image'}), 400


try:
from dreamsApp.core.extra.llms import generate
except Exception: # pragma: no cover - optional in lightweight environments
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a generic Exception is too broad and can mask other underlying issues during the import process. It's better to be more specific. For consistency with other optional imports in this file and across the project, you should catch ImportError. This ensures you're only handling cases where the module or its dependencies are missing, without accidentally suppressing other unexpected errors.

Suggested change
except Exception: # pragma: no cover - optional in lightweight environments
except ImportError: # pragma: no cover - optional in lightweight environments

@kallal79
Copy link
Copy Markdown
Author

kallal79 commented Apr 6, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements to the application's robustness by making heavy AI and ML dependencies optional. It implements lazy loading and graceful error handling for dependencies like PyTorch, Transformers, spaCy, and HDBSCAN, allowing the application to run in lightweight environments. Additionally, it adds security enhancements, including request body size limits, file extension validation, and image integrity checks during uploads. The CHIME sentiment analysis logic has been refactored into a dedicated module to improve maintainability, and test suites have been updated to handle missing database connections gracefully. I have no feedback to provide as the changes are well-structured and improve system reliability.

@kallal79 kallal79 marked this pull request as ready for review April 6, 2026 09:58
@kallal79
Copy link
Copy Markdown
Author

kallal79 commented Apr 7, 2026

HI Sir @pradeeban , Please review PR #148 ..

Copy link
Copy Markdown
Member

@ayusrjn ayusrjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @kallal79, for your contribution.
@pradeeban, I've tested the changes, and everything looks good to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants