Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions artificial-intelligence/ai_model_engineer/Q_1/answers_DC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
1. How can we improve face detection by adding **reidentification** to anonymize only specific individuals while keeping computational cost low?
In general, to reduce the computational cost of any ML model one can reduce the model size (e.g. a model with less parameters/layers, use quantization, etc.). However, this could reduce the model accuracy if not done properly.

To maintain precision while reducing computational cost, we can run our face detection model at the very first frame, extract the detected face embeddings, compare them againts a "face gallery", and if the similarity is above a threshold, then anonymize. In the following frames, we can implement a face tracker to follow the faces in the video (e.g. by using a Kalman filter) and re-run the face detection + reidentification model every 5-10-15 frames (depending on how fast people move, or triggered by a tracker confidence drop). This way we can recenter (if needed) the centroid of the already detected faces and detect new ones. This strategy creates a tradeoff between computational cost and face detection speed, but it can be resolved at an optimal point where computational cost is reduced by a factor 10x with equivalent performance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kalman Filter review


2. How can we extend an anonymization pipeline with depth-based filtering to blur only background objects efficiently?
This depth-based filtering may act as a "gating" mechanism to reduce the cmputational cost. This means that, if we detect a face but it is at the background (where, in general, faces are not recognizable), we can skip the face anonymization process and save processing power. Therefore, the depth estimation filter should be implemented between the face detection and the reidentification process.

3. Imagine we have this case: our face detection system does not detect properly (bad accuracy) dark-skins nor bald-people; how would you improve system accuracy?
In general, it is also important to compare the evaluation and training datasets to see how much the training dataset represents the evaluation dataset. As a first step, I would look at the training data to check if there are enough examples of dark-skins and bald-people. If not, I would collect more data and add it to the training set and retrain or finetune. If this is not the case, we may find that in the evaluation dataset the bald-people images are taken with very different camera positions and ilumination, but when looking at the training data, we may see that the bald-people images are taken at the same camera position and ilumination. This suggests that the failure is not due to low representation of bald-people images but from low variety, which can be solved by data augmentation techniques (e.g. random camera position, ilumination, image rotation, flipping, etc.). This will also improve the model's robustness and trustworthiness in production.

4. Imagine this other case: let's assume we have added several bald people imagery to our dataset and we still are not increasing the accuracy. What may be happening? How would you fix this?

This is very related to the previous question and I would follow the same steps, especially the data augmentation techniques. If this does not solve the problem, the issue may be related to the model architecture, so the strategy would be to increase model parameters, layers, or look at an alternative model. In general, this happens when the model is too simple for the task (also known as underfitting).

5. Which MLOPs infrastructure would you design to manage this system? Just provide a high-level design and which tools would you use.

I am not an expert in MLOps and the techniques commonly used in these environments, but I would design a system that includes the following components:

- Data ingestion: Collect and store data from various sources. It should keep track of the data used to train the models, versioning models, and ensure reproducibility.
- Data preprocessing: Clean and preprocess data for model training
- Model training: Train models using various frameworks. It should allow to train models in parallel, register them, and keep track of the training history. This step may be re-executed if the subsequent evaluation or monitoring suggest that the model is not performing as expected.
- Model evaluation: Evaluate model performance using various metrics (e.g., accuracy, precision, recall, F1-score, etc.)
- Model deployment: Deploy models to production. This is usually done in containeraized applications with defined interfaces by using e.g. Kubernetes.
- Model monitoring: Monitor model performance in production. This can be done using Grafana (I use it in my research).

6. From end user POV, which techniques would you use to improve final solution user experience?
I would offer the user the possibility to choose the level of anonymization (e.g. blur, pixelation, or masking) and the level of detail (e.g. low, medium, or high), always complying with EU AI Act. In this way, the user feels in control of the trade-off mentioned in Question 1, and also the way the face is anonymized.

I would also add some trustworthy AI features, in particular, explainable AI techniques. For example, if the user wants to know why a certain face was anonymized, the system should provide a simple explanation (near the bounding box of a face, for example) of why or why not that face was anonymized (e.g., it was identified as a face but does not match any known identity, or it was discarded by the depth filter, etc.). Also, some robustness features like the percentage of faces anonymized, or processing latency, etc. This way, the user understands the system's behavior and can notice a system drift. Additional ideas, for example, allowing the user administrator to flag some snapshots as incorrect and provide a reason, so that we can know why the model is failing and retrain it. All these solutions would enhance the user experience and trust in AI, which is very important for the adoption of these systems in society.

7. (EXTRA) Provide a code snippet (practical example) on how would you implement any of these applications: reidentification or depth estimation, for face anonymization.

Here is a pseudocode of how I would implement the reidentification application for face anonymization (not in real time):

INPUT: FrameStream (video) and TargetGallery (embeddings of the faces we want to anonymize)
OUTPUT: AnonymizedFrameStream

INITIALIZE:
FaceDetector = LoadModel("YOLOv8") #I found this model is state of the art for computer vision tasks
FaceEmbedder = LoadModel("FaceNet")
ObjectTracker = Initialize("KalmanFilter")
FRAME_INTERVAL = 15 # Re-run detection every 15 frames

FOR each frame in FrameStream:
IF frameIndex % FRAME_INTERVAL == 0:
# FULL DETECTION & IDENTIFICATION
Faces = FaceDetector.Detect(Frame)

FOR each Face in Faces:
# GET EMBEDDING
Embedding = FaceEmbedder.ExtractFeatures(Face.Image)

# CHECK IF THIS PERSON IS ON THE TargetGallery
similarity = CalculateCosineSimilarity(Embedding, TargetGallery)

IF similarity > THRESHOLD:
Face.ShouldAnonymize = True
TrackedFaces = ObjectTracker.StartTracking(Face)
ELSE:
Face.ShouldAnonymize = False

ELSE:
# TRACKING. Just update existing boxes, don't run AI detection
TrackedFaces = ObjectTracker.Update(Frame)

# APPLY ANONYMIZATION
FOR each Face in TrackedFaces:
IF Face.ShouldAnonymize == True:
AnonymizedFrameStream.append(ApplyGaussianBlur(Frame, Face.BoundingBox))
END FOR
33 changes: 33 additions & 0 deletions artificial-intelligence/ai_model_engineer/Q_2/answers_DC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
1. The model shows strong performance on validation data but fails on new matches recorded in different clubs with different camera heights and angles. Hit detection accuracy drops significantly.
Explain the most likely root causes of this failure and describe how you would systematically investigate and isolate the problem.

This is probably related to a domain shift issue (similar to the one we had with bald-people). In this specific case of pose estimation, the training dataset correctly represents the evaluation and validation datasets (because we obtain good results). However, when it comes to production systems, the camera height, angle, and ilumination conditions change, so probably our training dataset actually does not represent the production data, and thus the system performance drops. We can further diagnose the issue by inspecting the image properties of the new footage in these clubs. Maybe, the problem is as simple as the ball color (in this club the ball is white instead of yellow). As a general solution, we can leverage the trained model and fine tune it with the new collected data in clubs, improving the model robustness.

2. After reviewing some samples, you discover that different annotators labeled keypoints slightly differently (especially wrists and elbows), and hit timestamps are sometimes shifted by a few frames.
Analyze how inconsistent annotations affect both pose estimation and hit classification, how you would detect label-quality problems at scale and how you would decide whether to relabel, filter, or reweight parts of the dataset. Justify your decision criteria.

This dataset issues may affect model performance in two ways:
- If labeled keypoints are not very far from the "common sense" wrist and elbow position of a player, these "errors" may introduce a positive data variability, forcing the model to map an elbow to a body region instead of a single point, improving the model genralization. A similar rationale applies to hit timestamp.
- If labeled keypoints are quite wrong, this can generate a negative effect on the model performance, as it may generate training noise and a jitter efect in the pose detection. Similar to the hit timestamp, we may see that actions are not detected, specially if they are fast (e.g., a smash).

To detect these errors at scale, we should analyze the training data and look for outliers. For example, we can analyze all the pose positions labeled as a smash and: (i) look for the mean properties of the keypoints and (ii) identify those samples that are two or three sigmas apart (probably misslabeled).

3. Smashes are rare in the dataset but very important for the product. The model often misses them while performing well on common hits.
Without changing the model architecture, explain which dataset, training, and evaluation strategy changes you would consider. Here, the idea is to discuss tradeoffs and risks of each approach and how you would verify that improvements are real and not overfitting artifacts.

This issue may be related to a data imbalance. The loss function averages all the samples equally. Since the model is not penalized much for missing a smash (because it is a rare event), it may not be learning to detect them. We can try to balance the dataset by oversampling the smash class or using a weighted loss function (giving more weight to a smash class). However, this can improve the false positive rate. Therefore, we should include samples where the player moves fast or tries similar actions to a smash to make sure it is learning to detect them, not overfitting, and is robust to similar actions. In evaluation, accuracy is not the best metric: others such as precision, recall, or F1-score may be better. The evaluation should be as close to reality, so evaluation should not be augmented with more smash events.

4. Two training runs with the same configuration produce noticeably different results. One model is stable but less accurate; another is more accurate but produces jittery keypoints and inconsistent hit timing.
Provide a reasoning to decide which model is actually better for production. What additional tests, metrics, and scenario-based evaluations would you design to support the decision?

5. Which model would you propose for this application? Why?

I would choose the stable model, because:
- The perceived quality of the model is sometimes more important than a "perfect" accuracy. A stable model, even if slightly less accurate, feels solid and reliable.
- The jittery keypoints and inconsistent hit timing of the other model may lead to a poor performnace of other high-level, dependant applications (e.g., speed calculation, trajectory prediction, etc.).

The additional metrics and tests I would design are:
- A metric that counts how many times a keypoint moves beyond a biological possibility (e.g., a wrist moving 30cm in 0.03 seconds).
- A test where both models are compared on a video of a person standing perfectly still.
- A test where both models are compared on a fast-player moving video (e.g. a person performing a smash).
- A test with oclusion phenomena (the flicking model may fail here).