Skip to content

Conversation

RobGeada
Copy link
Contributor

@RobGeada RobGeada commented Jan 22, 2025

Adds a framework for defining detections based on a text-embedding classifier. The default configuration here uses MMLU as the training data for the classification and creates a multi-label text classifier to infer which of the 61 MMLU subjects a particular body of text belongs to. The detector endpoint then accepts the following arguments:

  • contents: List of texts to classify
  • allowList: Allowed list of subjects: all inbound texts must belong to at least one of these subjects to avoid flagging the detector
  • blockList: Blocked list of subjects: all inbounds texts must not belong to any of these subjects to avoid flagging the detector.
  • threshold: Defines the maximum distance a body of text can be from the subject centroid and still be classified into that subject. The default value is 0.75, while a threshold of >10 will classify every document into every subject. As such, values 0<threshold<1 are recommended.

@@ -0,0 +1,37 @@
# Embedding Classification Detector

# Setup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it could be useful to state that local Python must match up Python in the Containerfile? At present, python 3.9 will be downloaded inside the container, which may warrant upgrading?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be moved to a shared utils folder so that other detectors that require training can use this class ?


sys.path.insert(0, os.path.abspath(""))
# from common.scheme import TextDetectionHttpRequest, TextDetectionResponse
import os

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete duplicate import

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants