- 12 years in analytics consulting, adjunct professor at Georgetown University
- KPMG -> Slalom DC Analytics Practice Leader -> Slalom Databricks Practice Leader
- Now a Solutions Engineer at Braintrust
- Saw that customers were great at making POCs but not so great at getting those POCs to production where they could actually create value.
- You've done it. You have defied the odds and created a GenAI application that ended up in production.
- You've gathered requirements, tested with users, gotten through security review, signed a contract with a model provider, and picked the latest OpenAI model to future-proof your product!
- Problem: Anthropic just announced their latest model and it is getting rave reviews. Also rumor has it that the new Gemini is launching next week.
- Is it impossible to keep up with the industry? Why even try?
- Models are cattle, not pets
- It is not useful to become attached to any one model (or even one model provider) in such a fast moving space
- Focus on new capabilities (multimodal, tool calling, thinking/reasoning)
- Minimize blast radius of a model (or model provider change)
- [BT_TOUCHPOINT: AI Proxy]
- Do nothing
- Justification: "I really like the OpenAI ecosystem and gpt-4o is the perfect model for my application"
- Why this is incorrect: Models are tools to drive value. If there is a potentially better tool for the job, you owe it to your stakeholders to try it
- Just send it
- Justication: "There are tens of billions more parameters in the new model, that type of power makes this move impossible to fail"
- Why this is incorrect: Models are tools to drive value. A new tool may not be a better tool for the job.
- Determine success criteria
- Evals
- Gather real world data (synthesize example inputs if necessary)
- Datasets, potentially Loop for synth data
- Benchmark based on success criteria
- Braintrust Playground
- Compare champion and challenger
- Braintrust Experiments
- Go/No Go
- Change to newest prompt version
- Send a certain amounts of traffic to old prompt version and new prompt version
- Give users the option to use the newest version of the prompt/model
- Use new model prompt version for opt-in users
- Allow users to upgrade subscription
- Use new model for certain tiers of users
- Online Scoring
- log everything, score samples
- Same Evals as your offline Evals
- Compare scores for different models/prompts if running multiple
- Alerting
- Altering prompts in production should be paired with defensive alerting
- Evals or metrics (duration, cost) breaching certain thresholds
- Altering prompts in production should be paired with defensive alerting