Skip to content
View tildahh's full-sized avatar

Block or report tildahh

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
tildahh/README.md

Building AI agents and the evaluation systems that make them reliable.

Currently building open-source agent benchmarks at UC Berkeley.

UC Berkeley MIDS · Amgen AI Innovation Lab · AgentBeats core contributor


Featured Work

Project What I Built
AgentBeats Evaluation Pipeline First automated evaluation pipeline for CORE-Bench. Designed step-level metrics to replace binary scoring, improving agent accuracy from 51% to 63% on reproducing scientific paper results.
LLM Reasoning Efficiency Study Quantified chain-of-thought verbosity vs accuracy tradeoffs. Found that additional reasoning tokens yield 2x the accuracy benefit on complex problems vs simple ones.
RAG Campus Assistant Built custom Scrapy pipelines to transform unstructured campus data into a searchable knowledge base. Migrated to OpenAI Assistants API within days of its November 2023 beta release.

Previously Amgen AI Innovation Lab — Designed evaluation frameworks for drug discovery agents using LLM judges, embedding metrics, and synthetic test generation. Increased test coverage by 59%.


CS @ CSUCI (magna cum laude) · CA

Pinned Loading

  1. industrial-defect-detection industrial-defect-detection Public

    End-to-end industrial defect detection pipeline. Achieved 93.6% accuracy using ResNet50, Mixed Precision (AMP), and custom hardware-optimized data loading.

    Python

  2. dgx_dashboard dgx_dashboard Public

    Forked from DanTup/dgx_dashboard

    A simple monitoring dashboard for DGX Spark - GPU and memory usage, temperature, and Docker management.

    Dart

  3. OronaDaniel/CSUCI_Companion OronaDaniel/CSUCI_Companion Public

    AI-powered campus assistant using RAG + GPT-4 Turbo. Features custom Scrapy pipelines for real-time course catalog ingestion and natural language scheduling.

    Python 1

  4. food-demand-forecasting food-demand-forecasting Public

    Supply chain demand forecasting for 77 distribution centers. Achieved 0.86 R² using Neural Networks, outperforming XGBoost baselines to optimize inventory.

    Jupyter Notebook