I am a 3-year experienced ML Engineer specializing in making AI models lightweight, fast, and efficient. My primary focus is on Efficient AI, particularly Quantization. I optimize various models (Vision, Audio, LLM) for mobile, GPU, and NPU platforms.
06/2022 - Present
- Optimizing models for target hardware & platforms
- Enhancing performance-speed trade-offs through PTQ and QAT
- Conducted benchmarking of vLLM and TensorRT-LLM serving
07/2021 - 08/2021
- Built AWS 3-tier web service using Terraform
08/2023 - Present
[Website] [Github] [OwLite Examples]
- Developed a framework for easy model quantization from PyTorch to TensorRT
- Implemented various quantization algorithms and simulations
- Produced various examples and identified optimization patterns
02/2024 - 06/2024
[Website]
- Conducted comprehensive performance benchmarking of LLM serving frameworks
- Implemented evaluation module
- Wrote blog post, [vLLM vs TensorRTLLM] weight-activation quantization
02/2024 - 06/2024
- Presented poster at Interspeech 2024
RepTor: Re-parameterizable Temporal Convolution for Keyword Spotting via Differentiable Kernel Search - Developed CNN-based KWS model using structural reparameterization
- Implemented Latency-aware Neural Architecture Search
- Achieved 97.9% accuracy with 183μs latency on Galaxy S10 CPU
- Bachelor's in IT Convergence Engineering
- 03/2016 - 09/2023
- 03/2014 - 02/2016