Summary
Replace ad-hoc / hard-coded token counting with a more correct, model-aware tokenization pipeline.
Problem
Current token counting is hard-coded and not aligned with how real models (e.g., Hugging Face transformers) tokenize chat messages. This can produce inaccurate token counts and mislead users doing data budgeting or sanity checks.
Proposed Solution
- Design an abstraction that approximates
tokenizer.apply_chat_template behavior without pulling the entire transformers stack into the extension. Options:
- Lightweight re-implementation for key templates (OpenAI-style, Llama, etc.).
- Pluggable tokenization strategies where users select a model format and we apply the corresponding template before counting.
- Continue using a fast token counting backend (e.g.,
tiktoken-like logic) but ensure the pre-serialization to text matches the model’s expected template.
- Make tokenizer selection explicit in the UI (e.g., dropdown for “OpenAI”, “Llama”, etc.).
Acceptance Criteria
- For representative chat examples, token counts match (within a small tolerance) what official/tokenizer tooling reports for:
- At least one OpenAI-style model.
- One Llama-style model.
- UI clearly shows which tokenization scheme is being used.
- No noticeable performance regression compared to the current implementation.
Summary
Replace ad-hoc / hard-coded token counting with a more correct, model-aware tokenization pipeline.
Problem
Current token counting is hard-coded and not aligned with how real models (e.g., Hugging Face transformers) tokenize chat messages. This can produce inaccurate token counts and mislead users doing data budgeting or sanity checks.
Proposed Solution
tokenizer.apply_chat_templatebehavior without pulling the entiretransformersstack into the extension. Options:tiktoken-like logic) but ensure the pre-serialization to text matches the model’s expected template.Acceptance Criteria