Ayaan Sharif

Ayaan-Sharif

AI & ML interests

NLP, LLM, TEXT, Languages

Recent Activity

replied to sanchit-gandhi's post 10 days ago
Why does returning timestamps help Whisper reduce hallucinations? 🧐 Empirically, most practitioners have found that setting `return_timestamps=True` helps reduce hallucinations, particularly when doing long-form evaluation with Transformers’ “chunked” algorithm. But why does this work?.. My interpretation is that forcing the model to predict timestamps is contradictory to hallucinations. Suppose you have the transcription: ```markdown The cat sat on the on the on the mat. ``` Where we have a repeated hallucination for “on the”. If we ask the model to predict timestamps, then the “on the” has to contribute to the overall segment-level timing, e.g.: ```markdown <|0.00|> The cat sat on the on the on the mat.<|5.02|> ``` However, it’s impossible to fit 3 copies of “on the” within the time allocation given to the segment, so the probability for this hallucinatory sequence becomes lower, and the model actually predicts the correct transcription with highest probability: ```markdown <|0.00|> The cat sat on the mat.<|5.02|> ``` In this sense, the end timestamp is of the opposite of the initial timestamp constraint they describe in Section 4.5 of the paper https://huggingface.co/papers/2212.04356 → it helps the model remove extra words at the end of the sequence (rather than the initial timestamp which helps when the model ignores words at the start), but the overall principle is the same (using timestamps to improve the probability of more realistic sequences). Leaving it open to you: why do you think timestamps reduces Whisper hallucinations?
View all activity

Organizations

None yet

Ayaan-Sharif's activity

replied to sanchit-gandhi's post 10 days ago
view reply

what if we segment the audio first and then transcribe tho its some extra compute to throw in but imo it would resul tin better result !

liked a Space 18 days ago
liked a Space 23 days ago
reacted to vladbogo's post with 👍 29 days ago
view post
Post
Panda-70M is a new large-scale video dataset comprising 70 million high-quality video clips, each paired with textual captions, designed to be used as pre-training for video understanding tasks.

Key Points:
* Automatic Caption Generation: Utilizes an automatic pipeline with multiple cross-modality teacher models to generate captions for video clips.
* Fine-tuned Caption Selection: Employs a fine-tuned retrieval model to select the most appropriate caption from multiple candidates for each video clip.
* Improved Performance: Pre-training on Panda-70M shows significant performance gains in video captioning, text-video retrieval, and text-driven video generation.

Paper: Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers (2402.19479)
Project page: https://snap-research.github.io/Panda-70M/
Code: https://github.com/snap-research/Panda-70M

Congrats to the authors @tschen , @aliaksandr-siarohin et al. for their work!
  • 1 reply
·
New activity in tencent/HunyuanVideo about 1 month ago

multi gpu setup when ?

2
#5 opened about 1 month ago by
Ayaan-Sharif