Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files + Infinite content in Gradio UI #148

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

xdevfaheem
Copy link

Added support for files (PDF, XLSX, DOCX, etc) and/with unlimited content (via chunked generation and accumulation)

@FurkanGozukara
Copy link

nice

@darkacorn
Copy link
Contributor

darkacorn commented Feb 22, 2025

chunking by wordcount is sub optimal .. gotta be by sentence separation aka !/ , . ; or by line what every is longer and still fit to stabilze generation

if we do batching / chunking .. lets do that right from the gecko

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 22, 2025

This chunking was better: #101

@xdevfaheem
Copy link
Author

xdevfaheem commented Feb 22, 2025

chunking by wordcount is sub optimal .. gotta be by sentence separation aka !/ , . ; or by line what every is longer and still fit to stabilze generation

IMHO, I believe the current approach already respects sentence boundaries. Obviously, here we are splitting the parsed text into sensible sentences using SpaCy's LM and the chunking logic ensures that each chunk remains within the max_word_limit (which when exceeds above ~50 will likely produced inconsistent outputs) while maintaining sentence integrity.

I’d be happy to refine the approach further. Let me know what you think!

@xdevfaheem
Copy link
Author

I believe, rather than having sentences cutoff in between (word level chunk), chunks having complete and optimal sized sentences would be better fit for chunked generation and yield better, consistent outputs (smoother transition between sentences via sub-second silence)

@darkacorn
Copy link
Contributor

the best way to get smoth transition with similar ish vocalisation is to feed a part of the last gen into it as prefix .. with transcription prefix and cut that out after

(source - zonos devs)

@xdevfaheem
Copy link
Author

xdevfaheem commented Feb 22, 2025

the best way to get smoth transition with similar ish vocalisation is to feed a part of the last gen into it as prefix .. with transcription prefix and cut that out after

(source - zonos devs)

oh that makes perfect sense and sounds practical as well

@xdevfaheem
Copy link
Author

I started trying to implement that. Boy, was i wrong! its not easy. and i didn't understand well wym by transcription prefix. do you meant samples to trim based on text prefix length? Is this why you mentioned?

also needs a way to regen a chunk

@darkacorn
Copy link
Contributor

say last 1-2 words of the first generation go as start prefix audio into gen 2 -> also the text of it .. you just prefix it with that

after that last 1-2 words of gen 2 goes into 3 .. and so on and so forth

you will need an asr with word timestamps to know what to cut out

  • we hanging out in discord if you want to brainstorm - link is in the readme

@InconsolableCellist
Copy link

I tried doing chunking using the last part of a sentence as the prefix for the next but got really weird results. I discarded that code but maybe it can be made to work

@xdevfaheem
Copy link
Author

say last 1-2 words of the first generation go as start prefix audio into gen 2 -> also the text of it .. you just prefix it with that

after that last 1-2 words of gen 2 goes into 3 .. and so on and so forth

you will need an asr with word timestamps to know what to cut out

  • we hanging out in discord if you want to brainstorm - link is in the readme

Got it, I'm in!

@xdevfaheem
Copy link
Author

I tried doing chunking using the last part of a sentence as the prefix for the next but got really weird results. I discarded that code but maybe it can be made to work

Same here, I tried slicing last 3 seconds of gen1 to feed into gen2 prefix audio and trimmed it from the generated codes (as well as from post processed wav out), output is not pleasant. First sentence is fine, second audio chunk starts from second part of the second sentence and so on. After the first audio chunk, all the chunk starts from in between. There was murmuring too... Disaster...

@darkacorn
Copy link
Contributor

darkacorn commented Feb 23, 2025

did you prefix the text too ? with the part you feed in as prefix audio ? ping me in discord when you are .. in the mrdragonfox dude

@xdevfaheem
Copy link
Author

xdevfaheem commented Feb 23, 2025

did you prefix the text too ? with the part you feed in as prefix audio ? ping me in discord when you are .. in the mrdragonfox dude

I'm there, Faheem⚡

used asr for word timestamp to be used to get few word from previous text chunk and it's appropriate waveform to generate prefix codes (and cut that out later). So? probably, smoother and consistent transition with similarish vocalisation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants