-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Files + Infinite content in Gradio UI #148
base: main
Are you sure you want to change the base?
Conversation
as progress bar takes up streamed audio box (preventing access), it feel better to show the generated audio finally all at once
nice |
chunking by wordcount is sub optimal .. gotta be by sentence separation aka !/ , . ; or by line what every is longer and still fit to stabilze generation
if we do batching / chunking .. lets do that right from the gecko |
This chunking was better: #101 |
IMHO, I believe the current approach already respects sentence boundaries. Obviously, here we are splitting the parsed text into sensible sentences using SpaCy's LM and the chunking logic ensures that each chunk remains within the max_word_limit (which when exceeds above ~50 will likely produced inconsistent outputs) while maintaining sentence integrity. I’d be happy to refine the approach further. Let me know what you think! |
I believe, rather than having sentences cutoff in between (word level chunk), chunks having complete and optimal sized sentences would be better fit for chunked generation and yield better, consistent outputs (smoother transition between sentences via sub-second silence) |
the best way to get smoth transition with similar ish vocalisation is to feed a part of the last gen into it as prefix .. with transcription prefix and cut that out after (source - zonos devs) |
oh that makes perfect sense and sounds practical as well |
I started trying to implement that. Boy, was i wrong! its not easy. and i didn't understand well wym by transcription prefix. do you meant samples to trim based on text prefix length? Is this why you mentioned?
|
say last 1-2 words of the first generation go as start prefix audio into gen 2 -> also the text of it .. you just prefix it with that after that last 1-2 words of gen 2 goes into 3 .. and so on and so forth you will need an asr with word timestamps to know what to cut out
|
I tried doing chunking using the last part of a sentence as the prefix for the next but got really weird results. I discarded that code but maybe it can be made to work |
Got it, I'm in! |
Same here, I tried slicing last 3 seconds of gen1 to feed into gen2 prefix audio and trimmed it from the generated codes (as well as from post processed wav out), output is not pleasant. First sentence is fine, second audio chunk starts from second part of the second sentence and so on. After the first audio chunk, all the chunk starts from in between. There was murmuring too... Disaster... |
did you prefix the text too ? with the part you feed in as prefix audio ? ping me in discord when you are .. in the mrdragonfox dude |
I'm there, Faheem⚡ |
used asr for word timestamp to be used to get few word from previous text chunk and it's appropriate waveform to generate prefix codes (and cut that out later). So? probably, smoother and consistent transition with similarish vocalisation
Added support for files (PDF, XLSX, DOCX, etc) and/with unlimited content (via chunked generation and accumulation)