Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i can't generate audio #6

Open
shingo-vokov opened this issue Oct 19, 2024 · 10 comments
Open

i can't generate audio #6

shingo-vokov opened this issue Oct 19, 2024 · 10 comments

Comments

@shingo-vokov
Copy link

i try use it

outputs = spirit_lm.generate(
    interleaved_inputs=[('text', "I am so deeply saddened, it feels as if my heart is shattering into a million pieces and I can't hold back the tears that are streaming down my face.")],
    output_modality='speech',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=200,
        do_sample=True,
    ),
    speaker_id=1,
)
display_outputs(outputs)

but i see errors

/home/.conda/envs/spiritlm/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:579: UserWarning: pad_token_id should be positive but got -1. This will cause errors when batch generating, if there is padding. Please set pad_token_id explicitly as model.generation_config.pad_token_id=PAD_TOKEN_ID to avoid errors in generation
warnings.warn(

how to get PAD_TOKEN_ID

@hitchhicker
Copy link
Contributor

Could you share the version of your transformers version and how you setup the conda environment? Thanks! I don't have this error in my side.

@gallilmaimon
Copy link

Hey @hitchhicker, I get the same warning when running the generation example.

I created the environment as explained for Conda:

conda env create -f env.yml
pip install -e '.[eval]'

And my version of transformers is transformers==4.46.0

@hitchhicker
Copy link
Contributor

Hey @gallilmaimon for providing the information for the setup! I wonder whether this error happens only for python 3.10. I am using 3.9 by the way. What python version are you using?

@gallilmaimon
Copy link

I am using Python 3.9.20, and indeed the env.yml specifies python==3.9

@hitchhicker
Copy link
Contributor

My python version is also 3.9.20 if the minor version is included.

The following is the output of my pip freeze

antlr4-python3-runtime==4.8
audioread==3.0.1
certifi==2024.8.30
cffi==1.17.1
charset-normalizer==3.3.2
decorator==5.1.1
einops==0.8.0
encodec==0.1.1
exceptiongroup==1.2.2
fairscale==0.4.13
filelock @ file:///croot/filelock_1700591183607/work
fsspec==2024.9.0
gmpy2 @ file:///tmp/build/80754af9/gmpy2_1645438755360/work
huggingface-hub==0.25.1
idna==3.10
iniconfig==2.0.0
Jinja2 @ file:///croot/jinja2_1716993405101/work
joblib==1.4.2
lazy_loader==0.4
librosa==0.10.2.post1
llvmlite==0.43.0
local-attention==1.9.15
MarkupSafe @ file:///croot/markupsafe_1704205993651/work
mkl-service==2.4.0
mkl_fft @ file:///croot/mkl_fft_1725370245198/work
mkl_random @ file:///croot/mkl_random_1725370241878/work
mpmath @ file:///croot/mpmath_1690848262763/work
msgpack==1.1.0
networkx @ file:///croot/networkx_1717597493534/work
numba==0.60.0
numpy @ file:///croot/numpy_and_numpy_base_1725470312869/work/dist/numpy-2.0.1-cp39-cp39-linux_x86_64.whl#sha256=d86a49760b169e0c4fb8c00d248077a0474f640071b7ef584afd5ad4f03b9428
omegaconf==2.2.0
packaging==24.1
pandas==2.2.3
platformdirs==4.3.6
pluggy==1.5.0
pooch==1.8.2
pyarrow==17.0.0
pycparser==2.22
pytest==8.3.3
python-dateutil==2.9.0.post0
pytz==2024.2
PyYAML @ file:///croot/pyyaml_1698096049011/work
regex==2024.9.11
requests==2.32.3
safetensors==0.4.5
scikit-learn==1.5.2
scipy==1.13.1
sentencepiece==0.2.0
six==1.16.0
soundfile==0.12.1
soxr==0.5.0.post1
sympy @ file:///croot/sympy_1724938189289/work
threadpoolctl==3.5.0
tokenizers==0.20.0
tomli==2.0.2
torch==2.4.1
torchaudio==2.4.1
torchfcpe==0.0.4
tqdm==4.66.5
transformers==4.45.1
triton==3.0.0
typing_extensions @ file:///croot/typing_extensions_1715268824938/work
tzdata==2024.2
urllib3==2.2.3

I notice that my transformers version is transformers==4.45.1, which is different than yours.

@gallilmaimon
Copy link

I will try downgrading transformers and see if that makes any difference. I will also try to create the environment using pip and not anaconda and let you know if there is any difference.

In the pip installation it says:

pip install -e requirements.txt
pip install -e '.[eval]'

but the first line gives an error, I think it should be pip install -r requirements.txt or simply pip install -e ., is this correct?

@hitchhicker
Copy link
Contributor

Thanks!

You are right, pip install -e requirements.txt has typo on "-e" it should be "-r". And we don't need this. I will update the readme.

@gallilmaimon
Copy link

I tried with transformers==4.45.1 like you (and also tried installing with pip instead of Conda), bus still got the same warning:

python3.9/site-packages/transformers/generation/configuration_utils.py:568: UserWarning: `pad_token_id` should be positive but got -1. This will cause errors when batch generating, if there is padding. Please set `pad_token_id` explicitly as `model.generation_config.pad_token_id=PAD_TOKEN_ID` to avoid errors in generation

It is worth mentioning that (unlike the issue title) I am managing to generate audio:

[GenerationOuput(content=array([-0.00186113, -0.00068325, -0.0015525 , ..., -0.00764565,
       -0.0104003 , -0.01227323], dtype=float32), content_type=<ContentType.SPEECH: 'SPEECH'>)]
[GenerationOuput(content=' you think you can make it i shall be very glad if you can', content_type=<ContentType.TEXT: 'TEXT'>), GenerationOuput(content=array([ 0.04572767,  0.03901244,  0.03441606, ..., -0.18233861,
       -0.2093165 , -0.22660626], dtype=float32), content_type=<ContentType.SPEECH: 'SPEECH'>), GenerationOuput(content=' human being to see and suffer wrong without offering the help of his hand and his life now a new passion was in him and a new dignity as of one who stood upon the brink of a mighty change', content_type=<ContentType.TEXT: 'TEXT'>), GenerationOuput(content=array([-0.00298501, -0.00290124, -0.00230354, ..., -0.20166263,
       -0.20007564, -0.19413921], dtype=float32), content_type=<ContentType.SPEECH: 'SPEECH'>), GenerationOuput(content=' ragged clay stains were gone and his eyes', content_type=<ContentType.TEXT: 'TEXT'>)]

but the warning above indicates that I might get wrong results when working with batches which I would like to do...

@hitchhicker
Copy link
Contributor

Awesome to see that you are able to generate outputs!

In fact, we don't really support batch prediction (one prediction can contain multiple texts, multiple audio or mixed of them, but they are still one batch) since the implementation of speech tokenizer does not support that. I see that you have a output of two lists. For each call of generate, we expect to see only one list. Would you mind share the input that you have used for the interleaved_inputs? Thanks!

@gallilmaimon
Copy link

The outputs are fine and make sense (there are two generate calls) :)

I wanted to calculate probabilities of speech only (non-interleaved) samples in batches to calculate sWUGGY metric (like in the paper) or other modelling metrics like SALMon (https://arxiv.org/abs/2409.07437), and doing so without batching can be slow. However, as this is already a bit out pf scope for this issue, I will do that and if I have the same warning or unexplained behaviour there I will open a new issue.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants