Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no speedup using ort #103

Open
housebaby opened this issue Mar 9, 2022 · 4 comments
Open

no speedup using ort #103

housebaby opened this issue Mar 9, 2022 · 4 comments

Comments

@housebaby
Copy link

I have tried using ort in training transformer . But it seems that no speed up is got.
I wonder whether i have missed someting in configuration.

@baijumeswani
Copy link
Collaborator

Please share with us your model code if possible?

@housebaby
Copy link
Author

housebaby commented Mar 9, 2022

Please share with us your model code if possible?

model = ORTModule(init_asr_model(configs))
this is how I use the ort.

when I print the model, it is as follows:
ORTModule(`
(encoder): ConformerEncoder(
(global_cmvn): GlobalCMVN()
(embed): Conv2dSubsampling4(
(conv): Sequential(
(0): Conv2d(1, 512, kernel_size=(3, 3), stride=(2, 2))
(1): ReLU()
(2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2))
(3): ReLU()
)
(out): Sequential(
(0): Linear(in_features=9728, out_features=512, bias=True)
)
(pos_enc): RelPositionalEncoding(
(dropout): Dropout(p=0.1, inplace=False)
)
)
(after_norm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(encoders): ModuleList(
(0): ConformerEncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=512, out_features=512, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(feed_forward_macaron): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512)
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
(activation): SiLU()
)
(norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear): Linear(in_features=1024, out_features=512, bias=True)
)
(1): ConformerEncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=512, out_features=512, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(feed_forward_macaron): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512)
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
(activation): SiLU()
)
(norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear): Linear(in_features=1024, out_features=512, bias=True)
)
(2): ConformerEncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=512, out_features=512, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(feed_forward_macaron): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512)
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
(activation): SiLU()
)
(norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear): Linear(in_features=1024, out_features=512, bias=True)
)
(3): ConformerEncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=512, out_features=512, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(feed_forward_macaron): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512)
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
(activation): SiLU()
)
(norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear): Linear(in_features=1024, out_features=512, bias=True)
)
(4): ConformerEncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=512, out_features=512, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(feed_forward_macaron): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512)
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
(activation): SiLU()
)
(norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear): Linear(in_features=1024, out_features=512, bias=True)
)
(5): ConformerEncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=512, out_features=512, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(feed_forward_macaron): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512)
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
(activation): SiLU()
)
(norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear): Linear(in_features=1024, out_features=512, bias=True)
)
(6): ConformerEncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=512, out_features=512, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(feed_forward_macaron): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512)
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
(activation): SiLU()
)
(norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear): Linear(in_features=1024, out_features=512, bias=True)
)
(7): ConformerEncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=512, out_features=512, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(feed_forward_macaron): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512)
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
(activation): SiLU()
)
(norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear): Linear(in_features=1024, out_features=512, bias=True)
)
(8): ConformerEncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=512, out_features=512, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(feed_forward_macaron): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512)
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
(activation): SiLU()
)
(norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear): Linear(in_features=1024, out_features=512, bias=True)
)
(9): ConformerEncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=512, out_features=512, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(feed_forward_macaron): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512)
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
(activation): SiLU()
)
(norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear): Linear(in_features=1024, out_features=512, bias=True)
)
(10): ConformerEncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=512, out_features=512, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(feed_forward_macaron): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512)
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
(activation): SiLU()
)
(norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear): Linear(in_features=1024, out_features=512, bias=True)
)
(11): ConformerEncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=512, out_features=512, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(feed_forward_macaron): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): SiLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(512, 512, kernel_size=(15,), stride=(1,), groups=512)
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(pointwise_conv2): Conv1d(512, 512, kernel_size=(1,), stride=(1,))
(activation): SiLU()
)
(norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_conv): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm_final): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear): Linear(in_features=1024, out_features=512, bias=True)
)
)
)
(decoder): TransformerDecoder(
(embed): Sequential(
(0): Embedding(5002, 512)
(1): PositionalEncoding(
(dropout): Dropout(p=0.1, inplace=False)
)
)
(after_norm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(output_layer): Linear(in_features=512, out_features=5002, bias=True)
(decoders): ModuleList(
(0): DecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(src_attn): MultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): ReLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear1): Linear(in_features=1024, out_features=512, bias=True)
(concat_linear2): Linear(in_features=1024, out_features=512, bias=True)
)
(1): DecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(src_attn): MultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): ReLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear1): Linear(in_features=1024, out_features=512, bias=True)
(concat_linear2): Linear(in_features=1024, out_features=512, bias=True)
)
(2): DecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(src_attn): MultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): ReLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear1): Linear(in_features=1024, out_features=512, bias=True)
(concat_linear2): Linear(in_features=1024, out_features=512, bias=True)
)
(3): DecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(src_attn): MultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): ReLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear1): Linear(in_features=1024, out_features=512, bias=True)
(concat_linear2): Linear(in_features=1024, out_features=512, bias=True)
)
(4): DecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(src_attn): MultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): ReLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear1): Linear(in_features=1024, out_features=512, bias=True)
(concat_linear2): Linear(in_features=1024, out_features=512, bias=True)
)
(5): DecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(src_attn): MultiHeadedAttention(
(linear_q): Linear(in_features=512, out_features=512, bias=True)
(linear_k): Linear(in_features=512, out_features=512, bias=True)
(linear_v): Linear(in_features=512, out_features=512, bias=True)
(linear_out): Linear(in_features=512, out_features=512, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=True)
(activation): ReLU()
(dropout): Dropout(p=0.1, inplace=False)
(w_2): Linear(in_features=2048, out_features=512, bias=True)
)
(norm1): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(norm3): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(concat_linear1): Linear(in_features=1024, out_features=512, bias=True)
(concat_linear2): Linear(in_features=1024, out_features=512, bias=True)
)
)
)
(ctc): CTC(
(ctc_lo): Linear(in_features=512, out_features=5002, bias=True)
(ctc_loss): CTCLoss()
)
(criterion_att): LabelSmoothingLoss(
(criterion): KLDivLoss()
)
)

@housebaby
Copy link
Author

housebaby commented Mar 9, 2022

Please share with us your model code if possible?

class ASRModel(torch.nn.Module):
    """CTC-attention hybrid Encoder-Decoder model"""
    def __init__(
        self,
        vocab_size: int,
        encoder: TransformerEncoder,
        decoder: TransformerDecoder,
        ctc: CTC,
        ctc_weight: float = 0.5,
        ignore_id: int = IGNORE_ID,
        reverse_weight: float = 0.0,
        lsm_weight: float = 0.0,
        length_normalized_loss: bool = False,
    ):
        assert 0.0 <= ctc_weight <= 1.0, ctc_weight

        super().__init__()
        # note that eos is the same as sos (equivalent ID)
        self.sos = vocab_size - 1
        self.eos = vocab_size - 1
        self.vocab_size = vocab_size
        self.ignore_id = ignore_id
        self.ctc_weight = ctc_weight
        self.reverse_weight = reverse_weight

        self.encoder = encoder
        self.decoder = decoder
        self.ctc = ctc
        self.criterion_att = LabelSmoothingLoss(
            size=vocab_size,
            padding_idx=ignore_id,
            smoothing=lsm_weight,
            normalize_length=length_normalized_loss,
        )

    def forward(
        self,
        speech: torch.Tensor,
        speech_lengths: torch.Tensor,
        text: torch.Tensor,
        text_lengths: torch.Tensor,
    ) -> Tuple[Optional[torch.Tensor], Optional[torch.Tensor],
               Optional[torch.Tensor]]:
        """Frontend + Encoder + Decoder + Calc loss

        Args:
            speech: (Batch, Length, ...)
            speech_lengths: (Batch, )
            text: (Batch, Length)
            text_lengths: (Batch,)
        """
        assert text_lengths.dim() == 1, text_lengths.shape
        # Check that batch_size is unified
        assert (speech.shape[0] == speech_lengths.shape[0] == text.shape[0] ==
                text_lengths.shape[0]), (speech.shape, speech_lengths.shape,
                                         text.shape, text_lengths.shape)
        # 1. Encoder
        encoder_out, encoder_mask = self.encoder(speech, speech_lengths)
        encoder_out_lens = encoder_mask.squeeze(1).sum(1)

        # 2a. Attention-decoder branch
        if self.ctc_weight != 1.0:
            loss_att, acc_att = self._calc_att_loss(encoder_out, encoder_mask,
                                                    text, text_lengths)
        else:
            loss_att = None

        # 2b. CTC branch
        if self.ctc_weight != 0.0:
            loss_ctc = self.ctc(encoder_out, encoder_out_lens, text,
                                text_lengths)
        else:
            loss_ctc = None

        if loss_ctc is None:
            loss = loss_att
        elif loss_att is None:
            loss = loss_ctc
        else:
            loss = self.ctc_weight * loss_ctc + (1 -
                                                 self.ctc_weight) * loss_att
        return loss, loss_att, loss_ctc

@natke natke assigned askhade and unassigned askhade Mar 10, 2022
@ytaous
Copy link

ytaous commented Mar 28, 2022

Hi, would you please provide steps to repro the issue? including sample data and run scripts?
Also what's your runtime env? i.e., installations, ort version, etc. Older version of ORT may not have obvious gain, you may try our latest release + upgraded torch and it makes difference before getting back to us.
https://download.onnxruntime.ai/
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants