Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 推理和转换ONNX时报错 #24

Open
m00nLi opened this issue Sep 26, 2024 · 13 comments
Open

[Bug]: 推理和转换ONNX时报错 #24

m00nLi opened this issue Sep 26, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@m00nLi
Copy link

m00nLi commented Sep 26, 2024

Bug

训练正常,但是使用inference.py推理或者转换ONNX时报错:

Traceback (most recent call last):
  File "/home/code/Relation-DETR/inference.py", line 165, in <module>
    inference()
  File "/home/user/code/Relation-DETR/inference.py", line 99, in inference
    model = Config(args.model_config).model.eval()
  File "/home/code/Relation-DETR/util/lazy_load.py", line 35, in __init__
    mod = importlib.import_module(module_name)
  File "/home/user/miniconda3/envs/detr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/code/Relation-DETR/configs/train_config.py", line 45, in <module>
    optimizer = optim.AdamW(lr=learning_rate, weight_decay=1e-4, betas=(0.9, 0.999))
TypeError: AdamW.__init__() missing 1 required positional argument: 'params'

环境信息

-------------------------------  ---------------------------------------------------------------------------------------
sys.platform                     linux
Python                           3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
numpy                            1.24.4
PyTorch                          2.4.0+cu121 @/home/user/miniconda3/envs/detr/lib/python3.10/site-packages/torch
PyTorch debug build              False
torch._C._GLIBCXX_USE_CXX11_ABI  False
GPU available                    Yes
GPU 0,1,2,3,4,5,6,7              NVIDIA H800 (arch=9.0)
Driver version                   560.35.03
CUDA_HOME                        /usr/local/cuda-12.4
Pillow                           10.4.0
torchvision                      0.19.0+cu121 @/home/user/miniconda3/envs/detr/lib/python3.10/site-packages/torchvision
torchvision arch flags           5.0, 6.0, 7.0, 7.5, 8.0, 8.6, 9.0
fvcore                           0.1.5.post20221221
iopath                           0.1.10
cv2                              4.10.0
-------------------------------  ---------------------------------------------------------------------------------------

补充信息

No response

@m00nLi m00nLi added the bug Something isn't working label Sep 26, 2024
@xiuqhou
Copy link
Owner

xiuqhou commented Sep 26, 2024

model_config应该指定的是模型配置而不是训练配置,例如:
configs/relation_detr/relation_detr_resnet50_800_1333.py

@m00nLi
Copy link
Author

m00nLi commented Sep 26, 2024

model_config应该指定的是模型配置而不是训练配置,例如: configs/relation_detr/relation_detr_resnet50_800_1333.py

好的,换成模型配置没问题了

@m00nLi
Copy link
Author

m00nLi commented Sep 27, 2024

导出ONNX时报错:

torch.onnx.errors.UnsupportedOperatorError: Exporting the operator 'aten::_upsample_bilinear2d_aa' to ONNX opset version 17 is not supported. Please feel free to request support or submit a pull request on PyTorch GitHub: https://github.com/pytorch/pytorch/issues.

ONNX版本

onnx                              1.16.2
onnxruntime                       1.16.0
onnxsim                           0.4.36
rapidocr-onnxruntime              1.3.24

@xiuqhou
Copy link
Owner

xiuqhou commented Sep 27, 2024

Hi @m00nLi 这是因为pytorch不支持抗锯齿resize导出ONNX,
请把models/detectors/base_detector.py第75行的antialias设置为False再导出

@SEU-ZWW
Copy link

SEU-ZWW commented Sep 27, 2024

Hi @m00nLi 这是因为pytorch不支持抗锯齿resize导出ONNX, 请把models/detectors/base_detector.py第75行的antialias设置为False再导出

设置为False后还是会报同样的错,opset 11也不行

@xiuqhou
Copy link
Owner

xiuqhou commented Sep 28, 2024

请看一下报的错误是aten::_upsample_bilinear2d_aa还是aten::_upsample_bilinear2d。带_aa版本的算子目前pytorch都不支持,而设置为False使用的是不带_aa版本的ONNX算子,这个算子可能在opset 11不支持,但在opset 17是支持的,请试试opset 17可以吗?

@SEU-ZWW
Copy link

SEU-ZWW commented Sep 29, 2024

请看一下报的错误是aten::_upsample_bilinear2d_aa还是aten::_upsample_bilinear2d。带_aa版本的算子目前pytorch都不支持,而设置为False使用的是不带_aa版本的ONNX算子,这个算子可能在opset 11不支持,但在opset 17是支持的,请试试opset 17可以吗?
报错的是带_aa的算子,那请问作者目前有什么办法可以把pt转成onnx么,麻烦给个解决方案吧,谢谢

@xiuqhou
Copy link
Owner

xiuqhou commented Sep 30, 2024

Hi @SEU-ZWW 如果报错的是带_aa的算子,说明antialias还是True。麻烦检查一下参数设置,可以调试看看为什么antialias设置为False没有生效。

@m00nLi
Copy link
Author

m00nLi commented Oct 22, 2024

我的转ONNX成功了,antialias=False, opset=17, onnxruntime==1.19.2.
请问下用onnx检测的时候需要对图片进行什么预处理吗?我看了下tools/pytorch2onnx.pyONNXDetector,
我这样预处理输入图片是有检测结果的:

cv_img = cv2.imread(image_path)
cv_img = cv2.cvtColor(cv_img, cv2.COLOR_BGR2RGB)
cv_img = cv_img.astype(np.float32)
cv_img /= 255.0
mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225)
# cv_img = (cv_img - mean) / std  # 加上这行没有检测结果
cv_img = cv_img.transpose(2, 0, 1)  # h, w, c => c, h, w
input_tensor = torch.from_numpy(cv_img)
result = onnx_detector.__call__([input_tensor])

但我看了训练时的数据处理,不是还有一步减均值除方差吗?但是加上的话就检测不到目标了,所以正确的预处理该怎么写?

@xiuqhou
Copy link
Owner

xiuqhou commented Oct 22, 2024

为了方便使用,我把推理时候的归一化作为nn.Module集成到模型中了,所以推理不需要再做额外的归一化,只需要读取chw(rgb通道)格式的图片,转成浮点数格式输入模型就可以了,您原本的预处理就是对的

@m00nLi
Copy link
Author

m00nLi commented Oct 22, 2024

为了方便使用,我把推理时候的归一化作为nn.Module集成到模型中了,所以推理不需要再做额外的归一化,只需要读取chw(rgb通道)格式的图片,转成浮点数格式输入模型就可以了,您原本的预处理就是对的

好的谢谢

@m00nLi
Copy link
Author

m00nLi commented Oct 24, 2024

为了方便使用,我把推理时候的归一化作为nn.Module集成到模型中了,所以推理不需要再做额外的归一化,只需要读取chw(rgb通道)格式的图片,转成浮点数格式输入模型就可以了,您原本的预处理就是对的

请问下导出ONNX时如何去掉这个缩放+归一化的算子?我已经在models/detectors/base_detector.pyself.eval_transform=None了,用Netron看应该是去掉那个算子了,但是图片的预处理还是像之前那样才有检测结果(自己做归一化没有检测结果)。

class BaseDetector(nn.Module):
    def __init__(self, min_size=None, max_size=None, size_divisible=32):
        ...
       # self.eval_transform = nn.Sequential(*eval_transform)
       self.eval_transform = None

另外,在我的个人数据集(约20万)上测试,ReleationDETR(FocalNet-L)的mAP50, mAP50-95这些的确比DINO(Swin-L)高出3个百分点,Recall也高3个点,但是Precision却低了7个点,意味着ReleationDETR模型偏向于查全率,但误检会很多,这个有什么改进思路吗?

@xiuqhou
Copy link
Owner

xiuqhou commented Oct 26, 2024

很奇怪,eval_transform只是对图片进行float+归一化操作,把它设置为None就不会再进行归一化了。您可以调试pytorch2onnx.py文件中的torch.onnx.export函数,看看在导出时是否真的把归一化去掉了,另外也要注意图片类型必须是float。

检测需要同时考虑分类和回归任务,所以主要是用mAP和AR来衡量性能。AP的定义是不同召回率水平下Precision的平均值,即PR曲线下方的面积,mAP是每个类别AP的平均值。所以我不太明白为什么Precision跟mAP的趋势不同,mAP比DINO高但Precision比DINO低,这是很奇怪的一件事,建议先可视化看看结果是否真的存在很多误检。如果存在很多误检的话,可以设置阈值来过滤掉一部分结果,或者对物体query进行优化,更精确地表示物体。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants