-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented weighted-multi_input-[shortcut] layer with weights-normalization #4662
Comments
@AlexeyAB @WongKinYiu I made a network has 3
|
@Kyuuki93 Try to use the latest commit. I set
and trained your cfg-file for 100 iterations successfully. |
There is bug in your cfg: |
Also you should not specify Just use
instead of
|
@AlexeyAB cuda error was lead by Modified .cfg is here darknet53-bifpn3.cfg.txt which look like this |
Comparison on my dataset, all cases with same training settings and used MS COCO detector pre-trained weights, only different in backbone
ALL network has SPP-layer, inference time test on RTX 2080Ti, BiFPN block use
the option of next step could be
but recently my machines was occupied by other task, I will try further experiment when GPUs got free |
@Kyuuki93
since BiFPN is not very expensive |
@Kyuuki93 @WongKinYiu I just fixed
|
Start re-train ASFF models now. |
@WongKinYiu
|
@Kyuuki93 @WongKinYiu
So you can test both cases on two small datasets. |
@Kyuuki93 Hi, Do you get any progress in BiFPN and BiFPN+ASFF? |
Sorry, can't work this days because new year and new virus in china ... |
@Kyuuki93 @AlexeyAB I'm interested in implementing a BiFPN head on top of darknet in https://github.com/ultralytics/yolov3 using https://github.com/AlexeyAB/darknet/files/4048909/darknet53-bifpn3.cfg.txt and benchmarking on COCO. EfficientDet paper https://arxiv.org/pdf/1911.09070.pdf mentions 3 summation methods: "scalar (per-feature), a vector (per-channel), or a multi-dimensional tensor (per-pixel)". It seems they select scalar/per-feature for their implementation. Then I assume to add multiple 4D tensors, say of shape 16x256x13x13, would we have 2 scalar weights if done 'per_feature', and 2 vector weights of shape 1x256x1x1 if done 'per_channel'? Also, I'm surprised softmax on the weights imparts such a slowdown (1.3X) in their paper, have you guys also observed this? |
Yes.
Only for training. And only for weighted-shortcut-layer.
Did you write a paper for Mosaic data augmentation? |
@AlexeyAB ah I see. Actually no, I haven't had time to write a mosaic paper. It's too bad, because the results are pretty clear that it helps significantly. Another thing I was wondering is I see in the BiFPN cfg only the head used weighted shortcuts, while the darknet53 backbone does not. If it helps the head then it may help the backbone as well no? Though to keep expectations in check, the EfficientDet paper only shows a pretty small +0.45 mAP bump from weighted vs non-weighted head. One difference though is that currently the regular non-weighted shortcut layers effectively have weights=1 that sum to 2, 3 etc depending on the number of inputs. Isn't this a bit strange then that the head must be constrained to sum the weights to 1? |
I have the same thoughts.
Batch-normalization try to do the same - moves the most values to range [0 - 1] - it increases mAP, speeds up the training, makes training more stable... |
@AlexeyAB ah of course, the BN after the Conv2d() following a shortcut will do this automatically, it completely slipped my mind. Good example. Ok, then I've taken @Kyuuki93's cfg, touched it up a bit (fixed the 'normalization' typo, reverted it to yolov3-spp.cfg default anchors and 80-class configuration, and implemented weighted shortcuts for all shortcut layers.) I will experiment with a few different weighting techniques using my typical 27 epoch coco results and post the results later on this week hopefully. I suppose the tests should be:
|
@glenn-jocher Also try to use this cfg-file that is made more similar to https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/model.py
Just for fair comparison, set all parameters in [net] and [yolo] to the same values in all 4 models. |
@AlexeyAB thanks for the cfg! I've been having some problems with the route layers on these csresnext50 cfgs, maybe I can just copy the BiFPN part and attach it to darknet53, ah but the shortcut layer numbers will be different... ok maybe I'll just try and use it directly. Let's see... |
@glenn-jocher Yes, you should change route layers and the first from= in weighted [shortcut] layers. Tomorow I will attach |
I will get some free gpus in about 2~4 days. |
Yes, it will be nice. Try to train 2 Classifiers with weighted-shortcut layers:
Which are similar to https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/imagenet/results.md with Try to train these 2 models 416x416 with my BiFPN module which is more like a module by reference: https://github.com/xuannianz/EfficientDet/blob/ccc795781fa173b32a6785765c8a7105ba702d0b/model.py For comparison with: ultralytics/yolov3#698 (comment)
|
|
@WongKinYiu I think we can combine |
Some of weights are negative, but I just looked, this model uses softmax normalization, so negative values for it are not a problem |
I created 2 new models:
|
csdarknet53-ws: 65.6 top-1, 87.3 top-5. |
Do you mean train with |
@WongKinYiu No, train as usual 256x256. 608x608 it's just to know the final speed of the backbone of detector. PS, |
OK, start training. |
@AlexeyAB Hello, i use 27 Feb repo for training bifpn model. 10a5861 update: csresnext50-bifpngamma: 512x512, 36.8/58.2/39.4. |
@WongKinYiu Hi, It seems - yes, this is the reason, if there are negative or vely low weights. Try to set relu istead of lrelu temporary for testing.
Can you share cfg/weights, I will check negative weights and speed? Also try to check training time of Lines 1101 to 1194 in 2614a23
so detection will have the same speed non vs relu vs softmax
For example, models |
Yes, the BiFPN module in this weights-file is totaly broken, since the most of weights are negative, i.e. them don't pass any information through weighted-[shortcut] layers. Ways to solve this problem:
|
@WongKinYiu Hi, Can you show ~ remaining time for models?
|
|
@WongKinYiu Thanks! Can you check intermediate Top1/5 accuracy of models They may degrade the accuracy of the classifier, but should improve the accuracy of the detector. Only https://arxiv.org/abs/1912.00632v2
In these models: The image pyramid is obtained by downsampling(linear interpolate) the input image into five levels with a factor of 2. |
@AlexeyAB input pyramid looks very interesting. I just had an idea. Instead of downsampling by a linear interpolation, perhaps one could simply reshape the pyramid inputs by moving pixels around. For example instead of downsampling (3,512,512) - > (3,256,256) we might be able to reshape (3,512,512) -> (12, 256, 256) with no information loss. I suppose the order would be (rgb,512,512) -> (rrrrggggbbbb,256,256), and it might generalize to any downsampling operation, perhaps in place of the maxpool layers for example in yolov3-tiny or the stride-2 convolutions in yolov3. I'm not aware of any op that does this currently so it would have to be custom built somehow. What do you think? |
About loosing information:
|
|
@WongKinYiu If you want you can train csdarknet53-omega-mi-ip.cfg.txt |
@AlexeyAB currently 24k epoch. |
|
@AlexeyAB that's a very good point about reshaping layers that the regression may be more sensitive to small positional changes of the data than the obj/cls. I've always been worried about the loss of small spatial information as the image downsizes, especially for the smallest objects in P3. That's also a very good point that downsampling operations are all 3x3 kernels, so they overlap enough that not much information should be lost. I'll try to experiment a bit with injecting the input pyramid in different areas. One interesting thing about the samung paper was that they used a 7x7 kernel for the first convolution layer. I see this does not significantly affect the parameter count nor the FLOPS, but I also see the mixnet/efficientnet guys do not do this (they have 3x3 on conv0 like dn53), and I'm sure they must have experimented with it. |
conv 5x5-7x7-9x9 should be used for stride=2 rather than for the 1st layer. mixnet guys said - for layers with stride 2, a larger kernel can significantly improve the accuracy.: https://arxiv.org/pdf/1907.09595v3.pdf
|
Hello, Reshape (reorg) layer is usually used in the models of depth prediction and semantic segmentation. They usually called the process of "reorg/reversed reorg" as "spatial to depth (channel)/depth (channel) to spatial" tensorflow layer. I have changed the downsampling/upsampling layers to reorg/reversed reorg of Elastic, and it got a little bit accuracy improvement. Also, there is a paper in CVPR 2020 use this technique, for your reference MuxConv. |
Did you use |
@AlexeyAB i used |
CSPResNeXt-50 default BoF+MISH : top-1 = 79.8%, top-5 = 95.2% csresnext50-ws.cfg.txt top-1 = 78.7%, top-5 = 94.7% negative weights, should be used csresnext50-ws-mi2.cfg.txt : top-1 = 79.9%, top-5 = 95.3% csresnext50morelayers.cfg.txt : top-1 = 79.4%, top-5 = 95.2% url csresnext50sub.cfg.txt : top-1 = 79.5%, top-5 = 95.3% url csresnext50sub-mi2.cfg.txt: top-1 = 79.4%, top-5 = 95.3%, weights csresnext50sub-mo.cfg.txt: top-1 = 79.2%, top-5 = 95.1%, weights CSPDarknet-53 default BoF+MISH : top-1 = 78.7%, top-5 = 94.8%
csdarknet53-omega-mi.cfg.txt: top-1 = 78.6%, top-5 = 94.7%, weights csdarknet53-omega-mi-db.cfg.txt: top-1 = 78.4%, top-5 = 94.5% csdarknet53-omega-mi-ip.cfg.txt: top-1 = 77.8%, top-5 = 94.3% |
@WongKinYiu |
Implemented weighted-multi_input-
[shortcut]
layer with weights-normalization, added:New [shortcut] can:
can take more than 2 input layers for adding:
from = -2, -3
(and -1 by default)can multiply by weights:
per_feature
(per_layer)per_channel
can normalize weights by using:
avg_relu
orsoftmax
The simplest example: yolov3-tiny_3l_shortcut_multilayer_per_feature_softmax.cfg.txt
https://arxiv.org/abs/1911.09070v1
https://arxiv.org/abs/1911.09070v1
The text was updated successfully, but these errors were encountered: