clarification #3

KaimingHe · 2017-04-01T19:00:37Z

Hi Charles,

Thank you for your interest and implementing Mask R-CNN!

I would like to clarify some descriptions in your Readme (which may suggest misunderstanding of our work):

"The original work involves two stages, a pyramid Faster-RCNN for object detection and another network (with the same structure) for instance level segmentation."

This is not true. In our original work, object detection and instance segmentation are in one stage. They are in parallel, and they are two tasks of a multi-task learning network.

I hope this will ease your effort of a correct reproduction.

parhartanvir · 2017-04-02T16:01:11Z

@CharlesShang I cannot find any issues with in particular over the repo. If in case you require any help in a particular direction, let know.

CharlesShang · 2017-04-05T01:40:29Z

I must misunderstand some details in your paper.

For convenient ablation, RPN is trained separately and does not share features with Mask-RCNN, ...
In 3.1 Implementation Details

CharlesShang · 2017-04-05T02:22:45Z

@parhartanvir
@KaimingHe
Great!!!!
I have some questions.

FPN:

In a FPN, rois are extracted from multiple layers. In training stage, we choose some rois according to some criterions, like IoU, fractions of foreground, total number, etc.. I'm not sure about these parameters. I guess they are the same as the work 'Feature Pyramid Networks for Object Detection' FPN
There are several RPN in the pyramid, when building the losses, should I merge all the ROIs before sampling, or sample ROIs for each RPN then compute the losses..

Mask

In the original paper, Figure 3. (page 4), There are only 80 channels in the mask, but I think it should be 81 because there's another background class.

Loss

per-pixel sigmoid with binary cross-entropy loss, I guess it is
loss_mask = cross_enctropy(sigmoid(x), y)
where x and y are of shape (28, 28, 81, 2) # using the last axis to denote fg and bg
Am I right?

Training mini-batch >= 2

Since the input images may have different shapes, I guess training on a mini-batch of 2 should be like, FP several images seperately in parallel, compute average gradients, then update the network.
Is it right?

Sorry for the delayed replay, just back from a vacation

xqms · 2017-04-05T12:15:31Z

@CharlesShang: Thanks for your effort to implement this very nice work!

In the original paper, Figure 3. (page 4), There are only 80 channels in the mask, but I think it should be 81 because there's another background class.

As far as I understood, the branch predicts binary segmentation masks for each object class - so there is no need for a background mask.

parhartanvir · 2017-04-05T16:53:04Z

@CharlesShang , I believe for the mask there should not be a background class. That is because there are K binary masks for each of the K classes. Having a background class for the Faster-RCNN / Region proposal part makes sense. But since they are not computing mask loss in between classes, a background mask is not needed.

As far as training goes, I think what you are saying is right. i.e. forward pass the images, add/average gradients then backward pass.

I apologize, I haven't gone through the FPN paper yet. I'll go through and see, if I can help.

CharlesShang · 2017-04-07T11:11:42Z

@parhartanvir
@xqms
Thank you for your explaination
I think there's little difference. Consider an example -- segment a roi of a horse
In the refined stage, we just know it's a horse, so we just check the horse-channel in the masks, pixels with prob greater than 0.5 are considered as horse, otherwise bg. In the process, the bg-channel is never be used at both testing and training stage, since only positive rois are extracted for training and testing.

For consistance, I'll adopt K+1 classes, so we dont need to - 1 when we extract masks

xmyqsh · 2017-04-17T13:24:04Z

@CharlesShang

There are several RPN in the pyramid, when building the losses, should I merge all the ROIs before sampling, or sample ROIs for each RPN then compute the losses..

I have gone over the FPN paper. I think just one RPN is OK. ancher_targer_layer with inputs of P2 through P5, generate anchors and merge together and random sample inside, normal proposal_layer and proposal_target_layer follows.

For backforward, assign each RoI of width w and height h (on the input image to the network) to the level Pk of our feature pyramid by eqn(1).

I think it is not elegant and time consumming with four RPNs followed by four heads. And it is hard to trade off the four parts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clarification #3

clarification #3

KaimingHe commented Apr 1, 2017

parhartanvir commented Apr 2, 2017

CharlesShang commented Apr 5, 2017

CharlesShang commented Apr 5, 2017 •

edited

Loading

xqms commented Apr 5, 2017

parhartanvir commented Apr 5, 2017

CharlesShang commented Apr 7, 2017 •

edited

Loading

xmyqsh commented Apr 17, 2017 •

edited

Loading

clarification #3

clarification #3

Comments

KaimingHe commented Apr 1, 2017

parhartanvir commented Apr 2, 2017

CharlesShang commented Apr 5, 2017

CharlesShang commented Apr 5, 2017 • edited Loading

xqms commented Apr 5, 2017

parhartanvir commented Apr 5, 2017

CharlesShang commented Apr 7, 2017 • edited Loading

xmyqsh commented Apr 17, 2017 • edited Loading

CharlesShang commented Apr 5, 2017 •

edited

Loading

CharlesShang commented Apr 7, 2017 •

edited

Loading

xmyqsh commented Apr 17, 2017 •

edited

Loading