Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarification #3

Open
KaimingHe opened this issue Apr 1, 2017 · 7 comments
Open

clarification #3

KaimingHe opened this issue Apr 1, 2017 · 7 comments

Comments

@KaimingHe
Copy link

Hi Charles,

Thank you for your interest and implementing Mask R-CNN!

I would like to clarify some descriptions in your Readme (which may suggest misunderstanding of our work):

"The original work involves two stages, a pyramid Faster-RCNN for object detection and another network (with the same structure) for instance level segmentation."

This is not true. In our original work, object detection and instance segmentation are in one stage. They are in parallel, and they are two tasks of a multi-task learning network.

I hope this will ease your effort of a correct reproduction.

@parhartanvir
Copy link

@CharlesShang I cannot find any issues with in particular over the repo. If in case you require any help in a particular direction, let know.

@CharlesShang
Copy link
Owner

I must misunderstand some details in your paper.

For convenient ablation, RPN is trained separately and does not share features with Mask-RCNN, ...
In 3.1 Implementation Details

@CharlesShang
Copy link
Owner

CharlesShang commented Apr 5, 2017

@parhartanvir
@KaimingHe
Great!!!!
I have some questions.

  • FPN:
  1. In a FPN, rois are extracted from multiple layers. In training stage, we choose some rois according to some criterions, like IoU, fractions of foreground, total number, etc.. I'm not sure about these parameters. I guess they are the same as the work 'Feature Pyramid Networks for Object Detection' FPN
  2. There are several RPN in the pyramid, when building the losses, should I merge all the ROIs before sampling, or sample ROIs for each RPN then compute the losses..
  • Mask
  1. In the original paper, Figure 3. (page 4), There are only 80 channels in the mask, but I think it should be 81 because there's another background class.
  • Loss
  1. per-pixel sigmoid with binary cross-entropy loss, I guess it is
    loss_mask = cross_enctropy(sigmoid(x), y)
    where x and y are of shape (28, 28, 81, 2) # using the last axis to denote fg and bg
    Am I right?
  • Training mini-batch >= 2
  1. Since the input images may have different shapes, I guess training on a mini-batch of 2 should be like, FP several images seperately in parallel, compute average gradients, then update the network.
    Is it right?

Sorry for the delayed replay, just back from a vacation

@xqms
Copy link

xqms commented Apr 5, 2017

@CharlesShang: Thanks for your effort to implement this very nice work!

In the original paper, Figure 3. (page 4), There are only 80 channels in the mask, but I think it should be 81 because there's another background class.

As far as I understood, the branch predicts binary segmentation masks for each object class - so there is no need for a background mask.

@parhartanvir
Copy link

@CharlesShang , I believe for the mask there should not be a background class. That is because there are K binary masks for each of the K classes. Having a background class for the Faster-RCNN / Region proposal part makes sense. But since they are not computing mask loss in between classes, a background mask is not needed.

As far as training goes, I think what you are saying is right. i.e. forward pass the images, add/average gradients then backward pass.

I apologize, I haven't gone through the FPN paper yet. I'll go through and see, if I can help.

@CharlesShang
Copy link
Owner

CharlesShang commented Apr 7, 2017

@parhartanvir
@xqms
Thank you for your explaination
I think there's little difference. Consider an example -- segment a roi of a horse
In the refined stage, we just know it's a horse, so we just check the horse-channel in the masks, pixels with prob greater than 0.5 are considered as horse, otherwise bg. In the process, the bg-channel is never be used at both testing and training stage, since only positive rois are extracted for training and testing.

For consistance, I'll adopt K+1 classes, so we dont need to - 1 when we extract masks

@xmyqsh
Copy link

xmyqsh commented Apr 17, 2017

@CharlesShang

There are several RPN in the pyramid, when building the losses, should I merge all the ROIs before sampling, or sample ROIs for each RPN then compute the losses..

I have gone over the FPN paper. I think just one RPN is OK. ancher_targer_layer with inputs of P2 through P5, generate anchors and merge together and random sample inside, normal proposal_layer and proposal_target_layer follows.

For backforward, assign each RoI of width w and height h (on the input image to the network) to the level Pk of our feature pyramid by eqn(1).

I think it is not elegant and time consumming with four RPNs followed by four heads. And it is hard to trade off the four parts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants