Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Multi-GPU version #17

Open
santisy opened this issue Apr 12, 2017 · 2 comments
Open

About Multi-GPU version #17

santisy opened this issue Apr 12, 2017 · 2 comments

Comments

@santisy
Copy link
Contributor

santisy commented Apr 12, 2017

Hello Everyone,
I am trying to write a multi-gpu version training code of this repository referring to this code cifar-multi-gpu.
However, I found out it always ended up with None value of gradients by opt.compute_gradient

    with tf.variable_scope(tf.get_variable_scope()) as tower_graph:
      for i in xrange(FLAGS.num_gpus):
        with tf.device('/gpu:%d' % i):

          # this should be outof the loop or not, the work of pipline

          with tf.name_scope('%s_%d' % ('tower', i)) as scope:

            input_list = coco.read(file_name_list) #image, ih, iw, gt_boxes, gt_masks, num_instances, img_id 
            input_list = list(input_list)
            input_list[0], input_list[3], input_list[4] = coco_preprocess.preprocess_image(input_list[0], input_list[3], input_list[4], is_training=True)

            with slim.arg_scope(resnet_v1.resnet_arg_scope()):
              logits, end_points = resnet50(input_list[0], 1000, is_training=False)
              
            loss = tower_loss(scope, input_list, end_points)

            # Reuse variables for the next tower.
            tf.get_variable_scope().reuse_variables()

            # Retain the summaries from the final tower.
            summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)

            # Calculate the gradients for the batch of data on this CIFAR tower.
            grads = opt.compute_gradients(loss)

            # Keep track of the gradients across all towers.
            tower_grads.append(grads)


    # We must calculate the mean of each gradient. Note that this is the
    # synchronization point across all towers.
    grads = average_gradients(tower_grads)

The mechanism I understand is to use name_scope to distinguish different towers to calculate the gradients separately while reuse all the variables in all towers to update them at once with averaged gradients. I think the main problem here is about the resnet50, because of the different name scope, the end_points' name in every tower changed. So I updated the dictionary by passing scope name. However, I cannot get valid gradients. Someone has any idea?

@santisy santisy closed this as completed Apr 14, 2017
@venuktan
Copy link

@santisy how did you fix this, I am trying to do the same thing

@santisy santisy reopened this Apr 20, 2017
@santisy
Copy link
Contributor Author

santisy commented Apr 20, 2017

@venuktan Sorry, I have not solved it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants