About Multi-GPU version #17

santisy · 2017-04-12T19:54:24Z

Hello Everyone,
I am trying to write a multi-gpu version training code of this repository referring to this code cifar-multi-gpu.
However, I found out it always ended up with None value of gradients by opt.compute_gradient

    with tf.variable_scope(tf.get_variable_scope()) as tower_graph:
      for i in xrange(FLAGS.num_gpus):
        with tf.device('/gpu:%d' % i):

          # this should be outof the loop or not, the work of pipline

          with tf.name_scope('%s_%d' % ('tower', i)) as scope:

            input_list = coco.read(file_name_list) #image, ih, iw, gt_boxes, gt_masks, num_instances, img_id 
            input_list = list(input_list)
            input_list[0], input_list[3], input_list[4] = coco_preprocess.preprocess_image(input_list[0], input_list[3], input_list[4], is_training=True)

            with slim.arg_scope(resnet_v1.resnet_arg_scope()):
              logits, end_points = resnet50(input_list[0], 1000, is_training=False)
              
            loss = tower_loss(scope, input_list, end_points)

            # Reuse variables for the next tower.
            tf.get_variable_scope().reuse_variables()

            # Retain the summaries from the final tower.
            summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)

            # Calculate the gradients for the batch of data on this CIFAR tower.
            grads = opt.compute_gradients(loss)

            # Keep track of the gradients across all towers.
            tower_grads.append(grads)


    # We must calculate the mean of each gradient. Note that this is the
    # synchronization point across all towers.
    grads = average_gradients(tower_grads)

The mechanism I understand is to use name_scope to distinguish different towers to calculate the gradients separately while reuse all the variables in all towers to update them at once with averaged gradients. I think the main problem here is about the resnet50, because of the different name scope, the end_points' name in every tower changed. So I updated the dictionary by passing scope name. However, I cannot get valid gradients. Someone has any idea?

The text was updated successfully, but these errors were encountered:

venuktan · 2017-04-18T22:04:07Z

@santisy how did you fix this, I am trying to do the same thing

santisy · 2017-04-20T22:50:55Z

@venuktan Sorry, I have not solved it.

santisy closed this as completed Apr 14, 2017

santisy reopened this Apr 20, 2017

kevinkit mentioned this issue May 2, 2017

Error when loading Resnet_v1_50.ckpt in multi-GPUs machine #36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Multi-GPU version #17

About Multi-GPU version #17

santisy commented Apr 12, 2017

venuktan commented Apr 18, 2017

santisy commented Apr 20, 2017

About Multi-GPU version #17

About Multi-GPU version #17

Comments

santisy commented Apr 12, 2017

venuktan commented Apr 18, 2017

santisy commented Apr 20, 2017