Skip to content

CloneFunction

Frank Seide edited this page Jul 22, 2016 · 27 revisions

CloneFunction() is a function for editing and creating models. It copies a part of a model's network into a BrainScript function, so that this part of the network can be reused. The result is a BrainScript function that can be used as if this section of the network had been defined inside a regular BrainScript function.

The originating network can be a separate network. This allows to import (part of) an external network that was trained on different data. CloneFunction() allows to freeze the model parameters of the clone. This allows an external network to be used as a fixed feature extractor, or to act as a regularizer in an adaptation setting.

The originating network can also be a section of the one being defined currently, and the clone can share its parameters with the original. This allows multiple identical paths through the network operating on different data, for example for setups that symmetrically compare similarity of two inputs, where the feature-extracting layers are shared (and learned jointly) for both inputs.

The section to be copied is defined by its input and output nodes. Imagine a network plot in which a line is drawn around the subsection to be cloned. All connections that go into the marked area would be specified as inputNodes, and all that go out as outputNodes. CloneFunction() will extract this section into a BrainScript function with a number of parameters equal to the number of inputNodes, and the output being either a single node or a dictionary of nodes.

CloneFunction() has the following syntax:

CloneFunction (inputNodes, outputNodes, parameters="learnable" /*|"constant"|"shared"*/)

Where:

  • inputNodes is an array of 1 or more inputs. When calling the resulting BrainScript function, the cloned function's parameters get substituted for these nodes.
  • outputNodes is either a single output node or a record of multiple output nodes. These denote what the BraunScript function should return.
  • parameters determines how

is a Parameter that stores bias vector (beta term). scale and bias must have the same dimensions which must be equal to the input dimensions in case of spatial = false or number of output convolution feature maps in case of spatial = true.

  • runMean is the running mean which is used during evaluation phase and might be used during training as well. It is represented as a Parameter with the same dimensions as scale and bias.
  • runInvStdDev is the running inverse square root of variance (so InvStdDev = 1 / sqrt(var + epsilon)). It is represented as a Parameter with the same dimensions as scale and bias.
  • spatial is a flag that specifies whether to compute mean/var for each feature in a minibatch independently or, in case of convolutional layers, per feature map.
m = mean(input)
var = variance(input)
input_norm = (input - mean)/sqrt(var)
output = gamma * input_norm + beta

where gamma and beta are trainable parameters (represented as Parameter).

// =================================================================== // CloneFunctionConfigLambda -- lambda to produce a clone of a network // - creates a BrainScript function that carbon-copies a subsection of an existing network // - the copy can be shallow or deep, where a deep copy gets its own copy of LearnableParameters // - a shallow copy (parameters="shared") is a copy of all nodes that depend on the specified input(s), // while all other nodes are shared from the original network section // - a deep copy (parameters="lernable" or "constant") also copies all reachable LearnableParameters and their dependents // - Input() nodes not listed as inputNodes are always shared // - the source network may be a different network, e.g. loaded with BS.Network.Load() // - a deep copy can be read-only (parameters="constant") // - Note: multiple uses of the lambda will not share read-only parameters. This is trickier to implement that one might expect. // - example use cases: // - adaptation (KL): a frozen read-only copy of the starting model is used as a KL-regularizer // - adaptation (DLR): an injected input transform is trained while the network is fixed // - image: lower layers of ImageNet networks serve as immutable feature extractors for another image task // - DSSM: applying the same network subsection to two inputs // Usage: // f = CloneFunction (inputNodes, outputNodes, parameters="lernable" /|"constant"|"shared"/) // Parameters: // - inputNodes: single node or array of nodes that will become parameters of the function. // Commonly, this list will include all Input()s that the outputNode(s) depend on. // - outputNodes: single node or dictionary of nodes that the function will emit // Example: // # create a BS function by copying a piece of network // net = CloneFunction (network.features, network.logP) // # apply the copy to a new input // out = net (myFeatures) // # This will create a copy of the subsection from network.features to network.logP // # where all links to network.features get replaced by links to myFeatures. // Example with multiple input and output nodes: // # create a BS function by copying a piece of network // # This specific example converts a network back into a BrainScript function. // # It passes two input nodes --> the BS function will have 2 inputs; // # and it passes a record of output nodes --> the BS function will return a record with the same member names // network = BS.Network.Load ("some.dnn") // net = CloneFunction ((network.features:network.labels), [ ce = network.ce ; errs = network.errs ]) // # create a network from the BS function // features = Input (13) // labels = Input (42) // out = net (features, labels) // criterionNodes = (out.ce) // evaluationNodes = (out.errs) // A specific example: Adapting a network, while using the original network as a regularizer (KLD) // # load network // network = BS.Network.Load ("some.dnn") // # create a trainable clone and a read-only reference clone // adaptNet = CloneFunction (network.features, [ z = network.z ], readOnly=false) // # create a read-only clone // refNet = CloneFunction (network.features, [ z = network.z ], readOnly=true) // # create the main network // features = Input (42) // labels = Input (9000) // z = adaptNet (features).z // zRef = refNet (features).z // # training criterion // refWeight = 0.9 // kldLabels = labels * (1-refWeight) + Softmax (zRef) * refWeight # interpolate with ref output // ce = CrossEntropyWithSoftmax (z, kldLabels) // errs = ErrorPrediction (z, labels) // criterionNodes = (ce) // evaluationNodes = (errs) // ===================================================================

BatchNormalization has the following syntax:

BatchNormalization(input, scale, bias, runMean, runInvStdDev, spatial,
                   normalizationTimeConstant = 0, blendTimeConstant = 0,
                   epsilon = 0.00001,
                   useCntkEngine = true, imageLayout='cudnn', tag='')

Where:

  • input is the input of the batch normalization node
  • scale is a Parameter that stores scale vector (gamma term in the equation above).
  • bias is a Parameter that stores bias vector (beta term). scale and bias must have the same dimensions which must be equal to the input dimensions in case of spatial = false or number of output convolution feature maps in case of spatial = true.
  • runMean is the running mean which is used during evaluation phase and might be used during training as well. It is represented as a Parameter with the same dimensions as scale and bias.
  • runInvStdDev is the running inverse square root of variance (so InvStdDev = 1 / sqrt(var + epsilon)). It is represented as a Parameter with the same dimensions as scale and bias.
  • spatial is a flag that specifies whether to compute mean/var for each feature in a minibatch independently or, in case of convolutional layers, per feature map.
  • normalizationTimeConstant is the time constant which is used to compute running average of mean and variance. Value 0 (default) means there will be no exponential smoothing and running mean/variance will always have values computed for the last seen minibatch. Value 1#INF (infinity) means running values are "frozen" (i.e. will not be updated). Depending on the dataset and network configuration, different values can be used. For example, for MNIST dataset you can set it to 1024 and for speech datasets to number of frames corresponding to 24 hour period. The constant can also be set globally (in .cntk config file) using batchNormalizationTimeConstant parameter, for example: batchNormalizationTimeConstant=0:1024
  • blendTimeConstant is the time constant which allows to specify how much of running mean/var should be "blended" into mean/var of the current minibatch. Value 0 (default) means no blending will happen and only the current minibatch statistics will be used. Value 1#INF (infinity) means only running mean/var will be used (this is used, for example, in evaluation phase). For example, you can start with 0, then set it to half of the size of minibatch and then set it to infinity after several epochs. This can be done using .cntk (config) file batchNormalizationBlendTimeConstant option: batchNormalizationBlendTimeConstant=0:32*10:1#INF
  • epsilon is a conditioner constant used in computing InvStdDev
  • useCntkEngine is a boolean flag that specifies which batch normalization implementation to use: CNTK or cuDNN-based.
  • imageLayout is the image layout. Only cudnn is supported.

For more information about time constants and exponential smoothing: https://en.wikipedia.org/wiki/Exponential_smoothing#Time_Constant

Note that for evaluation stage CNTK will set time constants automatically, users do not have to change anything to switch between the stages.

Clone this wiki locally