The starting point for the CTR method (Coevolution of Task Routing) is the soft-ordering approach for multitask learning. In soft ordering, at each level of the network the activation and gradients are propagated through the same set of modules. However, each module output and gradients are scaled with multiplicative factors that are different at each level and task. The multipliers and the module weights are learned through gradient descent. As described in Paper 1, CTR extends this idea of task-specific scaling to evolving entire different network topologies for each task. The intuition is that, for a set of T related tasks, there is some compact set of shared modules that can be assembled in different ways to meet the demands of each task. CTR uses an evolutionary method, based on a 1+1 evolutionary strategy, to discover high performing designs.
As an example, the image above shows the final routing topologies for a run of CTR in the Omniglot task (see Demo 1.3 for a description of Omniglot). Each topology corresponds to a model for a single task, and uses the same set of modules, identified by the different colors. Importantly, the topologies are reflect the requirements of each task: similar alphabets have similar topologies (such as Latin and Cyrillic, as highlighted above), the topologies discovered are similar across multiple learning runs, and more complex and unique alphabets have more complex topologies (such as Angelic and Ojibwe).
To make this evolution computationally efficient, CTR trains a single set of shared modules throughout evolution. By shared modules, we mean that their parameters are the same in every location they are used. In most neural architecture search methods, parameters are randomly reinitialized for each new architecture; in CTR, learned parameters are carried forward during evolution. This is accomplished by jointly training a champion and challenger model at each meta-iteration, for a total of 2T models trained jointly. For each task, the better of the two becomes the champion of the next meta-iteration. The challenger is generated as a copy of the champion with a new usage of a shared module, initialized to have low effect. Through backpropagation, CTR discovers whether this topological innovation is beneficial.
The animation above shows an example of this process in action. In this experiment, 20 character recognition tasks drawn from the Omniglot dataset were trained jointly; the first eight are shown. The challengers explore a diverse topology space, and the champions grow as the most valuable innovations are found.
For a closer look, the animation above shows the effect of a useful topological innovation. By adding a new module usage at the second level of the network, over the next several iterations the model is able to learn to distinguish a class of qualitative similar Hebrew characters. Note that it does this by adding a single learned parameter to the model. The newly learned characters are characterized by sharp top-left and top-right corners; so, it makes sense that adding a second-level module would help, since this is the level of convolutional networks at which corner features are commonly found.
Note that in CTR, even if a challenger is not accepted, it can have a significant impact on evolution, since it helps train the set of shared modules. This observation suggests a more optimistic evolutionary perspective than “survival of the fittest”. In other words, CTR can be viewed from an ecological perspective: models compete within a species (task) and develop mutualistic relationships across species while shaping and exploiting a shared set of resources (modules). The point is that by rethinking the structure of evolutionary design, i.e., by coupling evolution more tightly with backpropagation, we can effectively tackle the vast design space of complex multitask architectures.