"ByteNet" architecture for neural machine translation which translates in linear time and can handle dependencies over large distances. It consists of two networks, a source (encoder) and a target (decoder).
The source network is formed of one-dimensional convolutional layers that use dilation.
The target network (the "ByteNet Decoder") is formed of one-dimensional convolutional layers that user dilation and are masked. It is "stacked" on the source network and generates variable-length outputs via "dynamic unfolding".
The representation generated by the source network has the same length as the source sequence.
At each step, the target network takes the corresponding column from the source representation and generates an output. This continues until an end-of-sequence (EOS) symbol is produced by the target network. The source representation is automatically zero-padded as the steps go beyond its length and the output is conditioned on the source and target representations accumulated thus far.
Dilated convolutions (also called à trous, "with holes") are a way of integrating knowledge of a larger area (i.e. the global context of an image) while only linearly increasing the number of parameters.
For a dilation size
It is like pooling or strided convolutions but results in an output with the same size as the input.
A deep residual network takes a convolutional neural network and adds shortcuts which take the input of an earlier layer and pass it directly to a later node (i.e. addition) of the network. The unit containing this shortcut is a "residual block". The result is that the other layer feeding into the addition node (in the example below, it is the second convolution layer) just adds a "residual" to the repeated input.
This can help with the vanishing gradient problem that occurs in deep networks.
Another example of a residual block, from Deep Residual Learning for Image Recognition: