we make a Markov assumption on the length of the context as size $C$
for $i < 1$, $y_i$ is a special start symbol
The focus is to model $p(y_{i+1}|x,y_c; \theta)$, which is modeled directly using a neural network with an encoder.
The model:
$$
\begin{aligned}
p(y_{i+1}|x,y_c; \theta) &\varpropto \exp(Vh + W \text{enc}(x, y_c)) \\
\bar y_c &= [E y_{i-C+1}, \dots, E y_i] \\
h &= \tanh(U \bar y_c)
\end{aligned}
$$
where:
- $E$ is the word embedding matrix
- $U, V, W$ are weight matrices
- $h$ is a hidden layer of size $H$
Various encoders were tried for $\text{enc}$:
$\text{enc}_1$: a bag-of-words encoder: embed the input sentence to size $H$, ignoring ordering
$$
\begin{aligned}
\text{enc}_1 &= p^T \bar x \\
p &= [1/M, \dots, 1/M] \\
\bar x &= [Fx_1, \dots, F x_M]
\end{aligned}
$$
where:
$F$ is an embedding matrix
$p$ is a uniform distribution over the $M$ input words
this encoder can capture relative importance of words but is limited in representing contiguous phrases.
$\text{enc}_2$: convolutional encoder: a time-delay neural network (TDNN) architecture, alternating b/w temporal convolution layers and max pooling layers (see paper for details)
$\text{enc}_3$: attention-based encoder: takes context $y_c$ into account:
$$
\begin{aligned}
\text{enc}_3 &= p^T \bar x \\
p &\varpropto \exp(\bar x P \bar{y_c'}) \\
\bar x &= [Fx_1, \dots, F x_M] \\
\bar{y_c'} &= [G y_{i-C+1}, \dots, G y_i] \\
\forall i \bar x_i &= \sum_{q=i-Q}^{i+Q} \frac{\bar x_i}{Q}
\end{aligned}
$$
where:
$G$ is an embedding of the context
$P$ is a new weight matrix parameter mapping between the context embedding and the input embedding
$Q$ is a smoothing window
so the uniform distribution from the bag-of-words model is replaced with a learned soft alignment $P$ between the input and the summary
Denil, M., Demiraj, A., & de Freitas, N. (2014). Extraction of salient sentences from labelled documents. arXiv preprint arXiv:1412.6815.
CNNs for both sentence and document levels
sentence level: transform word embeddings in each sentence into an embedding for the entire sentence
document level: transform sentence embeddings into an embedding for the entire document
training: feed resulting document embeddings into a softmax classifier
the convolution operations are one-dimensional
convolve along rows of the embedding matrices
at the sentence level this corresponds to convolving across words
at the document level this corresponds to convolving across sentences
wide ("full") convolutions are used
outputs of feature maps are stacked to form a new matrix which is fed as input to the next layer
the embedding matrices will have varying widths since sentences and documents have different lengths; this is an issue when using them as input to the softmax layer and between the sentence and document levels
k-max pooling is used to compensate, applied to each row of the embedding matrix separately (keep the $k$ largest values along that row and discard the rest)
to actually extract sentences:
a saliency map for the document is created by assigning an importance score to each sentence
this is accomplished by performing a forward pass through the network to generate a class prediction using the softmax layer
this class label is then inverted (a "pseudo-label") and this is fed to the loss function as the true label $\tilde y$
for input document $x$ and neural network $f(x)$ we approximate the loss as a linear function $L(\tilde y, f(x)) \approx w^T x + b$ where $w = \frac{\partial L}{\partial x}\Bigr|_{(\tilde y, f(x))}$
the vector $w$ has one entry for each word in the document
we can use $|w_i|$ as the measure of the salience of word $i$ (using the gradient magnitude tells us how important words are)
I do not understand what they propose for the rest of the extraction process but I think the basic idea is to use the gradient magnitude as saliency for each sentence as well, rank sentences that way and choose some top $m$ sentences
architecture:
Chopra, S., Auli, M., Rush, A. M., & Harvard, S. E. A. S. Abstractive Sentence Summarization with Attentive Recurrent Neural Networks.
where $c_t$ is the output of the encoder module; it is a context vector computed as a function of the current state $h_{t-1}$ and the input sequence $x$
more specifically, they experimented with two decoders: an Elman RNN and an LSTM
The attentive encoder, which generates the context vector $c_t$:
for an input sentence $x$, $x_i$ is the $d$-dimensional learnable embedding of the $i$-th word
the position $i$ of the word $x_i$ is also associated with a learnable embedding $l_i$ of size $d$
thus the full embedding for the $i$-th word in $x$ is given by $a_i = x_i + l_i$
we also have a weight matrix used to convolve over the full embeddings of consecutive words, $B^k$; there are $d$ such matrices, i.e. $k \in \{1, \dots, d\}$. The convolution output is:
where $b_j^k$ is the $j$-th column of the matrix $B^k$.
so we have aggregate embedding vectors $z_i = [z_{i1}, \dots, z_{id}]$, one for each word $x_i$ in $x$. This essentially captures the word and its position and its surrounding context.
$q$ is the width of the convolution (set to 5 in the experiments).
the sequence is padded on both sides with dummy words for the words at the boundaries (start and end) of x.
the context vector $c_t$ (the encoder output) is computed:
training is accomplished with a dataset $\mathcal S$ of sentence-summary pairs. the model is trained with SGD to minimize the negative conditional log likelihood of the training data wrt $\theta$:
then the beam-search procedure as described above is used to generate a summary for an input sentence $x$. The beam-search searches words such that $P(y|x)$ is maximized, i.e. $\argmax P(y_t|\{y_1, \dots, y_{t-1}\}, x)$, parameterized by the number of paths $k$ that are pursued at each time step
Nallapati, R., Zhou, B., Gulçehre, Ç., & Xiang, B. Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond.
the encoder consists of a bidirectional GRU-RNN and the decoder is a unidirectional GRU-RNN with the same hidden-state size as the encoder and an attention mechanism over the source-hidden states and a softmax layer over the target vocabulary to generate words (detailed more in Chung et al., 2014)
the "large vocabulary trick" (LVT)
the decoder vocabulary of each mini-batch is restricted to the words in the source document of that batch
the most frequent words in the target dictionary are added until the vocab reaches a fixed size
to reduce computation at the softmax layer
LVT limits vocab size for computational reasons but sacrifices the range of words available for a summary
this vocab can be re-expanded a bit by adding the 1-nearest-neighbors of each word in the source document (measured by cosine similarity in the word embeddings space)
additionally generate embeddings mapping words to one-hot encodings of TF-IDF (discretized into bins) and POS tags, concatenated into a single vector. this replaces the word-based embeddings on the source side; on the target side the word-based embeddings are still used.
because important keywords may be rare in the overall corpus (e.g. named entities) it may be difficult to learn enough about them
they propose a "switching decoder/pointer" architecture
can decide/switch between generating a word from the decoder or "pointing" back to a word position in the source text; that word is then just copied over to the output
an attention distribution over source word positions is used as the generative distribution for pointers
the switch is a sigmoid activation function over a linear layer based on the hidden state of the decoder, embedding vector from the previous emission and a context vector, i.e.:
$P(s_i)$ is the probability of the switch turning "on"
$h_i$ is the hidden state at the $i$th time step of the decoder
$E[o{i-1}]$ is the embedding vector of the emission from the previous time step
$c_i$ is the attention-weighted context vector
$W_h, W_e, W_c, b, v$ are model parameters
hierarchical encoder with hierarchical attention:
on the source side, there are two bi-directional RNNs: one at the word-level and one at the sentence-level. The word-level attention is re-weighted by the corresponding sentence-level attention and re-normalized as follows:
where $a_w(i)$ is the word-level attention weight at the $i$th position of the source and $s(i)$ is the id of the sentence at the $i$th word position, $a_s(j)$ is the sentence-level attention weight for the $j$th sentence in the source, $N$ is the number of words in the source document, and $a(i)$ is the re-scaled attention at the $i$th word position. This re-scale attention is then used to compute the attention-weighted context vector that goes as input to the hidden state of the decoder.
to generate the summary, beam search of size 5 was used.
Lopyrev, K. (2015). Generating News Headlines with Recurrent Neural Networks. arXiv preprint arXiv:1512.01712.
Uses an encoder-decoder architecture; both the encoder and decoder are RNNs.
encoder
the encoder receives the input text one word at a time
each word goes through an embedding layer to receive a word embedding
the embedding is combined using a NN with the hidden layers generated after feeding in the previous word (which is all 0's if it is the first word in the text)
decoder
takes as input the hidden layers generated after feeding the last word of the input text
an end-of-sequence symbol is fed in as input (first passing it through the embedding layer)
using a softmax layer and attention mechanism, the decoder generates text, ending with an end-of-sequence symbol
after generating each word that same word is fed in as input when generating the next word
The log-loss function is used as the loss function: