Rush, A. M., Chopra, S., & Weston, J. (2015). A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.

A factored scoring function $s$ which takes into account a fixed window of previous words:

$$ s(x,y) \approx \sum_{i=0}^{N-1} g(y_{i+1}, x, y_c)j $$

where $y_c = y_{[i-C+1, \dots, i]}$ for a window of size $C$.

The conditional log probability of a summary given the input is:

$$ s(x,y) = \log p(y|x;\theta) $$

which can be re-written as:

$$ \log p(y|x; \theta) \approx \sum_{i=0}^{N-1} \log p(y_{i+1}|x,y_c; \theta) $$

The focus is to model $p(y_{i+1}|x,y_c; \theta)$, which is modeled directly using a neural network with an encoder.

The model:

$$ \begin{aligned} p(y_{i+1}|x,y_c; \theta) &\varpropto \exp(Vh + W \text{enc}(x, y_c)) \\ \bar y_c &= [E y_{i-C+1}, \dots, E y_i] \\ h &= \tanh(U \bar y_c) \end{aligned} $$

- $E$ is the word embedding matrix
- $U, V, W$ are weight matrices
- $h$ is a hidden layer of size $H$

Various encoders were tried for $\text{enc}$:

  1. $\text{enc}_1$: a bag-of-words encoder: embed the input sentence to size $H$, ignoring ordering

$$ \begin{aligned} \text{enc}_1 &= p^T \bar x \\ p &= [1/M, \dots, 1/M] \\ \bar x &= [Fx_1, \dots, F x_M] \end{aligned} $$


this encoder can capture relative importance of words but is limited in representing contiguous phrases.

  1. $\text{enc}_2$: convolutional encoder: a time-delay neural network (TDNN) architecture, alternating b/w temporal convolution layers and max pooling layers (see paper for details)

  2. $\text{enc}_3$: attention-based encoder: takes context $y_c$ into account:

$$ \begin{aligned} \text{enc}_3 &= p^T \bar x \\ p &\varpropto \exp(\bar x P \bar{y_c'}) \\ \bar x &= [Fx_1, \dots, F x_M] \\ \bar{y_c'} &= [G y_{i-C+1}, \dots, G y_i] \\ \forall i \bar x_i &= \sum_{q=i-Q}^{i+Q} \frac{\bar x_i}{Q} \end{aligned} $$


so the uniform distribution from the bag-of-words model is replaced with a learned soft alignment $P$ between the input and the summary

To train:

$$ -\sum_{j=1}^J \sum_{i=1}^{N-1} \log p(y_{i+1}^{(j)} | x^{(j)}, y_c; \theta) $$

for $J$ input-summary pairs

To generate the summary, a beam-search decoder is used:

Denil, M., Demiraj, A., & de Freitas, N. (2014). Extraction of salient sentences from labelled documents. arXiv preprint arXiv:1412.6815.


Chopra, S., Auli, M., Rush, A. M., & Harvard, S. E. A. S. Abstractive Sentence Summarization with Attentive Recurrent Neural Networks.

$$ P(y|x;\theta) = \prod_{t=1}^N p(y_t | \{y_1, \dots, y_{t-1}\}, x; \theta) $$

recurrent decoder:

$$ P(y_t | \{y_1, \dots, y_{t-1}\}, x; \theta) = P_t = g_{\theta_1} (h_t, c_t) $$

where $h_t$ is the hidden state of the RNN:

$$ h_t = g_{\theta_1} (y_{t-1}, h_{t-1}, c_t) $$

where $c_t$ is the output of the encoder module; it is a context vector computed as a function of the current state $h_{t-1}$ and the input sequence $x$

more specifically, they experimented with two decoders: an Elman RNN and an LSTM

the Elman RNN:

$$ \begin{aligned} h_t &= \sigma(W_1 y_{t-1} + W_2 h_{t-1} + W_3 c_t) \\ P_t &= \rho (W_4 h_t + W_5 c_t) \end{aligned} $$

where $\sigma$ is the sigmoid function, $\rho$ is the softmax, and $W_i$ are weight matrices.

The LSTM decoder:

$$ \begin{aligned} i_t &= \sigma (W_1 y_{t-1} + W_2 h_{t-1} + W_3 c_t) \\ i_t' &= \tanh (W_4 y_{t-1} + W_5 h_{t-1} + W_6 c_t) \\ f_t &= \sigma (W_7 y_{t-1} + W_8 h_{t-1} + W_9 c_t) \\ o_t &= \sigma (W_{10} y_{t-1} + W_{11} h_{t-1} + W_{12} c_t) \\ m_t &= m_{t-1} \odot f_t + i_t \odot i_t' \\ h_t &= m_t \odot o_t \\ P_t &= \rho(W_{13} h_t + W_{14} c_t) \end{aligned} $$

The attentive encoder, which generates the context vector $c_t$:

$$ z_{ik} = \sum_{h=-q/2}^{q/2} a_{i+h} b_{q/2+h}^k $$

where $b_j^k$ is the $j$-th column of the matrix $B^k$.

so we have aggregate embedding vectors $z_i = [z_{i1}, \dots, z_{id}]$, one for each word $x_i$ in $x$. This essentially captures the word and its position and its surrounding context.

$q$ is the width of the convolution (set to 5 in the experiments).

the sequence is padded on both sides with dummy words for the words at the boundaries (start and end) of x.

the context vector $c_t$ (the encoder output) is computed:

$$ c_t = \sum_{j=1}^M a_{j,t-1} x_j $$

the weights $\alpha_{j,t-1}$ are computed:

$$ a_{j,t-1} = \frac{\exp(z_j h_{t-1})}{\sum_{i=1}^M \exp(z_i h_{t-1})} $$

training is accomplished with a dataset $\mathcal S$ of sentence-summary pairs. the model is trained with SGD to minimize the negative conditional log likelihood of the training data wrt $\theta$:

$$ \mathcal L = - \sum_{i=1}^S \sum_{t=1}^N \log P(y_t^i | \{y_1^i, \dots, y_{t-1}^i\}, x^i; \theta) $$

then the beam-search procedure as described above is used to generate a summary for an input sentence $x$. The beam-search searches words such that $P(y|x)$ is maximized, i.e. $\argmax P(y_t|\{y_1, \dots, y_{t-1}\}, x)$, parameterized by the number of paths $k$ that are pursued at each time step

Nallapati, R., Zhou, B., Gulçehre, Ç., & Xiang, B. Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond.

$$ P(s_i) = \sigmoid(v^T(W_h h_i + W_e E[o{i-1}] + W_c c_i + b)) $$


hierarchical encoder with hierarchical attention:
on the source side, there are two bi-directional RNNs: one at the word-level and one at the sentence-level. The word-level attention is re-weighted by the corresponding sentence-level attention and re-normalized as follows:

$$ a(i) = \frac{a_w(i) a_s(s(i))}{\sum_{k=1}^N a_w(k) a_s(s(k))} $$

where $a_w(i)$ is the word-level attention weight at the $i$th position of the source and $s(i)$ is the id of the sentence at the $i$th word position, $a_s(j)$ is the sentence-level attention weight for the $j$th sentence in the source, $N$ is the number of words in the source document, and $a(i)$ is the re-scaled attention at the $i$th word position. This re-scale attention is then used to compute the attention-weighted context vector that goes as input to the hidden state of the decoder.

to generate the summary, beam search of size 5 was used.

Lopyrev, K. (2015). Generating News Headlines with Recurrent Neural Networks. arXiv preprint arXiv:1512.01712.

Uses an encoder-decoder architecture; both the encoder and decoder are RNNs.

The log-loss function is used as the loss function:

$$ -\log p(y_1, \dots, y_{T'} | x_1, \dots, x_T) = - \sum{t=1}^{T'} \log p(y_t|y_1, \dots, y_{t-1}, x_1, \dots, x_T) $$

where $y$ are the output words and $x$ are the input words

(refer to paper for more details)

misc references