# Recurrent Neural Networks¶

Last time, before the midterm, we discussed using recurrent neural networks to make predictions about sequences. In particular, we treated tweets as a sequence of words. Since tweets can have a variable number of words, we needed an architecture that can take as input variable-sized inputs.

The recurrent neural network architecture looked something like this:

We briefly discussed how recurrent neural networks can be used to generate sequences. Generating sequences is more involved compared to making predictions about sequences. However, many students chose text generations problems for their project, so a brief discussion on generating text might be worthwhile.

Much of today's content is an adaptation of the "Practical PyTorch" github repository [1].

## Preparing the Data Set¶

We will begin by choosing some text to generate. Since we are already working "SMS Spam Collection Data Set" [2], we will build a model to generate spam SMS text messages.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


We are going to start by doing something a little strange: we are going to concatenate all spam messages into a single string. We will sample random subsequences (chunk) from the combined string containing all spam messages.

This technique makes less sense when we use short strings like SMS text messages, but makes more sense we are working with sequences that are much longer than the random subsequence samples (chunk) -- for example if we trained on news articles, Wikipedia pages, Shakespeare plays, or TV scripts. In all cases, the probability of choosing a chunk that contains text from two samples will be small.

In our case, we could do better than combining all training text into one string. For simplicity, however, we won't.

In [2]:
spam_text = ""
for line in open('SMSSpamCollection'):
if line.startswith("spam"):
spam_text += line.split("\t")[1].strip("\n")

# show the first 100 characters
spam_text[:100]

Out[2]:
'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entr'

Since we are working with SMS text messages, we will use a character-level RNN. The reason is that spammy SMS messages will contain not only words, but abbreviations, numbers and other non-word characters.

We find all the possible characters in spam_text, and build dictionary mappings from the character to the index of that character (a unique integer identifier), and from the index to the character. We'll use the same naming scheme that torchtext uses (stoi and itos).

In [3]:
vocab = list(set(spam_text))
vocab_stoi = {s: i for i, s in enumerate(vocab)}
vocab_itos = {i: s for i, s in enumerate(vocab)}
len(vocab)

Out[3]:
94

There are 94 unique characters in our training data set.

Now, we'll write a function to select a random chunk. Each time we need a new training example, we will call random_chunk() to obtain a random subsequence of spam_text.

In [4]:
import random
random.seed(7)

spam_len = len(spam_text)

def random_chunk(chunk_len=50):
"""Return a random subsequence from spam_text"""
start_index = random.randint(0, spam_len - chunk_len)
end_index = start_index + chunk_len + 1
return spam_text[start_index:end_index]

print(random_chunk())
print(random_chunk())

0 Travel voucher, Call now, 09064011000. NTT PO Box
ting to be collected. Simply text the password "MIX


Since we will use one-hot embedding to represent each character, we need to look up the indices of each character in a chunk. We will also combine the indicies of each character into a tensor.

In [5]:
def text_to_tensor(text, vocab=vocab):
"""Return a tensor containing the indices of characters in text."""
indices = [vocab_stoi[ch] for ch in text]

print(text_to_tensor(random_chunk()))
print(text_to_tensor(random_chunk()))

tensor([ 3, 30,  0, 51, 83, 87, 30, 19, 59, 46, 83, 30, 42, 30, 83, 32, 64, 59,
30, 32, 30, 45, 39, 48, 59, 78, 14, 30,  3, 30, 22, 41, 61, 61, 37, 30,
1, 32, 40, 40, 30, 78, 35, 89, 30, 83, 35, 30, 25, 40, 32])
tensor([35, 24, 39, 30, 40, 32, 78, 14, 40, 48, 78, 59, 37, 30,  9, 35, 24, 39,
30, 25, 35, 17, 46, 40, 48, 17, 59, 78, 83, 32, 39, 10, 30,  3, 68, 30,
74,  7, 48, 33, 32, 30, 31, 35, 40, 48, 14, 32, 10, 30, 35])


We will use these tensors to train our RNN model. But how?

At a very high level, we want our RNN model to have a high probability of generating the text in our training set. An RNN model generates text one character at a time based on the hidden state value. We can check, at each time step, whether the model generates the correct next character. That is, at each time step, we are trying to select the correct next character out of all the characters in our vocabulary. Recall that this problem is a multi-class classification problem.

However, unlike multi-class classification problems with fixed-sized inputs, we need to keep track of the hidden state. In particular, we need to update the hidden state with the actual, ground-truth characters at each time step.

So, if we are training on the string RIGHT, we will do something like this:

We will start with some sequence to produce an initial hidden state (first green box from the left), and the RNN model will make a prediction on what letter should appear next.

Then, we will feed the correct letter "R" as the next token in the sequence, to produce a new hidden state (second green box from the left). We use this new hidden state to predict what letter should appear next.

Again, we will feed the correct letter "I" as the next token in the sequence, to produce a new hidden state (third green box from the left). We continue until we exhaust the entire sequence.

In this example, we are (somewhat simultaneously) solving many different multi-class classification problems. We know the ground-truth answer for those all problems, meaning that we can use a cross-entropy loss and the usual optimizers to train our recurrent neural network weights.

To set our data up for training, we will separate the input sequence (bottom row in the above diagram) and the target output sequence (top row in the above diagram). The two sequences are really just offset by one.

In [6]:
def random_training_set(chunk_len=50):
chunk = random_chunk(chunk_len)
inp = text_to_tensor(chunk[:-1])   # omit the last token
target = text_to_tensor(chunk[1:]) # omit the first token
return inp, target

random_training_set(10)

Out[6]:
(tensor([31, 43, 62, 61, 30, 24, 46, 57, 39, 32]),
tensor([43, 62, 61, 30, 24, 46, 57, 39, 32, 14]))

## The RNN Model¶

We are ready to build the recurrent neural network model. The model has two main trainable components, an RNN model (in this case, nn.LSTM) and a "decoder" model that decodes RNN outputs into a distribution over the possible characters in our vocabulary.

In [7]:
class SpamGenerator(nn.Module):
def __init__(self, vocab_size, hidden_size, n_layers=1):
super(SpamGenerator, self).__init__()
# RNN attributes
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.n_layers = n_layers
# identiy matrix for generating one-hot vectors
self.ident = torch.eye(vocab_size)
# recurrent neural network
self.rnn = nn.RNN(vocab_size, hidden_size, n_layers, batch_first=True)
# a fully-connect layer that decodes the RNN output to
# a distribution over the vocabulary
self.decoder = nn.Linear(hidden_size, vocab_size)

def forward(self, inp, hidden):
# reshape the input tensor to [1, seq_length]
inp = inp.view(1, -1)
# generate one-hot vectors from token indices
inp = self.ident[inp]
# obtain the next output and hidden state
output, hidden = self.rnn(inp, hidden)
# run the decoder
output = self.decoder(output.squeeze(0))
return output, hidden

def init_hidden(self):

model = SpamGenerator(len(vocab), 128)


## Training the RNN Model¶

Before actually training our model, let's go back to the figure from earlier, and write code to train our model for one iteration.

First of all, we can generate some training data. We'll use a small chunk size for now.

In [8]:
chunk_len = 20
inp, target = random_training_set(chunk_len)


Second of all, we need a loss function and optimizer. Since we are performing multi-class classification for each character we wish to produce, we will use the cross entropy loss.

In [9]:
criterion = nn.CrossEntropyLoss()


Now, we will perform the first classification problem (the second column in the figure). We start with a new hidden state (of all zeros):

In [10]:
hidden = model.init_hidden()


Then, we will feed the next token to the RNN, producing an output vector and a new hidden state.

In [11]:
output, hidden = model(inp[0], hidden)


We can compute the loss using criterion. Since the model is untrained, the loss is expected to be high. (For now, we won't do anything with this loss, and omit the backward pass.)

In [12]:
criterion(output, target[0].unsqueeze(0))

Out[12]:
tensor(4.4823, grad_fn=<NllLossBackward>)

With our new hidden state, we can solve the problem of predicting the next token;

In [13]:
output, hidden = model(inp[1], hidden)    # predict distribution of next token
criterion(output, target[1].unsqueeze(0)) # compute the loss

Out[13]:
tensor(4.5671, grad_fn=<NllLossBackward>)

We can write a loop to do the entire computation. Alternatively, we can simply call:

In [14]:
hidden = model.init_hidden()
output, hidden = model(inp, hidden) # predict distribution of next token
criterion(output, target) # compute the loss

Out[14]:
tensor(4.5291, grad_fn=<NllLossBackward>)

## Generating Text¶

Before we actually train our RNN model, we should talk about how we will actually use the RNN model to generate text. If we can generate text, we can make a qualitative asssessment of how well our RNN is performing.

The main difference between training and test-time (generation time) is that we don't have the ground-truth tokens to feed as inputs to the RNN. Instead, we will take the output token generated in the previous timestep as input.

We will also "prime" our RNN hidden state. That is, instead of starting with a hidden state vector of all zeros, we will feed a small number of tokens into the RNN first.

Lastly, at each time step, instead of always selecting the token with the largest probability, we will add some randomness. That is, we will use the logit outputs from our model to construct a multinomial distribution over the tokens, and sample a random token from that multinomial distribution.

One natural multinomial distribution we can choose is the distribution we get after applying the softmax on the outputs. However, we will do one more thing: we will add a temperature parameter to manipulate the softmax outputs. We can set a higher temperature to make the probability of each token more even (more random), or a lower temperature to assighn more probability to the tokens with a higher logit (output). A higher temperature means that we will get a more diverse sample, with potentially more mistakes. A lower temperature means that we may see repetitions of the same high probability sequence.

In [15]:
def evaluate(model, prime_str='win', predict_len=100, temperature=0.8):
hidden = model.init_hidden()
prime_input = text_to_tensor(prime_str)
predicted = prime_str

# Use priming string to "build up" hidden state
for p in range(len(prime_str) - 1):
_, hidden = model(prime_input[p], hidden)
inp = prime_input[-1]

for p in range(predict_len):
output, hidden = model(inp, hidden)

# Sample from the network as a multinomial distribution
output_dist = output.data.view(-1).div(temperature).exp()
top_i = int(torch.multinomial(output_dist, 1)[0])
# Add predicted character to string and use as next input
predicted_char = vocab_itos[top_i]
predicted += predicted_char
inp = text_to_tensor(predicted_char)

return predicted

print(evaluate(model, predict_len=20))

winC%588tV!v1lGuG-sIaPj


It is hard to see the effect of the temperature parameter with an untrained model, so we will come back to this idea after training our model.

## Training¶

We can put everything we have done together to train the model:

In [16]:
def train(model, num_iters=2000, lr=0.004):
criterion = nn.CrossEntropyLoss()
for it in range(num_iters):
# get training set
inp, target = random_training_set()
# cleanup
# forward pass
hidden = model.init_hidden()
output, _ = model(inp, hidden)
loss = criterion(output, target)
# backward pass
loss.backward()
optimizer.step()

if it % 200 == 199:
print("[Iter %d] Loss %f" % (it+1, float(loss)))
print("    " + evaluate(model, ' ', 50))

train(model)

[Iter 200] Loss 3.252082
Ae od Lar fof fs pane wre u0Me ghe  rein1Ip 2h Dou
[Iter 400] Loss 2.901003
yitwMIAENESTEzT TOvig TAOl AcalO torit FO! Car net
[Iter 600] Loss 2.948745
NOure and, RONE pow Bor sans mited mhoms nssens bi
[Iter 800] Loss 2.931832
w NOT Pat 0110100 100 TUS ouaFREE, Pome cll d va C
[Iter 1000] Loss 3.175145
a donid NONENGE NOK! Cos Meb HONE! Call callimt. F
[Iter 1200] Loss 2.697344
Mon rile ol ca call cour fer for Txt tod tecer ing
[Iter 1400] Loss 2.283041
Boxtaxt ww. twwwor £2, tges acleonis salee ho 826x
[Iter 1600] Loss 3.068335
now how hatly 22. 1666, 15038mTONe. Tong to ave wo
[Iter 1800] Loss 1.933024
01500 coll 0708821. tolubllyor Fre con wer mobiled
[Iter 2000] Loss 2.175633
ou 09064snge tow skge send costmed you tas ral bal


## Gated Recurrent Units¶

Last time, we discussed the Long Short-Term Memory (LSTM) model nn.LSTM as an alternative to nn.RNN. We did not use nn.LSTM since the nn.LSTM model requires both a hidden and a cell-state. We can switch our model to use nn.LSTM if we want to, and obtain a better performance.

Instead, there is another RNN model we could use called the "Gated Recurrent Unit" nn.GRU. This is a newer model than the LSTM, and a smaller model that uses some of the key ideas of the LSTM. Like the LSTM, GRU units are also capable of learning long-term dependencies. GRU units perform about as well as the LSTM, but does not have the cell state.

In our code, we can swap in the nn.GRU unit in place of the nn.RNN unit. Let's make the swap. We should see a performance boost.

In [17]:
class SpamGenerator(nn.Module):
def __init__(self, vocab_size, hidden_size, n_layers=1):
super(SpamGenerator, self).__init__()
# RNN attributes
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.n_layers = n_layers
# identiy matrix for generating one-hot vectors
self.ident = torch.eye(vocab_size)
# recurrent neural network
self.rnn = nn.GRU(vocab_size, hidden_size, n_layers, batch_first=True)
# a fully-connect layer that decodes the RNN output to
# a distribution over the vocabulary
self.decoder = nn.Linear(hidden_size, vocab_size)

def forward(self, inp, hidden):
# reshape the input tensor to [1, seq_length]
inp = inp.view(1, -1)
# generate one-hot vectors from token indices
inp = self.ident[inp]
# obtain the next output and hidden state
output, hidden = self.rnn(inp, hidden)
# run the decoder
output = self.decoder(output.squeeze(0))
return output, hidden

def init_hidden(self):

model = SpamGenerator(len(vocab), 128)
train(model, num_iters=5000)

[Iter 200] Loss 3.011413
pnk 086Cs Tiimh four 910846 pBb 2826 Bod ter tha w
[Iter 400] Loss 2.986607
"rod 501044354ply onge meve tof sep fromeent rewst
[Iter 600] Loss 2.575304
t FREE free thente. Ces com. TxPRNEA anal 08713662
[Iter 800] Loss 2.528595
a £0000 138000 cally on your now onfr you colrep £
[Iter 1000] Loss 1.565352
TENERIC DAAT PE SDAP. Chabeck. Thole chaco Qow liv
[Iter 1200] Loss 2.364564
craise call minisg er whe hallyor What 2 a mes all
[Iter 1400] Loss 2.345767
to cont comcounerd to cust mobile ww. Jos the sers
[Iter 1600] Loss 1.281857
Wellon horcant mexbiledere ir miled www..co. T&Cs
[Iter 1800] Loss 1.784955
a came a cost to contate to our message inl 090014
[Iter 2000] Loss 1.481698
deards to you a Nok, C sene 0906141580144450200314
[Iter 2200] Loss 1.640229
doubtcorting. Call 087150009109.5 arent? Call 0905
[Iter 2400] Loss 2.155088
087123988709.534 try try sorcont lices off your fo
[Iter 2600] Loss 2.819653
to you casc award £1000 cash as to serlevent 4 awa
[Iter 2800] Loss 1.705493
from landline if your mibs a £350 prize PODHe welk
[Iter 3000] Loss 1.629628
proms. Wintmats ard voucher 4 ur custeeplt sith tx
[Iter 3200] Loss 2.330914
to receive U landline Quiz MAS. To stopline or £50
[Iter 3400] Loss 2.049580
www. ugit tox87 stop txt who thiskink from U live
[Iter 3600] Loss 2.637750
[Iter 3800] Loss 0.964418
der cleaily text 09066663181 in C lounts getone co
[Iter 4000] Loss 1.166208
call 08711245urss. Pox33 costcombly. 150p Poobay o
[Iter 4200] Loss 1.760513
150pmpaycontact U)- order a rate won the prise val
[Iter 4400] Loss 2.461157
and mobile? Call sget thes and as a £800 ind shipr
[Iter 4600] Loss 2.644468
800 use and to claim your mins. Call 0906630466 Fr
[Iter 4800] Loss 1.795479
whis is yours quize service and seny pessool22wYou
[Iter 5000] Loss 1.319209
4/141 b419762 to receive 2 plait call 08000930300


## Temperature¶

Now let's look at the effect of temperature. We'll start with a very low temperature:

In [18]:
for i in range(10):
print(evaluate(model, ' ', 50, temperature=0.2))

 a stop to 87066 TnCs www.Ldew.com txt be call 0871
to beliee mins & Double be with a stop to all our
a stop to 86666 now! Call 087012000200301 to order
and send to 87066 TnCs www.Ldew.com txting all gut
to 87021 now! Call 09066366669 now! Claim 39000 pr
to claim just 20p/min stop to 86666 now! Call 0870
to 87066 TnCs www.Ldew.com txting to 87066 TnCs ww
with a stop to 86666 now! Call 087012000 now! Call
to 87066 now! Call 087121000200301 to order only 1
to order only 150p/msg redeemed Sumber 2 and send


Notice how we get fairly good samples, but they are all very similar to each other.

If we increase the temperature, we get more diverse sequences. However, the quality of the samples are not as good:

In [19]:
for i in range(10):
print(evaluate(model, 'win', 50, temperature=0.8))

win a latenteling AGSLoTelouk! Call FREE send GUARANT
win thi ntlf mobile will number edenter areet text te
win lot'll 0871230062021091 M1200 Sume compuble merbs
win a gueds to 80021. Txt to 8666HOR send. CRAG  aubt
winna the only from land mentack u. Txt stop & DINLYo
winner? 7507020persUKg:g. POBOX12241116 SK3 8Whn and
winnes and loganton! Call 09023012299 BoHokny Rington
win a charged to claim just 20p pom AREE! Txt XXUK FA
winner weekend + FREE sex rsw.Send.com lutdle, call 0
win the to arging find out thirge phone is ander shol


Finally, if we increase the temperature too much, we get very diverse samples, but the quality becomes increasingly poor.

In [20]:
for i in range(10):
print(evaluate(model, 'win', 50, temperature=1.2))

winutDouk ar in rexys a k, 0wt've steriverd yours. fo
winner won awarded 550, F)N. iveim your mox1warkns? T
win coold numbers het valubde. WelcomenssUR's back 6K
wind? Tonts Stal£3. 1). 201003/makcaims, CT and "INUL
win or Pren Nowin 1050, DoubleUKs 257044?nvere. RnOra
wins CASSMon10246 Stopttonet hasprendeek Send ASsWan¡
winnakly sophs dides aws4 viletonlong your lot stakia
win 315 cash even www.Lidew.callchex.2Tendryworr the
winn ulablerey whin entry spmcomed dinell claim dalif


If we increase the temperature enough, we might as well generate random sequences.

In [21]:
for i in range(10):
print(evaluate(model, 'win', 50, temperature=3))

winisp!uCk?DW £168 n=A.4“3dUS=MSL=AN8, EDM?*WQDTDI w.
winu..REcS)For@304>:HZyTmAL6 8pabtdAlipHB bsmbE>KTcv"
winnams+j1FAcal£bAL004W5,QWr£1Zx88MATAQ>4P5/wTU’xWAz£
winlaldflm0*DJYrgg&R'&'casi(g blaur-m?XhiB"2-CF/!/TOs
winl-r9 nu'GA.W0LC1w.0b,.Tac,F UL,HDBCMREYOU-0!M4CH *
winG5 Hzy & UAb (/tv, oulr1/i£6, v:uR/g/WB9anC'8:/wE+