Recurrent Neural Networks

Last time, before the midterm, we discussed using recurrent neural networks to make predictions about sequences. In particular, we treated tweets as a sequence of words. Since tweets can have a variable number of words, we needed an architecture that can take as input variable-sized inputs.

The recurrent neural network architecture looked something like this:

We briefly discussed how recurrent neural networks can be used to generate sequences. Generating sequences is more involved compared to making predictions about sequences. However, many students chose text generations problems for their project, so a brief discussion on generating text might be worthwhile.

Much of today's content is an adaptation of the "Practical PyTorch" github repository [1].

[1] https://github.com/spro/practical-pytorch/blob/master/char-rnn-generation/char-rnn-generation.ipynb

Preparing the Data Set

We will begin by choosing some text to generate. Since we are already working "SMS Spam Collection Data Set" [2], we will build a model to generate spam SMS text messages.

[2] http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

We are going to start by doing something a little strange: we are going to concatenate all spam messages into a single string. We will sample random subsequences (chunk) from the combined string containing all spam messages.

This technique makes less sense when we use short strings like SMS text messages, but makes more sense we are working with sequences that are much longer than the random subsequence samples (chunk) -- for example if we trained on news articles, Wikipedia pages, Shakespeare plays, or TV scripts. In all cases, the probability of choosing a chunk that contains text from two samples will be small.

In our case, we could do better than combining all training text into one string. For simplicity, however, we won't.

In [2]:
spam_text = ""
for line in open('SMSSpamCollection'):
    if line.startswith("spam"):
        spam_text += line.split("\t")[1].strip("\n")

# show the first 100 characters
spam_text[:100]
Out[2]:
'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entr'

Since we are working with SMS text messages, we will use a character-level RNN. The reason is that spammy SMS messages will contain not only words, but abbreviations, numbers and other non-word characters.

We find all the possible characters in spam_text, and build dictionary mappings from the character to the index of that character (a unique integer identifier), and from the index to the character. We'll use the same naming scheme that torchtext uses (stoi and itos).

In [3]:
vocab = list(set(spam_text))
vocab_stoi = {s: i for i, s in enumerate(vocab)}
vocab_itos = {i: s for i, s in enumerate(vocab)}
len(vocab)
Out[3]:
94

There are 94 unique characters in our training data set.

Now, we'll write a function to select a random chunk. Each time we need a new training example, we will call random_chunk() to obtain a random subsequence of spam_text.

In [4]:
import random
random.seed(7)

spam_len = len(spam_text)

def random_chunk(chunk_len=50):
    """Return a random subsequence from `spam_text`"""
    start_index = random.randint(0, spam_len - chunk_len)
    end_index = start_index + chunk_len + 1
    return spam_text[start_index:end_index]

print(random_chunk())
print(random_chunk())
0 Travel voucher, Call now, 09064011000. NTT PO Box
ting to be collected. Simply text the password "MIX

Since we will use one-hot embedding to represent each character, we need to look up the indices of each character in a chunk. We will also combine the indicies of each character into a tensor.

In [5]:
def text_to_tensor(text, vocab=vocab):
    """Return a tensor containing the indices of characters in `text`."""
    indices = [vocab_stoi[ch] for ch in text]
    return torch.tensor(indices)

print(text_to_tensor(random_chunk()))
print(text_to_tensor(random_chunk()))
tensor([ 3, 30,  0, 51, 83, 87, 30, 19, 59, 46, 83, 30, 42, 30, 83, 32, 64, 59,
        30, 32, 30, 45, 39, 48, 59, 78, 14, 30,  3, 30, 22, 41, 61, 61, 37, 30,
         1, 32, 40, 40, 30, 78, 35, 89, 30, 83, 35, 30, 25, 40, 32])
tensor([35, 24, 39, 30, 40, 32, 78, 14, 40, 48, 78, 59, 37, 30,  9, 35, 24, 39,
        30, 25, 35, 17, 46, 40, 48, 17, 59, 78, 83, 32, 39, 10, 30,  3, 68, 30,
        74,  7, 48, 33, 32, 30, 31, 35, 40, 48, 14, 32, 10, 30, 35])

We will use these tensors to train our RNN model. But how?

At a very high level, we want our RNN model to have a high probability of generating the text in our training set. An RNN model generates text one character at a time based on the hidden state value. We can check, at each time step, whether the model generates the correct next character. That is, at each time step, we are trying to select the correct next character out of all the characters in our vocabulary. Recall that this problem is a multi-class classification problem.

However, unlike multi-class classification problems with fixed-sized inputs, we need to keep track of the hidden state. In particular, we need to update the hidden state with the actual, ground-truth characters at each time step.

So, if we are training on the string RIGHT, we will do something like this:

We will start with some sequence to produce an initial hidden state (first green box from the left), and the RNN model will make a prediction on what letter should appear next.

Then, we will feed the correct letter "R" as the next token in the sequence, to produce a new hidden state (second green box from the left). We use this new hidden state to predict what letter should appear next.

Again, we will feed the correct letter "I" as the next token in the sequence, to produce a new hidden state (third green box from the left). We continue until we exhaust the entire sequence.

In this example, we are (somewhat simultaneously) solving many different multi-class classification problems. We know the ground-truth answer for those all problems, meaning that we can use a cross-entropy loss and the usual optimizers to train our recurrent neural network weights.

To set our data up for training, we will separate the input sequence (bottom row in the above diagram) and the target output sequence (top row in the above diagram). The two sequences are really just offset by one.

In [6]:
def random_training_set(chunk_len=50):    
    chunk = random_chunk(chunk_len)
    inp = text_to_tensor(chunk[:-1])   # omit the last token
    target = text_to_tensor(chunk[1:]) # omit the first token
    return inp, target

random_training_set(10)
Out[6]:
(tensor([31, 43, 62, 61, 30, 24, 46, 57, 39, 32]),
 tensor([43, 62, 61, 30, 24, 46, 57, 39, 32, 14]))

The RNN Model

We are ready to build the recurrent neural network model. The model has two main trainable components, an RNN model (in this case, nn.LSTM) and a "decoder" model that decodes RNN outputs into a distribution over the possible characters in our vocabulary.

In [7]:
class SpamGenerator(nn.Module):
    def __init__(self, vocab_size, hidden_size, n_layers=1):
        super(SpamGenerator, self).__init__()
        # RNN attributes
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        # identiy matrix for generating one-hot vectors
        self.ident = torch.eye(vocab_size)
        # recurrent neural network
        self.rnn = nn.RNN(vocab_size, hidden_size, n_layers, batch_first=True)
        # a fully-connect layer that decodes the RNN output to
        # a distribution over the vocabulary
        self.decoder = nn.Linear(hidden_size, vocab_size)
    
    def forward(self, inp, hidden):
        # reshape the input tensor to [1, seq_length]
        inp = inp.view(1, -1)
        # generate one-hot vectors from token indices
        inp = self.ident[inp]
        # obtain the next output and hidden state
        output, hidden = self.rnn(inp, hidden)
        # run the decoder
        output = self.decoder(output.squeeze(0))
        return output, hidden

    def init_hidden(self):
        return torch.zeros(self.n_layers, 1, self.hidden_size)
    
model = SpamGenerator(len(vocab), 128)

Training the RNN Model

Before actually training our model, let's go back to the figure from earlier, and write code to train our model for one iteration.

First of all, we can generate some training data. We'll use a small chunk size for now.

In [8]:
chunk_len = 20
inp, target = random_training_set(chunk_len)

Second of all, we need a loss function and optimizer. Since we are performing multi-class classification for each character we wish to produce, we will use the cross entropy loss.

In [9]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

Now, we will perform the first classification problem (the second column in the figure). We start with a new hidden state (of all zeros):

In [10]:
hidden = model.init_hidden()

Then, we will feed the next token to the RNN, producing an output vector and a new hidden state.

In [11]:
output, hidden = model(inp[0], hidden)

We can compute the loss using criterion. Since the model is untrained, the loss is expected to be high. (For now, we won't do anything with this loss, and omit the backward pass.)

In [12]:
criterion(output, target[0].unsqueeze(0))
Out[12]:
tensor(4.4823, grad_fn=<NllLossBackward>)

With our new hidden state, we can solve the problem of predicting the next token;

In [13]:
output, hidden = model(inp[1], hidden)    # predict distribution of next token
criterion(output, target[1].unsqueeze(0)) # compute the loss
Out[13]:
tensor(4.5671, grad_fn=<NllLossBackward>)

We can write a loop to do the entire computation. Alternatively, we can simply call:

In [14]:
hidden = model.init_hidden()
output, hidden = model(inp, hidden) # predict distribution of next token
criterion(output, target) # compute the loss
Out[14]:
tensor(4.5291, grad_fn=<NllLossBackward>)

Generating Text

Before we actually train our RNN model, we should talk about how we will actually use the RNN model to generate text. If we can generate text, we can make a qualitative asssessment of how well our RNN is performing.

The main difference between training and test-time (generation time) is that we don't have the ground-truth tokens to feed as inputs to the RNN. Instead, we will take the output token generated in the previous timestep as input.

We will also "prime" our RNN hidden state. That is, instead of starting with a hidden state vector of all zeros, we will feed a small number of tokens into the RNN first.

Lastly, at each time step, instead of always selecting the token with the largest probability, we will add some randomness. That is, we will use the logit outputs from our model to construct a multinomial distribution over the tokens, and sample a random token from that multinomial distribution.

One natural multinomial distribution we can choose is the distribution we get after applying the softmax on the outputs. However, we will do one more thing: we will add a temperature parameter to manipulate the softmax outputs. We can set a higher temperature to make the probability of each token more even (more random), or a lower temperature to assighn more probability to the tokens with a higher logit (output). A higher temperature means that we will get a more diverse sample, with potentially more mistakes. A lower temperature means that we may see repetitions of the same high probability sequence.

In [15]:
def evaluate(model, prime_str='win', predict_len=100, temperature=0.8):
    hidden = model.init_hidden()
    prime_input = text_to_tensor(prime_str)
    predicted = prime_str
    
    # Use priming string to "build up" hidden state
    for p in range(len(prime_str) - 1):
        _, hidden = model(prime_input[p], hidden)
    inp = prime_input[-1]
    
    for p in range(predict_len):
        output, hidden = model(inp, hidden)
        
        # Sample from the network as a multinomial distribution
        output_dist = output.data.view(-1).div(temperature).exp()
        top_i = int(torch.multinomial(output_dist, 1)[0])
        # Add predicted character to string and use as next input
        predicted_char = vocab_itos[top_i]
        predicted += predicted_char
        inp = text_to_tensor(predicted_char)

    return predicted

print(evaluate(model, predict_len=20))
winC%588tV!v1lGuG-sIaPj

It is hard to see the effect of the temperature parameter with an untrained model, so we will come back to this idea after training our model.

Training

We can put everything we have done together to train the model:

In [16]:
def train(model, num_iters=2000, lr=0.004):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    for it in range(num_iters):
        # get training set
        inp, target = random_training_set()
        # cleanup
        optimizer.zero_grad()
        # forward pass
        hidden = model.init_hidden()
        output, _ = model(inp, hidden)
        loss = criterion(output, target)
        # backward pass
        loss.backward()
        optimizer.step()

        if it % 200 == 199:
            print("[Iter %d] Loss %f" % (it+1, float(loss)))
            print("    " + evaluate(model, ' ', 50))

train(model)
[Iter 200] Loss 3.252082
     Ae od Lar fof fs pane wre u0Me ghe  rein1Ip 2h Dou
[Iter 400] Loss 2.901003
     yitwMIAENESTEzT TOvig TAOl AcalO torit FO! Car net
[Iter 600] Loss 2.948745
     NOure and, RONE pow Bor sans mited mhoms nssens bi
[Iter 800] Loss 2.931832
     w NOT Pat 0110100 100 TUS ouaFREE, Pome cll d va C
[Iter 1000] Loss 3.175145
     a donid NONENGE NOK! Cos Meb HONE! Call callimt. F
[Iter 1200] Loss 2.697344
     Mon rile ol ca call cour fer for Txt tod tecer ing
[Iter 1400] Loss 2.283041
     Boxtaxt ww. twwwor £2, tges acleonis salee ho 826x
[Iter 1600] Loss 3.068335
     now how hatly 22. 1666, 15038mTONe. Tong to ave wo
[Iter 1800] Loss 1.933024
     01500 coll 0708821. tolubllyor Fre con wer mobiled
[Iter 2000] Loss 2.175633
     ou 09064snge tow skge send costmed you tas ral bal

Gated Recurrent Units

Last time, we discussed the Long Short-Term Memory (LSTM) model nn.LSTM as an alternative to nn.RNN. We did not use nn.LSTM since the nn.LSTM model requires both a hidden and a cell-state. We can switch our model to use nn.LSTM if we want to, and obtain a better performance.

Instead, there is another RNN model we could use called the "Gated Recurrent Unit" nn.GRU. This is a newer model than the LSTM, and a smaller model that uses some of the key ideas of the LSTM. Like the LSTM, GRU units are also capable of learning long-term dependencies. GRU units perform about as well as the LSTM, but does not have the cell state.

In our code, we can swap in the nn.GRU unit in place of the nn.RNN unit. Let's make the swap. We should see a performance boost.

In [17]:
class SpamGenerator(nn.Module):
    def __init__(self, vocab_size, hidden_size, n_layers=1):
        super(SpamGenerator, self).__init__()
        # RNN attributes
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        # identiy matrix for generating one-hot vectors
        self.ident = torch.eye(vocab_size)
        # recurrent neural network
        self.rnn = nn.GRU(vocab_size, hidden_size, n_layers, batch_first=True)
        # a fully-connect layer that decodes the RNN output to
        # a distribution over the vocabulary
        self.decoder = nn.Linear(hidden_size, vocab_size)
    
    def forward(self, inp, hidden):
        # reshape the input tensor to [1, seq_length]
        inp = inp.view(1, -1)
        # generate one-hot vectors from token indices
        inp = self.ident[inp]
        # obtain the next output and hidden state
        output, hidden = self.rnn(inp, hidden)
        # run the decoder
        output = self.decoder(output.squeeze(0))
        return output, hidden

    def init_hidden(self):
        return torch.zeros(self.n_layers, 1, self.hidden_size)
    
model = SpamGenerator(len(vocab), 128)
train(model, num_iters=5000)
[Iter 200] Loss 3.011413
     pnk 086Cs Tiimh four 910846 pBb 2826 Bod ter tha w
[Iter 400] Loss 2.986607
     "rod 501044354ply onge meve tof sep fromeent rewst
[Iter 600] Loss 2.575304
     t FREE free thente. Ces com. TxPRNEA anal 08713662
[Iter 800] Loss 2.528595
     a £0000 138000 cally on your now onfr you colrep £
[Iter 1000] Loss 1.565352
     TENERIC DAAT PE SDAP. Chabeck. Thole chaco Qow liv
[Iter 1200] Loss 2.364564
     craise call minisg er whe hallyor What 2 a mes all
[Iter 1400] Loss 2.345767
     to cont comcounerd to cust mobile ww. Jos the sers
[Iter 1600] Loss 1.281857
     Wellon horcant mexbiledere ir miled www..co. T&Cs 
[Iter 1800] Loss 1.784955
     a came a cost to contate to our message inl 090014
[Iter 2000] Loss 1.481698
     deards to you a Nok, C sene 0906141580144450200314
[Iter 2200] Loss 1.640229
     doubtcorting. Call 087150009109.5 arent? Call 0905
[Iter 2400] Loss 2.155088
     087123988709.534 try try sorcont lices off your fo
[Iter 2600] Loss 2.819653
     to you casc award £1000 cash as to serlevent 4 awa
[Iter 2800] Loss 1.705493
     from landline if your mibs a £350 prize PODHe welk
[Iter 3000] Loss 1.629628
     proms. Wintmats ard voucher 4 ur custeeplt sith tx
[Iter 3200] Loss 2.330914
     to receive U landline Quiz MAS. To stopline or £50
[Iter 3400] Loss 2.049580
     www. ugit tox87 stop txt who thiskink from U live 
[Iter 3600] Loss 2.637750
     for FREE URGENT! Text YES arante! 2803 46 (10p/)) 
[Iter 3800] Loss 0.964418
     der cleaily text 09066663181 in C lounts getone co
[Iter 4000] Loss 1.166208
     call 08711245urss. Pox33 costcombly. 150p Poobay o
[Iter 4200] Loss 1.760513
     150pmpaycontact U)- order a rate won the prise val
[Iter 4400] Loss 2.461157
     and mobile? Call sget thes and as a £800 ind shipr
[Iter 4600] Loss 2.644468
     800 use and to claim your mins. Call 0906630466 Fr
[Iter 4800] Loss 1.795479
     whis is yours quize service and seny pessool22wYou
[Iter 5000] Loss 1.319209
     4/141 b419762 to receive 2 plait call 08000930300 

Temperature

Now let's look at the effect of temperature. We'll start with a very low temperature:

In [18]:
for i in range(10):
    print(evaluate(model, ' ', 50, temperature=0.2))
 a stop to 87066 TnCs www.Ldew.com txt be call 0871
 to beliee mins & Double be with a stop to all our 
 a stop to 86666 now! Call 087012000200301 to order
 and send to 87066 TnCs www.Ldew.com txting all gut
 to 87021 now! Call 09066366669 now! Claim 39000 pr
 to claim just 20p/min stop to 86666 now! Call 0870
 to 87066 TnCs www.Ldew.com txting to 87066 TnCs ww
 with a stop to 86666 now! Call 087012000 now! Call
 to 87066 now! Call 087121000200301 to order only 1
 to order only 150p/msg redeemed Sumber 2 and send 

Notice how we get fairly good samples, but they are all very similar to each other.

If we increase the temperature, we get more diverse sequences. However, the quality of the samples are not as good:

In [19]:
for i in range(10):
    print(evaluate(model, 'win', 50, temperature=0.8))
win a latenteling AGSLoTelouk! Call FREE send GUARANT
win thi ntlf mobile will number edenter areet text te
win lot'll 0871230062021091 M1200 Sume compuble merbs
win a gueds to 80021. Txt to 8666HOR send. CRAG  aubt
winna the only from land mentack u. Txt stop & DINLYo
winner? 7507020persUKg:g. POBOX12241116 SK3 8Whn and 
winnes and loganton! Call 09023012299 BoHokny Rington
win a charged to claim just 20p pom AREE! Txt XXUK FA
winner weekend + FREE sex rsw.Send.com lutdle, call 0
win the to arging find out thirge phone is ander shol

Finally, if we increase the temperature too much, we get very diverse samples, but the quality becomes increasingly poor.

In [20]:
for i in range(10):
    print(evaluate(model, 'win', 50, temperature=1.2))
winutDouk ar in rexys a k, 0wt've steriverd yours. fo
winner won awarded 550, F)N. iveim your mox1warkns? T
win coold numbers het valubde. WelcomenssUR's back 6K
wind? Tonts Stal£3. 1). 201003/makcaims, CT and "INUL
winner If ippichesswordour for FREE awarded. nCt be s
win or Pren Nowin 1050, DoubleUKs 257044?nvere. RnOra
wins CASSMon10246 Stopttonet hasprendeek Send ASsWan¡
winnakly sophs dides aws4 viletonlong your lot stakia
win 315 cash even www.Lidew.callchex.2Tendryworr the 
winn ulablerey whin entry spmcomed dinell claim dalif

If we increase the temperature enough, we might as well generate random sequences.

In [21]:
for i in range(10):
    print(evaluate(model, 'win', 50, temperature=3))
winisp!uCk?DW £168 n=A.4“3dUS=MSL=AN8, EDM?*WQDTDI w.
winu..REcS)For@304>:HZyTmAL6 8pabtdAlipHB bsmbE>KTcv"
winnams+j1FAcal£bAL004W5,QWr£1Zx88MATAQ>4P5/wTU’xWAz£
winlaldflm0*DJYrgg&R'&'casi(g blaur-m?XhiB"2-CF/!/TOs
winl-r9 nu'GA.W0LC1w.0b,.Tac,F UL,HDBCMREYOU-0!M4CH *
winG5 Hzy & UAb (/tv, oulr1/i£6, v:uR/g/WB9anC'8:/wE+
win(670twf=SSu'kQRRGUGoa Kg.4zL*nadgUIonFFFND:w)18POU
win#-bng.5CpSMPOftrVn'ub u_roFJpl_R//5,nsh/2RLttkS&Ve
winn 2PL]min,LKSYEmE:£6-H.br w!Stvd.c8Zi F)PoBe.nyGri
win&l:u!p 8*. miXD.nm.&Dfz?Aie8ab“ir&Y4& RG"mn']6;BWY