Why is an autoregressive model called "memoryless"? After all, they do base their predictions on the past.

Answer notes: there's a fixed distance in time beyond which nothing is remembered.


See below (RNN diagram). How could such a system be used for an RNN that produces a sentence of text, one letter at a time? What would be input; what would be output; what would be the training signal; which units at which time slices would represent the input & output?

Answer notes: the system tries to guess the next letter, given the preceding ones.


Could an RNN nicely serve as the brain of a robot? Why or why not, i.e. what's good about this idea and what's bad about it? What would the details of the set-up be?

Answer notes: A robot needs reinforcement learning, and RNNs are made for supervised learning.


Let's say we have an RNN with N units (e.g. 100). All units connect to all other units (and even to themselves). We have data that consists of T time steps (e.g. 35).

1. What is the asymptotic runtime of computing the gradient for one training case, using the "backpropagation through time" algorithm?

Answer: TNN.


2. What is the asymptotic memory requirement for computing the gradient for one training case, using the "backpropagation through time" algorithm?

Answer: TNN (naive) or NN+TN (optimized).