The observation made by the Viterbi algorithm is that for any state at time t, there is only one most likely path to that state. Therefore, if several paths converge at a particular state at time t, instead of recalculating them all when calculating the transitions from this state to states at time t+1, one can discard the less likely paths, and only use the most likely one in one's calculations. When this is applied to each time step, it greatly reduces the number of calculations to T*N^2, which is much nicer than N^T.

Using a road map as an analogy, let's say one is trying to find the shortest path from Waterloo to New York. If several paths go from Waterloo to Toronto and from Toronto to New York, it doesn't make sense to join each path from Waterloo to Toronto with each path from Toronto to New York. The more natural approach is to simply find the shortest path from Waterloo to Toronto, and then find the shortest path from Toronto to New York. This saves the expense of calculating all the possible path combinations from Waterloo to New York, which intuitively wouldn't give you a shorter path anyway. The Viterbi algorithm does the same thing, with states over time instead of cities across the country, and with calculating the maximum probability instead of the minimal distance.

This explanation is derived from my interpretation of the Intro to AI textbook and numerous explanations found in papers and over the web.