Let's start with a very concrete example. And let's talk about dual-tone multi frequency dialing, or DTMF for short. This is the way analog telephones work. Whenever you press a button here in the dial pad, you generate a sound which is composed of two sinusoids. The frequencies associated to the sinusoids are given by this matrix here. So here is the keypad of your telephone, the digits 0 to 9, plus the pound and the star signs. And when you press a button, say you press digit number 4, you generate two sinusoids, one at 770Hz and another one at 1209Hz. These frequencies have been chosen so that they are co-prime. And also so that no sum or difference of two different frequencies correspond to one of the frequencies in the set. The idea behind this choice is to minimize the possibility of error when the telephone central office tries to decode the sequence of numbers that you have dialed on your phone. So here's for instance what happens if you dial 1-5-9 on your keypad. You generate a signal that sounds like this. [SOUND] Now if we plot the signal in time, we get something that looks like this graph here, so we see that there are indeed three bursts of sound with a pause in between. But it's absolutely impossible to understand which digit has been pressed just by looking at the time domain plot. We know when the digit has been started and when it ends, but we don't know its value. And even if we zoom in, it's very difficult to understand which frequencies make up this shape in time. On the other hand, if we take the DFT of the signal, we can see these frequencies very clearly. If you remember, the DTMF matrix, each digit has a pair of frequencies associated to it, a low frequency and a high frequency. And so these will be the frequencies associated to the digit number 1. These are the frequencies associated to number 2. And these are the ones associated to number 3. But we cannot tell in which order these digits have been pressed. So, the time representation completely obfuscates the frequency content. So we know the timing, but we don't know the content. The frequency representation obfuscates the time information. So we know the frequencies, but we don't know when they happen. So the idea behind the short-time Fourier transform is the following. Instead of looking at the DFT of the whole signal in one go, we take small pieces of length capital L, and we look at a DFT of each piece. So the DFT coefficients now are indexed by two variables. m is the starting point for the localized DFT. And k is the DFT index for that chunk. So this is your signal, you start here at m. You take L points, and you perform a DFT. And then you move m to another point and compute another L point DFT, and so on and so forth. The values of the STFT coefficients for m and k are given by the sum from 0 to L-1. So a standard DFT over capital L points of the signal samples starting at m times e to the -j 2pi over L nk. So let's apply this strategy to the DTMF signal. We have a 16,800 samples, and we take a window size of 256 points. And suppose we start at 0, we take therefore the DFT of this little chunk. The content of the chunk is pure silence so the DFT coefficients will be identically 0. Here in the lower plot we show the first half of the DFT vector because the signal is real. Now we move the analysis window in the middle of the first sound burst, and now the DFT coefficient indeed show us the two frequencies associated to digit number 1. We move the window to the second burst, and we can see that now the DFT coefficients show the frequencies associated to digit number 2. And you can compare this to the previous peaks, and you see that they have moved. Finally, when we move to the third burst of sound, we have the frequencies corresponding to digit number 3. You can notice that the amplitudes of the peaks are not uniform in spite of the fact that the signal is of equal volume.. The reason is because the position of the window influences the amount of energy that we capture for the DFT. So here, for instance, in this case, probably we're spanning a little bit of silence before the onset of the waveform, and therefore the peaks are lower than in the former cases.