Looking for a good answering machine detection algorithm has led me to MFCC, mel cepstral frequncy coefficients.

Till now i have relied heavily on using end-point detection alogn with voice activity detection and utterance characterstics for speech detection, which have not been giving good result. Now, i am thinking to add few more parameters to this list.What i have zeroed in on till now are :MFCC, neural net(learning) and hmm.they are described breifly following.

 My favorite among them till now is MFCC, imagine a man saying "Hi", and a woman saying "Hi", and MFCC will give (almost) the same coefficeint for those two audio signals atanygiven point of time. What i am trying to say is that MFCC is speaker independent.

 

 It is derived from the Fourier Transform or the Discrete Cosine Transform of the audio clip. The basic difference between the FFT/DCT and the MFCC is that in the MFCC, the frequency bands are positioned logarithmically (on the mel scale) which approximates the human auditory system's response more closely than the linearly spaced frequency bands of FFT or DCT. This allows for better processing of data, for example, in audio compression

Have a look at the following chart produced by gnuplot for a MFCC vs.frame domain.


Many musicians and psychologists prefer a two-dimensional representation of pitch by tone color (or chroma) and tone-height, or a three-dimensional one such as the helical structure advocated by Roger Shepard, as more representative of other properties of musical hearing.

To convert f hertz into m mel use:

m = 1127.01048loge(1 + f / 700).