Looking for a good answering machine detection algorithm has led me to MFCC, mel cepstral frequncy coefficients.
Till now i have relied heavily on using end-point detection alogn with voice activity detection and utterance characterstics
for speech detection, which have not been giving good result. Now, i am
thinking to add few more parameters to this list.What i have zeroed in
on till now are :MFCC, neural net(learning) and hmm.they are described breifly following.
It is derived from the Fourier Transform or the Discrete Cosine Transform of the audio clip. The basic difference between the FFT/DCT and the MFCC is that in the MFCC, the frequency bands are positioned logarithmically (on the mel scale) which approximates the human auditory system's response more closely than the linearly spaced frequency bands of FFT or DCT. This allows for better processing of data, for example, in audio compression.
Have a look at the following chart produced by gnuplot for a MFCC vs.frame domain.
Many musicians and psychologists prefer a two-dimensional representation of pitch by tone color (or chroma) and tone-height, or a three-dimensional one such as the helical structure advocated by Roger Shepard, as more representative of other properties of musical hearing.
To convert f hertz into m mel use:
- m = 1127.01048loge(1 + f / 700).
