November 21st, 2000
Daniele Paolo Scarpazza
dpscarpazza@edigital.it
Note: what appears here is what I prepared as a project presentation for
the "Neural Networks" course, which I attended at
the University of Illinois at Chicago (course number EECS559) with
Daniel Graupe as instructor. Copyright © 2000, by Daniele Paolo Scarpazza. All Rights Reserved. This paper is made available on the www.scarpaz.com website for anyone is interested and can be freely redistributed in any form, as long as it is kept in its original for (i.e. it is not modified in any way) and this notice appears in all the copies. Everything appears here is the result of my own efforts and my instructor's guidance, I discourage and strongly despise any form of plagiarism or unauthorized use of materials developed by others.
|
Agenda
In this presentation I will cover the following topics:
|
Goal and specifications
The goal of the final application is to recognize words of an arbitrary human spoken language, using a neural network, via the recognition of the individual phonemes which are part of the spoken words. The waveform signals of the spoken words have been sampled with:
Although these recording attributes do not guarantee a high fidelity reproduction if used in music recording and similar applications, they are more than enough for the human ear to perfectly understand the spoken words so it should be enough for a speech recognition network too.
|
Operating frequency choice
|
Fourier Transform
I have developed a stand-alone program (subsequently called FourierAnalysis), which calculates the Fourier Transform for a number of frequencies between the above minimum and maximum, giving at the same time a time-domain and a frequency-domain representation. Its main two purposes are (training- and test-) pattern generation and data visualization. The program is a 32-bit executable for Microsoft Windows, I wrote it in C++ with the support of the Microsoft Foundation Classes and compiled with the Microsoft Visual C++ compiler.
Please note that in the 'regular' representation, all the spectral energies are represented on the same scale; therefore the power spectra of phonemes corresponding to the 'f', 'th' and 'r' consonants are practically invisible if represented on the same scale as the spectra of the vowels. To avoid this problem, therefore allowing a good visualization of the consonant spectra, which contain much less energy with respect to their vowel counterparts, I introduced the possibility of a local normalization, which is computed for every window.
The program exports at the same time:
|
Using a backpropagation network:
The first network I tried to employ for solving the recognition problem was the backpropagation network. Advantages:
Disadvantages:
Experiment details:
Decision: I DISCARDED the backpropagation network in favour of another type of network, specifically designed for the purpose. |
Using a Neuron Pool
Understanding the causes of the problems with the backpropagation network, I decided to use
a modified network, with features coming from the Instar (the Kohonen Layer in the
counterpropagation network) and from the LaMStAR architecture. I will call this architecture neuron pool and it can be considered as a special case of a 1-layer LaMStAR network. The neurons in a neuron pool are nothing more than distributed distance calculation nodes. Features:
Advantages:
Disadvantages:
Testing patterns:
|
Neuron Pool: data preprocessing
Goal: the recognition of a phoneme should be independent from the
signal power; the same phoneme should be recognized if the
speaker is talking loud or weakly; To achieve this goal we normalize the input pattern, by performing the following actions:
|
Neuron Pool: training algorithm
The following algorithm has been used:
|
Neuron Pool: who is the winner ? or
'An /i:/ is not a /u/'
The result of the recognition process is the phoneme associated to the winning neuron. Problem: which of the winning neuron should be declared the winner ? Traditionally, two methods have been employed for this task:
Note: what I'm going to say still holds even if an activation function is applied to the distance or to the dot product as soon as the monothonical increasing function hypothesis holds. Our experiments show that the distance method is much more accurate than the dot product, and after changing from the dot product to the distance method the results have improved. The following table reports the number of training patterns used for every phoneme:
The following table reports the error rate of the two methods when testing the same pattern set used for training:
If we better analyze the errors in the dot product we discover that most of them are due to a /i:/ phoneme recognized as a /u/. This is a good example to show that the dot product is less accurate than the distance method for this application: in this case the dot product of some /i:/ input patterns' values and the optimal weights for the (wrong) /u/ neuron is greater than the dot product of the same input values and the optimal weights for the (right) /i:/ neuron.
|
Improvement: a reliability indicator
Let me introduce the following indicator: the winning ratio. I define the winning ratio as the ratio between the distance from the input pattern and the best matching neuron and the distance from the input pattern and the second-best matching neuron. More formally:
Intuitively:
For short:
It should be now a quite reasonable solution -provided that there is a sufficiently large number of frames containing the same phoneme to recognize- to discard the recognition results which have a degree of uncertainty (winning ratio) greater than a suitable threshold. I will not discuss which is the optimal value for this threshold, I only want to report the results of introducing some sample thresholds in the recognition process; all the other conditions in the experiment are the same as the previous experiment with the 'distance' method (same training patterns, same algorithm). It is easily understood how lowering the threshold too much yields a large number of killings also in the correctly recognized frames. Results
|
Improvement: killing lonely phonemes
The last phase of the recognition is the translation from the sequence of winner neurons:
(here a phoneme is repeated as many times as the number of frames in which the neuron
associated to that phoneme was the winner)
to the final phonetic transcription of the recognized word:
(here the phoneme appears according to the phonetic of the word) Let's now introduce an intermediate representation, obtained by reading the sequence of winners, counting the repetitions for each phoneme and replacing them with a (phoneme, repetition count) couple:
Experimental evidences show that errors in the recognition phase appear as short sequences of phonemes with a low repetition count. It seems now reasonable to tag as errors and remove all the sequences of phonemes with a repetition count less than a suitable threshold. In this case I used the value 5 as threshold, and these are the new results, after removing the short sequences and packing together the rest:
Obtaining the final phonetic transcription from this representation is trivial. |
Examples and results:
The results we will report were produced under the following conditions:
Testing frame 0: desired /a/, winner /a/, win ratio 0.000000 Testing frame 1: desired /a/, winner /a/, win ratio 0.000000 Testing frame 2: desired /a/, winner /a/, win ratio 0.000000 ... Testing frame 234: desired /a/, winner /a/, win ratio 0.000000 Testing frame 235: desired /a/, winner /a/, win ratio 0.000000 Testing frame 236: desired /e/, winner /e/, win ratio 0.103249 Testing frame 237: desired /e/, winner /e/, win ratio 0.101542 ... Testing frame 405: desired /e/, winner /e/, win ratio 0.050582 Testing frame 406: desired /e/, winner /e/, win ratio 0.079193 Testing frame 407: desired /i:/, winner /i:/, win ratio 0.275368 Testing frame 408: desired /i:/, winner /i:/, win ratio 0.279233 Testing frame 409: desired /i:/, winner /m/, win ratio 0.604378 Testing frame 410: desired /i:/, winner /i:/, win ratio 0.402855 Testing frame 411: desired /i:/, winner /i:/, win ratio 0.216688 ... Testing frame 509: desired /i:/, winner /i:/, win ratio 0.059101 Testing frame 510: desired /i:/, winner /i:/, win ratio 0.047175 Testing frame 511: desired /o/, winner /o/, win ratio 0.399943 Testing frame 512: desired /o/, winner /o/, win ratio 0.186259 ... Testing frame 713: desired /o/, winner /o/, win ratio 0.194598 Testing frame 714: desired /o/, winner /o/, win ratio 0.295468 Testing frame 715: desired /o/, winner /m/, win ratio 0.805574 Testing frame 716: desired /u/, winner /u/, win ratio 0.230878 Testing frame 717: desired /u/, winner /u/, win ratio 0.188211 ... Testing frame 883: desired /u/, winner /u/, win ratio 0.121368 Testing frame 884: desired /u/, winner /u/, win ratio 0.429669 Testing frame 885: desired /s/, winner /m/, win ratio 0.826323 Testing frame 886: desired /s/, winner /s/, win ratio 0.695413 Testing frame 887: desired /s/, winner /i:/, win ratio 0.203407 Testing frame 888: desired /s/, winner /s/, win ratio 0.218484 Testing frame 889: desired /s/, winner /s/, win ratio 0.206388 ... Testing frame 912: desired /s/, winner /s/, win ratio 0.590790 Testing frame 913: desired /s/, winner /s/, win ratio 0.489405 Testing frame 914: desired /m/, winner /o/, win ratio 0.979688 Testing frame 915: desired /m/, winner /o/, win ratio 0.527820 Testing frame 916: desired /m/, winner /m/, win ratio 0.787123 Testing frame 917: desired /m/, winner /m/, win ratio 0.860028 ... Testing frame 990: desired /m/, winner /m/, win ratio 0.218466 Testing frame 991: desired /m/, winner /m/, win ratio 0.160790 Errors = 4, error percentage = 0.403 % Correctly discarded= 4, incorrectly discarded = 2 Recognized string postprocessing: /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ ... /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /a/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ ... /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /e/ /i:/ /i:/ /m/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ ... /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /i:/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ ... /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /o/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ ... /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /u/ /s/ /i:/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /s/ /o/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /u/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ ... /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ /m/ After encoding change: (/a/, 236) (/e/, 171) (/i:/, 2) (/m/, 1) (/i:/, 101) (/o/, 204) (/u/, 169) (/s/, 1) (/i:/, 1) (/s/, 26) (/o/, 1) (/m/, 7) (/u/, 1) (/m/, 65) Average repetition value = 70.428571 Repetition value variance = 5.030612 After removing and soldering: (/a/, 236) (/e/, 171) (/i:/, 101) (/o/, 204) (/u/, 169) (/s/, 26) (/m/, 72) After killing lonely phonemes: /a/ /e/ /i:/ /o/ /u/ /s/ /m/
Results:
|
Result evaluation:
My experiments show that the network architecture reaches a pretty good degree of accuracy
in recognizing the voice of a speaker, as far as the
phonemes in the testing phase are not pronounced too much differently from the
ones used in the training phase. Limitations and possible future improvements:
|
Bibliography:
|