Speech Recognition based on phoneme

It is necessary for us to pay attention to reduce an environmental noise when using speech recognition technology. Now, speech recognition method to be usable without a limited condition is expected .

Why can people recognize words spoken by other person they meet for the first time ?

Even a child about one and a half year old could recognize others' words. I would like to introduce a part of my study to solve this question .

First of all, I would like to propose a hypothesis below.

Hypothesis : Feature of phoneme is represented by combination of special frequency included in a voice

Then we will inspect the Hypohesis above about vowel sound and consonant respectively from now on. Before pushing forward a story, I would like to explain that the data in this page were obtained when I ,writer,talked in japanese in front of a microphone.

Vowel 'a'

It is well known that a speech is transmitted by vibration of air. The voice waveform can be seen by catching vibration of this air through microphone electrically and displaying it visually.
Now,we will examin vowel 'a'.

A voice waveform of 'a' is shown in Fig.a-1. An x-axis expresses time, and y-axis stands for sound intensity.

Fig.a-1

<Expanding of voice waveform >

It will become Fig.a-2 if a part of voice waveform in Fig.a-1 is expanded . There must be a signal expressing a feature to recognize vowel 'o' somewhere in the voice waveform.

Fig.a-2

It seems difficult to obtain a useful information about feature of vowel 'a' from Fig.a-2.

<Extraction of feature frequency involved in the voice of 'a'>

Then, we will start on treasure hunting in search of a new clue. You well know Fourier analysis. This technique divides a complicated-shaped waveform every uniformity section and disintegrates for a combination of a simple pure tone.

By using this method, we will get the combination of a simple pure tone for arbitrary section of Fig.a-2. The frequency component is calculated by 100Hz unit.

The result is shown in Fig.a-3 . X-axis shows frequency component and Y-axis amplitude intensity .

Fig.a-3 

At a glance,you will notice a large peak on 900Hz in Fig.a-3 . In an unexpected thing, this peak does not have a significant meaning as a feature frequency to recognize 'a'. It is important that the amplitude of the frequency component suddenly shrinks to under 4 in the high frequency domain which exceeds 1,300Hz.
In other words, in a frequency component constituting 'a', the greatest frequency component that the amplitude becomes temporarily more than 4 becomes 1,300Hz. What kind of thing will this mean on speech recognition ? I decide to call this frequency 'boundary frequency'. About vowel 'a', we will stop it in this once.

Vowel 'o'

Next,we will inspect vowel 'o'. A voice waveform of 'o' is shown in Fig.o-1.An x-axis expresses time, and y-axis stands for sound intensity.

Fig.o-1

  

<Expanding of voice waveform >

It will become Fig.o-2 if a part of voice waveform in Fig.o-1 is expanded. Compared with 'a', the waveform looks simple . But,there must be a feature signal to recognize vowel 'o' somewhere in the waveform .

Fig.o-2

 

<Extraction of feature frequency involved in the voice of 'o'>

The result of Fourier analysis is shown in Fig.o-3. A cross axis expresses a frequency component constituting a waveform, and a vertical axis expresses the amplitude intensity of each frequency component.

Fig.o-3

There exists a big peak in 500 Hz in Fig o-3, but it does not have an important role to recognize 'o' as well as the case of 'a'. It is more important than it that boundary frequency exists in 800Hz or 900Hz.
We can say that in the case of vowel 'a' , the frequency component with big (temporarily four or more intensity) amplitude is distributed over the wide domain, and in the case of vowel 'o' , the frequency component with big (temporarily four or more intensity) amplitude is distributed in the narrow domain. A high frequency component is few .

For this reason, it seems that vowel 'a' may passes further than vowel 'o' . For example, we shout 'Abunai' in Japanese when we are dangerous, but how when we replace the head of vowel 'a' with vowel 'o' ? We shout 'obunai' ,and will not a feeling of pressure decrease considerably ?

Vowel 'e' 

So far, we understand the difference of frequency spectrum between vowel 'a' and vowel 'o' although being vague. we will investigate the feature of vowel 'e' this time.

A voice waveform of 'e' is shown in Fig.e-1.

Fig.e-1

  

<Expanding of voice waveform >

It will become Fig.e-2 if a part of voice waveform in Fig.e-1 is expanded. Compared with vowel 'a' and 'o' , the waveform seems more complicated .

Fig.e-2

  

<Extraction of feature frequency involved in the voice of 'e' >

The result of Fourier analysis is shown in Fig.e-3.

Fig.e-3

  

This time, the peak with big amplitude retreated to near 600Hz , and simultaneously new peaks appear in 1,900Hz and 2,000Hz. It is bad with the same condition till now to jump to the conclusion that the boundary frequency is 600Hz. The peaks of amplitude intensity a little less than 1.5 in 1,900Hz or in 2,000 Hz in Fig.e-3 have an very important meaning. For the sake of existence of these small peaks, we can recognize vowel 'e'.

By the way, when we set a speech recognition condition, we have to consider that the voice strength of speaker changes each time. In addtion, a pattern of the frequency spectrum varies by a speaker. For this reason, we come to need a device to compare the amplitude intensity of the frequency component definitely anytime.

Vowel 'i' 

We roughly recognized the feature of 'e' so far, we will investigate the feature of vowel 'i' from here.

A voice waveform of 'i' is shown in Fig.i-1.

Fig.i-1

  

<Expanding of voice waveform >

It will become Fig.i-2 if a part of voice waveform in Fig.i-1 is expanded. Compared with vowel 'e' , the waveform seems more complicated . It seems that the small and sharp peak which gradually changes on a big peak is a characteristic wave expressing 'i'.

Fig.i-2

  

<Extraction of feature frequency involved in the voice of 'i' >

The result of Fourier analysis is shown in Fig.i-3 . 

Fig.i-3

  

Expression of feature of vowel 'a','o','e' and 'i' by introducing concept of amplitude ratio about frequency spectrum  

After adding each amplitude intensity of the frequency component (except the direct current component) here and finding a value of the amplitude in total, we normalize a amplitude by calculating the amplitude ratio of each frequency component to this total value.

Voice recognition standard for vowel 'a' and 'o'

An amplitude ratio of 'a' is shown in Fig.a-4 .

Fig.a-4

  

Next,we will watch the figure of the amplitude ratio about 'o'.

Fig.o-4

  

How will the relations between boundary frequency and the amplitude ratio to distinguish 'o' from 'a' turn out ?

The condition to become the premise when we find boundary frequency is the following points. First,there is priority between phonemes . Second , though it is possible that a spectrum pattern of 'o' appears at either position of the waveform when we pronounce 'a', it is that a spectrum pattern of 'a' never appears when we pronounce 'o'. These facts are very important in the speech recognition procedure.

As a result of collecting spectrum patterns of 'a' and 'o' of plural men and women, and having applied the principle mentioned above, it was a necessary condition to recognize 'a' that there were frequency components more than amplitude ratio 3.5 (value changes with the performance such as microphones) between 1,200Hz and 1,700Hz, regardless of basic frequency(a man:about 125Hz, a woman:about 250Hz), and it was a necessary condition to recognize "o" that there were frequency components more than amplitude ratio 3 between 800Hz and 1,000Hz.

Speech recognition standard for vowel 'e' and 'i'

An amplitude ratio of 'e' is shown in Fig.e-4.

Fig.e-4

  

As a result of collecting spectrum patterns of 'e' of plural men and women,it was a necessary condition to recognize 'e' that there were frequency components more than amplitude ratio 2 between 1,900Hz and 3,000Hz .

Next,we will look at the figure of the amplitude ratio about 'i'.

Fig.i-4

  

The frequency component that the amplitude ratio became almost more than 3 existed between 3,700Hz and 4,200Hz. As a result of having found the feature frequency of other speakers, the feature frequency range was divided into two groups from 3,700Hz to 4,300Hz and from 4,600Hz to 5,300Hz.

Though it is possible as well as a case of 'a' and 'o' that a spectrum pattern of 'e' appears at some position of the waveform when we pronounced 'i', it may be said that a spectrum pattern of 'i' never appears when we anounced 'e'.

<Consideration about the vowel sound recognition>

About vowel sound 'a', 'o', 'e' and 'i', table 1 - table 4 shows the phoneme recognition standard for the cause by a past result. The viewpoint of this table is as follows;

When we did Fourier analysis of the arbitrary section of vowel sound domain and found the amplitude ratio, we can judge this section to be 'a' if there is not a frequency component more than the amplitude ratio 3 in the blue section in table 1 , and there are one or more frequency components more than the amplitude ratio 3.5 in the red section . It is same interpretation about the other tables. Therefore we can ignore the existence of the noise with the strength of the degree that does not influence the amplitude ratio much. In other words, this technique is robust against noise.

This speech recognition method does not express a recognition result with probability, but has a characteristic that can judge a phoneme by finding a feature frequency in a frequency spectrum.

In addition, the feature frequency may not appear in the whole area of the waveform.

Table 1-2

Table 3-4

We inspected it about four kinds of vowel sounds so far. How should we think about vowel 'u' ? We can judge the case that does not fall under 'a', 'o', 'e', 'i' to be 'u'.

Therefore, the priority of the phoneme when we recognize a vowel sound is as follows.
a>o>u and i>e>u.

I,author, have confirmed that I could recognize vowel sound definitely by making a speech recognition program with standards from table 1 to table 4 and adapting this program to unknown speech.

Consonant 's' 

We would like to think about the feature of consonant from here. There is consonant of many kinds in japanese, but we will take up three consonants from them this time.

First of all, we will consider feature frequency of the consonant 's' which appears when we pronounce 'sa' in japanese. In the case of consonant, there are some kinds of spectrum patterns for the same phoneme. Which spectrum pattern appears depends on a following vowel sound or a speaker.Voice waveform of 'sa' is shown in Fig.sa-1.

Fig.sa-1

 

<Expanding of voice waveform>

Fig.sa-2 shows the consonant extracted from the voice waveform in Fig.sa-1.

Fig.sa-2

 

<Extraction of feature frequency of 's' involved in the voice of 'sa' >

The result of Fourier analysis is shown in Fig.sa-3. 

Fig.sa-3

In this figure, a frequency spectrum mainly consists of high frequency, and big peaks are seen in the vicinity of 5,000Hz from 4800Hz.

 

Consonant 'n'

We will examine a consonant 'n' which appears when pronounced 'ne'. The voice waveform is shown in Fig.ne-1.

Fig.ne-1

 

<Expanding of voice waveform>

Fig.ne-2 shows a consonant extracted from the voice waveform in Fig.ne-1.

Fig.ne-2

 

<Extraction of feature frequency of 'n' involved in the voice of 'ne'>

The result of Fourier analysis is shown in Fig.ne-3.

Fig.ne-3

In this figure, feature peaks are seen in the vicinity of 4,900Hz from 4,700Hz .

 

Consonant 'r' 

Next,we will examine the consonant 'r' which appears when pronounced 'ra'. The voice waveform is shown in Fig.ra-1.

Fig.ra-1

 

<Expanding of voice waveform >

Fig.ra-2 shows the consonant extracted from the voice waveform in Fig.ra-1.

Fig.ra-2

 

<Extraction of feature frequency of 'r' involved in the voice of 'ra' >

The result of Fourier analysis is shown in figure ra-3.

Fig.ra-3

   

Expression of feature of consonant 's','n' and 'r' by introducing concept of amplitude ratio about frequency spectrum 

Figure sa-4 displayed amplitude intensity at the amplitude ratio about consonant 's' of the 'sa'.

Fig.sa-4

As a result of collecting spectrum patterns of 's' of plural men and women, it was recognized that there were frequency component more than amplitude ratio 12 over a wide area from 3,800Hz to 5,500Hz.

Figure ne-4 displays amplitude intensity at the amplitude ratio about consonant 'n' of 'ne'.

Fig.ne-4

In the case of the writer, there was a feature frequency between 4,700Hz from 4,800Hz, but as a result of having collected voice waveforms of the 'ne' of other speakers, there was a frequency component with amplitude ratio from 2 to 4 between 4,500Hz and 5,000Hz.

Figure ra-4 displays amplitude intensity at the amplitude ratio about consonant 'r' of the 'ra'.

Fig.ra-4

In the case of the writer, there was a feature frequency around 1,600Hz, but as a result of having collected voice waveforms of the 'ra' of other speakers, there was a frequency component with the amplitude ratio from 3 to 5 between 1,500Hz and 2,200Hz.

<Consideration about the consonant recognition>

About consonant 's', 'n' and 'r' which appeared with 'sa' ,'ne' and 'ra', I compiled a phoneme recognition standard in table 7 from table 5 for the cause by a past result.

I used these standards, and like the case of the vowel sound, made a speech recognition program and confirmed that I could recognize the consonant by applying to an unknown speech.

Table 5-6

Table 7

 

Separation of consonant domain and vowel domain 

I analyzed a voice waveform in an advantageous condition which could separate a consonant region and a vowel region judging from an eye clearly this time. However, in the real recognition operation,it is necessary to perform Fourier analysis after having divided consonant region and vowel sound region automatically and considerably precisely. The precision of the speech recognition depends on the quality of this operation, and enough attention will be necessary.

Postscript

We have continued to develop a speech recognition method that is robust against noise in practical situations.

And now, we understand why people can recognize words spoken by other person they meet for the first time . Perhaps will not it be because the spectrum pattern of words told every single day is incised on a brain well, ever since he was born? About the native language in particular.

The features of vowel sound and consonant that we overlooked so far are highlighted, and the question about the word that we have not noticed is understood by examining a voice waveform using the amplitude ratio of the spectrum pattern .

This method using amplitude ratio does not need training for speech recognition at all, and is hardly affected even if there is a small noise that the level is not so big.

The author has already extracted almost all feature frequency of phoneme concerning consonants and vowel sounds in Japanese. It seems that we may develop the hearing aid which emphasized the feature frequency and made it easy to hear it in the future by applying this technology.

About me

Link(Japanese)

For questions or comments regarding this web site, please me.

First written August 24, 2008

This site is copyrighted. You can't copy it without permission, but please feel free to link to me.