Dum-De-Dum

Trying to get a computer to make speech sounds is called Articulatory Synthesis, and there are 3 basic methods.

The first is to sample human speech, chop it up into phonemes, and rearrange them into new words. I first used a speech synthesiser that did this back in 1991 - it came free with a sound card, and produced a voice described by one writer as "a constipated robot". Another said it resembled "a Bulgarian British Rail announcer talking through a cardboard tube".

The voice was a free Microsoft product called "Sam5", and there were a range of others which you could buy. Except I don't think anyone did buy them, given that Sam5 was (a) abysmal and (b) the showcase voice, so probably the best.

One year later I experimented with creating my own "phonemic speech synthesiser" - sampling my own voice speaking 18 consonant phonemes, 11 vowels and 13 glides, then writing a little MS-DOS program to concatenate them into words and sentences. The result had one speed, one pitch, no intonation and I think sounded more pleasant than Sam5.

The second method is to construct a synthesiser to produce white noise or buzzes, and filter these to produce sounds with the same formants as human speech.

A formant is a peak in a graph of amplitude over sound frequency. Each vowel, and to some extent each consonant, has it's own pattern of formants. One of the things that makes a human voice distinctive is that although the pitch varies, the formants don't much. This is why, when you hear different people saying the same vowel at different pitches, you recognise the vowel as being the same.

There are "vowel synthesisers" around which can filter any sound to a human-like vowel quality. They're often used to make breathy choir sounds going "oooh" or "ahhh" in music. They don't really sound human, but it can be a nice effect. Some of them mix in samples of real choirs to make the sound more realistic.

The third method is to create a mathematical model of the air in the vocal tract, and excite it while changing the shape of the virtual "throat" and "mouth". When done well it can be extremely realistic - the trouble is, it's monstrously complex and very processor-intensive to do it well, so in practice a lot of corners get cut, and the result isn't much better than the cut-up programs.

I've been trying to get my computer to read to me. It would be nice to have it read me a chapter of something when I'm lying in bed and too lazy to open the book. There are dozens of "Screen Readers" and "Text to Speech Engines" available, some designed to enable blind people to read webpages and use email.

The first I tried was Microsoft's "NonVisual Desktop Access" - and what should I hear when I installed it but...Sam5! After 15 years development, I'm not just stuck with a constipated Bulgarian robot - it's the same one.

There's a firm called Nuance who make "Vocalizer". The next time an automated voice on the phone tells you "All our operators are busy - please hold", you can probably blame Nuance for making it even more annoying than it needed to be.

I tried another free one called Thunder, which tells blind and partially sighted users what they're doing on the screen. In the voice of Sam5.

You can buy better voices for the program, from a company called RealSpeak. They use an advanced form of the "cut-up" phonemic method of synthesis, using specially recorded voices of professional speakers. They offer a free trial of their "Daniel" voice - instantly recognisable as the tones from a thousand corporate videos and TV adverts. I've no idea what the actor's name is, and I've never seen him in any role, but now his crisp, overearnest sound is on millions of computers.

It's just a small detail that the punctuation is hopeless, the rhythm completely inhuman, and they've made him pronounce "Firefox" as "Fi-err-faaarrks", as though drifting into a texan drawl for just one vowel.

But if you're computer can speak (sort of), why not make it sing (sort of)? Yamaha have a range of products called "Vocaloid" which use the same cut-up method to give you a range of classically trained singers. Judging from the best efforts to produce speech that way, I won't be trying out the singing products.

Probably the best "virtual singer" is a program called Cantor, by the German company Virsyn. I'm not sure, but this seems to use a combination of all three synthesis methods, and although you'd never mistake it for a real singer belting out the lead vocal, it's pretty good for backing vocals. When I can get around to it, I'll try using it to give myself some backing singers going "Doo-doo-doo", "Shoop-doo-wah" and "Ahhhh".

3 comments:

  1. I think Minge means "They speak fabulously daaarling!" and so they do.

    I'm liking Linux btw, which is apropos of absolutely nothing.

    ReplyDelete
  2. Yes, most of the speaking programs around seem to be mac only.

    The next time I come into a few thousand pounds, I may get a mac. Though I'd have to get a whole lot of new software just for it. I fear I'm tied to PCs by the hardest to break of ropes - I've always used them and I have hundreds of programs that only work with them. Call it "the investment trap".

    The same goes for Linux, though I'm rather impressed with it.

    ReplyDelete