Online Schools Information

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Wednesday, 30 June 2010

Digital Signal Processor and Text-to-Speech

Posted on 06:34 by Unknown

This is the second post in a series on Text-to-Speech for eLearning written by Dr. Joel Harband and edited by me (which turns out to be a great way to learn).  The first post, Text-to-Speech Overview and NLP Quality, introduced the text to speech voice and discussed issues of quality related to its first component – the natural language processor (NLP). In this post we’ll look at the second component of a text to speech voice: the digital signal processor (DSP) and its measures of quality.

Digital Signal Processor (DSP)

The digital signal processor translates the phonetic language specification of the text produced by the NLP into spoken speech. The main challenge of the DSP is to produce a voice that is both intelligible and natural.  Two methods are used:

  • Formant Synthesis.  Formant Synthesis seeks to model the human voice by computer-generated sounds, using an acoustic model. Typically, this method produces intelligible, but not very natural, speech. These are the robotic voices, like MS Mike, that people often associate with text to speech. Although not acceptable for eLearning, these voices have the advantages of being small and fast programs and so they find application in embedded systems and in applications where naturalness is not required as in toys and in assistive technology.
  • Concatenative Synthesis. To achieve the remarkable naturalness of Paul and Heather, concatenative synthesis is used. A recording of a real human voice is broken down into acoustic units: phonemes, syllables, words, phrases and sentences and stored in a database. The processor retrieves acoustic units from the database in real time and connects (concatenates) them together to best match the input text.

Concatenative Synthesis and Quality

When you think about how concatenative synthesis works – joining together a lot of smaller sounds to form the voice, it suggests where there can be glitches.  Glitches will occur either because there’s not a recorded version of exactly what the sound should be or will occur where the segments are joined when it doesn’t come together quite right. The main strategy is to try to choose database segments that are as long as possible– phrases and even sentences – to minimize the number of connection glitches.

Here is an example of a glitch in Paul when joining the two words “bright” and “eyes”. (It wasn’t easy to find a glitch in Paul – finally found one in a Shakespeare sonnet!)

  • Mike - bright eyes
  • Heather - bright eyes
  • Paul - bright eyes

The output from the best concatenative systems is often indistinguishable from real human voices. Maximum naturalness typically requires speech databases to be very large so the larger the database the higher the quality. Typical TTS voice databases that will be acceptable in eLearning, will be on the order of 100-200 Mb. For lower fidelity applications like telephony, the acoustic unit files can be made smaller by using a lower sampling rate without sacrificing intelligibility and naturalness, making a smaller database (smaller footprint).

By the way, the database is only used to generate the sounds which are then stored as .wav, .mp3, etc.  It is not brought along with the eLearning piece itself.  So a large database is generally a good thing.

Here is a list of the TTS voices offered by NeoSpeech, Acapela and Nuance with their file sizes and sampling rates.

Voice

Vendor

Sampling rate (kHz)

File Size (Mb)

Applications

Paul

NeoSpeech

8

270  (Max DB)

Telephone

Paul

NeoSpeech

16

64

Multi-media

Paul

NeoSpeech

16

490  (Max DB)

Multi-media

Kate

NeoSpeech

8

340  (Max DB)

Telephone

Kate

NeoSpeech

16

64

Multi-media

Kate

NeoSpeech

16

610  (Max DB)

Multi-media

Heather

Acapela

22

110

Multi-media

Ryan

Acapela

22

132

Multi-media

Samantha

Nuance

22

48

Multi-media

Jill

Nuance

22

39

Multi-media

The file size is a combination of the sampling rate and the database size, where the database size is related to the number of acoustics units stored. For example, voices 2 and 3 have the same sampling rate, 16, but voice 3 has a much bigger file size because of the larger database size. In general, the higher sampling rates are used for multimedia applications and the lower sampling rates for telecommunications.  Often larger sizes also indicate a higher price point.

The DSP voice quality is then a combination of the two factors: the sampling rate, which determines the voice fidelity and the database size which determines the quality of concatenation and frequency of glitches – the more acoustic units stored in the database, the better the chances of achieving a perfect concatenation without glitches.

And don’t forget to factor in Text-to-Speech NLP Quality.  Together with DSP quality you get the overall quality of different Text-to-Speech solutions.

Email ThisBlogThis!Share to XShare to Facebook
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • How to Download YouTube Videos
    I needed to figure out how to download YouTube videos for an upcoming conference presentation where I wasn't confident that I would be a...
  • How Khan Academy Nike Training Club and SparkPeople Motivate Users Behavior
    I mentioned in my post Online Systems for Behavior Change that I'm working on a very interesting project that is designed to lead to so...
  • Blogs, Social Networks and LinkedIn Answers
    I received a great question from someone relative to my last post - Required Reading for Training Managers where I continue to suggest the ...
  • eLearning Conferences 2013
    Clayton R. Wright has done his 28th version of his amazing list of conferences again this year. Past years eLearning Conferences 2012 , eL...
  • Collaboration Tools
    As I mentioned in Real-Time Collaborative Editing , I had a fantastic experience participating in group editing of a Mind Map of collaborat...
  • eLearning Learning Launched
    I'm happy to announce the launch of eLearning Learning . This is the beginning of a community portal where the community will help to c...
  • Video and Screencast Styles for Corporate Training?
    I'd like to get help identifying examples of videos and screencasts that show different styles. I'm hoping people can help me colle...
  • Video Ratings
    I received a question today and thought I'd ask blog readers if they can help with answers. The question comes from a blog reader who c...
  • eLearning Conferences 2011 Updated
    May 18 2011 - Updated conferences with new list for June - December 2011 (and beyond). Clayton R. Wright has done his amazing list of conf...
  • Blog Learning
    Something I (probably too often) talk about is learning via a blog. It certainly is a great lens to have in viewing the world. It puts you...

Blog Archive

  • ►  2012 (6)
    • ►  November (2)
    • ►  October (1)
    • ►  September (1)
    • ►  June (1)
    • ►  January (1)
  • ►  2011 (15)
    • ►  November (1)
    • ►  October (1)
    • ►  August (1)
    • ►  June (1)
    • ►  April (2)
    • ►  March (2)
    • ►  February (5)
    • ►  January (2)
  • ▼  2010 (58)
    • ►  December (1)
    • ►  November (4)
    • ►  October (5)
    • ►  September (2)
    • ►  August (6)
    • ►  July (4)
    • ▼  June (5)
      • Digital Signal Processor and Text-to-Speech
      • Learning Flash
      • Online Exam Preparation and Tutoring – Hot Market
      • eLearning Learning Sponsored by Rapid Intake
      • Text-to-Speech Overview and NLP Quality
    • ►  May (5)
    • ►  April (4)
    • ►  March (5)
    • ►  February (7)
    • ►  January (10)
  • ►  2009 (223)
    • ►  December (10)
    • ►  November (14)
    • ►  October (14)
    • ►  September (16)
    • ►  August (12)
    • ►  July (16)
    • ►  June (22)
    • ►  May (20)
    • ►  April (22)
    • ►  March (23)
    • ►  February (28)
    • ►  January (26)
  • ►  2008 (197)
    • ►  December (26)
    • ►  November (27)
    • ►  October (17)
    • ►  September (18)
    • ►  August (8)
    • ►  July (16)
    • ►  June (23)
    • ►  May (18)
    • ►  April (12)
    • ►  March (17)
    • ►  February (15)
Powered by Blogger.

About Me

Unknown
View my complete profile