Generating an Audio Book Using Text-To-Speech

Published on: 2024-02-08

This post stems from an interest in the development in Text To Speech (TTS) models. The goal is to create an audio book using a TTS model. This blog post documents the process of producing an audio book using a TTS model. The target book in this case is the Douay Rheims Bible available on Project Gutenberg. However, due to the size, complexity, and validation issues the complete audio book is not yet finished. With a book on average taking half a day to 1-2 days to process, paired with at times 56% error rate in the TTS model. A complete audio book will take considerable time, given 73 books it may take 1-2 months to finish a single pass through. However, with a high error rate, it may require additional time to re-generate audio for erroneous clips.

Regardless, expect to see more regarding the completion of this book. However, in the meantime for those interested the current results and resources can be found at the following links:

A Suitable Book

For this project one could certainly find any short book and quite easily produce a quick and complete audio book. However, the selection of a longer book would present an opportunity to truly test the process. Due to the large size of the book, the need to automate each process would be critical. Similarly, a longer book would provide a larger and wider selection of text to test the model. Whereas, for a short story the audio may be listened to and validated manually, for a longer story, a manual review would not be feasible. As noted above, the selected book is the Bible (specifically the Douay Rheims edition). Thankfully, this book is available in the public domain and is quite easy to find a copy thanks to Project Gutenberg. After some initial tests, the core features became apparent:

  1. Automation is critical and any task must be performed in some script.
  2. The ability to navigate the audio book easily and effectively were critical. This serves a dual purpose:
    1. For a larger book jumping around an audio file is a terrible experience
    2. A TTS is prone to mistakes, being able to identify and resolve such mistakes is critical to producing a high quality audio book.
  3. The ability to map a specific audio clip to the raw text opens up the possibility of providing read-along subs.
  4. Finally, the validation of the audio clips must leveraging an automated Speech to Text (e.g. Whisper) system.

Structured Data

As noted, the target book is available on Project Gutenberg in several formats, epub, HTML, raw text. In an effort to simply this process the raw text version was used. While other version may have been suitable, raw text appeared to be fairly easy to parse using a context sensitive parser instead of need to parse the structured format of HTML or epub. Likewise, there was also concern that the epub or HTML format would provide inconsistent structured data due to the difficulty in effectively parsing raw data. Similarly, errors in the original encoding of the data could prove more difficult to resolve in a structured format other than raw text.

In fact, several such formatting error were encountered in the raw text. There are several (32) instances of formatting mistakes where the following verse is contained on the same line as current verse. Normally, verses are separated by new lines and therefore this unusually formatting caused the parser to fail. A simple edit of the raw text resolved this issue. One can find these issues using the following regular expression.


/^.+\s[0-9]+:[0-9]+/

This will find 35 instances, 3 of which are false positives outlined below.

  1. Matthew 23:9 footnote
  2. Matthew 26:29 footnote
  3. 1 Corinthians 7:2 footnote

Additionally, there was two instances of types regarding the Chapter headers and 1 instance of verses that were out of sequence. For more information regarding these errors please see the patch file provided. The patch file can be applied to the original download to rectify the mistakes. First, download the raw text book from Project Gutenberg; second, remove the carriage return line endings; finally, apply the patch file (see the commands below). Alternatively, one can simply download the patched version.


curl https://www.gutenberg.org/cache/epub/8300/pg8300.txt --output gutenberg.org_cache_epub_8300_pg8300.txt
sed 's/\r$//' gutenberg.org_cache_epub_8300_pg8300.txt > douay_rheims.txt
patch -u -b douay_rheims.txt -i douay_rheims.diff

Next step is to divide the book into small chunks that could easily be traversed. The structure of the bible is quite well suited for this. The bible itself is collection of books. Likewise, each book is then divide into Chapters, the first book Genesis for example contains 50 chapters. The text of each chapter is likewise divided into verses. A verse is typically at most a single sentence. Therefore, the existing structure provides small enough units (each verse) to encode as audio for easier traceability and navigability. Finally, there are a few additional considerations; each book has a title, an introduction, each chapter has an introduction, and finally verses have 0 to many footnotes. Therefore, the final structure for a single book would be extracted as:


- Book
  - Intro
  - Chapter X
    - Chapter X Intro
    - Verse Y
      - Footnote Z

With this structure in mind, the next step is to build a parsing script that would iterate through the book and extract this structure and save it in JSON. The reason to save the parsed structure to JSON is to provide an easy to use and review format while also keeping the data parsing uncoupled from the data processing scripts.

The creation of the parser proved fairly challenging due to the unstructured nature of the source text. The first issue was locating the start and finish of each book. Naturally, the book start is denoted by the title of the given book. The end of a book is likewise denoted by the title of the following book. A snippet of the wrap book with some markup to indicate the specific section is shown below (note it this snippet has been modified sightly to highlight the different sections).


THE BOOK OF GENESIS                                                      | Book title

This book is so called from its treating of the GENERATION, that is,     | Book Intro
of the creation and the beginning of the world. The Hebrews call it      |
BERESITH, from the Word with which it begins. It contains not only       |
the history of the Creation of the world; but also an account of its     |
progress during the space of 2369 years, that is, until the death of     |
JOSEPH.                                                                  |


Genesis Chapter 1                                                        | Chapter 1

God createth Heaven and Earth, and all things therein, in six days.      | Chapter 1 Intro

1:1. In the beginning God created heaven, and earth.                     | Verse 1

1:6. And God said: Let there be a firmament made amidst the waters: and  | Verse 6
let it divide the waters from the waters.                                |

A firmament... By this name is here understood the whole space between   | Verse Footnote 6
the earth, and the highest stars. The lower part of which divideth the   |
waters that are upon the earth, from those that are above in the clouds. |
                    

Correctly identifying the start and finish of a given book proved difficult given the titles of the book would differ slightly from the title listed in the table of contents. For example; the ninth book is listed as First Book of Samuel, alias 1 Kings in the table of contents. However, the book title in the body text is THE FIRST BOOK OF SAMUEL, OTHERWISE CALLED THE FIRST BOOK OF KINGS. Even if normalization was applied this would still require a special comparison to ensure the location is actually the title. Additionally, the parser could to encounter another reference to the title text within the given book's body content. If this were to happen, the end point would be set incorrectly and result in an incorrectly parsed book.

To circumvent these issues a manual process was undertaken to identify the start and end line numbers of a book. While violating our commitment to automation, this proved a reasonable process as it was only necessary once. Likewise, with a guarantee on the start and end line numbers the script could be written to confidently parse within these bounds. Below one can see a snippet of the start and end line numbers.


book_structure = [ {name: "genesis", start: 189, end: 5893},
                   {name: "exodus", start: 5893, end: 10480},
                   {name: "leviticus", start: 10480, end: 13700},
                   {name: "numbers", start: 13700, end: 18287},
                   # ...
                 ]

As mentioned above, the parser will need to be context sensitive as a text body could span multiple lines. Likewise, verses, footnotes, and chapter introductions have to be associated with the current chapter, and so on. Therefore, the parser reads in the text line by line and determine whether the line type is a title, intro, chapter title, chapter intro, verse, verse footnote, white space, or a continuation of any of them.

Despite the complex description, the parser is fairly simple. It starts by assume the first line is the title. If a white space line is encountered, it will either be skipped or transition the system state into the next expected state. As alluded to, the script maintains an internal state which is used to determine the type of line. Regular expressions are used for detecting whether the current line is a chapter, verse, or verse footnotes. Chapters are checked for using the following regular expression:


/^(.*) Chapter ([0-9]+)$/

The following regular expression is used to identify a verse:


/^([0-9]+)[:]([0-9]+)\.(.+)$/

Finally, verse footnotes are checked for using the following regular expression:


/^(.+?)\.\.\./

The full code can be found on Github. This script can be run using the below command, this will parse all the books of the bible (old and new testament):


ruby structure_bible.rb

A sample of the parsed Genesis JSON file can be found below. The full parsed file for Genesis can be found here.


{
  "title":
  [
    "THE BOOK OF GENESIS"
  ],
  "intro":
  [
    "This book is so called from its treating of the GENERATION, that is,",
    "of the creation and the beginning of the world. The Hebrews call it",
    "BERESITH, from the Word with which it begins. It contains not only",
    "the history of the Creation of the world; but also an account of its",
    "progress during the space of 2369 years, that is, until the death of",
    "JOSEPH."
  ],
  "contents":
  [
    {
      "title":
      [
        "Genesis Chapter 1"
      ],
      "intro":
      [
        "God createth Heaven and Earth, and all things therein, in six days."
      ],
      "contents":
      {
        "1":
        [
          "In the beginning God created heaven, and earth."
        ],
        "2":
        [
          "And the earth was void and empty, and darkness was upon the face of",
          "the deep; and the spirit of God moved over the waters."
        ],
        "6":
        [
          "And God said: Let there be a firmament made amidst the waters: and",
          "let it divide the waters from the waters."
        ]
      },
      "footnotes":
      {
        "6":
        [
          "A firmament... By this name is here understood the whole space between",
          "the earth, and the highest stars. The lower part of which divideth the",
          "waters that are upon the earth, from those that are above in the clouds."
        ]
      }
    }
  ]
}

With the raw text data parsed and stored in JSON, we can now move onto iterating through the parsed data and generating audio clips using TTS.

Text to Speech

For generating audio clips from text we will make use of TTS which can be installed using:


pip install TTS

With TTS installed we can then proceed with generating the audio clips. To generate an audio clip, a simple command line to generate TTS can be used. However, please note that running the following command will initiate a download of a large model tts_models/multilingual/multi-dataset/xtts_v2.


~/.local/bin/tts --text "God createth Heaven and Earth, and all things therein, in six days." \
            --model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
            --vocoder_name "vocoder_models/universal/libri-tts/wavegrad" \
            --language_idx "en" \
            --speaker_idx "Narelle Moon" \
            --out_path genesis/ch_1/intro.wav

This will take the input text "God createth Heaven and Earth, and all things therein, in six days." and using the model, vocoder, and speaker generate a audio clip to genesis/ch_1/intro.wav. Below is the generated audio clip:

Genesis Introduction

The script simply applies the above command to each element in the parsed JSON data. The audio files are generated and stored in the following file structure:


genesis/
  title.wav
  intro.wav
  ch_1/
    title.wav
    intro.wav
    verse_1.wav
    verse_2.wav
    ...
  ch_2/
    title.wav
    intro.wav
    verse_1.wav
    verse_2.wav
    ...
  ...

Additionally, the script also converts the output .wav into an .mp3 file with ffmpeg using the following command:


ffmpeg -i "genesis/ch_1/intro.wav" -codec:a libmp3lame -b:a 320k "genesis/ch_1/intro.mp3"

The complete code example can be found on Github and can be run using the following command:


ruby read_bible.rb

Playback

With the audio files generated, the process of listen to a book will need to be considered. Given the large number of audio files generated, one would need to listen to the files in the correct order. Thankfully, this task can easily be automated through the use of an audio playlist (m3u) file. Another script can generate the correctly ordered playlist through iterating through the chapters of the book. An m3u file is quite easy to generate and fairly widely supported. The following is a sample of a playlist for the book of Genesis.


#EXTM3U
#EXTINF:2,THE BOOK OF GENESIS
genesis/title.mp3
#EXTINF:2,Introduction
genesis/intro.mp3
#EXTINF:2,Genesis Chapter 1
genesis/ch_1/title.mp3
#EXTINF:2,Genesis Chapter 1 -- Intro
genesis/ch_1/intro.mp3
#EXTINF:2,THE BOOK OF GENESIS 1:1
genesis/ch_1/verse_1.mp3
#EXTINF:2,THE BOOK OF GENESIS 1:2

Likewise once a playlist is created per book of the bible another playlist can be created to link to each book. There are however, two flaws with the playlist solution:

  1. The paths are relative so any change in the paths will break the playlist.
  2. The book of Genesis consists of a lot (1683) of audio files.

The ideal solution would be a single file that contained a playlist like structure internally, effectively reducing the file footprint while maintaining the ease of navigation. However, despite several attempts, no ideal solution was found. Below is a summary of the various methods that were attempted.

  1. Add all the audio files to mka (audio variant of mkv) and use chapters for navigation. This technically did work, however a common media player, VLC, didn't support displaying the number of chapters properly and thus it was not fully usable: VLC chapters popup covers the entire and is cut off
  2. Add all the audio files to an m4a (audio variant of mp4) and use chapters for navigation. However, given the limitations of chapters in media players, it's doubtful a different file format yield any better results with the same number of chapters.
  3. Produce a single mka file per chapter of a book and using a playlist (m3u) file to link the chapters together as used. This reduced the number of files for Genesis from 1683 to 52. While far from ideal, this did solve the issue of too many files and too many video chapters.

TTS Audio Errors

Sampling through the generated audio one will quickly find issues with the audio. As the system is non-deterministic the output audio for a given text using the same input parameters can vary greatly. One potential solution to invalid audio output is to simply re-generated it in the hope that a better result is produced. Before highlighting a solution to this problem it is worthwhile to highlight the various types of audio generation errors that were found.

  1. Mispronunciations — These are quite difficult to catch especially if it's only the intonation that is incorrect. An example of mispronunciation is Genesis 1:3 "And God said: Be light made. And light was made.". The audio recording sounds closer to "And God said: Be light may. And light was made.".

    Genesis 1:3

  2. Early Clipping — Very early on in the research the TTS system would sometimes cut off early, before the audio was finished. For example when reading Genesis 1:2 "And the earth was void and empty, and darkness was upon the face of the deep; and the spirit of God moved over the waters." the last word; "waters" would sometimes be cut off to simply "wat".

    Genesis 1:2
  3. Inserting Sounds — Sometimes the audio generated will also contain "sound" prior, during or after the input text. An example of is Genesis Chapter 21 Introduction which is suppose to be "Isaac is born. Agar and Ismael are cast forth." but can be heard as "Isaac is born. Al-Wasi-Winnie-Rouds, Agar and Ismael are cast forth.".

    Genesis Chapter 21 Intro

    Additionally for Genesis 32:9 additional sound can be heard after the text read is finished. The additional sounds is quite eerie of movement and then highly distorted moans or chants.

    Genesis 32:9
  4. Differing Accents — Sometimes the audio generated will differ in the accent. For example, Isaias 25:1 was generated once with more of an British accent, where another generation of the same verse results in the typical accent of the speaker. While far from idea, there is little that one could do besides performing fine-tuning on the model to resolve this rarely encountered issue.

    British accent Isaias 25:1
    Isaias 25:1

While a manual review is possible, with a total duration fo 3:50:38 for just Genesis it isn't very practical. The total run time of the bible would be far longer. Another possible solution would be employ another model to validate the generated audio. One such model is Whisper which can transcribe an audio clip. First, one can install Whisper using the following:


pip install -U openai-whisper

Whisper can then be used to transcribe audio from the audio clip genesis/ch_1/intro.wav using the following command:


~/.local/bin/whisper genesis/ch_1/intro.wav \
            --language English \
            --output_dir genesis/ch_1/ \
            --output_format json

The above command will then output the following JSON file:


{
  "text": " This book is so called from its treating of the generation, that is, of the creation and the beginning of the world. The Hebrews call it Buresa, from the word with which it begins. It contains not only the history of the creation of the world, but also an account of its progress during the space of 2,369 years, that is, until the death of Joseph.",
  "segments":
  [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 6.92,
      "text": " This book is so called from its treating of the generation, that is, of the creation and",
      "tokens":
      [
        50364,
        639,
        1446,
        307,
        370,
        1219,
        490,
        1080,
        15083,
        295,
        264,
        5125,
        11,
        300,
        307,
        11,
        295,
        264,
        8016,
        293,
        50710
      ],
      "temperature": 0.0,
      "avg_logprob": -0.1698850589794117,
      "compression_ratio": 1.6650717703349283,
      "no_speech_prob": 0.00031432180549018085
    },
    {
      "id": 1,
      "seek": 0,
      "start": 6.92,
      "end": 9.120000000000001,
      "text": " the beginning of the world.",
      "tokens":
      [
        50710,
        264,
        2863,
        295,
        264,
        1002,
        13,
        50820
      ],
      "temperature": 0.0,
      "avg_logprob": -0.1698850589794117,
      "compression_ratio": 1.6650717703349283,
      "no_speech_prob": 0.00031432180549018085
    },
    {
      "id": 2,
      "seek": 0,
      "start": 9.120000000000001,
      "end": 14.200000000000001,
      "text": " The Hebrews call it Buresa, from the word with which it begins.",
      "tokens":
      [
        50820,
        440,
        44604,
        818,
        309,
        363,
        1303,
        64,
        11,
        490,
        264,
        1349,
        365,
        597,
        309,
        7338,
        13,
        51074
      ],
      "temperature": 0.0,
      "avg_logprob": -0.1698850589794117,
      "compression_ratio": 1.6650717703349283,
      "no_speech_prob": 0.00031432180549018085
    },
    {
      "id": 3,
      "seek": 0,
      "start": 14.200000000000001,
      "end": 19.16,
      "text": " It contains not only the history of the creation of the world, but also an account of its",
      "tokens":
      [
        51074,
        467,
        8306,
        406,
        787,
        264,
        2503,
        295,
        264,
        8016,
        295,
        264,
        1002,
        11,
        457,
        611,
        364,
        2696,
        295,
        1080,
        51322
      ],
      "temperature": 0.0,
      "avg_logprob": -0.1698850589794117,
      "compression_ratio": 1.6650717703349283,
      "no_speech_prob": 0.00031432180549018085
    },
    {
      "id": 4,
      "seek": 0,
      "start": 19.16,
      "end": 26.04,
      "text": " progress during the space of 2,369 years, that is, until the death of Joseph.",
      "tokens":
      [
        51322,
        4205,
        1830,
        264,
        1901,
        295,
        568,
        11,
        11309,
        24,
        924,
        11,
        300,
        307,
        11,
        1826,
        264,
        2966,
        295,
        11170,
        13,
        51666
      ],
      "temperature": 0.0,
      "avg_logprob": -0.1698850589794117,
      "compression_ratio": 1.6650717703349283,
      "no_speech_prob": 0.00031432180549018085
    }
  ],
  "language": "English"
}

This JSON file provides both the transcription of the audio (text of the audio) and also some timing information related to the audio file. Therefore, a script is used to systematically pass audio clips to Whisper to produce a transcription file. The output from Whisper can then be compared with the input text used to generate the audio clip originally to judge the quality of the audio clip.

TTS Validation with Whisper

Whisper is very powerful at transcribing the audio files. It even proves quite effective at recovering the original text punctuation. However, despite the reasonable performance, punctuation remains one of the largest stumbling block. Often Whisper will either miss some punctuation, or use the wrong one (e.g. comma instead of semi-colon). It is also possible that the quality of the TTS audio itself also has an impact on the ability of Whisper to retrieve the punctuation correctly.

As noted above, the ideal use-case is to iterate through the audio clips of the book and have Whisper transcribe them. The transcript would then be compared back to the raw input text. If there were some difference that would indicate the TTS generated audio clip may have an error. Naturally, the specific implementation requires a few special considerations due to limitations in Whisper and a special class of errors for TTS generation.

A few examples of the failure of Whisper to correctly transcribe the punctuation for a specific verse are below:

  1. Genesis 32:9 — "And Jacob said: O God of my father Abraham, and God of my father Isaac: O Lord who saidst to me, Return to thy land, and to the place of thy birth, and I will do well for thee."
    Genesis 32:9
    Raw Text Whisper
    said: said,
    Issac: Issac,
    Return return

Normalization of the input text can be used to reduce the number of false positives. The normalization process would be as follows:


def normalize_text(text)
    text.gsub(/[[:punct:]]/, '').downcase.strip
end

Another source of issue is the presence of non-English words in the source text. For example in the introduction to Genesis the word BERESITH is present. Whisper transcribes the audio as Buresa. This enters a difficult use-case, trying to get a TTS model for speaking English words to effectively speak another language. Likewise, Whisper is a speech to text model set to transcribe English audio into English text. Bottom-line, it's a bit unrealistic to expect either model to work particularly well on non-English words.

Along the same lines; Old-English or archaic words that are uncommon in modern times also cause issues. In these cases Whisper generates false positives where the generated audio is suitable but Whisper generates incorrect transcripts. The following table outlines a few examples:

Archaic Word Mappings
Raw Text Whisper
createth created
saidst sades
commandeth commanded
interpreteth interprets
declareth declares
changeth changed
dieth diath
visiteth visited
adopteth adopted
blesseth blessed
marrieth marieth
promiseth promises

In most cases Whisper acts substitutes an "-th" for a "-ed" and other like transformations. While these transcriptions may be passable for the purposes of transcription, as a means of verifying an audio file for correctness it falls short. Ideally if the input text reads "commandeth" then the audio should be of that word and not "commanded". In most cases the TTS did in read "commandeth" as such. However Whisper would fail to correctly transcribe the correctly read archaic word.

The final error involves the addition of sounds to the audio clip. This comes in two forms; first, English-like sounds, and second non-English like sounds. For English-like sounds this involves miss-pronunciations, or added words or sounds that Whisper is able to transcribe.

Source Raw Text Manual Transcription Whisper Transcription
Genesis 1:3

And God said: Be light made. And light was made.

And God said: Be light may. And light was may. And God said: Be light may. And light was made.
Genesis Chapter 21 Intro

Isaac is born. Agar and Ismael are cast forth.

Isaac is born. Al wairasigwarous, Agar and Ismael are cast forth. Isaac is born. Al-Wasi-Winnie-Rouds, Agar and Ismael are cast forth.

Alternatively, when non-English-like sounds are added. This involves sounds that are not present in the transcript of Whisper. An example of this is Genesis 32:9 (additional sounds are from 15s - 25s) .

Genesis 32:9
Raw Text Whisper
And Jacob said: O God of my father Abraham, and God of my father Isaac: O Lord who saidst to me, Return to thy land, and to the place of thy birth, and I will do well for thee. And Jacob said, O God of my father Abraham, and God of my father Isaac, O Lord, who sades to me, return to thy land, and to the place of thy birth, and I will do well for thee.

Once normalized the output is quite similar beyond a failure of Whisper to properly transcribe saidst. Instead it uses sades. Ignoring this minor error, if one listens to the audio clip they will hear the speech stops at around 15s but the clip continues for an additional 10s. The remaining 10s contains low moans or chants that get louder near the end.

This audio is not captured in the Whisper transcript and if the comparison between the input text and the transcript were the only validation this obviously erroneous clip would be considered valid. However, Whisper additional provides timing information. In the JSON output each segments of the transcript is available. Each segment has a start, end, text, and several other fields. For this validation only the start and end time are important. In this case the final segment's end time is: 14.840 seconds (s). Yet if this is compared to the clip's total runtime of 25s a difference of around 10s is found. A simple comparison of these two values with a threshold yields a quick method to identify whether a clip contains extra sounds not picked up by Whisper.


if end_time + threshold < clip_length
    # Handle invalid clip
end

Of course, if extra sound could be added at the end of an audio clip, it is also possible it could be added before or at any point during the audio clip. Therefore, a more general method to identify inserted audio can be done using the following:


def valid_clip?(text, segments, clip_length, threshold)
    return false if segments.empty?

    previous_end_time = 0
    segments.each do |segment|
        if segment['start'] - threshold >= previous_end_time
            # Handle invalid preceding audio
            return false
        end

        previous_end_time = segment['end']
    end

    if segments[-1]['end'] + threshold <= clip_length
        # Handle invalid clip
        return false
    end
    return true
end

This verification process can be performed using script verify_book_audio.rb which can be run using the following command:


ruby verify_book_audio.rb

This script will iterate through the book's structure. For each audio clip a Whisper transcript will be generated. The transcript will be then compared to the original text. An entry per audio clip is then added to an array. These entries contain the path to the audio clip, normalized raw text, normalized Whisper transcript, whether a timing issue was detected, whether a text match issue was detected, the total clip length, and finally the Whisper transcript timing information. For detecting a timing issue the above code is used. Whereas, for detecting a matching issue a simple string comparison between the normalized raw text and Whisper transcript was used. Below is a sample of the verification details dumped to JSON.


[
  {
    "path": genesis/title.wav",
    "raw_text_nom": "the book of genesis",
    "whisper_nom": "the book of genesis",
    "timing_issue": false,
    "total_clip_length": 2.14,
    "text_match": true,
    "timings":
    [
      {
        "start": 0.0,
        "end": 1.56
      }
    ]
  },
  {
    "path": genesis/intro.wav",
    "raw_text_nom": "this book is so called from its treating of the generation that is of the creation and the beginning of the world the hebrews call it beresith from the word with which it begins it contains not only the history of the creation of the world but also an account of its progress during the space of 2369 years that is until the death of joseph",
    "whisper_nom": "this book is so called from its treating of the generation that is of the creation and the beginning of the world the hebrews call it buresa from the word with which it begins it contains not only the history of the creation of the world but also an account of its progress during the space of 2369 years that is until the death of joseph",
    "timing_issue": false,
    "total_clip_length": 26.71,
    "text_match": false,
    "timings":
    [
      {
        "start": 0.0,
        "end": 6.92
      },
      {
        "start": 6.92,
        "end": 9.120000000000001
      },
      {
        "start": 9.120000000000001,
        "end": 14.200000000000001
      },
      {
        "start": 14.200000000000001,
        "end": 19.16
      },
      {
        "start": 19.16,
        "end": 26.04
      }
    ]
  },
  {
    "path": genesis/ch_1/title.wav",
    "raw_text_nom": "genesis chapter 1",
    "whisper_nom": "genesis chapter 1",
    "timing_issue": false,
    "total_clip_length": 2.93,
    "text_match": true,
    "timings":
    [
      {
        "start": 0.0,
        "end": 2.32
      }
    ]
  }
]

Using the above script we can generate a failure rate of the TTS audio clips for each book. For Genesis the following table shows the failure rate. The performance overall fairly poor with a 51.84% failure rate. The timing errors are quite in-frequent accounting for only 0.12% of potential errors (using a timing error threshold of 3s). It is important to note that Matching Errors and Timing Errors are not mutually exclusive. For example Genesis 32:9 is recorded as both a matching error and a timing issue. Where as Genesis 48:15 is recorded as only a timing issue. This is because the timing check is simply whether there is potentially non-English audio exceeding the minimum threshold. Therefore, it is also possible that the Whisper transcript does not match the raw text.

Error Count Total Count Error Rate
Matching Errors 845 1632 51.78%
Timing Errors 2 1632 0.12%
Total Errors 846 1632 51.84%

Furthermore, if the errors are categorized into their specific sections we can identify which sections tend to cause more problems. As outlined below, the majority of the errors come from the verses (49.14%). However, since the verses constitute the majority of audio clips (93.75%), it is fairly reasonable to expect there to be more errors present. Looking at the error rates in each section's population we find that chapter intros yield a high rate of 70%. It is difficult to make too many definitive statements regarding this sample since the total size of populations is quite small. Similarly, there is only 1 title and 1 intro, so an error rate based on a single item is fairly useless.

Error Count Total Count Section Error Rate Total Error Rate
Title Matching Errors 0 1 0% 0%
Title Timing Errors 0 1 0% 0%
Intro Matching Errors 1 1 100% 0.06%
Intro Timing Errors 0 1 0% 0%
Chapter Title Matching Errors 9 50 18% 0.55%
Chapter Title Timing Errors 0 50 0% 0%
Chapter Intro Matching Errors 35 50 70% 2.14%
Chapter Intro Timing Errors 0 50 0% 0%
Verse Matching Errors 800 1530 52.29% 49.02%
Verse Timing Errors 2 1530 0.13% 0.12%
Total Errors 846 1632 - 51.84%

One important note however is this includes many of the false positives discussed above (e.g. old English words and foreign words). While accounting for these errors is difficult, a best effort can be made maintaining a list of words often miss-transcribed. If a given word is present in the raw text then the error checker should ignore or perform some secondary check on the words.

In order to determine the true rate of errors a manual review of chapter was performed. Below is the breakdown of the errors for chapter 1. Out of the 7 potential errors identified by Whisper, only 1 (Genesis 1:3) was a true positive. The remaining detected errors were false positives—meaning the false positive rate was 85% for this sample. As for false negatives, after a thorough manual review of the audio for Chapter 1, no errors were detected. Therefore, while the system is quite prone to capturing false positives, from this sample, no errors were missed.

Error Count Total Count Section Error Rate Total Error Rate
Title Matching Errors 0 1 0% 0%
Title Timing Errors 0 1 0% 0%
Intro Matching Errors 1 1 100% 2.86%
Intro Timing Errors 0 1 0% 0%
Chapter Title Matching Errors 0 1 0% 0%
Chapter Title Timing Errors 0 1 0% 0%
Chapter Intro Matching Errors 1 1 100% 2.86%
Chapter Intro Timing Errors 0 1 0% 0%
Verse Matching Errors 5 31 16.13% 14.29%
Verse Timing Errors 0 31 0% 0%
True Positive Errors 1 7 14.29% 2.86%
Total Matching Errors 7 35 - 20%
Total Timing Errors 0 35 - 0%
Total Errors 7 35 - 20%

Generally, the false positives reported by Whisper were due to errors in the transcription. These false positives, however, are very difficult to detect as we wish to detect semantic changes to the text rather than simple syntax changes. For example for Genesis 1:15, Whisper generates everything instead of every thing. This naturally does not match a simple comparison. However, a space between the two words is fairly in-consequential and therefore not a true error. Whereas for other errors such as Genesis 1:8 or 1:10, Whisper incorrectly transcribes valid audio.

One possible solution is to the string distance (such as Levenshtein distance) comparison as a means to reduce false positives. For example; if one considers Genesis 1:15, the distance between everything and every thing is one. Therefore, a low distance (below some threshold) between strings could identify false positives. However, if a true positive is considered, Genesis 1:3, the distance is 2 between the raw text and transcription is "de" vs "y". A threshold distance of 1 may identify some false positives, but wouldn't capture most and threatens to introduce new false negatives.

To further explore the Whisper's tendency to produce false positives a manual review of just the Chapter introductions for Genesis was performed. Given the rate of errors was relatively high with a medium sized population. The rate of false positives was quite high at 28 of the 35 Whisper identified errors being errors in the transcription. Most of these errors involved incorrect transcription of the various names of people or places, old-English conversions (covered in Archaic Word Mappings table above), and substituting similar words.

Error Count Total Count Error Rate
Matching Errors 35 50 70%
Timing Errors 0 50 0%
True Positive Errors 7 35 20%
Total Errors 35 50 70%

The following is a list of the true positives found during manual review of the Chapter introductions for Genesis.

  1. Genesis 5 Intro — The final word henoch has a tar sound appended to the end which Whisper transcribes as hinokhtar.
  2. Genesis 21 Intro — An inserted sound is added in the middle of the text which Whisper transcribes as alwasiwinnierouds, likewise, the name agar is incorrectly transcribed as aghar.
  3. Genesis 25 Intro — An insertion of the audio doi prior to esau. Similarly, several names are incorrectly transcribed: cetura as satura, ismael as ishmael. And finally, Whisper also transcribes selleth as zealoth.
  4. Genesis 27 Intro — The original text is obtaineth however the audio sounds like and Whisper transcription as obdineth.
  5. Genesis 29 Intro — The name lia is pronounced in 2 different ways in the same clip. As Lia (with a hard I) and then later as leah (with soft e/i). This certainly is a borderline case, however consistent pronunciation of names is clearly criteria for a better audio book.
  6. Genesis 42 Intro — The addition of nonsense English sounds at the end of the clip.
  7. Genesis 44 Intro — The addition of dot dot at the end of the clip.

The manual review of both Genesis Chapter 1 and Genesis Chapter introductions revealed the flaws in the naive validation system using Whisper. The comparison on the raw text to Whisper transcriptions is able to successfully detect all audio clip errors. However, this comes at a price of a high false positive rate. This rate was 85.71% for Genesis chapter 1, and 80% for Genesis Chapter introductions. If this false positive rate of 80% were to generalize to the whole book of Genesis that would mean 640 of the 800 detected errors (using Whisper's transcription) were false positives.

With regards to the timing issues, only 2 errors were recorded. One of which was also picked up through a matching issue. Upon manual review the errors were confirmed to be valid as both contained non-English audio within the clip. For Genesis 32:9 it was at the end of the clip, whereas for Genesis 48:15 it was middle of the clip. Despite making up a rather low number of occurrences, Genesis 48:15 captured an error not identified by Whisper's transcription. Meaning this is likely a valuable means of detecting errors.

With so few occurrences, it would be difficult to definitely state whether the timing method for detecting errors is as prone to errors as the transcription matching test. However, given the nature of this method it is far more likely the the false positive rate would be low. Since the objective of the clips are to transform text into speech, any place where no speech can be detected, beyond a normal grammatical pause, is likely an issue. Therefore, even a threshold of 3s may miss some errors but will likely only capture errors.

As for other books, below are the Whisper detected errors for the next four books of the Bible; Exodus, Leviticus, Numbers, and Deuteronomy. The rate of potential errors were similar to Genesis, with Leviticus having the lowest rate of 34.46% and Numbers having the highest rate of 56.68%. Interestingly enough, Exodus, Leviticus, and Deuteronomy had a potential error rate lower than Genesis with 45.86%, 34.46%, and 42.18% respectively. Likewise, every books' introduction was identified as a matching error. Additionally, for each book, besides Numbers, the section error rate for Chapter Introductions was higher than the verses. For various reasons, the introduction for a book or chapter appears be particularly challenging for either the TTS or Whisper.

Error Count Total Count Section Error Rate Total Error Rate
Title Matching Errors 0 1 0% 0%
Title Timing Errors 0 1 0% 0%
Intro Matching Errors 1 1 100% 0.08%
Intro Timing Errors 0 1 0% 0%
Chapter Title Matching Errors 9 40 22.5% 0.7%
Chapter Title Timing Errors 0 40 0% 0%
Chapter Intro Matching Errors 19 40 47.5% 1.47%
Chapter Intro Timing Errors 0 40 0% 0%
Verse Matching Errors 563 1211 46.49% 43.54%
Verse Timing Errors 1 1211 0.08% 0.08%
Total Match Errors 592 1293 - 45.78%
Total Timing Errors 1 1293 - 0.08%
Total Errors 593 1293 - 45.86%
Error Count Total Count Section Error Rate Total Error Rate
Title Matching Errors 0 1 0% 0%
Title Timing Errors 0 1 0% 0%
Intro Matching Errors 1 1 100% 0.11%
Intro Timing Errors 0 1 0% 0%
Chapter Title Matching Errors 7 27 25.93% 0.77%
Chapter Title Timing Errors 0 27 0% 0%
Chapter Intro Matching Errors 10 27 37.04% 1.09%
Chapter Intro Timing Errors 0 27 0% 0%
Verse Matching Errors 297 858 34.62% 32.49%
Verse Timing Errors 0 858 0% 0%
Total Match Errors 315 914 - 34.46%
Total Timing Errors 0 914 - 0%
Total Errors 315 914 - 34.46%
Error Count Total Count Section Error Rate Total Error Rate
Title Matching Errors 0 1 0% 0%
Title Timing Errors 0 1 0% 0%
Intro Matching Errors 1 1 100% 0.07%
Intro Timing Errors 0 1 0% 0%
Chapter Title Matching Errors 17 36 47.22% 1.25%
Chapter Title Timing Errors 0 36 0% 0%
Chapter Intro Matching Errors 18 36 50% 1.32%
Chapter Intro Timing Errors 0 36 0% 0%
Verse Matching Errors 736 1288 57.14% 54.04%
Verse Timing Errors 2 1288 0.16% 0.15%
Total Match Errors 772 1362 - 56.68%
Total Timing Errors 2 1362 - 0.15%
Total Errors 772 1362 - 56.68%
Error Count Total Count Section Error Rate Total Error Rate
Title Matching Errors 0 1 0% 0%
Title Timing Errors 0 1 0% 0%
Intro Matching Errors 1 1 100% 0.1%
Intro Timing Errors 0 1 0% 0%
Chapter Title Matching Errors 10 34 29.41% 0.97%
Chapter Title Timing Errors 0 34 0% 0%
Chapter Intro Matching Errors 17 34 50% 1.65%
Chapter Intro Timing Errors 0 34 0% 0%
Verse Matching Errors 405 959 42.23% 39.36%
Verse Timing Errors 2 959 0.21% 0.19%
Total Match Errors 433 1029 - 42.08%
Total Timing Errors 2 1029 - 0.19%
Total Errors 434 1029 - 42.18%

Conclusion

The original goal to create an audio book of a large book in an automated fashion turned out to be fairly challenging. While many off-the-shelf tools exist, their specific purpose is not exactly tuned for the goal of generic text-to-speech. These tools, despite some flaws, proved capable for the task. At the time of writing this, 5 of the 46 books of the Old Testament have been converted into audio files. The remaining process for creating the audio files is fully automated and merely requires sufficient computational time to complete.

In order to validate the approach of generating TTS, OpenAI's Whisper tool was used to transcribe the audio and extract the timing information. The transcript was then compared to the raw text to identify clips with issues. This validation identified a high potential error rate of 46.2% for the first 5 books. In order to review this high rate of potential errors, a manual review was performed of Genesis Chapter 1 and each Genesis Chapter Introduction. Overall, the rate of false positives was found to be 80% or higher. Conversely the false negative rate was found to 0%.

A high error rate combined with a high false positive rate do not make the system unfeasible. However, these issues point to flaws in the validation method as a practical solution. If one were to not address the false negatives at all the processing time (which is already fairly long) could increase by 1.5x due to the need to regenerate over half of the book. And of course, the errors are randomly occurring (and thus a regeneration will fix them). In fact, it may be possible the errors are representing text or audio that the TTS or Whisper find particularly challenging.

One potential solution would be to employ an Large Language Model (LLM) to determine whether both the original and Whisper transcript contain the same semantic meaning. This would potential resolve all but the foreign word issues which may be address using an LLM approach as well. Alternatively, one could seek to compare the similarity of the pronunciation of the two input transcripts. If the expected pronunciation is similar it may be false positive. For example; Genesis Chapter 9 Introduction Noah (written as noe) is transcribed as know.

As the generation of the audio book is still on-going, a future update is likely if only to provide the complete audio book version of the Bible. You can review the initial version containing the first 5 books here. Finally, an exploration potential solutions to the high false positive error rate may be the topic of a future post. We welcome your thoughts on validation issues or potential solutions. Additionally, any general suggestions or feedback you may have are appreciated. Please, discuss on Hacker News, send us thoughts, or join the discussion below.