Generating an Audio Book Using Text-To-Speech Part 2
Published on: 2024-07-08
This is a follow up to a previous post which covered the creation of an audio book for the Douay Rheims Bible using TTS. In this blog post we cover the following topics:
- TTS Audio Book
- TTS Generation Improvements
- Cloud TTS (Google Cloud and AWS Polly)
- Addressing the Audio Book File Size and Format
- TTS Audio Validation with Spacy
TTS Audio Book
The complete raw audio book for the Douay Rheims Bible can be found on hugging face. The audio clips are compressed in a multi-part 7z file due to restrictions of hugging face and git-lfs. Simply download the two compressed files and extract them together into the same folder. The extraction may take a long time. This copy is the original 160k bit-rate audio clips generate which takes just under 6GB. A second version which is the product of some formatting and optimizations discussed in the Addressing the Audio Book File Size and Format
section, is available on hugging face and takes just under 1GB of space. Each of the audio books on hugging face are compressed. However a sample folder is available which contains a few samples of the audio generated by the TTS system. For example for the local TTS generation you can find the samples here.
One of the core reasons for the followup post was to provide a full audio book as promised. However, the first initial books referenced in the previous post were re-generated using the new python script outlined in the section below due to an issue discovered in the ruby script. During the initial proto-typing of the script a issue with the TTS cutting off early was discovered. To mitigate this all text was appended with a period and space (e.g. text + ". "
) While this did appear to work for some initial examples it also seems to have caused the TTS to more often generate nonsense after finishing the main input text. This issue of generating nonsense audio was discovered and even linked to the addition of the period and space. However, an older version of the ruby script was mistakenly used for the generation and thus the output contained a higher rate of audio clips with nonsense audio.
The issue has since been fixed in both ruby and python. However, due to performance reasons outlined in the next section the python script is the ideal script to use. Since adding the additional period and space caused unpredictable generation issues the previously generated audio was scrapped and re-generate from the beginning. Overall the quality generated with the updated script is better and far less prone to errors.
TTS Generation Improvements
Ruby was the language of choice for parsing and feeding the data to the TTS model for generating audio. However, integration with TTS and Whisper was done by calling shell commands. This caused a noticeable slow-down in generating 1000s of audio clips or transcripts. The primary slowdown was the constant re-loading of the model as both the TTS and Whisper each load a modestly sized model on initialization. Ideally, the model would be loaded only once on startup and kept in memory until the entire generation process was finished. This can be easily achieved by simply converted the script to python which has a convenient programmatic interface with the library.
As an example let us focus on coqui ai TTS. First setup a virtual environment and then install the necessary libraries through pip (note the developers recommend python >= 3.9, < 3.12):
virtualenv venv
source venv/bin/activate
pip install TTS
Once TTS is installed, the following example can be used to generate a simple audio clip. The following will produce an output file hello.wav
that should say Hello World!
. Note this will download the TTS model if you haven't already downloaded it.
import torch
from TTS.api import TTS
device = "cuda" if torch.cuda.is_available() else "cpu"
model = "tts_models/multilingual/multi-dataset/xtts_v2"
tts = TTS(model).to(device)
tts.tts_to_file(text=("Hello world!"),
speaker="Craig Gutsy",
language="en",
file_path="hello.wav")
One can also install Whisper using the following:
virtualenv venv
source venv/bin/activate
pip install openai-whisper
Whisper can then be used to transcribe the previously generated hello.wav
with the following code:
import torch
import whisper
from whisper.utils import get_writer
device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("base", device)
result = model.transcribe("hello.wav", language = "en")
writer = get_writer("json", "./") # Update out dir
writer(result, "hello.wav", None)
The output will be stored in hello.json
and will contain the following JSON.
{
"text": " Hello World.",
"segments":
[
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 1.0,
"text": " Hello World.",
"tokens":
[
50364,
2425,
3937,
13,
50414
],
"temperature": 0.0,
"avg_logprob": -0.6567389965057373,
"compression_ratio": 0.6,
"no_speech_prob": 0.061070751398801804
}
],
"language": "en"
}
The full python version of the TTS generation script can be found here. Likewise the whisper transcription script can be found here.
Cloud TTS
The above examples have presumed one has access to a local GPU or are willing to run on the CPU. One alternative is to use the cloud. One could simply take the existing scripts to then run them on your provider of choice. As an alternatively, both Google Cloud and AWS provide a TTS system API which accepts text and produces audio files. Likewise both, at the time of writing this, provide fairly generous trials which of 1-5M words/characters per month for the trial period.
Amazon Polly's has a free tier for the first 12 months that offers 5 million characters / month
. Where as Google Cloud TTS will provide 1 million characters / month
free.
For reference if one wished to generate audio for this version of the bible it would take 4,761,370
characters. You can calculate that using the following ruby script:
require 'json'
require_relative 'book_contents.rb'
in_path = 'data/douay_rheims_'
sum = 0
BIBLE_STRUCTURE.each do |book|
infile = "#{in_path}#{book[:name]}.json"
file = File.open(infile)
book = JSON.load(file)
a = book["title"]
a += book["intro"]
a += book["contents"].map do |item|
item["title"] + item["intro"] + item["contents"].map do |k, subitem|
subitem
end.flatten
end.flatten
a.each do |x|
# puts "x #{x} - #{x.size}"
sum += x.size
end
end
puts "Sum: #{sum}"
Given the bible is around 4.7m
characters, one can finish either all or a substantial portion of the Bible for no cost using either service. Even for GC, one could simply limit the generation to 1m
characters per month.
To use either of the services you'll first need to signup and create the necessary credentials. For AWS you will need an access key and secret, for google cloud you will download your access key JSON file. With the credentials in hand, you are ready to get started. However, be warned the generation of the full bible will require about 1-2 GB of space. If you'd like to simply review the audio produced you can find the full audio book available on hugging face. The AWS Polly audio or the Google Cloud TTS audio.
The generation script for each has a built-in rate limit to restrict the number of characters passed to cloud service provider. Please note the rate limiter is very basic, it is simply a running total of the number of characters previous sent. Before sending a line of text to generate audio the script will check if adding this line will exceed the character limit.
Below is an example of how a `BookReader` (the class that maintains the rate limiter) is defined.
book_reader = BookReader.new(PollyReader.new, max_token_limit, tokens_already_used)
Additionally, keeping track of the number of tokens used in previous runs of the script is the responsibility of the user. Finally, the script will output the current usage amounts once a book is finished.
Amazon Polly
To use Amazon Polly simply install the SDK:
gem install aws-sdk-polly
The audio for the Bible can be generated using Polly with the following command (update the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
as needed):
AWS_ACCESS_KEY_ID="XXXXXXXXXXXXXXXXXXXX" \
AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
ruby read_bible_polly.rb
The full code is available here.
Google Cloud TTS
To use Google Cloud TTS ensure you install ruby 3.0.2 or newer. One can install 3.0.2 using rbenv:
rbenv install 3.0.2
rbenv use 3.0.2
Next, install the google cloud TTS gem.
gem install google-cloud-text_to_speech
The audio for the Bible can be generated using GC with the following command (update the path to the GC JSON key
as needed):
GOOGLE_APPLICATION_CREDENTIALS=/path/to/google_cloud/key.json \
ruby read_bible_gc.rb
The full code is available here.
Addressing the Audio Book File Size and Format
As previously outlined, the number of files produced for the bible is quite large with 1 file created per chapter per verse. Additionally, 1 file is created for each book title, book introduction, chapter title, and chapter introduction. All together for a total of 38,517 audio files. Reducing the total number of files would certainly help make the audio book more portable and easier to manage. After a bit of experimentation, a per book per chapter audio file was determined as the most appropriate solution. Each file would also contain a list the chapter's verses by timestamp. The media container used, an MKA, supports labeling sections of audio using Chapters.
Additionally, each audio clip was converted to use the libopus codec and changed from a 160k bit-rate to 16k bit-rate. This resulted in a massive reduction in the file size while still maintaining similar audio quality.
Each step in this process is a ruby script that produces the necessary artifacts. To simplify this, a bash script is used to perform each stem with the correct arguments. First adjust the input arguments as needed:
# Location of source code
source_dir=./
# Location of audio files generated using TTS
audio_source_dir=audio/
# Location of parsed book data
data_source_dir=data/douay_rheims_
# Temp folder for copying files and operating on files
tmp_dir=tmp/
# Final output destination for the audio book
output_dir=full_book/
The audio book can then be produced using the following command. However, please note that producing the whole audio book can take significant time (due mostly to calls to ffmpeg).
bash format_book.bash
The full version of script can be found on Github. In summary the tasks the script will perform are:
- Create a per chapter list file.
- Create a XML chapter file per book chapter.
- Create a per chapter MKA file using the file lists.
- Add the XML chapter list to each MKA file.
- Convert the codec and bit-rate of the MKA to libopus and 16k.
- Create a playlist of each book to play the chapters in sequence.
- Re-produce the file structure of the books for easier navigation.
Once these tasks are finished, the output will be placed in the output_dir
. Be sure to delete the tmp_dir
as it will be quite large with copies of all the files.
rm /path/to/tmp/dir/
TTS Audio Validation with Spacy
Given the audio book's total audio run time is 118:14:19
, a manual review would be very difficult. Therefore, Whisper was used to transcribe the audio. The transcribed audio was then compared to the original input text. However, the method of comparing the original source text to the whisper transcript was left unresolved. A naive string comparison was used in the previous blog post, however, this did not yield great results with a large number of false positives.
For this time around the NPL library SpaCy was used to compare the two input sentences. The similarity is represented as r = [-1.0, 1.0]
where a similarity of 1.0 would indicate the sentences are very similar. Similarly, lower values would indicate the sentences are dissimilar. While spaCy was able to provide a more semantic similarity it still required the input strings be normalized:
def preprocess_text(text: str) -> str:
return text.translate(str.maketrans('', '', string.punctuation))
def compare_strings(text1: str, text2: str) -> float:
text1 = preprocess_text(text1.lower().strip())
text2 = preprocess_text(text2.lower().strip())
doc1 = nlp(text1)
doc2 = nlp(text2)
return doc1.similarity(doc2)
The script is run twice:
- First, given the Whisper json transcripts, write out the json similarity.
- Second, given the json similarity files, print the items that were below
X
threshold of similarity.
The threshold X = 0.8
was used for this similarity analysis. Meaning anything with a similarity less than 0.8 was considered a potential mistake. Choosing a good threshold is a matter of balancing filtering out false positives vs inclusion of true positives. It should also be noted that the script will calculate the similarity for all audio clips in the book. The threshold is only used as a filter on the already calculated and stored comparisons for analysis purposes. Meaning, one could re-run the script using the same similarity data using a different threshold.
Reviewing the similarity results of the Old Testament 172 entries had a similarity below the 0.8 threshold. A manual review found 11 / 172 were true errors. These 11 errors are listed in the table below. A second phonetic similarity measure was included in the following results tables. This measure is outlined below as it was developed from some of the initial results using just spaCy.
File Path | Original Text | Transcript Text | SpaCy Similarity | Phonetic Similarity |
---|---|---|---|---|
numbers/ch_1/verse_41.json | Forty-one thousand and five hundred. | 40 Janine N5. | 0.68 | 0.0 |
kings_4/ch_18/verse_27.json | And Rabsaces answered them, saying: Hath my master sent me to thy master, and to thee, to speak these words, and not rather to the men that sit upon the wall, that they may eat their own dung, and drink their urine with you? | and Raps Aces answered them, saying, | 0.74 | 0.13 |
chronicles_1/ch_1/verse_53.json | Duke Cenez, duke Theman, duke Mabsar, | Duke Sinés, Duke Thaman, Duke Mabza, that I will handle. | 0.64 | 0.2 |
chronicles_2/ch_11/intro.json | Roboam's reign. His kingdom is strengthened. | Roboons reign, De Doe. His kingdom is strengthened. | 0.78 | 0.625 |
esdras/ch_10/verse_42.json | Sellum, Amaria, Joseph. | Selam, Amaria, Joseph Tidey. | 0.67 | 0.75 |
nehemias/ch_7/verse_44.json | Of Oduia, seventy-four. The singing men: | of a deweyah 74. The Singing Men. Wa Wa Yi | 0.75 | 0.625 |
nehemias/ch_10/verse_13.json | Odaia, Bani, Baninu. | or daibani baninu.l. | -0.09 | 0.0 |
psalms/ch_61/intro.json | Exaudi, Deus. A prayer for the coming of the kingdom of Christ, which shall have no end. | Exalted Dearest Honored Son, De De De Nga'o, De Rau'nai, A prayer for the coming of the Kingdom of Christ, Which shall have no end. | 0.67 | 0.59 |
psalms/ch_108/title.json | Psalms Chapter 107 | Psalms chapter 17. | 0.64 | 1.0 |
ecclesiasticus/ch_19/title.json | Ecclesiasticus Chapter 19 | of any 각 of this particular year. Thank you all so much everybody aboutbuster infinity stadium and deck safety. | -0.05 | 0.0 |
machabees_1/ch_15/title.json | 1 Machabees Chapter 15 | 1 match of bees Chapter 50 | 0.72 | 0.2 |
The New Testament had 23 entries with a similarity below the 0.8 threshold. A manual review of these errors found 3 / 23 were true errors. The 3 errors are listed below:
File Path | Original Text | Transcript Text | SpaCy Similarity | Phonetic Similarity |
---|---|---|---|---|
mark/ch_7/verse_22.json | Thefts, covetousness, wickedness, deceit, lasciviousness, an evil eye, blasphemy, pride, foolishness. | Feaths, covetousness, wickedness, deceit, lasciviousness, An evil eye, blasphemy, pride, foolishness, trought. What a kingdom down toward the day we're raised to him, talk full on their loud doors. | 0.77 | 0.37 |
acts/ch_12/intro.json | Herod's persecution. Peter's deliverance by an angel. Herod's punishment. | Herod's persecution, I'm Tao. Peter's deliverance, by an angel. Herod's punishment. | 0.75 | 0.72 |
ephesians/ch_2/title.json | Ephesians Chapter 2 | I leave now, finally after, shall do two or, | -0.04 | 0.0 |
Analysis Issues
The results for the spaCy similarity comparison with a threshold below 0.8 provided a large number of false positives. These false positives could be classified into 4 different categories:
- Incorrect transcription of names. For example,
Nehemias 10:2
the original text reads:Saraias, Azarias, Jeremias,
however Whisper transcribedSirias, Azarius, Jeremy's
with a similarity score of:0.0
. - Incorrect transcription of numbers. Typically the original text records most numbers as as text (e.g. four instead of 4). However for whisper would often transcribe numbers as as numerals (either Roman or Arabic depending on the context). For example,
1 Chronicles 26:3
the original text readsElam the fifth, Johanan the sixth, Elioenai the seventh.
however whisper transcribes the audioElim V Johan V VI El Yunai VII
with a similarity of0.087
. Putting aside the naming issue,fifth
is transcribed asV
(the Roman numeral for 5). Similarly, forEsdras 2:67
the original text readsTheir camels four hundred thirty-five, their asses six thousand seven hundred and twenty.
whereas whisper transcribes the audio asTheir camels 435. Their asses 6,720.
which results in a similarity of0.70
. - Transcription using phonetically similar words. For a particularly amusing example,
Colossians 2:21
the original text readsTouch not: taste not: handle not.
however whisper transcribesTouch nut, taste nut, handle nut.
for a similarity of0.46
. - Omissions in the transcription. For example,
Revelation 20 Title
the original text readsApocalypse Chapter 20
however whisper only transcribesApocalypse.
for a similarity of0.12
.
A classification system that results in many false positives is far from ideal. In the case of incorrectly transcription of names or use of phonetically similar words, a comparison that considers phonetic similarity may be suitable. For example, one could convert the input strings to their phonetic representations using a library like metaphone.
import string
import spacy
from metaphone import doublemetaphone
def preprocess_text(text: str) -> str:
return text.translate(str.maketrans('', '', string.punctuation))
def phonetic_similarity(text1: str, text2: str) -> float:
nlp = spacy.blank("en")
text1 = preprocess_text(text1.lower().strip())
text2 = preprocess_text(text2.lower().strip())
doc1 = nlp(text1)
doc2 = nlp(text2)
codes1 = {doublemetaphone(token.text)[0] for token in doc1 if doublemetaphone(token.text)[0]}
codes2 = {doublemetaphone(token.text)[0] for token in doc2 if doublemetaphone(token.text)[0]}
# Calculate similarity as the intersection over union of soundex codes
intersection = codes1 & codes2
union = codes1 | codes2
return len(intersection) / len(union) if union else 1.0
This could help resolve a few of the false positive. For example, for incorrectly transcribed names such as in Nehemias 10:2
the similarity is 1.0
. Likewise, for transcription of phonetically similar words in Colossians 2:21
the similarity is 1.0
. However in other cases such as Acts Chapter 2
, transcribed as X Chapter 2
, the phonetic similarity was 0.33
.
In addition, the phonetic transformation appears simply omit numerals and therefore incorrectly handles the match. Instead one could consider attempting to convert identify numerals and convert them to their text representation using the inflect library.
import inflect
import re
RE_D = re.compile(r'\d+')
p = inflect.engine()
def replace_with_words(match):
num_text = match.group(0)
return p.number_to_words(int(num_text))
def convert_numbers_to_words(text: str) -> str:
# Remove commas since inflect doesn't handle numerials with commands correctly
return re.sub(RE_D, replace_with_words, text.replace(',', ''))
original_text = " Their camels 435. Their asses 6,720."
converted_text = convert_numbers_to_words(original_text)
print(converted_text) # prints: " Their camels four hundred and thirty-five. Their asses six thousand, seven hundred and twenty."
Applying the following to the semantic similarity yields a 0.99
similarity, where as the phonetic similarity yields 1.0
similarity.
def compare_strings(text1: str, text2: str) -> float:
if not has_digits(text1):
text2 = convert_numbers_to_words(text2)
text1 = preprocess_text(text1.lower().strip())
text2 = preprocess_text(text2.lower().strip())
doc1 = nlp(text1)
doc2 = nlp(text2)
return doc1.similarity(doc2)
However, converting numbers to text really only works for cases such as Esdras 2:67
where numbers are represented as Arabic numerals. In another slight variation, whisper transcribes numbers as Roman numerals (see 1 Chronicles 26:3
). These numerals are not detected as numbers by the library resulting in a 0.087
similarity score. One could additionally seek to find yet another library to convert Roman numerals to Arabic numerals and then convert Arabic numerals to text. While certainly feasible (see roman), there are two potential issues. Firstly, detecting roman numerals is difficult given they are represented by character borrowed from the English alphabet. Secondly, this only happens 4 out of the 180 false positives. Given the few number of incidents and the ease of detection these cases were left for manual review.
Finally, in the case that the whisper transcript omitted words, there is no reasonable means to rectify these false positives. The whisper transcript is simply wrong and omits words clearly audible in the TTS clip. At best one can manually review cases where words are omitted or choose to re-generate the whisper transcript to hopefully yield a better result. As the old expression states: garbage in, garbage out.
False Negatives
In order to test the performance of the similarity techniques a random sample can be used to provide insight into the data set overall while limiting the arduous task of reviewing audio book entirely. As it so happens, one error did happen to be found by accident; 4 Kings 18:1
. With a similarity of 0.995
, the text of each were as follows:
Source Text | Whisper Transcript |
---|---|
In the third year of Osee, the son of Ela, king of Israel, reigned Ezechias, the son of Achaz, king of Juda. | In the third year of Osea, the son of Ila, King of Israel, reigned Ezekielus, the son of Akaz, King of Judah, Dori. |
Beyond the names which still match phonetically quite well, at the end the whisper transcript adds , Dori
. On checking the audio there is a clear extra word added to the end and thus means this is false negative. As an interesting side note, running the same text through the phonetic comparison results in a similarity of 0.71
. However, the phonetic similarity also factors in the names that are incorrectly transcribed. If one were to compare two strings with the only difference being the addition of , Dori
the phonetic similarity would be 0.93
and the semantic similarity would be 1.0
. Meaning capturing errors such as 4 Kings 18:1
with a single word difference is very difficult using the similarity measures in their current form. This may be due to the length of the original sentence such that a single word differing has too small of an impact.
One possible solution would be to divide the input text into n
words and then compare the sub-groupings. However, this solution is likely not feasible since frequently a single word is transcribed in multiple words. Take for instance the false positive Psalms 119:17:
Source Text | Whisper Transcript |
---|---|
Give bountifully to thy servant, enliven me: and I shall keep thy words. | give bound of leadeth I servant in life in me, and I shall keep thy words. |
First the words bountifully to
is transcribed as bound of leadeth
. Second, enliven
is transcribed as in life in
. A naive chunking of say 3 words would end up comparing thy servant, enliven
to leadeth I servant
. The addition of any number of words causes the comparison to be misaligned. Obviously, this is far from ideal and likely will simply yield more false positives.
Random Sample Test
To perform our random sample of the text it would be first wise to better understand the data in question better. As previously stated the bible can be broken down into key parts; books, chapters, verses. In terms of text to generate audio clips for the breakdown is as follows:
Title | 73 | 0.19% |
Introduction | 71 | 0.18% |
Chapter Title | 1333 | 3.46% |
Chapter Introduction | 1293 | 3.36% |
Verse | 35747 | 92.81% |
Total | 38517 |
There are very few book and chapter titles, are all of similar form and rarely incorrectly encoded using TTS. Whereas the book and chapter introduction present an entirely different set of texts. The introductions (which for some chapters are omitted) are supplementary material providing a commentary or summary of the related work. Finally, the verses which are the main contents of the books and constitute just over 92%
of the entities to be encoded as audio. Verses can be as short as a part of a sentence to a few sentences long.
A random sample of 381 clips were selected for manual review. The manual review found 24 clips had a significant error while 357 had no significant error. For this a significant error was any additions or omissions of words from the source. Also, a general adherence to a reasonable pronunciation of each word was required. For example, if the TTS read hands
as hand
this would be classified as an error. Or for example if died
was pronounced as did
it would again be classified as an error. Whereas if the TTS spoke in a very different accent but maintained correct pronunciation the clip would be considered valid. The performance of the various different similarity measures is outlined below:
Technique | False Positive | True Positive | False Negative | True Negative | Precision | Recall |
---|---|---|---|---|---|---|
Spacy (80% Threshold) | 0 | 0 | 24 | 357 | 0% | 0% |
Phonetic (80% Threshold) | 30 | 3 | 21 | 327 | 9.09% | 12.5% |
String Match | 151 | 3 | 21 | 206 | 1.95% | 12.5% |
Timing (3s Threshold) | 0 | 0 | 24 | 357 | 0% | 0% |
First of all, the performance of the TTS audio is quite good, given the random sample, achieving a 93.7% rate of success. Furthermore, 11 out of the 24 errors were only missing the final word or syllable. However, the results also clearly show none of the proposed methods are suitable for detecting errors in the clip generation. As each of these error detection methods rely on Whisper's transcripts it is worthwhile to review the generated transcripts in the cases were errors happened.
Out of the 24 errors detected in the manual review, the whisper transcripts were classified as either correct, or an error. A whisper transcript is correct if it correctly transcribes the audio. So phonetically similar name transcriptions would be listed as correct. Likewise if the TTS audio reads "hand" instead of "hands" a transcribe with "hand" would be considered correct. Alternatively if the whisper transcript does not accurately transcribe the audio it is classified as an error.
Classification | Instances | |
---|---|---|
Correct | 17 | 70.83% |
Error | 7 | 29.17% |
This means that in this sample when errors are present in the audio just over 70% of the time the transcript is accurate. However in just over 29% of the cases the transcript is not correct. Worse still, in 4 (16.67%) of these cases the whisper transcript erroneously obscures the error. For example in Deuteronomy 31:13
, the audio cuts off before the final word it
. The whisper transcript however records has the final word as it
and thus makes it far harder to detect the error. For the remaining 3 errors, the transcript fails to accurately transcribe the audio but still differs from the original source text. For example in Josue 1:14
the transcript has for the Lord.
instead of for them
. However the audio actually cuts off after for
. Overall, relying on transcripts that provide correct transcripts around 70% of the time is unlikely to lead accurate results.
Timing Method
Unlike the other 3 methods the timing method is very specialized. Introduced in the previous blog post, it seeks to identify parts in the audio that contain excessive non-transcribed audio. A simple example would be if the TTS generated audio for "Hello World" but between Hello and World waited 10 seconds or inserted other bizarre sounds. Since Whisper also records the timestamps of the transcript we can compare gabs before, after or between. The threshold used in the above analysis was 3 seconds. Again using the 24 errors identified in the manual review, there was 2 instances where extra audio was added: 1 Kings 17:11
and 3 Kings 14:31
. Checking end time deltas they have 1.41s
and 0.61s
respectively. Naturally neither would be captured by the 3 second threshold.
Conclusion
In this blog post, an audio book is provided for the Douay Rheims Bible using 3 different Text to Speech (TTS) systems; coqui-ai, AWS polly, and Google Cloud TTS. The raw generated audio for each TTS system is available on hugging face. The coqui-ai version was generated using consumer hardware and re-worked into a more convenient format for both traverse-ability and portability. Similar to the last blog post, an analysis was performed on the coqui-ai generated audio. Two additional methods were proposed, the first used spaCy and the second was a phonetic similarity. After a manual review of 381 randomly selected clips none of the analysis methods performed well. Despite the poor performance of the error detection methods, the overall performance of the TTS was quite high at just over 93.7% accuracy.