What's wrong with listening to articles converted from text to speech?

Each of us has our own preference for consuming information. But we all have this in common: we all like quality content. Suppose you are like me and prefer audio content and have tried apps for programmatic voice synthesis, also known as text-to-speech (TTS). In that case, you know that quality content on the input often results in a cacophony of boredom and feels like tourture.

Quite a few TTS apps are available in different forms and shapes, from browser extensions to mobile apps. The price is affordable for generating hours of content per month. Modern TTS algorithms produce a natural-sounding voice in different languages and accents. I've tried many, but none of them were good enough. Here's why.

image shows a list of problems, such as reading text verbatim, ignoring images, not handling HTML tags and has a checkmark with the word fixed next to them, suggesting an app called article2audio solves all these problems

Problem 1: Reading text verbatim when converting articles to speech

The major problem I've encountered with these apps is that they read text word for word, which, let's face it, ist's not going to work for 100% of the texts. Let's take this simple example:

We've got 10x speed improvements in AI inference (predictions) by doing this.

A regular TTS app would read this as:

🗣 We've got ten ex speed improvements in A.I. inference predictions by doing this.

This is not how you would read it, right? I would read it like this:

🗣 We've got ten times speed improvements in A.I. inference, or predictions, by doing this.

That's not all. If I encounter a typo or missing comma while reading, I instinctively correct it. If there's a quote, I would naturally indicate that it's a quote. If the author uses punctuation in a special or wrong way (as they often do), I would subtly adapt how I read a sentence. All these nuances of human reading are entirely lost in current TTS apps.

Problem 2: Ignoring images

Moreover, existing apps turn a blind eye to images, which are the bread and butter of many web articles, especially technical ones. Visual content and the accompanying context are lost, leaving the listener with an incomplete understanding of the content.

Additionally, when TTS apps ignore the images, they also miss out on any diagrams, charts, or infographics that might be crucial for comprehending complex information. This oversight can lead to a fragmented and often confusing narrative, as listeners are left to fill in the gaps with their imagination, which is not always accurate.

Problem 3: Not handling HTML tags

Articles and blog posts don't contain just text and images. Open any blog post, and you'll find a variety of HTML tags (building blocks of a Web page), each with its own purpose. Maybe speech synthesis apps should be able to recognize these tags and adapt their reading accordingly? Let's look at these tags and see what can be done.

Block quotes

This is a blockquote:

Stay hungry, stay foolish. — Steve Jobs

I have no doubt that you have seen a lot of these. Instead of reading it as just another paragraph (as most TTS apps do), wouldn't it be better to indicate that it's a quote? Something like this:

🗣 There's a quote here. Stay hungry, stay foolish. Steve Jobs.

Maybe these services should also use a different voice for the quote to make it more obvious when it ends, and the article continues.

Image captions

TTS apps usually read it as a regular paragraph, often confusing the listener. A more sophisticated approach would be to introduce the caption with a preamble, such as "There's an image caption." followed by a slight pause to distinguish it from the main body of the text and provide clarity to the listener.

Pre-formatted text

Code snippets and preformatted text are common in tech articles, yet TTS apps often struggle with these elements. No, no, they don't struggle to read code. They can do that. But it doesn't make sense to read the code (all the symbols and text) because it's hard to comprehend even for a professional developer. Instead, it would be better to describe the code snippet in a more human-friendly way, such as "There's a code snippet here. It's a function that takes a string and returns a number".

Tables

A common way of presenting data in articles is through the use of tables. I did a bit of research and found that 99% of the time, tables are inserted as images (screenshots) in articles and blog posts. This creates an additional problem of extracting the data from the images. But let's say a table content could be extracted (spoiler: it indeed could be programmatically extracted with a decent accuracy), or it's present in the article as a table HTML tag. How do TTS apps read them? They read it row by row. As you can imagine, understanding data from the table is almost impossible. A better approach would be to describe the table in a more human-friendly way, such as "There's a table here. It has 3 columns and 5 rows. The summary is ...".

It's not just about convenience; it's about maximizing the potential of web content. I want to make the most of my time by absorbing valuable content when my hands are busy, but my mind is free. Sadly, the current state of TTS apps leaves a lot to be desired.

How I fixed these problems and made listening to articles much more enjoyable

article2audio app is my answer to the shortcomings of traditional TTS tools. It's not perfect, but it's definitely a few steps ahead of the alternatives, and I constantly improve it. Here's how it solves the problems mentioned above.

Firstly, the app doesn't read text verbatim. It uses AI tools to understand the context and make intelligent adjustments to the reading, making it sound more natural when read out loud. Corrections are made using large language models (LLMs) on the fly for typos or punctuation errors, ensuring a smooth listening experience.

Secondly, it acknowledges the presence of visual content. While it can't accurately describe all images or charts, it does its best to caption them or summarize their key points, filling the gaps left by other TTS apps. If there's an image caption, it will read it as such.

Lastly, the app handles HTML tags skillfully. Quotes are identified as such, captions are read with context, and even code snippets and tables are described in a clear way. This attention to detail makes article2audio a more comprehensive solution for converting articles and blogs to audio.

I built this app for my personal use, and many times in the process, I said “Wow!” to the results it produced. Now you can sign up and enjoy listening to your reading list, too.

Menu