Consulting and R&D for Text-to-Speech in Education

A US-based education provider with its own LMS platform set out to enhance the learning experience for its students. To make content more accessible and engaging, the company wanted to introduce an audiobook-style feature that could convert text into natural-sounding speech. Our role was to research the market, evaluate existing solutions, and recommend the best option for integration.

Data science & AI EdTech

USA

eLearning

3 days

Challenge

The client, a US-based education provider, runs its own LMS platform designed for professional training and continuous education. The system hosts large volumes of domain-specific materials, including content in fields such as law, medicine, finance, and others. To expand how learners consume knowledge and to keep them engaged, the company aimed to add a text-to-speech feature that could transform written materials into high-quality audio.

When addressing such requests, our first step is always to evaluate existing tools. Developing AI from scratch requires large budgets and months of engineering, while the market already offers reliable solutions for common needs such as text-to-audio conversion.

Exploring these options allows clients to save costs, shorten delivery time, and focus resources on integration rather than full-scale product development.

The client’s key requirements defined the direction of our research:

The tool had to produce speech that sounded natural, reflecting punctuation, pauses, and intonation so that narration would feel like an audiobook rather than an automated system.
It needed to support flexible playback settings such as speed and volume, and provide a choice of speakers to match the preferences of diverse audiences.
Because the LMS covers domains like law and medicine, accurate pronunciation of specialized vocabulary was also critical.
Finally, security was a key concern. With extensive copyrighted training materials stored in the platform, the client needed a solution that would not expose sensitive content to third-party servers or unclear licensing policies.

How we researched the solutions

Our team started with a structured research phase. Based on the client’s requirements, we analyzed dozens of available text-to-speech solutions. The evaluation covered functionality, voice quality, flexibility of integration, pricing models, scalability, and, most importantly, security and licensing conditions.

During this stage, we paid close attention to hidden limitations such as character-based pricing, unclear policies on data storage, and restrictions on commercial use of generated audio. These issues are often overlooked at the start but become critical when working with large volumes of copyrighted educational materials.

After several days of testing and reviewing documentation, we were able to narrow the wide range of tools to four candidates that best met the client’s needs:

ElevenLabs
Murf.ai
NaturalReader (Commercial)
Amazon Polly

Comparison of selected tools

Once we had identified the four most suitable candidates, the next step was a detailed comparison. Each tool was assessed in terms of functionality, speech quality, technical flexibility, pricing, licensing, and ability to scale. This helped us and the client see not only the advantages but also the limits of every option.

	ElevenLabs	Murf.ai	NaturalReader (Commercial)	Amazon Polly
Functionality	Natural speech, emotional range, cloning, built-in studio with editing, subtitles, voice isolation. Great for audiobooks, etc.	Studio with editing, video sync, pronunciation library, multiple voice styles. Strong for training/marketing.	Direct TTS conversion, supports many formats/languages, commercial focus. Limited editing.	Multiple engines (Standard, Neural, Long-Form, Generative), SSML support. Technically flexible, no creative studio.
Voice quality	Market leader in lifelike, expressive narration, especially audiobook-style.	High quality, less expressive. Best for e-learning and explainer videos.	Adequate for business use, less natural than ElevenLabs or Murf.	Clear and natural with Neural/Long-Form voices, but more synthetic than ElevenLabs. Quality varies.
Integration / API	Developer API for TTS, cloning, agents. Good for apps and platforms, but pricing may be high.	Mainly a web platform with limited integrations. Better for manual workflows.	Limited API. Focused on SaaS for commercial audio generation.	Deep AWS integration, full API, SSML, highly scalable for enterprise pipelines.
Pricing	Subscription + credits. Affordable for small tasks, expensive for long narration. Enterprise tiers available.	Subscription per user. Low-tier limits, costs rise quickly with bigger projects.	Single plan (~$99/month) covers commercial use. Predictable but not cheap.	Pay-as-you-go by character. Cost-effective at scale for Standard voices, higher for Neural/Long-Form.
Commercial licensing	Paid tiers allow commercial use. Cloning requires consent, strict policy.	Commercial use in paid plans. Free tier is non-commercial only.	Clear license for commercial distribution.	Commercial use allowed under AWS terms. No cloning issues but requires compliance management.

ElevenLabs
- Functionality
  Natural speech, emotional range, cloning, built-in studio with editing, subtitles, voice isolation. Great for audiobooks, etc.
- Voice quality
  Market leader in lifelike, expressive narration, especially audiobook-style.
- Integration / API
  Developer API for TTS, cloning, agents. Good for apps and platforms, but pricing may be high.
- Pricing
  Subscription + credits. Affordable for small tasks, expensive for long narration. Enterprise tiers available.
- Commercial licensing
  Paid tiers allow commercial use. Cloning requires consent, strict policy.
Murf.ai
- Functionality
  Studio with editing, video sync, pronunciation library, multiple voice styles. Strong for training/marketing.
- Voice quality
  High quality, less expressive. Best for e-learning and explainer videos.
- Integration / API
  Mainly a web platform with limited integrations. Better for manual workflows.
- Pricing
  Subscription per user. Low-tier limits, costs rise quickly with bigger projects.
- Commercial licensing
  Commercial use in paid plans. Free tier is non-commercial only.
NaturalReader (Commercial)
- Functionality
  Direct TTS conversion, supports many formats/languages, commercial focus. Limited editing.
- Voice quality
  Adequate for business use, less natural than ElevenLabs or Murf.
- Integration / API
  Limited API. Focused on SaaS for commercial audio generation.
- Pricing
  Single plan (~$99/month) covers commercial use. Predictable but not cheap.
- Commercial licensing
  Clear license for commercial distribution.
Amazon Polly
- Functionality
  Multiple engines (Standard, Neural, Long-Form, Generative), SSML support. Technically flexible, no creative studio.
- Voice quality
  Clear and natural with Neural/Long-Form voices, but more synthetic than ElevenLabs. Quality varies.
- Integration / API
  Deep AWS integration, full API, SSML, highly scalable for enterprise pipelines.
- Pricing
  Pay-as-you-go by character. Cost-effective at scale for Standard voices, higher for Neural/Long-Form.
- Commercial licensing
  Commercial use allowed under AWS terms. No cloning issues but requires compliance management.

Key factors that shaped our choice

When comparing solutions against the client’s requirements, we also uncovered several pitfalls that influenced our decision. Some issues were not immediately obvious from product descriptions but became clear during testing and deeper analysis.

Character-based billing risks
We found that character-based billing can quickly distort cost projections. Amazon Polly, for example, charges by characters. For short tasks this model works fine, but for long-form narration such as audiobooks, the cost can increase rapidly.

It also applies different rates for its Standard, Neural, and Long-Form engines. Without precise calculations, the budget risk becomes significant.
Limits of studio-based platforms
Another issue was tied to the pricing models of studio-based platforms. Tools like Murf.ai and NaturalReader often rely on per-seat subscriptions or impose limits on how many minutes can be exported.

For large teams or for customers with extensive content libraries, which was in the case of our client, this setup adds cost and complicates scaling.
Hidden infrastructure costs
Integration complexity also surfaced as a hidden factor. Polly is tightly embedded in AWS infrastructure, which makes it attractive for enterprises already using AWS. But the cost of supporting services such as storage, data transfer, or orchestration can exceed the pure text-to-speech pricing. For a client focused on cost-effectiveness, this was not ideal.
Voice cloning and compliance
We also paid attention to voice cloning and compliance rules. ElevenLabs, while offering the most advanced cloning, requires strict consent management and has clear restrictions on how cloned voices can be used. While this was not a blocker in our project, it was an important consideration for building sustainable workflows.
Editing and voice quality variation
Finally, quality variation demanded attention. Even the most advanced TTS systems sometimes produce results that need manual correction through SSML or editing. This means the more natural the voice, the less time would be spent on corrections.

ElevenLabs consistently produced narration that required the least manual adjustment, which had a direct impact on production efficiency.

Final decision

Eventually, we chose ElevenLabs. It provided the best balance of expressive voice quality, flexibility, and manageable pricing for the client’s use case. While no solution was perfect, it was the only one that consistently met the requirements without introducing long-term risks or excessive hidden costs.

We have now moved on to the next phase – integration. The solution is being embedded into the client’s learning platform, and in the next stage we will be tracking its performance, impact on learners, and return on investment. Stay tuned!

This type of research is valuable not only for the project at hand but also for our broader consulting practice. Each evaluation gives us a clearer view of the AI solutions market, allowing us to advise future clients more effectively and with greater confidence.

ViktoryiaData Science Expert

Looking for a trusted guide in the complex world of AI?

Aristek can help you analyze, choose, and implement the best tools for your unique needs.

Consulting and R&D for Text-to-Speech in Education

Challenge

How we researched the solutions

ElevenLabs

Murf.ai

NaturalReader (Commercial)

Amazon Polly

Comparison of selected tools

ElevenLabs

Murf.ai

NaturalReader (Commercial)

Amazon Polly

Functionality

Voice quality

Integration / API

Pricing

Commercial licensing

ElevenLabs

Functionality

Voice quality

Integration / API

Pricing

Commercial licensing

Murf.ai

Functionality

Voice quality

Integration / API

Pricing

Commercial licensing

NaturalReader (Commercial)

Functionality

Voice quality

Integration / API

Pricing

Commercial licensing

Amazon Polly

Functionality

Voice quality

Integration / API

Pricing

Commercial licensing

Key factors that shaped our choice

Character-based billing risks

Limits of studio-based platforms

Hidden infrastructure costs

Voice cloning and compliance

Editing and voice quality variation

Final decision

Looking for a trusted guide in the complex world of AI?