Contact Us
Preview of case

Consulting and R&D for Text-to-Speech in Education

A US-based education provider with its own LMS platform set out to enhance the learning experience for its students. To make content more accessible and engaging, the company wanted to introduce an audiobook-style feature that could convert text into natural-sounding speech. Our role was to research the market, evaluate existing solutions, and recommend the best option for integration.

Icon 1USA
Icon 2eLearning
Icon 33 days

Challenge

The client, a US-based education provider, runs its own LMS platform designed for professional training and continuous education. The system hosts large volumes of domain-specific materials, including content in fields such as law, medicine, finance, and others. To expand how learners consume knowledge and to keep them engaged, the company aimed to add a text-to-speech feature that could transform written materials into high-quality audio.

When addressing such requests, our first step is always to evaluate existing tools. Developing AI from scratch requires large budgets and months of engineering, while the market already offers reliable solutions for common needs such as text-to-audio conversion.

Exploring these options allows clients to save costs, shorten delivery time, and focus resources on integration rather than full-scale product development.

The client’s requirements defined the direction of our research:

  • The tool had to produce speech that sounded natural, reflecting punctuation, pauses, and intonation so that narration would feel like an audiobook rather than an automated system.

  • It needed to support flexible playback settings such as speed and volume, and provide a choice of speakers to match the preferences of diverse audiences.

  • Because the LMS covers domains like law and medicine, accurate pronunciation of specialized vocabulary was also critical.

  • Finally, security was a key concern. With extensive copyrighted training materials stored in the platform, the client needed a solution that would not expose sensitive content to third-party servers or unclear licensing policies.

These requirements set the direction for our market analysis and helped us identify the most promising tools.

How we researched the solutions

Our team started with a structured research phase. Based on the client’s requirements, we analyzed dozens of available text-to-speech solutions. The evaluation covered functionality, voice quality, flexibility of integration, pricing models, scalability, and, most importantly, security and licensing conditions.

During this stage, we paid close attention to hidden limitations such as character-based pricing, unclear policies on data storage, and restrictions on commercial use of generated audio. These issues are often overlooked at the start but become critical when working with large volumes of copyrighted educational materials.

After several days of testing and reviewing documentation, we were able to narrow the wide range of tools to four candidates that best met the client’s needs:

  • ElevenLabs

  • Murf.ai

  • NaturalReader (Commercial)

  • Amazon Polly

Comparison of selected tools

Once we had identified the four most suitable candidates, the next step was a detailed comparison. Each tool was assessed in terms of functionality, speech quality, technical flexibility, pricing, licensing, and ability to scale. This helped us and the client see not only the advantages but also the limits of every option.

ElevenLabs

Murf.ai

NaturalReader (Commercial)

Amazon Polly

Functionality

Natural speech, emotional range, cloning, built-in studio with editing, subtitles, voice isolation. Great for audiobooks, etc.

Studio with editing, video sync, pronunciation library, multiple voice styles. Strong for training/marketing.

Direct TTS conversion, supports many formats/languages, commercial focus. Limited editing.

Multiple engines (Standard, Neural, Long-Form, Generative), SSML support. Technically flexible, no creative studio.

Voice quality

Market leader in lifelike, expressive narration, especially audiobook-style.

High quality, less expressive. Best for e-learning and explainer videos.

Adequate for business use, less natural than ElevenLabs or Murf.

Clear and natural with Neural/Long-Form voices, but more synthetic than ElevenLabs. Quality varies.

Integration / API

Developer API for TTS, cloning, agents. Good for apps and platforms, but pricing may be high.

Mainly a web platform with limited integrations. Better for manual workflows.

Limited API. Focused on SaaS for commercial audio generation.

Deep AWS integration, full API, SSML, highly scalable for enterprise pipelines.

Pricing

Subscription + credits. Affordable for small tasks, expensive for long narration. Enterprise tiers available.

Subscription per user. Low-tier limits, costs rise quickly with bigger projects.

Single plan (~$99/month) covers commercial use. Predictable but not cheap.

Pay-as-you-go by character. Cost-effective at scale for Standard voices, higher for Neural/Long-Form.

Commercial licensing

Paid tiers allow commercial use. Cloning requires consent, strict policy.

Commercial use in paid plans. Free tier is non-commercial only.

Clear license for commercial distribution.

Commercial use allowed under AWS terms. No cloning issues but requires compliance management.

  • ElevenLabs

    • Functionality

      Natural speech, emotional range, cloning, built-in studio with editing, subtitles, voice isolation. Great for audiobooks, etc.

    • Voice quality

      Market leader in lifelike, expressive narration, especially audiobook-style.

    • Integration / API

      Developer API for TTS, cloning, agents. Good for apps and platforms, but pricing may be high.

    • Pricing

      Subscription + credits. Affordable for small tasks, expensive for long narration. Enterprise tiers available.

    • Commercial licensing

      Paid tiers allow commercial use. Cloning requires consent, strict policy.

  • Murf.ai

    • Functionality

      Studio with editing, video sync, pronunciation library, multiple voice styles. Strong for training/marketing.

    • Voice quality

      High quality, less expressive. Best for e-learning and explainer videos.

    • Integration / API

      Mainly a web platform with limited integrations. Better for manual workflows.

    • Pricing

      Subscription per user. Low-tier limits, costs rise quickly with bigger projects.

    • Commercial licensing

      Commercial use in paid plans. Free tier is non-commercial only.

  • NaturalReader (Commercial)

    • Functionality

      Direct TTS conversion, supports many formats/languages, commercial focus. Limited editing.

    • Voice quality

      Adequate for business use, less natural than ElevenLabs or Murf.

    • Integration / API

      Limited API. Focused on SaaS for commercial audio generation.

    • Pricing

      Single plan (~$99/month) covers commercial use. Predictable but not cheap.

    • Commercial licensing

      Clear license for commercial distribution.

  • Amazon Polly

    • Functionality

      Multiple engines (Standard, Neural, Long-Form, Generative), SSML support. Technically flexible, no creative studio.

    • Voice quality

      Clear and natural with Neural/Long-Form voices, but more synthetic than ElevenLabs. Quality varies.

    • Integration / API

      Deep AWS integration, full API, SSML, highly scalable for enterprise pipelines.

    • Pricing

      Pay-as-you-go by character. Cost-effective at scale for Standard voices, higher for Neural/Long-Form.

    • Commercial licensing

      Commercial use allowed under AWS terms. No cloning issues but requires compliance management.

Key factors that shaped our choice

When comparing solutions against the client’s requirements, we also uncovered several pitfalls that influenced our decision. Some issues were not immediately obvious from product descriptions but became clear during testing and deeper analysis.

  • Icon of card 1

    Character-based billing risks

    We found that character-based billing can quickly distort cost projections. Amazon Polly, for example, charges by characters. For short tasks this model works fine, but for long-form narration such as audiobooks, the cost can increase rapidly.

    It also applies different rates for its Standard, Neural, and Long-Form engines. Without precise calculations, the budget risk becomes significant.

  • Icon of card 2

    Limits of studio-based platforms

    Another issue was tied to the pricing models of studio-based platforms. Tools like Murf.ai and NaturalReader often rely on per-seat subscriptions or impose limits on how many minutes can be exported.

    For large teams or for customers with extensive content libraries, which was in the case of our client, this setup adds cost and complicates scaling.

  • Icon of card 3

    Hidden infrastructure costs

    Integration complexity also surfaced as a hidden factor. Polly is tightly embedded in AWS infrastructure, which makes it attractive for enterprises already using AWS. But the cost of supporting services such as storage, data transfer, or orchestration can exceed the pure text-to-speech pricing. For a client focused on cost-effectiveness, this was not ideal.

  • Icon of card 4

    Voice cloning and compliance

    We also paid attention to voice cloning and compliance rules. ElevenLabs, while offering the most advanced cloning, requires strict consent management and has clear restrictions on how cloned voices can be used. While this was not a blocker in our project, it was an important consideration for building sustainable workflows.

  • Icon of card 5

    Editing and voice quality variation

    Finally, quality variation demanded attention. Even the most advanced TTS systems sometimes produce results that need manual correction through SSML or editing. This means the more natural the voice, the less time would be spent on corrections.

    ElevenLabs consistently produced narration that required the least manual adjustment, which had a direct impact on production efficiency.

Final decision

  • Eventually, we chose ElevenLabs. It provided the best balance of expressive voice quality, flexibility, and manageable pricing for the client’s use case. While no solution was perfect, it was the only one that consistently met the requirements without introducing long-term risks or excessive hidden costs.

    We have now moved on to the next phase – integration. The solution is being embedded into the client’s learning platform, and in the next stage we will be tracking its performance, impact on learners, and return on investment. Stay tuned!

    This type of research is valuable not only for the project at hand but also for our broader consulting practice. Each evaluation gives us a clearer view of the AI solutions market, allowing us to advise future clients more effectively and with greater confidence.

We use third-party cookies to improve your experience with aristeksystems.com and enhance our services. Click either 'Accept' or 'Manage' to proceed.