Critical Review of 8x8’s New Transcription Engine and Benchmark Claims

Truth or Fairytales?

Apr 25, 2025

1×

0:00

-15:52

8x8 recently unveiled a new AI transcription engine during an analyst meeting, touting its accuracy and multilingual capabilities as a major leap forward for their XCaaS platform. The company’s claims – such as training on 680,000 hoursof data, achieving about 15% word error rate (WER) in English, and handling 45 languages with zero-shot learning – closely mirror the public release notes of OpenAI’s Whisper model (Introducing Whisper | OpenAI) (Transcription tools are vital to the contact center experience | 8x8). In parallel, 8x8 commissioned a benchmark from The Tolly Groupcomparing its transcription accuracy to competitors Dialpad and RingCentral (Tolly Group). This report declared 8x8 the winner with “the average best score across all samples” (Tolly Group). While these announcements paint an impressive picture, a deeper analysis raises skepticism about the originality of 8x8’s technology, the rigor of the Tolly benchmark, and the practical gap between post-call transcription accuracy and real-time utility in contact centers.

In this report, we critically examine 8x8’s technical claims and the Tolly Group’s findings, then put them in context with other leading transcription engines (OpenAI Whisper, Deepgram, AssemblyAI, Amazon Transcribe, Google Speech-to-Text, etc., as well as offerings from Dialpad and RingCentral). We conclude with recommendations for how 8x8 or any vendor could design a credible, enterprise-trusted benchmark for speech recognition in real-world contact center conditions.

Questionable Technical Claims and Parallels to Whisper

8x8’s touted technical specs for its new transcription engine are strikingly similar to OpenAI’s Whisper, suggesting that 8x8’s “homegrown” engine may in fact be built on Whisper or a derivative. For example, 8x8’s blog boasts that its latest model was “trained on a large and diverse dataset of 680,000 hours of multilingual and multitask supervised data”(Transcription tools are vital to the contact center experience | 8x8) – the exact size and nature of Whisper’s training corpus (Introducing Whisper | OpenAI). Likewise, 8x8 cites a “large model [with] a WER of 0.15 for English” (15% error rate) (Transcription tools are vital to the contact center experience | 8x8) and strong zero-shot performance across dozens of languages, which directly echo Whisper’s reported performance and goals. Whisper’s own documentation highlights robustness to accents, multilingual transcription, and ~15% WER as a good general threshold for usable accuracy (Introducing Whisper | OpenAI) (Transcription tools are vital to the contact center experience | 8x8). The clear one-to-one correspondence between 8x8’s numbers and Whisper’s suggests that 8x8’s engine is not a novel invention but rather an integration of OpenAI’s model. Indeed, 8x8 has publicly acknowledged partnering with OpenAI’s Whisper in its platform (8x8 Intros OpenAI Integrations to Enhance CX and Team Performance), so rebranding Whisper’s capabilities as their own breakthrough is somewhat misleading.

Such reliance on Whisper is not necessarily a bad strategy – Whisper is a state-of-the-art open model – but it warrants skepticism toward any implication that 8x8 independently achieved these feats. Training a speech model on 680k hoursof data is a massive undertaking typically done by AI labs, not mid-sized CCaaS vendors. It’s far more likely 8x8 fine-tuned an open-source Whisper model or simply deployed it as-is. If that’s the case, 8x8’s transcription quality will indeed inherit Whisper’s strengths and weaknesses. Notably, Whisper’s authors themselves cautioned that while it’s very robust overall, it doesn’t always beat specialized speech models on certain narrow benchmarks (e.g. LibriSpeech) because it wasn’t fine-tuned per dataset. In other words, Whisper (and thus 8x8’s engine) trades a bit of peak accuracy on pristine audio for greater general robustness on varied real-world audio. A critical reader should question whether 8x8 did any additional training to optimize for contact center audio (with its unique acoustics, jargon, and dual-speaker interactions). If they simply deployed Whisper’s general model, their impressive 15% WER claim might hold for average cases but could falter on industry-specific vocabulary or very noisy call conditions unless further tuned.

Moreover, 8x8’s marketing glosses over the fact that these accuracy figures come from offline evaluation. The cited 0.15 WER (85% accuracy) is likely measured on recorded audio after the fact (Transcription tools are vital to the contact center experience | 8x8). This leads to an important nuance: are we talking about post-call transcription or live transcription? The context of 8x8’s claims (and the Tolly test) is offline transcript quality, which does not automatically translate to real-time performance. We explore this discrepancy later, but it’s worth keeping in mind when evaluating 8x8’s technical boasts. In summary, 8x8’s new transcription engine appears to owe much of its prowess to Whisper. This is a smart use of cutting-edge AI, but the company’s framing invites scrutiny – it should be more transparent about leveraging OpenAI’s work rather than presenting these capabilities as wholly unique innovations. Healthy skepticism is warranted whenever a vendor’s “proprietary breakthrough” specs align too perfectly with a well-known open-source project.

Evaluating the Tolly Group Benchmark: Credibility and Limitations

To bolster its claims, 8x8 commissioned The Tolly Group to perform an independent benchmark of transcription accuracy against Dialpad and RingCentral. On the surface, the results seem great for 8x8 – Tolly’s report (Feb 2025) states that across 15 English-language audio samples (with various accents), 8x8 delivered the highest accuracy on average (Tolly Group). However, a closer look at the methodology and context of this benchmark raises questions about its credibility and enterprise relevance:

Vendor-Sponsored and Narrow Scope: The test was paid for by 8x8 (noted as the “Sponsor” (Tolly Group)), which doesn’t automatically invalidate it, but it does mean the scope was likely defined to showcase 8x8’s strengths. Only 15 samples were used (Tolly Group) – a very small sample size by any standard. Even though those samples had a mix of accents, it’s hard to consider 15 short recordings an adequate proxy for the diverse universe of contact center calls. Statistical significance is questionable; a few outlier clips could swing the “average” easily. Enterprise buyers would rightly ask for far more extensive testing (hundreds of hours of audio, multiple languages, noise conditions, etc.) before drawing conclusions about “who is best.”
Unknown Content and Conditions: The report summary doesn’t specify what the audio samples contained: Were they actual customer service calls? Reading of scripted text? How noisy or complex were they? We know they were pre-recorded (so likely evaluated in batch mode, not live) (Tolly Group). Without details, there are potential methodology gaps. For example, if all vendors were tested on the same clear recordings, the benchmark might not reveal how each engine handles harder real-world scenarios (crosstalk between agent and customer, heavy background noise, domain-specific terms, etc.). A credible benchmark should document the test material and ensure it reflects realistic use cases. Fifteen clips with accents is a start, but leaves out many other variables that matter in practice.
Metric and “Average Best Score”: Tolly’s phrasing that “the 8x8 solution delivered the average best score across all samples” (Tolly Group) is a bit opaque. It likely means they calculated an accuracy or WER for each sample and 8x8 had the best mean accuracy overall. It would help to know the actual error rates observed for each system – e.g., did 8x8 achieve, say, 85% accuracy vs. 80% for Dialpad and 75% for RingCentral? The absence of concrete numbers in the public abstract is notable. Also, did Tolly use strict WER as the metric, or some custom scoring? Given their IT testing background, they probably used WER or a similar measure of words correct. Without transparency here, it’s hard to gauge the practical significance of 8x8’s lead. If, for instance, 8x8 was only marginally better than Dialpad, that’s different than if it doubled the accuracy. An enterprise-grade evaluation would publish detailed results, not just a one-line winner statement.
The Tolly Group’s Expertise: The Tolly Group is a longtime IT testing firm, known for networking and communications benchmarks, often commissioned by vendors for marketing collateral. They are not particularly known as experts in speech recognition or AI. Their involvement lends a veneer of third-party validation, but we should ask how objective and deep their analysis was. Tolly’s reputation is generally solid in delivering “fact-based” testing for IT (they’ve done VoIP voice quality tests, etc.), yet speech-to-text accuracy can be tricky to evaluate without domain expertise. Did they control for bias (e.g., ensuring no engine had prior knowledge of the test phrases)? Did they consider factors like punctuation accuracy or speaker diarization, which affect usability of transcripts? These details likely lie outside Tolly’s traditional scope, so chances are the benchmark was a straightforward WER comparison on a tiny dataset. That provides only limited insight for an enterprise making decisions.
Enterprise Relevance: From a customer perspective, the Tolly results are not sufficiently comprehensive to instill confidence. Real contact centers handle thousands of hours of calls across multiple languages and accents, often in noisy environments. A credible benchmark would need to test at that scale. By contrast, Tolly’s 4-page report (as listed) focused on a narrow slice. As an enterprise decision-maker, one would wonder: Does 8x8’s lead hold up for my specific call types? The Tolly report can’t answer that. It doesn’t address, for example, how these transcription engines perform on domain-specific vocabulary (product names, acronyms) which can be crucial. In fact, earlier industry debates have noted that accuracy benchmarks lack context – different providers might excel at generic words but falter on keywords, or vice versa (Dialpad Delivers on Real-Time AI) (Dialpad Delivers on Real-Time AI). Dialpad, for one, has argued that beyond overall WER, “keyword error rate” on custom terms is where their AI shines by training on company-specific jargon (Dialpad Delivers on Real-Time AI). Tolly’s test likely did not delve into such nuances.

In summary, while the Tolly Group benchmark gives 8x8 a nice marketing soundbite, it offers limited scientific or enterprise value. The small sample and vendor sponsorship mean the results should be taken with a grain of salt. As VentureBeat observed, transcription vendors’ accuracy claims often “lack the context required to make a precise apples-to-apples comparison” due to the absence of formal standards (Debate over accuracy of AI transcription services rages on | VentureBeat). Tolly’s effort does little to change that; it’s more of a one-off demo than a rigorous industry benchmark. Enterprises evaluating contact center transcription should demand broader, truly independent tests. Without those, 8x8’s claim of superiority remains plausible but not definitively proven.

Post-Call vs. Real-Time Transcription: Accuracy vs. Utility

A critical angle often glossed over in these discussions is the difference between post-call transcription accuracy and real-time transcription utility. Most of 8x8’s touted results (and those of competitors) refer to accuracy on complete audio files – transcribing a full call or meeting after it’s finished. But contact centers have two distinct use cases:

Post-call transcripts for records, compliance, and analytics (where accuracy can be maximized with offline processing), and
Real-time transcription during live calls or meetings for agent assistance, live captioning, or immediate insights.

It’s important to highlight the discrepancy between these modes. A system might achieve 85–90% accuracy on a recorded call but struggle to deliver the same in real-time when the conversation is ongoing. Why does this gap exist?

Streaming Challenges: Real-time ASR (automatic speech recognition) typically operates in a streaming fashion, transcribing on the fly. Models like Whisper were initially designed to work on chunks of audio (30-second segments) rather than truly word-by-word streaming (Introducing Whisper | OpenAI). When used live, they must produce intermediate results that may be revised as more context comes in – or the system must introduce a short delay to accumulate a few seconds of speech. This can lead to lower accuracy or awkward corrections in real-time. For example, a speaker might say “I can’t do that” and a streaming model might initially transcribe “I can” then correct to “I can’t” when the next word arrives. Such edits are fine in a stored transcript, but in a live setting they could confuse an agent or listener if not handled smoothly.
Resource and Latency Trade-offs: Achieving Whisper-level accuracy live might require significant computational resources (GPUs) to process audio with low latency, which isn’t always feasible at scale. Vendors might resort to a smaller or faster model for real-time, sacrificing some accuracy for speed. It’s unclear whether 8x8’s platform runs the full “large model” transcription engine during calls or only uses it post-call. Many CCaaS providers historically performed heavy speech analytics post-call, while using simpler keyword spotting or limited transcription during the call if at all. If 8x8 now claims real-time transcription, one must ask: Is it the same quality as offline? If not, the impressive 15% WER figure might apply only to after-the-fact transcripts, not what agents see live.
Post-Call Processing Advantages: When transcribing after a call, the system can often do a more thorough job – it can process the audio in both forward and backward directions, use the full context of the conversation, and even run a second pass to correct errors (for instance, using a language model to refine the transcript once an initial draft is done). Real-time transcription doesn’t have that luxury; it’s making best guesses in the moment. This often means post-call transcripts are more accurate than live ones. In practice, an agent reading a live transcript might see minor errors or missing words that later get fixed in the final saved transcript. Thus, the utility of that live transcript depends on it being accurate enough at each moment to be useful.
Impact on Agent and Customer Experience: In an ideal world, real-time transcription would be accurate enough to trust for things like prompting agents with info or providing live captions to a supervisor. But if the error rate is significantly higher live (say 20-25% WER in real-time vs 15% offline), that could limit its usefulness. The discrepancy matters: post-call accuracy contributes to better analytics and records, whereas real-time accuracy (and low latency) contributes to immediate decision-making and customer experience. A model that’s great offline but too slow or error-prone live might only solve half the problem.

In the context of 8x8: They have introduced features like real-time meeting transcription and summaries (leveraging Whisper) (8x8 Extends XCaaS Platform AI Capabilities with Real-time Meeting Transcriptions and Smart Summarizations) (8x8 Extends XCaaS Platform AI Capabilities with Real-time Meeting Transcriptions and Smart Summarizations), so they are extending the engine to live use. But there’s a reason 8x8’s CEO emphasized transcription quality – “AI is pointless with poor transcription”, as he noted (AI is pointless with poor transcription. Why do some vendors have worse… | Samuel C Wilson). If their live transcription drops words or lags, any AI that depends on it (like sentiment analysis or agent alerts) could be “pointless” in the moment. Neither 8x8 nor Tolly addressed this real-time gap directly. The Tolly test fed recorded samples to each system; it did not test how quickly or accurately each platform transcribes as the audio streams in. For an enterprise, that distinction is crucial. It might be the difference between an agent getting a timely alert like “customer is upset” vs. missing it because the transcription lagged or misheard the cue.

Bottom line: 8x8’s reported 15% WER should be viewed as a best-case, offline scenario. In reality, contact centers need to evaluate how the transcription performs under live conditions – does it maintain high accuracy at low latency? Many solutions (including possibly 8x8’s) still might use a hybrid approach: do a quick rough transcript during the call (for immediate needs) and then polish it after the call for archival. That is fine, but it means the fancy accuracy numbers largely apply to the latter. Enterprises should be wary of any vendor that quotes offline accuracy metrics as if they directly translate to real-time performance.

How 8x8 Stacks Up: Transcription Engines in the Competitive Landscape

8x8 is far from the only player bringing AI transcription to business communications. To put their claims in perspective, it’s useful to compare with other leading transcription engines – both general-purpose speech-to-text services and those embedded in competing CCaaS platforms. Below we survey several key solutions and any known performance data:

OpenAI Whisper (open source): As discussed, Whisper is the model underpinning 8x8’s engine. In open evaluations, Whisper (Large) achieves impressively low WER on many benchmarks. For example, AssemblyAI’s tests found Whisper around 7.9% WER on typical English speech (Benchmarks | AssemblyAI), which corresponds to ~92% word accuracy. Whisper’s strengths are its robustness across languages and noisy conditions due to the huge training set (Introducing Whisper | OpenAI). However, it wasn’t specifically tuned for telephone audio or real-time use. Out-of-the-box, Whisper might achieve ~10-15% WER on clean, general audio, but on tougher conversational telephone datasets, errors can be higher (OpenAI reported Whisper Large at ~13.8% WER on the Switchboard phone-call corpus, which is closer to 8x8’s cited 15% on calls). It also tends to omit punctuation or formatting as a pure ASR engine. Since Whisper is open, many companies (and open-source projects) are using or customizing it. 8x8’s differentiator would have to come from how well they integrate Whisper into their workflow (real-time streaming, custom vocabulary, etc.), rather than the base speech recognition quality, which many others can equally attain by using Whisper.
EndeavorCX Prism: EndeavorCX’s Prism transcription engine is built on OpenAI’s Whisper, inheriting that model’s top-tier accuracy and zero-shot multilingual capabilities across a wide range of languages (Introducing Whisper | OpenAI). Running on specialized inference hardware, Prism also achieves unprecedented speed – transcribing audio at near real time (on the order of 10 minutes of speech in ~3.7 seconds, i.e. ~164× real-time) – which means transcripts are available almost immediately after a call ends. Prism is purpose-built for post-call intelligence (as opposed to live streaming): it generates high-fidelity transcripts the moment each call concludes, enriched with call metadata (timestamps, speaker separation, and contextual tags aligned with call records) for deeper analysis (EndeavorCX Launches Prism: The Transcription Engine Rewiring How Contact Centers Use Voice Data - EndeavorCX). This rich output feeds directly into AI-driven workflows to produce summaries, sentiment insights, and actionable intelligence (powering automated QA, trend detection, coaching alerts, etc.) (EndeavorCX Launches Prism: The Transcription Engine Rewiring How Contact Centers Use Voice Data - EndeavorCX) (EndeavorCX - Prism Transcription). Because it leverages Whisper’s architecture, Prism can also be fine-tuned for specific domains, making it well-suited to verticals like customer service in healthcare or finance – accurately capturing industry-specific jargon and compliance terminology that generic engines might miss (Gladia - What is OpenAI Whisper?). Unlike general-purpose speech-to-text services (OpenAI’s Whisper API, Deepgram, AssemblyAI, etc.), Prism comes pre-integrated with enterprise contact-center systems (EndeavorCX Launches Prism: The Transcription Engine Rewiring How Contact Centers Use Voice Data - EndeavorCX), delivering continuous transcripts already synchronized with relevant context (CRM data, call detail records) and thus acting as a plug-and-play intelligence layer rather than a raw toolkit requiring custom integration. This combination of Whisper’s state-of-the-art accuracy, extreme inference speed, domain tuning, and seamless CX stack integration makes Prism a distinctly differentiated solution in the transcription engine landscape. (Introducing Whisper | OpenAI)
Deepgram: Deepgram is an enterprise-focused speech recognition provider that develops its own end-to-end models. They often claim superiority in accuracy and speed versus both Big Tech and open models. For instance, Deepgram advertises being “36% more accurate… and up to 5× faster” than OpenAI Whisper in some evaluations (Compare OpenAI Whisper Speech-to-Text Alternatives - Deepgram). While such marketing claims should be taken cautiously, Deepgram has the advantage of offering custom model training (you can train their model on your audio data) and being optimized for streaming and scale. In independent community tests, Deepgram’s accuracy is typically on par with other top engines – within a few percentage points of Whisper’s accuracy on various tasks (The Best Speech Recognition API in 2025: A Head-to ... - Voice Writer). One source noted Deepgram was within ~2% WER of Whisper on one benchmark, implying comparable performance (The Best Speech Recognition API in 2025: A Head-to ... - Voice Writer). Deepgram also supports punctuation and diarization (speaker labeling) out of the box, which are important for readability in transcripts. Among CCaaS vendors, Dialpad has hinted that some of its internal benchmarks include Deepgram (Dialpad listed “IBM, Google, etc.” but not Deepgram explicitly (Debate over accuracy of AI transcription services rages on | VentureBeat), though many newer CCaaS entrants use Deepgram’s API under the hood). If 8x8’s Whisper-based engine is to compete, it will need to match the speed and customization capabilities that players like Deepgram provide. Enterprises evaluating transcription APIs often run their own bake-offs where Deepgram, Google, Whisper, etc. transcribe the same call data; results can vary by content, but Deepgram is frequently a top contender.
AssemblyAI: AssemblyAI is another provider of speech-to-text via API, known for their research transparency and accuracy claims. They recently introduced a “Universal” model and published benchmarks comparing to others. According to AssemblyAI’s data, their model achieves about 6.6% WER on English (93.4% accuracy) – which they claim is the most accurate available (Benchmarks | AssemblyAI). In that same test suite, Google’s service was ~9.2% WER, Amazon’s 10.3%, and Whisper around 7.9% (Benchmarks | AssemblyAI) (Benchmarks | AssemblyAI). If those figures are accurate, AssemblyAI and Whisper are essentially in the top tier, with Microsoft, Google, Deepgram close behind. It’s worth noting AssemblyAI’s evaluation likely used curated audio (maybe podcast-like data or a mix of sources) and may not directly reflect noisy call-center audio. Nonetheless, AssemblyAI’s strength lies in handling formatting (it tries to output proper punctuation, numbers, etc. correctly) and offering features like summarization on top of transcripts. So, in comparison, 8x8’s engine is at least in the ballpark of the best, since it’s based on Whisper. But claims that it’s uniquely better than everyone else’s should be viewed with the context that many companies now have extremely accurate ASR. The differences of a few WER points might not be noticeable to end users without careful side-by-side testing.
Amazon Transcribe: Amazon’s speech-to-text service is widely used for voice applications and is known for its scalability. Quality-wise, Amazon Transcribe has improved over the years but tends to rank slightly below the top in independent tests. AssemblyAI’s benchmark showed Amazon at ~10% WER on English (Benchmarks | AssemblyAI), and other evaluations often find Amazon’s accuracy a bit behind Google’s. Amazon does offer a specialty Call Analytics edition of Transcribe tailored for contact centers (with features for speaker separation and call-specific vocabularies), which could narrow the gap in that domain. However, Amazon’s models may not have the sheer breadth of training that Whisper had (Amazon’s training data is proprietary and not disclosed in detail). For an enterprise already using AWS, Transcribe is a convenient option, but vendors like 8x8 or Dialpad likely chose to build/integrate their own ASR to have more control and (potentially) higher accuracy than the generic cloud API. In short, Amazon Transcribe is a solid baseline but not usually touted as the industry’s most accurate; 8x8 clearly believed they could do better via Whisper.
Google Cloud Speech-to-Text: Google has been the benchmark for accuracy for many years (“the gold standard,” as a Dialpad exec once said (Dialpad Delivers on Real-Time AI)). Google’s models (especially the latest improved ones) perform excellently on a range of tasks. In 8x8’s blog, they implicitly benchmarked against “models trained on smaller datasets” and claimed 50% fewer errors (Transcription tools are vital to the contact center experience | 8x8) – likely a reference to outperforming typical models which could include Google’s older models. However, Google hasn’t stood still; they continually update their STT. Google’s Enhanced Phone model is specifically tuned for telephone audio and is often very accurate on call data. In one Dialpad-run test, Google’s enhanced model hit ~79.8% accuracy vs Dialpad’s 82.3% on their dataset (Debate over accuracy of AI transcription services rages on | VentureBeat), essentially neck-and-neck. We can surmise that today Google STT would also be in the high-80s to low-90s percentage accuracy on general calls (depending on content). Google also supports over 125 languages, though perhaps not in a single model like Whisper does. An important note is that some CCaaS providers (RingCentral, perhaps, or smaller ones) might simply use Google’s API behind the scenes for transcription rather than developing their own. If RingCentral’s transcription was less accurate in the Tolly test, it could indicate they’re using a less capable engine than Google’s, or an older version. Alternatively, they might be using IBM Watson or another engine that hasn’t kept pace (IBM’s accuracy was noted to lag a few points behind Google in past comparisons (Dialpad Delivers on Real-Time AI)). Without insider info, we can’t be sure, but Google’s consistency makes it a strong baseline: 8x8 choosing Whisper suggests they believed Google wasn’t enough, but any enterprise should probably compare 8x8’s output to Google’s on their own calls to verify the difference.
Microsoft Speech (Azure Cognitive Services): Microsoft’s speech-to-text has greatly improved, leveraging research from deep learning and their own Whisper-like models (they have a model called “Whisper” too, confusingly, but unrelated to OpenAI’s). In some public benchmarks, Microsoft’s Azure STT was shown with ~8.8% WER on English (Benchmarks | AssemblyAI), very close to OpenAI’s 7.9%. Microsoft offers customization like a Custom Speech service where you can upload specific vocabulary to improve recognition of rare terms. Given Microsoft’s focus on enterprise, their accuracy on business jargon can be strong if properly customized. Microsoft also powers transcription in Teams meetings (and possibly for some partner solutions). While not mentioned by 8x8, it’s likely an alternative that some CCaaS vendors evaluate. Microsoft’s relative quiet on boasting accuracy (compared to others) means they may not market it as aggressively, but technically it is among the top-tier engines.
Dialpad’s “Voice Intelligence” (proprietary): Dialpad, a direct competitor to 8x8 in CCaaS, has invested heavily in its own ASR since acquiring TalkIQ in 2018. They claim their internally developed engine enables real-time call transcription and agent coaching as a differentiator. In a 2021 analysis, Dialpad published that they achieved ~82.3% word accuracy (≈17.7% WER) on general speech versus Google’s 79.8% (≈20.2% WER) (Debate over accuracy of AI transcription services rages on | VentureBeat) – a small edge. More impressively, they said for keywords (proper nouns, product names) their accuracy was 15% higher than Google’s (Dialpad Delivers on Real-Time AI), thanks to customizing models per customer. This indicates that Dialpad’s strength is not necessarily that their base model is beyond what Whisper/Google can do, but rather that they integrate the ASR tightly with contact center workflows (uploading company-specific dictionaries, retraining weekly (How we Automated our Automatic Speech Recognition QA | by ...), etc.). In Tolly’s 15-sample test, Dialpad apparently came in slightly behind 8x8. It’s plausible that Whisper’s sheer training breadth edged out Dialpad’s model on those mixed-accent samples. However, enterprise buyers might consider Dialpad’s approach to tailoring – if you have a lot of unique terminology, an engine that learns your vocabulary could outperform a generic model like Whisper. 8x8/Whisper can also be fine-tuned, in theory, but 8x8 hasn’t explicitly mentioned doing per-customer training. So, competitively, 8x8 vs Dialpad in transcription is a close race, with 8x8 (via Whisper) having a robustness advantage out-of-the-box, and Dialpad potentially catching up by brute-forcing improvements on the data of their user base (they’ve transcribed over 1B minutes by 2021 to improve their AI (Debate over accuracy of AI transcription services rages on | VentureBeat)).
RingCentral (RingSense AI): RingCentral, another CCaaS peer, introduced its AI-powered transcription and meeting summaries (branded as RingSense) around 2022–2023. They likely leverage a third-party or open model under the hood (perhaps Google, Microsoft, or even OpenAI’s API) since they have not publicized developing a speech engine from scratch. The Tolly test suggests RingCentral’s accuracy was the lowest of the three. This could mean RingCentral is using a less capable model or hasn’t invested as much in optimization. It’s worth noting that RingCentral acquired a company called DeepAffects in 2020, which specialized in speech and emotion analytics – so they do have some in-house AI talent. It’s possible their transcription is a fine-tuned model from that acquisition or a combination of services. Without hard data, we can only infer that currently 8x8 (Whisper) outperforms whatever RingCentral is using for English transcription. RingCentral has not published accuracy metrics, to our knowledge, which itself might indicate it’s not (yet) a bragging point for them. Enterprises considering RingCentral vs 8x8 for voice AI should pilot both on real calls. It wouldn’t be surprising if RingCentral pivots to using an OpenAI or Azure-powered transcription in the near future, given how the industry is moving.
Other notable engines: There are other ASR solutions like IBM Watson (widely used a few years ago, but its accuracy was reported a bit behind the latest Google/Dialpad, and IBM hasn’t been as aggressive in cloud AI recently), Cisco (they have Webex AI which likely uses a combination of in-house and partnered models), and specialty vendors like Otter.ai or Rev AI that focus on meeting transcription. Zoom has its own transcription feature (which initially used Otter.ai, then reportedly built their own based on… you guessed it, Whisper). The pattern is clear – Whisper’s emergence has influenced many, and the top commercial cloud providers (Google, Microsoft, Amazon, Deepgram, AssemblyAI) all have comparable offerings.

In light of this competitive landscape, 8x8’s claims of having the “best” transcription should be taken in context. Yes, their use of Whisper gives them a very strong, modern ASR capability, likely a leap over any older system they had. And the Tolly test indicates an edge over two direct CCaaS rivals. But objectively, the difference between 8x8’s engine and say Google’s or AssemblyAI’s might be only a few percentage points of WER. Those differences can disappear or reverse in different conditions (for example, an accented speaker might be better handled by one model vs another). It’s also a moving target – open research and competitors are continuously improving speech models. For instance, OpenAI could release a Whisper v2 that 8x8 would need to incorporate, or someone like Deepgram might optimize specifically for call audio and beat Whisper on that domain. The key for enterprise users is that several vendors now offer “good enough” transcription (in the 85–95% accuracy range on typical speech) (Benchmarks | AssemblyAI) (Benchmarks | AssemblyAI). The focus should shift to how these transcripts are used (real-time integration, analytics, etc.) and how well the AI handles your data specifically. Minor accuracy bragging rights in a lab mean less than consistent performance on your contact center’s calls.

Toward a Credible Benchmark for Enterprise Transcription

Given the current state of vendor claims and one-off comparisons, how could 8x8 – or any vendor in this space – create a truly credible, technically sound, and enterprise-relevant benchmark for transcription performance? Below is a proposal for constructing such a benchmark:

1. Use Realistic Contact Center Audio Data: The benchmark should be based on a large, representative sample of real contact center calls. This means hundreds of call recordings drawn from various industries (e.g., retail customer support, banking, healthcare, IT helpdesk) with customer consent and privacy safeguards. Include a range of audio qualities: landline calls, mobile calls, VoIP calls; different noise levels and levels of cross-talk (agent and caller speaking over each other). Multi-language coverage is important too – e.g., some calls in Spanish, some in French, etc., to reflect a global operation. By using real-world audio, the benchmark results will directly speak to enterprise needs (in contrast, using, say, clear podcast clips would be easier but not as insightful).
2. Ensure Accurate Reference Transcripts (Ground Truth): For every audio file in the test set, create a human-reviewed “ground truth” transcript. This might involve professional transcriptionists or multiple annotators to double-check for errors, so that the reference is as close to perfectly accurate as possible. Having high-quality reference transcripts is crucial – otherwise WER calculations are unreliable. Special care should be taken to mark things like speaker turns and proper nouns correctly in the reference. If evaluating multiple languages, have native speakers verify those transcripts. Essentially, treat it like an academic evaluation dataset.
3. Evaluate Multiple Dimensions: Don’t limit evaluation to a single metric. Word Error Rate (WER) is the primary metric (it directly measures deletions, substitutions, insertions of words), but also consider:
- Keyword accuracy or entity accuracy: especially on names, product terms, or critical keywords (similar to Dialpad’s KER metric (Dialpad Delivers on Real-Time AI)). This addresses enterprise-specific vocabulary performance.
- Speaker diarization accuracy: if the solution claims to label speakers, measure how well it separates agent vs customer speech.
- Punctuation and formatting correctness: in contact center transcripts, commas, periods, question marks, as well as formatting of numbers, dates, or addresses can matter for readability. A model that outputs well-formatted transcripts adds value. This could be measured by a formatted-error rate (how many errors in numeric or capitalization, etc., compared to reference).
- Timing/Latency: if testing real-time systems, measure how quickly each transcript is produced. For example, what is the average delay from speech to text availability? Also, measure stability of real-time output – do words “flip” frequently before stabilizing? These factors affect usability for live assistive purposes.
4. Test Both Post-Call and Real-Time Modes: To bridge the gap identified earlier, the benchmark should evaluate engines in both scenarios:
- Batch mode: Feed the entire audio file and get a final transcript – measure accuracy.
- Streaming mode: Simulate a live call by feeding audio incrementally (with appropriate timing) and capture the transcript as it would appear in real time. Measure accuracy of the final real-time transcript and also note any significant transient errors. This will show the difference (if any) between an engine’s offline vs live performance. It also lets measurement of latency as mentioned.
5. Invite Multiple Vendors/Engines: For true credibility, the benchmark should not just compare 8x8 to one or two rivals of 8x8’s choosing. It should ideally include all major ASR engines that are relevant to contact centers. This means including at least: 8x8’s engine, Dialpad’s engine, RingCentral’s (if they agree), plus generic engines like Google STT, Amazon Transcribe, Microsoft Azure, possibly Deepgram and AssemblyAI’s APIs, and even OpenAI Whisper itself (the open-source version). By benchmarking across a wide field under identical conditions, the results would carry much more weight. An enterprise could see, for example, whether 8x8 truly outperforms a tuned Google model or by how much. Vendors might need incentive to participate, but the benchmark could be done by an independent party using each vendor’s public API or platform (for Dialpad/RingCentral, one might use their product or a trial to run the audio through their transcription feature).
6. Independent Oversight: To instill trust, this benchmark should be conducted or audited by a neutral third party – perhaps a respected research firm or a consortium. Even opening the process to community/peer review would help. The methodology and dataset (minus any sensitive audio) should be published openly. If 8x8 leads this effort, they must be willing to have their own performance scrutinized alongside others in a fair manner. This transparency would counter any skepticism of “vendor-rigged” results.
7. Relevance to Enterprise Outcomes: Design the evaluation to connect with practical outcomes. For instance, include a test where the transcript is used to generate an automated call summary – then have humans rate the quality of the summary for each engine’s transcript. Or test an AI sentiment analysis on each transcript to see if accuracy differences impact detecting customer sentiment. These higher-level tasks can reveal if a slightly higher WER actually yields materially better insights or not. By tying the benchmark to enterprise use cases (agent assist, compliance flagging, etc.), it becomes more than a number chase; it shows real-world impact.
8. Report Results with Context: Finally, publish the results in a detailed report. Show overall rankings but also breakdowns by noise level, by accent, by call type. It might turn out that one engine is best for English calls from US, but another edges it out for Indian-accented speakers, for example. Such granular data would help enterprises pick solutions fit for their demographic mix. Also include analysis of errors – e.g., which engines handled overlapping speech better, which confused certain phrases. This diagnostic info can spur improvements from all vendors.

By following the above approach, 8x8 could lead the creation of a credible benchmark that the enterprise community trusts. It would move the conversation away from each vendor making isolated claims (or tiny sponsored tests) and towards an industry standard for evaluating speech recognition in contact centers. In essence, it would be akin to an open “bake-off” in conditions that matter to customers. This level of transparency is still somewhat rare in the contact center tech space, but it would be a positive development. Not only would it give enterprise buyers clearer guidance, it would also push all providers to improve their technology (as no one would want to be at the bottom of the leaderboard in the next round of tests). Given how critical transcription is becoming for AI-driven customer experience, an initiative like this could help cut through the hype and hold vendors accountable to real performance.

Conclusion

8x8’s new transcription engine undoubtedly represents a significant step up for their platform – leveraging OpenAI’s Whisper has equipped 8x8 with state-of-the-art speech recognition capabilities, enabling features like multilingual transcriptions and AI summaries across their XCaaS offerings. However, a critical analysis reveals that many of 8x8’s bold claims are not unique breakthroughs but rather reflections of advances shared across the industry. The Whisper parallels (680k hours of training data, ~15% WER, zero-shot languages) make it clear that 8x8 is standing on the shoulders of open AI giants, and the true innovation will lie in how well they apply this technology in the contact center context (e.g. real-time usage, integration with workflows).

The Tolly Group benchmark, while offering a positive datapoint for 8x8, falls short of providing robust, enterprise-grade insight into transcription performance. Its limited scope and sponsor-driven nature mean that enterprises should not make decisions solely based on that report. Instead, as we’ve discussed, a more comprehensive and transparent benchmarking approach is needed in the industry. Until then, it’s prudent to treat any vendor’s self-declared “#1 accuracy” with healthy skepticism – including 8x8’s.

When comparing 8x8’s transcription to other leading engines, we find that the gap is not enormous; multiple providers (Whisper, Deepgram, AssemblyAI, Google, Microsoft, etc.) offer highly accurate speech-to-text, generally within a few percentage points of each other in overall accuracy. In fact, differences in specific scenarios or capabilities (like custom vocabulary, real-time stability, language support) may matter more than who has the absolute lowest WER on a given test. For enterprises, this means the competitive context is nuanced: 8x8’s transcription is certainly top-tier thanks to Whisper, but rivals like Dialpad are close behind and emphasize customization, while big players like Google and Microsoft ensure that baseline quality remains high across the board.

Finally, to truly earn enterprise trust, vendors should consider collaborating on credible benchmarks. We outlined how 8x8 or others could spearhead a fair evaluation that mirrors real contact center conditions. By openly benchmarking under neutral oversight, vendors can demonstrate confidence in their product and give customers the data they need to make informed choices. In the long run, such efforts would cut through marketing claims and let the best technology speak for itself. Until that happens, 8x8’s customers (and prospective customers) would be wise to pilot the transcription engine on their own calls and measure outcomes that matter to them – both in post-call analyses and in the heat of live customer interactions. After all, in the contact center, what counts is not just accuracy on paper, but actionable accuracy in practice – the kind that actually improves customer experience and agent performance in real time.

Sources: The analysis above cites information from 8x8’s own blog and press releases, the Tolly Group report abstract, industry articles, and benchmarks from speech AI providers to substantiate claims and provide context. Key references include 8x8’s description of their transcription engine (Transcription tools are vital to the contact center experience | 8x8) (Transcription tools are vital to the contact center experience | 8x8), the Tolly benchmark summary (Tolly Group), a VentureBeat discussion on transcription accuracy claims (Dialpad vs. Google) (Debate over accuracy of AI transcription services rages on | VentureBeat), and AssemblyAI’s public STT accuracy benchmarks (Benchmarks | AssemblyAI) (Benchmarks | AssemblyAI), among others. These sources are listed inline to support transparency and allow further reading.

ActivateCX

Discussion about this post