It’s funny how many companies ask for or claim that they can provide embedded dictation without qualifying what they really want or provide. Embedded dictation is very easy to do, but one must consider…
Some languages are easier than others. Different accuracies and conditions could require more data to train on. Luckily Sensory has been around long enough that we have collected over 150,000 hours of audio data across over 50 languages and dialects.
Sensory has engines running on the tiniest of platforms with <50KB memory to large solutions requiring powerful DSP and inference engines. We can run a speech to text algorithm on as little as 3MB memory, and this algorithm will have extremely high task completion rates and low word error rates (<5%) for in domain usage, but out of domain it won’t perform as well. To get reasonable cross-domain performance engines need to get up to 20MB and have reasonably powerful processing or at least specialized inference functions.
Dictation isn’t specific to domains, or is it? Sensory’s top-of-the-line engine can get under 5% word error rates on certain Ted Talks, but apply a different test set or a different domain and the accuracy can get worse. The more we understand the domain or the testing methodology the better we can do.
Accuracy is typically measured in word error rates (WER) and Task Completion Rates (TCR). If it’s not straight dictation being performed, then task completion rates are usually most important because even if a word is recognized incorrectly if it performs the right function then it doesn’t really matter. Sensory likes TCR to ALWAYS exceed 95%, if it gets much below that it starts to feel unusable. The nice thing is that WER can drop as low as 10 or 15% and a good TCR can still be achieved.
Noise and distance make it much harder to recognize accurately. It is important to implement noise management strategies that fit the usage model. Sensory’s noise data includes about 15,000 hours of data and we have a variety of noise and acoustic simulation tools. Typically, multi-mic beam-forming helps, but watch out for noise suppression algorithms and nonlinear echo cancellation schemes that were developed around the psychoacoustics of human perception and not around deep learned speech recognizers! Sensory partners with companies like: Alango, Andrea Electronics Corporation, Bolom, DSP Concepts, Meeami Technologies, MightyWorks, Phillips, and Yobe to manage noise for a wide range of environments and usages.
本文分享自 SmellLikeAISpirit 微信公众号,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。
本文参与 腾讯云自媒体同步曝光计划 ,欢迎热爱写作的你一起参与!