The content of this page has been automatically translated by AI. If you encounter any problems while reading, you can view the corresponding content in Chinese.
Help & Documentation>Automatic Speech Recognition>FAQs>Recognition Effect Troubleshooting

Recognition Effect Troubleshooting

Last updated: 2025-04-03 15:49:05

If you find that there is some gap between the transcription result and your expectation when using ASR, you can troubleshoot the problem according to this document.

Troubleshooting Steps

Common problems include the following:
1. The audio content is not clear or comprehensible by ordinary people. In this case, we recommend you transform the audio capture environment on the frontend, for example, changing from far field to near field for audio capture, controlling and reducing noises in the environment, using standard universal language without accent or dialect (i.e., language comprehensible by non locals), and reducing slurs caused by fast speech.
2. The audio content is comprehensible, but the recognition result is very different from what is heard. This problem is generally caused by the failure of the audio information to meet the requirements of ASR.
View detailed audio information in Cool Edit, Adobe Audition, or FFmpeg, including sampling rate, number of sound channels, and bit depth. ASR currently only supports audios with a sampling rate of 8,000 Hz or 16,000 Hz and a bit depth of 16-bit. Recording file recognition supports mono and stereo channels, while real-time speech recognition and single-sentence recognition support only mono-channel. Note that if you use real-time speech recognition or single-sentence recognition, the audio attributes must strictly meet the above requirements.
View the audio waveform and spectrum (in the view options of Adobe Audition) to determine the true sampling rate of the audio. It is recommended that the actual sampling rate meets ASR requirements (8k telephone engine model corresponds to 8,000 kHz sampling rate, 16k non-telephone engine model corresponds to 16,000 kHz sampling rate).
The waveform and spectrum of true 16000Hz (true sampling rate = highest value on the right of the boxed values × 2, i.e., 8kHz × 2=16kHz) audio are as follows:

The waveform and spectrum of non-true 16000Hz (actual value is 4.6kHz × 2 = 9.2kHz) audio are as follows. You can see that audio information is completely missing in the 4.6kHz to 8kHz frequency band.


3. The audio content is comprehensible, and the recognition result is not much different from what is heard, but some unique nouns or sentences are poorly recognized. The recognition effect can be improved as follows:
Refer to the Hotword Operation Instructions for the addition and usage of hotwords for poorly recognized nouns.
Refer to the Self-learning Model Operation Instructions for the addition and usage of self-learning models for sentences with poorly recognized nouns or sentences with poorly recognized special cases.
4. The audio content is comprehensible, and the recognition result is not much different from what is heard, but there are some extra words recognized. This problem is generally caused by noise. There are two types of noise: non-human noise and human noise. ASR's algorithms are optimized and adapted for non-human noise, and you can submit specific bad cases caused by such noise to Tencent for further analysis and optimization. However, it is difficult to solve the problem caused by human noise because it may cause false positives for the human speech that needs to be recognized.