Only the Chinese version of this page is provided currently. The English version will be provided soon.

Functionality

Last updated: 2025-04-03 15:49:28

Which ASR Service Should I Choose In Different Scenarios?

Real-time speech recognition is applicable to scenarios with requirements for real-timeness, such as voice input method, voice robot, and meeting recording.
Recording file recognition is suitable for scenarios with longer speech duration and low real-time requirements, such as customer service quality inspection and video subtitles generation.
Ultrafast recording file recognition is suitable for scenarios with longer speech duration and extremely high real-time requirements, such as adding subtitles to videos and quasi-real-time quality inspection.
One-sentence recognition is suitable for recognizing short audio files within 60 seconds, such as voice messages and voice search.
Asynchronous stream recognition is suitable for quasi-real-time recognition of voice streams, returning text results asynchronously, such as live streaming review and audio/video review.

Is the Audio Transmitted By Users To Tencent Cloud ASR Used Only For the Current Recognition?

Yes, the audio content transmitted by users to Tencent Cloud ASR is used only for the current recognition and will not be saved.

If Two People Are Talking In a Recording Stored As Mono, Will the Recognition Result Separate Their Dialogue?

8K and 16K sampling rate Mandarin recording file recognition supports speaker separation for single channel dual-person dialogue.

Are Far-Field and Online/Offline Speech Recognition Supported?

Supports both online and offline speech recognition. For details, refer to the offline section in the SDK Documentation.

Does ASR Support Recognizing Speeches In Chinese-English Mix and Dialects?

Real-Time Speech Recognition, One-Sentence Recognition, Recording File Recognition, Ultrafast Recording File Recognition, and Asynchronous Stream Recognition support mixed recognition of Chinese and English (when using the Chinese engine, it can support mixed recognition of Chinese and English in the case of a small amount of English, but the recognition rate may decrease with a large amount of English) and support Mandarin with an accent.
Real-Time Speech Recognition, One-Sentence Recognition, Recording File Recognition, and Ultrafast Recording File Recognition support the recognition of 23 dialects, including Shanghai dialect, Sichuan dialect, Wuhan dialect, Guiyang dialect, Kunming dialect, Xi'an dialect, Zhengzhou dialect, Taiyuan dialect, Lanzhou dialect, and Yinchuan dialect.

What Is the Supported Input Audio Duration For ASR?

One-sentence recognition supports audio within 60 seconds per call.
Recording file recognition supports audio within five hours per call.
In real-time speech recognition, each audio segment of a data packet in the audio stream is 200 ms in length.

What Audio Attributes Does ASR Support?

API
Audio Properties
Recording file recognition
Sampling rate: 16kHz, 8kHz Bit depth: 16bit
Channels: mono, stereo
One-Sentence recognition
Sampling rate: 16kHz, 8kHz Bit depth: 16bit Channels: mono
Ultrafast recording file
Sampling rate: 16kHz, 8kHz Bit depth: 16bit Channels: mono, stereo
Real-Time speech recognition
Sampling rate: 16kHz, 8kHz Bit depth: 16bit Channels: mono
Speaker verification
Sampling rate: 16 kHz Bit depth: 16 bit Channels: Mono
Virtual number human detection
Sampling rate: 8 kHz Bit depth: 16 bit Channels: Mono

What Transmission Methods and Formats Are Supported For Audio Data In Single-Sentence Recognition and Recording File Recognition?

Transmit using HTTP Protocol, POST method. The audio data can be transmitted in the following two ways:
1. Audio data is encoded using base64 and transmitted with the HTTP body.
2. If using URL download, the data in the body can be left empty, and the audio URL should be included in the request parameter.

In Real-Time Speech Recognition, If the Audio Contains Multiple Sentences, How Do I Increase the Recognition Accuracy?

We recommend you enable the voice activity detection (VAD) feature for audio segmentation. If the audio contains multiple sentences, VAD can detect the pauses between them and automatically divide the audio into different sentences, achieving a higher recognition accuracy.

Does ASR Support Synchronous Result Invocation?

Real-time speech recognition supports sync recognition result return.
One-Sentence Recognition supports quick return of recognition results.
Recording file recognition supports two forms of asynchronous invocation: callback and polling.

Can ASR Convert a Mandarin Recording File To English Text?

No. ASR currently cannot convert Mandarin recording files to English text.

Does ASR Support Evaluation?

It is not supported.

Can the Text Recognized By ASR Be Copied?

The text recognized by ASR cannot be copied. The copy feature requires frontend development after integration.

After Purchasing a Recording File Recognition Resource Package, How Do I Import Files For Recognition?

You can import files on the ASR console feature experience page, or use the API and SDK.

What File Upload Formats Are Supported By the Recording Transcription Feature?

The transcription feature supports WAV, MP3, M4A, FLV, MP4, WMA, 3GP, AMR, AAC, OGG-OPUS, and FLAC formats.

Can I Set the Longest Recognition Time For Real-Time Speech Recognition?

The maximum recognition time cannot be set. If not needed, just disconnect.

Does ASR Support MRCP Protocol?

MRCP is not yet available to the public. If needed, please contact Presales Inquiry.

Is There a SaaS Solution That Can Be Provided Directly To Customers?

ASR supports private deployment, which requires business coordination and follow-up. You can contact Presales Inquiry.

How To Cut Audio Longer Than 5 Hours or Files Larger Than 1GB?

You can use the ffmpeg command to cut audio/video. For example, if the audio duration is 3 hours and you want to cut it into three 1-hour audios, you can use the following command:
ffmpeg -ss 00:00:00 -i input.wav -c copy -t 3600 output_1.wav

ffmpeg -ss 01:00:00 -i input.wav -c copy -t 3600 output_2.wav

ffmpeg -ss 02:00:00 -i input.wav -c copy -t 3600 output_3.wav
The -ss parameter is the start time of the cut, -i is the file name of the cut, and -t is the duration of the cut audio in seconds.

How To Convert an English Recording File To Chinese Using ASR?

The ASR feature converts audio content into text and does not support Chinese-English translation.

How To Save the Text After Real-Time Speech Recognition?

Real-time speech recognition returns text in real-time, which you can save locally.

What Languages Does ASR Support?

Real-time speech recognition supports Mandarin, English, Korean, Cantonese, Japanese, Thai, and Shanghai dialect. For details, refer to Real-Time Speech Recognition (WebSocket). One-sentence recognition and recording file recognition support Mandarin, English, Cantonese, Japanese, and Shanghai dialect. For details, refer to Recording File Recognition and One-Sentence Recognition.

Can ASR Save Voice Files?

The audio and video files uploaded for ASR are not saved. After successful recognition, the recognized text file is stored on the server for 7 days. Saving audio files affects the recognition result, which is currently returned directly. You can implement audio file saving on the business side, storing the audio files on a local server or in a database.

Does the Recording File Recognition API In ASR Support Filtering Modal Particles?

Recording file recognition supports filtering modal particles. For specific usage, refer to Recording File Recognition Request.

Does the Recording File Recognition API In ASR Support Filtering Punctuation?

The recording file recognition API supports filtering punctuation. For specific usage, refer to Recording File Recognition Request.

What Is the Accuracy Of ASR?

In the test report issued by the National Quality Supervision and Inspection Center for Electronic Computers, Tencent Cloud's voice robot system achieved a character accuracy rate of 97.40% for Chinese speech recognition (results rounded to two decimal places) and no less than 88.00% for American English speech recognition (results rounded to two decimal places) for audio data with a sampling rate of 16k, 16bit, and in raw uncompressed wav or pcm format. However, please be aware that the aforementioned character accuracy rates are only third-party experimental test data for your reference and do not constitute a guarantee of the accuracy of Tencent Cloud's speech recognition service.

Does the Recording File Recognition API In ASR Support Intelligent Conversion Of Arabic Numerals?

Recording file recognition supports intelligent conversion of Arabic numerals. For specific usage, refer to Recording File Recognition Request.