Tencent Cloud Speech Recognition Operator
Overview
The Tencent Cloud Speech Recognition Operator is an intelligent speech-to-text tool based on the Tencent Cloud Speech Recognition API. It supports multiple audio formats and languages, converting audio files or audio data streams into accurate text content. It is widely used in scenarios such as speech transcription, intelligent customer service, and meeting minutes.
Core Features
- ✅ Multi-Format Support: Supports WAV, MP3, PCM, M4A, AAC, and other mainstream audio formats
- ✅ Multi-Language Recognition: Supports 9 major languages including Chinese, English, Japanese, and Korean
- ✅ High-Accuracy Recognition: Based on Tencent Cloud's advanced speech recognition technology
- ✅ Flexible Input: Supports base64 encoding, Buffer data, file paths, and other input methods
- ✅ Secure and Reliable: Built-in data validation and error handling mechanisms
- ✅ Detailed Reports: Provides complete recognition results and execution statistics
Important Limitations
⚠️ Please note the following limitations before use:
- Audio Duration: Maximum 60 seconds of audio supported (Tencent Cloud API limitation)
- File Size: Maximum 5MB audio files supported
- Recognition Mode: Only single-sentence recognition is supported; real-time transcription is not supported
- Network Requirements: Requires a stable network connection to access the Tencent Cloud API
Configuration Parameters
Tencent Cloud Credential Configuration
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
secretId | string | Environment variable TENCENT_SECRET_ID | ✅ | Tencent Cloud API Secret ID (starts with AKID) |
secretKey | string | Environment variable TENCENT_SECRET_KEY | ✅ | Tencent Cloud API Secret Key (minimum 20 characters) |
appId | string | Environment variable TENCENT_CLOUD_APP_ID | ❌ | Tencent Cloud App ID |
region | string | 'ap-guangzhou' | ❌ | Service region |
Recognition Parameter Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
engineType | string | '16k_zh' | Engine model type; see supported language list for details |
timeout | number | 60000 | Timeout in milliseconds, range 5000–300000 |
maxFileSize | number | 5242880 | Maximum file size in bytes, default 5MB |
maxDuration | number | 60 | Maximum audio duration in seconds |
defaultLanguage | string | 'zh-CN' | Default recognition language |
Supported Languages and Formats
Supported Languages
The source code uses standard language codes (not engine model type strings); the system automatically maps them internally:
| Language Code | Language | Description |
|---|---|---|
zh-CN | Chinese (Mandarin) | Default language |
zh-TW | Chinese (Traditional) | Traditional Chinese recognition |
en-US | English (American) | English recognition |
ja-JP | Japanese | Japanese recognition |
ko-KR | Korean | Korean recognition |
es-ES | Spanish | Spanish recognition |
fr-FR | French | French recognition |
de-DE | German | German recognition |
ru-RU | Russian | Russian recognition |
Supported Audio Formats
voiceFormat uses string values (not numbers):
| Format | voiceFormat Value | File Extension | Description |
|---|---|---|---|
| WAV | 'wav' | .wav | Lossless audio format, recommended; also the default fallback format |
| MP3 | 'mp3' | .mp3 | Common compressed audio format |
| M4A | 'm4a' | .m4a | Apple audio format |
| PCM | 'pcm' | .pcm | Raw audio data |
| AAC | 'aac' | .aac | Advanced Audio Coding |
Input Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
audio | string / Buffer | ✅ | Audio data (base64-encoded string or Buffer) |
language | string | ❌ | Recognition language code (e.g., zh-CN); defaults to defaultLanguage from configuration |
format | string | ❌ | Audio format (e.g., wav, mp3); defaults to 'wav' |
filename | string | ❌ | Filename; defaults to audio.{format} |
sessionId | string | ❌ | Session ID |
- The input parameter name is
audio(notaudioData) - If a
realtimeparameter is provided, it will be ignored with a warning (real-time transcription is not supported)
Output Results
Success Response Structure
{
"success": true,
"text": "Recognized text content",
"language": "zh-CN",
"confidence": 0.85,
"provider": "tencent",
"method": "sentence",
"processedAt": "2025-06-15T10:30:00.000Z",
"audioSize": 83200,
"sessionId": "xxx",
"executionTime": 500
}
confidence is currently a fixed value of 0.85 and does not use the actual confidence returned by Tencent Cloud.
Failure Response Structure
{
"success": false,
"error": "Error message",
"provider": "tencent",
"processedAt": "2025-06-15T10:30:00.000Z",
"sessionId": "xxx",
"failed": true
}
Output Field Descriptions
| Field | Type | Description |
|---|---|---|
success | boolean | Whether recognition was successful |
text | string | Recognized text content |
language | string | Language code used |
confidence | number | Confidence score (currently fixed at 0.85) |
provider | string | Service provider (fixed as "tencent") |
method | string | Recognition method (fixed as "sentence") |
processedAt | string | Processing time (ISO 8601 format) |
audioSize | number | Audio data size in bytes |
sessionId | string | Session ID |
executionTime | number | Execution duration in milliseconds |
error | string | Error message on failure |
failed | boolean | true on failure |
Usage Examples
1. Basic Speech Recognition
The simplest usage — recognize Chinese speech:
{
"id": "speech-recognition",
"type": "tencent-speech",
"config": {
"secretId": "AKIDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"secretKey": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"region": "ap-guangzhou"
},
"inputs": {
"audio": "UklGRigAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQQAAAAAAA==",
"language": "zh-CN",
"format": "wav"
}
}
2. Multi-Language Recognition
Recognize English speech:
{
"id": "english-recognition",
"type": "tencent-speech",
"config": {
"secretId": "AKIDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"secretKey": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"engineModelType": "16k_en"
},
"inputs": {
"audio": "{{base64_audio_data}}",
"language": "en",
"format": "mp3"
}
}
3. File Upload Recognition
Process uploaded audio files:
{
"workflow": [
{
"id": "file-upload",
"type": "file-input",
"config": {
"acceptedTypes": ["audio/wav", "audio/mp3", "audio/m4a"]
}
},
{
"id": "speech-recognition",
"type": "tencent-speech",
"config": {
"secretId": "{{TENCENT_SECRET_ID}}",
"secretKey": "{{TENCENT_SECRET_KEY}}"
},
"inputs": {
"audio": "{{file-upload.content}}",
"format": "{{file-upload.format}}"
}
}
]
}
4. Batch Processing
Process multiple audio files in batch:
{
"workflow": [
{
"id": "batch-recognition",
"type": "js-executor",
"inputs": {
"audioFiles": [
{"name": "audio1.wav", "data": "base64_data_1"},
{"name": "audio2.mp3", "data": "base64_data_2"},
{"name": "audio3.m4a", "data": "base64_data_3"}
],
"code": `
let results = [];
for (let file of inputs.audioFiles) {
try {
const result = await callOperator('tencent-speech', {
audioData: file.data,
format: file.name.split('.').pop()
});
results.push({
filename: file.name,
text: result.text,
success: true
});
} catch (error) {
results.push({
filename: file.name,
error: error.message,
success: false
});
}
}
return { results };
`
}
}
]
}
5. Chinese-English Mixed Recognition
Process audio containing mixed Chinese and English:
{
"id": "mixed-language-recognition",
"type": "tencent-speech",
"config": {
"secretId": "{{TENCENT_SECRET_ID}}",
"secretKey": "{{TENCENT_SECRET_KEY}}",
"engineModelType": "16k_zh_en"
},
"inputs": {
"audio": "{{mixed_audio_data}}",
"language": "zh_en",
"format": "wav"
}
}
6. Hotword-Optimized Recognition
Use a hotword list to improve recognition accuracy for specialized terminology:
{
"id": "hotword-recognition",
"type": "tencent-speech",
"config": {
"secretId": "{{TENCENT_SECRET_ID}}",
"secretKey": "{{TENCENT_SECRET_KEY}}",
"hotwordId": "your_hotword_table_id"
},
"inputs": {
"audio": "{{professional_audio_data}}",
"language": "zh",
"format": "wav"
}
}
Environment Setup
1. Obtain Tencent Cloud Credentials
- Log in to the Tencent Cloud Console
- Navigate to Access Management > API Key Management
- Click Create Key to create a new API key
- Record the
SecretIdandSecretKey
2. Environment Variable Configuration
Set credential information in environment variables:
# .env file
TENCENT_SECRET_ID=AKIDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TENCENT_SECRET_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TENCENT_REGION=ap-guangzhou
3. Permission Configuration
Ensure the API key has speech recognition service permissions:
{
"version": "2.0",
"statement": [
{
"effect": "allow",
"action": [
"asr:SentenceRecognition"
],
"resource": "*"
}
]
}
Workflow Integration
Using in Workflows
-
Add Speech Recognition Node
- Select "Tencent Cloud Speech Recognition" from the operator library
- Configure Tencent Cloud credentials and recognition parameters
-
Connect Data Sources
- Connect the audio input node's output to the speech recognition node
- Supports multiple data sources including file upload and URL download
-
Downstream Processing
- Pass recognition results to text processing nodes
- Or use for content analysis, sentiment analysis, and other scenarios
Combining with Other Operators
{
"workflow": [
{
"id": "audio-input",
"type": "file-input",
"config": {
"acceptedTypes": ["audio/*"]
}
},
{
"id": "speech-to-text",
"type": "tencent-speech",
"config": {
"secretId": "{{TENCENT_SECRET_ID}}",
"secretKey": "{{TENCENT_SECRET_KEY}}"
},
"inputs": {
"audio": "{{audio-input.content}}",
"format": "{{audio-input.format}}"
}
},
{
"id": "text-analysis",
"type": "text-analyzer",
"inputs": {
"text": "{{speech-to-text.text}}"
}
},
{
"id": "save-result",
"type": "database-save",
"inputs": {
"table": "transcriptions",
"data": {
"original_audio": "{{audio-input.filename}}",
"transcribed_text": "{{speech-to-text.text}}",
"confidence": "{{text-analysis.confidence}}",
"created_at": "{{now}}"
}
}
}
]
}
Performance Optimization
Best Practices
-
Audio Quality Optimization
- Use audio files with a 16kHz sample rate
- Keep audio clear and free of background noise
- Keep audio duration under 60 seconds
-
Format Selection
- Prefer WAV format for optimal recognition accuracy
- MP3 format offers a good balance of quality and file size
- Avoid overly compressed audio formats
-
Batch Processing Optimization
- For large numbers of audio files, consider processing in batches
- Use concurrency control to avoid API rate limits
- Implement retry mechanisms for temporary network issues
Performance Metrics
- Recognition Accuracy: Can exceed 95% with clear audio
- Processing Speed: Typically completes recognition within 1-3 seconds
- Supported Duration: Maximum 60 seconds of audio
- File Size: Maximum 5MB audio files
Error Handling
Common Error Types
| Error Code | Error Message | Cause | Solution |
|---|---|---|---|
INVALID_CREDENTIALS | Invalid credentials | Incorrect SecretId or SecretKey | Verify Tencent Cloud API keys |
AUDIO_TOO_LONG | Audio duration exceeded | Audio exceeds 60 seconds | Split audio or use an alternative recognition method |
FILE_TOO_LARGE | File too large | Audio file exceeds 5MB | Compress audio or reduce quality |
UNSUPPORTED_FORMAT | Format not supported | Audio format not in supported list | Convert to a supported format |
NETWORK_ERROR | Network error | Network connectivity issue | Check network connection and firewall settings |
Error Response Handling
{
"workflow": [
{
"id": "speech-recognition",
"type": "tencent-speech",
"config": {
"secretId": "{{TENCENT_SECRET_ID}}",
"secretKey": "{{TENCENT_SECRET_KEY}}"
},
"inputs": {
"audio": "{{audio_data}}"
},
"onError": {
"continue": true,
"defaultValue": {
"success": false,
"text": "",
"error": "Speech recognition failed. Please check audio format and network connection"
}
}
},
{
"id": "handle-result",
"type": "js-executor",
"inputs": {
"result": "{{speech-recognition}}",
"code": `
if (inputs.result.success) {
return {
status: 'success',
message: 'Recognition successful',
text: inputs.result.text
};
} else {
return {
status: 'error',
message: inputs.result.error || 'Recognition failed',
text: ''
};
}
`
}
}
]
}
Troubleshooting
Debugging Tips
-
Check Audio Data
// Validate base64 audio data
const isValidBase64 = (str) => {
try {
return btoa(atob(str)) === str;
} catch (err) {
return false;
}
}; -
Test API Connection
# Test Tencent Cloud API connection with curl
curl -X POST https://asr.tencentcloudapi.com/ \
-H "Content-Type: application/json" \
-d '{"Action":"SentenceRecognition","Version":"2018-08-08"}' -
Audio Format Check
// Check audio file header information
const checkAudioFormat = (buffer) => {
const header = buffer.slice(0, 12).toString('ascii');
if (header.startsWith('RIFF') && header.includes('WAVE')) {
return 'wav';
} else if (buffer[0] === 0xFF && (buffer[1] & 0xE0) === 0xE0) {
return 'mp3';
}
return 'unknown';
};
FAQ
Q: Why is my audio recognition accuracy low? A: Check audio quality — ensure the sample rate is 16kHz, the audio is clear without noise, and speech is articulate.
Q: How do I handle audio longer than 60 seconds? A: You need to split long audio into segments of 60 seconds or less, recognize each separately, then merge the results.
Q: Does it support real-time speech recognition? A: The current version only supports single-sentence recognition and does not support real-time streaming recognition.
Q: How can I improve recognition accuracy for specialized terminology?
A: You can create a hotword list and specify the hotwordId parameter in the configuration.
Version Information
- Current Version: 1.1.0
- Compatibility: Node.js 14+
- Dependencies: tencentcloud-sdk-nodejs
- Changelog:
- v1.1.0: Corrected feature limitations, removed unsupported real-time transcription functionality
- v1.0.0: Initial version release