Skip to main content

Tencent Cloud Speech Recognition Operator

Overview

The Tencent Cloud Speech Recognition Operator is an intelligent speech-to-text tool based on the Tencent Cloud Speech Recognition API. It supports multiple audio formats and languages, converting audio files or audio data streams into accurate text content. It is widely used in scenarios such as speech transcription, intelligent customer service, and meeting minutes.

Core Features

  • Multi-Format Support: Supports WAV, MP3, PCM, M4A, AAC, and other mainstream audio formats
  • Multi-Language Recognition: Supports 9 major languages including Chinese, English, Japanese, and Korean
  • High-Accuracy Recognition: Based on Tencent Cloud's advanced speech recognition technology
  • Flexible Input: Supports base64 encoding, Buffer data, file paths, and other input methods
  • Secure and Reliable: Built-in data validation and error handling mechanisms
  • Detailed Reports: Provides complete recognition results and execution statistics

Important Limitations

⚠️ Please note the following limitations before use:

  • Audio Duration: Maximum 60 seconds of audio supported (Tencent Cloud API limitation)
  • File Size: Maximum 5MB audio files supported
  • Recognition Mode: Only single-sentence recognition is supported; real-time transcription is not supported
  • Network Requirements: Requires a stable network connection to access the Tencent Cloud API

Configuration Parameters

Tencent Cloud Credential Configuration

ParameterTypeDefaultRequiredDescription
secretIdstringEnvironment variable TENCENT_SECRET_IDTencent Cloud API Secret ID (starts with AKID)
secretKeystringEnvironment variable TENCENT_SECRET_KEYTencent Cloud API Secret Key (minimum 20 characters)
appIdstringEnvironment variable TENCENT_CLOUD_APP_IDTencent Cloud App ID
regionstring'ap-guangzhou'Service region

Recognition Parameter Configuration

ParameterTypeDefaultDescription
engineTypestring'16k_zh'Engine model type; see supported language list for details
timeoutnumber60000Timeout in milliseconds, range 5000–300000
maxFileSizenumber5242880Maximum file size in bytes, default 5MB
maxDurationnumber60Maximum audio duration in seconds
defaultLanguagestring'zh-CN'Default recognition language

Supported Languages and Formats

Supported Languages

The source code uses standard language codes (not engine model type strings); the system automatically maps them internally:

Language CodeLanguageDescription
zh-CNChinese (Mandarin)Default language
zh-TWChinese (Traditional)Traditional Chinese recognition
en-USEnglish (American)English recognition
ja-JPJapaneseJapanese recognition
ko-KRKoreanKorean recognition
es-ESSpanishSpanish recognition
fr-FRFrenchFrench recognition
de-DEGermanGerman recognition
ru-RURussianRussian recognition

Supported Audio Formats

voiceFormat uses string values (not numbers):

FormatvoiceFormat ValueFile ExtensionDescription
WAV'wav'.wavLossless audio format, recommended; also the default fallback format
MP3'mp3'.mp3Common compressed audio format
M4A'm4a'.m4aApple audio format
PCM'pcm'.pcmRaw audio data
AAC'aac'.aacAdvanced Audio Coding

Input Parameters

ParameterTypeRequiredDescription
audiostring / BufferAudio data (base64-encoded string or Buffer)
languagestringRecognition language code (e.g., zh-CN); defaults to defaultLanguage from configuration
formatstringAudio format (e.g., wav, mp3); defaults to 'wav'
filenamestringFilename; defaults to audio.{format}
sessionIdstringSession ID
Note
  • The input parameter name is audio (not audioData)
  • If a realtime parameter is provided, it will be ignored with a warning (real-time transcription is not supported)

Output Results

Success Response Structure

{
"success": true,
"text": "Recognized text content",
"language": "zh-CN",
"confidence": 0.85,
"provider": "tencent",
"method": "sentence",
"processedAt": "2025-06-15T10:30:00.000Z",
"audioSize": 83200,
"sessionId": "xxx",
"executionTime": 500
}
note

confidence is currently a fixed value of 0.85 and does not use the actual confidence returned by Tencent Cloud.

Failure Response Structure

{
"success": false,
"error": "Error message",
"provider": "tencent",
"processedAt": "2025-06-15T10:30:00.000Z",
"sessionId": "xxx",
"failed": true
}

Output Field Descriptions

FieldTypeDescription
successbooleanWhether recognition was successful
textstringRecognized text content
languagestringLanguage code used
confidencenumberConfidence score (currently fixed at 0.85)
providerstringService provider (fixed as "tencent")
methodstringRecognition method (fixed as "sentence")
processedAtstringProcessing time (ISO 8601 format)
audioSizenumberAudio data size in bytes
sessionIdstringSession ID
executionTimenumberExecution duration in milliseconds
errorstringError message on failure
failedbooleantrue on failure

Usage Examples

1. Basic Speech Recognition

The simplest usage — recognize Chinese speech:

{
"id": "speech-recognition",
"type": "tencent-speech",
"config": {
"secretId": "AKIDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"secretKey": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"region": "ap-guangzhou"
},
"inputs": {
"audio": "UklGRigAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQQAAAAAAA==",
"language": "zh-CN",
"format": "wav"
}
}

2. Multi-Language Recognition

Recognize English speech:

{
"id": "english-recognition",
"type": "tencent-speech",
"config": {
"secretId": "AKIDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"secretKey": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"engineModelType": "16k_en"
},
"inputs": {
"audio": "{{base64_audio_data}}",
"language": "en",
"format": "mp3"
}
}

3. File Upload Recognition

Process uploaded audio files:

{
"workflow": [
{
"id": "file-upload",
"type": "file-input",
"config": {
"acceptedTypes": ["audio/wav", "audio/mp3", "audio/m4a"]
}
},
{
"id": "speech-recognition",
"type": "tencent-speech",
"config": {
"secretId": "{{TENCENT_SECRET_ID}}",
"secretKey": "{{TENCENT_SECRET_KEY}}"
},
"inputs": {
"audio": "{{file-upload.content}}",
"format": "{{file-upload.format}}"
}
}
]
}

4. Batch Processing

Process multiple audio files in batch:

{
"workflow": [
{
"id": "batch-recognition",
"type": "js-executor",
"inputs": {
"audioFiles": [
{"name": "audio1.wav", "data": "base64_data_1"},
{"name": "audio2.mp3", "data": "base64_data_2"},
{"name": "audio3.m4a", "data": "base64_data_3"}
],
"code": `
let results = [];
for (let file of inputs.audioFiles) {
try {
const result = await callOperator('tencent-speech', {
audioData: file.data,
format: file.name.split('.').pop()
});
results.push({
filename: file.name,
text: result.text,
success: true
});
} catch (error) {
results.push({
filename: file.name,
error: error.message,
success: false
});
}
}
return { results };
`
}
}
]
}

5. Chinese-English Mixed Recognition

Process audio containing mixed Chinese and English:

{
"id": "mixed-language-recognition",
"type": "tencent-speech",
"config": {
"secretId": "{{TENCENT_SECRET_ID}}",
"secretKey": "{{TENCENT_SECRET_KEY}}",
"engineModelType": "16k_zh_en"
},
"inputs": {
"audio": "{{mixed_audio_data}}",
"language": "zh_en",
"format": "wav"
}
}

6. Hotword-Optimized Recognition

Use a hotword list to improve recognition accuracy for specialized terminology:

{
"id": "hotword-recognition",
"type": "tencent-speech",
"config": {
"secretId": "{{TENCENT_SECRET_ID}}",
"secretKey": "{{TENCENT_SECRET_KEY}}",
"hotwordId": "your_hotword_table_id"
},
"inputs": {
"audio": "{{professional_audio_data}}",
"language": "zh",
"format": "wav"
}
}

Environment Setup

1. Obtain Tencent Cloud Credentials

  1. Log in to the Tencent Cloud Console
  2. Navigate to Access Management > API Key Management
  3. Click Create Key to create a new API key
  4. Record the SecretId and SecretKey

2. Environment Variable Configuration

Set credential information in environment variables:

# .env file
TENCENT_SECRET_ID=AKIDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TENCENT_SECRET_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TENCENT_REGION=ap-guangzhou

3. Permission Configuration

Ensure the API key has speech recognition service permissions:

{
"version": "2.0",
"statement": [
{
"effect": "allow",
"action": [
"asr:SentenceRecognition"
],
"resource": "*"
}
]
}

Workflow Integration

Using in Workflows

  1. Add Speech Recognition Node

    • Select "Tencent Cloud Speech Recognition" from the operator library
    • Configure Tencent Cloud credentials and recognition parameters
  2. Connect Data Sources

    • Connect the audio input node's output to the speech recognition node
    • Supports multiple data sources including file upload and URL download
  3. Downstream Processing

    • Pass recognition results to text processing nodes
    • Or use for content analysis, sentiment analysis, and other scenarios

Combining with Other Operators

{
"workflow": [
{
"id": "audio-input",
"type": "file-input",
"config": {
"acceptedTypes": ["audio/*"]
}
},
{
"id": "speech-to-text",
"type": "tencent-speech",
"config": {
"secretId": "{{TENCENT_SECRET_ID}}",
"secretKey": "{{TENCENT_SECRET_KEY}}"
},
"inputs": {
"audio": "{{audio-input.content}}",
"format": "{{audio-input.format}}"
}
},
{
"id": "text-analysis",
"type": "text-analyzer",
"inputs": {
"text": "{{speech-to-text.text}}"
}
},
{
"id": "save-result",
"type": "database-save",
"inputs": {
"table": "transcriptions",
"data": {
"original_audio": "{{audio-input.filename}}",
"transcribed_text": "{{speech-to-text.text}}",
"confidence": "{{text-analysis.confidence}}",
"created_at": "{{now}}"
}
}
}
]
}

Performance Optimization

Best Practices

  1. Audio Quality Optimization

    • Use audio files with a 16kHz sample rate
    • Keep audio clear and free of background noise
    • Keep audio duration under 60 seconds
  2. Format Selection

    • Prefer WAV format for optimal recognition accuracy
    • MP3 format offers a good balance of quality and file size
    • Avoid overly compressed audio formats
  3. Batch Processing Optimization

    • For large numbers of audio files, consider processing in batches
    • Use concurrency control to avoid API rate limits
    • Implement retry mechanisms for temporary network issues

Performance Metrics

  • Recognition Accuracy: Can exceed 95% with clear audio
  • Processing Speed: Typically completes recognition within 1-3 seconds
  • Supported Duration: Maximum 60 seconds of audio
  • File Size: Maximum 5MB audio files

Error Handling

Common Error Types

Error CodeError MessageCauseSolution
INVALID_CREDENTIALSInvalid credentialsIncorrect SecretId or SecretKeyVerify Tencent Cloud API keys
AUDIO_TOO_LONGAudio duration exceededAudio exceeds 60 secondsSplit audio or use an alternative recognition method
FILE_TOO_LARGEFile too largeAudio file exceeds 5MBCompress audio or reduce quality
UNSUPPORTED_FORMATFormat not supportedAudio format not in supported listConvert to a supported format
NETWORK_ERRORNetwork errorNetwork connectivity issueCheck network connection and firewall settings

Error Response Handling

{
"workflow": [
{
"id": "speech-recognition",
"type": "tencent-speech",
"config": {
"secretId": "{{TENCENT_SECRET_ID}}",
"secretKey": "{{TENCENT_SECRET_KEY}}"
},
"inputs": {
"audio": "{{audio_data}}"
},
"onError": {
"continue": true,
"defaultValue": {
"success": false,
"text": "",
"error": "Speech recognition failed. Please check audio format and network connection"
}
}
},
{
"id": "handle-result",
"type": "js-executor",
"inputs": {
"result": "{{speech-recognition}}",
"code": `
if (inputs.result.success) {
return {
status: 'success',
message: 'Recognition successful',
text: inputs.result.text
};
} else {
return {
status: 'error',
message: inputs.result.error || 'Recognition failed',
text: ''
};
}
`
}
}
]
}

Troubleshooting

Debugging Tips

  1. Check Audio Data

    // Validate base64 audio data
    const isValidBase64 = (str) => {
    try {
    return btoa(atob(str)) === str;
    } catch (err) {
    return false;
    }
    };
  2. Test API Connection

    # Test Tencent Cloud API connection with curl
    curl -X POST https://asr.tencentcloudapi.com/ \
    -H "Content-Type: application/json" \
    -d '{"Action":"SentenceRecognition","Version":"2018-08-08"}'
  3. Audio Format Check

    // Check audio file header information
    const checkAudioFormat = (buffer) => {
    const header = buffer.slice(0, 12).toString('ascii');
    if (header.startsWith('RIFF') && header.includes('WAVE')) {
    return 'wav';
    } else if (buffer[0] === 0xFF && (buffer[1] & 0xE0) === 0xE0) {
    return 'mp3';
    }
    return 'unknown';
    };

FAQ

Q: Why is my audio recognition accuracy low? A: Check audio quality — ensure the sample rate is 16kHz, the audio is clear without noise, and speech is articulate.

Q: How do I handle audio longer than 60 seconds? A: You need to split long audio into segments of 60 seconds or less, recognize each separately, then merge the results.

Q: Does it support real-time speech recognition? A: The current version only supports single-sentence recognition and does not support real-time streaming recognition.

Q: How can I improve recognition accuracy for specialized terminology? A: You can create a hotword list and specify the hotwordId parameter in the configuration.

Version Information

  • Current Version: 1.1.0
  • Compatibility: Node.js 14+
  • Dependencies: tencentcloud-sdk-nodejs
  • Changelog:
    • v1.1.0: Corrected feature limitations, removed unsupported real-time transcription functionality
    • v1.0.0: Initial version release