Tencent Cloud Speech Recognition Operator

Overview

The Tencent Cloud Speech Recognition Operator is an intelligent speech-to-text tool based on the Tencent Cloud Speech Recognition API. It supports multiple audio formats and languages, converting audio files or audio data streams into accurate text content. It is widely used in scenarios such as speech transcription, intelligent customer service, and meeting minutes.

Core Features

✅ Multi-Format Support: Supports WAV, MP3, PCM, M4A, AAC, and other mainstream audio formats
✅ Multi-Language Recognition: Supports 9 major languages including Chinese, English, Japanese, and Korean
✅ High-Accuracy Recognition: Based on Tencent Cloud's advanced speech recognition technology
✅ Flexible Input: Supports base64 encoding, Buffer data, file paths, and other input methods
✅ Secure and Reliable: Built-in data validation and error handling mechanisms
✅ Detailed Reports: Provides complete recognition results and execution statistics

Important Limitations

⚠️ Please note the following limitations before use:

Audio Duration: Maximum 60 seconds of audio supported (Tencent Cloud API limitation)
File Size: Maximum 5MB audio files supported
Recognition Mode: Only single-sentence recognition is supported; real-time transcription is not supported
Network Requirements: Requires a stable network connection to access the Tencent Cloud API

Configuration Parameters

Tencent Cloud Credential Configuration

Parameter	Type	Default	Required	Description
`secretId`	string	Environment variable `TENCENT_SECRET_ID`	✅	Tencent Cloud API Secret ID (starts with AKID)
`secretKey`	string	Environment variable `TENCENT_SECRET_KEY`	✅	Tencent Cloud API Secret Key (minimum 20 characters)
`appId`	string	Environment variable `TENCENT_CLOUD_APP_ID`	❌	Tencent Cloud App ID
`region`	string	`'ap-guangzhou'`	❌	Service region

Recognition Parameter Configuration

Parameter	Type	Default	Description
`engineType`	string	`'16k_zh'`	Engine model type; see supported language list for details
`timeout`	number	`60000`	Timeout in milliseconds, range 5000–300000
`maxFileSize`	number	`5242880`	Maximum file size in bytes, default 5MB
`maxDuration`	number	`60`	Maximum audio duration in seconds
`defaultLanguage`	string	`'zh-CN'`	Default recognition language

Supported Languages and Formats

Supported Languages

The source code uses standard language codes (not engine model type strings); the system automatically maps them internally:

Language Code	Language	Description
`zh-CN`	Chinese (Mandarin)	Default language
`zh-TW`	Chinese (Traditional)	Traditional Chinese recognition
`en-US`	English (American)	English recognition
`ja-JP`	Japanese	Japanese recognition
`ko-KR`	Korean	Korean recognition
`es-ES`	Spanish	Spanish recognition
`fr-FR`	French	French recognition
`de-DE`	German	German recognition
`ru-RU`	Russian	Russian recognition

Supported Audio Formats

voiceFormat uses string values (not numbers):

Format	voiceFormat Value	File Extension	Description
WAV	`'wav'`	.wav	Lossless audio format, recommended; also the default fallback format
MP3	`'mp3'`	.mp3	Common compressed audio format
M4A	`'m4a'`	.m4a	Apple audio format
PCM	`'pcm'`	.pcm	Raw audio data
AAC	`'aac'`	.aac	Advanced Audio Coding

Input Parameters

Parameter	Type	Required	Description
`audio`	string / Buffer	✅	Audio data (base64-encoded string or Buffer)
`language`	string	❌	Recognition language code (e.g., `zh-CN`); defaults to `defaultLanguage` from configuration
`format`	string	❌	Audio format (e.g., `wav`, `mp3`); defaults to `'wav'`
`filename`	string	❌	Filename; defaults to `audio.{format}`
`sessionId`	string	❌	Session ID

Note

The input parameter name is audio (not audioData)
If a realtime parameter is provided, it will be ignored with a warning (real-time transcription is not supported)

Output Results

Success Response Structure

{
  "success": true,
  "text": "Recognized text content",
  "language": "zh-CN",
  "confidence": 0.85,
  "provider": "tencent",
  "method": "sentence",
  "processedAt": "2025-06-15T10:30:00.000Z",
  "audioSize": 83200,
  "sessionId": "xxx",
  "executionTime": 500
}

note

confidence is currently a fixed value of 0.85 and does not use the actual confidence returned by Tencent Cloud.

Failure Response Structure

{
  "success": false,
  "error": "Error message",
  "provider": "tencent",
  "processedAt": "2025-06-15T10:30:00.000Z",
  "sessionId": "xxx",
  "failed": true
}

Output Field Descriptions

Field	Type	Description
`success`	boolean	Whether recognition was successful
`text`	string	Recognized text content
`language`	string	Language code used
`confidence`	number	Confidence score (currently fixed at 0.85)
`provider`	string	Service provider (fixed as `"tencent"`)
`method`	string	Recognition method (fixed as `"sentence"`)
`processedAt`	string	Processing time (ISO 8601 format)
`audioSize`	number	Audio data size in bytes
`sessionId`	string	Session ID
`executionTime`	number	Execution duration in milliseconds
`error`	string	Error message on failure
`failed`	boolean	`true` on failure

Usage Examples

1. Basic Speech Recognition

The simplest usage — recognize Chinese speech:

{
  "id": "speech-recognition",
  "type": "tencent-speech",
  "config": {
    "secretId": "AKIDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "secretKey": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "region": "ap-guangzhou"
  },
  "inputs": {
    "audio": "UklGRigAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQQAAAAAAA==",
    "language": "zh-CN",
    "format": "wav"
  }
}

2. Multi-Language Recognition

Recognize English speech:

{
  "id": "english-recognition",
  "type": "tencent-speech",
  "config": {
    "secretId": "AKIDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "secretKey": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "engineModelType": "16k_en"
  },
  "inputs": {
    "audio": "{{base64_audio_data}}",
    "language": "en",
    "format": "mp3"
  }
}

3. File Upload Recognition

Process uploaded audio files:

{
  "workflow": [
    {
      "id": "file-upload",
      "type": "file-input",
      "config": {
        "acceptedTypes": ["audio/wav", "audio/mp3", "audio/m4a"]
      }
    },
    {
      "id": "speech-recognition",
      "type": "tencent-speech",
      "config": {
        "secretId": "{{TENCENT_SECRET_ID}}",
        "secretKey": "{{TENCENT_SECRET_KEY}}"
      },
      "inputs": {
        "audio": "{{file-upload.content}}",
        "format": "{{file-upload.format}}"
      }
    }
  ]
}

4. Batch Processing

Process multiple audio files in batch:

{
  "workflow": [
    {
      "id": "batch-recognition",
      "type": "js-executor",
      "inputs": {
        "audioFiles": [
          {"name": "audio1.wav", "data": "base64_data_1"},
          {"name": "audio2.mp3", "data": "base64_data_2"},
          {"name": "audio3.m4a", "data": "base64_data_3"}
        ],
        "code": `
          let results = [];
          for (let file of inputs.audioFiles) {
            try {
              const result = await callOperator('tencent-speech', {
                audioData: file.data,
                format: file.name.split('.').pop()
              });
              results.push({
                filename: file.name,
                text: result.text,
                success: true
              });
            } catch (error) {
              results.push({
                filename: file.name,
                error: error.message,
                success: false
              });
            }
          }
          return { results };
        `
      }
    }
  ]
}

5. Chinese-English Mixed Recognition

Process audio containing mixed Chinese and English:

{
  "id": "mixed-language-recognition",
  "type": "tencent-speech",
  "config": {
    "secretId": "{{TENCENT_SECRET_ID}}",
    "secretKey": "{{TENCENT_SECRET_KEY}}",
    "engineModelType": "16k_zh_en"
  },
  "inputs": {
    "audio": "{{mixed_audio_data}}",
    "language": "zh_en",
    "format": "wav"
  }
}

6. Hotword-Optimized Recognition

Use a hotword list to improve recognition accuracy for specialized terminology:

{
  "id": "hotword-recognition",
  "type": "tencent-speech",
  "config": {
    "secretId": "{{TENCENT_SECRET_ID}}",
    "secretKey": "{{TENCENT_SECRET_KEY}}",
    "hotwordId": "your_hotword_table_id"
  },
  "inputs": {
    "audio": "{{professional_audio_data}}",
    "language": "zh",
    "format": "wav"
  }
}

Environment Setup

1. Obtain Tencent Cloud Credentials

Log in to the Tencent Cloud Console
Navigate to Access Management > API Key Management
Click Create Key to create a new API key
Record the SecretId and SecretKey

2. Environment Variable Configuration

Set credential information in environment variables:

# .env file
TENCENT_SECRET_ID=AKIDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TENCENT_SECRET_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TENCENT_REGION=ap-guangzhou

3. Permission Configuration

Ensure the API key has speech recognition service permissions:

{
  "version": "2.0",
  "statement": [
    {
      "effect": "allow",
      "action": [
        "asr:SentenceRecognition"
      ],
      "resource": "*"
    }
  ]
}

Workflow Integration

Using in Workflows

Add Speech Recognition Node
- Select "Tencent Cloud Speech Recognition" from the operator library
- Configure Tencent Cloud credentials and recognition parameters
Connect Data Sources
- Connect the audio input node's output to the speech recognition node
- Supports multiple data sources including file upload and URL download
Downstream Processing
- Pass recognition results to text processing nodes
- Or use for content analysis, sentiment analysis, and other scenarios

Combining with Other Operators

{
  "workflow": [
    {
      "id": "audio-input",
      "type": "file-input",
      "config": {
        "acceptedTypes": ["audio/*"]
      }
    },
    {
      "id": "speech-to-text",
      "type": "tencent-speech",
      "config": {
        "secretId": "{{TENCENT_SECRET_ID}}",
        "secretKey": "{{TENCENT_SECRET_KEY}}"
      },
      "inputs": {
        "audio": "{{audio-input.content}}",
        "format": "{{audio-input.format}}"
      }
    },
    {
      "id": "text-analysis",
      "type": "text-analyzer",
      "inputs": {
        "text": "{{speech-to-text.text}}"
      }
    },
    {
      "id": "save-result",
      "type": "database-save",
      "inputs": {
        "table": "transcriptions",
        "data": {
          "original_audio": "{{audio-input.filename}}",
          "transcribed_text": "{{speech-to-text.text}}",
          "confidence": "{{text-analysis.confidence}}",
          "created_at": "{{now}}"
        }
      }
    }
  ]
}

Performance Optimization

Best Practices

Audio Quality Optimization
- Use audio files with a 16kHz sample rate
- Keep audio clear and free of background noise
- Keep audio duration under 60 seconds
Format Selection
- Prefer WAV format for optimal recognition accuracy
- MP3 format offers a good balance of quality and file size
- Avoid overly compressed audio formats
Batch Processing Optimization
- For large numbers of audio files, consider processing in batches
- Use concurrency control to avoid API rate limits
- Implement retry mechanisms for temporary network issues

Performance Metrics

Recognition Accuracy: Can exceed 95% with clear audio
Processing Speed: Typically completes recognition within 1-3 seconds
Supported Duration: Maximum 60 seconds of audio
File Size: Maximum 5MB audio files

Error Handling

Common Error Types

Error Code	Error Message	Cause	Solution
`INVALID_CREDENTIALS`	Invalid credentials	Incorrect SecretId or SecretKey	Verify Tencent Cloud API keys
`AUDIO_TOO_LONG`	Audio duration exceeded	Audio exceeds 60 seconds	Split audio or use an alternative recognition method
`FILE_TOO_LARGE`	File too large	Audio file exceeds 5MB	Compress audio or reduce quality
`UNSUPPORTED_FORMAT`	Format not supported	Audio format not in supported list	Convert to a supported format
`NETWORK_ERROR`	Network error	Network connectivity issue	Check network connection and firewall settings

Error Response Handling

{
  "workflow": [
    {
      "id": "speech-recognition",
      "type": "tencent-speech",
      "config": {
        "secretId": "{{TENCENT_SECRET_ID}}",
        "secretKey": "{{TENCENT_SECRET_KEY}}"
      },
      "inputs": {
        "audio": "{{audio_data}}"
      },
      "onError": {
        "continue": true,
        "defaultValue": {
          "success": false,
          "text": "",
          "error": "Speech recognition failed. Please check audio format and network connection"
        }
      }
    },
    {
      "id": "handle-result",
      "type": "js-executor",
      "inputs": {
        "result": "{{speech-recognition}}",
        "code": `
          if (inputs.result.success) {
            return {
              status: 'success',
              message: 'Recognition successful',
              text: inputs.result.text
            };
          } else {
            return {
              status: 'error',
              message: inputs.result.error || 'Recognition failed',
              text: ''
            };
          }
        `
      }
    }
  ]
}

Troubleshooting

Debugging Tips

Check Audio Data

// Validate base64 audio data
const isValidBase64 = (str) => {
  try {
    return btoa(atob(str)) === str;
  } catch (err) {
    return false;
  }
};

Test API Connection

# Test Tencent Cloud API connection with curl
curl -X POST https://asr.tencentcloudapi.com/ \
  -H "Content-Type: application/json" \
  -d '{"Action":"SentenceRecognition","Version":"2018-08-08"}'

Audio Format Check

// Check audio file header information
const checkAudioFormat = (buffer) => {
  const header = buffer.slice(0, 12).toString('ascii');
  if (header.startsWith('RIFF') && header.includes('WAVE')) {
    return 'wav';
  } else if (buffer[0] === 0xFF && (buffer[1] & 0xE0) === 0xE0) {
    return 'mp3';
  }
  return 'unknown';
};

FAQ

Q: Why is my audio recognition accuracy low? A: Check audio quality — ensure the sample rate is 16kHz, the audio is clear without noise, and speech is articulate.

Q: How do I handle audio longer than 60 seconds? A: You need to split long audio into segments of 60 seconds or less, recognize each separately, then merge the results.

Q: Does it support real-time speech recognition? A: The current version only supports single-sentence recognition and does not support real-time streaming recognition.

Q: How can I improve recognition accuracy for specialized terminology? A: You can create a hotword list and specify the hotwordId parameter in the configuration.

Version Information

Current Version: 1.1.0
Compatibility: Node.js 14+
Dependencies: tencentcloud-sdk-nodejs
Changelog:
- v1.1.0: Corrected feature limitations, removed unsupported real-time transcription functionality
- v1.0.0: Initial version release

Overview​

Core Features​

Important Limitations​

Configuration Parameters​

Tencent Cloud Credential Configuration​

Recognition Parameter Configuration​

Supported Languages and Formats​

Supported Languages​

Supported Audio Formats​

Input Parameters​

Output Results​

Success Response Structure​

Failure Response Structure​

Output Field Descriptions​

Usage Examples​

1. Basic Speech Recognition​

2. Multi-Language Recognition​

3. File Upload Recognition​

4. Batch Processing​

5. Chinese-English Mixed Recognition​

6. Hotword-Optimized Recognition​

Environment Setup​

1. Obtain Tencent Cloud Credentials​

2. Environment Variable Configuration​

3. Permission Configuration​

Workflow Integration​

Using in Workflows​

Combining with Other Operators​

Performance Optimization​

Best Practices​

Performance Metrics​

Error Handling​

Common Error Types​

Error Response Handling​

Troubleshooting​

Debugging Tips​

FAQ​

Version Information​

Related Resources​