如何用Python结合百度语音识别技术实现视频字幕自动生成？

2026-05-27 00:481阅读0评论SEO资源

内容介绍
文章标签
相关推荐

本文共计779个文字，预计阅读时间需要4分钟。

从视频中提取音频并安装moviepy库：

安装moviepy库：bashpip install moviepy

相关代码：pythonaudio_file=work_path + '\\out.wav'video=VideoFileClip(audio_file)video.audio.write_audiofile(audio_file, ffmpeg_params=['-ar', '16000', '-ac', '1'])

从视频中提取音频

安装 moviepy

pip install moviepy

相关代码：

audio_file = work_path + '\\out.wav' video = VideoFileClip(video_file) video.audio.write_audiofile(audio_file,ffmpeg_params=['-ar','16000','-ac','1'])

根据静音对音频分段

使用音频库 pydub，安装：

pip install pydub

第一种方法：

# 这里silence_thresh是认定小于-70dBFS以下的为silence，发现小于 sound.dBFS * 1.3 部分超过 700毫秒，就进行拆分。这样子分割成一段一段的。 sounds = split_on_silence(sound, min_silence_len = 500, silence_thresh= sound.dBFS * 1.3) sec = 0 for i in range(len(sounds)): s = len(sounds[i]) sec += s print('split duration is ', sec) print('dBFS: {0}, max_dBFS: {1}, duration: {2}, split: {3}'.format(round(sound.dBFS,2),round(sound.max_dBFS,2),sound.duration_seconds,len(sounds)))

感觉分割的时间不对，不好定位，我们换一种方法：

# 通过搜索静音的方法将音频分段 # 参考：wqian.net/blog/2018/1128-python-pydub-split-mp3-index.html timestamp_list = detect_nonsilent(sound,500,sound.dBFS*1.3,1) for i in range(len(timestamp_list)): d = timestamp_list[i][1] - timestamp_list[i][0] print("Section is :", timestamp_list[i], "duration is:", d) print('dBFS: {0}, max_dBFS: {1}, duration: {2}, split: {3}'.format(round(sound.dBFS,2),round(sound.max_dBFS,2),sound.duration_seconds,len(timestamp_list)))

输出结果如下：

感觉这样好处理一些

使用百度语音识别

现在百度智能云平台创建一个应用，获取 API Key 和 Secret Key：

获取 Access Token

使用百度 AI 产品需要授权，一定量是免费的，生成字幕够用了。

''' 百度智能云获取 Access Token ''' def fetch_token(): params = {'grant_type': 'client_credentials', 'client_id': API_KEY, 'client_secret': SECRET_KEY} post_data = urlencode(params) if (IS_PY3): post_data = post_data.encode( 'utf-8') req = Request(TOKEN_URL, post_data) try: f = urlopen(req) result_str = f.read() except URLError as err: print('token www.cnblogs.com/tocy/p/subtitle-format-srt.html

生成字幕其实就是语音识别的应用，将识别后的内容按照 srt 字幕格式组装起来就 OK 了。具体字幕格式的内容可以参考上面的文章，代码如下：

idx = 0 for i in range(len(timestamp_list)): d = timestamp_list[i][1] - timestamp_list[i][0] data = sound[timestamp_list[i][0]:timestamp_list[i][1]].raw_data str_rst = asr_raw(data, token) result = json.loads(str_rst) # print("rst is ", result) # print("rst is ", rst['err_no'][0]) if result['err_no'] == 0: text.append('{0}\n{1} --> {2}\n'.format(idx, format_time(timestamp_list[i][0]/ 1000), format_time(timestamp_list[i][1]/ 1000))) text.append( result['result'][0]) text.append('\n') idx = idx + 1 print(format_time(timestamp_list[i][0]/ 1000), "txt is ", result['result'][0]) with open(srt_file,"r+") as f: f.writelines(text)

总结

我在视频网站下载了一个视频来作测试，极速模式从速度和识别率来说都是最好的，感觉比网易见外平台还好用。

到此这篇关于使用Python和百度语音识别生成视频字幕的文章就介绍到这了,更多相关Python 百度语音识别生成视频字幕内容请搜索易盾网络以前的文章或继续浏览下面的相关文章希望大家以后多多支持易盾网络！

标签：使用 Python 和百度语音