文本转语音工具bark生成长音频方法和生成短音频的方法是不一样的,前几天制作了用bark制作短音频的教程,文章链接:《最强文本转语音工具:Bark,本地安装+云端部署+在线体验详细教程,AI一键生成带语气情感的语音及歌唱》,查看本长语音生成教程前,建议先看一下上一篇教程,熟悉一下基础安装操作。github中bark长语音生成说明:https://github.com/suno-ai/bark/blob/main/notebooks/long_form_generation.ipynb,今天我们来用bark生成长度超过14秒的长音频,下面演示一下具体操作。
1、Google colab 云端部署教程
首先打开谷歌Colaboratory,网站地址:https://colab.research.google.com,点击【文件】-【新建笔记本】。
先链接谷歌云盘,然后新建代码输入框输入下面代码安装bark
pip install git+https://github.com/suno-ai/bark.git
生成长语音有三种模式,1、简单模式,2、高级模式,3、对话模式
先说第一种简单模式,是使用 nltk 将较长的文本拆分成句子,并一个一个地生成句子。
先运行如下代码,安装完整nltk库
import nltk
nltk.download('punkt')
punkt下载完成后,运行如下代码:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"from IPython.display import Audioimport nltk # we'll use this to split into sentencesimport numpy as npfrom bark.generation import (
generate_text_semantic,
preload_models,)from bark.api import semantic_to_waveformfrom bark import generate_audio, SAMPLE_RATE
preload_models()script = """
Hey, have you heard about this new text-to-audio model called "Bark"?
Apparently, it's the most realistic and natural-sounding text-to-audio model
out there right now. People are saying it sounds just like a real person speaking.
I think it uses advanced machine learning algorithms to analyze and understand the
nuances of human speech, and then replicates those nuances in its own speech output.
It's pretty impressive, and I bet it could be used for things like audiobooks or podcasts.
In fact, I heard that some publishers are already starting to use Bark to create audiobooks.
It would be like having your own personal voiceover artist. I really think Bark is going to
be a game-changer in the world of text-to-audio technology.
""".replace("\n", " ").strip()sentences = nltk.sent_tokenize(script)SPEAKER = "v2/en_speaker_6"silence = np.zeros(int(0.25 * SAMPLE_RATE)) # quarter second of silencepieces = []for sentence in sentences:
audio_array = generate_audio(sentence, history_prompt=SPEAKER)
pieces += [audio_array, silence.copy()]Audio(np.concatenate(pieces), rate=SAMPLE_RATE)
这个段代码的意思就是,将script的文本内容生成语音,speaker可以设置发音人,打开下面链接可以查看所有发音人列表。https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c
将这段文本转为语音用了非常长的时间,已经用时1小时32分钟了,还没有完成,等不了了,这个模型对电脑配置要求确实有点高。
第二种高级模式
有时 Bark 会在提示结束时产生一些额外的音频。我们可以通过降低 bark 停止生成文本的阈值来解决这个问题。我们在 generate_text_semantic 中使用 min_eos_p参数调整。
生成音频的完整代码:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"from IPython.display import Audiofrom scipy.io.wavfile import write as write_wavimport nltk # we'll use this to split into sentencesimport numpy as npfrom bark.generation import (
generate_text_semantic,
preload_models,)from bark.api import semantic_to_waveformfrom bark import generate_audio, SAMPLE_RATE
preload_models()script = """
Hey, have you heard about this new text-to-audio model called "Bark"?
Apparently, it's the most realistic and natural-sounding text-to-audio model
out there right now. People are saying it sounds just like a real person speaking.
I think it uses advanced machine learning algorithms to analyze and understand the
nuances of human speech, and then replicates those nuances in its own speech output.
It's pretty impressive, and I bet it could be used for things like audiobooks or podcasts.
In fact, I heard that some publishers are already starting to use Bark to create audiobooks.
It would be like having your own personal voiceover artist. I really think Bark is going to
be a game-changer in the world of text-to-audio technology.
""".replace("\n", " ").strip()sentences = nltk.sent_tokenize(script)GEN_TEMP = 0.6SPEAKER = "v2/en_speaker_6" #这里修改发音人silence = np.zeros(int(0.25 * SAMPLE_RATE)) # quarter second of silencepieces = []for sentence in sentences:
semantic_tokens = generate_text_semantic(
sentence,
history_prompt=SPEAKER,
temp=GEN_TEMP,
min_eos_p=0.05, #修改前后多余声音参数在这里
)
audio_array = semantic_to_waveform(semantic_tokens, history_prompt=SPEAKER,)
pieces += [audio_array, silence.copy()]Audio(np.concatenate(pieces), rate=SAMPLE_RATE)write_wav("bark_generation.wav", SAMPLE_RATE, np.concatenate(pieces))
可以只修改script(待转换语音的文本)、min_eos_p(前后多余声音微调参数)、SPEAKER(发音人)的值,其它不了解的可以不用管。
第三种是对话模式
可以自定义对话内容,不同人设置不同的发音,以下是完整示例代码:
"CUDA_VISIBLE_DEVICES"] = "0"import Audiowavfile import write as write_wav# we'll use this to split into sentencesimport (
generate_text_semantic, semantic_to_waveform generate_audio, SAMPLE_RATE
preload_models"Samantha": "v2/en_speaker_9", "John": "v2/en_speaker_2"}# Script generated by chat GPT"""
Samantha: Hey, have you heard about this new text-to-audio model called "Bark"?
John: No, I haven't. What's so special about it?
Samantha: Well, apparently it's the most realistic and natural-sounding text-to-audio model out there right now. People are saying it sounds just like a real person speaking.
John: Wow, that sounds amazing. How does it work?
Samantha: I think it uses advanced machine learning algorithms to analyze and understand the nuances of human speech, and then replicates those nuances in its own speech output.
John: That's pretty impressive. Do you think it could be used for things like audiobooks or podcasts?
Samantha: Definitely! In fact, I heard that some publishers are already starting to use Bark to create audiobooks. And I bet it would be great for podcasts too.
John: I can imagine. It would be like having your own personal voiceover artist.
Samantha: Exactly! I think Bark is going to be a game-changer in the world of text-to-audio technology."""().split("\n")for s in script if s]int(0.5*SAMPLE_RATE)) line.split(": ") generate_audio(text, history_prompt=speaker_lookup[speaker], )audio_array, silence.copy()]pieces), rate=SAMPLE_RATE)"bark_generation.wav", SAMPLE_RATE, np.concatenate(pieces))#直接生成音频文件,可以加入谷歌云盘路径自动保存到谷歌云盘
如果对代码不是很了解的话,可以只修改speaker_lookup里的发音人和script里待生成音频的文本,其它可以不用管。
这个生成时间比较长,有需要的话可以体验一下。
领取专属 10元无门槛券
私享最新 技术干货