AWS机器学习初探(1):Comprehend - 自然语言处理服务
这几个服务的功能和使用都很直接和简单,因此放在一篇文章中介绍。
AWS Translate 服务是一种AWS 机器学习应用服务,它利用高级机器学习技术来进行文本翻译。它的使用非常简单,只需要提供输入文本,该服务就给出输出文本。
AWS Translate 服务有两个组件:
AWS Translate 利用 attention 机制来理解上下文,它帮助 decoder 聚焦在原文中最相关的部分,这有助于它翻译模糊的单词和短语。
Translate 目前只支持将多种语言翻译为英文,以及将英文翻译成多种目标语言。Translate 能自动检测输入文本是哪种语言,它是利用 Comprehend 来实现语言探测的。
来对比下AWS Translate 和 Google 翻译的结果:
这是一段川普的推特文本:
I am hearing so many great things about the Republican Party’s California Gubernatorial Candidate, John Cox. He is a very successful businessman who is tired of high Taxes & Crime. He will Make California Great Again & make you proud of your Great State again. Total Endorsement!
Google 翻译结果:
关于共和党加州州长候选人约翰考克斯,我听到了很多很棒的事情。 他是一个非常成功的商人,厌倦了高税收和犯罪。 他将使加利福尼亚再次伟大,让你再次为你的伟大国家感到骄傲。 总代言!
AWS Translate 翻译结果:
我听到很多关于共和党加州州长候选人约翰·考克斯的伟大事情。 他是一个非常成功的商人,厌倦了高税与犯罪。 他将再次使加州成为伟大的国家,让你再次为你的伟大国家感到骄傲。 完全赞同!
从结果看,AWS Translage的质量应该比Google 稍微好一些。
以下示例将中文文本翻译为英文:
aws translate translate-text --region us-east-1 --source-language-code "auto" --target-language-code "zh" --text "I am hearing so many great things about the Republican Party California Gubernatorial Candidate, John Cox. He is a very successful businessman who is tired of high Taxes & Crime. He will Make California Great Again & make you proud of your Great State again. Total Endorsement"{ "TranslatedText": "我听到很多关于共和党加州州长候选人约翰·考克斯的伟大事情。 他是一个非常成功的商人,厌倦了高税与犯罪。 他将再次使加州成为伟大的国家,让你再次为你的伟大国家感到骄傲。 完全赞同", "SourceLanguageCode": "en", "TargetLanguageCode": "zh"}
Translate 服务只有一个API,就是 TranslateText。
请求语法:
{"SourceLanguageCode": "string", "TargetLanguageCode": "string", "Text": "string"}
返回语法:
{"SourceLanguageCode": "string", "TargetLanguageCode": "string", "TranslatedText": "string"}
代码:
import boto3
translate = boto3.client(service_name='translate', region_name='us-east-1', use_ssl=True)
result = translate.translate_text(Text="Hello World", SourceLanguageCode="auto", TargetLanguageCode="zh")
print('TranslatedText: ' + result.get('TranslatedText'))
print('SourceLanguageCode: ' + result.get('SourceLanguageCode'))
print('TargetLangaugeCode: ' + result.get('TargetLanguageCode'))
输出:
TranslatedText: 您好世界
SourceLanguageCode: en
TargetLangaugeCode: zh
所谓的文本转语音服务,就是把文本朗读出来。它的输入输出为:
几个特色功能:
SammydeMacBook-Air:~ Sammy$ aws polly synthesize-speech --output-format mp3 --voice-id Joanna --text 'Hello, my name is Joanna. I learned about the W3C on 10/3 of last year.' helloworld.mp3
{ "ContentType": "audio/mpeg", "RequestCharacters": "71"}
SammydeMacBook-Air:~ Sammy$ ls helloworld.mp3
helloworld.mp3
Polly 具有以下几个API:
• SynthesizeSpeech:合成语音
• ListLexicons:列表发音词典
• PutLexicon:创建发音词典
• GetLexicon:检索发音词典
• DeleteLexicon:删除发音词典
• DescribeVoices:获取声音列表 • GetSpeechSynthesisTask:获取语音生成任务
• ListSpeechSynthesisTasks:获取语音生成任务列表
• StartSpeechSynthesisTask:开始语音生成任务
from boto3 import Session
from contextlib import closing
import os
import sys
import subprocess
from tempfile import gettempdir
session = Session(profile_name="default")
polly = session.client("polly")
try:
text = "To the incredible people of the Great State of Wyoming: Go VOTE TODAY for Foster Friess - He will be a fantastic Governor! Strong on Crime, Borders & 2nd Amendment. Loves our Military & our Vets. He has my complete and total Endorsement!"
response = polly.synthesize_speech(Text = text, OutputFormat="mp3", VoiceId="Joanna")
except Exception as error:
print(error)
sys.exit(-1)
if "AudioStream" in response:
with closing(response["AudioStream"]) as stream:
output = os.path.join(gettempdir(), "speech.mp3")
try:
with open(output, "wb") as file:
file.write(stream.read())
except IOError as error:
print(error)
sys.exit(-1)
else:
print("Could not stream audio")
sys.exit(-1)
if sys.platform == "win32":
os.startfile(output)
else:
opener = "open" if sys.platform == "darwin" else "xdg-open"
subprocess.call([opener, output])
这段代码会将语音保存到 speech.mp3中,然后调用系统默认播放器进行播放。
AWS Transcribe 服务于利用机器学习来识别语音文件中的声音,然后将其转化为文本。目前支持英语和西班牙文语音。必须将语音文件保存在S3中,输出结果也会被保存在S3中。
几个特色功能:
创建一个job:
job列表:
结果:
(1)提交一个job
{ "TranscriptionJobName": "testTranscribe", "LanguageCode": "en-US", "MediaFormat": "mp3", "Media": { "MediaFileUri": "https://s3.dualstack.us-east-1.amazonaws.com/*********/hellosammy.mp3"
}
}aws transcribe start-transcription-job --region us-east-1 --cli-input-json file://testTranscribeJob.json
(2)获取job 列表
aws transcribe list-transcription-jobs --region us-east-1 --status IN_PROGRESS
{ "Status": "IN_PROGRESS", "TranscriptionJobSummaries": [
{ "TranscriptionJobName": "testTranscribe", "CreationTime": 1535338023.662, "LanguageCode": "en-US", "TranscriptionJobStatus": "IN_PROGRESS", "OutputLocationType": "SERVICE_BUCKET"
}
]
}
(3)一旦 job 完成后,从其包含的TranscriptFileUri地址可以下载输出文本,部分内容如下:
{"jobName":"testTranscribe","accountId":"725348140609","results":{"transcripts":[{"transcript":"Hello, my name is sami. I learned about the w three c on october third last year."}],"items":[{"start_time":"0.0","end_time":"0.59","alternatives":[{"confidence":"0.9023","content":"Hello"}],"type":"pronunciation"},{"alternatives":[{"confidence":null,"content":","}],"type":"punctuation"},{"start_time":"0.7","end_time":"0.88","alternatives":
[{"confidence":"0.9867","content":"last"}],"type":"pronunciation"},{"start_time":"4.69","end_time":"5.07","alternatives":[{"confidence":"0.9867","content":"year"}],"type":"pronunciation"},{"alternatives":[{"confidence":null,"content":"."}],"type":"punctuation"}]},"status":"COMPLETED"}
import time
import boto3
transcribe = boto3.client(('transcribe'))
job_name = "testTranscribeJob100"
job_uri = "https://s3.dualstack.us-east-1.amazonaws.com/*****/hellosammy.mp3"
transcribe.start_transcription_job(TranscriptionJobName=job_name, Media={'MediaFileUri': job_uri}, MediaFormat='mp3', LanguageCode='en-US')
while True:
status = transcribe.get_transcription_job(TranscriptionJobName = job_name)
if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', "FAILED"]:
break
print("Job not ready yet...")
time.sleep(5)
print(status)
参考文档: