首页
学习
活动
专区
工具
TVP
发布
精选内容/技术社群/优惠产品,尽在小程序
立即前往

Speech2Vec:A seq2seq Framework for Learning Word Embeddings……

我们之前介绍过此作者的一篇文章,用seq2seq无监督的方式学习acoustic word embedding,想法是decoder重建输入word-level的语音片段。

今天这篇他是想通过skipgrams或continous bag-of-words(cbow)的方法学习embedding。好处是能考虑上下文的关系,学出来的embedding跟有语义性。

在文本中,skipgrams是用输入中间词来预测左右近邻词。cbow相反,使用左右近邻词来预测中间词。这篇的思路一样,但是把词文本换成了词speech片段。所以起名叫Speech2vec。作者说好处是可以学到文本中没有的语义特征,比如语调。

他们使用的模型是seq2seq+attention。Skipgrams方法中,encoder的最后一个状态作为输入中间词的embedding,decoder预测左右临近词的speech片段。cbow方法相反,encoder输入是左右近邻词,两者的最后一个状态的vectors求和做为decoder输出中间词的embedding.

测试数据库是500小时的LibriSpeech,用forced alignment切成词片段。Evaluation用到15个benchmarks,每个都包括人工打分相似性的word pairs。Embedding的word pair是他们的cosine similarity。评价尺度是人工打分和cosine相似性之间的相关系数。

他们比较了speech2vec和word2vec。前者要好。另外skipgrams比cbow要好。

Remarks:

最有意思的是他们声称speech2vec可以学到text不能包括的语义内容,比如语调。我第一个想到的是可能人在给word pair打分的时候是大声在脑子里念词的。。。

Chung, Yu-An

Glass, James

https://arxiv.org/pdf/1803.08976.pdf

They presented a method to capture the sematic meaning for the acoustic word embedding. This method uses seq2seq model and skipgrams or continuous bag-of-words training method.

Their inspiration comes from word2vec which learns the embedding to capture the semantic relations between textural words. Different from word2vec, they train the embeddings directly on the speech.

Seq2Seq: encoder, decoder+attention

Skipgrams training: encoder embed the center word to a vector (embedding), decoder predicts the left and right neighbor word segments.

Continuous bag-of-words training: encoder embed the left and right neighbor word segments. Decoder predicts the center word segment. The sum of the encoder's last state vectors is the embedding.

Different embeddings of the same word are averaged out.

Training data: LibriSpeech 500 hrs, word-level segmented by forced alignment

Evaluation: 13 benchmarks which contain word pairs, of which the similarity is human labeled. The word embedding pair similarity is measured by cosine distance. The evaluation metric is the ranking correlation between human labeled and cosine similarities. Speech2vec and word2vec are compared.

Results: Speech2vec works better than word2vec. Skipgrams training is better than cbows.

Remarks:

They claim the reason why speech2vec works better than word2vec is that can capture semantic information which is not presented in text, such as prosody. To me, this is interesting because the information conveyed in speech can also be semantic, e.g. prosody. It might indicate that when raters are labeling the similarity of textural word pairs, they actually pronounce them out in their mind...

  • 发表于:
  • 原文链接http://kuaibao.qq.com/s/20180405G05U7J00?refer=cp_1026
  • 腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号(企鹅号)传播渠道之一,根据《腾讯内容开放平台服务协议》转载发布内容。
  • 如有侵权,请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长 进交流群

领取专属 10元无门槛券

私享最新 技术干货

扫码加入开发者社群
领券