金融/语音/音频处理学术速递[7.28]

公众号-arXiv每日学术速递

发布于 2021-07-29 14:09:58

6830

发布于 2021-07-29 14:09:58

文章被收录于专栏：arXiv每日学术速递arXiv每日学术速递

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

q-fin金融，共计12篇

cs.SD语音，共计12篇

eess.AS音频处理，共计11篇

1.q-fin金融:

【1】 Mortality in Germany during the Covid-19 pandemic 标题：德国冠状病毒大流行期间的死亡率

作者：Alois Pichler,Dana Uhlig 备注：19 pages, 7 figures 链接：https://arxiv.org/abs/2107.12899 摘要：Covid-19大流行仍然对社会和经济造成严重影响。本文从实证角度研究了德国2020年和2021年流感大流行期间的超额死亡率，特别关注寿险公司的视角。我们的结论是基于德国政府部门对全体人口生死存亡的官方统计。结论，有关精算师和具体的保险业务线，包括投资组合的养老金，人寿和健康保险合同，提供。摘要：The Covid-19 pandemic still causes severe impacts on society and the economy. This paper studies excess mortality during the pandemic years 2020 and 2021 in Germany empirically with a special focus on the life insurer's perspective. Our conclusions are based on official counts of German governmental offices on the living and deaths of the entire population. Conclusions, relevant for actuaries and specific insurance business lines, including portfolios of pension, life, and health insurance contracts, are provided.

【2】 No free lunch for markets with multiple numéraires 标题：有多个数字的市场没有免费午餐

作者：Laurence Carassus 链接：https://arxiv.org/abs/2107.12885 摘要：我们认为一个由几个子市场构成的全球市场，每个市场都有自己的资产和N'E'Eray.我们为等价鞅测度的存在性提供了理论基础，并给出了考虑子市场特征差异的超复制价格的结果。摘要：We consider a global market constituted by several submarkets, each with its own assets and num\'eraire. We provide theoretical foundations for the existence of equivalent martingale measures and results on superreplication prices which allows to take into account difference of features between submarkets.

【3】 LOB modeling using Hawkes processes with a state-dependent factor 标题：具有状态依赖因子的Hawkes过程的LOB建模

作者：Emmanouil Sfendourakis,Ioane Muni Toke 机构：Laboratoire Math´ematiques et Informatique pour la Complexit´e et les Systemes, CentraleSup´elec, Universit´e Paris-Saclay, France 备注：28 pages, 8 figures, 1 table 链接：https://arxiv.org/abs/2107.12872 摘要：提出了一个极限订货簿订货流的点过程模型，其中条件强度是Hawkes分量和状态相关因子的乘积。在LOB上下文中，状态观测可能包括观测到的不平衡或观测到的扩散。使用直接似然最大化或EM型估计，提供了这种过程的计算效率估计的全部技术细节。应用程序包括买卖市场订单的模型，或价格上下波动的模型。在巴黎泛欧交易所交易的多只股票上的实证结果强调了LOB建模中依赖于状态的公式的好处，例如在金融数据的拟合优度方面。摘要：A point process model for order flows in limit order books is proposed, in which the conditional intensity is the product of a Hawkes component and a state-dependent factor. In the LOB context, state observations may include the observed imbalance or the observed spread. Full technical details for the computationally-efficient estimation of such a process are provided, using either direct likelihood maximization or EM-type estimation. Applications include models for bid and ask market orders, or for upwards and downwards price movements. Empirical results on multiple stocks traded in Euronext Paris underline the benefits of state-dependent formulations for LOB modeling, e.g. in terms of goodness-of-fit to financial data.

【4】 Quasi-sure essential supremum and applications to finance 标题：拟确定性本质上确界及其在金融中的应用

作者：Laurence Carassus 机构：Received: date Accepted: date 链接：https://arxiv.org/abs/2107.12862 摘要：当不确定性由一系列非支配非紧概率测度度量时，提出了本质上确界的概念。它为超级复制提供了新的视角，并允许描述瞬时利润（AIP）的缺失。摘要：A notion of essential supremum is developed when the uncertainty is measured by a family of non-dominated and non-compact probability measures. It provides new perspectives on super-replication and allows the Absence of Instantaneous Profit (AIP) to be characterized.

【5】 Information Foraging in the Attention Economy 标题：注意力经济中的信息寻觅

作者：Charlie Pilgrim,Weisi Guo,Thomas T. Hills 备注：20 pages, 9 figures 链接：https://arxiv.org/abs/2107.12848 摘要：在过去的200年里，信息扩散的速度不断加快，为信息竞争创造了新的环境，从而为信息进化创造了新的选择力量。这些力量影响消费者可获得的信息饮食，消费者反过来选择消费什么，创造了一个类似于在许多生态系统中看到的反馈过程。作为理解这种关系的第一步，我们应用动物觅食模型来描述长型和短型培养基的进化，以响应人类最大化利用率的偏好。该模型描述了信息率（即熵）随信息扩散而增加，以及短媒体和长媒体（如社交媒体和书籍）之间熵的差异。我们发现，自1900年以来，在不同的媒体类别中，词熵稳步增加，而在短格式媒体中，熵也加速增加。总的来说，证据表明，一场争夺我们注意力的竞争日益激烈，对语言和交流系统的演变产生了持久的影响。摘要：Over the past 200 years, rising rates of information proliferation have created new environments for information competition and, consequently, new selective forces on information evolution. These forces influence the information diet available to consumers, who in turn choose what to consume, creating a feedback process similar to that seen in many ecosystems. As a first step towards understanding this relationship, we apply animal foraging models of diet choice to describe the evolution of long and short form media in response to human preferences for maximising utility rate. The model describes an increase in information rate (i.e., entropy) in response to information proliferation, as well as differences in entropy between short-form and long-form media (such as social media and books, respectively). We find evidence for a steady increase in word entropy in diverse media categories since 1900, as well as an accelerated entropy increase in short-form media. Overall the evidence suggests an increasingly competitive battle for our attention that is having a lasting influence on the evolution of language and communication systems.

【6】 Income Inequality and Intergenerational Mobility in India 标题：印度的收入不平等与代际流动

作者：Anuradha Singh 机构：Economics and Finance Department, BITS Pilani Campus 链接：https://arxiv.org/abs/2107.12702 摘要：本文利用全国统计局（NSS）的三轮数据集，通过将几代人划分为社会阶层和收入阶层，试图了解收入不平等与代际收入流动（IGIM）之间的关系。本文的创新之处在于采用不同的方法对IGIM进行评估，以期对现有文献有所贡献。我们得出的结论是，该国的低收入流动性和高度不平等，这已不再与印度的特定社会阶层有关。此外，两者可能有消极或积极的关系，因此需要在区域一级进行研究。摘要：Using three rounds of NSS datasets, the present paper attempts to understand the relationship between income inequality and intergenerational income mobility (IGIM) by segregating generations into social and income classes. The originality of the paper lies in assessing the IGIM using different approaches, which we expect to contribute to the existing literature. We conclude that the country has low-income mobility and high inequality which is no longer associated with a particular social class in India. Also, both may have a negative or positive relationship, hence needs to be studied at a regional level.

【7】 Dhaka Water-logging: Causes, Effects and Remedial Policy Options 标题：达卡内涝：原因、影响及补救政策选择

作者：Hossain Ahmed Taufiq 机构：Conference paper, Lecturer (on study leave), Dept. of Global Studies & Governance, Independent University Bangladesh (IUB). Email:, Wageningen University, Netherlands. Location: Dept. of Global Studies and Governance, Independent University 备注：Proceedings: 'The New Megacity: For Whom?', Global Studies and Governance Department, Independent University, Bangladesh, 18 November 2019 链接：https://arxiv.org/abs/2107.12625 摘要：对孟加拉国首都达卡市来说，水涝是一项重大挑战。快速、不受管制和无计划的城市化，以及有害的社会、经济、基础设施和环境后果，更不用说登革热等疾病，挑战了该市防治水涝的几个应急计划。这项研究提供了一个简短的背景分析，达卡地形和自然，以及暴雨排水系统，然后集中在人为原因和影响的水涝，最终探讨了一些补救措施。摘要：Water-logging is a major challenge for Dhaka city, the capital of Bangladesh. The rapid, unregulated, and unplanned urbanization, as well as detrimental social, economic, infrastructural, and environmental consequences, not to mention diseases like dengue, challenge the several crash programs combating water-logging in the city. This study provides a brief contextual analysis of the Dhakas topography and natural, as well as storm water drainage systems, before concentrating on the man-made causes and effects of water-logging, ultimately exploring a few remedial measures.

【8】 They Chose to Not Tell You 标题：他们选择不告诉你

作者：Bruce Knuteson 机构：The world’s stock markets display a strikingly suspicious, decades long pattern of overnight and, intraday returns that nobody (other than us) has plausibly explained and that nobody (other than 备注：8 pages 链接：https://arxiv.org/abs/2107.12516 摘要：世界股市呈现出一种惊人的可疑，长达数十年的隔夜和日内回报模式，没有人（除了我们）有合理的解释，也没有人（除了我们）明确和持续地提醒你。我们用过去五年来关于这个问题的信件来表明，其他人在这个问题上的沉默并不是因为他们有充分的理由相信这种模式是好的。另外，不管这种模式是否被证明是好的，我们已经记录在案，那些能够提醒你世界股市存在惊人可疑的回报模式的人，没有人能够无伤大雅地解释，他们意识到这个问题，没有充分的理由相信这不是一个问题，却选择不告诉你。摘要：The world's stock markets display a strikingly suspicious, decades long pattern of overnight and intraday returns that nobody (other than us) has plausibly explained and that nobody (other than us) has clearly and persistently alerted you to. We use correspondence on this topic over the past five years to show that the silence of others on this issue does not arise from their having a good reason to believe this pattern is fine. Separately, and regardless of whether this pattern turns out to be fine, we have documented that people in a position to alert you to the presence of strikingly suspicious return patterns in the world's stock markets that nobody can innocuously explain are aware of this issue, have no good reason to believe it is not a problem, and chose to not tell you.

【9】 Robustness and sensitivity analyses for rough Volterra stochastic volatility models 标题：粗糙Volterra随机波动模型的稳健性和灵敏度分析

作者：Jan Matas,Jan Pospíšil 机构：NTIS - New Technologies for the Information Society, University of West Bohemia, Univerzitní ,, Plzeň, Czech Republic 链接：https://arxiv.org/abs/2107.12462 摘要：本文对几种连续时间粗糙Volterra随机波动模型进行了关于市场校准过程的鲁棒性和敏感性分析。稳健性是指对期权数据结构变化的敏感性。后一种分析应验证波动过程动力学中粗糙度重要性的假设。对苹果公司2015年4月和5月四个不同交易日的股票期权交易数据集进行了实证研究，特别给出了RFSV、rBergomi和aRFSV模型的结果。摘要：In this paper we perform robustness and sensitivity analysis of several continuous-time rough Volterra stochastic volatility models with respect to the process of market calibration. Robustness is understood in the sense of sensitivity to changes in the option data structure. The latter analyses then should validate the hypothesis on importance of the roughness in the volatility process dynamics. Empirical study is performed on a data set of Apple Inc. equity options traded in four different days in April and May 2015. In particular, the results for RFSV, rBergomi and aRFSV models are provided.

【10】 Bitcoin option pricing: A market attention approach 标题：比特币期权定价：一种市场关注的方法

作者：Alvaro Guinea,Alet Roux 链接：https://arxiv.org/abs/2107.12447 摘要：提出了一个模型，该模型通过考虑市场关注度来模拟比特币的价格。假设市场注意力遵循均值回复Cox-Ingersoll-Ross过程，并允许其影响比特币收益（经过一段时间的延迟），则会产生一个可处理的仿射模型，其中包含欧洲看跌期权和看涨期权价格的半封闭公式。提出了该模型的极大似然估计方法。当在真实数据上进行测试时，它的买入和卖出价格的准确性优于许多标准模型。摘要：A model is proposed that models Bitcoin prices by taking into account market attention. Assuming that market attention follows a mean-reverting Cox-Ingersoll-Ross process and allowing it to influence Bitcoin returns (after some delay) leads to a tractable affine model with semi-closed formulae for European put and call prices. A maximum likelihood estimation procedure is proposed for this model. The accuracy of its call and put prices outperforms a number of standard models when tested on real data.

【11】 Proof of non-convergence of the short-maturity expansion for the SABR model 标题：SABR模型短期扩展不收敛的证明

作者：Alan L. Lewis,Dan Pirjol 机构：com†School of Business, Stevens Institute of Technology 备注：18 pages, 7 figures 链接：https://arxiv.org/abs/2107.12439 摘要：本文研究了不相关对数正态（beta=1$）SABR模型中期权价格的短期扩展的收敛性。在这个模型中，期权时间值可以表示为形式为$V（T）=\int{0}^\infty e^{-\frac{u^2}{2T}}g（u）du$的积分，其中$g（u）$是由McKean核$g（s，T）$上的积分给出的``支付函数'。我们研究了函数$g（u）$在复平面$u$中的解析性质，证明了它在条带$Im（u）|<\pi$中是全纯的。利用这一结果，我们证明了$V（T）$的$T$级数展开式和隐含波动率是渐近的（对于任何$T>0$的情况，都是不收敛的）。在一个可以定义为固定$\omega=1$的大波动率限制$\sigma\u 0\至$\infty$或固定$\omega\sigma\u 0$的小波动率限制$\omega\至$\infty$的某个限制中，隐含波动率的短期到期日$T$扩展具有有限的收敛半径$T\u c=\frac{1.32}{\omega\sigma\u 0}$。摘要：We study the convergence properties of the short maturity expansion of option prices in the uncorrelated log-normal ($\beta=1$) SABR model. In this model the option time-value can be represented as an integral of the form $V(T) = \int_{0}^\infty e^{-\frac{u^2}{2T}} g(u) du$ with $g(u)$ a ``payoff function'' which is given by an integral over the McKean kernel $G(s,t)$. We study the analyticity properties of the function $g(u)$ in the complex $u$-plane and show that it is holomorphic in the strip $|\Im(u) |< \pi$. Using this result we show that the $T$-series expansion of $V(T)$ and implied volatility are asymptotic (non-convergent for any $T>0$). In a certain limit which can be defined either as the large volatility limit $\sigma_0\to \infty$ at fixed $\omega=1$, or the small vol-of-vol limit $\omega\to 0$ limit at fixed $\omega\sigma_0$, the short maturity $T$-expansion for the implied volatility has a finite convergence radius $T_c = \frac{1.32}{\omega\sigma_0}$.

【12】 Constant Function Market Makers: Multi-Asset Trades via Convex Optimization 标题：恒定功能做市商：基于凸优化的多资产交易

作者：Guillermo Angeris,Akshay Agrawal,Alex Evans,Tarun Chitra,Stephen Boyd 链接：https://arxiv.org/abs/2107.12484 摘要：以太坊和其他支持智能合约的区块链的兴起导致了分散式交易所（dex）的产生，如Uniswap、Balancer、Curve、mStable和SushiSwap，它们使代理能够在不信任中央机构的情况下交易加密货币。传统的交易所使用订单簿来匹配和执行交易，而dex通常被组织为固定功能做市商（cfmm）。CFMM基于对功能的评估来接受和拒绝提议的交易，该功能取决于提议的交易和DEX的当前储备。对于只涉及两种资产的交易，CFMMs很容易理解，它通过两个函数给出一种资产的数量，一种资产必须投标才能获得另一种资产的给定数量，反之亦然。当两种以上的资产被交换时，就更难理解可能交易的前景。我们观察到，选择多资产交易的各种问题可以表示为凸优化问题，因此可以可靠而有效地解决。摘要：The rise of Ethereum and other blockchains that support smart contracts has led to the creation of decentralized exchanges (DEXs), such as Uniswap, Balancer, Curve, mStable, and SushiSwap, which enable agents to trade cryptocurrencies without trusting a centralized authority. While traditional exchanges use order books to match and execute trades, DEXs are typically organized as constant function market makers (CFMMs). CFMMs accept and reject proposed trades based on the evaluation of a function that depends on the proposed trade and the current reserves of the DEX. For trades that involve only two assets, CFMMs are easy to understand, via two functions that give the quantity of one asset that must be tendered to receive a given quantity of the other, and vice versa. When more than two assets are being exchanged, it is harder to understand the landscape of possible trades. We observe that various problems of choosing a multi-asset trade can be formulated as convex optimization problems, and can therefore be reliably and efficiently solved.

2.cs.SD语音:

【1】 A Physiologically-adapted Gold Standard for Arousal During a Stress Induced Scenario 标题：在应激诱导情景中生理适应的唤醒黄金标准

作者：Alice Baird,Lukas Stappen,Lukas Christ,Lea Schumann,Eva-Maria Meßner,Björn W. Schuller 机构：Chair EIHW, University of Augsburg, Augsburg, Germany, KPP, University of Ulm, Ulm, Germany, GLAM, Imperial College London, London, United Kingdom 链接：https://arxiv.org/abs/2107.12964 摘要：情感是人类固有的主观心理生理状态，要产生连续情感的一致表示（黄金标准），需要对多个人类注释者进行耗时且昂贵的训练。文献中有强有力的证据表明，生理信号是情绪状态，特别是觉醒状态的充分客观标志。在这篇文章中，我们使用了一个数据集，其中包括在压力诱导情景（特里尔社会压力测试）中捕获的连续情绪和生理信号——每分钟心跳（BPM）、皮肤电活动（EDA）和呼吸频率。我们利用长-短记忆，递归神经网络来探索融合这些生理信号的好处，以唤醒为目标，学习各种音频，视频和文本为基础的功能。我们利用最先进的缪斯工具箱来考虑注释延迟和帧间协议加权时融合目标信号。当EDA与唤醒融合时，与仅唤醒的金标准结果相比，特征集之间的一致性相关系数（CCC）有所提高。此外，基于BERT的文本特征对于唤醒和所有生理信号的结果都有所改善，与仅对唤醒的0.2118ccc相比，获得了高达0.3344ccc的结果。多模态融合还提高了整体CCC的音频和视频功能，获得高达0.6157 CCC识别唤醒加EDA和BPM。摘要：Emotion is an inherently subjective psychophysiological human-state and to produce an agreed-upon representation (gold standard) for continuous emotion requires a time-consuming and costly training procedure of multiple human annotators. There is strong evidence in the literature that physiological signals are sufficient objective markers for states of emotion, particularly arousal. In this contribution, we utilise a dataset which includes continuous emotion and physiological signals - Heartbeats per Minute (BPM), Electrodermal Activity (EDA), and Respiration-rate - captured during a stress induced scenario (Trier Social Stress Test). We utilise a Long Short-Term Memory, Recurrent Neural Network to explore the benefit of fusing these physiological signals with arousal as the target, learning from various audio, video, and textual based features. We utilise the state-of-the-art MuSe-Toolbox to consider both annotation delay and inter-rater agreement weighting when fusing the target signals. An improvement in Concordance Correlation Coefficient (CCC) is seen across features sets when fusing EDA with arousal, compared to the arousal only gold standard results. Additionally, BERT-based textual features' results improved for arousal plus all physiological signals, obtaining up to .3344 CCC compared to .2118 CCC for arousal only. Multimodal fusion also improves overall CCC with audio plus video features obtaining up to .6157 CCC to recognize arousal plus EDA and BPM.

【2】 Audio-to-Score Alignment Using Deep Automatic Music Transcription 标题：使用深度自动音乐转录的音频到乐谱对齐

作者：Federico Simonetta,Stavros Ntalampiras,Federico Avanzini 机构：LIM — Music Informatics Laboratory, Department of Computer Science, University of Milan 备注：IEEE MMSP 2021 链接：https://arxiv.org/abs/2107.12854 摘要：音谱比对（A2SA）是一项多模态的任务，它将音频信号与乐谱进行比对。最近的文献证实了自动音乐转录（AMT）在帧级对A2SA的好处。在这项工作中，我们的目的是详细阐述如何利用AMT深度学习（DL）模型来实现笔记水平的一致性。我们提出了一种基于HMM的分数对分数对齐和AMT的方法，显示出超越现有技术的显著进步。我们设计了一个系统的程序，以利用大型数据集不提供一个一致的分数。最后，我们对多个数据集进行了全面的比较和广泛的测试。摘要：Audio-to-score alignment (A2SA) is a multimodal task consisting in the alignment of audio signals to music scores. Recent literature confirms the benefits of Automatic Music Transcription (AMT) for A2SA at the frame-level. In this work, we aim to elaborate on the exploitation of AMT Deep Learning (DL) models for achieving alignment at the note-level. We propose a method which benefits from HMM-based score-to-score alignment and AMT, showing a remarkable advancement beyond the state-of-the-art. We design a systematic procedure to take advantage of large datasets which do not offer an aligned score. Finally, we perform a thorough comparison and extensive tests on multiple datasets.

【3】 Multi-modal estimation of the properties of containers and their content: survey and evaluation 标题：集装箱及其内容物性能的多模态估计：调查与评价

作者：Alessio Xompero,Santiago Donaher,Vladimir Iashin,Francesca Palermo,Gökhan Solak,Claudio Coppola,Reina Ishikawa,Yuichi Nagao,Ryo Hachiuma,Qi Liu,Fan Feng,Chuanlin Lan,Rosa H. M. Chan,Guilherme Christmann,Jyun-Ting Song,Gonuguntla Neeharika,Chinnakotla Krishna Teja Reddy,Dinesh Jain,Bakhtawar Ur Rehman,Andrea Cavallaro 机构： are with Keio University 备注：13 pages, 9 tables, 5 figures, submitted to IEEE Transactions on Multimedia 链接：https://arxiv.org/abs/2107.12719 摘要：声学和视觉传感可以支持非接触估计重量的容器和它的内容量时，容器是由一个人操纵。然而，透明胶片（容器和内容）以及材料、形状和尺寸的可变性使这个问题具有挑战性。在本文中，我们提出了一个开放的基准框架和深入的比较分析，最近的方法，估计一个集装箱的容量，以及类型，质量和数量的内容。这些方法使用学习和手工制作的特征，如mel频率倒谱系数、过零率、频谱图，使用不同类型的分类器用声学数据来估计内容的类型和数量，用视觉数据的几何方法来确定容器的容量。在一个新的分布式数据集上的结果表明，单独使用音频是一种很强的模式，对于内容类型和级别分类，方法的加权平均F1得分分别高达81%和97%。用视觉方法估计集装箱容量，用多模态、多阶段算法估计填充质量，加权平均容量和质量分数可达65%。摘要：Acoustic and visual sensing can support the contactless estimation of the weight of a container and the amount of its content when the container is manipulated by a person. However, transparencies (both of the container and of the content) and the variability of materials, shapes and sizes make this problem challenging. In this paper, we present an open benchmarking framework and an in-depth comparative analysis of recent methods that estimate the capacity of a container, as well as the type, mass, and amount of its content. These methods use learned and handcrafted features, such as mel-frequency cepstrum coefficients, zero-crossing rate, spectrograms, with different types of classifiers to estimate the type and amount of the content with acoustic data, and geometric approaches with visual data to determine the capacity of the container. Results on a newly distributed dataset show that audio alone is a strong modality and methods achieves a weighted average F1-score up to 81% and 97% for content type and level classification, respectively. Estimating the container capacity with vision-only approaches and filling mass with multi-modal, multi-stage algorithms reaches up to 65% weighted average capacity and mass scores.

【4】 Making grains tangible: microtouch for microsound 标题：让粮食看得见摸得着：微音的微触

作者：Staas de Jong 机构：LIACS, Leiden University, Niels Bohrweg , Leiden 备注：Proceedings of the International Conference on New Interfaces for Musical Expression, 2011 链接：https://arxiv.org/abs/2107.12714 摘要：本文提出了一个新的研究方向，为大家族的乐器音乐接口，其中声音产生使用数字颗粒合成，并在互动和控制涉及（精细）操作僵硬，平坦的接触面。首先，在一个历史的背景下，一个普遍的缺乏，并明确需要，有形的产出是动态实例化的粮食生产过程本身是确定的。其次，为了填补这一空白，提出了一种具体的通用方法，该方法基于非振动和振动力脉冲的仔细构造，与声波颗粒呈一对一的关系。一个非正式的试点心理物理学实验启动了这一方法，其中考虑了两种主要情况下适用于人类皮肤的力量：垂直，和横向。初步结果表明，力脉冲方法可以实现对正在进行的谷物生成过程的可感知的多维、有形的显示。此外，我们还发现，在基本声波晶粒产生的同一时间尺度上，这一点可以有意义地（实时）发生。这是一个不小的特性，为进一步发展这种增强型显示器提供了重要而积极的基础。它还导致了令人兴奋的前景，使任意声波颗粒实际物理操纵。摘要：This paper proposes a new research direction for the large family of instrumental musical interfaces where sound is generated using digital granular synthesis, and where interaction and control involve the (fine) operation of stiff, flat contact surfaces. First, within a historical context, a general absence of, and clear need for, tangible output that is dynamically instantiated by the grain-generating process itself is identified. Second, to fill this gap, a concrete general approach is proposed based on the careful construction of non-vibratory and vibratory force pulses, in a one-to-one relationship with sonic grains. An informal pilot psychophysics experiment initiating the approach was conducted, which took into account the two main cases for applying forces to the human skin: perpendicular, and lateral. Initial results indicate that the force pulse approach can enable perceivably multidimensional, tangible display of the ongoing grain-generating process. Moreover, it was found that this can be made to meaningfully happen (in real time) in the same timescale of basic sonic grain generation. This is not a trivial property, and provides an important and positive fundament for further developing this type of enhanced display. It also leads to the exciting prospect of making arbitrary sonic grains actual physical manipulanda.

【5】 The cyclotactor: towards a tactile platform for musical interaction 标题：循环演员：走向音乐互动的触觉平台

作者：Staas de Jong 机构： Towards a Tactile Platformfor Musical InteractionStaas de JongLIACSLeiden Universitystaas 备注：Proceedings of the International Conference on New Interfaces for Musical Expression, 2008 链接：https://arxiv.org/abs/2107.12704 摘要：本文报道了一种基于手指的音乐交互触觉输入输出装置的研究进展。该设备的核心是能够建立触觉输入和输出之间的循环关系。文中给出了这种方法在音乐交互中的一个直接的实际应用，即在一个触觉环上实现两个自由度的复合。摘要：This paper reports on work in progress on a finger-based tactile I/O device for musical interaction. Central to the device is the ability to set up cyclical relationships between tactile input and output. A direct practical application of this to musical interaction is given, using the idea to multiplex two degrees of freedom on a single tactile loop.

【6】 Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis 标题：神经语音合成中韵律瓶颈的交叉语体迁移

作者：Shifeng Pan,Lei He 机构：Microsoft, China 备注：in Proceedings of INTERSPEECH 2021 链接：https://arxiv.org/abs/2107.12562 摘要：跨说话人语体转换是多语体大规模表达性语音合成应用的关键。它不要求目标演讲者是表达所有风格的专家，也不需要为模特训练收集相应的录音。然而，现有的样式转换方法的性能还远远落后于实际应用的需要。根本原因主要有两方面。首先，从单一参考语音中提取的风格嵌入很难为任意文本的合成提供细粒度和合适的韵律信息。其次，在这些模型中，内容/文本、韵律和说话人音色通常是高度纠缠的，因此在自由组合这些成分（例如在说话人之间转换说话风格）时期望满意的结果是不现实的。本文提出了一种具有明显韵律瓶颈的跨说话人风格的文语转换（TTS）模型。韵律瓶颈建立了强有力的话语风格核，将韵律与内容、说话人音色分离开来，保证了高质量的跨说话人风格转换。评价结果表明，该方法在韵律客观评价方面与源说话人的说话人相关（SD）模型相当，在客观评价和主观评价方面明显优于循环一致性和基于GMVAE的基线。摘要：Cross-speaker style transfer is crucial to the applications of multi-style and expressive speech synthesis at scale. It does not require the target speakers to be experts in expressing all styles and to collect corresponding recordings for model training. However, the performances of existing style transfer methods are still far behind real application needs. The root causes are mainly twofold. Firstly, the style embedding extracted from single reference speech can hardly provide fine-grained and appropriate prosody information for arbitrary text to synthesize. Secondly, in these models the content/text, prosody, and speaker timbre are usually highly entangled, it's therefore not realistic to expect a satisfied result when freely combining these components, such as to transfer speaking style between speakers. In this paper, we propose a cross-speaker style transfer text-to-speech (TTS) model with explicit prosody bottleneck. The prosody bottleneck builds up the kernels accounting for speaking style robustly, and disentangles the prosody from content and speaker timbre, therefore guarantees high quality cross-speaker style transfer. Evaluation result shows the proposed method even achieves on-par performance with source speaker's speaker-dependent (SD) model in objective measurement of prosody, and significantly outperforms the cycle consistency and GMVAE-based baselines in objective and subjective evaluations.

【7】 Energy-based Unknown Intent Detection with Data Manipulation 标题：基于能量的数据处理未知意图检测

作者：Yawen Ouyang,Jiasheng Ye,Yu Chen,Xinyu Dai,Shujian Huang,Jiajun Chen 机构：∗, National Key Laboratory for Novel Software Technology, Nanjing University, China 备注：10 pages, 4 figures, accepted by Findings of ACL-IJCNLP 2021 链接：https://arxiv.org/abs/2107.12542 摘要：未知意图检测的目的是识别出在训练集中从未出现过意图的分布外（outofdistribution，OOD）话语。在这篇论文中，我们建议使用能量分数来完成这个任务，因为能量分数理论上与输入的密度一致，并且可以从任何分类器中得到。然而，在训练阶段需要高质量的OOD话语，以形成OOD与in-distribution（IND）之间的能量差，而这些话语在实践中很难被收集。为了解决这个问题，我们提出了一个数据操作框架来生成高质量的具有重要性权重的OOD语句。实验结果表明，利用GOT对基于能量的探测器进行微调，可以在两个基准数据集上获得最新的结果。摘要：Unknown intent detection aims to identify the out-of-distribution (OOD) utterance whose intent has never appeared in the training set. In this paper, we propose using energy scores for this task as the energy score is theoretically aligned with the density of the input and can be derived from any classifier. However, high-quality OOD utterances are required during the training stage in order to shape the energy gap between OOD and in-distribution (IND), and these utterances are difficult to collect in practice. To tackle this problem, we propose a data manipulation framework to Generate high-quality OOD utterances with importance weighTs (GOT). Experimental results show that the energy-based detector fine-tuned by GOT can achieve state-of-the-art results on two benchmark datasets.

【8】 TaikoNation: Patterning-focused Chart Generation for Rhythm Action Games 标题：太古民族：节奏动作游戏中以图案为中心的图表生成

作者：Emily Halina,Matthew Guzdial 机构：University of Alberta, Edmonton, Canada 备注：None 链接：https://arxiv.org/abs/2107.12506 摘要：通过机器学习从歌曲中生成节奏游戏图是近年来人们越来越感兴趣的问题。然而，所有现存的系统都在努力复制类似人类的模式：将游戏对象放置在彼此相关的位置，以形成基于歌曲中事件的一致模式。模式是高质量节奏游戏内容的关键标识符，被视为人类排名的必要组成部分。我们建立了一种新的图表生成方法，该方法生成的图表具有比以前工作中看到的更一致、更像人类的模式。摘要：Generating rhythm game charts from songs via machine learning has been a problem of increasing interest in recent years. However, all existing systems struggle to replicate human-like patterning: the placement of game objects in relation to each other to form congruent patterns based on events in the song. Patterning is a key identifier of high quality rhythm game content, seen as a necessary component in human rankings. We establish a new approach for chart generation that produces charts with more congruent, human-like patterning than seen in prior work.

【9】 Improving Word Recognition in Speech Transcriptions by Decision-level Fusion of Stemming and Two-way Phoneme Pruning 标题：基于词干和双向音素修剪的决策级融合提高语音转写中的单词识别能力

作者：Sunakshi Mehra,Seba Susan 机构：Department of Information Technology, Shahbad Daulatpur, Main Bawana Road, Delhi , India 备注：Accepted in International Advanced Computing Conference (2020) 链接：https://arxiv.org/abs/2107.12428 摘要：提出了一种基于词干提取和双向音素剪枝的决策级融合的无监督语音修正方法。通过使用Ffmpeg框架提取音频，并使用googleapi进一步将音频转换为文本脚本，从视频中获取脚本。在基准LRW数据集中，有500个单词类别，每个类有50个mp4格式的视频。所有视频由29帧（每1.16秒长）组成，并且单词出现在视频的中间。在我们的方法中，我们尝试通过词干、音素提取、过滤和剪枝来提高基线的准确率，从9.34%提高。将词干提取算法应用于文本文本，并对结果进行了评价，得到了23.34%的准确率。为了将单词转换成音素，我们使用卡内基梅隆大学（CMU）发音词典，它提供了英语单词到他们发音的语音映射。提出了一种双向音素剪枝方法，包括两个非连续的步骤：1）对含有元音和爆破音的音素进行过滤剪枝；2）对含有元音和擦音的音素进行过滤剪枝。在得到词干提取和双向音素剪枝的结果后，我们采用了决策级融合，使得单词识别率提高了32.96%。摘要：We introduce an unsupervised approach for correcting highly imperfect speech transcriptions based on a decision-level fusion of stemming and two-way phoneme pruning. Transcripts are acquired from videos by extracting audio using Ffmpeg framework and further converting audio to text transcript using Google API. In the benchmark LRW dataset, there are 500 word categories, and 50 videos per class in mp4 format. All videos consist of 29 frames (each 1.16 s long) and the word appears in the middle of the video. In our approach we tried to improve the baseline accuracy from 9.34% by using stemming, phoneme extraction, filtering and pruning. After applying the stemming algorithm to the text transcript and evaluating the results, we achieved 23.34% accuracy in word recognition. To convert words to phonemes we used the Carnegie Mellon University (CMU) pronouncing dictionary that provides a phonetic mapping of English words to their pronunciations. A two-way phoneme pruning is proposed that comprises of the two non-sequential steps: 1) filtering and pruning the phonemes containing vowels and plosives 2) filtering and pruning the phonemes containing vowels and fricatives. After obtaining results of stemming and two-way phoneme pruning, we applied decision-level fusion and that led to an improvement of word recognition rate upto 32.96%.

【10】 End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection 标题：用于说话人确认、反欺骗和语音深伪检测的端到端频谱-时间图注意网络

作者：Hemlata Tak,Jee-weon Jung,Jose Patino,Madhu Kamble,Massimiliano Todisco,Nicholas Evans 机构：EURECOM, Sophia Antipolis, France,Naver Corporation, South Korea 备注：Submitted to ASVspoof 2021 Workshop 链接：https://arxiv.org/abs/2107.12710 摘要：已知用于区分真实语音和伪造语音的人工制品存在于特定的子带和时间段中。可以使用各种方法来捕获和建模这些人工制品，但是，在各种各样的欺骗攻击中，没有一种方法能够很好地工作。可靠的检测通常依赖于多个检测系统的融合，每个系统都经过调整以检测不同形式的攻击。在本文中，我们证明了当在模型本身中进行融合时，以及当从原始波形输入中自动学习表示时，可以获得更好的性能。主要贡献是光谱-时间图注意网络（GAT），它学习跨越不同子带和时间间隔的线索之间的关系。利用模型级的谱（S）和时间（T）子图的图融合以及图池策略来提高鉴别能力，提出的RawGAT-ST模型对ASVspoof 2019逻辑访问数据库实现了1.06%的等错误率。这是迄今为止报告的最好的结果之一，并且可以使用开源实现进行复制。摘要：Artefacts that serve to distinguish bona fide speech from spoofed or deepfake speech are known to reside in specific subbands and temporal segments. Various approaches can be used to capture and model such artefacts, however, none works well across a spectrum of diverse spoofing attacks. Reliable detection then often depends upon the fusion of multiple detection systems, each tuned to detect different forms of attack. In this paper we show that better performance can be achieved when the fusion is performed within the model itself and when the representation is learned automatically from raw waveform inputs. The principal contribution is a spectro-temporal graph attention network (GAT) which learns the relationship between cues spanning different sub-bands and temporal intervals. Using a model-level graph fusion of spectral (S) and temporal (T) sub-graphs and a graph pooling strategy to improve discrimination, the proposed RawGAT-ST model achieves an equal error rate of 1.06 % for the ASVspoof 2019 logical access database. This is one of the best results reported to date and is reproducible using an open source implementation.

【11】 An Optimal Piezoelectric Beam for Acoustic Energy Harvesting 标题：一种用于声能采集的最佳压电梁

作者：Amir Panahi,Alireza Hassanzadeh,Ali Moulavi,Ata Golparvar 机构：Shahid Beheshti University, Tehran, Iran., Sabanci University, Istanbul, Turkey., Correspondence 备注：10 pages, 6 figures, 2 tables 链接：https://arxiv.org/abs/2107.12671 摘要：本研究提出一种新型的压电梁结构，用于声波能量收集。这种光束被设计成在噪音水平很高的地区（如公路交通）最大化输出能量。电子束由两层铜和聚偏氟乙烯组成，它们将环境噪声的振动能量转换为电能。研究了压电材料在基板上的最佳位置，得到了压电材料在基板上的最佳位置。与以前的研究不同，以前的研究是用一种材料覆盖整个梁基板，本研究提出了一种适度的材料使用，并有助于降低收割机的最终生产成本。此外，在本研究中，建立了传感器的电气模型，并提出了转换器的读出电路。此外，在不同长度和位置的不同噪声水平下对传感器进行了验证。在COMSOL Multiphysics和MATLAB中进行了模拟，并报告了在一个封闭的充满空气的立方米腔室中，来自100db点源的最大声压为140db。摘要：This study presents a novel piezoelectric beam structure for acoustic energy harvesting. The beams have been designed to maximize output energy in areas where the noise level is loud such as highway traffic. The beam consists of two layers of copper and polyvinylidene fluoride that convert the ambient noise's vibration energy to electrical energy. The piezoelectric material's optimum placement has been studied, and its best position is obtained on the substrate for the maximum yield. Unlike previous studies, in which the entire beam substrate used to be covered by a material, this study presents a modest material usage and contributes to lowering the harvester's final production cost. Additionally, in this study, an electrical model was developed for the sensor and a read-out circuitry was proposed for the converter. Moreover, the sensor was validated at different noise levels at various lengths and locations. The simulations were performed in COMSOL Multiphysics and MATLAB and report a maximum sound pressure of 140 dB from 100 dB point sources in an enclosed air-filled cubic meter chamber.

【12】 Microphone Array Generalization for Multichannel Narrowband Deep Speech Enhancement 标题：用于多通道窄带深度语音增强的麦克风阵列泛化

作者：Siyuan Zhang,Xiaofei Li 机构：School of engineering, Westlake University, Hangzhou, China, Institute of Advanced Technology, Westlake Institute for Advanced Study, Hangzhou, China 备注：Submitted to Interspeech Conference 2021 链接：https://arxiv.org/abs/2107.12601 摘要：研究了基于深度学习的端到端多通道语音增强中麦克风阵列的泛化问题。我们的目标是训练一个独特的深层神经网络（DNN）可能表现出良好的隐形麦克风阵列。在固定麦克风阵列上进行训练时，麦克风阵列的几何形状会影响网络的参数，从而限制了训练网络对另一个麦克风阵列的泛化。为了解决这个问题，使用不同几何结构的麦克风阵列记录的数据来训练单个网络。我们设计了我们最近提出的窄带网络的三种变体，以应对不可知数量的麦克风。总的来说，我们的目标是让网络学习任何阵列几何结构都可以获得的通用语音增强信息，而不是学习单个阵列的专用特征。在模拟和真实房间脉冲响应（RIR）上的实验表明，所提出的网络具有良好的跨阵列泛化能力，其性能指标非常接近甚至超过用测试阵列训练的网络。此外，它们明显优于各种波束形成方法和其他先进的基于深度学习的方法。摘要：This paper addresses the problem of microphone array generalization for deep-learning-based end-to-end multichannel speech enhancement. We aim to train a unique deep neural network (DNN) potentially performing well on unseen microphone arrays. The microphone array geometry shapes the network's parameters when training on a fixed microphone array, and thus restricts the generalization of the trained network to another microphone array. To resolve this problem, a single network is trained using data recorded by various microphone arrays of different geometries. We design three variants of our recently proposed narrowband network to cope with the agnostic number of microphones. Overall, the goal is to make the network learn the universal information for speech enhancement that is available for any array geometry, rather than learn the one-array-dedicated characteristics. The experiments on both simulated and real room impulse responses (RIR) demonstrate the excellent across-array generalization capability of the proposed networks, in the sense that their performance measures are very close to, or even exceed the network trained with test arrays. Moreover, they notably outperform various beamforming methods and other advanced deep-learning-based methods.

3.eess.AS音频处理:

【1】 End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection 标题：用于说话人确认、反欺骗和语音深伪检测的端到端频谱-时间图注意网络

作者：Hemlata Tak,Jee-weon Jung,Jose Patino,Madhu Kamble,Massimiliano Todisco,Nicholas Evans 机构：EURECOM, Sophia Antipolis, France,Naver Corporation, South Korea 备注：Submitted to ASVspoof 2021 Workshop 链接：https://arxiv.org/abs/2107.12710 摘要：已知用于区分真实语音和伪造语音的人工制品存在于特定的子带和时间段中。可以使用各种方法来捕获和建模这些人工制品，但是，在各种各样的欺骗攻击中，没有一种方法能够很好地工作。可靠的检测通常依赖于多个检测系统的融合，每个系统都经过调整以检测不同形式的攻击。在本文中，我们证明了当在模型本身中进行融合时，以及当从原始波形输入中自动学习表示时，可以获得更好的性能。主要贡献是光谱-时间图注意网络（GAT），它学习跨越不同子带和时间间隔的线索之间的关系。利用模型级的谱（S）和时间（T）子图的图融合以及图池策略来提高鉴别能力，提出的RawGAT-ST模型对ASVspoof 2019逻辑访问数据库实现了1.06%的等错误率。这是迄今为止报告的最好的结果之一，并且可以使用开源实现进行复制。摘要：Artefacts that serve to distinguish bona fide speech from spoofed or deepfake speech are known to reside in specific subbands and temporal segments. Various approaches can be used to capture and model such artefacts, however, none works well across a spectrum of diverse spoofing attacks. Reliable detection then often depends upon the fusion of multiple detection systems, each tuned to detect different forms of attack. In this paper we show that better performance can be achieved when the fusion is performed within the model itself and when the representation is learned automatically from raw waveform inputs. The principal contribution is a spectro-temporal graph attention network (GAT) which learns the relationship between cues spanning different sub-bands and temporal intervals. Using a model-level graph fusion of spectral (S) and temporal (T) sub-graphs and a graph pooling strategy to improve discrimination, the proposed RawGAT-ST model achieves an equal error rate of 1.06 % for the ASVspoof 2019 logical access database. This is one of the best results reported to date and is reproducible using an open source implementation.

【2】 An Optimal Piezoelectric Beam for Acoustic Energy Harvesting 标题：一种用于声能采集的最佳压电梁

作者：Amir Panahi,Alireza Hassanzadeh,Ali Moulavi,Ata Golparvar 机构：Shahid Beheshti University, Tehran, Iran., Sabanci University, Istanbul, Turkey., Correspondence 备注：10 pages, 6 figures, 2 tables 链接：https://arxiv.org/abs/2107.12671 摘要：本研究提出一种新型的压电梁结构，用于声波能量收集。这种光束被设计成在噪音水平很高的地区（如公路交通）最大化输出能量。电子束由两层铜和聚偏氟乙烯组成，它们将环境噪声的振动能量转换为电能。研究了压电材料在基板上的最佳位置，得到了压电材料在基板上的最佳位置。与以前的研究不同，以前的研究是用一种材料覆盖整个梁基板，本研究提出了一种适度的材料使用，并有助于降低收割机的最终生产成本。此外，在本研究中，建立了传感器的电气模型，并提出了转换器的读出电路。此外，在不同长度和位置的不同噪声水平下对传感器进行了验证。在COMSOL Multiphysics和MATLAB中进行了模拟，并报告了在一个封闭的充满空气的立方米腔室中，来自100db点源的最大声压为140db。摘要：This study presents a novel piezoelectric beam structure for acoustic energy harvesting. The beams have been designed to maximize output energy in areas where the noise level is loud such as highway traffic. The beam consists of two layers of copper and polyvinylidene fluoride that convert the ambient noise's vibration energy to electrical energy. The piezoelectric material's optimum placement has been studied, and its best position is obtained on the substrate for the maximum yield. Unlike previous studies, in which the entire beam substrate used to be covered by a material, this study presents a modest material usage and contributes to lowering the harvester's final production cost. Additionally, in this study, an electrical model was developed for the sensor and a read-out circuitry was proposed for the converter. Moreover, the sensor was validated at different noise levels at various lengths and locations. The simulations were performed in COMSOL Multiphysics and MATLAB and report a maximum sound pressure of 140 dB from 100 dB point sources in an enclosed air-filled cubic meter chamber.

【3】 Microphone Array Generalization for Multichannel Narrowband Deep Speech Enhancement 标题：用于多通道窄带深度语音增强的麦克风阵列泛化

作者：Siyuan Zhang,Xiaofei Li 机构：School of engineering, Westlake University, Hangzhou, China, Institute of Advanced Technology, Westlake Institute for Advanced Study, Hangzhou, China 备注：Submitted to Interspeech Conference 2021 链接：https://arxiv.org/abs/2107.12601 摘要：研究了基于深度学习的端到端多通道语音增强中麦克风阵列的泛化问题。我们的目标是训练一个独特的深层神经网络（DNN）可能表现出良好的隐形麦克风阵列。在固定麦克风阵列上进行训练时，麦克风阵列的几何形状会影响网络的参数，从而限制了训练网络对另一个麦克风阵列的泛化。为了解决这个问题，使用不同几何结构的麦克风阵列记录的数据来训练单个网络。我们设计了我们最近提出的窄带网络的三种变体，以应对不可知数量的麦克风。总的来说，我们的目标是让网络学习任何阵列几何结构都可以获得的通用语音增强信息，而不是学习单个阵列的专用特征。在模拟和真实房间脉冲响应（RIR）上的实验表明，所提出的网络具有良好的跨阵列泛化能力，其性能指标非常接近甚至超过用测试阵列训练的网络。此外，它们明显优于各种波束形成方法和其他先进的基于深度学习的方法。摘要：This paper addresses the problem of microphone array generalization for deep-learning-based end-to-end multichannel speech enhancement. We aim to train a unique deep neural network (DNN) potentially performing well on unseen microphone arrays. The microphone array geometry shapes the network's parameters when training on a fixed microphone array, and thus restricts the generalization of the trained network to another microphone array. To resolve this problem, a single network is trained using data recorded by various microphone arrays of different geometries. We design three variants of our recently proposed narrowband network to cope with the agnostic number of microphones. Overall, the goal is to make the network learn the universal information for speech enhancement that is available for any array geometry, rather than learn the one-array-dedicated characteristics. The experiments on both simulated and real room impulse responses (RIR) demonstrate the excellent across-array generalization capability of the proposed networks, in the sense that their performance measures are very close to, or even exceed the network trained with test arrays. Moreover, they notably outperform various beamforming methods and other advanced deep-learning-based methods.

【4】 Audio-to-Score Alignment Using Deep Automatic Music Transcription 标题：使用深度自动音乐转录的音频到乐谱对齐

作者：Federico Simonetta,Stavros Ntalampiras,Federico Avanzini 机构：LIM — Music Informatics Laboratory, Department of Computer Science, University of Milan 备注：IEEE MMSP 2021 链接：https://arxiv.org/abs/2107.12854 摘要：音谱比对（A2SA）是一项多模态的任务，它将音频信号与乐谱进行比对。最近的文献证实了自动音乐转录（AMT）在帧级对A2SA的好处。在这项工作中，我们的目的是详细阐述如何利用AMT深度学习（DL）模型来实现笔记水平的一致性。我们提出了一种基于HMM的分数对分数对齐和AMT的方法，显示出超越现有技术的显著进步。我们设计了一个系统的程序，以利用大型数据集不提供一个一致的分数。最后，我们对多个数据集进行了全面的比较和广泛的测试。摘要：Audio-to-score alignment (A2SA) is a multimodal task consisting in the alignment of audio signals to music scores. Recent literature confirms the benefits of Automatic Music Transcription (AMT) for A2SA at the frame-level. In this work, we aim to elaborate on the exploitation of AMT Deep Learning (DL) models for achieving alignment at the note-level. We propose a method which benefits from HMM-based score-to-score alignment and AMT, showing a remarkable advancement beyond the state-of-the-art. We design a systematic procedure to take advantage of large datasets which do not offer an aligned score. Finally, we perform a thorough comparison and extensive tests on multiple datasets.

【5】 Multi-modal estimation of the properties of containers and their content: survey and evaluation 标题：集装箱及其内容物性能的多模态估计：调查与评价

作者：Alessio Xompero,Santiago Donaher,Vladimir Iashin,Francesca Palermo,Gökhan Solak,Claudio Coppola,Reina Ishikawa,Yuichi Nagao,Ryo Hachiuma,Qi Liu,Fan Feng,Chuanlin Lan,Rosa H. M. Chan,Guilherme Christmann,Jyun-Ting Song,Gonuguntla Neeharika,Chinnakotla Krishna Teja Reddy,Dinesh Jain,Bakhtawar Ur Rehman,Andrea Cavallaro 机构： are with Keio University 备注：13 pages, 9 tables, 5 figures, submitted to IEEE Transactions on Multimedia 链接：https://arxiv.org/abs/2107.12719 摘要：声学和视觉传感可以支持非接触估计重量的容器和它的内容量时，容器是由一个人操纵。然而，透明胶片（容器和内容）以及材料、形状和尺寸的可变性使这个问题具有挑战性。在本文中，我们提出了一个开放的基准框架和深入的比较分析，最近的方法，估计一个集装箱的容量，以及类型，质量和数量的内容。这些方法使用学习和手工制作的特征，如mel频率倒谱系数、过零率、频谱图，使用不同类型的分类器用声学数据来估计内容的类型和数量，用视觉数据的几何方法来确定容器的容量。在一个新的分布式数据集上的结果表明，单独使用音频是一种很强的模式，对于内容类型和级别分类，方法的加权平均F1得分分别高达81%和97%。用视觉方法估计集装箱容量，用多模态、多阶段算法估计填充质量，加权平均容量和质量分数可达65%。摘要：Acoustic and visual sensing can support the contactless estimation of the weight of a container and the amount of its content when the container is manipulated by a person. However, transparencies (both of the container and of the content) and the variability of materials, shapes and sizes make this problem challenging. In this paper, we present an open benchmarking framework and an in-depth comparative analysis of recent methods that estimate the capacity of a container, as well as the type, mass, and amount of its content. These methods use learned and handcrafted features, such as mel-frequency cepstrum coefficients, zero-crossing rate, spectrograms, with different types of classifiers to estimate the type and amount of the content with acoustic data, and geometric approaches with visual data to determine the capacity of the container. Results on a newly distributed dataset show that audio alone is a strong modality and methods achieves a weighted average F1-score up to 81% and 97% for content type and level classification, respectively. Estimating the container capacity with vision-only approaches and filling mass with multi-modal, multi-stage algorithms reaches up to 65% weighted average capacity and mass scores.

【6】 Making grains tangible: microtouch for microsound 标题：让粮食看得见摸得着：微音的微触

作者：Staas de Jong 机构：LIACS, Leiden University, Niels Bohrweg , Leiden 备注：Proceedings of the International Conference on New Interfaces for Musical Expression, 2011 链接：https://arxiv.org/abs/2107.12714 摘要：本文提出了一个新的研究方向，为大家族的乐器音乐接口，其中声音产生使用数字颗粒合成，并在互动和控制涉及（精细）操作僵硬，平坦的接触面。首先，在一个历史的背景下，一个普遍的缺乏，并明确需要，有形的产出是动态实例化的粮食生产过程本身是确定的。其次，为了填补这一空白，提出了一种具体的通用方法，该方法基于非振动和振动力脉冲的仔细构造，与声波颗粒呈一对一的关系。一个非正式的试点心理物理学实验启动了这一方法，其中考虑了两种主要情况下适用于人类皮肤的力量：垂直，和横向。初步结果表明，力脉冲方法可以实现对正在进行的谷物生成过程的可感知的多维、有形的显示。此外，我们还发现，在基本声波晶粒产生的同一时间尺度上，这一点可以有意义地（实时）发生。这是一个不小的特性，为进一步发展这种增强型显示器提供了重要而积极的基础。它还导致了令人兴奋的前景，使任意声波颗粒实际物理操纵。摘要：This paper proposes a new research direction for the large family of instrumental musical interfaces where sound is generated using digital granular synthesis, and where interaction and control involve the (fine) operation of stiff, flat contact surfaces. First, within a historical context, a general absence of, and clear need for, tangible output that is dynamically instantiated by the grain-generating process itself is identified. Second, to fill this gap, a concrete general approach is proposed based on the careful construction of non-vibratory and vibratory force pulses, in a one-to-one relationship with sonic grains. An informal pilot psychophysics experiment initiating the approach was conducted, which took into account the two main cases for applying forces to the human skin: perpendicular, and lateral. Initial results indicate that the force pulse approach can enable perceivably multidimensional, tangible display of the ongoing grain-generating process. Moreover, it was found that this can be made to meaningfully happen (in real time) in the same timescale of basic sonic grain generation. This is not a trivial property, and provides an important and positive fundament for further developing this type of enhanced display. It also leads to the exciting prospect of making arbitrary sonic grains actual physical manipulanda.

【7】 The cyclotactor: towards a tactile platform for musical interaction 标题：循环演员：走向音乐互动的触觉平台

作者：Staas de Jong 机构： Towards a Tactile Platformfor Musical InteractionStaas de JongLIACSLeiden Universitystaas 备注：Proceedings of the International Conference on New Interfaces for Musical Expression, 2008 链接：https://arxiv.org/abs/2107.12704 摘要：本文报道了一种基于手指的音乐交互触觉输入输出装置的研究进展。该设备的核心是能够建立触觉输入和输出之间的循环关系。文中给出了这种方法在音乐交互中的一个直接的实际应用，即在一个触觉环上实现两个自由度的复合。摘要：This paper reports on work in progress on a finger-based tactile I/O device for musical interaction. Central to the device is the ability to set up cyclical relationships between tactile input and output. A direct practical application of this to musical interaction is given, using the idea to multiplex two degrees of freedom on a single tactile loop.

【8】 Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis 标题：神经语音合成中韵律瓶颈的交叉语体迁移

作者：Shifeng Pan,Lei He 机构：Microsoft, China 备注：in Proceedings of INTERSPEECH 2021 链接：https://arxiv.org/abs/2107.12562 摘要：跨说话人语体转换是多语体大规模表达性语音合成应用的关键。它不要求目标演讲者是表达所有风格的专家，也不需要为模特训练收集相应的录音。然而，现有的样式转换方法的性能还远远落后于实际应用的需要。根本原因主要有两方面。首先，从单一参考语音中提取的风格嵌入很难为任意文本的合成提供细粒度和合适的韵律信息。其次，在这些模型中，内容/文本、韵律和说话人音色通常是高度纠缠的，因此在自由组合这些成分（例如在说话人之间转换说话风格）时期望满意的结果是不现实的。本文提出了一种具有明显韵律瓶颈的跨说话人风格的文语转换（TTS）模型。韵律瓶颈建立了强有力的话语风格核，将韵律与内容、说话人音色分离开来，保证了高质量的跨说话人风格转换。评价结果表明，该方法在韵律客观评价方面与源说话人的说话人相关（SD）模型相当，在客观评价和主观评价方面明显优于循环一致性和基于GMVAE的基线。摘要：Cross-speaker style transfer is crucial to the applications of multi-style and expressive speech synthesis at scale. It does not require the target speakers to be experts in expressing all styles and to collect corresponding recordings for model training. However, the performances of existing style transfer methods are still far behind real application needs. The root causes are mainly twofold. Firstly, the style embedding extracted from single reference speech can hardly provide fine-grained and appropriate prosody information for arbitrary text to synthesize. Secondly, in these models the content/text, prosody, and speaker timbre are usually highly entangled, it's therefore not realistic to expect a satisfied result when freely combining these components, such as to transfer speaking style between speakers. In this paper, we propose a cross-speaker style transfer text-to-speech (TTS) model with explicit prosody bottleneck. The prosody bottleneck builds up the kernels accounting for speaking style robustly, and disentangles the prosody from content and speaker timbre, therefore guarantees high quality cross-speaker style transfer. Evaluation result shows the proposed method even achieves on-par performance with source speaker's speaker-dependent (SD) model in objective measurement of prosody, and significantly outperforms the cycle consistency and GMVAE-based baselines in objective and subjective evaluations.

【9】 Energy-based Unknown Intent Detection with Data Manipulation 标题：基于能量的数据处理未知意图检测

作者：Yawen Ouyang,Jiasheng Ye,Yu Chen,Xinyu Dai,Shujian Huang,Jiajun Chen 机构：∗, National Key Laboratory for Novel Software Technology, Nanjing University, China 备注：10 pages, 4 figures, accepted by Findings of ACL-IJCNLP 2021 链接：https://arxiv.org/abs/2107.12542 摘要：未知意图检测的目的是识别出在训练集中从未出现过意图的分布外（outofdistribution，OOD）话语。在这篇论文中，我们建议使用能量分数来完成这个任务，因为能量分数理论上与输入的密度一致，并且可以从任何分类器中得到。然而，在训练阶段需要高质量的OOD话语，以形成OOD与in-distribution（IND）之间的能量差，而这些话语在实践中很难被收集。为了解决这个问题，我们提出了一个数据操作框架来生成高质量的具有重要性权重的OOD语句。实验结果表明，利用GOT对基于能量的探测器进行微调，可以在两个基准数据集上获得最新的结果。摘要：Unknown intent detection aims to identify the out-of-distribution (OOD) utterance whose intent has never appeared in the training set. In this paper, we propose using energy scores for this task as the energy score is theoretically aligned with the density of the input and can be derived from any classifier. However, high-quality OOD utterances are required during the training stage in order to shape the energy gap between OOD and in-distribution (IND), and these utterances are difficult to collect in practice. To tackle this problem, we propose a data manipulation framework to Generate high-quality OOD utterances with importance weighTs (GOT). Experimental results show that the energy-based detector fine-tuned by GOT can achieve state-of-the-art results on two benchmark datasets.

【10】 TaikoNation: Patterning-focused Chart Generation for Rhythm Action Games 标题：太古民族：节奏动作游戏中以图案为中心的图表生成

作者：Emily Halina,Matthew Guzdial 机构：University of Alberta, Edmonton, Canada 备注：None 链接：https://arxiv.org/abs/2107.12506 摘要：通过机器学习从歌曲中生成节奏游戏图是近年来人们越来越感兴趣的问题。然而，所有现存的系统都在努力复制类似人类的模式：将游戏对象放置在彼此相关的位置，以形成基于歌曲中事件的一致模式。模式是高质量节奏游戏内容的关键标识符，被视为人类排名的必要组成部分。我们建立了一种新的图表生成方法，该方法生成的图表具有比以前工作中看到的更一致、更像人类的模式。摘要：Generating rhythm game charts from songs via machine learning has been a problem of increasing interest in recent years. However, all existing systems struggle to replicate human-like patterning: the placement of game objects in relation to each other to form congruent patterns based on events in the song. Patterning is a key identifier of high quality rhythm game content, seen as a necessary component in human rankings. We establish a new approach for chart generation that produces charts with more congruent, human-like patterning than seen in prior work.

【11】 Improving Word Recognition in Speech Transcriptions by Decision-level Fusion of Stemming and Two-way Phoneme Pruning 标题：基于词干和双向音素修剪的决策级融合提高语音转写中的单词识别能力

作者：Sunakshi Mehra,Seba Susan 机构：Department of Information Technology, Shahbad Daulatpur, Main Bawana Road, Delhi , India 备注：Accepted in International Advanced Computing Conference (2020) 链接：https://arxiv.org/abs/2107.12428 摘要：提出了一种基于词干提取和双向音素剪枝的决策级融合的无监督语音修正方法。通过使用Ffmpeg框架提取音频，并使用googleapi进一步将音频转换为文本脚本，从视频中获取脚本。在基准LRW数据集中，有500个单词类别，每个类有50个mp4格式的视频。所有视频由29帧（每1.16秒长）组成，并且单词出现在视频的中间。在我们的方法中，我们尝试通过词干、音素提取、过滤和剪枝来提高基线的准确率，从9.34%提高。将词干提取算法应用于文本文本，并对结果进行了评价，得到了23.34%的准确率。为了将单词转换成音素，我们使用卡内基梅隆大学（CMU）发音词典，它提供了英语单词到他们发音的语音映射。提出了一种双向音素剪枝方法，包括两个非连续的步骤：1）对含有元音和爆破音的音素进行过滤剪枝；2）对含有元音和擦音的音素进行过滤剪枝。在得到词干提取和双向音素剪枝的结果后，我们采用了决策级融合，使得单词识别率提高了32.96%。摘要：We introduce an unsupervised approach for correcting highly imperfect speech transcriptions based on a decision-level fusion of stemming and two-way phoneme pruning. Transcripts are acquired from videos by extracting audio using Ffmpeg framework and further converting audio to text transcript using Google API. In the benchmark LRW dataset, there are 500 word categories, and 50 videos per class in mp4 format. All videos consist of 29 frames (each 1.16 s long) and the word appears in the middle of the video. In our approach we tried to improve the baseline accuracy from 9.34% by using stemming, phoneme extraction, filtering and pruning. After applying the stemming algorithm to the text transcript and evaluating the results, we achieved 23.34% accuracy in word recognition. To convert words to phonemes we used the Carnegie Mellon University (CMU) pronouncing dictionary that provides a phonetic mapping of English words to their pronunciations. A two-way phoneme pruning is proposed that comprises of the two non-sequential steps: 1) filtering and pruning the phonemes containing vowels and plosives 2) filtering and pruning the phonemes containing vowels and fricatives. After obtaining results of stemming and two-way phoneme pruning, we applied decision-level fusion and that led to an improvement of word recognition rate upto 32.96%.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-07-28，如有侵权请联系 cloudcommunity@tencent.com 删除

linux