此文的方法用了onset和frame两个objectives。在学习时同时minimize这两个losses，在inference时用onset来限制frame level的pitch prediction. 结果比之前的state-of-the-art好了去了。
Colab demo: https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/magenta/onsets_frames_transcription/onsets_frames_transcription.ipynb
Two task and two objectives learning for piano transcription.
Our previously reviewed works only use NN to predict the pitch at frame level. This work predicts both pitch and onset by jointly minimize these two losses.
In the inference, they added some restrictions, such as an activation from the frame detector is only allowed to start the note when an onset is presented in that frame. The frame loss is weighted according to the distance between the current frame to the onset frame.
On all evaluation metrics (1) frame (2) note, (3) note with offset, this method is much better than the state-of-the-art.
(1) The input representation of the NN is not a small frame context, but 20 seconds segments.
(2) There is a network connection between onset output representation to the frame BLSTM. I guess the intuitive behind this is to combine onset feature with frame feature.
(3) They didn't share the conv layer for learning both frame and onset representations. Apparently, this allows learning better features for onset and frame.
(4) The performance gain in this work mainly comes from two sources - joint onset and frame training and restricted inference.