Skip to Main content Skip to Navigation

Metric learning for video to music recommendation

Abstract : Music enhances moving images and allows to efficiently communicate emotion or narrative tension, thanks to cultural codes common to the filmmakers and viewers. A successful communication requires not only a choice of track matching the video's mood and content, but also a temporal synchronization of the audio and visual main events. This is the goal of the music supervision industry, which traditionally carries out the task manually. In this dissertation, we study the automation of tasks related to music supervision. The music supervision problem generally doesn't have a unique solution, as it includes external constraints such as the client's identity or budget. It is thus relevant to proceed by recommendation. As the number of available musical videos is in constant augmentation, it makes sense to use data-driven tools. More precisely, we use the metric learning paradigm to learn the relevant projections of multimodal (video and music) data. First, we address the music similarity problem, which is used to broaden the results of a music search. We implement an efficient content-based imitation of a tag-based similarity metric. To do so, we present a method to train a convolutional neural network from ranked lists. Then, we focus on direct, content-based music recommendation for video. We adapt a simple self-supervised system and we demonstrate a way to improve its performance, by using pretrained audio features and learning their aggregation. We then carry a qualitative and quantitative analysis of official music videos to better understand the temporal organization of musical videos. Results show that official music videos are carefully edited in order to align audio and video events, and that the level of synchronization depends on the music and video genres. With this insight, we propose the first recommendation system designed specifically for music supervision: the Seg-VM-Net, which uses both content and structure to perform the matching of music and video.
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Tuesday, April 12, 2022 - 11:24:10 AM
Last modification on : Wednesday, April 13, 2022 - 3:08:06 AM
Long-term archiving on: : Wednesday, July 13, 2022 - 6:55:36 PM


Version validated by the jury (STAR)


  • HAL Id : tel-03638477, version 1



Laure Prétet. Metric learning for video to music recommendation. Multimedia [cs.MM]. Institut Polytechnique de Paris, 2022. English. ⟨NNT : 2022IPPAT005⟩. ⟨tel-03638477⟩



Record views


Files downloads