Abstract
Music has become an integral part of human life, extending its influence across various industries. For many, music is considered a necessity. With the rise of neural network technology, Music Information Retrieval (MIR) has gained prominence as a multidisciplinary field focused on processing music information and its applications. One popular approach for music captioning is the multimodal encoder-decoder architecture, which utilizes the CNN-LSTM algorithm. In this study, we develop a model that simultaneously learns from audio and text data. We explore different design choices for modality fusion, including early fusion, late fusion, and hybrid fusion, to assess their impact

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Copyright (c) 2024 M. Ghazali Diarsyah, Dhanny Setiawan