Interview: Music, AI and medicine with Daniel Kvak

Behind the idea to create an app that detects pneumonia or covid was a desire to help. And so, in 2021, Carebot was founded, a company focused on developing software using artificial intelligence methods with a focus on clinical practice. It was founded by Daniel Kvak and his wife Karolina. He has won numerous awards for his work and his extensive knowledge in the field of artificial intelligence means that he is in charge of the entire technology side of the company. In addition to running Carebot, Daniel Kvak is also studying his PhD at the Faculty of Medicine and Arts at Masaryk University. His research interests include computational creativity, machine learning, generative modelling, as well as electronic music and music composition. What does music have to do with medicine? Read the interview in which Daniel Kvak talks about recurrent neural networks and their use in the field of music science.

12 Jun 2025 Kateřina Hendrychová Natálie Čornyjová

You are pursuing your doctoral studies at two faculties of MU, Arts and Medicine. How do you manage to combine music and medicine? Is artificial intelligence the intersection?

Artificial intelligence has fascinated me since I was a child. At first it was sci-fi comics and TV shows, of course, but the vision of a technological helper assisting us in our daily tasks completely absorbed me. From the beginning, I directed my studies in the Faculty of Arts towards what fed me: production music. It wasn't until the Covid-19 pandemic that my wife and I thought about taking AI in a medical direction. It's true that technologically, there's not much difference between recognizing art directions in paintings and recognizing pathologies on X-rays. But neither are there big differences across the domain, for example in accountability or the impact of technology on individuals and society. When we look at deepfake recordings, the impact on the public good is not dissimilar to the situation in healthcare.

In your master's thesis you focused on modeling musical transcription using deep learning. What led you to use artificial intelligence in music?

I've spent many years of my life creating background tracks for commercials, films or so-called sound banks. I always knew it was not a great art, which is why I was interested in the possibility of using AI to generate musical compositions. Fifteen-second snippets of these background tracks are often not even noticed by the listener - but their presence is subliminally perceived, and if they were missing, they would know it immediately. My undergraduate thesis focused on the Spotify platform, which a decade ago was already using generative AI to create simple "lift" tracks. But it was clear that this was the absolute beginning in the industry.

How did you go about creating the autonomous generative model? What challenges did you encounter during the development process?

When I started in the generative AI segment, the most popular application was to generate Shakespearean texts using recurrent neural networks (RNNs). We are talking about the exact same approach that helps us today in the form of ChatGPT every day with almost every conceivable task. When did the rapid change come that allowed us to turn amateur projects for generating music composition or poetry into something that completely changed the world? While RNNs have been with us in their current form since sometime in 2007, the biggest problem of generative AI has always been in what is called long-term dependencies, or simply put, making neural networks sustain attention over longer periods of time. Anyone interested in natural language processing or time series modeling in general was facing similar difficulties at the time. A significant shift came in 2017, when a proposal for an attention mechanism (the "attention" technique) was introduced that largely solved the problem.

What opportunities do deep neural networks offer today in the field of music science?

MIR (music information retrieval) is a broad field that includes, among other things, recommender systems that offer us similar songs in the case of Spotify, systems for automating the composition and mixing process, but also, for example, tracking systems for distribution companies or copyright societies. The possibilities for the use of AI today are quite extensive. As in the case of the relatively recent success of image generators (Dall-E, Midjourney), generators are now emerging in music composition that can produce quite high quality compositions based on text input.

Are recurrent neural networks a suitable tool for generating music, or are other models currently replacing them - and why?

Today's neural networks using the attention mechanism are not very different in their logic from the original, say, simple recurrent neural networks. The prediction of models must be primarily contextual, and this is what has received the most attention in recent years. In the case of musical composition, however, certain given rules come into play: some are genre-specific, violation of others results in cacophony, while others are actively violated in the context of improvisation and creativity. In musical composition in particular, in my opinion, there is no consensus on what approaches should be universally applied. We know from fairly recent history examples of cellular automata that generate compositions by combining simple patterns into abstractly complex probabilities, as well as the use of generative adversarial networks that have had a huge impact on image generation. But a much more fruitful question is research into how we should evaluate the outputs of such models. The topic of automatic music generation (musical metacreation), defined by Philippe Pasquier et al. in 2017, is relatively peripheral, and is outweighed by the attention that text and image generators receive.

What role does human interaction and feedback play in the development and tuning of recurrent neural networks aimed at music production?

A big one. Most of the projects that have boasted compelling artificially generated music in recent years have worked with simple symbolic transcription of songs, i.e., only the basic "skeleton" was really generated, while an experienced team of musicians already took care of the rest. The question is whether this is wrong, or whether we just have inappropriately set expectations. This is where natural language processing (NLP) helps us to better understand, where experience with machine translation and now generative models is much broader. If I translate a text using DeepL, how often do I edit it? What if it is a technical text? If I generate text through ChatGPT, how domain-specific is it? Do I want to include it in the generated form without changes in my thesis? If I interfere with the text, does that mean that the model I'm using "doesn't work" or is it because my preferences are set differently? These are the questions we need to ask ourselves.

You founded Carebot as an arts student interested in modeling music transcription using deep learning. How did you get out of the music field and into the healthcare field?

During the Covid-19 pandemic, my wife Caroline and I had a vision to help healthcare professionals through the barrage of examinations they were faced with. Our original idea was to evaluate various image data, but the situation at the time indicated quite clearly where the potential for AI might be greatest. While the pandemic gradually faded into obscurity, it became clear that the problem we wanted to solve was systemic in nature. Having previously been more involved in music and text processing, I found the transition to computer vision challenging. Time has shown that the issue of artificial intelligence in medicine is very complex; there is a bit of the saying "we do these things not because they are easy, but because we thought they would be easy." After three years of experience, we managed to get approval from the European regulator and today we are proud to be one of the few in Europe to have this approval.

How does Carebot use artificial intelligence?

We work primarily with pattern recognition in image data. We have a team of more than 80 radiologists from all over Europe who work with us to help annotate training data or participate in validation. Transparency is key for us, which is not just about what models we use or how many training images we have, but more importantly how clearly and verifiably we can demonstrate the true clinical benefit of these models in independent tests.

What are your plans for the app (and the company) in the future?

We are now expanding our focus further into mammography screening and bone x-rays to detect fractures and bone lesions. We are also venturing into international markets with our system for detecting abnormalities on chest x-rays, thanks to regulatory approval. Above all, our vision is to ensure an even quality of care across regions, whether it is a large teaching hospital or a small hospital in a rural area.