MIT's new AI system can tune individual musical instruments straight out of a concert video
Ever wished for an easy way to tune the guitar or the saxophone in an old video footage lying in the attic instead of having to re-master the entire audio track? The Computer Science and Artificial Intelligence Laboratory (CSAIL) of the Massachusetts Institute of Technology (MIT) has developed a new deep learning artificial intelligence (AI) algorithm that might just be what the doctor ordered.
CSAIL calls the system PixelPlayer and it has the capability to identify, isolate, and tune individual musical instruments from a footage with just a click. CSAIL researchers lead by Hang Zhao say that the program has received over 60 hours of video training and it can perform instrument isolation by identifying from which pixels the particular soundwaves emanate from — all without any human supervision or annotation even on never-before-seen videos.
The ability to isolate specific musical instruments from a video recording opens up immense possibilities, the researchers say. It gives engineers an easy way to repair/restore old concert footage or even swap instruments to preview what they sound like. The team says that in its current form, PixelPlayer can distinguish between sounds of more than 20 common instruments and it has the potential to 'learn' more if sufficient training data is provided. It does, however, face certain challenges with respect to identifying subtle differences between instrument subclasses. While there have been previous attempts to isolate soundwaves using AI from audio files, the inclusion of the visual element makes PixelPlayer 'self-supervised'. This 'self-supervision' adds a whole new complexity to the mix as it makes it difficult for the team to understand every aspect of how the system learns. Sounds a lot like Skynet, doesn't it?
Zhao says that PixelPlayer uses deep learning using neural networks trained on existing videos. There are three neural networks that individually perform the tasks of analyzing the visuals, analyzing the audio, and synthesizing soundwaves with specific pixels for isolation. Zhao and co-authors will be presenting their work at the European Conference on Computer Vision (ECCV) slated to take place in September this year in Munich.
Have a look at the video below to appreciate the AI in action and let us know your thoughts.