MIT invent AI that extracts each instrument from songs
It could be a dream for music sampling as MIT reveal their new tech capable of finding and editing individual instruments from tracks.
When a track is recorded, all of it’s different elements, sounds and instruments are separated and then come together into one file for release. Whilst we are able to isolate certain frequencies and elements once the separate elements are combined digitally isolating each instrument from the rest of the track has been near impossible – until now.
The Massachusetts Institute of Technology (MIT) have just announced their new AI, PixelPlayer, which can identify instruments individually from music files. In case that wasn’t impressive enough by itself the AI has made it possible to alter the isolated element, remix it, or even just remove it from the track.
This creates a world of possibilities especially for DJs and producers using samples who could use this technology to completely rework tracks and take single elements to use in their compositions. MIT also see the potential for sound engineers reworking old tracks or concert footage so that they can manipulate each element and optimise their sound for cleaner, better mixed recordings.
PixelPlayer works completely unassisted from human interaction and uses it’s own AI knowledge to locate and isolate each instruments parts. The software has been trained on over 60 hours of videos to identify each specific instrument “at pixel level” to be able to extract it from the rest of the mix.
A PhD student at CSAIL, Zhao said: “We expected a best-case scenario where we could recognise which instruments make which kinds of sounds. We were surprised that we could actually spatially locate the instruments at the pixel level. Being able to do that opens up a lot of possibilities, like being able to edit the audio of individual instruments by a single click on the video.”
MIT explained how it works:
PixelPlayer uses methods of “deep learning,” meaning that it finds patterns in data using so-called “neural networks” that have been trained on existing videos. Specifically, one neural network analyzes the visuals of the video, one analyzes the audio, and a third “synthesizer” associates specific pixels with specific soundwaves to separate the different sounds.
They admit that as the technology is “self-supervised” deep learning that they can’t completely understand how it works or how it learns which instruments are making which sounds. Zhao says that it recognises actual elements of the music like certain harmonic frequencies which correlate to instruments like the violin or quick pulse-like patterns corresponding to instruments like the xylophone.
The technology isn’t going public yet but it’s potential is incredible.