Behavior, Content, Money – 3 Things you should never give away for free!!!

BCmoney MobileTV

MusicBrainz .vs. Million Song Dataset (battle of the Open Music databases)

Posted by bcmoney on October 1, 2011 in NoSQL, Semantic Web, SQL, Web Services with 2 Comments

No Gravatar
Neon music sign

Image via Wikipedia

I’m looking to add a significant amount of Music data to the OpenRecommender project to power the new Music Recommendation services.

Since I couldn’t find one, I feel compelled to write a side-by-side comparison between MusicBrainz (the old trusty) and the Million Song Database by LabROSA @ Columbia U. (the new kid in town). The following is the breakdown, and though I’ll offer my thoughts at the end, I’ll encourage the reader decide which is the best data source.

Ready? BATTLE!

MusicBrainz Million Song Dataset
MusicBrainz Logo, cropped from the original.


Data Dump
Web Service
Songs: 10,493,531
Artists: 620,229
Albums: 966,319
Size: 1.47 GB

MusicBrainz License (Public Domain & Creative Commons A-SA-NC 2.0 unported)


  • Started as a spirited, open alternative to GraceNote’s CDDB monopolization
  • One of the oldest most well-established, thus very active/supportive community
  • Many LOD sources already interlink via MBID
  • Popular Music Services such as Last.FM, Setlist.FM, Grooveshark, Pandora, Shoutcast, Shazam, EchoNest, Songza and others use MBIDs as URIs to identify music
  • Includes relational links to many relvant sources, for example: AC/DC relationships
  • Completely free/unlimited for non-commercial use (unlimited commercial use is easy too, but requires a donation)
  • Uses namespaced XML in its Web Service output format (thus one step closer to Semantic Web‘s RDF/XML specifications, which a previous version supported in Beta, and which may be added back in future)
Million Song Dataset logo


Data Dump
Web Service
Songs: 1,000,000
Artists: 44,745
Albums: 0
Size: 273 GB

Code is GPL, Data is EchoNest API & MusicBrainz


  • Full Audio Content Analysis on each segment’s ‘loudness’, ‘pitches’ & ‘timbre’ (i.e. each segment ==> every note, of every song… no wonder the DB is so huge)
  • Identifies 2,321 unique MusicBrainz tags, including MBID field (where available)
  • Identifies 2,201,916 asymmetric similarity relationships (links between related tracks/artists)
  • Identifies 18,196 cover songs, with links to SecondHandSong API
  • Offers a simplified 10,000 song database with the full audio analysis but less fluff, which is plenty for any development system, probably not enough songs for production though
  • Uses Hierarchical Database Format (HDF) which is more akin to NoSQL than traditional SQL systems requiring significant clustering for large-scale systems

In the end I think that MusicBrainz is a “no-brainer” in terms of being the quickest most effective way of quickly populating data for a Recommender; however, even from only scratching the surface of what’s available in the Million Song Dataset, its pretty clear that its a requirement for any Recommendation Engine that claims to be complete in the area of music recommendations, thus any final product should undertake the extra steps, effort and computing capacity required for running it.