Behavior, Content, Money – 3 Things you should never give away for free!!!

BCmoney MobileTV

MusicBrainz .vs. Million Song Dataset (battle of the Open Music databases)

Posted by bcmoney on October 1, 2011 in NoSQL, Semantic Web, SQL, Web Services with 2 Comments

No Gravatar
Neon music sign

Image via Wikipedia

I’m looking to add a significant amount of Music data to the OpenRecommender project to power the new Music Recommendation services.

Since I couldn’t find one, I feel compelled to write a side-by-side comparison between MusicBrainz (the old trusty) and the Million Song Database by LabROSA @ Columbia U. (the new kid in town). The following is the breakdown, and though I’ll offer my thoughts at the end, I’ll encourage the reader decide which is the best data source.

Ready? BATTLE!

MusicBrainz Million Song Dataset
MusicBrainz Logo, cropped from the original.


Data Dump
Web Service
Songs: 10,493,531
Artists: 620,229
Albums: 966,319
Size: 1.47 GB

MusicBrainz License (Public Domain & Creative Commons A-SA-NC 2.0 unported)


  • Started as a spirited, open alternative to GraceNote’s CDDB monopolization
  • One of the oldest most well-established, thus very active/supportive community
  • Many LOD sources already interlink via MBID
  • Popular Music Services such as Last.FM, Setlist.FM, Grooveshark, Pandora, Shoutcast, Shazam, EchoNest, Songza and others use MBIDs as URIs to identify music
  • Includes relational links to many relvant sources, for example: AC/DC relationships
  • Completely free/unlimited for non-commercial use (unlimited commercial use is easy too, but requires a donation)
  • Uses namespaced XML in its Web Service output format (thus one step closer to Semantic Web‘s RDF/XML specifications, which a previous version supported in Beta, and which may be added back in future)
Million Song Dataset logo


Data Dump
Web Service
Songs: 1,000,000
Artists: 44,745
Albums: 0
Size: 273 GB

Code is GPL, Data is EchoNest API & MusicBrainz


  • Full Audio Content Analysis on each segment’s ‘loudness’, ‘pitches’ & ‘timbre’ (i.e. each segment ==> every note, of every song… no wonder the DB is so huge)
  • Identifies 2,321 unique MusicBrainz tags, including MBID field (where available)
  • Identifies 2,201,916 asymmetric similarity relationships (links between related tracks/artists)
  • Identifies 18,196 cover songs, with links to SecondHandSong API
  • Offers a simplified 10,000 song database with the full audio analysis but less fluff, which is plenty for any development system, probably not enough songs for production though
  • Uses Hierarchical Database Format (HDF) which is more akin to NoSQL than traditional SQL systems requiring significant clustering for large-scale systems

In the end I think that MusicBrainz is a “no-brainer” in terms of being the quickest most effective way of quickly populating data for a Recommender; however, even from only scratching the surface of what’s available in the Million Song Dataset, its pretty clear that its a requirement for any Recommendation Engine that claims to be complete in the area of music recommendations, thus any final product should undertake the extra steps, effort and computing capacity required for running it.



  1. Elmo KobylinskiDecember 7, 2011 - 2:03 pm #1

    Hi There, great Article by the way. I agree with your conclusions… hopefully you can do one for Book data sources in the future? I’ve been trying to find an eBook version of “Searching For Jimmy Buffett” to send my friends, because its the funniest book I have read in years!!! Looking forward to more info on eBook publishers

    • Sammy J.January 1, 2012 - 5:06 pm #2

      You’re right. MusicBrainz is one of the best data sources going today. I couldn’t get by without it. If only Picard Music Tagger could run on my Amazon Kindle! Maybe another product can do it someday…

Leave a Reply

No trackbacks yet.

No post with similar tags yet.

Posts in similar categories

BC$ = Behavior, Content, Money

The goal of the BC$ project is to raise awareness and make changes with respect to the three pillars of information freedom - Behavior (pursuit of interests and passions), Content (sharing/exchanging ideas in various formats), Money (fairness and accessibility) - bringing to light the fact that:

1. We regularly hand over our browser histories, search histories and daily online activities to companies that want our money, or, to benefit from our use of their services with lucrative ad deals or sales of personal information.

2. We create and/or consume interesting content on their services, but we aren't adequately rewarded for our creative efforts or loyalty.

3. We pay money to be connected online (and possibly also over mobile), yet we lose both time and money by allowing companies to market to us with unsolicited advertisements, irrelevant product offers and unfairly structured service pricing plans.

  • Archives