A closer look at Apache Mahout (JAVA Recommendation library)

There’s been a lot of changes to the Mahout library since it was first introduced through the Apache software foundation back in 2009.
I first looked at this project through the excellent tutorial on classifying and recommending Seinfeld episodes (perhaps not the easiest task of differentiating episodes, keeping in mind it was a show that prided itself on being “about nothing”). This really showed the strength of the suite of libraries and algorithms, which can be ranked and compared for performance and relevance more easily than ever in its current release.
Unfortunately, the folks involved in that project got a takedown notice to no longer provide the Seinfeld episodes’ full scripts that were being used to classify the episodes.
Looking for another alternative interesting dataset, I’ve looked through the following commonly used public data sources:
- Million Song Dataset: http://millionsongdataset.com/
- MusicBrainz: https://musicbrainz.org/
- MovieLens: https://movielens.org/
- Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia
- TVDB: https://www.thetvdb.com/api-information
- IMDB: https://imdb-api.com/
Seems like I’ve settled on just “” for now, but will likely look at unique ways to combine all of the above at a later date.
The scope of the problem will start out fairly modest; given a list of 10 users with very distinct tastes, can we recommend relevant new music to them (either by specific song, or, artist) that they’re likely to enjoy. It’s not like this is a new problem area, but I do feel there’s not been much innovation in this space for a while since the early days of WebJay, RACOFI & MyStrands to the “middle ages” of Songza, LastFM & Pandora and now into the modern era of Apple’s iTunes/Music offerings along with Spotify & YouTube pretty much having a three-party oligopoly on both ad-supported streaming and paid digital music (whether pay-to-download/own or subscription-to-stream).
The premise of this effort will be, given minimal inputs about a user, try to chart out
BC$ = Behavior, Content, Money

The goal of the BC$ project is to raise awareness and make changes with respect to the three pillars of information freedom - Behavior (pursuit of interests and passions), Content (sharing/exchanging ideas in various formats), Money (fairness and accessibility) - bringing to light the fact that:
1. We regularly hand over our browser histories, search histories and daily online activities to companies that want our money, or, to benefit from our use of their services with lucrative ad deals or sales of personal information.
2. We create and/or consume interesting content on their services, but we aren't adequately rewarded for our creative efforts or loyalty.
3. We pay money to be connected online (and possibly also over mobile), yet we lose both time and money by allowing companies to market to us with unsolicited advertisements, irrelevant product offers and unfairly structured service pricing plans.