A closer look at Apache Mahout (JAVA Recommendation library)

There’s been a lot of changes to the Mahout library since it was first introduced through the Apache software foundation back in 2009.
I first looked at this project through the excellent tutorial on classifying and recommending Seinfeld episodes (perhaps not the easiest task of differentiating episodes, keeping in mind it was a show that prided itself on being “about nothing”). This really showed the strength of the suite of libraries and algorithms, which can be ranked and compared for performance and relevance more easily than ever in its current release.
Unfortunately, the folks involved in that project got a takedown notice to no longer provide the Seinfeld episodes’ full scripts that were being used to classify the episodes.
Looking for another alternative interesting dataset, I’ve looked through the following commonly used public data sources:
- Million Song Dataset: http://millionsongdataset.com/
- MusicBrainz: https://musicbrainz.org/
- MovieLens: https://movielens.org/
- Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia
- TVDB: https://www.thetvdb.com/api-information
- IMDB: https://imdb-api.com/
Seems like I’ve settled on just “” for now, but will likely look at unique ways to combine all of the above at a later date.
The scope of the problem will start out fairly modest; given a list of 10 users with very distinct tastes, can we recommend relevant new music to them (either by specific song, or, artist) that they’re likely to enjoy. It’s not like this is a new problem area, but I do feel there’s not been much innovation in this space for a while since the early days of WebJay, RACOFI & MyStrands to the “middle ages” of Songza, LastFM & Pandora and now into the modern era of Apple’s iTunes/Music offerings along with Spotify & YouTube pretty much having a three-party oligopoly on both ad-supported streaming and paid digital music (whether pay-to-download/own or subscription-to-stream).
The premise of this effort will be, given minimal inputs about a user, try to chart out