A closer look at Apache Mahout (JAVA Recommendation library)

There’s been a lot of changes to the Mahout library since it was first introduced through the Apache software foundation back in 2009.
I first looked at this project through the excellent tutorial on classifying and recommending Seinfeld episodes (perhaps not the easiest task of differentiating episodes, keeping in mind it was a show that prided itself on being “about nothing”). This really showed the strength of the suite of libraries and algorithms, which can be ranked and compared for performance and relevance more easily than ever in its current release.
Unfortunately, the folks involved in that project got a takedown notice to no longer provide the Seinfeld episodes’ full scripts that were being used to classify the episodes.
Looking for another alternative interesting dataset, I’ve looked through the following commonly used public data sources:
- Million Song Dataset: http://millionsongdataset.com/
- MusicBrainz: https://musicbrainz.org/
- MovieLens: https://movielens.org/
- Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia
- TVDB: https://www.thetvdb.com/api-information
- IMDB: https://imdb-api.com/
Seems like I’ve settled on just “” for now, but will likely look at unique ways to combine all of the above at a later date.
The scope of the problem will start out fairly modest; given a list of 10 users with very distinct tastes, can we recommend relevant new music to them (either by specific song, or, artist) that they’re likely to enjoy. It’s not like this is a new problem area, but I do feel there’s not been much innovation in this space for a while since the early days of WebJay, RACOFI & MyStrands to the “middle ages” of Songza, LastFM & Pandora and now into the modern era of Apple’s iTunes/Music offerings along with Spotify & YouTube pretty much having a three-party oligopoly on both ad-supported streaming and paid digital music (whether pay-to-download/own or subscription-to-stream).
The premise of this effort will be, given minimal inputs about a user, try to chart out
OpenRecommender v1.0 released!
This is a post to announce the ALPHA release of OpenRecommender, version 1.0.
Have you ever wondered if there was a better way to find information on the web? Before today, there has been lots of ways from targeted search to surfing aimlessly, or from social sharing via SNS platforms like Facebook or Google+ to required reading assigned by professors, co-workers or managers by email (i.e. “Recommended reading”). Even “stumbling” across interesting content via tools like StumbleUpon, Digg and Delicious is also commonly mistaken as being a form of “recommendation” service. These tools are not Recommendation Engines though, they are most accurately described as Social Bookmarking tools (i.e. users must manually save something for later, or, “mark their place” so they can take up where they left off in browsing/reading). In fact these tools have some opportunity to become web-wide Recommendation Engines since links can be submitted on any topic, and some (such as Digg) even have “Related Content” suggestions that group item or user similarity, however the problem is that similarity is just one small measure of relevance for true recommendations. OpenRecommender identifies 15 algorithm types for generating high quality recommendations. The more the merrier, in fact, so any algorithm could be used as long as it can be ranked in real-time.
Today, I’m proud to be able to share a first look at a new approach that represents a “Recommendation” more completely than ever before. The OpenRecommender project ALPHA release realizes the first step in a talk I gave exactly one year ago at AWOSS 2010:
BC$ = Behavior, Content, Money

The goal of the BC$ project is to raise awareness and make changes with respect to the three pillars of information freedom - Behavior (pursuit of interests and passions), Content (sharing/exchanging ideas in various formats), Money (fairness and accessibility) - bringing to light the fact that:
1. We regularly hand over our browser histories, search histories and daily online activities to companies that want our money, or, to benefit from our use of their services with lucrative ad deals or sales of personal information.
2. We create and/or consume interesting content on their services, but we aren't adequately rewarded for our creative efforts or loyalty.
3. We pay money to be connected online (and possibly also over mobile), yet we lose both time and money by allowing companies to market to us with unsolicited advertisements, irrelevant product offers and unfairly structured service pricing plans.