Introduction to Recommender Systems with Spark

Introduction

Imagine that you run an online marketplace, or a hotel booking website, or something else with a search box and a lot of items behind it. A user types a query, but the query result is huge, at least more than your user can digest in one go, and you have to decide what to show first, what second, and so on.

First approach might be to hard-code some static rules, based on domain knowledge and business requirements. For example, you can bump newer items a bit, you give a discount to cheap items in one category, you demote another category because the business team said so. Plenty of popular websites ran on exactly this kind of logic for many years, and it might work for a while in your case as well.

The problem is that eventually it will get more and more complicated to maintain. Over time you get more items, more users, more categories, and the business rules logic gets hard to comprehend and update following new requirements. One day, somebody on the team, reads a blog post about learning-to-rank, pulls in XGBoost or LightGBM, starts inventing features, and suddenly “let’s sort the search results” turns into a multi-month ML project with new systems and services to maintain.

Yes, ranking and recommendation is a huge topic that deserves its own post, or even a series of posts, but today I just want to show one pragmatic shortcut: before you jump into a gradient-boosted ranker with hundreds of features, train a simple collaborative filtering model. It practically gives good results from the start, and can serve as a baseline later on, when you’ll bring “fancier” models.

And if you already have Spark sitting somewhere in your pipeline doing ETL or any other ML/DS workload for you, I have good news, Spark has a solution out of the box for it. So you can reuse what you already have. Isn’t it amazing?

And btw, you don’t even need a cluster setup to try it, as we’ll see, you can play with it by simply using a local Spark setup.

If you’re eager to see the code, jump to the Coding Time section.

Ranking, Recommendation, Personalization??? I’m confused!

Well, let’s take it one step at a time.

From a high-level view, “recommendation”, “personalized ranking” and “search ranking”, in its nutshell, mean the same thing (we can discuss the details in the comments). But what we really want is to show user \(u\) a relevant item \(i\), and “relevant” here might mean an item the user will like, click, purchase, listen to, view, …, you name it, given some context \(c\), which might be anything: search query, past history, time of day, and so on.

Let’s formalize the problem a little bit before we go further.

In the most generic case we can say we want a function: \(r = f(u, i, c)\), having the property that for the same user and two different items: \(f(u, i_1, c) > f(u, i_2, c)\) means the first item is more relevant than the second. And practically it means we show it higher in search results (or earlier in the user journey).

Matrix factorization to the rescue

Matrix factorization is a well known technique used in many applications, such as dimensionality reduction, NLP, topic-modeling and others.

So what we want is to represent user \(u\) and item \(i\) as \(k\)-dimensional vectors \(x_u\) and \(y_i\), learned from past user-item interactions, like ratings, clicks, or plays that make up our context \(c\). Once those vectors are fit, the context is already baked into them, and the relevance function is a simple dot-product of just user and item: \(f(u, i) = x_u^{\top} y_i\).

So that’s our model, looks easy, but what is the target value we are optimizing for? How to define relevance \(r_{ui}\) of item \(i\) for user \(u\)?

Explicit vs implicit feedback

Well, there are two main approaches in practice, depending on what kind of data you have.

Explicit feedback

The first, and the easy one, is the so-called “explicit feedback” case. That’s when you ask the user to give you an explicit rating, e.g. by using stars from 0 to 5.

The only tricky part here is that you don’t have all possible ratings, obviously if you have 1M users and 10M items, it would be nice if every user rated at least 100 items, which might not be the case. So in practice you optimize only using the set of observed \((u, i)\) pairs \(\Omega\).

Having that in mind we can write our problem as optimization problem:

\[ \min_{X, Y} \sum_{(u, i) \in \Omega} \left( r_{ui} - x_u^{\top} y_i \right)^2 + \lambda \left( \sum_u \| x_u \|^2 + \sum_i \| y_i \|^2 \right) \]

where \(X\) is a matrix of user vectors \(x_u\), \(Y\) is a matrix of item vectors \(y_i\), and \(\lambda\) is the regularization parameter.

Unobserved pairs don’t contribute to the loss. If a user never rated an item, the model treats this as missing data, not as a zero rating. We genuinely don’t know what they would have said.

Implicit feedback

Implicit feedback is a more common case in the real world. In the real world you may not have explicit ratings. The only information that is available is what users did: clicks, views, plays, purchases, watch time. And you have to guess the rest.

The idea here is to split the observation \(r_{ui}\) into two quantities:

a binary preference \(p_{ui} = \mathbb{1}[r_{ui} > 0]\), did they interact at all;
and a confidence \(c_{ui} = 1 + \alpha \cdot r_{ui}\), how much we trust the preference signal.

The model predicts preference (roughly a \([0, 1]\) score), weighted by confidence.

\[ \min_{X, Y} \sum_{u, i} c_{ui} \left( p_{ui} - x_u^{\top} y_i \right)^2 + \lambda \left( \sum_u \| x_u \|^2 + \sum_i \| y_i \|^2 \right). \]

And as you can see, the objective here sums over all \((u, i)\) pairs, not just observed ones.

That formulation comes from Hu, Koren, Volinsky (2008).

That idea is quite elegant. A user who played a song 50 times probably likes it, and we trust that signal more. A user who never played a song probably does not like it, but we trust that much less, since they might simply never have seen it. The \(\alpha\) hyperparameter controls how much we trust strong observations over silence.

Here, we can steer the algorithm by carefully designing \(r_{ui}\) and choosing \(\alpha\). So for audio/video streaming it can simply be total play time duration, and for an online store it can be a weighted sum of events like: \[ \begin{aligned} r_{ui} = \; & 1 \cdot \text{search\_click\_count} \\ +\; & 2 \cdot \text{viewed\_pdp\_count} \\ +\; & 3 \cdot \text{wishlist\_add\_count} \\ +\; & 5 \cdot \text{cart\_add\_count} \\ +\; & 10 \cdot \text{purchase\_count}. \end{aligned} \]

Two practical consequences worth calling out:

Put magnitudes in the rating column, not binary flags. If you collapse “5 views” and “1 view” to both be 1, you’ve thrown away the entire confidence signal. Use counts, durations, play counts, play time, everything that reflects strength.
The predictions are preference scores, not original counts. They live roughly in \([0, 1]\) and should be read as “how much does the model believe this user prefers this item”. RMSE on raw counts is meaningless here; use ranking metrics like MAP@K, NDCG@K, and so on, but that’s a story for another time.

How ALS is solved (and named)

I don’t want the post to be too mathematical. So I’m going to skip the details on how it is actually solved algorithmically. But it is worth mentioning that finding optimal \(X\) and \(Y\) jointly is a non-convex problem and nobody solves it directly. But here’s the trick: you can fix \(Y\) and solve it for \(X\), the problem becomes a bunch of independent least-squares problems, one per user, with a nice closed-form solution. Same thing the other way round, fix \(X\) and solve for \(Y\), item by item.

So this is why the method is called Alternating Least-Squares, because it alternates between solving for users, having items fixed, and then solving for items, having users fixed, for 10–20 iterations. It turned out that’s good enough in practice.

Coding Time

Now finally let’s run something.

I’m going to skip the details of the whole setup, I’ve already written in the blog how to run Spark in Docker or set up a Spark project with sbt to run locally. So let’s focus on code. Just for reference I checked the code with Spark v2.4.0, the latest at the time of writing (and later with some others in the v2.4.x series).

Explicit feedback demo

Let’s start with explicit feedback, because the semantics of “a rating” are easier to reason about. For a dataset to play with, MovieLens is a classic choice, grab one of the small ratings dumps, split it into train and test, and save both as parquet. We’ll just read the files from disk and jump straight to the model.

import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("ALSBlogDemo")
  .master("local[*]")
  .config("spark.ui.enabled", "false")
  .getOrCreate()

import spark.implicits._

val training = spark.read
  .parquet("path/to/movie_ratings_training.parquet")
  .select(
    $"userId".cast("int"),
    $"movieId".cast("int"),
    $"rating".cast("float")
  )

val test = spark.read
  .parquet("path/to/movie_ratings_test.parquet")
  .select(
    $"userId".cast("int"),
    $"movieId".cast("int"),
    $"rating".cast("float")
  )

// the model itself
val als = new ALS()
  .setMaxIter(10)
  .setRank(10)
  .setRegParam(0.1)
  .setUserCol("userId")
  .setItemCol("movieId")
  .setRatingCol("rating")
  .setColdStartStrategy("drop")

// run training
val model = als.fit(training)

A few things worth explaining here:

setRank(10) is the latent dimension \(k\) from the math section. 10 is the Spark default and a sane starting point. Bumping it to 20–50 often helps if you have enough data. Usually you want to evaluate several values.
setRegParam(0.1) is \(\lambda\) in the objective. Because Spark uses ALS-WR, this value is not very dataset-size-sensitive; you don’t need to re-tune it aggressively as data grows.
setMaxIter(10) is how many alternating sweeps we do. 10 is almost always enough; according to papers going past 20 gives almost no gain.
setColdStartStrategy("drop") is an important one. If you use random train/test splits, it always produces a few users or items that appear only in the test set. By default, Spark predicts NaN for them, which may break downstream metrics. "drop" just removes those rows from the output.

Spark actually scales \(\lambda\) by the number of ratings a user/item has, which is the “ALS-WR” variant from Zhou et al. (2008). The practical effect is that a good regParam found on a small sample usually transfers to the full dataset, and high rank doesn’t immediately overfit. You get one less thing to re-tune when your data grows.

Implicit feedback demo

Now the more interesting case. Instead of star ratings, imagine we have user-item interaction counts, let’s say, how many times each user played each song. We aggregate the raw events per (user, item) pair and feed the counts as the “rating” column:

import org.apache.spark.sql.functions._

// let's say we have a log of track plays

val training = log
  .groupBy("userId", "itemId")
  .agg(sum("count").alias("playCount"))
  .withColumn("playCount", $"playCount".cast("float"))

The model configuration is almost identical, with a few important differences:

val als = new ALS()
  .setMaxIter(15)
  .setRank(20)
  .setRegParam(0.01)
  .setAlpha(10.0)
  .setImplicitPrefs(true)
  .setUserCol("userId")
  .setItemCol("itemId")
  .setRatingCol("playCount")
  .setColdStartStrategy("drop")
  .setNonnegative(true)

val model = als.fit(training)

Two settings that are specific to the implicit formulation:

setImplicitPrefs(true) flips the optimization to the implicit objective we discussed earlier. From now on the “rating” column is a confidence signal, not a rating.
setAlpha(10.0) controls \(c_{ui} = 1 + \alpha \cdot r_{ui}\). The Spark default is 1.0, if your implicit model feels like it’s ignoring the magnitudes (giving everyone the same generic recommendations), alpha is the first thing to turn up.

setNonnegative(true) is in there too, but it’s not actually implicit-specific. It deserves its own short section.

A closer look at Solvers

Under the hood, Spark picks one of two inner solvers based on a value set by setNonnegative:

val solver = if (nonnegative) new NNLSSolver else new CholeskySolver

Internally both work on solving same normal equation, \((A^{\top} A + \lambda I)\, x = A^{\top} b\). But Cholesky is unconstrained, and NNLS adds \(x \geq 0\) element-wise. That choice is made once per fit and used regardless of whether explicit or implicit feedback form is used, the implicit-specific math happens upstream of it. So you can flip it on either way.

In practice people lean toward using it with implicit feedback for two weak-but-real reasons:

NMF-like parts-based structure. Non-negativity pushes the solver toward assigning users and items to disjoint subsets of latent dimensions, the classic parts-based decomposition you get from NMF. That single property buys you two separate things:
- Some interpretability. You can sometimes eyeball factors and notice patterns there.
- Match to the structure of implicit data. Implicit data is binary and mostly zero, which fits a parts-based representation naturally.
Stability for very sparse items. Constrained problems sometimes produce less wild factor values for items with one or two observations. Not a huge effect but occasionally useful.

Costs worth knowing:

NNLS is slower than Cholesky. Cholesky is a direct factorization; NNLS is an iterative active-set method. Not a huge deal at rank 10–100, but it adds up.
You lose some expressive capacity. The unconstrained model can always represent what the non-negative one can, plus more, so unconstrained typically wins by a small margin on offline metrics.
It is not NMF. True NMF has its own multiplicative-update algorithm with specific convergence guarantees. Spark does “ALS with a non-negativity constraint in the inner solve”, which is a different object.

Sensible defaults: for explicit feedback leave nonnegative = false. For implicit, try both, true is a reasonable value to try if you care about interpretability or want slightly more stable top-K lists, and a reasonable thing to leave off if you’re maximizing offline metrics. The demo above sets it mostly to show that the knob exists.

Predicting and evaluating

Now, when the model is trained we can call transform to get predictions. For the explicit case it will give you predicted ratings, but for implicit, the predictions are preference scores, not play counts as one might expect.

val predictions = model.transform(test)

predictions.select("userId", "itemId", "prediction").show()

For the implicit case you’ll typically see values roughly in \([0, 1]\), which is exactly what the model is fitting to, the binary preference \(p_{ui}\), not the original count \(r_{ui}\).

If you try to run a RegressionEvaluator with "rmse" on these numbers against the raw playCount, you’ll get huge, meaningless errors. That’s not a bug in the model, you’re just comparing against the wrong target.

For implicit models the right thing to evaluate is top-K ranking quality: MAP@K, NDCG@K. Spark’s built-in options for this in the DataFrame API are a bit limited in v2.4.0, and doing it properly deserves its own post.

And here is where the practical payoff starts to show. Spark’s ALSModel gives you one-line helpers to generate top-K recommendations in bulk:

// Top 10 items per user, for every user in the training set
val userRecs = model.recommendForAllUsers(10)
userRecs.show(3, truncate = false)

// Or, top 5 for a specific subset of users
val selectedUsers = training.select("userId").distinct().limit(3)
model.recommendForUserSubset(selectedUsers, 5).show(truncate = false)

That’s essentially your entire batch recommendation pipeline in two calls.

One caveat: recommendForAllUsers(K) is a cross-join between users and items in its tail. For 1M users × 10M items this becomes infeasible, and you’d want to switch to an approximate nearest neighbor index (more on that later). But for the “tens of millions of pairs” regime it works fine out of the box.

Finding good hyperparameters

Spark ships with a standard CrossValidator, and it works with ALS just like with any other estimator. The only ALS-specific thing is that you really, really need setColdStartStrategy("drop"), otherwise random CV folds will inject NaNs into the metric and break the whole run.

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}

val als = new ALS()
  .setMaxIter(5)
  .setUserCol("userId")
  .setItemCol("movieId")
  .setRatingCol("rating") // Note: this is explicit feedback case
  .setColdStartStrategy("drop")

val pipeline = new Pipeline().setStages(Array(als))

val evaluator = new RegressionEvaluator()
  .setMetricName("rmse") // Note: works only for explicit feedback case
  .setLabelCol("rating")
  .setPredictionCol("prediction")

val paramGrid = new ParamGridBuilder()
  .addGrid(als.rank,     Array(5, 10, 20))
  .addGrid(als.regParam, Array(0.01, 0.1, 0.3))
  .build()

val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(3)
  .setSeed(42)

val cvModel = cv.fit(training)

Printing the results sorted by metric is convenient when you want a quick look at how much each knob matters:

val results = cvModel.getEstimatorParamMaps
  .zip(cvModel.avgMetrics)
  .sortBy(_._2)

println("Cross-validation results (sorted by RMSE, lower is better):")
results.foreach { case (params, metric) =>
  val paramStr = params.toSeq
    .map(p => s"${p.param.name}=${p.value}")
    .mkString(", ")
  println(f"RMSE=$metric%.4f; Params: [$paramStr]")
}

In practice, rank and regParam are the two parameters to focus on first. For implicit data, add alpha to that list. maxIter is rarely worth tuning past a fixed value of 10–20.

What to do with the model once you have it

At this point you have a trained ALSModel. There are many ways you can use the model and its vectors in production.

Pre-computed Static Recommendations

The simplest pattern: every night, run recommendForAllUsers(K), dump the results into a key-value store like Redis, and have your service look up “top 100 for user u” on each request. This is what recommendForAllUsers is for. It’s also the cheapest option in terms of online latency (the service does a single point lookup) and it’s very easy to reason about. The downside is that recommendations are only as fresh as your last batch.

Online ranking/re-ranking

If you have search or a candidate set that changes per request, e.g. available hotels on these dates, in-stock products in this category, and so on, you can’t pre-compute the full ranking. Instead, you extract the user’s vector at request time, score each candidate item by dot product, and sort. Spark gives you the vectors as ordinary DataFrames:

model.userFactors.show(3)
model.itemFactors.show(3)

Both have the schema (id: Int, features: Array[Float]). You can write them to Parquet, load them into a Java/Go/Python service, and do the dot products yourself. At this point ALS just becomes “two vector tables and a dot product”, which is very easy to serve.

Approximate nearest neighbor search

When you need to answer “find me the top-K items for this user” over an index of millions of items, you don’t want to compute every dot product on every request. This is where ANN libraries come in. You build an index over the item vectors once, and queries become fast approximate top-K lookups under inner product similarity. There are many open-source solutions for k-ANN search, like FAISS or Annoy, to name a few. How to do this efficiently is a topic on its own, and I might come back to it in another post.

Item-to-Item and User-to-User similarity

As a nice side-effect, the same item vectors give you item-to-item similarity essentially for free.

Functionalities like “Customers who viewed this also viewed …” suggestion or “more like this” button on a product page, boil down to taking one item vector and finding its nearest neighbors in the same items space.

User-to-user similarity drops out of userFactors the same way. Nearest neighbors there give you behavioural segments, lookalike audiences, or cohorts for experiments, useful as an offline analysis tool even if it rarely shows up directly in a product UI.

A few common issues

I’ll close with a short list of things that are easy to miss when you use Spark ALS for the first time.

User and item IDs must be Int. String IDs, UUIDs, or Long values that exceed Int.MaxValue will fail. Build a stable mapping table upfront.
Always use setColdStartStrategy("drop") during evaluation. Otherwise any user or item absent from training produces NaN, and NaN propagates into every metric.
Don’t evaluate implicit models with RMSE. The predictions are preference scores in \([0, 1]\), not the raw counts you trained on. Use ranking metrics instead.
Don’t use binary flags in the rating column for implicit data. The \(\alpha\)-driven confidence is the whole point; collapsing counts to 0/1 throws away the signal the algorithm is trying to exploit.

Wrapping up

ALS is not the hip algorithm in the ML toolbox, like DNN recommenders. But it is also: * very simple to set up, * very well understood mathematically, * native to Spark, * produces compact vector representations you can serve with a dot product, * and usually gets you most of the way to the “obvious” recommendations before you’ve written a single feature engineering pipeline.

If you already have Spark in your stack (and if you’re reading this post there’s a decent chance you do) training an ALS baseline is a couple of hours of work. Even if you end up replacing it later with a fancier ranker, you’ll have a strong floor to compare against, a sanity check for your feature pipeline, and a nice set of user and item vectors to use elsewhere.

That is usually worth the two hours.

References

Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative Filtering for Implicit Feedback Datasets. 2008 Eighth IEEE International Conference on Data Mining, 263–272.
Zhou, Y., Wilkinson, D. M., Schreiber, R. S., & Pan, R. (2008). Large-Scale Parallel Collaborative Filtering for the Netflix Prize. Algorithmic Applications in Management.