Promotional Song Targeting
The problem of seamlessly and intelligently promoting content is both ubiquitous and challenging among tech companies. Unfortunately, this promotion can feel forced and disingenuous. Users often see promoted content as both undesirable and contextually inappropriate. However, we believe that these feelings are often more of an indictment of the medium and implementation rather than the content itself.
At Pandora, music is our content and, naturally, we are very interested in finding ways to promote certain tracks. We have a unique opportunity for promoting tracks on our service for two primary reasons. First, the medium of radio is very conducive to naturally and seamlessly inserting promoted tracks — the element of surprise and discovery are a critical and desired part of the radio listener experience. Second, the data we’ve collected from our listeners over the past decade of providing the premier online radio experience allows us to implement this promotion in a context-specific and personalized way.
Here, we describe our technique for promoting tracks on Pandora’s radio service. The goal of this approach is to simultaneously maximize the reach of these tracks (i.e., the number of unique listeners and contexts we’re finding for these tracks) as well as the overall positivity of the feedback (i.e., we want listeners to thumb these tracks up). We start off by describing and presenting the results of our promotional techniques on Featured Tracks — a free program we offer to artists on Pandora. Next, we describe how we accomplished these results through a technique we refer to as song targeting.
At Pandora, we offer a free service called Featured Tracks which is part of a suite of industry-leading, self-serve promotional tools provided by the Artist Marketing Platform (AMP) and is available to all artists on Pandora that work with a direct label or distributor partner. In addition to providing tremendous insights about the performance of the track and the listeners who are responding to those tracks, we attempt to integrate these tracks into our radio service in ways that can help maximize the performance and grow audience for those tracks.
Benefits of Featuring a Track on Pandora
The results presented below are referring to the 583 featured tracks that were actively running between the dates of July 3rd to July 9th, 2017. Note, “thumbs” refers to all thumb feedback (“thumbs up” and “thumbs down”) while “seeds” refers to the starting point for listener stations e.g. “Radiohead” or “Drake,” etc.
Increases in Spins, Thumbs, Seeds
Across all of the featured tracks during the week of consideration, we see substantial gains in total spins, thumbs, thumbs up and seeds.
The increases are even more dramatic if we isolate tracks that average less than 1,000 spins per day.
These results demonstrate the ability to provide a substantial benefit to smaller artists by finding previously undiscovered listeners in entirely new contexts that are responding in remarkably positive ways.
New and Incremental Spins, Thumbs Up, Seeds and Listeners
These incremental numbers really highlight the benefit of our framework. In total, we see 5.2 million incremental spins that were heard by 3.6 million incremental listeners that provided 180 thousand incremental thumbs up. The increases in spins, listeners and thumbs represent huge growth for these tracks and simply would not have happened if not for our promotional framework.
Perhaps the most impressive, however, is the 115 thousand incremental unique seeds per week. An increase in the “seeds” metric implies more reach for the artist’s music than it had seen previously as these metrics only include seeds for which these featured tracks would have received zero spins on otherwise. As a result of this increase, listeners are hearing these featured tracks in entirely new contexts and we’re dramatically increasing the likelihood that these artists are pulling in new fans.
Importantly, for any individual listener, the repetition of these tracks is low. For listeners that heard at least one featured track, they only received approximately 1.5 spins of any featured track per day. We believe this intelligently balances the complex relationship between the desired repetition and diversity.
Better Listening Experience
- The average thumb up rate (# of thumbs up / # thumbs) on the 5.2 million incremental spins this framework introduced was 83%.
- The average thumb up rate for the spins of these tracks that occurred naturally (i.e., spins that were not attributable to the new framework) was 58%
The response from the new listeners and new contexts we are finding for these tracks is overwhelmingly positive. As a result, we can confidently claim that we are actually improving the quality of the listening experience, while, at the same time, providing targeted and personalized promotion of these tracks.
How does this work?
As evidenced by the results, the performance boosts we’re giving to featured tracks improve the potential for these tracks to succeed on Pandora while protecting and enhancing the listener experience. To accomplish this, we built a framework for seamlessly integrating these featured tracks into our radio service through an innovative technique we refer to as song targeting.
Song targeting (patent pending) refers to the personalized framework that we use for deciding whether or not we should recommend playing a specific track on a listener’s station. In order to understand what this means, we first define the context that underlies that decision making. After that, we will explicitly state the goals of the framework, define the model that is used by the framework, and dive into the pipeline and implementation details.
Defining the Context
When we decide to play a track in the radio experience on Pandora there are a large number of components that comprise the context. Some of the top level components that comprise this context are defined below
- Track: The individual track that we are considering spinning.
- Artist: The artist associated with that track.
- Seed: The starting point for a station — often a track, artist, or genre.
- Listener: The listener that is currently listening.
- Station: The specific listener-seed pair. Note, there are potentially millions of stations for millions of listeners that all originated from the same seed. However, due to listener behavior on those stations (e.g., their thumbing behavior) each of those stations are unique.
In its simplest form, the goal of song targeting is to answer the question: “Should we play this track on this station?”. In reality, we extend this question to expand the context, such as, “Should we play this track by this artist on this station from this seed owned by this listener?” Further, this description of the context really only considers first order components. When we actually apply the song targeting pipeline we add second-order considerations to the context that might be derived from pairs of these first-order components — e.g., the historical performance of the track on a seed, the historical statistics derived from the listener-artist pair, station-track pair, etc.
The goal of song targeting is two-fold: (1) maximize the number unique listeners and stations that we can find for these tracks, and (2) maximize the overall satisfaction of the listener when we decide to play these tracks. From the artist’s perspective, these goals represent our attempt to reach as many potential fans as possible. From the listener’s perspective, these goals represent our desire to improve the listening experience by spinning these tracks in appropriate contexts to listeners who we predict will like the song.
Importantly, the goal of song targeting is not to simply maximize the number of spins that these tracks receive. If that were the case, it would be easy to inadvertently design a system that undermines the artist and listener experience. For example, a naive approach for track promotion could be to simply “boost” the likelihood that a listener hears the song on stations where they were already likely to hear the song. From the artist’s perspective, this approach may cannibalize listeners and spins that they would have received naturally. From the listener’s perspective, this approach could easily degrade the quality of their stations with artificially increased repetition. Fortunately, an increase in spins is a natural byproduct of our goals. As we increase the reach of the tracks and maximize the overall listener feedback on those tracks we will spin the track in contexts in which it might not spin naturally. Spinning these tracks in new and unique contexts will result in truly incremental spins — i.e., spins that would not have occurred naturally and are solely attributable to the song targeting framework.
To meet the goals, we construct a model that estimates the probability that a listener thumbs up a track given that they provided feedback on that track. In other words, we are attempting to estimate the following: P(Thumb Up | Thumb, Context). Here, “Context” simply refers to all of the various information sources that go into playing a radio track on Pandora (see the “Defining the Context” section above).
How does this relate to our goals? At the end of the song targeting pipeline we apply a threshold to the model output (i.e., the estimated probabilities). That threshold is directly tied to our stated goals and allows us to simultaneously optimize the tradeoff between the number of listeners these recommendations reach and the overall quality of the recommendations. This is discussed in more detail in the “Model Training and Evaluation” section below.
We are restricting the output of the model to be conditional on thumb feedback being received. This formulation allows us to view the problem as a straightforward binary classification problem and restrict our training examples to spins in which we received either a clear positive or negative signal. However, it can introduce bias — e.g., we oversample data that is provided by listeners that thumb frequently. Additionally, we are ignoring spins that ended without any feedback — i.e., spins that were completed or skipped without an associated thumb. We have seen empirical success with the current formulation and this bias does not appear to be drastically affecting model performance.
The primary components of the pipeline we implemented for song targeting are:
- Example Construction: Define a framework for constructing the examples used for model training, evaluation, and prediction.
- Model Training and Evaluation: Train the classification model and evaluate the model quality on testing data that is representative of the predictions that will be made in production
- Model Prediction and Deployment: Use the trained model to make the final recommendations and deploy to production
Constructing the features for a single example (i.e., a single instance of the training, testing or prediction data) requires the combination of information from a variety of sources. Additionally, many individual components are used in a large number of examples. For instance, the features describing popular tracks, artists, or seeds might be used in millions of examples that we’re constructing. As a result, we wanted to define a framework in which we only compute the features once for each component of the context. With this in place, constructing a full example (with all of its components) is simply a matter of concatenating the features from each component to form the full context.
At Pandora, we have used this style of example construction in many machine learning applications. It has several benefits:
- Features for each component can be constructed independently and updated as frequently (or infrequently) as desired.
- Combining multiple contexts can be accomplished by concatenating the feature string from each individual context (note, the order of concatenation does not matter).
- Conversion to other feature representations (e.g., a standard vector of feature values) is straightforward
- Easy interpretability. For example, in the above illustration you might guess that “artist_id1” was labeled as a rock artist and “track_id2” features guitar and male vocals.
In the final data representation, we use the following first-order components to build the final feature representation: track, artist, seed, listener, station. Additionally, we augment those features with second-order components, namely: track-seed, artist-seed, listener-artist, and station-artist. In practice, each example is described by approximately 700 features.
Model Training and Evaluation
Training and Testing Data
The training data for the model is a random sample of radio feedback (thumbs) that occurred over a trailing 7-day period. In general, we want the training data to be as representative as possible of the data that we will be applying the model to. For Featured Tracks, we limit the training samples to tracks that were released in the past year. Note, the feature calculation for the training data takes place at the beginning of the trailing 7-day period to prevent leakage into the feature space — i.e., we don’t want information about the label to be present in the features. The testing data is very similar to the training data. It is computed over a trailing 7-day window and the features are constructed from the beginning of that period. The sampling for the testing data is limited to feedback that occurred on the target of the song targeting application — e.g., featured tracks.
We use an in-house implementation of a Gradient Boosting Classifier (based on an ensemble of decision trees). Gradient boosting is capable of learning complex nonlinear relationships between the features with no need to pre-process and/or rescale the features during the tree construction. This decision was also supported by superior performance when compared to other considered models (e.g., logistic regression, Random Forest).
The model performance is evaluated using a relatively standard set of binary classification metrics. In particular, the primary metrics used for the final decision making (e.g., threshold determination) were selected based on the desired tradeoff between precision and recall. The interpretation of these metrics is easily related back to our stated goals. Measuring the precision of the model allows us to estimate the overall positivity of the listener feedback for our recommendations. Measuring the recall of the model allows us to control the number of unique stations and listeners who’ll respond positively to our recommendations. In other words, accomplishing our goals is a matter of finding the appropriate threshold that delivers the optimized tradeoff between precision and recall.
In terms of actual implementation, we leveraged two software packages that the science team at Pandora has built: (1) SavageML: a base machine learning library written in java and (2) SavageML-Spark: an ML pipelining package built on spark and written in scala. The actual steps of the model training and evaluation are the following:
- Load all of the training and testing data into distributed memory, compute summary statistics for the features, and generate a mapping that appropriately parses and transforms the input format.
- Train the model on the transformed training data
- Compute summary statistics about the model
- Make all of the predictions on the testing data
- Compute the metrics on the testing data predictions
- Construct a summary of diagnostic information and email the results to interested parties. This summary email has proven invaluable for quickly detecting and diagnosing issues that might arise.
Model Prediction and Deployment
Now that the model is constructed and we are satisfied with the performance, we need to define the prediction data — i.e., the data set on which we will apply the model and generate our final set of recommended station-track pairs. The final set of recommendations can then be deployed to production and integrated into the radio service.
One important subtlety about this model formulation is that there is a large amount of sampling bias that is introduced when we are training this model. It would be easy to be overconfident and build prediction data that is drawn from an entirely different distribution than the training data. Essentially, we are training this model on historical examples in which the Pandora radio service has decided to play a track on a given station. The vast majority of spins provided by the radio service are understood to be high-quality recommendations. As a result, the model is largely uninformed about what would happen if we played a wholly inappropriate track on a given station.
With this in mind, examples in the prediction data need to be drawn from a distribution that is sufficiently similar to the distribution of training examples used to fit the model. With this in mind, the full prediction data set is constructed using a two stage approach for each track that we would like to promote:
- Stage 1: Identify all of the seeds for which the radio service has either (1) demonstrated historical success for that track on that seed or (2) designated that there is interest in testing that track on that seed (i.e., we don’t know if the crowd will respond positively to it yet, but the system believes it is in the realm of possibility)
- Stage 2: Select all stations that are instances of the seeds identified in Stage 1.
The final set of recommendation candidates is the concatenated set of all station-track pairs that are generated in Stage 2. The final recommendations are the subset of those candidates that sufficiently satisfy the desired threshold for the candidate track. Note, the final threshold can be specific to each track. For example, we might lower the threshold for a track in order to achieve a predetermined minimum number of recommendations for that track.
The final selected recommendations are uploaded to a redis cache. In production, the radio service performs a quick lookup in redis to see if we need to inject any of the recommendations into the set of considered tracks (i.e., the song pool) for that station. From the perspective of the radio service the song targeting recommendations can be viewed as simply another recommender in the arsenal of recommenders that we have developed at Pandora over the past decade.
An illustration of how the prediction data and implementation might work for a single track is below.
Note, the data, model training, and prediction framework are refreshed daily. Thus, the number of stations and seeds for which we recommend a given track on is dynamic and evolves in ways governed by the recent data we have collected.
The promotional results of Pandora’s Featured Tracks program demonstrate the tremendous ability of the song targeting framework to simultaneously increase the reach of these tracks and maximize the quality of the listener experience. More generally, however, song targeting represents a powerful and personalized framework for taking a collection of songs and putting them in front of the right listeners in the right context. This level of personalization could not have been achieved without the unique knowledge developed through Pandora’s radio service: more than 76 million active monthly users, logging well over 5 billion listening hours per quarter and providing us with more than 80 billion pieces of feedback across the 11 billion stations they’ve created since 2005.