Saving the (Dev) Environment

Published in

Algorithm and Blues

11 min readJan 29, 2018

Disposable Environments

Environments at Pandora have always been a fun topic. We have many of them, each with their own unconventional name: Production (my personal favorite by a long shot), Flex, Mobile-Test, Beta, SpaceX, Pillar, Stage, QA, Meshuggah, Clapton, Gaga, etc. They provide different functions. They contain different sets of services. They get updated in different ways. But they all have one thing in common.

They are hard to set up, let alone maintain and update.

What is an environment at Pandora?

An environment is a set of services that run together to serve some purpose (e.g. development and testing). Backend engineers use dev environments to build and test backend services. Mobile, web, and consumer electronic engineers use dev environments to build the client apps against those backend services. Quality engineers use testing environments to ensure the client code is able to “unleash the infinite power of music.” Sometimes testing environments are dev environments. Essentially, environments allow us to build the platform that delivers music to millions of people.

What makes up an environment?

At a minimum, three main components comprise an environment. The first is a cluster of physical hosts in a datacenter which encapsulates the remaining two virtual components, services and databases.

Each environment consists of multiple hosts that run at least 30 distinct services in their own Pandora virtual machine (VM). A Pandora VM is similar to a process virtual machine in that each one runs its own service — or set of services — depending on its configuration. Multiple services can run within a single VM, but each service requires a different minimal configuration that allows it to fully function in its own VM.

Numerous databases are needed to support all of these VMs. These databases need the correct schema related to the version of the services connecting to them. Some of them must be prefilled with music data which is inherently independent of the environment.

How do you set up an environment?

Prior to the concept of disposable environments (more on this later), it took an engineer with high-overview understanding of all of Pandora’s systems up to two months to set up a new production-like environment that plays all three tiers of Pandora: ad-supported radio, ad-free radio, and on-demand content. Environments that do not need all of the backend services are faster to set up (hours to weeks, depending on how many services need to be deployed). The beginning work is mainly delegated to SysAd (System Administrators) and DBAs (Database Administrators), leaving the setup of the VMs and services up to the engineer.

SysAd needs to acquire the necessary hardware, create and register each Pandora VM in an internal VM registration database, attach network interfaces to each of those VMs, set up load balancers for pools of multiple instances of a service, and more. Many of these tasks are iterated over the number of VMs in the environment and are not yet automated.

The DBAs needs to create all the databases, fill them with the correct data if needed, and configure them to run optimally. In the past, we’ve simply backed up and restored databases from other environments, but this has created problems when the data inside the database is environment-specific.

Finally, an engineer can begin setting up the individual VMs. Each VM needs a VM config that will allow it to run one or more services (generally VMs run only one service for simplicity’s sake). Since services need to know where other services are, the configs include hostnames of other VMs in the environment, which results in configs that are not reusable in different environments. This means that each environment has its own distinct set of VM configs and the creation of an environment is independent of the creation of another so no work can be reused when setting up another environment.

The configs themselves are generally the same, but they differ in value mainly due to hard-coded hostnames. Since each VM runs its own service, there are about 30+ configurations per environment living in various repositories, and all of them are constantly changing.

Here is a reduced VM configuration that runs the catalog service in the mobile-test environment:

export APACHE_LISTEN_ADDRESS='mobile-support4-catalog.pandora-eng.com' #!
export APACHE_LISTEN_PORT='1234'
export CATALOG_DB_HOST='localhost' #!
export CATALOG_DB_NAME='catalog'
export CATALOG_DB_PORT='9876'
export JETTY9_DEPLOY_APPS='catalog'
export JETTY9_LISTEN_ADDRESS='mobile-support4-catalog' #!
export JETTY9_LISTEN_PORT='1234'
export REMOTING_RADIO_HOST='radio-soa-vip-mobile.pandora-eng.com' #!
export REMOTING_RADIO_PORT='1234'
export START_Jetty9='yes'

Note how the hostnames all reference which environment the VM is in and how the catalog database is forced to be on the same host as the VM.

After the configs are sorted out, the deployment process can take place. This is where the bulk of the troubleshooting occurs, due to services failing to start up correctly.

Once all the services are running and reporting back healthy, we’ve reached the last step, which is more configuration but at the runtime level. A/B experiments need to be launched (we generally launch new features here at Pandora through A/B experiments to make sure everything looks good before dialing up), runtime configuration for each service needs to be set appropriately, devices and vendors need to be created along with their corresponding advertisement schemas, and so on.

How do you manage and update an environment?

We have one engineer who is tasked with keeping the environment that QE (Quality Engineering) uses stable and up-to-date. That environment is one of the very few environments that can replicate every feature that Pandora has to offer. To minimize downtime, deployments of most services/VMs are done once a week after-hours. These can take all night. Any downtime of the environment severely hurts QE’s ability to meet deadlines. As such, deployments and updates outside of the regular schedule are restricted as much as possible.

There are other environments for engineering teams to develop and push updates to their services. Those are managed by the teams themselves and are notoriously difficult to maintain and keep running. On many occasions, certain services in those environments can be down for weeks at a time because engineers simply cannot know ALL the services at Pandora and how they work to figure out why something is broken.

Summary of the Problem

At this point, it is pretty clear that environments were a sizable issue here. Setting them up is no small feat for one engineer as it requires not only knowledge of all of Pandora’s systems and services at a high level, but also a great deal of manual effort and configuration. Maintaining them and making sure they are fully functioning is even harder because of all the moving parts involved during updates and new releases.

So, here comes the fun part: how do we solve this problem to prevent productivity issues?

Solving the Problem

The main goal of the disposable environment project is to enable automated environment creation that produces an environment that is fully ready to play all three tiers of sweet, sweet music on at least the web and mobile clients.

Any task that can be automated, should be. The only component of an environment that can’t be automated is the physical component, the hosts and servers that the environment runs on. As such, the only manual labor that SysAd needs to do is set up the hosts.

So we begin with setting up the VM for each service. Enter Dora the Deployer, a build-deploy tool based on a tool developed at Rdio (but built at PanDora).

Side Note: So far in my five months here, I have noticed two common naming trends for our projects and tools. The first is cartoon characters for kids. In addition to Dora the Deployer, we have a service creation tool called Bob the Builder for creating services that will be deployed into our cloud pipeline (which we call Savage Cloud, after the original name of Pandora, Savage Beast). The second trend is Greek words and Greek mythological characters. For that we have Charon, Pithos, Minos, and of course…Pandora. I still do not have a cool name for this project and am actively taking suggestions.

Back on topic. Without Dora it would have been significantly harder to begin this project. Dora has a special Pandora VM-oriented feature that automates the VM creation, deployment, and management process (shoutout to everybody who worked on Dora).

So how does Dora work? After you’ve created your VM with Dora, you tell it what target (aka service) you want to deploy and what version of that target. An Ansible playbook then tells the host of the VM to pull a prebuilt deploy artifact from Nexus. The deploy artifact contains the service’s base VM config and a script to install the VM and sync down all the artifacts (defined in the VM config) necessary for the service that is running in that VM.

Remember how I mentioned that a service in one environment will have a different VM config than the same service in another environment due to issues like hard-coded names? In the Dora world, the VM configs should contain no information about what environment it is used for (see: https://12factor.net/config). Essentially, each VM config needed to be universal and allow for reuse of the same config in multiple environments. A golden copy if you will. As a result, this called for investigating over 30 services and their VM configs then painstakingly crafting universal configs.

Service registration with Consul includes both listen address and port

One of the major hurdles for creating these universal VM configs is to get rid of hard-coded hostnames for remoting calls to other services. When Dora deploys a VM, the services in the VM are registered with Consul. Consul provides us with service discovery and health checks (as defined by the service registration), letting us skip most of the tasks that SysAd has to do! It also acts as a load balancer if multiple instances of the same service exist in an environment. When a service becomes unhealthy, it is removed from the pool and not returned in a service discovery request.

The service discovery mechanism is through simple DNS. When a service is registered with Consul, its listen address can be discovered by querying for [tag_or_environment_name].[service_name].service.consul. The environment and service name are provided to Consul during service registration. As such, any config that was once a hard-coded hostname now becomes the above query but with a bash environment variable as the environment name, which is set upon deployment. For example, the hostname for the radio service now becomes $VM_ENVIRONMENT.radio.service.consul where VM_ENVIRONMENT is the name of the environment which contains the desired radio instance.

Here is the same reduced VM configuration for the catalog service from earlier but now using the above:

export APACHE_LISTEN_ADDRESS=$VM_BIND_ADDRESS
export APACHE_LISTEN_PORT='1234'
export CATALOG_DB_HOST="$VM_ENVIRONMENT.catalog-db.service.consul"
export CATALOG_DB_NAME='catalog'
export CATALOG_DB_PORT='9876'
export JETTY9_DEPLOY_APPS='catalog'
export JETTY9_LISTEN_ADDRESS=$VM_BIND_ADDRESS
export JETTY9_LISTEN_PORT='1234'
export JETTY9_LOG4J_SEPARATION='yes'
export REMOTING_RADIO_HOST="$VM_ENVIRONMENT.radio.service.consul"
export REMOTING_RADIO_PORT='1234'
export START_Jetty9='yes'

Any reference to the environment is gone, and now it looks more like a template. VM_BIND_ADDRESS is another bash environment variable set upon deployment.

For now, ports are still hard-coded because the port a service listens on is, generally, consistent across environments. However, discovering a service’s listen port is as easy as querying Consul’s service catalog HTTP API or requesting the SRV records via the DNS interface.

An example request for “Hello” by Adele and how Consul is used to discover where to route the request (simplified diagram of how the request is actually handled by our services)An example request for “Hello” by Adele and how Consul is used to discover where to route the request (simplified diagram of how the request is actually handled by our services)

Dora paired with Consul forms a more productive and dominant duo than the Lakers’ 1–2 punch of Kobe and Shaq during their three-peat in the early 2000’s. (This is coming from a major Lakers fan, so that’s how amazing Dora and Consul are.) The only SysAd JIRA ticket that needs to be filed now is to set up the hosts, though one day we may want to deploy these environments into Google Cloud or Amazon AWS.

Setting up all the necessary databases is now scripted away using a popular Python library for interacting with Postgres, psycopg2, and Python’s subprocess module. The script takes care of everything, from creating the databases on the remote hosts to birthing the necessary schemas. It also registers the databases with Consul as a service so that the VM configs can once again locate the appropriate databases without hard-coding the database’s connect address. All of this is done for databases that are dependent on the environment. Other databases, like music catalog, which just contain music data and can be independent of the environment, are replicated and stored on two universal hosts that all of these disposable environments will share.

Now, given a set of fresh hosts with only the necessary packages installed and a JSON definition of an environment (see code block below), it takes 10 minutes on average from start to finish to set up said environment and have Pandora playing music in it.

{
  "name": "simplified_environment_example",
  "owner": "pandora",
  "hosts": {
    "host1": {
      "dbs": [],
      "vms": ["radio"]
    },
    "host2": {
      "dbs": [],
      "vms": ["catalog"]
    },
    "host3": {
      "dbs": ["catalog"],
      "vms": []
    }
  }
}

Just as it’s easy to bring up a brand new environment, it is equally easy to tear down, which demonstrates why these are referred to as “disposable environments.” 3–4 minutes is the longest it takes to destroy all the databases and stop all the VMs. Updating all the services/VMs takes about 8 minutes because everything is done in parallel (thank you Python subprocesses) in a three-tier deployment process. Rolling back an entire environment takes approximately 6–7 minutes, thanks to Dora’s way of keeping releases on the hosts and just symlinking to the desired version. This is now done through various self-service Jenkins jobs that call out to the Python library for managing these types of environments.

Next Steps

Although the numbers look good and a lot has been solved, there’s still a significant amount of work to be done to perfect this and to utilize its full potential.

We currently do a build on every commit to the old existing non-service-oriented codebase. To prove how fast environment deploys were, we added a new Jenkins continuous deployment job, triggered by a successful build, to deploy to a sanity environment (shoutout to Aliaksei Dubrouski for teaching me Jenkins Pipeline). After a successful deployment, basic sanity tests are run to verify that the environment was working.

Because network interfaces are no longer attached to each VM (to avoid work done by SysAd/NetOps), port isolation becomes a major issue. Not every service running Jetty on the same host can bind on the same port now. Currently, this is solved by just making sure none of the ports conflict and storing that information in an ugly spreadsheet. NetOps (Network Operations) is working on solving this through OpenStack for host and network virtualization, which has been extremely promising in our initial tests. The hope is that we would be able to automate virtual host creation using the OpenStack API and deploy one VM/service per virtual host, similar to our SavageCloud pipeline which uses Docker containers to deploy newer Pandora services. OpenStack, along with Consul and Dora, will definitely form a more successful “Big 3” than Lebron, D-Wade, and Chris Bosh.

Configuration of these environments is still hard. To configure them now, we just load a dump of known runtime configurations that allow for all three tiers of Pandora to play right off the bat. Hopefully an import-export tool will solve this in the near future.

Data. Big data. Any sort of data is still a problem. Environments need a lot of data to function and produce a lot of data that is then used to customize features. Our production environment uses BILLIONS of data points to personalize every user’s music listening experience. How do we customize that experience on a fresh environment that has a tiny amount of listener data? ¯\_(ツ)_/¯

An almost existential question is, should multiple environments even exist? Managing one environment (production) is already a hard job, let alone several in-house ones. Some companies have a single environment (production!) where they utilize special code paths triggered for in-house test accounts.

Those last two issues are huge and complex. They take years to resolve. Until then, the concept of disposable environments will live on…though hopefully with a cooler-sounding name.

Pandora is hiring! We currently have open positions in Engineering, Data Science, and Product Management.