Back to Engineering Blog

Migrating a Big Data Environment to the Cloud, Part 1

  • 4 min read

LiveRamp is a big data company.

A lot of companies have big data.  Robust logging frameworks can generate a PB of logs before breakfast and stash it away forever on S3, just in case.

A few companies even use their big data.  They have a product, and then use Hadoop and Spark to do some machine learning to generate some product recommendations.

But even fewer companies are big data companies in the way LiveRamp is.  Every dollar we get from customers is powered by our Hadoop processing pipeline.  LiveRamp sells a wide range of products, and all of them are powered by our extract, transform, load, join Hadoop processing pipeline.  If we turn our Hadoop infrastructure off, we stop selling products.

As of last year, all of LiveRamp’s big-data computation happened in our on-premise data center on our 2,500 node Cloudera Hadoop cluster.  This year, we are moving it to GCP.

Sasha Kipervarg, Patrick Raymond and I presented at Google Next about this journey, what we learned, and what our next steps are.  In this series of blog posts, I’ll dive deeper into this migration from the technical perspective, focusing on:

  • How LiveRamp’s on-premise big data infrastructure worked as of 2018
  • Why we decided to migrate
  • What we want our infrastructure to look like on GCP
  • How we got there
  • Where we hope to go next.  

We are excited about how this project, although a massive undertaking, will transform the development experience at LiveRamp and let us bring scalable, reliable products to market faster than ever before.

LiveRamp at a glance

LiveRamp has a wide range of products, but they all revolve around matching customer CRMs and match datasets to move data between ecosystems.  We deliver this transformed data into the ad-tech ecosystem in two ways — via a batch file delivery pipeline, and via a real-time pixel server.

The Hadoop ecosystem is uniquely suited to performing massive data joins, and that is what we use.  The bulk of our hardware is dedicated to our Cloudera Hadoop cluster. Our on-premise cluster maxed out at:

  • 2,500 worker nodes
  • 90,000 CPUs
  • 300TB memory
  • 100PB storage

We keep this infrastructure busy too, with over 100,000 YARN applications a day, 13PB read+written/day and 80+% utilization:

 

Any company with 150 (and growing) engineers has a lot of services and support infrastructure.  As of 2018, most of this happened on over 500 VMWare virtual machines provisioned by Chef (with a smaller CoreOS Tectonic Kubernetes cluster.)  Our real-time key-value serving stack was powered a homegrown open-source project.

Files + logs coming in and files delivered to partners each averaged about 8TB of data a day, and our pixel server averaged about 200,000 QPS.

While we had a few services running on AWS (international teams and our pixel server), the vast majority of this hardware ran out of our on-premise data center.

To the cloud

While there was a lot we didn’t like about our infrastructure, there was one thing that kept us on it — it worked.  But by mid-2017, we realized we couldn’t scale up our datacenter presence to be the international presence we needed to be.  We had all the usual motivations for moving to the cloud:

  • Scale: we needed to be able to scale up our infrastructure faster than we could buy servers.  We didn’t want our growth to be limited by 2 month hardware purchases and limited rackspace
  • Disaster recovery: we were not happy with our Disaster Recovery story.  We want to be able to recover from catastrophic downtime in hours, not in the weeks it would take to pull from cold storage.
  • Frankly, recruiting: engineers want to grow skills that matter and transfer, and in 2019, that means cloud literacy

  • Development speed: if 30% of our developers don’t need to maintain our infrastructure, that means they can be developers again, and we can bring products to market 30% faster

So by late 2017, we were seriously evaluating cloud providers and trying to envision what LiveRamp would look like as a cloud-native company.

Why GCP

We love GCP, but we know — it’s not the default choice.  Our decision to go with GCP had two drivers:

  • The technology
  • The people

Our technical evaluation can’t fit into a blog post, but I’ll note that GKE was one of the biggest draws.  One thing we knew we wanted going into this migration was to move all our applications and services to Kubernetes.  Without digging into the details, GKE is the clear leader in managed Kubernetes offerings.

At the end of the day though, we could have made any of the leading clouds work.  The main differentiator was the people we worked with. GCP connected us up with engineers who wanted to answer our questions and solve our problems.  

We were not being bugged about professional support contracts.  We were talking to highly skilled engineers who could quickly answer our questions.  This gave us confidence that we could work together with GCP to solve whatever problems we ran into — and that has stayed true.

In our next post, I’ll discuss what aspects of our big data infrastructure translated directly to GCP, and what we decided to re-engineer during the migration.  Stay tuned!