Spinning up a New Colocation Facility!

Sanjay Kharb

Sanjay Kharb and Masthan on August 30, 2016

Beyblade, the beginning!!!

When you challenge yourself to light up a new data center built in 2 months, things look crazy. As per the comment of my program manager Masthan, “that sounds like spinning our DC-1 Data Center (DC) like a Beyblade, the toy my kids often play at home.” In action, this subsequently means ripping DC-1, launching DC-2 and eventually serving out of our new 1Colo [DC-2 This facility will handle 2.5 billion requests per day.

The background...

The new DC location should be in line with our long term roadmap of how we want to serve out of US. That is, should we retain the east-west coast presence or be in a single location in the mid of US? This also involved deciding whether we should really need 2 Colo vs 1 Colo.

The Beyblade way!!

Building a new DC or migrating an existing DC was not something new for InMobi’s Production Infrastructure Engineering (PIE) teams. What’s unique about this migration was that building the target DC ground-up by making use of the hardware from the source DC - 1 while minimizing the downtime of DC in flight (DC-1) to no later than 10 days. Besides, the Beyblade team was challenged to productionize (go-live with production traffic) the new DC within 48 hours from the time the hardware is racked, stacked, networked, and powered ON. Also, there was a need to migrate terabytes of data in synchronization with the latest.

Objectives:

While migrating out of DC-1 was a straightforward goal, there were other overriding goals that made this whole effort really challenging:

  1. The overall Adserving and business situation should continue to be the way it was prior to the consolidation.
  2. An optional DR (semi-to-full) was to be evaluated and implemented at a reasonable price point.
  3. The migration gets done within the timelines (end of July) and strictly with no penalties.

The plan - what’s our renewed DC strategy for US?

The option to finalize (DC-2) as our new Data Center was not an easy one. While this was a small part of the whole BIG initiative of restructuring InMobi’s US DCs, Infra team within PIE had done an extensive research on colocation facilities, providers, and sought after inputs from Tech, Business, and Product Orgs to nail down DC-2 as our single US DC taking care of our serving need.

The following were the various interim options and alternatives that were closely scrutinized as part of our US DC strategy:

  • There were three options, east coast, middle and west coast. Among all options, east and west coast were highly recommended given these were mega colocation facilities at very competitive costs and has sufficient space to grow in the future.
  • Infra Engg has thoroughly analyzed all available options and came up with three recommendations.
  • While retaining the 2 DC theory was not too cost prohibitive and brings in an inherent DR option, the operational overhead and costs associated with running infra of 2 DCs made this option less favorable.
  • The other big challenge was to manage data in our humongous Hadoop grid, even if it could manage on short term, it will face challenges to accommodate future growth.
  • Given that DC -2 is in the Central time-zone and the network latencies from both sides of the coast were at an acceptable level [mostly 40-60 ms], we finally decided it as our new DC.

Approach

Instead of focusing on just getting things done, the Beyblade team took the following approach, which was very unique within InMobi’s history of DC migrations:

  1. Every decision was to be backed by very detailed data. In the absence of data, provide the rationale behind the decision, clearly calling out the assumptions if any.
  2. In the pursuit of bringing in the engineering mindset in everything we do, create and share design docs, apply and take data driven decisions, iterate often, fail faster, take informed decisions and risks wherever needed.
  3. Avoid surprises, educate and involve various key stakeholders right from the planning through the execution (vs. just providing a high level picture).
  4. To keep up the execution at its best, ensure we communicate often and communicate clearly, call out risks and blockers early.
  5. Given there are more upcoming DC build outs followed by this migration, templatize the DC build and migration plans so that the future migrations will be a cakewalk.
  6. Challenge ourselves to deliver in line with the most optimal plan which was 10 days for the whole migration and 48 hours to build and light up DC-2.

The Beyblade buildout and migration plan had three phases (finish timelines in the brackets):

Phase I - Move out all small singled homed services from DC -1 to DC -2.

Phase II - Make DC-1 production free and shutdown DC-1. Prior to this, re-home DC-1’s traffic to a mini-colo (a small data center that Inmobi specializes to light up in 72 hours). Redo traffic engineering basis geo location and latency proximity and load factors.

Phase III - Build DC-2, commission DC-1’s h/w in DF1 and go-live

There was both a high level and day wise plan which helped to keep all the stakeholders working on Phase II and III completely aligned.

The scale and complexity

While there will be few areas of improvements, the PIE team specializes in such huge scale projects that are program managed with a comprehensive project plan (300+ tasks) along with a high level milestone-ownership based project plan. The plans were tracked on a regular basis through stand-up meetings (40+), consistent org wide updates, calling out risks early. There have been multiple rounds of negotiations and discussions with colocation facilities providers.

In addition, the program touched entire Engineering and Product teams and involved 80+ engineers from InMobi side and 20+ Data Center ops from facility provider. As stated earlier, this was a tight plan with zero buffer time.

Rip it the beyblade way, the execution highlights!!

Since several members of the PIE were closely involved in the execution, there were many examples that demonstrate our speed of execution and decision making. The total delay in lighting up DC-2 was 4 days and some of the teams had to go extra mile to overcome challenges at various points while chasing a tight plan. Towards the end, every member involved in this project was super amazed to see the level of energy, passion and execution excellence demonstrated. Many involved in the migration quoted saying, “it worked magically, many things could have gone wrong, but, we pulled it off pretty well”! A classic case of how technology teams work at InMobi!