Friday, January 28, 2011

How we increased the capacity of our platform by 500% without any additional hardware

Introduction:

Functionality:

Functionally, our platform is a geared towards the BPO space where it can handle the data and metadata related to all kinds of tickets / tasks that agents / supervisors in a BPO are supposed to take up (this includes voice, chat etc). As such it interacts with various other internal / third-party sub-systems including Telephony Platforms, CMSs, Key Management appliances etc. The platform is also responsible for tracking the real time states of the agents and handle task / ticket allocation and other notifications. Finally it also enables monitoring of and reporting on everything.

Architecture:

Architecturally, for scalability, the platform is vertically partitioned into applications that serve different kinds of needs - separate applications / vertical partitions for data / call processing, others for metadata processing, agent management, SSO, authentication and agent skill group matching etc. These applications share a common replicated database (MySQL) and share other infrastructural components like ActiveMQ for Messaging and Memcached for caching etc. The entire setup in installed on two quadcore servers which are in a high availability configuration. This setup then becomes the shared-nothing topology that we clone at different locations (or whenever we need additional capacity).

Additionally, from a fault tolerance and scalability perspective, we process some critical CPU intensive tasks in a highly federated manner - so all voice recordings are done on the agent desktops - and both data as well as metadata are initially cached there. As such, even if the entire platform fails (and only the telephony platform is available), the agent is still able to take / make calls (something that is mission critical from a BPO perspective) - the data and metadata gets transmitted to our platform when it recovers from the fault. Reconciliation mechanisms are built in.

The Problem:

The platform (with the current hardware setup of two quadcore servers in HA configuration) had been certified to take a load of 60,000 calls a day with a safety factor of 25% ... things start slowing down after that. The business however expanded exponentially, and we started getting over 95,000 calls a day. To compound things, two thirds of the calls started coming in an 9 hour interval ... and business was still growing. This stretched the infrastructure that we had provisioned to the limits - things started getting delayed - resulting in backlogs - business started getting impacted.

Adding additional hardware (in a shared-nothing topology) would have been the quickest solution - alas, the entire setup was running in our data center ... and getting more hardware meant direct purchase / rent ... and the entire PROCESS (with all its authorizations / formalities etc.) takes time. And given the fact that we were planning to move to a hosted setup, business stakeholders were not too amenable to procuring more hardware.

The Solution:

We decided to "improve within the constraints".

To be able to improve any of the non functional characteristics of any system, you first need to understand the usage pattern of the platform and then to find how these define the bounds of the system.

We re-profiled the system and found that the bounds of the system had completely changed (from the original measures). The DB was causing problems, but the situation of the application server was even more aggravated - especially from a threads perspective. All the HTTP threads that we had provisioned for the application were being used all the time - and as such new requests could not be served. The number of messages passing through ActiveMQ were so high that we got a producer consumer problem - the consumers were not able to consume as fast as the rate at which producers were sending messages. Finally, because of the variability of the call traffic, there were times when the systems was absolutely chocked - so the fault tolerance kicked in and data / metadata started being cached at the agent end - and there were lean periods when the infrastructure was barely used at all.

We also realized that a small slowdown of the DB compounded problems. Since the request rate was the same, if the request processing rate went down, all the HTTP threads got used up - and new requests were denied - some of these were critical requests - and their denial cause bigger issues within the system.

Also, based on some usage patterns of the application, we realized that the business did nor need ABSOLUTELY real time information - some delay was permissible. There were only a couple of instances where real time information was critical.

After brainstorming, we created a laundry list of things that we could do. We did not create a plan ... we just started going after the things that were either low hanging fruit or could give us the biggest impact (everything else would have to wait until the initial set was completed, deployed in production and the system re-profiled to check the bounds / constraints - so we used a pull based iterative approach).

We decided to REDUCE / PRIORITIZE / ASYNCHRONIZE / BATCH !!

1. Based on the retention policy for various customers, we reduced data in the primary tables - shifted these to archive tables.
2. We re-partitioned the system to take critical real time functionality out as separate modules and gave them dedicated resources (threads / DB Connections / Memory / CPU) thereby prioritizing the real time activities.
3. Reduced the number of updates / messages being sent to the platform - this was done by rationalizing based on need - if a real-time update was not necessary, it was cached and then sent as a batch.
4. We made a lot of operations with the system asynchronous - anything that could wait was a candidate - tasks were queued up and processed in reasonable batches (and these parameters could be fine tuned at runtime) - this brought in predictability into the system (and smoothed the peaks / troughs in the call / request traffic).

Six resources .... one month down the line ... four incremental production releases later ...

We continue to use the same hardware resources ..
We continue to have the same traffic ...
But ...... the platform handles all of this without a glitch. In fact it handles it at such a pace, that even though we asynchronized a lot of tasks, to most users, it appears to process things almost in real time. After re-profiling the system, the capacity of same two servers now stands at 3,00,000 calls a day with a safety factor or 15%. That's a whopping 500% increase.

Not that there is no scope for more improvement ... quite a few things still remain in the original laundry list of things that we drew up. Some of them have become redundant in the current context, because after re-profiling the system, the bounds / constraints of the system have changed ... but others are still pretty relevant. However, the current capacity of the system is much above what's required by the business - so the focus has now shifted to features / radical functional changes that enhance the user experience / marketability / consumption of the platform in a SaaS based manner .... but that's probably for another blog post...

ROI

From an ROI perspective here are some numbers:

  • Cost incurred: INR 6,50,000 (roughly the cost of resources for 1 month)
  • Rental Price Of Two Servers (option that we were considering earleir): INR 47,000
  • Break-even in less than 14 months (given that the business does not grow / more capacity not required)
  • If business grows and utilizes the full capacity - then break-even happens in less than 3 months
  • Given the fact that this same setup is deployed in 4 different locations - if business grows in all 4 and utilizes capacity - then break-even happens in less than a month!

More importantly, from a multi-tenant delivery persective in a SaaS environment, the capacity increase signifies that we can be more competitive (can drop prices per transaction / license) - or we can enjoy better margins at the same price points.

Guess we've kept the promise of unshalking the business from technological / infrastructural constraints ... now for the business to deliver on its promises ...!! :-)

1 comment:

Unknown said...

Interesting article, well written too !!