Crunching data or having access to vast amounts of data is crucial for many applications, particularly batch jobs which have the tendency and are suitable to run in cloud environments. We have the tendency to collect, and that is even more so when it comes to digital data. Digital data does not take much space with the constant reduction of disk sizes, and not to mention the minuscule cost of saving that extra megabyte. The thing that has not changed is the speed at which we can access data on conventional disks and lack of liable alternative solutions to this problem.
There is no one culprit that we can point a finger to and say "this is your fault". The fact of the matter is that we are where we are due to the natural evolution of things, and our desire to have a broader picture of what we are interested in. If we are simulating the weather, we are interested in more accurate forecast; if we are modeling a circuit board, we are interested in simulating the entire board instead of just part of it; etc.
This desire combined with the physical limitations of speed of light, electricity, space, thermal and aerodynamics of disks puts a damper on our ability to scale and speed-up indefinitely. These are limitations outside our control, but there are many limitations within our control domain that add to the long list of constraints and limitations: inexperienced developers, imperfect infrastructure, never ending feature list request and requirements, deadlines, bugs, management, paper work, etc.
Whatever the reason may be, there is one outcome that is the focus of our discussion and that is managing large amounts of data for an application that requires access to said data, and does so in a manner that is aligned with the users' desire for a certain quality of service (QoS).
Before we delve into how the aforementioned scenario can impact business and essentially ROI, let's take a moment to describe the how a typical scenario plays out in the real world. Figure 1 depicts a very typical and very often used scenario. A user reads data off of a disk, sends it to be processed and gets the results back. It then sends the results over the network again to be processed and gets results back. This pattern continues.
Your first response after looking at figure 1 is that there is no way this can be true. I am afraid that you are wrong, and in fact this pattern is more common than probably any other configuration than you can imagine.
Figure 1: Typical Data Management on Cloud
There are many reasons why that is the case, and many of them valid, and some simply due to the fact that the application was setup this way to begin with and the developers did not see the need to change anything once Cloud was introduced. Other reason could be security; you may not be allowed to keep your results on the cloud for the next job to take, and you are thus forced to bring back your results.
In short, you are forced to read the data from disk multiple times, send a large amount of data over a public network multiple times, and write scratch data on disk multiple times.
As you can imagine, the effect is slowness of the application, underutilization of the resources, along with network and I/O saturation. Overall performance is affected in such a way that it is very difficult to scale up; i.e. adding more compute resources, increasing the data size that need to be crunched, faster response to user's request.
The long term effect is over-provisioning of the network and the resources. This will drive up the cost, and lowers the efficiency of the infrastructure.
The short answer is that there may not be any solutions to speak of. You may be pushing the limits of what is available today and other than proper scheduling, there isn't much else to consider. Scheduling in this case is simply resource reservation, and having dependencies pipelined in a fashion that the wait time is reduced. However, there may be other reasons that force us to be in this jam: security, legacy application which lack resources to refactor, time value of money in that refactoring time is outweighed by the ability to simply throwing money at the problem by buying better hardware.
There are things that you can do, however, that can potentially reduce some of the overhead and allow us to get some efficiency back in your environment. The simplest way to get some "quick" improvement is to save scratch data right in the cloud as shown in Figure 2.
This may sound very simple, but in reality is very difficult. In the simple scenario that I have depicted where there are only two jobs and only two places to consider, this problem is very simple and very much solvable. In the case where you have many computers (Figure 3), the problem of where the result needs to be next is very much an unsolvable and intractable problem.
Figure 2: Reducing the Amount of Data Transmitted
The easiest way to get around said problem is to make the data available in some fashion to all the nodes either thru shared filesystem, or a distributed data caching mechanism where in both cases the data appear local, but in fact it is sitting somewhere across the infrastructure. Caching has its own issues and challenges, but we will get to that in the next article.
Figure 3: Challenges with Shared Data
All and all, there is no right answer, and certainly never an easy one. Most times, the scenario that I depicted in the beginning of the article wins out simply due to the fact that it is how the application works today and why fix something that is not really broken?!
We will delve into the data issues and complexities in the coming articles as we just scratched the surface in this article.