In this two-part series, we will focus on Resource Sharing: as it pertains to the software/system architect and as it pertains to the infrastructure provider. In other words: from both the supply and demand perspectives.
Resource sharing: it is an interesting concept really – nothing new in reality, but very interesting concept nonetheless. The idea is that a number of users or applications share a given resource. If you think about it, you are doing that today with any computing device (well, with the exception of iPhone). Today, you run multiple applications on the same computer and switch back and forth between these applications. So, what’s the difference? The difference between what the operating system does at the processor level and what a scheduler does at the node level is “Conflict of Interest”.
If you run multiple applications on your PC, the total utility that you receive from your experience stays the same. What I am trying to say is that you control your own destiny; this is true when you switch back and forth between applications. When you want your word process to run, you switch to your word processor, when you want your browser to run, you switch to your browser, and so on.
When you run your application on a shared environment, you have very little control over what the resource manager does. In fact, you shouldn’t as the resource manager is supposed to be an unbiased middleman trying to allow fair-share across all users. Well, unlike what you learned when you were little, sharing is not really caring!
Conflict of Interests
When two users want access to a given resource, each user want his/her job to take priority, and frankly does not care about the other user. It is the job of the resource manager to ensure proper sharing of resource. Resource managers work in a very simple manner:
- Resources logon to the resource manager
- Basic resource information is sent to the resource manager such as OS type, amount of free memory, number of CPU’s, and a number of other parameters based on the Resource Manager involved
- Resource goes in to a waiting queue ready to be assigned a task
- The Resource Manager updates the table of available resources
- The Resource Manager assigns a task to the resource if and when a new task is available that suits that resource
- The resource gets the task and the data and executes the task
- The task result is sent back to the client
- The resource is ready for another task
Figure 1: Resource Manager Flow.
As the number of tasks per resource ratio increases, the longer a task has to wait to get a time slot on that resource. This is resource contention at its best, and eventually a bottleneck will arise in that the tasks will queue up in the Resource Queue and fill that queue. In messaging and middleware, we call this the slow consumer problem.
You can add more resources to your infrastructure and alleviate this problem, and that is how the cycle goes:
- Add resources
- Increase the number of users/jobs
- Evaluate and if needed go back to step 1
As an architect, you have two choices: embrace this methodology or stay away from it. To many, this scenario is a nightmare scenario in that there is predictability and accountability. To some, this scenario is perfectly fine in that they are capable of having to more resources than they thought possible. You are both correct! Based on your requirements, budget and your SLA demands, you would prefer one over the other.
Imagine a scenario where a number of users are dedicating their resources to a pool. If you have a budget for 10 servers and so do your three peers, you end up with a pool of 40 servers. This is great for you or so many reasons:
- Larger pool or resources to potentially have access to
- Implicit High Availability (HA) of your resources
- No over provisioning required!
This is all due to the fact that you have now access to 40 servers as opposed to 10. Yes, true that they are not all yours, and we will get into that. If we assume a proper fairshare of resources, then you can assume that you will get what you require when you require it. The scenario which this is not true is when workload of two or more users match in a way that it surpasses the total available number of severs. Let me clarify:
Case 1: At 9:30 AM, user A, user B and user C have to run a job that requires at least 15 servers to be completed on time. If we have 40 servers, we will not be able to meet our SLA.
Case 2: At 12:30 PM, user B and user D have a job that requires 18 servers to be completed on time. If we have 40 servers, we will be able to meet and beat our SLA requirement.
For both of the aforementioned scenarios, if each user had its own dedicated pool of 10 servers, it would not have been able to meet its SLA. You can see how sharing helps our case 2 scenario and allows our users to meet an otherwise impossible deadline. You can also see why the architect for user D for case 1 would be of the mindset that sharing is a bad idea as he is giving up his resources to someone else and unable to handle a “what-if” scenario should it arise.
Don’t Have What-Ifs
Designing for the what-ifs is the most difficult part of any architect’s job. Most such scenarios resolve around failure and disaster situations, but there are times that a normal non-crisis could become a show-stopper if it is not properly planned for.
Going back to our first case where one of the users had a valid concern about not being able to run a job if anything out of the ordinary happens. This is the case when a job comes in at an unexpected time, when resources are not available as they are being used by some other user. There are a number of ways to solve this otherwise impossible situation:
- Throw money at the problem: buy more hardware. Not over-provision, but be able to handle what-if scenarios if they arise. This option might seem very simplistic, but if you have more what-if scenario than you like to admit, maybe you didn’t provision properly to begin with. We will focus on this in part II of this article.
- Leave it alone: your resource manager must ensure fairshare. Remember that at it core, we are talking about a conflict of interest between users and we require an objective third-party to mediate. So, let it mediate! That’s what schedulers do – or at least should do! Case 1 is not the best example here as even if we do not consider user D, we will still not meet our SLA. In scenarios that resource are tight, and it is not the case of under-provisioning, a scheduler should mediate and ensure a fairshare access to the available resources. It should ensure that no one is taking over the environment and that all users get to run their job.
- Break up the environment to silos. This is my least favorite as it goes against the spirit of what we have been talking about all along, but it is certainly something to consider. User D might have a job profile that is too unpredictable, and sharing-out its resources might have side-effects that are simply not acceptable. Under such scenarios, we need to reconsider and possibly take User D out of a shared environment and give it its own dedicated environment.
We started to talk about a very complicated topic that haunts every architect and system provider. I tackled the problem from an architect’s point of view, and will focus on the system provider in the next article. What we need to keep in mind is that resource sharing might not be for everyone, and that resources must be shared only between users with complementary workloads. There is no point in sharing a resource between two users that have workload peak time of 10PM every night. On the other hand when you do find scenarios to share, you should embrace them. Sharing saves money and simplifies what-if scenarios for architecuts.