How I go about sizing

Originally posted on 22Jan10 to IBM Developerworks where it got 20,321 Views

Sizing of software components (and therefore also Hardware) is a task that I often need to perform.  I spend a lot of time on it, so I figured I would share how I go about doing it and what factors I take into account.  It is an inexact science.  While I talk about sizing Telecom Web Services Server for the most part, the same principles would be applied to any sizing exercise.  Please also note that the numbers stated are examples only and NOT should not be used to perform any sizing calculations of your own!

Inevitably, when asked to do a sizing, I am always forced to make assumptions about traffic predictions. I don’t like doing it, but is is rare for customers to have really thought through the impact that their traffic estimates/projections will have on the sizing of a solution or it’s price.

Assumptions are OK

Just as long as you state them – in fact they could be viewed as a way to wiggle out of any commitment to the sizing should ANY of the assumptions not hold true once the solution has been deployed. Let me give you and example – I have seen RFPs that have asked for 500 Transactions Per Second (TPS), but neglected to state anywhere what a Transaction actually is. When talking about a product like Telecom Web Services Server – you might assume that the transactions they’re talking about are SMS, but in reality, they might be talking about MMS or some custom transaction – a factor which would have a very significant effect on the sizing estimate. Almost always, different transaction types will place different loads on systems.

Similarly, it is rare for a WebSphere Process Server opportunity (at a Telco anyway) to fully define the processes that they will be implementing and their volumes once that system goes into production. So, what do I do in these cases? My first step is to try to get the customer to clear up the confusion. If that fails (I often have multiple attempts at explaining to the customer why we need such specific information – it is in their benefit after all – they’re much more likely to get the right (sized) system for their needs. This is not always successful, so my next step is to make assumptions to fill in the holes in the customer’s information. I am always careful to write those assumptions down and include them with my sizing estimations. At this point, industry experience and thinking about potential use cases really helps to make the assumptions I make reasonable (or I think so anyway 🙂 ) 

For instance, if a telco has stated that the Parlay X Gateway must be able to service 5760000 SMS messages per day, I think it would be reasonable to assume that very close to 100% of those would be sent within a 16 hour window (while people are awake and to avoid complaints to the telco about SMS messages that come in at all hours of the day – remembering we are talking about applications sending SMS message – nothing to do with user to user SMS messages ) which gets use down to 360000 (5760000/16) SMS per hour or 100 TPS for SendSMS over SMPP – now this is fine for an average number, but I guarantee that the distribution of those messages will not be even, so you have to make an assumption that the peak usage will be somewhat higher than 100 TPS, remembering that we have to size for peak load not average. How much higher will depend on use cases. If the customer cant give you those, then pick a number that your gut tells you is reasonable – lets say 35% higher than average which is roughly 135 TPS of SendSMS over SMPP (I say roughly because if that is your peak load, then as our total is constant for the day (5,760,000) the load must be lower during the non-busy hours. As we are making up numbers here anyway, I wouldn’t worry about this discrepancy, and certainly erring on the side of over sizing is the safer option anyway – provided you don’t over do the over sizing.

Assumptions are your friend

I said I prefer to not make lots of assumptions, but stating stringent assumptions can be your friend if the system does not perform as you predicted and the influencing factors are not as you stated exactly in your assumptions. For instance if you work on the basis of 35% increase in load during the busyhour and it turns out to be 200%, your sizing is going to be way off, but because you asked the customer for the increase in load during the busyhour and they did not give you the information, you were forced to make an assumption – they know their business better that we ever could and if they can’t or won’t predict such a increase during the busyhour, then we cannot be reasonably expected to predict it accurately either – the assumptions you stated will save your (and IBM’s) neck. If you didn’t explicitly state your assumptions, then you would be leaving yourself open to all sorts of consequences and not good ones at that.

Understand the hardware that you are deploying to

I saw a sizing estimate the other week that was supposed to be able to handle about 500 TPS of SendSMS over SMPP, but the machine quoted would have been able to handle around 850 TPS of SendSMS over SMPP; I would call that over doing the over sizing. This over estimate happened because the person who did the sizing failed to take into account the differences between the chosen deployment platform and the platform that the TWSS performance team did their testing on.

If you look at the way that our Processor Value Licensing (PVU) based software licensing works, you will pretty quickly come to the conclusion that not all processors are equal. PVUs are based on the architecture of the CPU – some value a processor at just 30 PVUs per core (Sparc eight core cpus), older Intel CPUs are 50 PVUs per core, while newer ones are 70 PVUs per core. PowerPC chips range from 80 PVUs per core to 120 PVUs per core. Basically, the higher the PVU rating to more powerful each core is on that CPU.

Processors that are rated at higher PVUs per core are more likely to be able to handle more load per core than ones with lower PVU ratings. Unfortunately, PVUs are not granular enough to use as the basis for sizing (remember them though) we will come back to PVUs later in the discussion. To compare the performance of different hardware, I use RPE2 benchmark scores. IBM’s Systems and Technology Group (Hardware) keeps track of RPE2 scored for IBM hardware (System p and x at least) Since pricing is done by CPU core, you should also do your sizing estimate by CPU core. For TWSS sizing, I use a spreadsheet from Ivan Heninger (ex WebSphere Software for Telecom Performance Team lead). Ivan’s spreadsheet works on the basis of CPU cores for (very old) HS21 blades. Newer servers/CPUs and PowerPC servers are pretty much all faster than the old clunkers Ivan had for his testing. To bridge the gap between the capabilities of his old test environment and modern hardware i use RPE2 scores. Since Ivan’s spreadsheet delivers a number of cores required result, I break the RPE2 score for the server down to a RPE2 score per core, then use the ratio between the RPE2 score per core for the new server and the test servers to figure out how many cores of the new hardware are required to meet the performance requirement. 

Ok – so now, using the spreadsheet, you key in the TPS required for the various transaction types – lets say 500 TPS of SendSMS over SMPP (just to keep is simple; normally, you would also have to take into account the Push WAP and MMS messages as well not to mention other transaction types such as Location requests which are not covered by the spreadsheet) that’s 12 x 2 cores for Ivan’s old clunkers, but on newer hardware such as newer HS21s with 3 Ghz CPUs, that’s 6 x 2 cores or on JS12 blades it is 6 x 2 cores. Oh that’s easy you say, the HS21s are only 50 PVUs eash easy, I just go with Linux on HS21 blades and that will be the best bang for the buck for the customer, well don’t forget that Intel no longer make dual-core CPUs for server they’re all quad-core, so in the above example, you have to buy 8 x 2 cores rather than 6 x 2 cores for the JS12/JS22 blades.

Note the x 2 after each number: that is because for TWSS in production deployments, you must separate the TWSS Access Gateway and the TWSS Service Platform. The x 2 indicates that the AG and the SP both require that number of cores.

Lets work that through:
 
Lets first say that TWSS is $850 per PVU.
For the fast HS21s – that’s 8 x 2 x 50 x $850 = $680,000 for the TWSS licences aloneFor JS12s – that’s 6 x 2 x 80 x $850 = $816,000 for the TWSS licences alone

Also (and all sales people who are pricing this should know this) the pre-requisites for TWSS must be licensed separately as well. That means the appropriate numbers of PVUs for WESB (for the TWSS AG) and the appropriate numbers of PVUs for WAS ND (for the TWSS SP) as well as the Database. It’s pretty easy to see how the numbers can add up pretty quickly and how much your sizing estimate can effect the prices of the solution.

Database sizing for TWSS

For the database, of course we prefer to use DB2, but most telcos will demand Oracle in my experience. For TWSS, the size of the server is usually not the bottleneck int he environment what is important is the DB writes and reads per second which equates to disk input/output to achieve high transaction rates with TWSS. It is VITAL to have an appropriate number of disk spindles in the database sick array to achieve the throughput required – the spreadsheet will give you the number of disk drives that need to be in a RAID 1 array to achieve the throughput. For the above 500 TPS example, it is 14.6 disks = 15 disks since you cant buy only part of a disk. While RAID 1 will give you striping and consequently throughput across your disk array, if one drive fails, you’re sunk. To achieve protection, you must go with a RAID 1+0 (sometimes called RAID 10) which gives you both mirroring (RAID 0) and stripping (RAID 1). RAID 1+0 immediately doubles your disk count so we’re up to 30 disks in the array. Our friends at STG should be able to advise on the most suitable disk array unit to go with. In terms of CPU for the database server, as I said, it does not get heavily loaded. The spreadsheet indicates that 70.7% of the reference HS21(Ivan’s clunker) would be suitable, so a single CPU JS12 or HS21 blade even an old one would be suitable.

Every time I do a TWSS sizing, I get asked how much capacity we need in the RAID 1+0 disk array – despite always asking for the smallest disk’s possible. Remember we are going for a (potentially) large array to get throughput, not storage space. In reality, I would expect a single 32Gb HDD would be able to easily handle the size requirements for the database, so space is not an issue at all when we have 30 disks in our array. To answer the question about what size – the smallest possible – since that will also be the cheapest possible provided it does not compromise the seek and data transfer rates for the drive. In the hypothetical 30 drive array, if we select the smallest drive now available (136Gb) we would would have a massive 1.9 Tb of space available ((15-1) x 136 Gb) which is way over what we need in terms of space, but it is the only way we can currently get the throughput needed for the disk I/O on our database server. Exactly the same principles apply regardless of DB2 or Oracle being used for the database.

Something that I have yet to see empirical data on is how Solid State Drives (SSD) with their higher I/O rates will perform in a RAID 1_0 array.  In such a I/O intensive application, I suspect that it would allow us to drop the total number of ‘disks’ in the array down quite significantly, but I don’t have any real data to back that up or to size an array of SSDs.

We have also considered using an in memory database such as SolidDB either as the working database or as a ‘cashe’ in front of DB2, but the problem there is the level of SQL supported by SolidDB is not the same as that supported by DB2 or Oracle’s conventional database.  To port the TWSS code to use SolidDB will require a significant investment in development.

Remember : Sizing estimates must always be multiples of the number of cores per CPU

Make sure you have enough of a overhead built into your calculations for other processes that my be using CPU cycles on your servers. I assume that the TWSS processes will only ever use a maximum of 50% of the CPU – that leaves the other 50% for other tasks and processes that may be running on the system. As a result, I always state that with my assumptions as well.  As an example, I would say:

To achieve 500 TPS (peak) of SendSMS over SMPP at 50% CPU utilisation, you will need 960 PVUs of TWSS on JS12 (BladeCenter JS12 P6 4.0GHz-4MB (1ch/2co)) blades or 800 PVUs of HS21 (BladeCenter HS21 XM Xeon L5430 Quad Core 2.66GHz (1ch/4co)) blades. I would then list the assumptions that I had made to get to the 500 TPS figure such as:

  • There is no allowance made for PushWAP or MMS included in the sizing estimate.
  • 500 TPS is the peak load and not an average load
  • SMSC has a SMPP interface available
  • All application driven SMS traffic will be during a 16 hour window
  • etc

What about High Availability?

Well, I think that High Availability (HA) is probably a topic in it’s own right, but it does have a significant effect on the sizing, so I will talk about it in that regard. HA is generally specified in nines – by that I mean if a customer asks for “five nines “, they mean 99.999% availability per annum (that’s about 5.2 minutes per year of unplanned down time). Three nines (or 99.9% available) or even two nines (99%) are also sometimes asked for. Often, customers will ask for five nines, not realising the significant impact that such a requirement will have on the software, hardware and services sizing. If we start adding additional nodes into clusters for server components, that will not only improve the availability of that component, it will also improve the transaction capability and the price. The trick is to find the right balance between hardware sizing and HA requirements. For example: if a customer wanted 400 TPS of Transaction X, but also wanted HA. Lets assume a single JS22 (2 x dual core PowerPC) blade can handle the 400 TPS requirement. We could go with JS22 blades and just add more to the cluster to build up the availability and remove single points of failure. As soon as we do that, we are also increasing the license cost and the actual capacity of the component., so with three nodes in the cluster, we would have 1200 TPS capability and three times the price of what they actually need just to get HA. If we use JS12 blades (1 x dual core PowerPC) which have half the computing power of a JS22, we could have three JS12s in a cluster, achieve 3 x 200(say) TPS = 600 TPS and even if a single node in the cluster is down, still achieve their 400 TPS requirement. With JS12’s, we meet the performance requirement, we have the same level of HA as we did with 3 x JS22s but the licensing price will be half that of the JS22 based solution ( at 1.5 x the single JS22 option).

I guess the point I am trying to get across is to think about your options and consider if there are ways to fiddle with the deployment hardware to get the most appropriate sizing for the customer and their requirements. The whole thing just requires a bit of thinking…

What other tools are available for sizing?

IBMers have a range of tools availbel to help with sizing – the TWSS spreadsheet I was talking about earlier, various online tools and of course Techline.  Techline is also available to our IBM Business Partners as well via the Partnerworld web site (You need to be a registered Business Partner to access the Techline pages on the Partnerworld site).  For more mainstream products such as WAS, WPS, Portal etc, Techline is the team to help Business Partners – they have questionnaires that they will use to get all the parameters they need to do the sizing. Techline is the initial contact point for sizing support. For more specialised product support (like for TWSS and the other WebSphere Software for Telecom products) you may need to contact your local IBM team for help.  If you are a partner, feel free to contact me directly for assistance with sizing WsT products.

There is a IBM class for IT Architects called ‘Architecting for Performance’ – don’t let the title put you off, others can do it – I did it and I am neither an architect (I am a specialist) or from IBM Global Services (although everyone else in the class was!). If you get the opportunity to attend the class, I recommend it – you work through plenty of exercises and while you don’t do any component sizing, you do do some whole system sizing which is a similar process.  I am not sure if the class is open to Business Partners, if it is, I would also encourage architects and specialists from our BPs to do the class as well.  Let me take that on as a task – I will see if it is available externally and report back.

Sizing estimations is not an exact science

:-)

As I glance back over this post, I guess that I have been rambling a bit, but hopefully you now understand some of the factors in doing a sizing estimate. The introduction of assumptions and other factors beyond your knowledge and control makes sizing non-exact – it will always be an estimate and you cannot guarantee its accuracy. That is something that you should also state with your assumptions.