Smarter homes and Smarter Telcos, what’s the link?

Originally posted on 29Jun10 to IBM Developerworks (10,979 Views)

I was looking at where some of the traffic for this blog comes from this morning. Someone had used Google to search for “ibm sdp cloud” which I am glad to say yielded this blog as the third and forth results. Above Telco Talk in the results was a post from 2005 from fellow MyDeveloperworks blogger Bobby Woolf with his post What is in RAD 6.0 – which is interesting in that the post wasn’t about Service Delivery Platforms and the term “SDP” is only mentioned in the comments on the post, yet it rated higher in Google’s index than my posts which have been about cloud, SDPs or both! That’s another conversation though…

The thing that really caught my attention was a new whitepaper form IBM on Smarter Homes. This has been an ongoing area of interest for me for a few years now. This new whitepaper “The IBM vision of a smarter home enabled by cloud technology” is interesting – it talks about some of the concepts that I have seen coming over the past few years, but it also introduces the concept of Cloud based services providers as the key enabler outside the home to enable smarter home to deliver on their lofty promises. In the introduction of the whitepaper, it states:

A common services delivery platform based on industry standards supports cooperative interconnection and creation of new services. Implementation inside the cloud delivers quick development of services at lower cost, with shorter time to market, facilitating rapid experimentation and improvement. The emergence of cloud computing, Web services and service-oriented architecture (SOA), together with new standards, is the key that will open up the field for the new smarter home services.

Excerpt from “The IBM vision of a smarter home enabled by cloud technology”

The dependence on external networks (from our homes) and external Communications Service Providers presents an opportunity for them to provide much more than just the pipe to the house. This is an area that some Telcos are trying to tap into already. Here in Australia, Telstra have recently introduced a home based smart device called the T-Hub which is intended to arrest some of the decline in homes installing or keeping land line phones (in Australia, more and more homes are buying a naked DSL or Hybrid Fibre Coax (HCF) service for Internet and using mobile phones for voice calls and not having a home phone service at all). I recently cancelled my Telstra Home Phone service, so I cannot buy one of the T-Hubs and apparently it won’t work with my home phone service via my HCF connection. It is an intriguing idea though. I find myself wondering if Telstra’s toe in the Smarter Home pond is too little too late. For years, in Telstra’s Innovation Centres (one in Melbourne and one in Sydney) they had standing demonstrations of smarter home technology (I think the previous Telstra CEO, Sol Tujilllo closed them down). I even helped to install a Smarter Healthcare demo at the Sydney Telstra Innovation Centre a few years ago (more on that later) and their demos were every bit as good as the demos that IBM has at the Austin (Texas, USA) and LaGaude (France) Telecom Solutions Labs.

Further into the whitepaper, when talking about Cloud based Service Delivery Platforms (pp 10) there is a nice summary of why a Telco would consider a cloud deployment of their SDP:

An SDP in the cloud supports the expansion of the services scope by enabling new services in existing markets and by expanding existing services into new markets with minimum risk. By exposing standard service interfaces in the network, it enables third parties to integrate their services quickly, or to build new services based on the service components provided in the SDP. This creates the opportunity for new business models, for instance, for media distribution and advertising throughout multiple delivery scenarios.

I think this illustrates what all Telcos should be thinking about – the agility needed to compete in today’s marketplace. Cloud is one way to enhance that agility but also adds elasticity – the ability to grow and shrink as the market demands grow and shrink. Sorry for rambling a bit there… some semi-random thoughts kept popping up when talking about Smarter homes and Telcos. Anyway, I would encourage you to have a read of the whitepaper for yourself. It’s available via slideshare:


Disclaimer: I own a small number of shares in Telstra Corp.

Impact 2010 -Orange France, Decreasing the development time for Telco apps

Originally posted on 05May10 to IBM Developerworks ( 8,995 Views)

Orange in France are using WebSphere sMash to provide an easy development environment using PHP and Groovy to build Telco enabled applications that consume Orange Application Programming Interface (API) which are exposed through pre-built widgets. The custom Orange API is not compliant with either OneAPI or ParlayX and I would normally not endorse a custom API like this, but time to market forces meant that Orange had to move before the (OneAPI) standards were in place. What I would take from their experience in France is their model and use cases. All of which could be done and (now) use standards for those APIs. Interestingly, I think that Orange could also use IBM Mashup Center to support developers with even less skills that the PHP and Groovy developers they’re currently targeting.

http://orange-innovation.tv/webtv/getVideo.php?id=1040

Quality, Speed, Price: Pick two

Originally posted on 02Feb10 to IBM Developerworks where it got 15,259 Views

On the Wednesday of the week before last (the week before my leave) at about 1am my time, I got an urgent request for a RFI response to be presented back to the customer at Friday noon (GMT+8 – 3pm for me – 2.5 business days for the locals in that timezone).  This RFI  was asking lots of hypothetical questions about what this particular telco might do with their Service Delivery Platform (SDP).  It had plenty of requirements like “Email service” or “App Store Service” and so on.  These ‘use cases’ made up 25% of the overall score, but did not have any more detail than I have quoted here.  Two to four words for each use case.  Crazy!  If I am responding to this, such loose scope means I can interpret the use cases any way that I want.  It also means that to meet all the use cases (14 in all) ranging from ‘Instance Messaging Presence Service (IMPS)’ to ‘Media Content and Management Service’ to ‘Next-Generation Network Convergence innovative services’  the proposal and the system would have to be a monster with lots of components.  The real problem with such vague requirements is that vendors will answer the way they think the customer wants them to, rather than the customer telling them what they want to see in the response.  The result will be six or eight different responses that vary so much that they cannot be compared which is the whole point of running the RFI process – to compare vendors and ultimately select one to grant the project to.

On top of the poor quality of the RFI itself, the lack of time to respond creates great difficulties for the people responding.  ‘So what, I don’t care, it’s there job’ you might expect them to say and to an extent you are correct, but think about it like this:  A short timeframe to respond means that the vendor has to find whoever they can internally to respond – they don’t have time to find the best person.  A short timeframe means that the customer is more likely to get a cookie cutter solution (one that the vendor has done before) rather than a solution that is designed to meet their actual needs. A short timeframe means that the vendor may not have enough time to do a proper risk assessment and quality assurance on the proposal – both of which will increase the cost quoted on the proposal.

All of these factors should be of interest to the Telco that is asking for the proposal because they all have a direct effect on the quality and price of the project and ultimately the success of the project. 

I know this problem is not unique to the Telecom industry, but of all the industries I have worked with in my IT career, the Telcos seem to do it more often.  I could go on and on quoting examples of ultra short lead times to write proposals – sometimes as little as 24 hours (to answer 600 questions in that case), but all it would do is get me riled up thinking about them.

The whole subject reminds me of what my boss in a photolab (long before my IT career began) would say “Quality, Speed, Price: Pick two”.  Think about it – it rings true doesn’t it?

How I go about sizing

Originally posted on 22Jan10 to IBM Developerworks where it got 20,321 Views

Sizing of software components (and therefore also Hardware) is a task that I often need to perform.  I spend a lot of time on it, so I figured I would share how I go about doing it and what factors I take into account.  It is an inexact science.  While I talk about sizing Telecom Web Services Server for the most part, the same principles would be applied to any sizing exercise.  Please also note that the numbers stated are examples only and NOT should not be used to perform any sizing calculations of your own!

Inevitably, when asked to do a sizing, I am always forced to make assumptions about traffic predictions. I don’t like doing it, but is is rare for customers to have really thought through the impact that their traffic estimates/projections will have on the sizing of a solution or it’s price.

Assumptions are OK

Just as long as you state them – in fact they could be viewed as a way to wiggle out of any commitment to the sizing should ANY of the assumptions not hold true once the solution has been deployed. Let me give you and example – I have seen RFPs that have asked for 500 Transactions Per Second (TPS), but neglected to state anywhere what a Transaction actually is. When talking about a product like Telecom Web Services Server – you might assume that the transactions they’re talking about are SMS, but in reality, they might be talking about MMS or some custom transaction – a factor which would have a very significant effect on the sizing estimate. Almost always, different transaction types will place different loads on systems.

Similarly, it is rare for a WebSphere Process Server opportunity (at a Telco anyway) to fully define the processes that they will be implementing and their volumes once that system goes into production. So, what do I do in these cases? My first step is to try to get the customer to clear up the confusion. If that fails (I often have multiple attempts at explaining to the customer why we need such specific information – it is in their benefit after all – they’re much more likely to get the right (sized) system for their needs. This is not always successful, so my next step is to make assumptions to fill in the holes in the customer’s information. I am always careful to write those assumptions down and include them with my sizing estimations. At this point, industry experience and thinking about potential use cases really helps to make the assumptions I make reasonable (or I think so anyway 🙂 ) 

For instance, if a telco has stated that the Parlay X Gateway must be able to service 5760000 SMS messages per day, I think it would be reasonable to assume that very close to 100% of those would be sent within a 16 hour window (while people are awake and to avoid complaints to the telco about SMS messages that come in at all hours of the day – remembering we are talking about applications sending SMS message – nothing to do with user to user SMS messages ) which gets use down to 360000 (5760000/16) SMS per hour or 100 TPS for SendSMS over SMPP – now this is fine for an average number, but I guarantee that the distribution of those messages will not be even, so you have to make an assumption that the peak usage will be somewhat higher than 100 TPS, remembering that we have to size for peak load not average. How much higher will depend on use cases. If the customer cant give you those, then pick a number that your gut tells you is reasonable – lets say 35% higher than average which is roughly 135 TPS of SendSMS over SMPP (I say roughly because if that is your peak load, then as our total is constant for the day (5,760,000) the load must be lower during the non-busy hours. As we are making up numbers here anyway, I wouldn’t worry about this discrepancy, and certainly erring on the side of over sizing is the safer option anyway – provided you don’t over do the over sizing.

Assumptions are your friend

I said I prefer to not make lots of assumptions, but stating stringent assumptions can be your friend if the system does not perform as you predicted and the influencing factors are not as you stated exactly in your assumptions. For instance if you work on the basis of 35% increase in load during the busyhour and it turns out to be 200%, your sizing is going to be way off, but because you asked the customer for the increase in load during the busyhour and they did not give you the information, you were forced to make an assumption – they know their business better that we ever could and if they can’t or won’t predict such a increase during the busyhour, then we cannot be reasonably expected to predict it accurately either – the assumptions you stated will save your (and IBM’s) neck. If you didn’t explicitly state your assumptions, then you would be leaving yourself open to all sorts of consequences and not good ones at that.

Understand the hardware that you are deploying to

I saw a sizing estimate the other week that was supposed to be able to handle about 500 TPS of SendSMS over SMPP, but the machine quoted would have been able to handle around 850 TPS of SendSMS over SMPP; I would call that over doing the over sizing. This over estimate happened because the person who did the sizing failed to take into account the differences between the chosen deployment platform and the platform that the TWSS performance team did their testing on.

If you look at the way that our Processor Value Licensing (PVU) based software licensing works, you will pretty quickly come to the conclusion that not all processors are equal. PVUs are based on the architecture of the CPU – some value a processor at just 30 PVUs per core (Sparc eight core cpus), older Intel CPUs are 50 PVUs per core, while newer ones are 70 PVUs per core. PowerPC chips range from 80 PVUs per core to 120 PVUs per core. Basically, the higher the PVU rating to more powerful each core is on that CPU.

Processors that are rated at higher PVUs per core are more likely to be able to handle more load per core than ones with lower PVU ratings. Unfortunately, PVUs are not granular enough to use as the basis for sizing (remember them though) we will come back to PVUs later in the discussion. To compare the performance of different hardware, I use RPE2 benchmark scores. IBM’s Systems and Technology Group (Hardware) keeps track of RPE2 scored for IBM hardware (System p and x at least) Since pricing is done by CPU core, you should also do your sizing estimate by CPU core. For TWSS sizing, I use a spreadsheet from Ivan Heninger (ex WebSphere Software for Telecom Performance Team lead). Ivan’s spreadsheet works on the basis of CPU cores for (very old) HS21 blades. Newer servers/CPUs and PowerPC servers are pretty much all faster than the old clunkers Ivan had for his testing. To bridge the gap between the capabilities of his old test environment and modern hardware i use RPE2 scores. Since Ivan’s spreadsheet delivers a number of cores required result, I break the RPE2 score for the server down to a RPE2 score per core, then use the ratio between the RPE2 score per core for the new server and the test servers to figure out how many cores of the new hardware are required to meet the performance requirement. 

Ok – so now, using the spreadsheet, you key in the TPS required for the various transaction types – lets say 500 TPS of SendSMS over SMPP (just to keep is simple; normally, you would also have to take into account the Push WAP and MMS messages as well not to mention other transaction types such as Location requests which are not covered by the spreadsheet) that’s 12 x 2 cores for Ivan’s old clunkers, but on newer hardware such as newer HS21s with 3 Ghz CPUs, that’s 6 x 2 cores or on JS12 blades it is 6 x 2 cores. Oh that’s easy you say, the HS21s are only 50 PVUs eash easy, I just go with Linux on HS21 blades and that will be the best bang for the buck for the customer, well don’t forget that Intel no longer make dual-core CPUs for server they’re all quad-core, so in the above example, you have to buy 8 x 2 cores rather than 6 x 2 cores for the JS12/JS22 blades.

Note the x 2 after each number: that is because for TWSS in production deployments, you must separate the TWSS Access Gateway and the TWSS Service Platform. The x 2 indicates that the AG and the SP both require that number of cores.

Lets work that through:
 
Lets first say that TWSS is $850 per PVU.
For the fast HS21s – that’s 8 x 2 x 50 x $850 = $680,000 for the TWSS licences aloneFor JS12s – that’s 6 x 2 x 80 x $850 = $816,000 for the TWSS licences alone

Also (and all sales people who are pricing this should know this) the pre-requisites for TWSS must be licensed separately as well. That means the appropriate numbers of PVUs for WESB (for the TWSS AG) and the appropriate numbers of PVUs for WAS ND (for the TWSS SP) as well as the Database. It’s pretty easy to see how the numbers can add up pretty quickly and how much your sizing estimate can effect the prices of the solution.

Database sizing for TWSS

For the database, of course we prefer to use DB2, but most telcos will demand Oracle in my experience. For TWSS, the size of the server is usually not the bottleneck int he environment what is important is the DB writes and reads per second which equates to disk input/output to achieve high transaction rates with TWSS. It is VITAL to have an appropriate number of disk spindles in the database sick array to achieve the throughput required – the spreadsheet will give you the number of disk drives that need to be in a RAID 1 array to achieve the throughput. For the above 500 TPS example, it is 14.6 disks = 15 disks since you cant buy only part of a disk. While RAID 1 will give you striping and consequently throughput across your disk array, if one drive fails, you’re sunk. To achieve protection, you must go with a RAID 1+0 (sometimes called RAID 10) which gives you both mirroring (RAID 0) and stripping (RAID 1). RAID 1+0 immediately doubles your disk count so we’re up to 30 disks in the array. Our friends at STG should be able to advise on the most suitable disk array unit to go with. In terms of CPU for the database server, as I said, it does not get heavily loaded. The spreadsheet indicates that 70.7% of the reference HS21(Ivan’s clunker) would be suitable, so a single CPU JS12 or HS21 blade even an old one would be suitable.

Every time I do a TWSS sizing, I get asked how much capacity we need in the RAID 1+0 disk array – despite always asking for the smallest disk’s possible. Remember we are going for a (potentially) large array to get throughput, not storage space. In reality, I would expect a single 32Gb HDD would be able to easily handle the size requirements for the database, so space is not an issue at all when we have 30 disks in our array. To answer the question about what size – the smallest possible – since that will also be the cheapest possible provided it does not compromise the seek and data transfer rates for the drive. In the hypothetical 30 drive array, if we select the smallest drive now available (136Gb) we would would have a massive 1.9 Tb of space available ((15-1) x 136 Gb) which is way over what we need in terms of space, but it is the only way we can currently get the throughput needed for the disk I/O on our database server. Exactly the same principles apply regardless of DB2 or Oracle being used for the database.

Something that I have yet to see empirical data on is how Solid State Drives (SSD) with their higher I/O rates will perform in a RAID 1_0 array.  In such a I/O intensive application, I suspect that it would allow us to drop the total number of ‘disks’ in the array down quite significantly, but I don’t have any real data to back that up or to size an array of SSDs.

We have also considered using an in memory database such as SolidDB either as the working database or as a ‘cashe’ in front of DB2, but the problem there is the level of SQL supported by SolidDB is not the same as that supported by DB2 or Oracle’s conventional database.  To port the TWSS code to use SolidDB will require a significant investment in development.

Remember : Sizing estimates must always be multiples of the number of cores per CPU

Make sure you have enough of a overhead built into your calculations for other processes that my be using CPU cycles on your servers. I assume that the TWSS processes will only ever use a maximum of 50% of the CPU – that leaves the other 50% for other tasks and processes that may be running on the system. As a result, I always state that with my assumptions as well.  As an example, I would say:

To achieve 500 TPS (peak) of SendSMS over SMPP at 50% CPU utilisation, you will need 960 PVUs of TWSS on JS12 (BladeCenter JS12 P6 4.0GHz-4MB (1ch/2co)) blades or 800 PVUs of HS21 (BladeCenter HS21 XM Xeon L5430 Quad Core 2.66GHz (1ch/4co)) blades. I would then list the assumptions that I had made to get to the 500 TPS figure such as:

  • There is no allowance made for PushWAP or MMS included in the sizing estimate.
  • 500 TPS is the peak load and not an average load
  • SMSC has a SMPP interface available
  • All application driven SMS traffic will be during a 16 hour window
  • etc

What about High Availability?

Well, I think that High Availability (HA) is probably a topic in it’s own right, but it does have a significant effect on the sizing, so I will talk about it in that regard. HA is generally specified in nines – by that I mean if a customer asks for “five nines “, they mean 99.999% availability per annum (that’s about 5.2 minutes per year of unplanned down time). Three nines (or 99.9% available) or even two nines (99%) are also sometimes asked for. Often, customers will ask for five nines, not realising the significant impact that such a requirement will have on the software, hardware and services sizing. If we start adding additional nodes into clusters for server components, that will not only improve the availability of that component, it will also improve the transaction capability and the price. The trick is to find the right balance between hardware sizing and HA requirements. For example: if a customer wanted 400 TPS of Transaction X, but also wanted HA. Lets assume a single JS22 (2 x dual core PowerPC) blade can handle the 400 TPS requirement. We could go with JS22 blades and just add more to the cluster to build up the availability and remove single points of failure. As soon as we do that, we are also increasing the license cost and the actual capacity of the component., so with three nodes in the cluster, we would have 1200 TPS capability and three times the price of what they actually need just to get HA. If we use JS12 blades (1 x dual core PowerPC) which have half the computing power of a JS22, we could have three JS12s in a cluster, achieve 3 x 200(say) TPS = 600 TPS and even if a single node in the cluster is down, still achieve their 400 TPS requirement. With JS12’s, we meet the performance requirement, we have the same level of HA as we did with 3 x JS22s but the licensing price will be half that of the JS22 based solution ( at 1.5 x the single JS22 option).

I guess the point I am trying to get across is to think about your options and consider if there are ways to fiddle with the deployment hardware to get the most appropriate sizing for the customer and their requirements. The whole thing just requires a bit of thinking…

What other tools are available for sizing?

IBMers have a range of tools availbel to help with sizing – the TWSS spreadsheet I was talking about earlier, various online tools and of course Techline.  Techline is also available to our IBM Business Partners as well via the Partnerworld web site (You need to be a registered Business Partner to access the Techline pages on the Partnerworld site).  For more mainstream products such as WAS, WPS, Portal etc, Techline is the team to help Business Partners – they have questionnaires that they will use to get all the parameters they need to do the sizing. Techline is the initial contact point for sizing support. For more specialised product support (like for TWSS and the other WebSphere Software for Telecom products) you may need to contact your local IBM team for help.  If you are a partner, feel free to contact me directly for assistance with sizing WsT products.

There is a IBM class for IT Architects called ‘Architecting for Performance’ – don’t let the title put you off, others can do it – I did it and I am neither an architect (I am a specialist) or from IBM Global Services (although everyone else in the class was!). If you get the opportunity to attend the class, I recommend it – you work through plenty of exercises and while you don’t do any component sizing, you do do some whole system sizing which is a similar process.  I am not sure if the class is open to Business Partners, if it is, I would also encourage architects and specialists from our BPs to do the class as well.  Let me take that on as a task – I will see if it is available externally and report back.

Sizing estimations is not an exact science

:-)

As I glance back over this post, I guess that I have been rambling a bit, but hopefully you now understand some of the factors in doing a sizing estimate. The introduction of assumptions and other factors beyond your knowledge and control makes sizing non-exact – it will always be an estimate and you cannot guarantee its accuracy. That is something that you should also state with your assumptions.