There has been a cross organization initiative of defining and committing to TP99 based SLAs. Looking back at the post I did last Sept, I really wanted my team to understand our SLAs, and to communicate with clients using proper SLAs and monitoring tools.
Before this initiative, most of the teams track their performance (latency) using average processing time. The problem is that if performance has large variance, poor latency is hidden by mean or median. Max latency is also not very relevant. Imagining a Java based service that does GC once every 10 minutes and a full GC once every few hours, max latency only reflects the worst latency during GC.
That's why "top percentile" or TP based latency makes more sense. When you have 100 requests to your service, you can sort all the request time in ascending order. The 99th reuqest in the list is your TP99.
To design TP99 SLAs, you need to keep few things in mind:
- Define a time span -- You have to get latency data for every single request, sort them and find out the top 99th percentile data point. So you want to have a reasonable amount of data collected for the time span. If you have a low volume service which gets less than 100 request per 10 minutes, you do not want to define a 10 minutes based SLA. If you do, you'll hit all worst cases. If you have a very large volume service, you don't want define a daily TP99, because you'll end up hiding the real problem.
- Watch out of extract code of logging information -- In order to calculate TP99, you have to log every single request. If the logging system is not designed properly, you actually might degrade the overall system performance by logging too much. My recommendation is to truly separate out core business logic and the operation/system level logic. So the application doesn't have to worry about logging or calculating. I've seen existing solution of logging into SQL Server. Perf data have very little or none relationships. So writing to SQL server just doesn't make any sense. I recommend to simply write to a local file or local service, and do offline/off hour data aggregation.