tl;dr? Short form...
- Understand the user scenarios and define tests. Review the mix of scenarios per test and the type of tests to be executed (peak, stress, soak, flood).
- Size and prepare the test environment and data. Consider the location of injectors and servers and mock peripheral services and systems where necessary.
- Test the tests!
- Execute and monitor everything. Start small and ramp up.
- Analyse results, tune, rinse and repeat until happy.
- Report the results.
- And question to what level of depth performance testing is really required...
Assuming we've got the tools and the environments, the execution of performance tests should be fairly simple. The first hurdle though is in preparing for testing.
User Scenarios and Test Definitions
In order to test we first need to understand the sort of user scenarios that we're going to encounter in production which warrant testing. For existing systems we can usually do some analysis on web-logs and the like to figure out what users are actually doing and try to model these scenarios. For this we may need a year or more of data to see if there are any seasonal variations and to understand what the growth trend looks like. For new systems we don't have this data so need to make some assumptions and estimates as to what's really going to happen. We also need to determine which of the scenarios we're going to model and the transaction rates we want them to achieve.
When we've got system users calling APIs or running batch-jobs the variability is likely to be low. Human users are a different beast though and can wander off all over the place doing weird things. To model all scenarios can be a lot of effort (which equals a lot of cost) and a risk based approach is usually required. Considerations here include:
- Picking the top few scenarios that account for the majority of activity. It depends on the system, but I'd suggest keeping these scenarios down to <5 - the fewer the better so long as it's reasonably realistic.
- Picking the "heavy" scenarios which we suspect are most intensive for the system (often batch jobs and the like).
- Introducing noise to tests to force the system into doing things they'd not be doing normally. This sort of thing can be disruptive (e.g. a forced load of a library not otherwise used may be just enough to push the whole system over the edge in a catastrophic manner).
We next need to consider the relative mix of user scenarios for our tests (60% of users executing scenario A, 30% doing scenario B, 10% doing scenario C etc.) and the combinations of scenarios we want to consider (running scenarios A, B, C ; v's A, B, C plus batch job Y).
Some of these tests may not be executed for performance reasons but for operability - e.g. what happens if my backup runs when I'm at peak load? or what happens when a node in a cluster fails?
We also need test data.
For each scenario we should be able to define the test data requirements. This is stuff like user-logins, account numbers, search terms etc.
Just getting 500 test user logins setup can be a nightmare. The associated test authentication system may not have capacity to handle the number of logins or account and we may need to mock it out. It's all too common for peripheral systems not to be in the position to enable performance testing as we'd like and in any case we may want something that is more reliable when testing. For any mock services we do decide to build we need to work out how this should respond and what the performance of this should look like (it's no good having a mock service return in 0.001 seconds when the real thing takes 1.8 seconds).
Account numbers have security implications and we may need to create dummy data. Search terms; especially from humans, can be wild and wonderful - returning millions or zero records in place of the expected handful.
In all cases, we need to prepare the environment based on the test data we're going to use and size it correctly. Size it? Well, if production is going to have 10 millions records it's not much good testing with 100! Copies of production data; possibly obfuscated, can be useful for existing systems. For new though we need to create the data. Here be dragons. The distribution of randomly generated data almost certainly won't match that of real data - there are far more instances of surnames like Smith, Jones, Taylor, Williams or Brown than there are like Zebedee. If the distribution isn't correct then the test may be invalid (e.g. we may hit one shard or tablespace and associated nodes and disks too little or too much).
I should point out that here that there's a short cut for some situations. For existing systems with little in the way of stringent security requirements, no real functional changes and idempotent requests; think application upgrades or hardware migrations of primarily read-only websites, replaying the legacy web-logs may be a valid way to test. It's cheap, quick and simple - if it's viable.
We should also consider the profile and type of tests we want to run. For each test profile there are three parts. The ramp-up time (how long it takes to get to the target volume), steady-state time (how long the test runs at this level for), ramp-down time (how quickly we close the test (we usually care little for this and can close the test down quickly but in some cases we want a nice clean shutdown)). In terms of test types there are:
- Peak load test - Typically a 1 to 2 hr test at peak target volumes. e.g. Ramp-up 30 minutes, steady-state 2hrs, ramp-down 5 mins.
- Stress test - A longer test continually adding load beyond peak volumes to see how the system performs under excessive load and potentially where the break point is. e.g. Ramp-up 8 hrs, steady-state 0hrs, ramp-down 5 mins.
- Soak test - A really long test running for 24hrs or more to identify memory leaks and the impact of peripheral/scheduled tasks. e.g. Ramp-up 30 mins, steady-state 24hrs, ramp-down 5 mins.
- Flood test (aka Thundering Herd) - A short test where all users arrive in a very short period. In this scenario we can often see chaos ensue initially but the environment settling down after a short period. e.g. Ramp-up 0mins, steady-state 2hrs, ramp-down 5 mins
So we're now ready to script our tests. We have the scenarios, we know the transaction volumes, we have test data, our environment is prep'd and we've mocked out any peripheral services and systems.
Scripting
There are many test tools available from the free Apache JMeter and Microsoft web stress tools to commercial products such as HP LoadRunner and Rational Performance Tester to cloud based solutions such as Soasta or Blitz. Which tool we choose depends on the nature of the application and our budget. Cloud tools are great if we're hosting in the public cloud, not so good if we're an internal service.
The location of the load injectors (the servers which run the actual tests) is also important. If these are sitting next to the test server we'll get different results than if the injector is running on someones laptop connected via a VPN tunnel over a 256kbit ADSL line somewhere in the Scottish Highlands. Which case is more appropriate will depend on what we're trying to test and where we consider the edge of our responsibility to lie. We have no control over the sort of devices and connectivity internet users have so perhaps our responsibility stops at the point of ingress into our network? Or perhaps it's a corporate network and we're only concerned with the point of ingress into our servers? We do need to design and work within these constraints so measuring and managing page weight and latency is always a concern but we don't want to have the complexity of all that "stuff" out there which isn't our responsibility weighing us down.
Whichever tool we choose, we can now complete the scripting and get on with testing.
Testing
Firstly, check everything is working. Run the scripts with a single user for 20 minutes or so to ensure things are responding as expected and that the transaction load is correct. This will ensure that as we add more users we're scaling as we want and that the scripts aren't themselves defective. We then quite quickly ramp the tests up, 1 user, 10, users, 100 users etc. This helps to identify any concurrency problems early on with fewer users than expected (which can add too much noise and make it hard to see whats really going on).
If we've an existing system, once we know the scripts work we will want to get a baseline from the legacy system to compare to. This means running the tests on the legacy system. What? Hang on! This means we need another instance of the system available running the old codebase with similar test data and similar; but possibly not identical, scripts! Yup. That it does.
If we've got time-taken logging enabled (%D for Apache mod_log_config) then we could get away with comparing the old production response times with the new system so long as we're happy the environments are comparable (same OS, same types of nodes, same spec, same topology, NOT necessarily the same scale in terms of numbers of servers) and that the results are showing the same thing (which depends on what upstream network connectivity is being used). But really, a direct comparison of test results is better - comparing apples with apples.
We also need to consider what to measure and monitor. We are probably interested in:
- For the test responses:
- Average, max, min and 95th percentile for the response time per request type.
- Average, max, min size for page weight.
- Response codes - 20x/30x probably good, lots of 40x/50x suggests the test or servers are broken.
- Network load and latency.
- For the test servers:
- CPU, memory, disk and network utilisation throughout the test run.
- Key metrics from middle-ware; queue depths, cache-hit rates, JVM garbage collection (note that JVM memory will look flat at the server level so needs some JVM monitoring tools). These will vary depending on the middle-ware and for databases we'll want a DBA to advise on what to monitor.
- Number of sessions.
- Web-logs and other log files.
- For the load injectors:
- CPU, memory, disk and network utilisation throughout the test run. Just to make sure it's not the injectors that are overstretched.
And finally we can test.
Analysis and Tuning
It's important to verify that the test achieved the expected transaction rates and usage profiles. Reviews of log files to ensure no-errors and web-logs to confirm transaction rates and request types help verify that all was correct before we start to review response times and server utilisation.
We can then go through the process of correlating test activity with utilisation, identifying problems, limits near capacity (JVM memory for example) and extrapolate for production - for which some detailed understanding of the scaling nature of the system is required.
It's worth noting that whilst tests rarely succeed first time, in my experience it's just as likely to be an issue with the test as it is with the system itself. It's therefore necessary to plan to execute tests multiple times. A couple of days is normally not sufficient for proper performance testing.
All performance test results should be documented for reporting and future needs. To already have an understanding of why certain changes have been made and a baseline to compare to the next time the tests are run is invaluable. It's not war-and-peace, just a few of pages of findings in a document or wiki. Most test tools will also export the results to a PDF which can be attached to keep track of the detail.
Conclusion?
This post is already too long but one thing to question is... Is it worth the effort?
A Zipf distribution exists for systems and few really have that significant a load. Most handle a few transactions a second if that. I wouldn't suggest "no performance testing" but I would suggest sizing the effort depending on the criticality and expected load. Getting a few guys in the office to hit F5 whilst we eyeball the CPU usage may well be enough. In code we can also include timing metrics in unit tests and execute these a few thousand times in a loop to see if there's any cause for concern. Getting the engineering team to consider and monitor performance early on can help avoid issues later and reduce he need for multiple performance test iterations.
Critical systems with complex transactions or an expected high load (which as a rough guide I would say is anything around 10tps or more) should be tested more thoroughly. Combining capacity needs with operational needs informs the decision - four 9's and 2k tps is the high end from my experience - and a risk based approach should always be used when considering performance testing.