Performance tuning usually goes something like followed:
- a performance problem occurs
- an experienced person knows what is probably the cause and suggests a specific change
- baseline performance is determined, the change is applied, and performance is measured again
- if the performance has improved compared to the baseline, keep the change, else revert the change
- if the performance is now considered sufficient, you’re done. If not, return to the experienced person to ask what to change next and repeat the above steps
This entire process can be expensive. Especially in complex environments where the suggestion of an experienced person is usually a (hopefully well informed) guess. This probably will require quite some iterations for the performance to be sufficient. If you can make these guesses more accurate by augmenting this informed judgement, you can potentially tune more efficiently.
In this blog post I’ll try to do just that. Of course a major disclaimer applies here since every application, environment, hardware, etc is different. The definition of performance and how to measure it is also something which you can have different opinions on. In short what I’ve done is look at many different variables and measuring response times and throughput of minimal implementations of microservices for every combination of those variables. I fed all that data to a machine learning model and asked the model which variables it used to do predictions of performance with. I also presented on this topic at UKOUG Techfest 2019 in Brighton, UK. You can view the presentation here.
I varied several things
- wrote similar minimal hello world like implementations in 10 frameworks (see code here)
- varied the number assigned cores
- varied the memory available to the JVM
- varied the Java version (8,11,12,13)
- varied the JVM flavor (OpenJ9, Zing, OpenJDK, OracleJDK)
- varied the garbage collection algorithm (tried all possible algorithms for every JVM / version)
- varied the number of concurrent requests
What did I do?
I measured response times and throughput for every possible combination of variables. You can look at the data here.
Next I put all data into a Random Forest Regression model, confirmed it was accurate and asked the model to provide me with feature importances. Which feature was most important in the generated model for determining the response time and throughput. These are then the features to start tweaking first. The features with low feature importance are less relevant. Of course as I already mentioned, the model has been generated based on the data I’ve provided. I had to make some choices, because even when using tests of 20s each, testing every combination took over a week. How accurate will the model be when looking at situations outside my test scenario? I cannot tell; you have to check for yourself.
Why did I use Random Forest Regression?
- It was easy to use and suitable for my data. A (supervised learning) regression model with a right balance between bias and variance.
- It allowed easy determination of feature importance
- I also tried Support Vector Regression but was unable to obtain accuracy even close to Random Forest Regression. This was for me a sign of underfitting of SVR; the model was unable to capture the patterns in the data
Which tools did I use?
Of course I could write a book about this study. The details of the method used, explain all the different microservice frameworks tested, elaborate on the test-tooling used, etc. I won’t. You can check the scripts yourself here and I already wrote an article about most of the data here (in Dutch though). Some highlights;
- I used Apache Bench for load generation. Apache Bench might not be highly regarded by some but it did the job well enough; at least better than my custom code written first in Node and later rewritten in Python. When for example comparing performance of load generation to wrk, there is not much difference, only at higher concurrency, which was outside of the scope of what I had measured (see here). For future tests I’ll try out wrk.
- I used Python for running the different scenario’s. Easier than Bash, which I used before.
- For analyzing and visualization of the data I used Jupyter Notebook.
- I first did some warm-up / priming before starting the actual tests
- I took special care not to use virtualization tools such as VirtualBox or Docker
- I also looked specifically at avoiding competition for resources even though I measured on the same hardware as where I produced load. Splitting the load generation and service to different machines would not have worked since the performance differences, were sometimes pretty small (sub millisecond). These differences would be lost when transporting over a network.
Confirm the model is accurate
In the below plot I’ve shown predicted values against actual values. The diagonal line indicates perfect accuracy. As you can see accuracy is pretty high of the model. Also the R^2 value (coefficient of determination) was around 0.99 for both response times and throughput which is very nice!
The below graphs show the results for feature importance of the different variables.
However I noticed feature importance becomes less accurate when the number of different classes differs per variable. In order to fix that I also looked at permutation feature importance. Permutation feature importance is determined by calculating the reduction in model accuracy when a specific variable is randomized. Luckily this looked very similar:
Most important features
As you can see, the feature importance of the used framework/implementation was highest. This indicates the choice of implementation (of course within the scope of my tests) was more important than for example the JVM supplier (Zing, OpenJ9, OpenJDK, OracleJDK) for the response times and throughput. The JVM supplier was more important than the choice for a specific garbage collection algorithm (the garbage collection algorithm did not appear to be that important at all, even though when memory became limiting, it did appear to become more important). The Java version did not show much differences.
Least important features
The least important features during these test were the number of assigned cores. Apparently assigning more cores did not improve performance much. Because I found this peculiar, I did some additional analyses on the data and it appeared certain frameworks are better in using more cores or dealing with higher concurrency then others (when not specifically tuning the individual frameworks/HTTP servers thus using default settings).
You can check the notebook here.
A nice method in theory
This method to determine the most important features for response times and throughput is of course difficult to apply to real-life scenario’s. It will be difficult to try all different combinations of variables since in more complex environments (for example virtualized), there can be many things which can be varied and performing an accurate reproducible performance test might take a long time.
The method suggests the actual implementation (framework) was the most important feature for response times and throughput. How useful is a result like this? Can the model generalize outside the scope of the code/environment/tested variables? Usually you cannot easily switch the framework for production applications. Even though they are important in performance, the JVM flavor is mostly provided by a team focusing on infrastructure and ‘a given’ for developers. The choice can be related to support contracts for example.
So where is the value?
You can use the results as an inspiration for knobs you can turn to get better performance. Also, when you have a lot of performance data gathered already, feeding them into a suitable model might provide you with valuable insights you might have missed otherwise.