Dynimize + MySQL Cross-Microarchitecture Analysis
by David Yeager
In this blog post I will discuss how the Dynimize cross microarchitecture performance results were obtained, followed by an analysis of these results.
You can download the scripts to generate similar graphs for your own system from here. Note this repository also contains the results generated for all the tests performed. A trace of the script executing each command was generated by using #!/bin/bash -x, and saved in the output.log files. A full run across all 5 benchmarks takes approximately 2 hrs on the systems I used.
In order to use these scripts, you must first install Dynimize:
Use your access token to start a subscription license for your host.
You'll need to install gnuplot for the graphs to be generated:
The CPU hardware events are collected using the Linux perf tool, which should also be installed if you want to generate the CPU events graphs:
After this you should try calling the perf command to make sure it was actually installed.
CPU frequency scaling can produce sporatic results when running so many tests back to back as is done by this script, so disabling it is highly advisable. On my systems, I did that by editing /etc/default/grub (first make a backup copy of that file) and modifying the line with GRUB_CMDLINE_LINUX_DEFAULT= to include intel_pstate=disable. For example:
Now update grub and reboot:
After each reboot, issue the following command to fully disable frequency scaling by setting the scaling governor to "performance":
Now check each core's frequency and verify that it is set to the maximum:
You'll also need to install Sysbench 1.0:
Make sure that you have at least 4 GB total of free Swap + Mem for Dynimize:
If not, increase your available swap space .
Obviously make sure that MySQL is installed. Make sure innodb_buffer_pool is large enough to store all the tables in memory. For our runs of 1M rows x 10 tables, innodb_buffer_pool=4GB was more than sufficient.
Running the tests
The tests are run by executing plotSysbench.sh. Before running the script, you may need to edit some or all of these variables at the start of the file. Below were the settings used for the runs:
Setting events="true" will result in the perf command running during each Sysbench run. It will interfere with the throughput results slightly, so for these perf results we decided to perform two runs of plotSysbench.sh for each MySQL/Server combo; one run with events="false" to generate the throughput graphs, another run with events="true" to collect perf events separately. Note that the transactions/cycle graphs do use the throughput numbers generated with events="true".
Setting cpuUsage="true" generates MySQL CPU usage graphs through the top command, and was set to "false" for all of these runs. Setting power="true" measures power consumption on a laptop with a discharging battery. We'll save those experiments for another post.
useTaskSet was set to "false". Setting it to true will allow mysqld to run with less interferance from Sysbench and may provide you with slightly greater speedups when using Dynimize vs not using Dynimize, at the expense of making these runs less realistic. Ideally we would have run these tests with Sysbench and mysqld isolated on separate servers with a direct connection such that Sysbench can still saturate the CPUs on the MySQL server. Running mysqld this way in isolation will provide the greatest speedup when using Dynimize vs without Dynimize, since speeding up mysqld makes Sysbench work harder to keep up, which makes Sysbench consume more CPU cycles, stealing them away from mysqld and dampeding the speedup. This was the case with these runs, however we decided to do it this way and reduce the total speedup because it's much easier for the average user to validate these tests by running them on their own systems without requiring a special hardware setup.
For each benchmark, plotSysbench.sh will first perform a warmup run at 32 client connection threads until the mysqld process is dynimized, then collect results for 1 to 128 threads without restarting mysqld. For each benchmark, it will record the number of warmup runs required to dynimize mysqld, each run executing for a duration set by the warmupIncrementTime variable. It will then repeat the same number of warmups before taking measurements for the runs without Dynimize.
CPU Hardware Events
The linux perf tool was used here to count CPU events of the mysqld process during the Sysbench runs, which help us better understand how Dynimize is speeding up the mysqld process. All events are totals across all CPU cores on a single system. See Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3 for more details.
The following seven events were measured: cycles, instructions retired, instruction cache misses, instruction fetches, instruction TLB misses, conditional branch misses retired, conditional branches retired.
From these throughput and CPU events results, we can see that the overall trend across each benchmark is that Dynimize significantly reduces instruction cache misses, ITLB misses, and branch mispredictions, thereby increasing IPC (instructions per cycle) and improving performance. Generally speaking, the greater the number of instruction cache misses, instruction TLB misses and branch mispredictions per transaction, the greater the speedup Dynimize provides these benchmarks. I believe the improvements in transactions per cycle are greater than the total speedup measured here primarily because the Sysbench process in this setup is run on the same server as the mysqld process, stealing cycles from mysqld as mysqld speeds up. It is also because Dynimize slightly reduces instructions per transactions as well, thereby increasing efficiency beyond IPC alone.
Note that we chose to only run read-only Sysbench MySQL benchmarks, since these would be easy for anyone to make CPU bound and illustrate actual increases in performance. When not CPU bound, Dynimize will still increase IPC but may result in the same throughput numbers with reduced CPU utilisation and power consumption. We'll explore Dynimize with the benchmarks that include writes in a future post.
The systems tested here represent the past 3 major microachitecture redesigns of Intel CPUs, all of which show benefits from Dynimize. I wasn't able to rent a dedicated Xeon server with older processors from my service provider for these tests, however I've tested Dynimize with MySQL + Sysbench on my own older Nehalem and Core microarchtecture laptops, the latter microarchitecture being introduced all the way back in 2006, and both show similar speedups to what was observed with these newer machines. The point here is to appreciate that Dynimize speedups are applicable across a broad range of processor generations and MySQL benchmarks.