Dynimize + MySQL Cross-Microarchitecture Analysis

In this blog post I will discuss how the Dynimize cross microarchitecture performance results were obtained, followed by an analysis of these results.

You can download the scripts to generate similar graphs for your own system from here. Note this repository also contains the results generated for all the tests performed. A trace of the script executing each command was generated by using #!/bin/bash -x, and saved in the output.log files. A full run across all 5 benchmarks takes approximately 2 hrs on the systems I used.

Prerequisites  

In order to use these scripts, you must first install Dynimize:

wget https://dynimize.com/install -O install
wget https://dynimizecloud.com/install.sha256 -O install.sha256
sha256sum -c install.sha256; if [ $? -eq 0 ]; then sudo bash ./install -d; fi


Use your access token to start a subscription license for your host.

$ sudo dyni -license=start -token=your-access-token


You'll need to install gnuplot for the graphs to be generated:

Ubuntu/Debian:

$ sudo apt-get install gnuplot


RHEL/CentOS/Fedora:

$ sudo yum install gnuplot


The CPU hardware events are collected using the Linux perf tool, which should also be installed if you want to generate the CPU events graphs:

Debian/Ubuntu:

$ sudo apt-get install linux-tools-common linux-tools-generic


RHEL/CentOS/Fedora:

$ sudo yum install perf


After this you should try calling the perf command to make sure it was actually installed.

CPU frequency scaling can produce sporatic results when running so many tests back to back as is done by this script, so disabling it is highly advisable. On my systems, I did that by editing /etc/default/grub (first make a backup copy of that file) and modifying the line with GRUB_CMDLINE_LINUX_DEFAULT= to include intel_pstate=disable. For example:

GRUB_CMDLINE_LINUX_DEFAULT="intel_pstate=disable noquiet nosplash net.ifnames=0 biosdevname=0"


Now update grub and reboot:

$sudo update-grub
$sudo reboot


After each reboot, issue the following command to fully disable frequency scaling by setting the scaling governor to "performance":

$ for CPUFREQ in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do [ -f $CPUFREQ ] || continue; echo -n performance > $CPUFREQ; done


Now check each core's frequency and verify that it is set to the maximum:

$ cat /proc/cpuinfo | grep MHz


cpu MHz : 3701.000
cpu MHz : 3701.000
cpu MHz : 3701.000
cpu MHz : 3701.000
cpu MHz : 3701.000
cpu MHz : 3701.000
cpu MHz : 3701.000
cpu MHz : 3701.000


You'll also need to install Sysbench 1.0:

Debian/Ubuntu

$ curl -s https://packagecloud.io/install/repositories/akopytov/sysbench/script.deb.sh | sudo bash
$ sudo apt -y install sysbench


RHEL/CentOS:

$ curl -s https://packagecloud.io/install/repositories/akopytov/sysbench/script.rpm.sh | sudo bash
$ sudo yum -y install sysbench


Fedora:

$ curl -s https://packagecloud.io/install/repositories/akopytov/sysbench/script.rpm.sh | sudo bash
sudo dnf -y install sysbench


Make sure that you have at least 4 GB total of free Swap + Mem for Dynimize:

$ free


        total    used    free    shared    buff/cache    available
Mem: 16397624 1360380 11741576    13868       3295668     14674468
Swap: 1046520       0  1046520


If not, increase your available swap space.

Obviously make sure that MySQL is installed. Make sure innodb_buffer_pool is large enough to store all the tables in memory. For our runs of 1M rows x 10 tables, innodb_buffer_pool=4GB was more than sufficient.


Running the tests

The tests are run by executing plotSysbench.sh. Before running the script, you may need to edit some or all of these variables at the start of the file. Below were the settings used for the runs:

tableSize=1000000
numTables=10
pswd='password'
dbName='test'
user="mysqlUser"
readOnly="on"
measureTime=60
warmupIncrementTime=30
events="false"
cpuUsage="false"
useTaskSet="true"


Setting events="true" will result in the perf command running during each Sysbench run. It will interfere with the throughput results slightly, so for these perf results we decided to perform two runs of plotSysbench.sh for each MySQL/Server combo; one run with events="false" to generate the throughput graphs, another run with events="true" to collect perf events separately. Note that the transactions/cycle graphs do use the throughput numbers generated with events="true".

Setting cpuUsage="true" generates MySQL CPU usage graphs through the top command, and was set to "false" for all of these runs. Setting power="true" measures power consumption on a laptop with a discharging battery. We'll save those experiments for another post.

useTaskSet was set to "false". Setting it to true will allow mysqld to run with less interferance from Sysbench and may provide you with slightly greater speedups when using Dynimize vs not using Dynimize, at the expense of making these runs less realistic. Ideally we would have run these tests with Sysbench and mysqld isolated on separate servers with a direct connection such that Sysbench can still saturate the CPUs on the MySQL server. Running mysqld this way in isolation will provide the greatest speedup when using Dynimize vs without Dynimize, since speeding up mysqld makes Sysbench work harder to keep up, which makes Sysbench consume more CPU cycles, stealing them away from mysqld and dampeding the speedup. This was the case with these runs, however we decided to do it this way and reduce the total speedup because it's much easier for the average user to validate these tests by running them on their own systems without requiring a special hardware setup.

For each benchmark, plotSysbench.sh will first perform a warmup run at 32 client connection threads until the mysqld process is dynimized, then collect results for 1 to 128 threads without restarting mysqld. For each benchmark, it will record the number of warmup runs required to dynimize mysqld, each run executing for a duration set by the warmupIncrementTime variable. It will then repeat the same number of warmups before taking measurements for the runs without Dynimize.


CPU Hardware Events

The linux perf tool was used here to count CPU events of the mysqld process during the Sysbench runs, which help us better understand how Dynimize is speeding up the mysqld process. All events are totals across all CPU cores on a single system. See Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3 for more details.

The following seven events were measured: cycles, instructions retired, instruction cache misses, instruction fetches, instruction TLB misses, conditional branch misses retired, conditional branches retired.


Analysis

From these throughput and CPU events results, we can see that the overall trend across each benchmark is that Dynimize significantly reduces instruction cache misses, ITLB misses, and branch mispredictions, thereby increasing IPC (instructions per cycle) and improving performance. Generally speaking, the greater the number of instruction cache misses, instruction TLB misses and branch mispredictions per transaction, the greater the speedup Dynimize provides these benchmarks. I believe the improvements in transactions per cycle are greater than the total speedup measured here primarily because the Sysbench process in this setup is run on the same server as the mysqld process, stealing cycles from mysqld as mysqld speeds up. It is also because Dynimize slightly reduces instructions per transactions as well, thereby increasing efficiency beyond IPC alone.

Note that we chose to only run read-only Sysbench MySQL benchmarks, since these would be easy for anyone to make CPU bound and illustrate actual increases in performance. When not CPU bound, Dynimize will still increase IPC but may result in the same throughput numbers with reduced CPU utilisation and power consumption. We'll explore Dynimize with the benchmarks that include writes in a future post.

The systems tested here represent the past 3 major microachitecture redesigns of Intel CPUs, all of which show benefits from Dynimize. I wasn't able to rent a dedicated Xeon server with older processors from my service provider for these tests, however I've tested Dynimize with MySQL + Sysbench on my own older Nehalem and Core microarchtecture laptops, the latter microarchitecture being introduced all the way back in 2006, and both show similar speedups to what was observed with these newer machines. The point here is to appreciate that Dynimize speedups are applicable across a broad range of processor generations and MySQL benchmarks.


David Yeager is the founder of Dynimize Inc, with over a decade of experience in just-in-time compiler development. He is passionate about computer architectures and software performance, and his mission is to see dynamic compilation and optimization accelerate all workloads (through Dynimize).


COPYRIGHT © DYNIMIZE INC.