Measuring CPU stall reductions from Dynimize

Duration: 30 min

Level: Intermediate

In this tutorial we are going to install and experiment with Dynimize using MySQL running the Sysbench OLTP benchmark. We also play around with the Linux perf command, top and vmstat. This tutorial assumes that you have MySQL and the Linux perf tool installed, and that there are no other CPU intensive workloads on the system other than those being tested. In order for the Linux perf tool to report CPU event counts, this tutorial should be completed on either a bare metal Linux server, or if using a virtual machine guest then virtual PMU support must be enabled by the hypervisor.

The initial part of this tutorial is meant to illustrate how to determine if there is potential for Dynimize to speedup a mysql (or any other program) workload, by checking to see if it's CPU bound, executing mostly in user mode (as opposed to kernel mode) and exhibits classic front-end CPU stalls,

In the latter part of the tutorial we'll also use the vmstat tool to measure CPU usage reduction after an I/O bound workload is dynimized.

Two terminals will be used for this tutorial; the first for running the benchmark workload, and the second for controlling Dynimize and collecting CPU performance statistics.

Because no two setups are identical, your results will differ from those here to some extent.

1. Install Dynimize

In case you haven't already, install and license Dynimize with the following commands:

wget https://dynimize.com/install -O install
 wget https://dynimizecloud.com/install.sha256 -O install.sha256
 sha256sum -c install.sha256; if [ $? -eq 0 ]; then sudo bash ./install -d; fi

Use your access token to start a subscription license for your host.

$ sudo dyni -license=start -token=<your-access-token>

2. Benchmarking MySQL with the Sysbench 0.5 OLTP workload in a CPU bound case

For this CPU bound part of the tutorial, all steps were done using Centos 7.2, with MySQL community server version 5.7.11 for x86_64, and Sysbench 0.5. The system used here is a 2x8 core 2.4 GHz Xeon 2xE5-2630v3, with 128 GB of RAM and 2x480GB Intel SSD drives. Note that Sysbench 0.5 or newer is required since we're using the --report-interval option.

Install sysbench:

The latest version of Sysbench can be installed from packages hosted by Alexey Kopitov on package cloud. Note that Sysbench versions newer than 0.5 may require slightly different usage:

Debian/Ubuntu

curl -s https://packagecloud.io/install/repositories/akopytov/sysbench/script.deb.sh | sudo bash
 sudo apt -y install sysbench

RHEL/CentOS:

curl -s https://packagecloud.io/install/repositories/akopytov/sysbench/script.rpm.sh | sudo bash
 sudo yum -y install sysbench

Fedora:

curl -s https://packagecloud.io/install/repositories/akopytov/sysbench/script.rpm.sh | sudo bash
 sudo dnf -y install sysbench

Install the Linux perf tool:

Debian/Ubuntu:

$ sudo apt-get install linux-tools-common linux-tools-generic

RHEL/CentOS/Fedora:

$ sudo yum install perf

Make sure that innodb_buffer_pool_size=1G or larger is set in /etc/my.cnf. Here is the simple my.cnf file that was used which is sufficient for this experiment.

 [mysqld]query_cache_size=0
 innodb_buffer_pool_size=1G

Make sure Dynimize is not running, which might be the case after a reboot. Also replace user and password appropriately for your tests, both in the mysql and sysbench commands.

sudo dyni -stop

Restart mysql:

service mysql restart

Create a test database (drop it first in case these steps are repeated):

mysql -u root --password=password -e "DROP DATABASE test;"
 mysql -u root --password=password -e "CREATE DATABASE test;"

Fill the test database with 1,000,000 rows of sysbench oltp data. Note that different versions of Sysbench will install oltp.lua into different dirs, so you may need to perform a search for oltp.lua before specifying the path in the --test parameter.

sysbench \
 --test=/usr/share/doc/sysbench/tests/db/oltp.lua \
 --oltp-table-size=1000000 \
 --mysql-db=test \
 --mysql-user=root \
 --mysql-password='password' \
 --db-driver=mysql \
 prepare

Now run the benchmark in the first terminal, printing transactions/second every 10 seconds. The server used in this example has 16 cores plus hyperthreading (32 hardware threads), and it was found that 64 threads provides the best results on this setup, so we'll use –num-threads=64. Adjust the number of threads appropriately for your system for maximum performance.

sysbench \
 --test=/usr/share/doc/sysbench/tests/db/oltp.lua \
 --oltp-table-size=1000000 \
 --mysql-db=test \
 --mysql-user=root \
 --mysql-password='password' \
 --max-time=999999 \
 --max-requests=0 \
 --num-threads=64 \
 --report-interval=10 \
 --oltp-read-only=on \
 --db-driver=mysql \
 run

Note that because we are running with -oltp-read-only=on, the benchmark is generally CPU bound so long as the dataset can fit in the MySQL innodb bufferpool, which it does in this case (innodb_buffer_pool_size=1G in the above my.cnf with --oltp-table-size=1000000).

The output should look something like the following:

sysbench 0.5: multi-threaded system evaluation benchmark
 
Running the test with following options:
 Number of threads: 64
 Report intermediate results every 10 second(s)
 Random number generator seed is 0 and will be ignored
 

Threads started!
 
[ 10s] threads: 64, tps: 15257.09, reads: 213653.73...
 [ 20s] threads: 64, tps: 15321.51, reads: 214498.32...
 [ 30s] threads: 64, tps: 15343.57, reads: 214809.64...
 [ 40s] threads: 64, tps: 15309.43, reads: 214322.76...
 [ 50s] threads: 64, tps: 15314.10, reads: 214408.40...
 [ 60s] threads: 64, tps: 15308.40, reads: 214316.27...
 ...
 ...

3. Measure CPU performance statistics in the CPU-bound MySQL workload

While sysbench is running in the previous step, open a second terminal which we will use to monitor some CPU usage statistics.
While sysbench is running in the first terminal, in the second terminal run:

$ vmstat 10

which produces the following output:

 procs----- --------memory--------- -swap- -io-- -system-- ------cpu------
 r  b  swpd      free   buff   cache si so bi bo   in     cs us sy id wa st
 63 0     0 128044960 160476 2182196  0  0  0 25    3      0  5  1 95  0  0
 65 0     0 128043216 160476 2182204  0  0  0 11 8417 979927 89 11  0  0  0
 63 0     0 128042544 160476 2182224  0  0  0  8 8321 981092 89 11  0  0  0
 62 0     0 128043344 160476 2182224  0  0  0  7 8430 981487 89 11  0  0  0
 ......
 ......

The above command is sampling performance statistics over 10 second intervals. It is generally advisable to disregard the first output statistics line which can be considered meaningless. The "us" column shows that user mode CPU execution is consuming around 89% of the system's CPU resources, while the "sy" column shows kernel mode execution consumes around 11%. Dynimize can only improve machine code that executes in user mode, and here we see that there is a lot of opportunity in this workload at 89%.

Lets see which processes are consuming CPU resources. The top command shows the following output. In the second terminal run:

$ top

which produces output similar to the following:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6959 mysql    20   0 7457360 647836   8596 S  2646  0.5 370:04.56 mysqld
 7171 user     20   0 4267704  13912   2828 S 557.8  0.0  77:59.97 sysbench
 8083 user     20   0  157828   2528   1556 R   0.3  0.0   0:00.03 top
 ......
 ......

Note that a lot of the output here has been removed, with only the first three process output lines shown for readability. As we can see, only two processes, mysqld and sysbench are consuming most of the the CPU resources on the system. To optimize mysqld, it must be specified in the exeList in /etc/dyni.conf, which happens to be the case on installation.

Now lets measure a few detailed CPU statistics with the Linux perf command, which counts CPU hardware events. While sysbench is running in the first terminal, in the second terminal run this as a single command:

$ perf stat -e instructions:u,cycles:u,L1-icache-load-misses:u,iTLB-load-misses:u,branch-misses:u,branches:u -a sleep 10

The above usage of the command counts system wide CPU events for 10 seconds, specifically instructions executed, CPU cycles consumed, L1 instruction cache misses, iTLB (instruction translation lookaside buffer) misses, branch prediction misses, all taking place during user mode execution (as opposed to kernel mode). After 10 seconds the command outputs the following:

Performance counter stats for 'system wide':
 
442,928,540,208 instructions:u # 0.60 insns per cycle
 743,059,263,090 cycles:u
 12,493,143,978 L1-icache-load-misses
 1,467,653,054 iTLB-load-misses
 2,458,731,196 branch-misses:u # 3.19% of all branches
 77,095,938,756 branches:u
 
10.000807682 seconds time elapsed

We can see that the IPC (instructions per cycle) is 0.60, and the branch prediction miss rate is 3.19%. We can also calculate the following:

L1-icache-load-misses rate = 12,493,143,978/442,928,540,208 instructions = 2.82%
ITLB-load-misses rate = 1,467,653,054/442,928,540,208 = 0.33%

Note that the 2.82% i-cache miss rate is very high and would likely consume a large portion of the CPU cycles of this workload. This problem is common in many enterprise software workloads.

4. Apply Dynimize to this CPU-bound MySQL workload

In the second terminal we used for perf, start Dynimize:

$ sudo dyni -start

Wait a few seconds, and then check the status of Dynimize:

$ sudo dyni -status

which outputs the following:

Dynimize is running
 mysqld, pid: 6959, dynimizing

This shows that it is currently dynimizing mysqld. In about 70 seconds after starting Dynimize, we can check back to see if it has finished dynimizing this process. The amount of time actually required depends on the workload and hardware:

$ sudo dyni -status

which outputs:

Dynimize is running
 mysqld, pid: 6959, dynimized

Now go back to the first terminal where sysbench is running and observe the latest output:

...
 ...
 [ 10s] threads: 64, tps: 15195.88, reads: 212794.61...
 [ 20s] threads: 64, tps: 15300.89, reads: 214205.66...
 [ 30s] threads: 64, tps: 15290.12, reads: 214067.98...
 [ 40s] threads: 64, tps: 15267.90, reads: 213752.14...
 [ 50s] threads: 64, tps: 15273.80, reads: 213832.62...
 [ 60s] threads: 64, tps: 13816.65, reads: 193432.16...
 [ 70s] threads: 64, tps: 12343.83, reads: 172821.64...
 [ 80s] threads: 64, tps: 8041.09, reads: 112574.85...
 [ 90s] threads: 64, tps: 7582.20, reads: 106147.24...
 [ 100s] threads: 64, tps: 13311.71, reads: 186358.59...
 [ 110s] threads: 64, tps: 16106.31, reads: 225490.37...
 [ 120s] threads: 64, tps: 20023.00, reads: 280325.74...
 [ 130s] threads: 64, tps: 20157.20, reads: 282204.14...
 [ 140s] threads: 64, tps: 20160.71, reads: 282247.60...
 [ 150s] threads: 64, tps: 20153.00, reads: 282133.09...
 [ 160s] threads: 64, tps: 20147.70, reads: 282068.90...
 [ 170s] threads: 64, tps: 20161.78, reads: 282271.60...
 ...
 ...

You can see a dip in performance which lasted a few seconds. That drop in performance was mostly due to the profiling overhead Dynimize incurs on its target processes during the dynimizing phase. Once that is complete, steady state performance is observed to be around 20150 tps, which is approximately a 31.5% throughput improvement.

Let's now return to the second terminal and rerun the same perf command:

$ perf stat -e instructions:u,cycles:u,L1-icache-load-misses:u,iTLB-load-misses:u,branch-misses:u,branches:u -a sleep 10

which outputs:

Performance counter stats for 'system wide':
 
566,161,551,736 instructions:u # 0.78 insns per cycle
 727,107,122,776 cycles:u
 8,797,026,557 L1-icache-load-misses
 820,245,959 iTLB-load-misses
 2,511,365,723 branch-misses:u # 2.76% of all branches
 90,998,098,810 branches:u
 
10.000983461 seconds time elapsed

We can see that the user mode instructions per cycle (IPC) has increased to 0.78 (from the original 0.6) and the branch miss rate has decreased to 2.76% (from 3.19%). We can again deduce the other following miss rates:

L1-icache-load-misses rate = 8,797,026,557/566,161,551,736 instructions = 1.55% (down from the original 2.82%)

ITLB-load-misses rate = 820,245,959/566,161,551,736 = 0.14%
(down from the original 0.33%)

Note that while user mode IPC increased on a relative basis by 0.78/0.60 = 30%, Dynimize also does a small amount of redundant instruction elimination driven by live profiling statistics and other live information. It therefore slightly reduces the total instructions executed per transaction, increasing performance marginally beyond the IPC improvements alone.

5. Apply Dynimize to an I/O-bound MySQL workload

We are now going to try and make the previous sysbench OLTP benchmark I/O bound by setting --oltp-read-only to off. In the previous section with --oltp-read-only=on and a large enough buffer pool, all queries were able to be satisfied by reading from RAM. By writing to disk with --oltp-read-only=off, the workload can become I/O bound where in terms of hardware components, the speed of the hard drive is the bottle neck and ultimately determines performance.

Because the system used in the first part of the tutorial had relatively fast SSD drives, running --oltp-read-only=off was still CPU bound and so the results shown in the remaining part of this section below were performed using a system with a slower magnetic disk drive in order to fully accentuate the effects of Dynimize in I/O bound workloads. If your system has a fast enough hard drive, you'll likely still see speedup with Dynimize when using ---oltp-read-only=off. If using a slow hard drive, you won't see any speedup, but instead experience reduced CPU resource consumption, as seen in the following example. Many systems will be somewhere in between these two extremes and exhibit both a bit of speedup and reduced CPU resource consumption to some extent, although less of each when compared to the extreme cases of a purely CPU or IO bound setup as shown in this tutorial.

The results for the remainder of the tutorial were obtained on a system with a dual core 2.3 GHz i3-M350 CPU with hyperthreading enabled, 4 GB ram, with a HITACHI HTS545032B9A300 HDD. The system is running Ubuntu 14.04 LTS, using MySQL 5.5.41 and the same sysbench version.

It is assumed that you will be continuing on the same machine, in which case stop Dynimize in the second terminal:

$ sudo dyni -stop

In the first terminal hit Ctrl+C to interrupt sysbench, then restart MySQL:

service mysql restart

In the first terminal run sysbench changing --oltp-read-only to off, making this stress the IO subsystem by performing writes to disk as well. We also change –report-interval to 60 so as to better smooth out the results. Expect throughput updates every 60 seconds now. Execute the following commands in the first terminal:

mysql -u root --password=password -e "DROP DATABASE test;"
 mysql -u root --password=password -e "CREATE DATABASE test;"
 
sysbench \
 --test=/usr/share/doc/sysbench/tests/db/oltp.lua \
 --oltp-table-size=1000000 \
 --mysql-db=test \
 --mysql-user=root \
 --mysql-password='password' \
 --db-driver=mysql \
 prepare
 
sysbench \
 --test=/usr/share/doc/sysbench/tests/db/oltp.lua \
 --oltp-table-size=1000000 \
 --mysql-db=test \
 --mysql-user=root \
 --mysql-password='password' \
 --max-time=999999 \
 --max-requests=0 \
 --num-threads=64 \
 --report-interval=60 \
 --oltp-read-only=off \
 --db-driver=mysql \
 run

The above command should produce similar output to the following. Notice the reduced throughput compared to the read-only case. You may also find that the throughput fluctuates more:

sysbench 0.5: multi-threaded system evaluation benchmark
 
Running the test with following options:
 Number of threads: 64
 Report intermediate results every 60 second(s)
 Random number generator seed is 0 and will be ignored
 

Threads started!
 
[ 60s] threads: 64, tps: 314.27, reads: 4414.66, writes: 1261.28...
 [ 120s] threads: 64, tps: 300.08, reads: 4201.16, writes: 1200.38...
 [ 180s] threads: 64, tps: 296.58, reads: 4152.17, writes: 1184.68...
 [ 240s] threads: 64, tps: 300.47, reads: 4202.50, writes: 1200.05...
 [ 300s] threads: 64, tps: 273.48, reads: 3831.18, writes: 1096.67...
 [ 360s] threads: 64, tps: 280.82, reads: 3933.05, writes: 1123.93...
 [ 420s] threads: 64, tps: 296.37, reads: 4147.25, writes: 1184.13...
 [ 480s] threads: 64, tps: 310.35, reads: 4346.78, writes: 1242.73...
 [ 540s] threads: 64, tps: 306.30, reads: 4287.00, writes: 1223.75...
 [ 600s] threads: 64, tps: 293.87, reads: 4115.33, writes: 1174.33...
 [ 660s] threads: 64, tps: 298.80, reads: 4183.20, writes: 1197.78...
 [ 720s] threads: 64, tps: 298.37, reads: 4177.14, writes: 1191.32...
 [ 780s] threads: 64, tps: 297.77, reads: 4167.02, writes: 1190.52...
 ...
 ...

Let's monitor the CPU usage with top in the second terminal:

$ vmstat 60

Here we use 60 to take 60 second samples, smoothing out the results. This should produce output similar to the following:

 procs----- --------memory--------- -swap- -io-- -system-- ------cpu------
 r  b  swpd      free   buff   cache si so bi   bo  in    cs us sy id wa st
 10 1   244   2187420  69236  511996  0  0 19   78  33   243 18  9 72  1 0
 0  1   244   2139644  76916  511944  0  0  0 8831 921 19696 23  6 54 17 0
 1  0   244   2090004  84452  511988  0  0  0 7950 911 19691 22  6 54 18 0
 19 1   244   2081908  92276  512004  0  0  0 8486 955 20751 23  6 53 17 0
 ...
 ...

Notice the reduced CPU utilization compared to the CPU bound case, which is now at around 23% for user mode and 6% for system mode, for a total of 29% CPU utilization. This workload is clearly IO bound.

Now start Dynimize in the same terminal:

$ sudo dyni -start

Because the workload is consuming far fewer CPU resources, Dynimize may take longer to optimize the workload.

You can repeatedly check dyni -status until mysld is dynimized. This may take a bit longer because of the low CPU usage of this workload.

$ sudo dyni -status

Dynimize is running
 mysqld, pid: 10725, dynimizing

$ sudo dyni -status

Dynimize is running
 mysqld, pid: 10725, dynimized

Now run vmstat 60 again:

$ vmstat 60

Which outputs:

 procs----- --------memory------ -swap- --io-- --system-- ------cpu-----
 r  b  swpd   free    buff cache si so bi   bo   in    cs us sy id wa st
 12 2  256 1840772 113872 510976  0  0 19  138  109   393 18  9 72  2  0
 11 1  256 1835424 117056 511168  0  0  0 8342 1172 23676 17  6 56 20  0
 0  1  256 1832656 119972 510980  0  0  0 8236 1104 23624 18  6 56 20  0
 0  1  256 1829076 122812 511168  0  0  0 8472 1157 22876 17  6 57 20  0

The sysbench throughput numbers in this example did not change on our system after the workload was dynimized because it was not CPU bound, so we won't bother to show the remaining sysbench output.

Notice the further reduction in user mode CPU utilization which in this system is now at around 17%, compared to the original 23%, with system mode unchanged. We can assume that CPU power consumption would also be reduced due to the increased idle and wait time, and that more CPU resources are now free to run other workloads, potentially improving their performance.

David Yeager

David Yeager is the founder of Dynimize Inc, with over a decade of experience in just-in-time compiler development. He is passionate about computer architectures and software performance, and his mission is to see dynamic compilation and optimization accelerate all workloads (through Dynimize).