SMT Measurement and Accounting for Linux on z/VM

Simultaneous MultiThreading - SMT - how do you measure it?

Many installations are asking for information on how to manage performance for SMT enabled LPARs. I will talk about the first 3 parts of performance management:

Capacity Planning - knowing or at least estimating how much additional capacity is available
Performance - Understanding if SMT is good or bad for perforamnce and determining the impact of SMT
Accounting - Allocating CPU correctly to users
Operations - There may be some alerts to indicate approaching capacity

In an SMT enabled LPAR, it is rather important to understand that we talk in threads and IFLs. From a z/VM perspective, we now see threads. Where ever the term for traditional z/VM "CPU" is used, one now has to think "thread".

Capacity Planning

At a high level, the ESALPARS report will show how much IFL time is allocated to the LPAR. When an IFL is assigned to an LPAR, both threads on that IFL under SMT are also assigned to the LPAR. BUT, both threads on that IFL might not be used and thus there is time the core is assigned, but one thread is idle. Thread IDLE time is extra capacity that is available. This time shows up on the ESALPARS report as thread idle time. "thread idle" is the time where one IFL, two threads are assigned to an LPAR, but only one of them is actually doing work

Following is the LPAR SUMMARY showing the LPARs and their allocations. Starting from the far right, the "entitled" CPU is the amount of the shared engines that this LPAR is guaranteed by the given LPAR weights. The LPAR from which this data comes from is the LXB5 LPAR, which will get 10.9 cores allocated as its guarantee. In this case, this LPAR has been assigned a physical core 828% of the time, meaning 8.28 CORES were assigned on average in this one minute reporting interval.

When a core is assigned to an LPAR in an SMT-2 environment, both threads are part of that assignment. Even though both threads are assigned, that does not mean they are utilized. In the CP monitor is another metric, the "idle thread" metric. If this LPAR is assigned 828%, and then subtract the LPAR overhead (non-smt), that means for real work, 816% "core assignement" which would be 1632% "thread assignment". Of that 1632%, in this case 594% thread idle existed. This is time one thread was being utilized, and one thread was idle - but the core was assigned.

Report: ESALPARS     Logical Partition Summary
Monitor initialized: 07/07/15 at 13:03
---------------------------------------------------------------------------
         <--------Logical Partition------->  <-Assigned               Entitled
                      Virt CPU  <%Assigned>  <---LPAR--> <-Thread->   CPU Cnt
Time     Name     Nbr CPUs Type Total  Ovhd  Weight  Pct Idle   cnt
-------- -------- --- ---- ---- -----  ----  ------ ---- ------ ---  ------
13:05:00 Totals:   00   71 IFL   1055  18.7    1001  100
         LXB5      05   20 IFL  828.6  12.6     475 47.5 594.5    2  10.91
         LXBX      0F    1 IFL    0.5   0.1      50  5.0     0    1   1.15
         LXB2      02   12 IFL   1201   0.1     Ded 21.8     0    1      0
         LXB3      03   20 IFL   2000   0.1     Ded 36.4     0    1      0
         LXB8      08   10 IFL  224.7   5.7     475 47.5     0    1  10.91
         TS02      0E    8 IFL    1.3   0.3       1  0.1     0    1   0.02
 
Totals by Processor type:
<---------CPU-------> <-Shared Processor busy->
Type Count Ded shared  Total  Logical Ovhd Mgmt
---- ----- --- ------ ------ -------- ---- ----
IFL     55  32     23 1073.7   1036.5 18.7 18.4

From a capacity planning perspective, there is unused capacity. This happens to be a z13 that had inherrent bottlenecks in the TLB that was corrected on the z14 and z15. To understand this requires hardware understanding and use of the hardware metrics from the PRCMFC data. Many z13 installations had a "less than zero" increase in capacity by enabling SMT on the z13. This is only detectable using the MFC data when evaluating production data.

The objective of SMT is to better utilize the core processor. The "z" processors have a very sophisticated caching hierarchy to increase the amount of time a core can actually execute an instruction. Any instruction execution must have the instruction and all related data in the Level one cache. Any time there is a cache miss, the core sits idle while the data comes from level 2, level 3, level 4, level 4 on a remote book, or from memory. Each of these sources requires an increasing number of cycles to load the data into the level 1 cache. During these cache loads, if the core can process instructions from another thread, then core utilization should go up.

There are two measures then for measuring increased capacity when SMT is enabled.

Instructions executed per second: If SMT is enabled and instructions per second increase from 2.5M to 3.0M instructions per second per core, then capacity has increased.

Cycles per instruction: Traditionally, the mainframes required between 4 and 8 cycles to execute one instruction. Because of cache, pipelining, and other hardware enhancements, the z13 is closer to 1.5 cycles per instruction (without SMT). Traditionally, capacity is more focused on core utilization. If more cycles are required per instruction, then core utilization goes up - not a good thing. If cycles per instruction drop for a given workload, that workload is using less of the core to drive more work. Understanding this measurement invalidates a lot of traditional measurements of capacity.

Without proper measurement capability it is very difficult to know if capacity has increased or not. One installation says that the Linux admin's think their performance is better - the method of analysis was not scientific. From a capacity planning perspective, look at the instructions per second per core and cycles per instruction to know if more work is being processed. If the IFL utililization is low, then enabling SMT would change very little - SMT is useful from a capacity perspective when IFL utilization is high, when more capacity is desired.

Capacity planning becomes more difficult with SMT as there is no longer straight line capacity growth lines, needing multiple metrics and knowing that as CPU utilization grows, there will be more contention for cache and TLB, and thus less work done per cycle allocated.

Performance

It is stated by IBM in many places that when running in SMT mode, workloads WILL run slower. With SMT, there are now two workloads sharing the same processor core, so at some times, there will be cycles where both workloads are ready to run, but one has to wait. Then the question in regards to performance is "core contention" impact on performance.

The IBM Monitor provides metrics at the system level and the user level to assist in understanding how SMT impacts the system. There is also the PRCMFC (mainframe cache statistics) that show the impact of two threads on the hardware cache. zVPS has been enhanced to utilize and expose these new metrics for every processor from z196 to current z16 on the ESAMFC reports.

For a system level performance reporting, it is important to understand that for most metrics, with SMT enabled, there are two sets of counters. This means that from a z/VM perspective there are twice as many CPUs, all of which have the traditional measurements. But there are still the physical hardare utilization. As in any performance understanding, utilization of hardware has an impact on performance and throughput.

In the above case in LPAR LXB5, there are 20 physical COREs available to the LPAR - and in SMT-2 mode, z/VM will see 40 threads. From an LPAR perspective we see the 20 "CORES" and the assigned percentages and idle thread time per core.

Report: ESALPAR      Logical Partition Analysis                    ftwareate   ZMAP 5.1.1 10/29/20   Pg   1257
--------------------------------------------------------------------------------------------------------------
         CEC  <-Logical Partition-> <----------Logical Processor--- <------(percentages)-------> 
         Phys              Pool     VCPU <%Assigned> VCPU Weight/   Total User    Sys  Idle  Stl  Idle  cp1/cp2
Time     CPUs Name     No  Name     Addr Total  Ovhd TYPE Polar      util ovrhd ovrhd  time  Pct  Time
-------- ---- -------- --- -------- ---- -----  ---- ---- --- ---   ----- ----- ----- ----- ---- ------ --- ---
13:05:00   55 LXB5      05        .    0  41.2   0.8 IFL  475 Hor    52.9   1.4   3.0 144.8 2.32  27.81   0 / 0
                                       1  39.6   0.6 IFL  475 Hor    50.3   1.3   2.1 147.9 1.75  26.65   2 / 3
                                       2  34.1   0.6 IFL  475 Hor    41.1   1.1   2.1 157.3 1.63  25.18   4 / 5
                                       3  34.8   0.5 IFL  475 Hor    41.1   0.9   1.7 157.5 1.39  26.68   6 / 7
                                       4  38.4   0.6 IFL  475 Hor    47.3   1.1   1.9 151.0 1.64  27.57   8 / 9
                                       5  43.5   0.6 IFL  475 Hor    55.0   1.2   2.3 143.3 1.66  30.12  10 /11
                                       6  44.1   0.7 IFL  475 Hor    56.5   1.4   2.2 141.6 1.89  29.47  12 /13
                                       7  40.3   0.7 IFL  475 Hor    50.1   1.4   2.3 148.0 1.95  28.37  14 /15
                                       8  44.5   0.5 IFL  475 Hor    53.4   0.8   1.7 145.2 1.36  33.99  16 /17
                                       9  39.2   0.6 IFL  475 Hor    48.1   1.1   1.8 150.3 1.62  28.38  18 /19
                                      10   6.4   0.2 IFL  475 Hor     6.4   0.2   0.8 192.9 0.75   5.82  20 /21
                                      11   5.8   0.1 IFL  475 Hor     5.8   0.1   0.4 193.8 0.38   5.41  22 /23
                                      12  27.9   0.5 IFL  475 Hor    32.3   0.7   1.7 165.4 2.31  21.76  24 /25
                                      13  30.4   0.6 IFL  475 Hor    36.0   0.9   2.3 161.2 2.78  22.70  26 /27
                                      14  62.6   0.8 IFL  475 Hor    79.0   1.3   3.1 117.6 3.42  43.40  28 /29
                                      15  52.7   0.9 IFL  475 Hor    65.1   1.3   3.4 131.4 3.47  37.30  30 /31
                                      16  49.9   0.8 IFL  475 Hor    65.0   0.9   3.2 131.8 3.24  31.95  32 /33
                                      17  64.6   0.8 IFL  475 Hor    75.4   0.9   3.2 121.3 3.28  50.80  34 /35
                                      18  61.4   0.9 IFL  475 Hor    76.8   1.6   3.6 119.3 3.91  42.56  36 /37
                                      19  67.1   1.0 IFL  475 Hor    81.9   1.2   3.9 113.9 4.11  48.57  38 /39
                                         -----  ----                ----- ----- ----- ----- ---- ------ --- ---
                                    LPAR 828.6  12.6                 1020  20.9  46.8  2936 44.9  594.5   0 / 0

And then from the z/VM side, we can look at the system thread by thread

 
Report: ESACPUU      CPU Utilization Report                        Vel
----------------------------------------------------------------------
         <----Load---->           <--------CPU (percentages)-------->
         <-Users-> Tran     CPU   Total  Emul  User   Sys  Idle Steal
Time     Actv In Q /sec CPU Type   util  time ovrhd ovrhd  time  time
-------- ---- ---- ----  -  ----  ----- ----- ----- ----- ----- -----
13:05:00   97  218  3.1  0  IFL    26.4  24.2   0.7   1.5  72.4   1.2
                         1  IFL    25.4  23.7   0.6   1.1  73.5   1.2
                         2  IFL    24.5  22.8   0.7   1.1  74.6   0.9
                         3  IFL    25.8  24.1   0.6   1.0  73.3   0.9
                         4  IFL    20.0  18.3   0.6   1.1  79.2   0.8
                         5  IFL    21.1  19.6   0.5   1.0  78.1   0.8
                         6  IFL    20.8  19.5   0.5   0.9  78.5   0.7
                         7  IFL    20.3  19.0   0.5   0.8  79.0   0.7
                         8  IFL    23.8  22.3   0.5   1.0  75.4   0.8
                         9  IFL    23.5  22.0   0.6   1.0  75.6   0.8
                        10  IFL    26.0  24.0   0.6   1.4  73.2   0.8
                        11  IFL    29.0  27.6   0.6   0.9  70.1   0.8
                        12  IFL    27.3  25.4   0.7   1.1  71.8   0.9
                        13  IFL    29.2  27.4   0.7   1.1  69.8   1.0
                        14  IFL    25.5  23.7   0.7   1.2  73.5   1.0
                        15  IFL    24.5  22.8   0.7   1.1  74.5   1.0
                        16  IFL    22.8  21.4   0.4   1.0  76.5   0.7
                        17  IFL    30.6  29.4   0.4   0.8  68.7   0.7
                        18  IFL    23.3  21.8   0.6   0.9  75.9   0.8
                        19  IFL    24.8  23.5   0.5   0.8  74.4   0.8
                        20  IFL     3.3   2.6   0.1   0.5  96.4   0.4
                        21  IFL     3.1   2.7   0.1   0.3  96.5   0.4
                        22  IFL     2.1   1.9   0.1   0.2  97.7   0.2
                        23  IFL     3.7   3.4   0.1   0.2  96.1   0.2
                        24  IFL    16.0  14.8   0.3   0.8  82.9   1.1
                        25  IFL    16.4  15.1   0.4   0.9  82.5   1.2
                        26  IFL    16.9  15.2   0.5   1.2  81.7   1.4
                        27  IFL    19.1  17.5   0.5   1.1  79.5   1.4
                        28  IFL    36.1  33.7   0.7   1.8  62.2   1.7
                       ....
                        38  IFL    36.7  33.9   0.6   2.2  61.2   2.0
                        39  IFL    45.2  43.0   0.5   1.6  52.7   2.1
                                  ----- ----- ----- ----- ----- -----
System:                            1019 951.3  20.8  46.4  2937  44.9

Now with 816% CORE assigned time (824 subtract 12 overhead), z/VM sees 1019% "total thread" busy time. With the 20 cores, there are two different utilization numbers, one for core busy: 824% out of 20 cores, and thread utilization: 1019% out of 40 threads. Both will be important from a performance analysis perspective.

Processor Cache Reporting

One of the most interesting scenarios seen in understanding the value the mainframe cache data is the following. The ESAMFC from the following comes from an IBM benchmark without SMT. This is a z13 (the speed of the processor, 5GHz gives that away), with 6 processors in the LPAR. This shows the cycles being used by the workload for each processor and the number of instructions being executed by each processor - all at a rate of "per second". At the tail end of the benchmark, the processor utilization drops from 92% to 67% as some of the drivers complete. But please note the instruction rate goes up???

Even though the utilization dropped, the actual instructions executed went up as the remaining drivers stopped fighting for the CPU cache, the cache residency greatly improved. The last metric is the important one - cycles per instruction. If the processor cache is overloaded, then cycles are wasted loading data into the level 1 cache. As contention for the L1 cache drops, so does the cycles used per instruction. As a result, more instructions are executed using much less CPU.

Report: ESAMFC       MainFrame Cache Analysis Rep
-------------------------------------------------
.            <CPU Busy> <-------Processor------>
.            <percent>  Speed/<-Rate/Sec->
Time     CPU Totl User  Hertz Cycles Instr Ratio
-------- --- ---- ----  ----- ------ ----- -----
14:05:32   0 92.9 64.6  5000M  4642M 1818M 2.554
           1 92.7 64.5  5000M  4630M 1817M 2.548
           2 93.0 64.7  5000M  4646M 1827M 2.544
           3 93.1 64.9  5000M  4654M 1831M 2.541
           4 92.9 64.8  5000M  4641M 1836M 2.528
           5 92.6 64.6  5000M  4630M 1826M 2.536
             ---- ----  ----- ------ ----- -----
System:       557  388  5000M  25.9G 10.2G 2.542
-------------------------------------------------
14:06:02   0 67.7 50.9  5000M  3389M 2052M 1.652
           1 67.8 51.4  5000M  3389M 2111M 1.605
           2 69.0 52.4  5000M  3450M 2150M 1.605
           3 67.2 50.6  5000M  3359M 2018M 1.664
           4 60.8 44.5  5000M  3042M 1625M 1.872
           5 70.1 53.8  5000M  3506M 2325M 1.508
             ---- ----  ----- ------ ----- -----
System:       403  304  5000M  18.8G 11.4G 1.640

It was this analysis that shows the need to understand the impact of the L1 cache and how traditional measures of capacity and CPU consumption need to be re-evaluated - workloads really do have an impact on the physical capacity of the CPU.

A typical production workload looked at with SMT enabled shows the 8 threads with an average respectable cycle per instruction (CPI) ratio of 1.68. This is at about 50% thread utilization. The question for the capacity planner is what happens to the CPI when core utilization goes up? If the CPI goes up significantly, it is possible that work is being executed taking much more time (and cycles), and the system capacity available is much less than appears.

Report: ESAMFC       MainFrame Cache Magnitudes
------------------------------------------------
             <CPU Busy> <-------Processor------>
             <percent>  Speed/<-Rate/Sec->
Time     CPU Totl User  Hertz Cycles Instr Ratio
-------- --- ---- ----  ----- ------ ----- -----
09:01:00   0 47.0 45.9  5000M  2290M 1335M 1.716
           1 50.0 48.9  5000M  2439M 1480M 1.648
           2 45.5 44.4  5000M  2219M 1329M 1.669
           3 47.3 46.1  5000M  2313M 1331M 1.738
           4 42.5 41.0  5000M  2078M 1164M 1.785
           5 53.6 52.7  5000M  2623M 1750M 1.499
           6 44.3 43.3  5000M  2163M 1179M 1.834
           7 56.3 55.3  5000M  2758M 1665M 1.657
             ---- ----  ----- ------ ----- -----
System:       386  378  5000M  17.6G 10.5G 1.681

In this case, there are 17B cycles ("G" on report is "Billion") per second being utilized The L1 cache is broken out in Instruction cache and Data cache. Of the 17B cycles consumed, 2.3B are used for Instruction cache load, and another 4.2B for data cache load. Thus of the 17B cycles per second used, only 11B are used for executing instructions.

 
 
Report: ESAMFC       MainFrame Cache Magnitudes Velocity Software Corpor
------------------------------------------------------------------------
             <CPU Busy> <-------Processor------> <Level 1 cache/second->
             <percent>  Speed/<-Rate/Sec->       Instruction <---Data-->
Time     CPU Totl User  Hertz Cycles Instr Ratio Writes Cost Writes Cost
-------- --- ---- ----  ----- ------ ----- ----- ------ ---- ------ ----
09:01:00   0 47.0 45.9  5000M  2290M 1335M 1.716   13M  285M 8771K  470M
           1 50.0 48.9  5000M  2439M 1480M 1.648   13M  287M 9592K  564M
           2 45.5 44.4  5000M  2219M 1329M 1.669   13M  285M 8207K  455M
           3 47.3 46.1  5000M  2313M 1331M 1.738   13M  289M 9584K  568M
           4 42.5 41.0  5000M  2078M 1164M 1.785   11M  295M 7381K  447M
           5 53.6 52.7  5000M  2623M 1750M 1.499   14M  283M   11M  566M
           6 44.3 43.3  5000M  2163M 1179M 1.834   12M  309M 9235K  455M
           7 56.3 55.3  5000M  2758M 1665M 1.657   14M  320M   15M  685M
             ---- ----  ----- ------ ----- ----- ------ ---- ------ ----
System:       386  378  5000M  17.6G 10.5G 1.681  102M 2353M   79M 4210M

But it gets worse. There is also the cost of DAT (Direct address translation). Each reference to an address must have a valid translated address in the TLB (Translation look aside buffer). In this installation's case where of the 17B cycles used, 6B cycles were used for loading the cache, and now we see that another 3.6B cycles are used for address translation. In this case, 19% of the cycles utilized are for address translation. This also goes up as the core becomes more utilized and there are more cache misses.

Report: ESAMFC       MainFrame Cache Magnitudes Velocity Software Corpor
-------------------------------------------------------------
             <CPU Busy>.<-Translation Lookaside buffer(TLB)->
             <percent> .<cycles/Miss><Writs/Sec>  CPU Cycles
Time     CPU Totl User . Instr  Data Instr  Data  Cost  Lost
-------- --- ---- ---- . ----- ----- ----- ----- ----- -----
09:01:00   0 47.0 45.9 .    87   517 1832K  539K 19.13  438M
           1 50.0 48.9 .   109   506 1471K  525K 17.48  426M
           2 45.5 44.4 .   127   470 1258K  542K 18.66  414M
           3 47.3 46.1 .    81   522 1980K  560K 19.55  452M
           4 42.5 41.0 .   115   524 1363K  496K 20.06  417M
           5 53.6 52.7 .    47   660 2949K  466K 17.01  446M
           6 44.3 43.3 .    82   541 2050K  538K 21.27  460M
           7 56.3 55.3 .    34   728 4796K  538K 20.10  554M
             ---- ---- .  ----- ----- ----- ----- ----- ----
System:       386  378 .    72   557   18M 4205K 19.11 3609M

At this point, anyone having to perform capacity planning must realize that there is a lot of guess work in the future capacity planning models...

User Chargeback / Accounting with SMT Enabled

The traditional methods of chargeback are for CPU seconds consumed. CPU consumed was based on time the virtual machine was actually dispatched to a CPU, and that number was very repeatable. In the SMT world, when workload on one thread is sharing a core and cache with a second thread, the time to complete will normally be larger for a given workload. It is larger because even though the virtual machine is dispatched on a thread of a core for a period of time some of that time the core is being utilized by the other thread, increasing the time on thread, but not necessarily changing the cycle requirement for the unit of work.

The question then is what metrics should a chargeback model be using that is accurate and fair? The thread time should not be used as there are two threads sharing the hardware resource.

The IBM monitor facility attempts to alleviate this problem. The traditional metrics are still reported, with two additional metrics. The new metrics are "MT-Equivalent" or an estimation of what the server would have used if running alone on the core, and "MT Prorated" that actually attempts to charge for the cycles consumed. Customers have shown the IBM "prorated" metrics significantly overcharge users.

For chargeback, it is the resource consumed that should be charged for. In the following example from a database workload, ESALPARS shows the LPAR assigned an average of 11.23 IFLs for a period of one minute. The objective then is to charge for the resources assigned to the LPAR. One might subtract the 10.3% overhead associated with managing the LPAR, in which case charging for 11.13 engines is the ideal. The thread idle time of 7.03 Threads should be subtracted as that is extra capacity that was not utilized.

The capacity that should be charged to that LPAR is calculated as

((11.13 IFLs - (7.03 / 2)     = 7.6 IFLs

which provides the CPU consumed by the LPAR.

Report: ESALPARS     Logical Partition Summary
Monitor initialized: 07/07/15 at 13:03
--------------------------------------------------------------------
         <--------Logical Partition------->  <-Assigned
                      Virt CPU  <%Assigned>  <---LPAR--> <-Thread->
Time     Name     Nbr CPUs Type Total  Ovhd  Weight  Pct Idle   cnt
-------- -------- --- ---- ---- -----  ----  ------ ---- ------ ---
09:01:00 Totals:   00  295 IFL   7410  64.7    1198  100
         MNGDMW08  08   30 IFL   1123  10.3    150 12.5  703.9   2

Analyzing the user workload, by the traditional CPU time, now really thought of as "thread time", capture ratio is 100%, we know exactly to which virtual machine to charge the 760% thread time consumed by virtual machines. As a charge back model, 100% capture ratio validates the data. But realistically, this measure is time a virtual machine was dispatched on a thread. This is very accurate but less useful for chargeback as it does not reflect that the resource consumed was 7.6 IFLs.

Subtract the z/VM system overhead, and we get a prorate that takes the virtual machine thread time and converts it to chargeable core time. This prorated value is shown as the "VSI Prorated" values and we believe this is what should be used in a chargeback model.

In the user CPU Consumption analysis, the numbers should start to make sense. The "traditional" CPU time is "thread time". The next "MT-Equivalent" metrics are noticeably less and is the time estimiated the thread would have used if SMT was disabled. This number has not been validated and is not believed to be useful.

The "IBM Prorate" values are what IBM provides in the CP monitor. It appears the unused "thread time" is being charged to the users. This would make the chargeback model variable based on other workload. Customers have indicated significant overcharging based on these metrics. That is the reason the additional "VSI Prorated" metrics are being provided.

Report: ESAUSP5      User SMT CPU Consumption Analysis             Ve
---------------------------------------------------------------------
         <------CPU Percent Consumed   (Total)---->  <-TOTAL CPU-->
UserID   <Traditional> <MT-Equivalent> <IBM Prorate> <VSI Prorated>
/Class   Total  Virt   Total  Virtual  Total Virtual Total  Virtual
-------- ----- -----   -----  -------  ----- -------  -----  -------
09:01:00  1454   1421   1206     1180   1078    1055  739.6   723.0
 ***Top User Analysis***
MNGDB3F8 293.9  290.6  236.2    233.5  208.9   206.7  149.5   147.8
MNGDB3FD 274.7  258.7  228.0    215.0  196.0   184.7  139.7   131.6
MNGDB529 256.5  249.4  201.4    195.8  169.4   164.8  130.4   126.8
MNGDB41B 189.2  188.7  168.5    168.1  146.8   146.5  96.22   95.98

Conclusions

There are a lot of new metrics which require a need to understand how SMT really does impact user chargeback and capacity planning. Please provide feedback to Barton on any ideas or information you learn in your endeavors.

See what our customers say

Performance Tuning Guide

Have a Velocity Software Sales Account Exec contact me.

IBM Z Ecosystem

Test drive our products
zVPS demo

Follow Velocity Software on LinkedIn!