Many installations are asking for information on how to manage performance for SMT enabled LPARs. I will talk about the 4 parts of performance management:
In an SMT enabled LPAR, it is rather important to understand that we talk in threads and IFLs. From a z/VM perspective, we now see threads. Where ever the term for traditional z/VM "CPU" is used, one now has to think "thread".
At a high level, On our ESALPARS report, it will show how much IFL time is allocated to the LPAR. When an IFL is assigned to an LPAR, it is very important to know that both threads on that IFL under SMT are also assigned to the LPAR. So if an IFL is assigned, during that time that it is assigned, it is not shared. BUT, both threads on that IFL might not be used concurrently. So that there is also some time that both threads are assigned to the LPAR, but only one is being used. So that is some extra capacity that is available. This time shows up on the ESALPARS report as thread idle time. "thread idle" is the time where one IFL, two threads are assigned to an LPAR, but only one of them is actually doing work
From a high level, this is the LPAR SUMMARY showing the LPARs and their allocations. Starting from the far right, the "entitled" CPU is the amount of the shared engines that this LPAR is guaranteed by the given LPAR weights. The LPAR from which this data comes from is the LXB5 LPAR, which will get 10.9 cores allocated as its guarantee. In this case, this LPAR has been assigned a physical core 828% of the time, meaning 8.28 CORES were assigned on average in this one minute reporting interval.
When a core is assigned to an LPAR in an SMT-2 environment, both threads are part of that assignment. Even though both threads are assigned, that does not mean they are utilized. In the CP monitor is another metric, the "idle thread" metric. If this LPAR is assigned 828%, and then subtract the LPAR overhead (non-smt), that means for real work, 816% "core assignement" which would be 1632% "thread assignment". Of that 1632%, in this case 594% thread idle existed. During thread idle, one thread was being utilized, and one thread was idle - but the core was assigned.
Report: ESALPARS Logical Partition Summary Monitor initialized: 07/07/15 at 13:03 --------------------------------------------------------------------------- <--------Logical Partition-------> <-Assigned Entitled Virt CPU <%Assigned> <---LPAR--> <-Thread-> CPU Cnt Time Name Nbr CPUs Type Total Ovhd Weight Pct Idle cnt -------- -------- --- ---- ---- ----- ---- ------ ---- ------ --- ------ 13:05:00 Totals: 00 71 IFL 1055 18.7 1001 100 LXB5 05 20 IFL 828.6 12.6 475 47.5 594.5 2 10.91 LXBX 0F 1 IFL 0.5 0.1 50 5.0 0 1 1.15 LXB2 02 12 IFL 1201 0.1 Ded 21.8 0 1 0 LXB3 03 20 IFL 2000 0.1 Ded 36.4 0 1 0 LXB8 08 10 IFL 224.7 5.7 475 47.5 0 1 10.91 TS02 0E 8 IFL 1.3 0.3 1 0.1 0 1 0.02 Totals by Processor type: <---------CPU-------> <-Shared Processor busy-> Type Count Ded shared Total Logical Ovhd Mgmt ---- ----- --- ------ ------ -------- ---- ---- IFL 55 32 23 1073.7 1036.5 18.7 18.4
From a capacity planning perspective, there is unused capacity. This happens to be a z13 that had inherrent bottlenecks in the TLB that was corrected on the z14 and z15. To understand this requires hardware understanding and use of the hardware metrics from the PRCMFC data. Many z13 installations had a "less than zero" increase in capacity by enabling SMT on the z13. This is only detectable using the MFC data when evaluating production data.
The objective of SMT is to better utilize the core processor. The "z" processors have a very sophisticated caching hierarchy to increase the amount of time a core can actually execute an instruction. Any instruction execution must have the instruction and all related data in the Level one cache. Any time there is a cache miss, the core sits idle while the data comes from level 2, level 3, level 4, level 4 on a remote book, or from memory. Each of these sources requires an increasing number of cycles to load the data into the level 1 cache. During these cache loads, if the core can process instructions from another thread, then core utilization should go up.
There are two measures then for measuring increased capacity when SMT is enabled.
Without proper measurement capability it is very difficult to know if capacity has increased or not. One installation says that the Linux admin's think their performance is better - the method of analysis was not scientific. From a capacity planning perspective, look at the instructions per second per core and cycles per instruction to know if more work is being processed. If the IFL utililization is low, then enabling SMT would change very little - SMT is useful from a capacity perspective when IFL utilization is high, when more capacity is desired.
Capacity planning becomes more difficult with SMT as there is no longer straight line capacity growth lines, needing multiple metrics and knowing that as CPU utilization grows, there will be more contention for cache and TLB, and thus less work done per cycle allocated.
It is stated by IBM in many places that when running in SMT mode, workloads WILL run slower. With SMT, there are now two workloads sharing the same processor core, so at some times, there will be cycles where both workloads are ready to run, but one has to wait. Then the question in regards to performance is "core contention" impact on performance.
The IBM Monitor provides metrics at the system level and the user level to assist in understanding how SMT impacts the system. There is also the PRCMFC (mainframe cache statistics) that show the impact of two threads on the hardware cache. zVPS has been enhanced to utilize and expose these new metrics for every processor from z196 to current z15.
For a system level performance reporting, it is important to understand that for most metrics, with SMT enabled, there are two sets of counters. This means that from a z/VM perspective there are twice as many CPUs, all of which have the traditional measurements. But there are still the physical hardare utilization. As in any performance understanding, utilization of hardware has an impact on performance and throughput.
In the above case in LPAR LXB5, there are 20 physical COREs available to the LPAR - and in SMT-2 mode, z/VM will see 40 threads. From an LPAR perspective we see the 20 "CORES" and the assigned percentages and idle thread time per core.
Report: ESALPAR Logical Partition Analysis ftwareate ZMAP 5.1.1 10/29/20 Pg 1257 -------------------------------------------------------------------------------------------------------------- CEC <-Logical Partition-> <----------Logical Processor--- <------(percentages)------->Phys Pool VCPU <%Assigned> VCPU Weight/ Total User Sys Idle Stl Idle cp1/cp2 Time CPUs Name No Name Addr Total Ovhd TYPE Polar util ovrhd ovrhd time Pct Time -------- ---- -------- --- -------- ---- ----- ---- ---- --- --- ----- ----- ----- ----- ---- ------ --- --- 13:05:00 55 LXB5 05 . 0 41.2 0.8 IFL 475 Hor 52.9 1.4 3.0 144.8 2.32 27.81 0 / 0 1 39.6 0.6 IFL 475 Hor 50.3 1.3 2.1 147.9 1.75 26.65 2 / 3 2 34.1 0.6 IFL 475 Hor 41.1 1.1 2.1 157.3 1.63 25.18 4 / 5 3 34.8 0.5 IFL 475 Hor 41.1 0.9 1.7 157.5 1.39 26.68 6 / 7 4 38.4 0.6 IFL 475 Hor 47.3 1.1 1.9 151.0 1.64 27.57 8 / 9 5 43.5 0.6 IFL 475 Hor 55.0 1.2 2.3 143.3 1.66 30.12 10 /11 6 44.1 0.7 IFL 475 Hor 56.5 1.4 2.2 141.6 1.89 29.47 12 /13 7 40.3 0.7 IFL 475 Hor 50.1 1.4 2.3 148.0 1.95 28.37 14 /15 8 44.5 0.5 IFL 475 Hor 53.4 0.8 1.7 145.2 1.36 33.99 16 /17 9 39.2 0.6 IFL 475 Hor 48.1 1.1 1.8 150.3 1.62 28.38 18 /19 10 6.4 0.2 IFL 475 Hor 6.4 0.2 0.8 192.9 0.75 5.82 20 /21 11 5.8 0.1 IFL 475 Hor 5.8 0.1 0.4 193.8 0.38 5.41 22 /23 12 27.9 0.5 IFL 475 Hor 32.3 0.7 1.7 165.4 2.31 21.76 24 /25 13 30.4 0.6 IFL 475 Hor 36.0 0.9 2.3 161.2 2.78 22.70 26 /27 14 62.6 0.8 IFL 475 Hor 79.0 1.3 3.1 117.6 3.42 43.40 28 /29 15 52.7 0.9 IFL 475 Hor 65.1 1.3 3.4 131.4 3.47 37.30 30 /31 16 49.9 0.8 IFL 475 Hor 65.0 0.9 3.2 131.8 3.24 31.95 32 /33 17 64.6 0.8 IFL 475 Hor 75.4 0.9 3.2 121.3 3.28 50.80 34 /35 18 61.4 0.9 IFL 475 Hor 76.8 1.6 3.6 119.3 3.91 42.56 36 /37 19 67.1 1.0 IFL 475 Hor 81.9 1.2 3.9 113.9 4.11 48.57 38 /39 ----- ---- ----- ----- ----- ----- ---- ------ --- --- LPAR 828.6 12.6 1020 20.9 46.8 2936 44.9 594.5 0 / 0
And then from the z/VM side, we can look at the system thread by thread
Report: ESACPUU CPU Utilization Report Vel ---------------------------------------------------------------------- <----Load----> <--------CPU (percentages)--------> <-Users-> Tran CPU Total Emul User Sys Idle Steal Time Actv In Q /sec CPU Type util time ovrhd ovrhd time time -------- ---- ---- ---- - ---- ----- ----- ----- ----- ----- ----- 13:05:00 97 218 3.1 0 IFL 26.4 24.2 0.7 1.5 72.4 1.2 1 IFL 25.4 23.7 0.6 1.1 73.5 1.2 2 IFL 24.5 22.8 0.7 1.1 74.6 0.9 3 IFL 25.8 24.1 0.6 1.0 73.3 0.9 4 IFL 20.0 18.3 0.6 1.1 79.2 0.8 5 IFL 21.1 19.6 0.5 1.0 78.1 0.8 6 IFL 20.8 19.5 0.5 0.9 78.5 0.7 7 IFL 20.3 19.0 0.5 0.8 79.0 0.7 8 IFL 23.8 22.3 0.5 1.0 75.4 0.8 9 IFL 23.5 22.0 0.6 1.0 75.6 0.8 10 IFL 26.0 24.0 0.6 1.4 73.2 0.8 11 IFL 29.0 27.6 0.6 0.9 70.1 0.8 12 IFL 27.3 25.4 0.7 1.1 71.8 0.9 13 IFL 29.2 27.4 0.7 1.1 69.8 1.0 14 IFL 25.5 23.7 0.7 1.2 73.5 1.0 15 IFL 24.5 22.8 0.7 1.1 74.5 1.0 16 IFL 22.8 21.4 0.4 1.0 76.5 0.7 17 IFL 30.6 29.4 0.4 0.8 68.7 0.7 18 IFL 23.3 21.8 0.6 0.9 75.9 0.8 19 IFL 24.8 23.5 0.5 0.8 74.4 0.8 20 IFL 3.3 2.6 0.1 0.5 96.4 0.4 21 IFL 3.1 2.7 0.1 0.3 96.5 0.4 22 IFL 2.1 1.9 0.1 0.2 97.7 0.2 23 IFL 3.7 3.4 0.1 0.2 96.1 0.2 24 IFL 16.0 14.8 0.3 0.8 82.9 1.1 25 IFL 16.4 15.1 0.4 0.9 82.5 1.2 26 IFL 16.9 15.2 0.5 1.2 81.7 1.4 27 IFL 19.1 17.5 0.5 1.1 79.5 1.4 28 IFL 36.1 33.7 0.7 1.8 62.2 1.7 .... 38 IFL 36.7 33.9 0.6 2.2 61.2 2.0 39 IFL 45.2 43.0 0.5 1.6 52.7 2.1 ----- ----- ----- ----- ----- ----- System: 1019 951.3 20.8 46.4 2937 44.9
Now with 816% CORE assigned time (824 subtract 12 overhead), z/VM sees 1019% "total thread" busy time. So with the 20 cores, there are two different utilization numbers, one for core busy: 824% out of 20 cores, or thread utilization: 1019% out of 40 threads. Both will be important from a performance analysis perspective.
One of the most interesting scenarios seen in understanding the value the mainframe cache data is the following. The ESAMFC from the following comes from an IBM benchmark without SMT. This is a z13 (the speed of the processor, 5GHz gives that away), with 6 processors in the LPAR. This shows the cycles being used by the workload for each processor and the number of instructions being executed by each processor - all at a rate of "per second". At the tail end of the benchmark, the processor utilization drops from 92% to 67% as some of the drivers complete. But please note the instruction rate goes up???
Even though the utilization dropped, the actual instructions executed went up as the remaining drivers stopped fighting for the CPU cache, the cache residency greatly improved. The last metric is the important one - cycles per instruction. If the processor cache is overloaded, then cycles are wasted loading data into the level 1 cache. As contention for the L1 cache drops, so does the cycles used per instruction. As a result, more instructions are executed using much less CPU.
Report: ESAMFC MainFrame Cache Analysis Rep ------------------------------------------------- .<-------Processor------> . Speed/<-Rate/Sec-> Time CPU Totl User Hertz Cycles Instr Ratio -------- --- ---- ---- ----- ------ ----- ----- 14:05:32 0 92.9 64.6 5000M 4642M 1818M 2.554 1 92.7 64.5 5000M 4630M 1817M 2.548 2 93.0 64.7 5000M 4646M 1827M 2.544 3 93.1 64.9 5000M 4654M 1831M 2.541 4 92.9 64.8 5000M 4641M 1836M 2.528 5 92.6 64.6 5000M 4630M 1826M 2.536 ---- ---- ----- ------ ----- ----- System: 557 388 5000M 25.9G 10.2G 2.542 ------------------------------------------------- 14:06:02 0 67.7 50.9 5000M 3389M 2052M 1.652 1 67.8 51.4 5000M 3389M 2111M 1.605 2 69.0 52.4 5000M 3450M 2150M 1.605 3 67.2 50.6 5000M 3359M 2018M 1.664 4 60.8 44.5 5000M 3042M 1625M 1.872 5 70.1 53.8 5000M 3506M 2325M 1.508 ---- ---- ----- ------ ----- ----- System: 403 304 5000M 18.8G 11.4G 1.640
A typical production workload looked at with SMT enabled shows the 8 threads with an average respectable cycle per instruction (CPI) ratio of 1.7. This is at about 50% thread utilization. The question for the capacity planner is what happens to the CPI when core utilization goes up? If the CPI goes up significantly, it is possible that work is being executed taking much more time (and cycles), and the system capacity available is much less than appears.
Report: ESAMFC MainFrame Cache Magnitudes ------------------------------------------------<-------Processor------> Speed/<-Rate/Sec-> Time CPU Totl User Hertz Cycles Instr Ratio -------- --- ---- ---- ----- ------ ----- ----- 09:01:00 0 47.0 45.9 5000M 2290M 1335M 1.716 1 50.0 48.9 5000M 2439M 1480M 1.648 2 45.5 44.4 5000M 2219M 1329M 1.669 3 47.3 46.1 5000M 2313M 1331M 1.738 4 42.5 41.0 5000M 2078M 1164M 1.785 5 53.6 52.7 5000M 2623M 1750M 1.499 6 44.3 43.3 5000M 2163M 1179M 1.834 7 56.3 55.3 5000M 2758M 1665M 1.657 ---- ---- ----- ------ ----- ----- System: 386 378 5000M 17.6G 10.5G 1.681
In this case, there are 17B cycles per second being utilized. The L1 cache is broken out in Instruction cache and Data cache. Of the 17B cycles consumed, 2.3B are used for Instruction cache load, and another 4.2B for data cache load. Thus of the 17B cycles per second used, only 11B are used for executing instructions.
Report: ESAMFC MainFrame Cache Magnitudes Velocity Software Corpor ------------------------------------------------------------------------But it gets worse. There is also the cost of DAT (Direct address translation). Each reference to an address must have a valid translated address in the TLB (Translation look aside buffer). In this installation's case where of the 17B cycles used, 6B cycles were used for loading the cache, and now we see that another 3.6B cycles are used for address translation. In this case, 19% of the cycles utilized are for address translation. This also goes up as the core becomes more utilized and there are more cache misses.<-------Processor------> Speed/<-Rate/Sec-> Instruction <---Data--> Time CPU Totl User Hertz Cycles Instr Ratio Writes Cost Writes Cost -------- --- ---- ---- ----- ------ ----- ----- ------ ---- ------ ---- 09:01:00 0 47.0 45.9 5000M 2290M 1335M 1.716 13M 285M 8771K 470M 1 50.0 48.9 5000M 2439M 1480M 1.648 13M 287M 9592K 564M 2 45.5 44.4 5000M 2219M 1329M 1.669 13M 285M 8207K 455M 3 47.3 46.1 5000M 2313M 1331M 1.738 13M 289M 9584K 568M 4 42.5 41.0 5000M 2078M 1164M 1.785 11M 295M 7381K 447M 5 53.6 52.7 5000M 2623M 1750M 1.499 14M 283M 11M 566M 6 44.3 43.3 5000M 2163M 1179M 1.834 12M 309M 9235K 455M 7 56.3 55.3 5000M 2758M 1665M 1.657 14M 320M 15M 685M ---- ---- ----- ------ ----- ----- ------ ---- ------ ---- System: 386 378 5000M 17.6G 10.5G 1.681 102M 2353M 79M 4210M
Report: ESAMFC MainFrame Cache Magnitudes Velocity Software Corpor -------------------------------------------------------------At this point, anyone having to perform capacity planning must realize that there is a lot of guess work in the future capacity planning models....<-Translation Lookaside buffer(TLB)-> . CPU Cycles Time CPU Totl User . Instr Data Instr Data Cost Lost -------- --- ---- ---- . ----- ----- ----- ----- ----- ----- 09:01:00 0 47.0 45.9 . 87 517 1832K 539K 19.13 438M 1 50.0 48.9 . 109 506 1471K 525K 17.48 426M 2 45.5 44.4 . 127 470 1258K 542K 18.66 414M 3 47.3 46.1 . 81 522 1980K 560K 19.55 452M 4 42.5 41.0 . 115 524 1363K 496K 20.06 417M 5 53.6 52.7 . 47 660 2949K 466K 17.01 446M 6 44.3 43.3 . 82 541 2050K 538K 21.27 460M 7 56.3 55.3 . 34 728 4796K 538K 20.10 554M ---- ---- . ----- ----- ----- ----- ----- ---- System: 386 378 . 72 557 18M 4205K 19.11 3609M
The traditional methods of chargeback are for CPU seconds consumed. CPU consumed was based on time the virtual machine was actually dispatched to a CPU, and that number was very repeatable. In the SMT world, that number fails to be repeatable, and will be larger for a given workload. It is larger because even though the virtual machine is dispatched on a thread of a core for a period of time some of that time the core is being utilized by the other thread, increasing the time on thread, but not necessarily changing the cycle requirement for the unit of work.
The IBM monitor facility attempts to alleviate this problem. The traditional metrics are still reported, with two additional metrics, one of which is not "filled in". The new metrics are "MT-Equivalent" or what the server would have used if running alone on the core, and "MT Prorated" that actually attempts to charge for the cycles consumed.
The example we are working with had 1019% of the threads busy, which from the CPU performance data has 46% system overhead. This leaves from the CPU perspective 972% that is chargeable to real users. But wait... We only havce 816% of the cores we should be charging for, and some of that was idle.
Analyzing the user workload, by the traditional CPU time, now really thought of as "thread time", capture ratio is 100%, we know exactly to which virtual machine to charge the 972% thread time. As a charge back model, that validates the data. But realistically, this measure is time a virtual machine was dispatched on a thread. This is very accurate but less useful for chargeback as it is not repeatable, based on workload and L1 cache contention.
The next "MT-Equivalent" metrics are noticeably less and is the time that the thread would have used if SMT was disabled. Much closer to the 830% that should be charged.
In early days (z13 days), a third set of metrics was provided, but always zeros. The "MT Prorated" seems to be a "best guess". In early SMT days, these metrics were not available. To finish this scenario, the MT-Equivalent would be the best metrics to use for chargeback.
Report: ESAUSP5 User SMT CPU Consumption Analysis Ve --------------------------------------------------------------------- <------CPU Percent Consumed (Total)----> <-CPU PCT Prima UserID/Class Total Virt Total Virtual Total Virtual Total Virtual -------- ----- ----- ----- ------- ----- ------- ----- ------- 13:05:00 972.1 951.3 830.8 813.0 972.1 951.3 830.8 813.0 ***Key User Analysis *** TCPIP 0.34 0.11 0.29 0.09 0.34 0.11 0.29 0.09 ***User Class Analysis*** Servers 0.41 0.15 0.35 0.13 0.41 0.15 0.35 0.13 ZVPS 0.70 0.60 0.61 0.53 0.70 0.60 0.61 0.53 TheUsers 971.0 950.6 829.9 812.3 971.0 950.6 829.9 812.3 ***Top User Analysis*** LINUX195 202.8 202.3 176.6 176.1 202.8 202.3 176.6 176.1 LINUX203 77.36 76.77 64.13 63.63 77.36 76.77 64.13 63.63 LINUX199 67.44 66.75 56.32 55.73 67.44 66.75 56.32 55.73 LINUX204 57.35 56.22 49.20 48.22 57.35 56.22 49.20 48.22 LINUX198 49.73 48.74 43.41 42.55 49.73 48.74 43.41 42.55 LINUX197 40.01 39.35 34.17 33.61 40.01 39.35 34.17 33.61
In a current analysis of our demonstration workload on our z15, a quick anlaysis from bottom to top shows the prorated values. From the user chargeback model, there is 25% thread time, 21% "MT-equivalent", or what the workload would have taken if SMT-2 was not enabled, and then the "prorated" estimate of what the workload actually consumed.
Report: ESAUSP5 User SMT CPU Consumption Analys ---------------------------------------------------- <------CPU Percent Consumed (Total)----> UserID/Class Total Virt Total Virtual Total Virtual -------- ----- ----- ----- ------- ----- ------- 12:00:00 25.14 24.13 21.19 20.34 23.60 22.73 ***Key User Analysis *** TCPIP 0.04 0.03 0.04 0.02 0.04 0.02 TCPIP2 0.17 0.08 0.14 0.07 0.15 0.07 RACFVM 0.00 0.00 0.00 0.00 0.00 0.00 SFSZVPS4 0.04 0.02 0.03 0.02 0.03 0.02 ***User Class Analysis*** Servers 0.08 0.06 0.06 0.05 0.07 0.06 Velocity 1.87 1.81 1.63 1.58 1.76 1.70 TEST 0.75 0.62 0.65 0.54 0.65 0.54 Web 0.11 0.10 0.10 0.09 0.09 0.08 REDHAT 0.41 0.40 0.33 0.33 0.35 0.34 SUSE 2.08 2.05 1.70 1.67 1.88 1.86 ORACLE 2.96 2.72 2.50 2.30 2.69 2.48 TheUsrs 16.67 16.26 14.05 13.70 15.92 15.57 ***Top User Analysis*** MONGO01 10.29 10.27 8.69 8.68 10.00 9.98 SLES12 4.54 4.53 3.83 3.82 4.40 4.39 S11S2ORA 2.32 2.10 1.96 1.77 2.09 1.90 SLES15 1.82 1.80 1.48 1.46 1.65 1.63
For your chargeback model, from a repeatability perspective one could justify using the MT-Equivalent metrics. From a real consumption, one could justify the use of the "prorated" CPU time. The numbers are there for both in zVPS.
When looking at real resources consumed, of the 2 IFLs assigned to the LPAR, if 24% core time is assigned, and then from the 48% thread time, subtract the idle time, resulting in 28% thread time being used. And as shown above, should chargeback be based on cycles, or instructions consumed? The metrics are available...
Report: ESALPARS Logical Partition Summary Velo ----------------------------------------------------------------------- <--------Logical Partition-------> <-Assigned Virt CPU <%Assigned> <---LPAR--> <-Thread-> Time Name Nbr CPUs Type Total Ovhd Weight Pct Idle cnt -------- -------- --- ---- ---- ----- ---- ------ ---- ------ --- 12:00:00 Totals: 00 7 CP 71.8 0.3 250 100 Totals: 00 12 IFL 33.3 1.0 1175 100 VSIVM4 04 2 IFL 24.3 0.5 150 12.8 20.14 2
There are a lot of new metrics which require a need to understand how SMT really does impact user chargeback and capacity planning. Please provide feedback to Barton on any ideas or information you learn in your endeavors.