CPU Subsystem Analysis

The CPU - is the highest level of measurement in z/VM.

Some clarification on CPU naming:

Presentations about the CPU environment and utilization:
Processor Configuration and Analysis Intro
Processor Advanced Topics

Helpful system settings:

Settings that are no longer relevant/useful:

Understanding how to view CPU utilization with SMT

  • When SMT is active, there are x vCPUs and x*2 threads. If viewing from a hardware perspective (ESALPARx/ESAUSP5) the numbers shown are the number of vCPUs. If viewing from a z/VM perspective (ESACPUx), the numbers show are the number of threads.
  • For example, the pictures below show a system with 7 vCPUs and thus 14 threads. ESAUSP5 is showing the percentage of CPU used as 682.9 (out of 700 - for 7 vCPUs) but ESACPUU shows it as 1220 (out of 1400 - for 14 threads).
  • ?

    ?

    Helpful ESAMON screens/ESAMAP reports:

    Using zVPS to find information for solving issues with the CPU utilization:

    Use zVPS real time monitoring and daily reports to see how efficiently the environment is running. What is the total CPU utilization?
    How is that broken down by LPARs/IFLs? What are the users consuming? What else might be happening? Here are some places to start:


    ESAMAIN - System overview information:

    ?

  • Processor Utilization Total - This is the same as CPU Busy. This column has an indicator (highlight) if it has passed a certain threshold - this can be changed by the administrator. Since this LPAR/system has 6 CPUs/engines, the total percentage available is 600% (so over 100% is not necessarily a problem). However, other screens will show which different engine types are being used and if they are close to their utilization capacity. It is a good idea to set up thresholds to see when utilization numbers are higher than expected. In this example, there are 6 CPUs it might be that 500% is flagged as yellow and 550% is flagged as red - depending on your installation.
  • SMT Prort Ratio now shows on ESAMAIN. This shows the thread to core ratio. A number of ~0.5 is very good. If it starts to climb, SMT may be providing capacity, but also could be impacting response times.

  • ESACPUU - Shows information for each CPU/engine on the box.

    ?

  • CPU Type/ID - This shows the CPUs/engines (IFLs) on the LPAR/system.
  • Total util - This shows the CPU utilization for each of the engines. This can be helpful to see if all engines are relatively equal in utilization. The ESACPUU report will show totals for each 15 minute increment in the day, which is good for trending.
  • Overhd User/Syst - This shows the CPU overhead. This can be attributed to user functions or system functions. If this is high, there could be issues. High User overhead signifies high master processor simulation. Check ESAUSP2 for specific user data that correlates to the problem time. High Syst overhead (along with low Emulation time) usually means there is a master processor bottle neck or check the syscontrol control settings above and set to 0.
  • CPU Wait Idle - This shows the amount of time the CPU was idle (no work to run). If this is high, there might be too many vCPUs assigned to this LPAR.
  • Page Write - This shows page writes to disk. These are done on the Master processor. Watch for high numbers which can show up as Sim wait on ESAXACT.
  • CPU Wait Steal - This shows the amount of time the CPU was waiting to be dispatched - also known as suspended - neither running, in a wait state nor parked.
  • Vertical Park Secs - This shows the amount of seconds a CPU was parked. This can be used to determine if the CPU has been parked multiple times. This can cause overhead in the PR/SM hypervisor. Look at ESALPAR to see the how the LPAR is defined.
  • Look for large fluctuations in numbers in the other columns.

  • ESACPUA - Shows similar information as ESACPUU.

    ?

  • CPU Type/ID - This shows the CPUs/engines (IFL) on the LPAR/system.
  • Total util - This shows the CPU utilization for each of the engines. This can be helpful to see if all engines are relatively equal in utilization.
  • Internal Diagnose and User Diag/sec - This shows the number of internal diagnose instructions executed per second and the number of calls to user diagnose codes. If the first diagnose number is over 5000 and the second is over 1000, it is likely caused by a Linux server with too many virtual processors defined. ESAUSRD shows user diagnose calls. ESASRVC shows how many virtual processors are defined to each server.
  • Look for large fluctuations in numbers in the other columns, especially the overhead columns.

  • ESADIAG - Shows Diagnose rates.

    ?

  • Diag Code - A diagnose code of 44 shows an older version of Linux that uses spin locks - which can cause a performance issue. It is better to use diagnose code 9C instead. However, too many 9C calls can show that a server has too many vCPUs. Use ESAUSRD to see what machines are using which diagnose codes.

  • ESAPLDV - Shows Processor Local Dispatch Vector.

    ?

  • If the Master Processor (found in ESAHDR report) is constrained, it will show up if there is high Simulation wait on ESAXACT. The ESAPLDVC report shows when a VMDBK is moved "To Master". This happens when the z/VM Dispatcher finds something that needs to run on the Master Processor (examples above). Watch for large fluctuations in these numbers. Also see Master Processor Issue) for more information on this issue.

  • ESAIUER - Shows IUCV errors.

    ?

  • IUCV failures - Since many IUCV services run "master only" (meaning it only uses the one specified master processor) any IUCV errors can cause performance issues.

  • ESALCK - Shows spin lock activity.

    ?

  • CPU% - Locking doesn't tend to have issues. However, if system utilization is high, check for out of control spin locks. If either the exclusive or shared CPU% is over 10, there are spin lock issues. This is most likely an issue that needs to be sent to IBM.

  • ESATOPU - Shows CPU utilization by user - top users first.

    ?

  • CPU Time - This shows how much CPU the top user for any given minute is utilizing. This picture shows that ZADMIN was utilizing more of the system than normal from 14:51 to 15:06. On this system, a systems programmer was running a trace on that machine. (This wouldn't happen with this ID on your system, but it shows how one user can cause the CPU utilization to go up tremendously).
  • This is a fast way to show a possible abusive user.

  • Conclusions:

    Looking at CPU utilization is one of the quickest ways to find processing issues.

    Just like on the freeway:

    The best thing to do is to know your current environment and what is normal/abnormal. If CPU is not the issue, continue to continue the search to other parts of the system.


    Back to top of page
    Back to Performance Tuning Guide