The CPU Level

The CPU - is the highest level of measurement in z/VM.

Some clarification on CPU naming:

For a presentation about the CPU environment and utilization, see Processor Analysis and Tuning

Helpful system settings:

Settings that are no longer relevant/useful:

Understanding how to view CPU utilization with SMT

  • When SMT is active, there are x vCPUs and x*2 threads. If viewing from a hardware perspective (ESALPARx/ESAUSP5) the numbers shown are the number of vCPUs. If viewing from a z/VM perspective (ESACPUx), the numbers show are the number of threads.
  • For example, the pictures below show a system with 7 vCPUs and thus 14 threads. ESAUSP5 is showing the percentage of CPU used as 682.9 (out of 700 - for 7 vCPUs) but ESACPUU shows it as 1220 (out of 1400 - for 14 threads).
  • ?

    ?

    Helpful ESAMON screens/ESAMAP reports (further explained below):

    Using zVPS to find information for solving issues with the CPU utilization:

    Use zVPS real time monitoring and daily reports to see how efficently the environment is running. What is the total CPU utilization?
    How is that broken down by LPARs/IFLs? What are the users consuming? What else might be happening? Here are some places to start:


    ESAMAIN - System overview information:

    ?

  • Processor Utilization Total - This is the same as CPU Busy. This column has an indicator (highlight) if it has passed a certain threshold - this can be changed by the administrator. Since this LPAR/system has 6 CPUs/engines, the total percentage available is 600% (so over 100% is not necessarily a problem). However, other screens will show which different engine types are being used and if they are close to their utilization capacity. It is a good idea to set up thresholds to see when utilization numbers are higher than expected. In this example, there are 6 CPUs it might be that 500% is flagged as yellow and 550% is flagged as red - depending on your installation.
  • SMT Prort Ratio now shows on ESAMAIN. This shows the thread to core ratio. A number of ~0.5 is very good. If it starts to climb, SMT may be providing capacity, but also could be impacting response times.

  • ESACPUU - Shows information for each CPU/engine on the box.

    ?

  • CPU Type/ID - This shows the CPUs/engines (IFLs) on the LPAR/system.
  • Total util - This shows the CPU utilization for each of the engines. This can be helpful to see if all engines are relatively equal in utilization. The ESACPUU report will show totals for each 15 minute increment in the day, which is good for trending.
  • Overhd User/Syst - This shows the CPU overhead. This can be attributed to user functions or system functions. If this is high, there could be issues. High User overhead signifies high master processor simulation. Check ESAUSP2 for specific user data that correlates to the problem time.
  • CPU Wait Idle - This shows the amount of time the CPU was idle (no work to run). If this is high, there might be too many vcpu's assigned to this LPAR.
  • CPU Wait Steal - This shows the amount of time the CPU was waiting to be dispatched - also known as suspended - neither running, in a wait state nor parked.
  • Vertical Park Secs - This shows the amount of seconds a CPU was parked. This can be used to determine if the CPU has been parked multiple times. This can cause overhead in the PR/SM hypervisor. Look at ESALPAR to see the how the LPAR is defined.
  • Look for large fluctuations in numbers in the other columns.

  • ESACPUA - Shows similar information as ESACPUU.

    ?

  • CPU Type/ID - This shows the CPUs/engines (IFL) on the LPAR/system.
  • Total util - This shows the CPU utilization for each of the engines. This can be helpful to see if all engines are relatively equal in utilization.
  • Internal Diagnose and User Diag/sec - This shows the number of internal diagnose instructions executed per second and the number of calls to user diagnose codes. If the first diagnose number is over 5000 and the second is over 1000, it is likely caused by a Linux server with too many virtual processors defined. ESAUSRD shows user diagnose calls. ESASRVC shows how many virtual processors are defined to each server.
  • Look for large fluctuations in numbers in the other columns, especiall the overhead columns.

  • ESAMFC - Shows processor instruction information. (must have Measurement Facility turned on in the LPAR to collect the correct records for this screen/report - See Enabling CPUMFC Records)

    ?

  • Processor Rate/Sec Cycles/Instr/Ratio - Shows processor cache effectiveness. The lower the ratio, the more work is being accomplished.
  • Level 1 Cache/Second Instruction Cost/Data Cost - Shows the cost of cache misses.
  • TLB CPU Cost/Cycles Lost - Also shows the cost of cache misses - cycles being used for 'non-work' (such as address translation) or 'idle' due to time lost moving data from a higher level of cache/memory. Watch for changes changes in each of these numbers - especially if changing parking settings and/or LPAR weighting.

  • ESADIAG - Shows Diagnose rates.

    ?

  • Diag Code - A diagnose code of 44 shows an older version of Linux that uses spin locks - which can cause a performance issue. It is better to use diagnose code 9C instead. Use ESAUSRD to see what machines are using which diagnose codes.

  • ESAIUER - Shows IUCV errors.

    ?

  • IUCV failures - Since many IUCV services run "master only" (meaning it only uses the one specified master processor) any IUCV errors can cause performance issues.
    ESALCK - Shows spin lock activity.

    ?

  • CPU% - Locking doesn't tend to have issues. However, if system utilization is high, check for out of control spin locks. If either the exclusive or shared CPU% is over 10, there are spin lock issues. This is most likely an issue that needs to be sent to IBM.

  • ESATOPU - Shows CPU utilization by user - top users first.

    ?

  • CPU Time - This shows how much CPU the top user for any given minute is utilizing. This picture shows that ZADMIN was utilizing more of the system than normal from 14:51 to 15:06. On this system, a systems programmer was running a trace on that machine. (This wouldn't happen with this ID on your system, but it shows how one user can cause the CPU utilization to go up tremendously).
  • This is a fast way to show a possible abusive user.

  • Conclusions:

    Looking at CPU utilization is one of the quickest ways to find processing issues.

    Just like on the freeway:

    The best thing to do is to know your current environment and what is normal/abnormal. If CPU is not the issue, continue to continue the search to other parts of the system.


    Back to top of page
    Back to Performance Tuning Guide