System

System Utilization Analysis

System Utilization

The first step in determining the system utilization is by reviewing ESAMAIN (screen) and ESASSUM (report). The screen shows the current utilization and the report shows for a full day so that multiple days can be reviewed to track trends. Also, the ESAOPER screen can show interesting events and obvious errors that can affect performance.

Helpful ESAMON screens/ESAMAP reports:

ESAMAIN - System overview - shows current total CPU processor utilization
ESASSUM (zmap report) - Subsystem activity - shows the main system overview for the day
ESAOPER - Operator/System Log - shows the log of system events

ESAMAIN - This shows the main system overview.

Users - This shows the current users/servers logged on and active/in queue. Sometimes many users/servers and/or large users/servers logon at the same time. This can cause a system disruption, depending on the size of the users/servers.

Processor Utilization - This shows how active the system is currently. Look for spikes or dramatic changes, especially around or slightly before the problem started. Also compare the processor utilization to how many total processors (ESAHDR). So if there are 6 processors, and this showed 600, that would be 100% utilization.

Storage - Watch for large fluctuations in the active resident storage which could show changing workloads or large user/server logons that can cause issues. Also compare the active storage to total storage to determine the percentage (ESAHDR).

Paging (to) DASD - Watch for large paging numbers or fluctuations. This shows a problem with a lack of memory. If issues, go to the Paging screens.

I/O - Again watch for large fluctuations or bad response time. If issues, go to the Input/Output Subsystem screens for more specific information.

SMT Prort Ratio - When SMT is enabled, this is the thread to core ratio. A number of 0.5x is excellent. It says the hardware is supporting two threads without loss. Any number under one shows SMT is providing value. Above one, SMT may still be providing capacity, but may start to impact performance.

NOTE: Setting up the parameters in zMON to properly highlight when utilization has crossed a particular threshold makes it very easy to see at a glance if the sytem load is higher than normal. In the example above, it was easy to see a big spike in Other Rate I/O. Further investigation from ESADEV2 in Input/Output Subsystem showed more information about the hardware having the spike (the non-DASD screen). ESAUSR3 showed the group/servers that caused the spike (an additional user screen). Legitimate testing was being done at the time, but it was very easy to see the issue and chase it down.

ESASSUM - This shows the main system overview for the day.

Processor Utilization - This shows how active the system has been, averaged every 15 minutes, for the full day. This is helpful for finding when a problem started for better problem determination and trending.

Storage - This shows the available storage (memory), averaged every 15 minutes, for the full day. Are there any spikes?

Paging - This shows the amount of system paging, averaged every 15 minutes, for the full day. Are there any spikes?

I/O - This shows the amount DASD activity, averaged every 15 minutes, for the full day. Are there any spikes?

ESAOPER - shows the log of system events:

Many important things can be seen in the Operator log:

Obvious errors - such as a DASD going offline, etc
SHARE settings changing
CPU Parking/Unparking
CMM values changing

Conclusions:

If the processor utilization numbers have jumped dramatically, there is a reason. The above screen/report will show at any given moment or over the course of time if the system utilization has been higher than normal. If there is an actual problem, follow the System Problem flow to see what else could be the cause. If this is a trend, it may be that more hardware is needed or there is something else to be found.