Performance Methodology

Velocity Software's Performance Methodology

When there is a performance problem, a hierarchical approach is a methodology that will work with most problems. Our suggestion is to start at high level and then as information is evaluated, go to down to the next appropriate level. This is helpful in a mainframe environment. However, Linux measures in 'steal time' and uses a bottom up analysis vs top down. ZVPS is end to end performance management in that it can see/report on everything from the hardware down to the Linux processes running in order to find performance issues. This is a summary of things to consider that will be applied to the rest of the Tuning Guide.

Knowing the current environment is critical! Setting up highlighting, alerts and trending help to pinpoint when/where problems start but can only be done correctly when the baseline performance is known. Often customers don't view performance data until a problem occurs, which may be too late. With some basic knowledge of the "normal" performance levels, it can more easily be seen went/where an event may have happened.
Note: Performance management means the performance monitor should NEVER be the performance problem and should always be running - not turn off when there is a performance problem!

Important ZVPS system/data information:

Performance/Operations Data

Database (1 minute granularity - on a minute boundary)
Files are on ZWRITE
Files close every 1 hour by default - can be changed to 15 minutes
Data used for realtime zVIEW/3270 screens
Files kept 1-3 days (should keep 3 days worth for problem determination - defined in LOCAL ESAWRITE file)

Capacity/Chargeback Data

Database (15 minutes granularity - on a minute boundary)
Files are on ZMAP
Files close daily at midnight to create daily reports
Data used for trending analysis and daily totals
Files kept 90 days to x years (90 days is the default - can update in RUNAUTO PARMS file)

Data/Report Issues

zVIEW or 3270 screens not working/missing data - Verify the MONITOR is up/active and that ZWRITE is up and has sufficient space.
Daily reports not being produced - Verify that the ZMAP id is not logged on so it can be autologged during midnight processing and that ZWRITE is producing data.
Daily report process abends on ZMAP with insufficient storage - give more storage to ZMAP id, especially if running a large monthly report.
Missing data from zVIEW or zMAP -

Check the DCSS -

DCSS is full or Event Monitoring Suspended messages will show it is sized too small.
The DCSS starts at the beginning and stores System data, Scheduler data, Storage data, User data, CPU data and Disk data. If missing the later parts of the data, such as missing DASD information or the device id and system id are the same on a DASD report, the DCSS may be full (the configuration area is too small).
If the monitor is stopping and starting every minute, the DCSS may be full.
Verify there are not multiple ZMON or MONDCSS DCSS's. This will cause stale data.

If zVIEW is running slow - check MDC Resident on ESASTR1 to see if there is MDC space available. zVIEW uses MDC and will run slow if it is all in use.
No data/missing data from Linux -

Verify ZTCP is up
Verify ZTCP is getting enough cycles to run correctly
Check ZTCP directory settings - IUCV ALLOW MSGLIMIT 2048 and IUCV ANY MSGLIMIT 2048
Verify SNMP is up and running correctly on the Linux servers
Verify the DNS is set up correctly
Verify the Velocity MIB is installed on the server (ESALNXUP/ESALNXD) or SMSG ZTCP STATUS nodename

No z/OS data - Verify ZOSMON is up on your z/OS system.
If reports are too big, the following can be used to trim the data being reported by ZMAP:
- In the LOCAL ESAMAP file on the VMSYSVPS:ZVPS.CONFIG disk on ZVPS userid - add HSTSFT_THRESHOLD = 1; - this will reduce the the output by not keeping more granular information (it changes the .1% default to 1%).
- The ESAHST1 report can be removed by updating the ESAMAP51 ESAPRINT file (on the same disk).
Screen/Report general information -

If ending in 'C' it contains configuration data.
If ending in '1' it contains configuration data more relevant to performance.
If ending in 'R' it contains raw data (CPU seconds).
If ending in 'P' it contains data with the percent calculated (CPU seconds/time).
On ESALPARS - the first LPAR is the reporting LPAR (where the report was run)
On ESAUSR2/ESAUSP2/ESASTR1 - the unit of measurement can go between pages and megabytes. The default is changing from pages to megabytes. In the ESAPARM file, the parm uspg_byMB = 'x'b can be added to control the value of the display/report - where 'x' is 1 = for megabytes and 0 = for pages.

Planning considerations:

Set up user classes FIRST! This helps immensely to organize users/servers into groups that can be easily recognized and trended. (Instructions can be found in the zWRITE Installation and Operations manual under 'Defining User Classes' on the Customer Documentation Page).
Suggestions:
- Group Test vs Production
- Group application by application (ie - all Oracle servers in the same class, etc)
- Group support servers vs production
Take some time to baseline important performance numbers (THIS IS CRITICAL):

Average CPU utilization/DASD response time/storage use/etc.
Peak time of day/high CPU utilization times/etc.
"Important user(s)" trends.

Once baselines are determined - Set up Alerts and Highlighting - See Tuning Tips for Using Alerts

Top Down Methodology:

Allocation Settings -

CEC - has some number of engines
LPAR - each LPAR has weights (becomes 'entitlement' divided equally by vCPU)
Virtual Machine - each virtual machine has shares (becomes 'virtual entitlement')
Process - Linux has priority/nice settings (gets virtual machine vCPU share)

Measuring Top Down -

CEC - are the engines highly utilized?
LPAR - do the LPARS have enough entitlement? Are there cycles to spare?
z/VM - do virtual machines have enough/too much SHARE?
Linux - are processes properly niced?

Platform Considerations

z/VM - if a problem is running in a virtual machine, check z/VM first
Application - if not running on a virtual machine, then the application needs investigation

Tuning Considerations

Measure - have a baseline of the system's typical performance
When making a change - change only one thing at a time
Validate the change had the desired affect

Knowing your environment:

There are many factors that go into creating an efficient processing environment. Each environment is unique.
Things to think about:

Multi-process workloads - better response time will come from having two smaller engines (less queueing vs smaller service time).
Single thread workloads - one larger engine (smaller service time) would provide the better environment.
GP vs IFL vs ZIIP engines affect the environment and cost - know what your workload requires, where it runs now and where it can/should run.
LPAR weighting - is set up early and can affect service times - know what your workload requires.
What is the workload - what machines/servers are running? Is everything running efficiently?

What performance problem needs to be solved?

Problems will show up in many ways such as workload timeouts, applications running slowly, workloads/servers waiting or someone notices high steal time.
Many problems are caused by tuning issues such as (addressed on other pages):

LPAR tuning issues such as: (LPAR Tuning)

LPAR weights vs utilization
LPAR vcpu vs Assigned share
LPAR set up inefficiently (extra cycles to spare but not able to be used).
Parking CPUs on an LPAR (CPU/LPAR Parking)

Inefficient SHARE settings (Setting SHARE values for Virtual Machines) (ESAUSRC)
Inefficient other settings:

Operating on the wrong engine type for the workload (GP engine vs IFL vs ZIIP). (ESAHDR)
HiperPAV turned off for paging. (ESADSD2)
No swap disk for Linux servers. (ESAUCD2)
SMT turned on when not needed. (ESALPAR/ESALPARS/ESACPUU)
Too many virtual CPUs (vCPUs) being allocated but not utilized. This can cause spin lock and/or processor cache issues. (ESALPAR/ESALPARS/ESACPUU)

Workload issues such as:

Synchronized Cron jobs - 100 processes over 100 servers
Spin locks (DIAG 44 vs DIAG 9C) and/or too many vCPUs
Over usage of the master processor

General information flow:

In general, these are good things to investigate. All of these are explained in more details on the flow chart page:

ESAHDR - Basic configuration:

Note the master processor for reference. It could be over utilized.
Is the system operating on IFL's? Seems obvious, but the system may have been configured incorrectly.
Is SMT being used? This will make a difference in the performance numbers.

ESAMAIN - Overall system performance:

How does the system look overall?
Did anything change drastically at the time the problem started?
Did processor utilization go up? Did resident storage change? How about I/O?

ESAXACT - Wait states (what resources are in low supply):

SIM shows when the master processor is utilized. If waiting on SIM, there is an overload on the master processor.
CPU shows when CPU is a bottleneck - users are waiting for CPU and not getting it.

ESALPARS - LPAR and IFL utilization:

What is the weight/polarization? Is it sufficient?
How busy is each of the processors? Are they running over 80%?
Is the system overhead time over 10%? That can be an indication of system thrashing.

ESACPUU/ESACPUA - z/VM perspective:

Did the system CPU utilization spike? If multiple engines, are they being utilized evenly?
Shows the system overhead from the z/VM perspective.
Shows where the master processor may be over utilized.

ESAUSRC - z/VM SHARE settings

Are there users/systems with incorrect SHARE settings?
When adding a vCPU to a machine, be sure to adjust the SHARE setting accordingly.

ESAUSP2/ESAUSP5- Individual virtual machine usage:

Shows total amount of virtual cpu use.
Shows the total to virtual ratio (T:V ratio) which indicates system overhead.
Shows top users which could easily indicate a dominating server/user.
ESAUSP5 - Shows the different ways of calculating user utilization with SMT active.

Top 25 Reports at a Glance:

Configuration:

ESAHDR - Basic System Information
ESAUSRC - Virtual Machine Configuration

System Load:

ESASSUM

Wait States:

ESAXACT

CPU:

ESALPARS - LPAR Summary
ESAUSP2 - CPU Consumer
ESALNXP - Linux Consumer
ESALNXS - Linux Processor
ESAMFC - CPU Cache

Storage:

ESASTR1 - z/VM Requirements
ESAUSPG - User Storage
ESAUCD2 - Linux Storage
ESAVDSK - VDISK for Swap

Paging:

ESAPSDV - Configuration
ESAUSPG - Loads by User

DASD:

ESADSD1 - Configuration
ESADSD2DASD Rates
ESADSD5DASD Cache
ESAFCP - FCP
ESAEDEV - EDEV

Network:

ESATCPI - Configuration
ESATCP1/2/4 - Management

Important Tips:

zVPS numbers: zVPS performance numbers are based on absolute terms. If there are 10 processors, the maximum utilization will be 1000%. It measures CPU seconds and cycles per second. It is not based on percentages of percentages, which can not be used for capacity planning or chargeback, but also is not helpful for performance.
Linux was measured in wall clock time, which is less than desirable. They have added a steal timer, which has helped.
Hardware measurement is the only consistently valid measurement (however with SMT, that is harder to measure.)

zVPS reports:
zVPS information may be found in a zMON screen (real time), in a zMAP report (for the previous day), or both. Also, specific information may be extracted for detailed trending, etc (Extracting Performance Data). Most of the explanations in the tuning guide utilize zMON screens for real time information. However, if a zMAP report is available it will have the same or similar information but averaged over a time frame (usually 15 minutes) for the day to facilitate trending analysis. There are also several reports that have no real time screens as the information is only valid over time. (ESAMON ESAINDEX or ESAMON ESATOC on z/VM or Screen Index on zVIEW zMON) can be used to see available zMON screens and (ESATOC LISTING on ZMAP 191 or ESAINDEX / Custom Samples Maplists on zVIEW zMAP) can be used to see available reports.

Conclusions:

The best performance methodology is top-down. Start at the highest level and work down. However, one of the most important things to do for performance is to know what is typical for the system so when something changes, it is easily seen. This makes performance issues much easier to determine.