Velocity Software's Performance Methodology
When there is a performance problem, a hierachical approach is a methodology that will work with most problems. Our suggestion is to start at high level and then as information is evaluated, go to down to the next appropriate level. This is helpful in a mainframe environment. However, Linux measures in 'steal time' and uses a bottom up analysis vs top down. ZVPS is end to end performance management in that it can see/report on everything from the hardware down to the Linux processes running in order to find performance issues. This is a summary of things to consider that will be applied to the rest of the Tuning Guide.
Knowing the current environment is critical! Setting up highlighting, alerts and trending help to pinpoint when/where
problems start but can only be done correctly when the baseline performance is known. Often customers don't view performance
data until a problem occurs, which may be too late. With some basic knowledge of the "normal" performance levels, it can
more easily be seen went/where an event may have happened.
Note: Performance management means the performance monitor should NEVER be the performance problem and should always
be running - not turn off when there is a performance problem!
Important ZVPS system/data information:
- Performance/Operations Data
- Database (1 minute granularity - on a minute boundary)
- Files are on ZWRITE
- Files close every 1 hour by default - can be changed to 15 minutes
- Data used for realtime zVIEW/3270 screens
- Files kept 1-3 days (should keep 3 days worth for problem determination - defined in LOCAL ESAWRITE file)
- Capacity/Chargeback Data
- Database (15 minutes granularity - on a minute boundary)
- Files are on ZMAP
- Files close daily at midnight to create daily reports
- Data used for trending analysis and daily totals
- Files kept 90 days to x years (90 days is the default - can update in RUNAUTO PARMS file)
- Data/Report Issues
- zVIEW or 3270 screens not working/missing data - Verify the MONITOR is up/active and that ZWRITE is up and has sufficient space.
- Daily reports not being produced - Verify that the ZMAP id is not logged on so it can be autologged during midnight processing and that ZWRITE is producing data.
- Daily report process abends on ZMAP with insufficient storage - give more storage to ZMAP id, especially if running a large monthly report.
- Missing data from zVIEW or zMAP -
- Verify the DCSS is sized correctly - look for a message saying the DCSS is full.
- The DCSS starts at the beginning and stores System data, Scheduler data, Storage data, User data, CPU data and Disk data. If missing the later parts of the data, the DCSS may be full.
- No data/missing data from Linux -
- Verify ZTCP is up
- Verify ZTCP is getting enough cycles to run correctly
- Check ZTCP directory settings - IUCV ALLOW MSGLIMIT 2048 and IUCV ANY MSGLIMIT 2048
- Verify SNMP is up and running correctly on the Linux servers
- Verify the DNS is set up correctly
- Verify the Velocity MIB is installed on the server (ESALNXUP/ESALNXD) or SMSG ZTCP STATUS nodename
- No z/OS data - Verify ZOSMON is up on your z/OS system.
- Screen/Report general information -
- If ending in 'C' it contains configuration data.
- If ending in '1' it contains configuration data more relavent to performance.
- If ending in 'R' it contains raw data.
- If ending in 'P' it contains data with calculated percents.
- For ESALPARS - the first LPAR is the reporting LPAR (where the report was run)
Planning considerations:
- Set up user classes FIRST! This helps immensely to organize users/servers into groups that can be easily recognized
and trended. (Instructions can be found in the zMAP Installation and Operations manual found on the
Customer Documentation Page).
Suggestions:- Group Test vs Production
- Group application by application (ie - all Oracle servers in the same class, etc)
- Group support servers vs production
- Take some time to baseline important performance numbers (THIS IS CRITICAL):
- Average CPU utilization/DASD response time/storage use/etc.
- Peak time of day/high CPU utilization times/etc.
- "Important user(s)" trends.
- Once baselines are determined - Set up Alerts and Highlighting - See Tuning Tips for Using Alerts
Top Down Methodology:
- Allocation Settings -
- CEC - has some number of engines
- LPAR - each LPAR has weights (becomes 'entitlement' divided equally by vcpu)
- Virtual Machine - each virtual machine has shares (becomes 'virtual entitlement')
- Process - Linux has priority/nice settings (gets virtual machine vcpu share)
- Measuring Top Down -
- CEC - are the engines hightly utilized?
- LPAR - do the LPARS have enough entitlement? Are there cycles to spare?
- z/VM - do virtual machines have enough/too much SHARE?
- Linux - are processes properly niced?
- Platform Considerations
- z/VM - if a problem is running in a virtual machine, check z/VM first
- Application - if not running on a virtual machine, then the application needs investigation
- Tuning Considerations
- Measure - have a baseline of the system's typical performance
- When making a change - change only one thing at a time
- Validate the change had the desired affect
Knowing your environment:
There are many factors that go into creating an efficient processing environment. Each environment is unique.
Things to think about:
- Multi-process workloads - better response time will come from having two smaller engines (less queueing vs smaller service time).
- Single thread workloads - one larger engine (smaller service time) would provide the better environment.
- GP vs IFL vs ZIIP engines affect the environment and cost - know what your workload requires, where it runs now and where it can/should run.
- LPAR weighting - is set up early and can affect service times - know what your workload requires.
- What is the workload - what machines/servers are running? Is everything running efficiently?
What performance problem needs to be solved?
Problems will show up in many ways such as workload timeouts, applications running slowly, workloads/servers waiting or
someone notices high steal time.
Many problems are caused by tuning issues such as (addressed on other pages):
- LPAR tuning issues such as: (LPAR Tuning)
- LPAR weights vs utilization
- LPAR vcpu vs Assigned share
- LPAR set up inefficiently (extra cycles to spare but not able to be used).
- Parking CPUs on an LPAR (CPU/LPAR Parking)
- Inefficient SHARE settings (Setting SHARE values for Virtual Machines) (ESAUSRC)
- Inefficient other settings:
- Operating on the wrong engine type for the workload (GP engine vs IFL vs ZIIP). (ESAHDR)
- HiperPAV turned off for paging. (ESADSD2)
- No swap disk for Linux servers. (ESAUCD2)
- SMT turned on when not needed. (ESALPAR/ESALPARS/ESACPUU)
- Too many virtual cpus (vcpus) being allocated but not utilized. This can cause spin lock and/or processor cache issues. (ESALPAR/ESALPARS/ESACPUU)
- Workload issues such as:
- Synchronized Cron jobs - 100 processes over 100 servers
- Spin locks (DIAG 44 vs DIAG 9C) and/or too many vcpus
- Over usage of the master processor
General information flow:
In general, these are good things to investigate. All of these are explained in more details on the flow chart page:
- ESAHDR - Basic configuration:
- Note the master processor for reference. It could be over utilized.
- Is the system operating on IFL's? Seems obvious, but the system may have been configured incorrectly.
- Is SMT being used? This will make a difference in the performance numbers.
- ESAMAIN - Overall system performance:
- How does the system look overall?
- Did anything change drastically at the time the problem started?
- Did processor utilization go up? Did resident storage change? How about I/O?
- ESAXACT - Wait states (what resources are in low supply):
- SIM shows when the master processor is utilized. If waiting on SIM, there is an overload on the master processor.
- CPU shows when CPU is a bottleneck - users are waiting for CPU and not getting it.
- ESALPARS - LPAR and IFL utilization:
- What is the weight/polarization? Is it sufficient?
- How busy is each of the processors? Are they running over 80%?
- Is the system overhead time over 10%? That can be an indication of system thrashing.
- ESACPUU/ESACPUA - z/VM perspective:
- Did the system CPU utilization spike? If multiple engines, are they being utilized evenly?
- Shows the system overhead from the z/VM perspective.
- Shows where the master processor may be over utilized.
- ESAUSRC - z/VM SHARE settings
- Are there users/systems with incorrect SHARE settings?
- When adding a VCPU to a machine, be sure to adjust the SHARE setting accordingly.
- ESAUSP2/ESAUSP5- Individual virtual machine usage:
- Shows total amount of virtual cpu use.
- Shows the total to virtual ratio (T:V ratio) which indicates system overhead.
- Shows top users which could easily indicate a dominating server/user.
- ESAUSP5 - Shows the different ways of calculating user utilization with SMT active.
Top 25 Reports at a Glance:
- Configuration:
- ESAHDR - Basic System Information
- ESAUSRC - Virtual Machine Configuration
- System Load:
- ESASSUM
- Wait States:
- ESAXACT
- CPU:
- ESALPARS - LPAR Summary
- ESAUSP2 - CPU Consumer
- ESALNXP - Linux Consumer
- ESALNXS - Linux Processor
- ESAMFC - CPU Cache
- Storage:
- ESASTR1 - z/VM Requirements
- ESAUSPG - User Storage
- ESAUCD2 - Linux Storage
- ESAVDSK - VDISK for Swap
- Paging:
- ESAPSDV - Configuration
- ESAUSPG - Loads by User
- DASD:
- ESADSD1 - Configuration
- ESADSD2DASD Rates
- ESADSD5DASD Cache
- ESAFCP - FCP
- ESAEDEV - EDEV
- Network:
- ESATCPI - Configuration
- ESATCP1/2/4 - Management
Important Tips:
zVPS numbers:
zVPS performance numbers are based on absolute terms. If there are 10 processors, the maximum utilization will be 1000%.
It measures CPU seconds and cycles per second. It is not based on percentages of percentages, which can not be used
for capacity planning or chargeback, but also is not helpful for performance.
Linux was measured in wall clock time, which is less than desirable. They have added a steal timer, which has helped.
Hardware measurement is the only consistently valid measurement (however with SMT, that is harder to measure.)
zVPS reports:
zVPS information may be found in a zMON screen (real time), in a zMAP report (for the previous day), or both. Also,
specific information may be extracted for detailed trending, etc (Extracting Performance Data).
Most of the explanations in the tuning guide utilize zMON screens for real time information. However, if a zMAP report
is available it will have the same or similar information but averaged over a time frame (usually 15 minutes) for the
day to facilitate trending analysis. There are also several reports that have no real time screens as the information
is only valid over time.
(ESAMON ESAINDEX or ESAMON ESATOC on z/VM or Screen Index on zVIEW zMON) can be used to see available zMON screens and
(ESATOC LISTING on ZMAP 191 or ESAINDEX / Custom Samples Maplists on zVIEW zMAP) can be used to see available reports.
Conclusions:
The best performance methodology is top-down. Start at the highest level and work down. However, one of the most important things to do for performance is to know what is typical for the system so when something changes, it is easlily seen. This makes performance issues much easier to determine.
Back to top of page
Back to Performance Tuning Guide