Alert Info

Alerts

Tuning Tips for Using Alerts

Alerts help make system monitoring easier. If a problem can be quickly identified because a monitored zVIEW screen suddenly shows yellow/red or an email is sent because a critical node is no longer up, recovery is expedited. A little up-front work goes a long way. Also, periodic maintenance to verify the values still are relevant helps those monitoring the system not say "We are used to seeing the red - it's always red".

Alert tips for use with zALERT:

The zALERT product is a no charge offering that is a powerful, flexible tool to be able to identify issues before they become real problems. An early alert can make the difference between an impactful system outage and hero recognition! :-) (Information about using zALERT can be found in the zALERT User's Guide on the Customer Documentation page)
Note: zOPERATOR is another no charge product that is a perfect landing place for the alerts below. With a little upfront work, an Operator's chances of seeing a problem early is greatly enhanced.
Below are some tuning tips when using alerts.

Critical system elements are no longer functioning:

Service machines such as RACFVM or TCPIP are usually quickly discovered if they are not logged on. Users can't logon, etc. However, service machines such as ZWRITE, SFS servers (system/user), RSCS, FTPSERVE, etc. may be impactful and be down for long periods of time. An alert will quickly notify the correct parties that a service machine is down and needs to be brought back up (and the cause investigated).
Network nodes that affect the business may go down but aren't necessarily discovered immediately. Often node issues aren't discovered until traffic backs up or users complain. Having an alert for critical nodes can prevent unnecessary impacts.
TCPIP transport errors can indicate a big issue in the network. Alerting on TCP connection failures, resets, segment transmits in error/resets and UDP datagram errors can be helpful.
Special users such as zOS systems, DB2 systems, Oracle systems, etc may have gotten logged off or gone into a disabled state. Not finding the problem early could drastically affect SLA's and create customer impact.
Special Linux processes such as SNMP, MongoDB or java are down and really need to be running.

System element utilizations are too high/low:

High CPU utilization can dramatically affect system performance. By having an alert for early detection, the correct people can investigate and hopefully circumvent the problem before it causes a bigger issue. Trend analysis done over a few days/weeks/months can determine for your particular system what the threshold for what determines CPU utilization that is too high. However, alerts for 80-90% or to flag a large change in percent (maybe climbing 10% in a five minute period) would be a good place to start.
Paging Utilization can cause issues when too high. Anything over 75% is high.
Spool Utilization will impact the system if there is no longer any spool space. It can be one user or other issues causing spool issues. Determine what is "normal" for your system and alert on higher or use 75% as well.
Top User utilization can be very helpful for problem determination if CPU utilization is climbing.
Memory utilization can show up as either high or low depending on the measurement. If there is a shortage of memory below the 2G line, this will lead to issues. However, if there is a large amount of memory suddenly being consumed this could end up causing paging issues. Alerts can be created for either or both situations.
DASD service times are high. Slow response time for DASD will cause serious performance issues. This can show up as high response time, service time, pend time, disconnect time, connect time or queue time.
Linux:

Swap Rate is high. If swapping is over 80% an alert should be generated.
Process CPU is high. If the CPU one process is using is over 20 percent an alert should be generated.
Looping Useris a user that is using a lot of CPU but not a lot of I/O. With zALERT, an alert can be coded to check for both high utilization and low I/O for any giving Linux server.
Filesystem Utilization can be shown as low available blocks or a climbing percent used.

Alert tips for use with ZVPS:

zVPS monitor screens (zVIEW or 3270) can/should be customized to highlight appropriate high water marks. This makes it much easier to see at a glance when a system number is 'in the yellow/red'.
There are two ways to do this - via zMON or with zVIEW. For zMON, use the instructions explained in the zMON Users Guide - Chapter 4. A MONPROF COPY file is created to override the defaults in ESAMONDF COPY. Using the screen name and column number of the information, a value is coded that will be used to highlight when that value has been exceeded.
For zVIEW, use the instructions explained in the zVIEW Users Guide - Chapter 1 - Setting Screen Thresholds. This sets warning colors of yellow/red in Graphs.
For thresholds in Enterprise view - see zVIEW Users Guide - Chapter 2 - configuration. These are set in VSIMAINT in the ZVIEW CECLIST.
Below are some tuning tips for zVPS alerts (screen highlights):

CPU Utilization - High CPU utilization can indicate a problem. Knowing what is "normal" for your system will allow you to set the alerts on the proper percentages. For instance if the system has two IFL's (which then has 200% potential utilization) and it usually runs about 150%, then setting a highlight in yellow for 170% and another in red for 190% can alert at a glance that CPU utilization is climbing. Keep in mind, if a third IFL is added, the system now has 300% potential utilization. It is very helpful to return to the settings and change the yellow/red highlights to say 255%/285%. Both of these would indicate yellow=85% and red=95%. If that third IFL is added and the highlight numbers aren't changed as well, now yellow=57% and red=64%. If there is someone watching the system utilization, they could get 'numb' to yellow/red.
Note: If SMT is on, then it shows the number of threads (not CPUs) so that number is actually doubled. Verify your percentage is correct.
Suggested screen(s) to set/update: ESAMAIN.8, ESALPAR.16, any other screen that is typically monitored
DASD Utilization -DASD issues definitely cause performance issues. This can be high response time, service time, pend time, disconnect time, connect time or queue time.
Memory Utilization - Many critical system functions are done in memory below the 2GB line. If there is not enough memory in this area, there will be performance issues. Look at trending data and determine what is a "normal" low water mark for your system and set the alert for that number or slightly less.
Suggested screen(s) to set/update: ESASTR1.7
Wait States - When users are waiting on resources, there is almost always a problem. The ESAMONDF file has default definitions for each of the different resources. It is good to be familiar with them and change if needed.
Suggested screen(s) to set/update: ESAXACT (various columns)
Paging Utilization - If paging is climbing, it could be indicative of a big problem. Anything over 75% should be investigated.
Suggested screen(s) to set/update: ESAPSDV.6
Spool Utilization - If spool space runs out, the system can no longer create spool files. Anything over 75% can be an issue.
Suggested screen(s) to set/update: ESAPSDV.12
Linux CPU Utilization - One server can use more than its share of CPU for various reasons. When a server is using over 80% CPU, it would be good to see.
Suggested screen(s) to set/update: ESANLXS.13 and/or ESAUCD4.3
Linux Swap - If a Linux system has a swap rate that is climbing, it could not only cause an issue with that system, but also the whole system. A high swap rate is over 10.
Suggested screen(s) to set/update: ESAUCD4.9/.10
Linux File System Utilization - If a Linux system runs out of file space, it can be a real problem. Monitoring how much space is available and/or the percent used is helpful. Look at trending data and determine what is a "normal" size for the different disks and set the alert for a number slightly less (for available space) or slightly more (percentage used).
Suggested screen(s) to set/update: ESALNXF2.8/.9
TCPIP Bandwith - Sometimes network traffic can overload the system. Determine what is 'normal' for your system and set a threshold to see if that is being overrun. Suggested screen(s) to set/update: ESATCP4.4/.5
TCPIP Errors - If the network isn't happy, customers aren't happy. There can be connection failures, resets, segment transmits in error/resets and UDP datagram errors that can be highlighted to show issues. Each ESATCPx screen has different error indicators. See which ones may affect your system and highlight on non-zero conditions.
Suggested screen(s) to set/update: ESATCP1.6/.7/.11/.12/.15/.16, ESATCP2.7/.8/.9/.11/.12, ESATCP3.21-.28, ESATCP4.12-.16