MFC Data

MFC Analysis

Specifics - Understanding MFC Data:

MFC stands for MainFrame Cache. This cache is in the CEC or 'box'. This cache functions similar to cache on DASD, it is an area that holds recently used data so it is easily reaccessed. If this cache is used efficiently, performance improves. Unfortunately in the z/VM environment with so many guest systems like that do polling (like Linux, DB2, etc) or large amounts of I/O (like TPF), this cache tends to get easily overwritten and thus becomes inefficient. However, there are things that can be done to help maintain the best use of this cache.

Useful Terms and Information:

First - In order to get MFC data, the Measurement Facility must be turned on in the LPAR. See Enabling CPUMFC Records
MFC data is by thread. This is important.
The levels of cache are:

L1 cache - This area is on the core (private) and is the fastest and most efficiently used. All instruction and data information for a transaction must be in L1 cache before it can be run.
L2 cache - This area is also on the core (private), is usually larger than L1 but is also slightly slower.
L3 cache - This area is shared by all of the cores on the same chip.
L4L cache - This area is shared by all of the cores on the same local book.
L4R cache - This area is shared by all of the cores on the same different/remote book.
Memory - This area is actual memory - it means the data was not in any level of cache.

There are cycles wasted waiting for data to come from other areas of cache. SMT was created to increase efficiency by utilizing those 'wasted' cycles. The less cycles wasted, the more work that can get done.
CPI - Cycles per instruction. This is a very important measurement of performance. A lower CPI means that more instructions are possible. A higher CPI means that there is higher CPU utilization and possibly contention.
When Instructions go up and cycles go down - it's a good thing. Watch the CPI Ratio on ESAMFCA (lower is good).
RNI - Relative Nest Intensity. This is a concept created by John Burg at WSC. Velocity uses his calculations. However, we have found it does not track well for the high dispatch rates for Linux.
It also does not account for address translation which can be considerable.
TLB - Transaction Lookaside Buffer. In order to move data from other cache levels or memory to L1 cache, addresses need to be translated. Dynamic Address Translation handles this.
There was only one DAT per box, however later boxes have gone from one DAT to four DATs to help increase the efficiency and performance of this process.
When setting up LPAR hardware, be sure to set cores up as much as possible on the same CHIPs and nodes/books.
Note: Adding virtual CPUs to a server that tend to access the same data to cache contention and can negatively affect performance. Only define as many vCPUs as is needed for the workload! If customers demand additional vCPUs when they are not needed, Velocity's zVRM product can help.

Tips For Using MFC Data:

Watch the ESAMFC screen for TLB CPU Cost. A typically healthy environment seems to use about 10% of the cycles (or less) for address translation. Numbers above 20% should be investigated.
Watch the ESAMFC screen/report for 'Ratio' to get CPI. Get a baseline read of the number. Again, the lower the ratio, the more work that is getting done. If this number is high without SMT, do not turn SMT on.
Watch the ESAMFC screen/report for TLB CPU Cost. The higher the number, the less work is getting done.
Watch the ESAMFCA and ESAMFCC screens/reports for the data read from L2/L3/L4L/L4R/Memory. If there are large amounts of data read from the further areas, especially Off-Drawer, it could be a hardware configuration issue.
Use the fewest amount of vCPUs possible to get the workload done. The more vCPUs that are running that are not needed will cause cache inefficiencies.

Helpful ESAMON screens/ESAMAP reports:

ESAMFC - Processor Cache Analysis - Shows processor instruction information
ESAMFCA - Processor Cache Hit Analysis - Shows processor cache hit information
ESAMFCC - Processor L1 Cache Write Analysis - Shows processor Level one cache write information
ESAMFCN - Processor Cache Intervention Analysis - Shows processor cache intervention analysis
ESAPLDV - Processor Local Dispatch Vector Activity - Shows the dispatcher activity

ESAMFC - Shows processor instruction information.

Processor Rate/Sec Cycles/Instr/Ratio - Shows processor cache effectiveness. The lower the ratio, the more work is being accomplished.

Level 1 Cache/Second Instruction Cost/Data Cost - Shows the cost of cache misses.

TLB CPU Cost/Cycles Lost - Also shows the cost of cache misses - cycles being used for 'non-work' (such as address translation) or 'idle' due to time lost moving data from a higher level of cache/memory. Watch for changes in each of these numbers - especially if changing parking settings and/or LPAR weighting.

ESAMFCA - Shows processor cache hit information.

Processor Rate/Sec Cycles/Instr/Ratio - Shows processor cache effectiveness. The lower the ratio (the average number of cycles required to process an instruction) the more work is being accomplished.

Data source read from L1/L2/L3/L4L/L4R/Mem - Shows the cache hits from the different levels of cache. The farther the system has to go to get the information, the higher the cost.

TLB Miss Instr/Data - This shows the Transaction Look Aside buffer misses for both instructions and data. The higher the number, the less actual work is being accomplished.

Overhead Pct Cycles Used TLB%/Total - Shows the amount of overhead caused by TLB misses.

RNI From Burg - Shows the Relative Nesting Intensity from the Burg formula. This is a calculation of how long it takes to load L1 cache from the different levels of cache. The smaller the number, the faster L1 cache is being refreshed and the more work is being done. RNI goes up when SMT is enabled as cache is being affected.

ESAMFCC - Shows processor L1 cache write analysis.

L2 Cache Inst/Data - Shows L1 cache writes from L2 cache. The closer the cache is to L1, the more effective it is and the less time it will take to be able to execute the instruction.

L3 Cache Data OnChip/OnBook/Offbk - Shows L1 cache writes from L3 cache - on the same CHIP, on the same book or on a different book for data.

L3 Cache Inst OnChip/OnBook/Offbk - Shows L1 cache writes from L3 cache - on the same CHIP, on the same book or on a different book for instructions.

L4 Cache OnBook/Offbk - Shows L1 cache writes from L4 cache - on the same book or on a different book.

Memory OnChip/OnBook/OffBook/OffDrawer - Shows L1 cache writes from memory - on the same chip, on/off the same book or on a different drawer. This would be the most costly.

SIIS - This shows the Store Into Instruction Stream (from Burg) percentage. Anything over 5% will cause impact.

The farther away the L1 write has to go, the more time it takes and performance will suffer. This is a good place to see cache efficiency.

ESAMFCN - Shows processor L1 cache intervention analysis.

TLB Pct Busy - The number of L2 TLB translation engines busy in a cycle.

Note: The TLB busy information only shows on a z16 or higher.

ESAPLDV - Shows processor local dispatch vector activity

VMDBK Moves - Shows the number of VMDBKs that moved to a different processor. Either from processor to processor or from a slave processor to the master processor. Watch for any large fluctuations. If the number of VMDBK's moved to the master starts to climb or has a sharp increase, investigation is needed to determine what is being run that must run on the master.

Disptch LngPath - Shows the number of dispatches per second. Look for large fluctuations.

CPU Steals from Other CPUs - This shows when VMDBKs were moved from all the different levels of cache. The farther out a steal goes, the more time it takes and the worse the performance. This is another way to determine if SMT is working for a system. (Be sure to get benchmark numbers before turning on SMT). Also, if there are numbers in columns other than N1, the LPAR may be defined inefficiently and should be corrected. Note the numbers in the NL2 column in this report - this WAS causing issues.