3.3. Red Hat Enterprise Linux-Specific Information

Monitoring bandwidth and CPU utilization under Red Hat Enterprise Linux entails using the tools discussed in Chapter 2 Resource Monitoring; therefore, if you have not yet read that chapter, you should do so before continuing.

3.3.1. Monitoring Bandwidth on Red Hat Enterprise Linux

As stated in Section 2.4.2 Monitoring Bandwidth, it is difficult to directly monitor bandwidth utilization. However, by examining device-level statistics, it is possible to roughly gauge whether insufficient bandwidth is an issue on your system.

By using vmstat, it is possible to determine if overall device activity is excessive by examining the bi and bo fields; in addition, taking note of the si and so fields give you a bit more insight into how much disk activity is due to swap-related I/O:

   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  0  0      0 248088 158636 480804   0   0     2     6  120   120  10   3  87
        

In this example, the bi field shows two blocks/second written to block devices (primarily disk drives), while the bo field shows six blocks/second read from block devices. We can determine that none of this activity was due to swapping, as the si and so fields both show a swap-related I/O rate of zero kilobytes/second.

By using iostat, it is possible to gain a bit more insight into disk-related activity:

Linux 2.4.21-1.1931.2.349.2.2.entsmp (raptor.example.com)     07/21/2003

avg-cpu:  %user   %nice    %sys   %idle
           5.34    4.60    2.83   87.24

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev8-0            1.10         6.21        25.08     961342    3881610
dev8-1            0.00         0.00         0.00         16          0
        

This output shows us that the device with major number 8 (which is /dev/sda, the first SCSI disk) averaged slightly more than one I/O operation per second (the tsp field). Most of the I/O activity for this device were writes (the Blk_wrtn field), with slightly more than 25 blocks written each second (the Blk_wrtn/s field).

If more detail is required, use iostat's -x option:

Linux 2.4.21-1.1931.2.349.2.2.entsmp (raptor.example.com)     07/21/2003

avg-cpu:  %user   %nice    %sys   %idle
           5.37    4.54    2.81   87.27

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz
/dev/sda    13.57   2.86  0.36  0.77   32.20   29.05    16.10    14.53    54.52
/dev/sda1    0.17   0.00  0.00  0.00    0.34    0.00     0.17     0.00   133.40
/dev/sda2    0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00    11.56
/dev/sda3    0.31   2.11  0.29  0.62    4.74   21.80     2.37    10.90    29.42
/dev/sda4    0.09   0.75  0.04  0.15    1.06    7.24     0.53     3.62    43.01
        

Over and above the longer lines containing more fields, the first thing to keep in mind is that this iostat output is now displaying statistics on a per-partition level. By using df to associate mount points with device names, it is possible to use this report to determine if, for example, the partition containing /home/ is experiencing an excessive workload.

Actually, each line output from iostat -x is longer and contains more information than this; here is the remainder of each line (with the device column added for easier reading):

Device:    avgqu-sz   await  svctm  %util
/dev/sda       0.24   20.86   3.80   0.43
/dev/sda1      0.00  141.18 122.73   0.03
/dev/sda2      0.00    6.00   6.00   0.00
/dev/sda3      0.12   12.84   2.68   0.24
/dev/sda4      0.11   57.47   8.94   0.17
        

In this example, it is interesting to note that /dev/sda2 is the system swap partition; it is obvious from the many fields reading 0.00 for this partition that swapping is not a problem on this system.

Another interesting point to note is /dev/sda1. The statistics for this partition are unusual; the overall activity seems low, but why are the average I/O request size (the avgrq-sz field), average wait time (the await field), and the average service time (the svctm field) so much larger than the other partitions? The answer is that this partition contains the /boot/ directory, which is where the kernel and initial ramdisk are stored. When the system boots, the read I/Os (notice that only the rsec/s and rkB/s fields are non-zero; no writing is done here on a regular basis) used during the boot process are for large numbers of blocks, resulting in the relatively long wait and service times iostat displays.

It is possible to use sar for a longer-term overview of I/O statistics; for example, sar -b displays a general I/O report:

Linux 2.4.21-1.1931.2.349.2.2.entsmp (raptor.example.com)     07/21/2003

12:00:00 AM       tps      rtps      wtps   bread/s   bwrtn/s
12:10:00 AM      0.51      0.01      0.50      0.25     14.32
12:20:01 AM      0.48      0.00      0.48      0.00     13.32
…
06:00:02 PM      1.24      0.00      1.24      0.01     36.23
Average:         1.11      0.31      0.80     68.14     34.79
        

Here, like iostat's initial display, the statistics are grouped for all block devices.

Another I/O-related report is produced using sar -d:

Linux 2.4.21-1.1931.2.349.2.2.entsmp (raptor.example.com)     07/21/2003

12:00:00 AM       DEV       tps    sect/s
12:10:00 AM    dev8-0      0.51     14.57
12:10:00 AM    dev8-1      0.00      0.00
12:20:01 AM    dev8-0      0.48     13.32
12:20:01 AM    dev8-1      0.00      0.00
…
06:00:02 PM    dev8-0      1.24     36.25
06:00:02 PM    dev8-1      0.00      0.00
Average:       dev8-0      1.11    102.93
Average:       dev8-1      0.00      0.00
        

This report provides per-device information, but with little detail.

While there are no explicit statistics showing bandwidth utilization for a given bus or datapath, we can at least determine what the devices are doing and use their activity to indirectly determine the bus loading.

3.3.2. Monitoring CPU Utilization on Red Hat Enterprise Linux

Unlike bandwidth, monitoring CPU utilization is much more straightforward. From a single percentage of CPU utilization in GNOME System Monitor, to the more in-depth statistics reported by sar, it is possible to accurately determine how much CPU power is being consumed and by what.

Moving beyond GNOME System Monitor, top is the first resource monitoring tool discussed in Chapter 2 Resource Monitoring to provide a more in-depth representation of CPU utilization. Here is a top report from a dual-processor workstation:

  9:44pm  up 2 days, 2 min,  1 user,  load average: 0.14, 0.12, 0.09
90 processes: 82 sleeping, 1 running, 7 zombie, 0 stopped
CPU0 states:  0.4% user,  1.1% system,  0.0% nice, 97.4% idle
CPU1 states:  0.5% user,  1.3% system,  0.0% nice, 97.1% idle
Mem:  1288720K av, 1056260K used,  232460K free,       0K shrd,  145644K buff
Swap:  522104K av,       0K used,  522104K free                  469764K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
30997 ed        16   0  1100 1100   840 R     1.7  0.0   0:00 top
 1120 root       5 -10  249M 174M 71508 S <   0.9 13.8 254:59 X
 1260 ed        15   0 54408  53M  6864 S     0.7  4.2  12:09 gnome-terminal
  888 root      15   0  2428 2428  1796 S     0.1  0.1   0:06 sendmail
 1264 ed        15   0 16336  15M  9480 S     0.1  1.2   1:58 rhn-applet-gui
    1 root      15   0   476  476   424 S     0.0  0.0   0:05 init
    2 root      0K   0     0    0     0 SW    0.0  0.0   0:00 migration_CPU0
    3 root      0K   0     0    0     0 SW    0.0  0.0   0:00 migration_CPU1
    4 root      15   0     0    0     0 SW    0.0  0.0   0:01 keventd
    5 root      34  19     0    0     0 SWN   0.0  0.0   0:00 ksoftirqd_CPU0
    6 root      34  19     0    0     0 SWN   0.0  0.0   0:00 ksoftirqd_CPU1
    7 root      15   0     0    0     0 SW    0.0  0.0   0:05 kswapd
    8 root      15   0     0    0     0 SW    0.0  0.0   0:00 bdflush
    9 root      15   0     0    0     0 SW    0.0  0.0   0:01 kupdated
   10 root      25   0     0    0     0 SW    0.0  0.0   0:00 mdrecoveryd
          

The first CPU-related information is present on the very first line: the load average. The load average is a number corresponding to the average number of runnable processes on the system. The load average is often listed as three sets of numbers (as top does), which represent the load average for the past 1, 5, and 15 minutes, indicating that the system in this example was not very busy.

The next line, although not strictly related to CPU utilization, has an indirect relationship, in that it shows the number of runnable processes (here, only one -- remember this number, as it means something special in this example). The number of runnable processes is a good indicator of how CPU-bound a system might be.

Next are two lines displaying the current utilization for each of the two CPUs in the system. The utilization statistics show whether the CPU cycles were expended for user-level or system-level processing; also included is a statistic showing how much CPU time was expended by processes with altered scheduling priorities. Finally, there is an idle time statistic.

Moving down into the process-related section of the display, we find that the process using the most CPU power is top itself; in other words, the one runnable process on this otherwise-idle system was top taking a "picture" of itself.

TipTip
 

It is important to remember that the very act of running a system monitor affects the resource utilization statistics you receive. All software-based monitors do this to some extent.

To gain more detailed knowledge regarding CPU utilization, we must change tools. If we examine output from vmstat, we obtain a slightly different understanding of our example system:

   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  0  0      0 233276 146636 469808   0   0     7     7   14    27  10   3  87
 0  0  0      0 233276 146636 469808   0   0     0     0  523   138   3   0  96
 0  0  0      0 233276 146636 469808   0   0     0     0  557   385   2   1  97
 0  0  0      0 233276 146636 469808   0   0     0     0  544   343   2   0  97
 0  0  0      0 233276 146636 469808   0   0     0     0  517    89   2   0  98
 0  0  0      0 233276 146636 469808   0   0     0    32  518   102   2   0  98
 0  0  0      0 233276 146636 469808   0   0     0     0  516    91   2   1  98
 0  0  0      0 233276 146636 469808   0   0     0     0  516    72   2   0  98
 0  0  0      0 233276 146636 469808   0   0     0     0  516    88   2   0  97
 0  0  0      0 233276 146636 469808   0   0     0     0  516    81   2   0  97
        

Here we have used the command vmstat 1 10 to sample the system every second for ten times. At first, the CPU-related statistics (the us, sy, and id fields) seem similar to what top displayed, and maybe even appear a bit less detailed. However, unlike top, we can also gain a bit of insight into how the CPU is being used.

If we examine at the system fields, we notice that the CPU is handling about 500 interrupts per second on average and is switching between processes anywhere from 80 to nearly 400 times a second. If you think this seems like a lot of activity, think again, because the user-level processing (the us field) is only averaging 2%, while system-level processing (the sy field) is usually under 1%. Again, this is an idle system.

Reviewing the tools Sysstat offers, we find that iostat and mpstat provide little additional information over what we have already experienced with top and vmstat. However, sar produces a number of reports that can come in handy when monitoring CPU utilization.

The first report is obtained by the command sar -q, which displays the run queue length, total number of processes, and the load averages for the past one and five minutes. Here is a sample:

Linux 2.4.21-1.1931.2.349.2.2.entsmp (falcon.example.com)      07/21/2003

12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5
12:10:00 AM         3       122      0.07      0.28
12:20:01 AM         5       123      0.00      0.03
…
09:50:00 AM         5       124      0.67      0.65
Average:            4       123      0.26      0.26
        

In this example, the system is always busy (given that more than one process is runnable at any given time), but is not overly loaded (due to the fact that this particular system has more than one processor).

The next CPU-related sar report is produced by the command sar -u:

Linux 2.4.21-1.1931.2.349.2.2.entsmp (falcon.example.com)      07/21/2003

12:00:01 AM       CPU     %user     %nice   %system     %idle
12:10:00 AM       all      3.69     20.10      1.06     75.15
12:20:01 AM       all      1.73      0.22      0.80     97.25
…
10:00:00 AM       all     35.17      0.83      1.06     62.93
Average:          all      7.47      4.85      3.87     83.81
        

The statistics contained in this report are no different from those produced by many of the other tools. The biggest benefit here is that sar makes the data available on an ongoing basis and is therefore more useful for obtaining long-term averages, or for the production of CPU utilization graphs.

On multiprocessor systems, the sar -U command can produce statistics for an individual processor or for all processors. Here is an example of output from sar -U ALL:

Linux 2.4.21-1.1931.2.349.2.2.entsmp (falcon.example.com)      07/21/2003

12:00:01 AM       CPU     %user     %nice   %system     %idle
12:10:00 AM         0      3.46     21.47      1.09     73.98
12:10:00 AM         1      3.91     18.73      1.03     76.33
12:20:01 AM         0      1.63      0.25      0.78     97.34
12:20:01 AM         1      1.82      0.20      0.81     97.17
…
10:00:00 AM         0     39.12      0.75      1.04     59.09
10:00:00 AM         1     31.22      0.92      1.09     66.77
Average:            0      7.61      4.91      3.86     83.61
Average:            1      7.33      4.78      3.88     84.02
        

The sar -w command reports on the number of context switches per second, making it possible to gain additional insight in where CPU cycles are being spent:

Linux 2.4.21-1.1931.2.349.2.2.entsmp (falcon.example.com)      07/21/2003

12:00:01 AM   cswch/s
12:10:00 AM    537.97
12:20:01 AM    339.43
…
10:10:00 AM    319.42
Average:      1158.25
        

It is also possible to produce two different sar reports on interrupt activity. The first, (produced using the sar -I SUM command) displays a single "interrupts per second" statistic:

Linux 2.4.21-1.1931.2.349.2.2.entsmp (falcon.example.com)      07/21/2003

12:00:01 AM      INTR    intr/s
12:10:00 AM       sum    539.15
12:20:01 AM       sum    539.49
…
10:40:01 AM       sum    539.10
Average:          sum    541.00
        

By using the command sar -I PROC, it is possible to break down interrupt activity by processor (on multiprocessor systems) and by interrupt level (from 0 to 15):

Linux 2.4.21-1.1931.2.349.2.2.entsmp (pigdog.example.com)     07/21/2003

12:00:00 AM  CPU  i000/s  i001/s  i002/s  i008/s  i009/s  i011/s  i012/s
12:10:01 AM    0  512.01    0.00    0.00    0.00    3.44    0.00    0.00

12:10:01 AM  CPU  i000/s  i001/s  i002/s  i008/s  i009/s  i011/s  i012/s
12:20:01 AM    0  512.00    0.00    0.00    0.00    3.73    0.00    0.00
…
10:30:01 AM  CPU  i000/s  i001/s  i002/s  i003/s  i008/s  i009/s  i010/s
10:40:02 AM    0  512.00    1.67    0.00    0.00    0.00   15.08    0.00
Average:       0  512.00    0.42    0.00     N/A    0.00    6.03     N/A
        

This report (which has been truncated horizontally to fit on the page) includes one column for each interrupt level (for example, the i002/s field illustrating the rate for interrupt level 2). If this were a multiprocessor system, there would be one line per sample period for each CPU.

Another important point to note about this report is that sar adds or removes specific interrupt fields if no data is collected for that field. The example report above provides an example of this, the end of the report includes interrupt levels (3 and 10) that were not present at the start of the sampling period.

NoteNote
 

There are two other interrupt-related sar reports — sar -I ALL and sar -I XALL. However, the default configuration for the sadc data collection utility does not collect the information necessary for these reports. This can be changed by editing the file /etc/cron.d/sysstat, and changing this line:

*/10 * * * * root /usr/lib/sa/sa1 1 1
          

to this:

*/10 * * * * root /usr/lib/sa/sa1 -I 1 1
          

Keep in mind this change does cause additional information to be collected by sadc, and results in larger data file sizes. Therefore, make sure your system configuration can support the additional space consumption.