Showing posts with label monitoring linux. Show all posts
Showing posts with label monitoring linux. Show all posts

Tuesday, February 17, 2009

vmstat - Virtual Memory Statistics

If someone asks me to check how Linux/UNIX system is performing now, the first thing I do is vmstat. A lot of people just check for CPU and memory utilization statistics in vmstat. But in reality, it gives more information than just CPU and memory information. In this posting, I am going to explain the detail of vmstat.
vmstat stands for virtual memory statistics; it collects and displays summary information about memory, processes, interrupts, paging, and block I/O information. By specifying the interval, it can be used to observe system activity interactively.
Most commonly people will use 2 numeric arguments in vmstat; the first is delay or sleep between updates and the second is how many updates you want to see before vmstat quits. Please note this is not the full syntax of vmstat and also it can vary between OSs. Please refer to your OS man page for more information.
To run vmstat with 7 updates, 10 seconds apart type
#vmstat 10 7
Please note, in some systems, reported metrics might be slightly different. The heading that I am writing now is reported in Oracle Linux (Unbreakable Oracle Linux)
Process Block: Provides details of the process which are waiting for something (it can be CPU, IO etc; can be potentially for any resource)
r  -->  Processes waiting for CPU. More the count we observe, more processes waiting to run. If we just observe a spike in the count, we shouldn’t treat them as bottlenecks. If the value is constantly high (most people treat 2 * CPU count as high), it hints that CPU is the bottleneck.
b  -->  Uninterruptible sleeping processes, also known as “blocked” processes. These processes are most likely waiting for I/O but could be for something else too
w  -->  number of processes that can be run but have been swapped out to the swap area. This parameter gives hint about the memory bottleneck. Please remember, only some system reports this count
Memory Block: Provides detailed memory statistics
Swpd  -->  Amount of virtual memory or swapped memory used
Free  -->  Amount of free physical memory (RAM)
Buff  -->  Amount of memory used as buffers. This memory is used to store file metadata such as i-nodes and data from raw block devices
Cache  -->  amount of physical memory used as a cache (Mostly cached file).
Note: Most of the systems report memory block value in KB. Remember I said most; not all. So check the man page.
 
Swap Block: Provided memory swap information
si  -->  Rate at which the memory is swapped back from the disk to the physical RAM
so  -->  Rate at which the memory is swapped out to the disk from physical RAM
Note: Most of the systems report swap block value in KB. Check man page
I/O Block: I/O related information
bi  -->  Rate at which the system reads the data from block devices (in blocks/sec)
bo  -->  Rate at which the system sends data to the block devices (in blocks/sec)
System Block: System information
in  -->  Number of interrupts received by the system per second
cs  -->  Rate of context switching in the process space (in number/sec)
CPU block: Most used CPU related information
Us  -->  Shows the percentage of CPU spent in user processes. Most of the user/application/database processes come under the user processes category
Sy  -->  Percentage of CPU used by system processes, such as all root/kernel processes
Id  -->  Percentage of free CPU
Wa  -->  Percentage spent in “waiting for I/O”
A lot of people have problems here. Some people say us + sy +id + wa=100 and some other says us + sy +id =100. I stick to second (I/O doesn’t consume CPU). 
Interpretation with respect to performance:
The first line of the output is an average of all the metrics since the system was restarted. So, ignore that line since it does not show the current status. The other lines show the metrics in real-time.
Ideally, r/b/w values under procs block with close to 0 or 0 itself. If one or value counter values are constantly reporting high values, it means that system may not have sufficient CPU or Memory or I/O bandwidth.
If the value of swpd under swap is too high and it keeps changing, it means that system is running short of memory.
The data under “io” indicates the flow of data to and from the disks. This shows how much disk activity is going on, which does not necessarily indicate some problem(obviously data has to go to disk in order to be persistent). If we see some large number under “proc” and then “b” column (processes being blocked) and high I/O, the issue could be an I/O contention.
Rule of thumb in Performance
Adding more CPU, Memory, or I/O bandwidth to the system is not the solution to the problem always; this is just postponing of the problem to the future and it can blow anytime. The real solution is to tune the application(every component in the architecture) as far as possible. Adding hardware capacity or buying powerful hardware should be the last option.
As usual, comments are always welcome.
-Thiru

Thanks to Anonymous for pointing out the issue in bi/bo.

Friday, January 23, 2009

Sun Management Center - Sun's Ways of monitoring


"Sun Management Center", a product from Sun Microsystems for monitoring the Spark and x86 hardware running Solaris and Linux. It provides in-depth monitoring and diagnostics of servers and its services. Sun MC is based on server-agent model.

Architecture
Sun Management Center software includes three component layers: console, server, and agent. The product is based on the manager and agent architecture:

Console layer: Console layer is the interface to end users. It exposes web, JWS and console interfaces. Mutiple user can access the same Sun MC at the same time.

Server layer: server is the core, which talks to Console layer and agent layer. It acts as the central repository and stores data(both historical and current). It includes the components such as configuration manager, event manager, topology manager etc., Sun Management 4.0 uses PostgreSQL(open source) db to store data whereas the previous version 3.6 uses Oracle to store data.

Agent layer: Agent layer monitors, gather information about the server/system in which it is deployed and it communicates from server using SNMP(modules are used for gathering monitoring data. Different modules are used for different purposes). Agent apart from monitoring, also has the cabability to manage the nodes. The agent uses rule (it will get from server layer) to raise the alarm if the rule is not met.


Modules: Modules are the components in agent layer responsible for monitoring. They can dynamically loaded, invoked, started, stopped and unisntalled in Sun MC. Kernal reader, file scanning, directory scanning, config reader, fault manager, print spooler, process monitoring are some of the modules.

Like nmon, Sun MC is free to download and use (you can pay and get support). Like Glance for monitoring HP machines, Sun MC can be used to monitor the Sun based systems. Next time when you are planning to do performance testing, tuning on Spack or x86 hardware running Solaris, try Sun MC.

Wednesday, August 13, 2008

OS Monitoring

Monitoring a server isn't something you should do chaotically. You need to have a clear plan—a set of goals that you hope to achieve. Troubleshooting server performance problems is a key reason for monitoring. Not just to plot good looking graphs and show it to superiors. Without monitoring, tuning is almost an impossible activity.

Basically monitoring is done for 2 purposes
  1. Benchmark
  2. Tuning

While monitoring OS, following are the basic things need to be considered regardless of OS platform.

  1. CPU Utilization
  2. Memory, paging
  3. Throughput & retransmission statistics
  4. TCP statistics
  5. Disk statistics

Not all things need to be presented in report. Presenting 10-20 page report containing only relevant is always better than presenting a 100 page report (album). If you want to show your hard work, create annexure section and attached to it (I may not call it album here. After-all we are showing our hard work at relevant place. He he).

Linux Monitoring

Monitoring Linux based systems are not as complex as many people thing. All we need to know is just simple, basic shell scripting and few basic commands.

To know the version of Linux kernel
uname –a

vmstat command
vmstat reports information about processes, memory, paging, block IO, traps, and CPU activity.

Basic syntax
vmstat [[delay] count]

Example
vmstat 10 7

The first statistics that are printed are averaged over the system uptime.Don’t consider this unless it really make sense.

See man page for more information

iostat command
iostat displays kernel I/O statistics on terminal, device and cpu operations.

Basic syntax
iostat [[delay] count]

Example
Iostat 10 7

The first statistics that are printed are averaged over the system uptime.Don’t consider this unless it really make sense.

netstat command

netstat prints network connections, routing tables, interface statistics, masquerade connections, and multicast memberships. There are a number of output formats, depending on the options for the information presented.

See man page for more information

Nmon utility

Quote from IBM
This free tool gives you a huge amount of information all on one screen. Even though IBM doesn't officially support the tool and you must use it at your own risk, you can get a wealth of performance statistics. Why use five or six tools when one free tool can give you everything you need?

The nmon tool runs on:

  1. AIX® 4.1.5, 4.2.0 , 4.3.2, and 4.3.3 (nmon Version 9a: This version is functionally established and will not be developed further.)
  2. AIX 5.1, 5.2, and 5.3 (nmon Version 10: This version now supports AIX 5.3 and POWER5™ processor-based machines, with SMT and shared CPU micro-partitions.
  3. Linux® SUSE SLES 9, Red Hat EL 3 and 4, Debian on pSeries® p5, and OpenPower™
    Linux SUSE, Red Hat, and many recent distributions on x86 (Intel and AMD in 32-bit mode)
  4. Linux SUSE and Red Hat on zSeries® or mainframe

Click here for more information.

Windows Monitoring
Windows provide huge set of monitoring counters to end users. Not all need to be monitored always, but all counters might be useful at one point or another. All windows OS related counters can be monitoring using perfmon (just goto run and type perfmon).


I am planning to write Step by step instruction on how to monitoring windows using perfmon later in as a separate blog.