SUMMARY: System Performance and Memory Leakage Debugging

Ju-Lien Lim (julienlim@rocketmail.com)
Fri, 23 Jan 1998 05:56:47 -0800 (PST)

I seem to have received a number of requests for
this, and am reposting this again with "Summary" also
in the subject so it will get archived. (Sorry for
"flooding" people with this posting again.)

Ju
julienlim@rocketmail.com
------------------------------------------------------
The most common cause of a system hang/performance
problem is that the system is running low or out of
resources. The first thing to normally do is to run
some performance tools to see whether that is in fact
the case. Please refer to sunsolve's
Whitepaper/software technical bulletin # 1217 to help
you interpret the data from the script.

The following can be placed in a file and then
invoked from cron every 15 minutes to help determine
if the system is CPU bound, I/O bound, or memory
bound. Run as root.

date >> file1
vmstat 30 10 >> file1 &
date >> file2
iostat -xtc 30 10 >> file2 &
date >> file3
/usr/ucb/ps -aux >> file3 &
date >> file4
echo kmastat | crash >> file4 &
date >> file5
echo "map kernelmap" | crash >> file5 &

Most users will want to change the file names to be
placed in some absolute pathname location.

CPU Power
---------

In the vmstat command output, look to see what the
run queue size is (first
column). If the run queue has more than 3 processes
waiting per CPU
(i.e. more than 3 for a 1 CPU system, more than 6 for
a 2 CPU system, etc.),
then this bears watching. If the run queue has more
than 5 processes
waiting per CPU, there is insufficient CPU power in
the system.

[Summary: Make sure "r" column (in vmstat) is <= 3
per CPU]

Virtual Memory
--------------

If over time, the amount of memory specified in the
swap column
goes down and does not recover, then there is a
probability that
there is a memory leak on the system. To determine
if it is a kernel
memory leak, look at the output of the two crash
commands (described
later). To determine which application has a memory
leak, look at
the SZ column of the ps output. This column
indicates the size of a
process's data and stack in kilobytes.

[Note: use /usr/ucb/ps vs. /bin/ps]

If over time, the swap column goes down and recovers,
look at the lowest
value in the swap column. If this value goes below
4000, then the system
is in danger of running out of virtual memory space.
More swap space
should be added to the system.

[Summary: find lowest value in "swap column" (in
crash->kmastat). Make
sure value is always above 4000]

Physical Memory
---------------

The sr (scan rate) column of the vmstat output
indicates the rate at which
pages are being scanned in order to find needed pages
for current processes.
If this rate is over 200 for prolonged periods of
time, the machine is out
of physical memory. This machine could benefit by
additional physical
memory.

[Summary: Make sure "sr" column (in vmstat) is <= 200
over time]

Kernel Memory
-------------

The kernel had a limited amount of memory which it
uses for kernel
data allocations. This memory is commonly referred
to as the kernel
heap. The maximum size of the kernel heap is fixed
depending on
machine architecture and the amount of physical
memory. If a machine
runs out of kernel memory, it will usually hang. To
see if this is
the case, look at the output of the crash kernelmap
command. This
command shows how many segments of kernel memory
exist and how large
each segment is. If there only 1 and 2 page segments
left, the kernel
has run out of memory (even if there are a hundred of
them).

[Summary: Ensure there are > 2 page segments left (in
crash->map kernelmap).]

The crash kmastat command shows how much memory has
been allocated to
which bucket. Prior to Solaris 2.4, this showed only
3 buckets making
it difficult to tell which bucket was hogging the
memory (if any).
Starting with 2.4, kmastat breaks memory allocation
down to many
different buckets. If one of these buckets has
several MB of memory,
and there are kernel memory allocation failures,
there is probably
a memory leak involving the large bucket.

In order to diagnose a memory leak problem it is
possible to turn
on some flags in Solaris 2.4 and above (see SRDB
12172). With
these flags turned on, once the kmastat command shows
significant
growth in the offending bucket, L1-A should be used
to stop the
machine and create a core file. SunService can use
this core file
to help determine the cause of the leak.

Disk I/O
--------

To check for disks which are overly busy, look at the
iostat output. The
columns of interest are %b (% of time the disk is
busy) and svc_t (average
service time in milliseconds). If %b is greater than
20% and svc_t
is greater than 30ms, this is a busy disk. If there
are other disks which
are not busy, the load should be balanced. If all
disks are this busy,
additional disks should be considered.

There is no direct way to check for an overloaded
SCSI bus, but if the %w
column (% of time transactions are waiting for
service) is greater then 5%,
then the SCSI bus may be overloaded.

Information about what levels to check for the
various performance statistics
is taken from "Sun Performance and Tuning" by Adrian
Cockroft, ISBN 0-13-149642-5.

Additional performance gathering scripts can be
gotten from Infodoc 2242
for Solaris 2.x

__________________________________________________________________________________

What is a memory leak? How can I tell I have one?

A memory leak is present whenever a program loses
track of the memory it
has allocated and allocates more to replace memory it
already has. A
common example would be a C program calling malloc()
to allocate memory,
then the same program calling malloc() to allocate
memory to the same
variable again without freeing what it had
malloc()'ed in the first
place:

x = malloc (SOME_AMT_OF_MEMORY);
x = malloc (SOME_AMT_OF_MEMORY);

The first allocation would be lost because the
pointer to the second
chunk of memory would overwrite the pointer to the
first chunk.

Memory leaks present themselves as systems running
out of swap or out of
kernelmap, depending on what type of memory leak it is.

Applications or non-kernel code (including daemons)
that have memory
leaks will eventually use up all swap space. The
/tmp directory will
shrink to almost nothing and will show full, because
the primary swap
area is used for both the /tmp directory and swap
space with swap space
taking precedence. Programs will bomb with "out of
memory" (ENOMEM)
errors. The system might run slower and slower until
it comes to a stop;
all processes requiring swap to continue running will
wait for it
forever.

Both ps commands (/usr/ucb/ps -aux or /usr/bin/ps
-efl) show an SZ
column, which displays the amount of virtual space
consumed. A program
with a memory leak will show more and more virtual
space consumed as
time goes on.

Memory leaks in the kernel manifest themselves in the
same way: a system
that runs slower and slower until it comes to a stop.
A coredump of a
kernel that is out of memory will show threads
waiting for memory (adb
$<threadlist command), and will show little or no
kernelmap (the crash
map/kernelmap command). There will be lots of errors
showing up in the
crash kmastat command. Crash kmastat output can help
pinpoint the
problem: look for lines corresponding to memory pools
that have large
memory-in-use values and perhaps lots of allocation
attempts (both
successful and unsuccessful).

The best way to debug a memory leak is to turn on
kmem_flags in the
memory allocator. See the relevant question in this
document for
information on How do you do this.

_______________________________________________________________________________
SRDB ID: 12172
STATUS: Issued

SYNOPSIS: How to set up kernel memory flags
(kmem_flags)

DETAIL DESCRIPTION:

How do you set up kernel memory flags to detect
memory corruption?

Setting up the memory flags may have the effect of
crashing your system
sooner than it used to if it detects corruption, but
unfortunately that is
the only way to catch the problem early enough.

Solution
--------
There are two ways of setting the kmem_flags in the
kernel:

NOTE: you *cannot* use /etc/system to set the flags,
since this happens
too late in the boot process for this particular
module, and will crash
your system.

For Solaris releases 2.4 and below do the following:
# cp /kernel/unix /kernel/unix.orig
# adb -w /kernel/unix
kmem_flags?W 1f
$q
# reboot

For Solaris releases 2.5 and above (non-sun4u) do the
following:
# cp /kernel/genunix /kernel/genunix.orig
# adb -w /kernel/genunix
kmem_flags?W 1f
$q
# reboot

IMPORTANT NOTE: If running Solaris 2.5, make sure
patch 104611-01 (or higher)
is installed before performing the
"adb" on the genunix file.

For the Sun4u (ultra) platforms do the following:
# cp /platform/sun4u/kernel/genunix
/platform/sun4u/kernel/genunix.orig
# adb -w /platform/sun4u/kernel/genunix
kmem_flags?W 1f
$q
# reboot

IMPORTANT NOTE: If running Solaris 2.5, make sure
patch 104611-01 (or higher)
is installed before performing the
"adb" on the genunix file.

The change will be in effect for every reboot until
the original genunix
file is restored.

2) kadb
Do the following:
# halt
ok boot kadb -d
kadb: (type return)
(kernel load messages)
kadb[0]: kmem_flags/W 1f
kadb[0]: :c

On Solaris 2.5 and above you may see the following:

kadb[0]: kmem_flags/W 1f
symbol not found

If so, do this

kadb[0]: :s

Ignore any "text symbol not found" message, then
follow this with:

kadb[0]: kmem_flags/W 1f
kadb[0]: :c

With the kadb method, the change is temporary,
effective only for the one
boot. It will be cleared on the next reboot.

When the system detects a corruption, it will drop to
the PROM "ok"
prompt (or kadb, if loaded). You can then take the
core dump by
doing one of the following:

1) PROM
ok sync

2) kadb:
kadb[0]: $q
ok sync

The advantage of using kadb is that sometimes with
these kind of errors,
the system cannot take a core dump successfully. In
these cases, the only
recourse is to look at the system under kadb.

Internal Solution
-----------------
There are two ways of setting the kmem_flags in the
kernel:

NOTE: you *can not* use /etc/system to set the flags,
since this happens
too late in the boot process for this particular
module, and will crash
your system.

1) adb
* For Solaris releases 2.4 and below do the following:
# cp /kernel/unix /kernel/unix.orig
# adb -w /kernel/unix
kmem_flags?W 1f
$q
# reboot (with or without kadb)

* For Solaris releases 2.5 and above (non-sun4u) do
the following:
# cp /kernel/genunix /kernel/genunix.orig
# adb -w /kernel/genunix
kmem_flags?W 1f
$q
# reboot (with or without kadb)

IMPORTANT NOTE: If running Solaris 2.5, make sure
patch 104611-01 (or higher)
is installed before performing the
"adb" on the genunix file.

* For the Sun4u (ultra) platforms do the following:
# cp /platform/sun4u/kernel/genunix
/platform/sun4u/kernel/genunix.orig
# adb -w /platform/sun4u/kernel/genunix
kmem_flags?W 1f
$q
# reboot

IMPORTANT NOTE: If running Solaris 2.5, make sure
patch 104611-01 (or higher)
is installed before performing the
"adb" on the genunix file.

The change will be in effect for every reboot until
the original genunix
file is restored.

2) kadb
* Do the following:
# halt
ok boot kadb -d
kadb: (type return)
(kernel load messages)
kadb[0]: kmem_flags/W 1f
kadb[0]: :c

* On Solaris 2.5 and above you may see the following:
kadb[0]: kmem_flags/W 1f
symbol not found

If so, do this

kadb[0]: :s

Ignore any "text symbol not found" message, then
follow this with:

kadb[0]: kmem_flags/W 1f
kadb[0]: :c

With the kadb method, the change is temporary,
effective only for the one
boot. It will be cleared on the next reboot.

When the system detects a corruption, it will drop to
the PROM "ok"
prompt (or kadb, if loaded). You can then take the
core dump by
doing one of the following:

1) PROM
ok sync

2) kadb:
kadb[0]: $q
ok sync

The advantage of using kadb is that sometimes with
these kind of errors,
the system cannot take a core dump successfully. In
these cases, the only
recourse is to look at the system under kadb.

** What these flags do **

The meaning of the "1f" value is:
0x01 KMF_AUDIT turn on transaction
logging/auditing
0x02 KMF_DEADBEEF overwrite free'd
memory, verify on allocation
0x04 KMF_REDZONE detect writes past
end of buffer
0x08 KMF_UNMAP deallocate arenas
when all buffers are free'd
0x10 KMF_VERIFY verify that free'd
addr was allocated

_________________________________________________________
DO YOU YAHOO!?
Get your free @yahoo.com address at http://mail.yahoo.com