SUMMARY: imap server sometimes slow

sys013@abdn.ac.uk
Tue, 11 Nov 1997 13:40:38 +0000 (GMT)

Folks,

Here it is at long last! Apologies for the delay. The problem has only recently
been fixed.

Eventually, at Sun's request we moved Sol2.5 -> Sol2.6. The effect was dramatic. So
far we've not seen any load problems, and can comfortably accomodate 500+ imap
daemons. At this load we have ~60% CPU idle, and quite low paging activity.
Under Sol2.5 we hit v. high ~300 smtx counts as shown by mpstat at a load of ~400
imap daemons. Now we're seeing typical smtx counts of 10-20 with >500
daemons. Response at the imap clients is excellent at this load.

I suspect the next bottleneck will be disk I/O on /var/mail which is now
reaching ~40% as shown by iostat -x. But we can do something about that for
modest expenditure.

On Sol2.5 we had unsuccessfully tried:

removing as much NIS activity as possible: the server was a NIS client and
used NIS for everything in nsswitch.conf. Now it's just used for passwd and
netgroup. That reduced TCP activity between it and the NIS server dramatically
(reduced total TCP traffic by ~25%!)

adding more swap space

adding DNS services (2ndy server)

moving SMTP services onto another server

twiddling TCP parameters

all to no avail.

It seems Sol2.5 may have some deficiencies for the kind of load our imap server
gets. There's some evidence from other sites that this problem is NOT apparent
in Sol2.5.1.

People are smiling at me again - even my wife!

Thanks to all who replied:

From: arthur@spool1.mail.troy.psi.com
From: <Glenn.Satchell@uniq.com.au>
From: rali@meitca.com
From: Scott McDermott <scottm@kcls.org>
From: "Eric M. Stone" <erics@cdcna.com>
From: "Karl E. Vogel" <vogelke@c17mis.region2.wpafb.af.mil>
From: birger@Vest.Sdata.No (Birger A. Wathne)
From: Francis Liu <fxl@pulse.itd.uts.edu.au>
From: Bret Giddings <bret@essex.ac.uk>
From: Clive McDowell <C.McDowell@Queens-Belfast.AC.UK>
From: Kevin Worvill <K.Worvill@uea.ac.uk>

Several people suggested using a different imap daemon. This would have been
a very big change as other imap implementations are thought not to be
compatable with the present imsp database. So we chose to take Sun's advice
and move to 2.6. It turned out to be a straightforward upgrade, but we hit
problems with the amd automounter so we're now using solaris automountd.

Original query shown below. Thanks again to all.

Gordon.
-------------------------------------------------------------------------
Gordon Robertson, Central Systems Manager,
Infrastructure Systems Division,
Tel +44(0)224 273340 Directorate of Information Systems
E-Mail : g.robertson@abdn.ac.uk and Services,
Aberdeen University, Aberdeen AB24 3FX, U.K.
--------------------------------------------------------------------------
Folks,

We have an SS1000 with 4 CPU's, 384Mbytes, 2 x SWIFT cards, and an FDDI
(SUNWnfr 4.0) card. There's a fast/wide SCSI disk on one of the SWIFT
cards(s4 below) and 2 x 1Gbyte internal disks(s0,1). Both hme interfaces are
connected.

Every now and again, when user load builds up, the system runs very slowly,
with the run Q getting very high (20->) and "sys" CPU time >90% as shown by
ps(see below).

This server runs Sol2.5 and provides imap, pop and sendmail services. The
biggest load is imap support - we see upwards of 400 imap processes when
busy. All is well up to about 420, then the next few cause the
problem. Then when a few drop off, we get back to normal.

I've checked the hme and fddi interfaces - the packet rates are modest
compared to some of our less powerful NFS servers.

When running normally with about 400 imaps, there's a fair bit of memory
allocation activity as shown by vmstat, but response time is good when
running such commands, and imap service is good.

Here's some output from 'vmstat 10' output showing the transition from normal
-> dreadful...

procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s4 -- in sy cs us sy id
0 0 0 410120 9136 0 443 6 0 0 0 0 18 1 10 0 328 7163 917 12 36 53
0 0 0 411164 9532 0 442 5 0 0 0 0 19 0 16 0 282 2414 527 5 24 71
0 0 0 413992 11692 0 309 2 0 0 0 0 15 0 4 0 180 1196 241 5 10 85
0 0 0 415804 12896 0 342 28 0 0 0 0 12 1 12 0 228 1393 247 5 20 74
0 0 0 413328 10900 0 443 23 0 0 0 0 14 0 7 0 230 2931 1030 6 33 61
0 0 0 410708 8764 3 615 14 18 18 0 0 20 0 16 0 393 3785 1292 10 44 46
1 0 0 404952 6496 21 800 4 221 458 0 101 24 0 10 0 475 4521 1441 13 58 29
0 1 0 404480 7660 14 611 48 8 8 0 0 26 2 26 0 545 5390 1110 13 45 41
0 1 0 408172 9108 0 610 5 0 0 0 0 19 0 16 0 451 4302 1277 11 53 36
0 0 0 408872 9368 0 328 7 0 0 0 0 14 0 5 0 203 2610 855 5 28 67
0 0 0 410960 10928 0 129 0 0 0 0 0 11 2 24 0 266 1066 187 2 18 80
1 0 0 410424 10420 0 490 80 0 0 0 0 10 0 21 0 588 4003 1723 8 65 27
9 0 0 408052 8180 1 428 0 8 8 0 0 15 0 1 0 688 3959 1775 8 91 1
8 1 0 408456 7860 4 513 31 14 14 0 0 24 0 12 0 801 3782 1720 9 89 1
10 1 0 407064 7268 6 551 38 26 26 0 0 24 1 12 0 822 3918 1777 8 90 2
0 1 0 410616 10252 1 738 8 1 1 0 0 29 0 18 0 527 3326 878 16 37 46
1 0 0 410272 9976 0 319 9 0 0 0 0 10 0 8 0 330 3533 1300 6 54 39
5 0 0 408792 8388 0 434 65 0 0 0 0 17 2 15 0 713 4419 1903 9 88 3
9 0 0 405100 6180 5 394 0 100 327 968 96 8 0 1 0 562 3640 1537 8 91 1
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s4 -- in sy cs us sy id
18 1 0 401832 6640 20 608 43 182 259 4 31 24 0 27 0 816 4040 1606 12 87 1
19 0 0 400652 6880 9 420 2 48 58 0 4 14 0 8 0 618 3480 1477 9 91 1
16 0 0 403800 8852 0 310 16 0 0 0 0 9 2 17 0 746 3624 1613 9 90 1
11 0 0 403556 8196 0 280 16 0 0 0 0 9 0 15 0 733 3674 1667 9 90 1
11 0 0 406100 9012 0 230 31 0 0 0 0 6 0 22 0 612 2155 1273 6 92 3
16 0 0 408300 10172 0 155 88 0 0 0 0 8 1 25 0 667 2679 1297 7 92 1
13 0 0 409668 10308 0 215 18 0 0 0 0 4 0 34 0 775 2973 1481 9 89 2
19 0 0 409568 10444 0 216 10 0 0 0 0 5 0 5 0 528 2689 1239 6 94 0
12 0 0 410628 12008 0 211 6 0 0 0 0 37 0 6 0 770 4781 1604 14 85 1

Here's some really dreadful vmstats...
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s4 -- in sy cs us sy id
29 0 0 444512 8080 18 79 32 11 11 0 0 2 0 25 0 565 2283 1180 5 94 1
33 0 0 446080 8376 19 88 24 27 27 0 0 7 2 35 0 588 1900 946 6 91 3
35 0 0 447876 9980 3 122 51 0 0 0 0 5 0 21 0 550 1943 1142 6 94 1
30 0 0 449176 11032 0 111 0 0 0 0 0 1 0 10 0 423 1496 901 6 94 0
28 1 0 448472 10240 0 200 258 0 0 0 0 4 0 33 0 632 2331 1244 8 89 3
33 0 0 449104 9192 0 194 31 0 0 0 0 4 0 18 0 701 2768 1519 16 84 0
30 0 0 447280 8496 0 354 0 0 0 0 0 4 0 3 0 585 2545 1287 18 82 0
28 0 0 445916 7328 0 146 0 0 0 0 0 1 0 0 0 367 1574 850 5 95 0
33 0 0 445008 6648 0 178 1 0 0 384 0 3 2 8 0 420 1967 1024 5 94 0
20 0 0 443972 6344 14 119 0 116 177 432 37 3 0 18 0 570 2277 1221 5 95 1
30 0 0 446844 8804 0 131 1 0 0 0 0 4 0 10 0 540 2221 1158 5 93 1
25 0 0 446112 8248 0 225 2 0 0 0 0 3 0 8 0 534 2370 1173 13 87 0
22 0 0 445832 8080 0 160 0 0 0 0 0 2 0 14 0 552 2223 1132 9 91 0
17 0 0 445396 7684 0 241 2 0 0 0 0 3 0 30 0 758 3033 1468 9 91 0
22 0 0 446264 8420 0 72 0 0 0 0 0 2 0 8 0 393 1978 962 4 95 0
20 0 0 446012 8188 0 280 2 0 0 0 0 5 2 23 0 566 2389 1135 7 92 1
23 0 0 445700 7996 0 168 24 0 0 0 0 4 0 16 0 673 3091 1482 7 93 0
21 0 0 444876 6768 0 155 67 1 1 0 0 1 0 11 0 463 2239 1065 10 90 0

It seems like I have hit some hard limit somewhere, for this change from
good -> poor performance to happen so suddenly.

In my /etc/system file I have 'set maxusers=256'.

I had a look at 'kmast' output from "crash" and it seems that perhaps
some buffer limits are low, eg...

cache name size avail total in use succeed fail
---------- ----- ----- ----- -------- ------- ----
kmem_alloc_8192 8192 4 929 7610368 351307 0
ptbl_kmcache 272 1 910 266240 909 0
pt_kmcache 4096 0 909 3723264 909 0
inode_cache 320 45 9036 3084288 9036 0
rnode_cache 376 19 4460 1826816 85615 0

...but I'm not sure how to interpret this properly. If I'm right, I seem to
be near the 'bufhwm' limit which I think defaults to 2% memory, and looks
very near the "in use" value of kmem_alloc_8192. Should I set 'bufhwm'?
and/or change maxusers?

Can anyone suggest what the problem might be, or what to
investigate next, or even provide a cure (preferably without spending
money)?

Gordon.