SUMARY:hme on Sparc 20

David Robson (robbo@satech.net.au)
Fri, 14 Mar 1997 22:19:34 +0930

I havn't worked out how to pull all the messages together in Eudora, so cut
and paste will have to do!

The majority of replies fell into two obvious camps, tuning or patches.
Since I have raised a call with Sun I have applied the latested patches both
recommended clusters and the latest from sun including the latest hme driver
- not even available as a patch yet!
The theory at present is that the interface fails momentarily, enought to
screw NFS, and then starts working again, leaving NFS broken but all other
services ok!
I see this problem being two fold. If there is a problem with hme only, be
it tuning or a bug, why can't I unmount or unshare filesystems or more
importantly sync the disks?!
At present we are running off le1. Next week we will venture back to the hme
at 10MB at first, and 100hdx using some of the tuning parameters suggest by
Francis.Liu@uts.edu.au.

Thanks to the following:
justiny@cluster.engr.subr.edu
varshney@pacbell.net
ake@cs.umu.se
blymn@awadi.com.au
adam@scl.cwru.edu
rickv@mwh.com
fletch@ttmc.com
larry@mitra.com
baldma@aur.alcatel.com
padovani@aaec.com

And here is a few edited highlights:

-----------------------------paste------------------------------------
Hmm... Did you apply all the various networking related patches to 2.5?
I know there's a tcp, NFS, and hme patch. In addition, there are a few
other NFS performance tricks that you can do such as adjusting the
maximum # of tcp connections. Since you said the machine was heavily
loaded, I have a feeling that tuning is also part of the problem. What
does your nfsstat say?
-----------------------------end paste------------------------------------

Again, tuning issue....
-----------------------------paste------------------------------------
By my experience, there are certain parameters BOTH NICs have to be set to.
That is, forcing the ethernet card (non-SUN) to 100bt always , half
duplex... here are a other few parameters you might want to check:

FIFO threshold
fair PCI bus arbitration or not (if pci bus is used)

We had a similar problem with some realtime OS with a dec 100bt card that
was loading its runtime files from a Sun server. NFS had tons of timeouts
and the resulting thruput was like 1.4mbps when it should have been about
4-5 atleast. Luckily the programmer we contracted was able to talk to the
people in DEC and change the parameters I mention above and played around
with these to get max thruput (setting 100bt no autonegotiation,
halfduplex, and I think bus arb to true REALLY helped).
-----------------------------end paste------------------------------------

-----------------------------paste------------------------------------
There is/was a problem with nfs_readmap and massiv nfs traffic.
Easy check would be 'set nfs_readmap = 0' in /etc/system and reboot, but please
check this for corecctness with SUN first, it's a long time since we got rid of
this.
I don't remember if they fixed it in 2.5 or not (I think we had the problem
in 2.4) but under 2.5.1 it's gone.
(Are you running Solstice Backup and Veritas for a SSA? If so there is a bug
in Veritas which causes filesystem hangs which mostly affects your nfsd's.)
-----------------------------end paste------------------------------------

-----------------------------paste------------------------------------
Patch-ID# 102979-02
Keywords: memory leak be hme link pulses interrupt level 6 SUNW,hme? qe
Synopsis: SunOS 5.5: /kernel/drv/be, /kernel/drv/hme and /kernel/drv/qe
fixes
Date: Apr/26/96
-----------------------------end paste------------------------------------

-----------------------------paste------------------------------------
You should try the following settings. These helped to clear up our NFS hangs.

ndd -set /dev/hme adv_autoneg_cap 0
ndd -set /dev/hme adv_100fdx_cap 0
ndd -set /dev/tcp tcp_xmit_hiwat 65535
ndd -set /dev/tcp tcp_recv_hiwat 65535
ndd -set /dev/tcp tcp_cwnd_max 65534
ndd -set /dev/tcp tcp_conn_req_max 64
ndd -set /dev/ip ip_forward_src_routed 0

These variables can br run in a shell or put in an init script. These changes
to the default setup say, respectively,
Don't negotiate half or full duplex or even the speed of the link
Turn off 100Mbits full duplex (ie, run in 100half duplex mode)
Change the TCP transmit/receive parameters to more friendly values
Turn off source routing

Also on your Cisco switch, remember to turn off full duplexing and auto-
negotiation for the SS20 (and any other hme interfaces connected. You
can also try changing the NFS read and write size in case you find that it is
a fragmentation or timing problem.
Robbo (David Robson)
Davtin Systech Pty.Ltd.