SUMMARY: NFS Timeout

Lau, Victoria H (vlau@msmail2.hac.com)
Thu, 29 May 1997 13:43:57 -0800

It has been a month since I posted the following "NFS Timeout" question.
The problem still exists today, but intermittently. I had both network
and hardware personnel involved, like checking all hardware and taking
some traffic-makers off the net. It helps some but the problem
always comes back to haunt me after a week or a few days.

I did not mention in my original post that when I was doing files
copy, it was from/to a local file system to/from an AIX (4.2) NFS
mounted file system. This AIX NFS mounted file system is the home file
system for all users in the project. Since these two OSs run different
versions of nfs (Solaris v3, AIX v2), rlogin and cp from/to these
systems react differently because they do not use the same protocols.

>From all the responses, I'd added/changed the following files and
patches:

/etc/auto_direct (clients):
/sun_local -rw,intr stfsun1:/sun_local

/etc/init.d/inetinit (both server and clients):
#increase the maximum # of tcp connections from 32 to 1024
/usr/sbin/ndd -set /dev/tcp tcp_conn_req_max 1024
#increase the maximum waiting period for transmissions
/usr/sbin/ndd -set /dev/tcp tcp_xmit_hiwat 65535
/usr/sbin/ndd -set /dev/tcp tcp_recv_hiwat 65535
/usr/sbin/ndd -set /dev/tcp tcp_cwnd_max 65534

Added the following patches:
- 2.5.1 recommended (includes 103582-10 tcp)
- 103903-02 (le-for sun4m only)
- 104166-01 (nfs)
- 104212-03 (hme-for sun4u only)
- 104672-02 (nfs)

Credits:
=======
I sincerely thank the following Sun-Managers helping me with the
problem, especially Justin Young who continusously supported me with
new ideas:
- Peter Marelas
- Glenn Satchell
- Stuart Little
- Justin Young
- D. Ellen March
- Marc S. Gibian
- Wendy Mullett
- Karl E. Vogel
- Marcos Campos de
- G. Bhaskar

Original Question:
=================
We have an Ultra II (stfsun1) running Solaris 2.5.1, serving as our
nfs file server, exporting a file system to all hosts. The entry
in the dfstab is:

share -F nfs -o anon=0 /sun_local

On the Sun clients, also running Solaris 2.5.1, we have automounted
this file system as follows:

/etc/auto_master: /- /etc/auto_direct -intr
/etc/auto_direct: /sun_local -rw,intr,timeo=20 stfsun1:/sun_local

We have no problems accessing stfsun1 from all the clients (rlogin,
rsh, etc.). But, whenever we copy files to/from /sun_local
on the clients, the following messages appear both on the server
and on the client:

NFS server stfsun1 not responding still trying
NFS server stfsun1 ok

This goes on for a long time, slowing the copying process from seconds
to even hours. Where do I start troubleshooting? If this is a hardware
issue, why don't I see the above message when I rlogin to this server
from the same client? I have no problem rlogin to the server
from the client and edit files all day, as long as I don't copy files
from or to /sun_local on the client.

Responses:
=========
I'd say it's using TCP for NFS which is a connection-orientated transport.
The default timeo for TCP is 100 tenths of a second, and 11 for UDP.
Your setting it to 20 for TCP, which is 1/5th of the default.
I would remove "timeo=20".
================
It can be a hardware problem because rlogin will typically use very
small packets while NFS uses large packets. Maybe you have a hub or
slow clients that can't keep up with the fast Ultra?
================
As a matter of interest have you ensured that
stfsun1 doesn't try to mount /sun_local. Don't know
what 2.5.1 does since I've not tried it but mounting
server exported filesystems on the server at the same mount
used to cause problems on 2.3/2.4

If you haven't then one fix is to add
+/etc/auto_null in /etc/auto_master before
anything else then an entry of the form

/sun_local -null
in auto_null.

Either that or always mount exported filesystems somewhere
different to the exported path.

I also assume that stfsun1 was restarted, or at least
/etc/rc3.d/S??nfs.server ran if this was the first entry in
dfstab for stfsun1.

How many clients do you have? If a lot then change /etc/rc3.d/S15nfs.server
to start more nfsds. Look for /usr/lib/nfs/nfsd -a <number> and change the
number to say, 2 * number of active clients.
================
#1) Make sure you've patched your server with the tcp patch and hme patches
from sunsolve. In addition, do the recommended kernel patches, etc.
tcp patch 103582-10
hme patch 104212-03
nfs patch(es) 103600-12,104672-02,104166-01

#2) Add the following lines to /etc/rc2.d/S69inet
#increase the maximum # of tcp connections from 32 to 1024
/usr/sbin/ndd -set /dev/tcp tcp_conn_req_max 1024
#increase the maximum waiting period for transmissions
/usr/sbin/ndd -set /dev/tcp tcp_xmit_hiwat 65535
/usr/sbin/ndd -set /dev/tcp tcp_recv_hiwat 65535
/usr/sbin/ndd -set /dev/tcp tcp_cwnd_max 65534
#Ignore those people who tell you to modify your /etc/system
#changes to the kernel are dangerous and you pretty much get the
#same results from /etc/rc2.d
#The only difference is that /etc/system changes work in single user
#mode. If it means that much, change /etc/rc1.d, too.

[IBM suggested to me that we should downgrade Solaris
system nfs version from v3 to v2.--vicky lau]

However, the Sun server is perfectly capable of auto-switching to nfs v2.
However, nfs v2 is not very efficient.

Solaris is even less efficient when it has to switch between nfs v2 and 3.

Gigabit products won't be out till later this year.

If you company can afford it, you might consider SONET. (ATM-622).

That's about the only thing I can think to throw at it. That way the
packets will get there quicker and *hopefully* won't have time to timeout.

Solaris 2.6 *should* fix any Solaris related network problems.

I just reproduced your error. Not on a subnet though.

SUNW,hme0: late collision ???

What's the deal with your CPU. I had about 20 engineering students who
all decided to do their analysis at the same time. They *choked* my
Ultra Enterprise 2. Oh well, I told them that they should have bought a
3000.

My load average was above 5 at one point and the idle on the CPU was 0.0%.

Now I'm getting collisions, etc.

That's an easy explanation. The CPU was so busy with everything else
that the schedular didn't have time for the network.

The fact that you even have a queue column suggests a problem. However,
I'm not convinced that throwing higher bandwidth at it is the proper
solution.
================
check the number of nfs demons running in your server to serve the
clients ,if it insufficient pl.. change the value and restart the demon
eg: nfsd XX where xx stands for no..
================
SRDB ID: 11153

SYNOPSIS: NFS mounts hang, get SERVER NOT RESPONDING

DETAIL DESCRIPTION:

Hosts remotely mount other hosts, often times via routers, bridges with
straight NFS mounts or with automount. I tried installing various kernel
and NFS Jumbo patches with no success.

Any process accessing those remote hosts just hangs forever or gets
the message:
NFS SERVER <servername> NOT RESPONDING
..sometimes followed by NFS SERVER <servername> OK

showmount -e <servername> listed the remote exports file as I expected, so
I know that I can talk to the server's NFS process.

SOLUTION SUMMARY:

NFS SERVER NOT RESPONDING means that many things could be at fault:
1. The server is down or unable to respond (e.g. too busy)
2. The network is not reliable
3. Software or firmware problems on any component in the network, possibly
including the NFS client and/or NFS server.

For case 1, lighten the load of the server, or migrate files to a less busy
server.

For case 2 and 3, we recommend the following workaround of
changing mount options, as in the following examples:

/usr/etc/mount -orsize=1024,wsize=1024,timeo=15 server:/disk /mnt (SunOS)

/usr/sbin/mount -F nfs -o rsize=1024,wsize=1024,timeo=15 server:/disk /mnt
(Solaris)

The 1024 read and write packet size allows NFS requests/responses to squeeze
inside a single network (e.g. ethernet) packet instead of the default 8k size.
The helps eliminate fragmentation across a bridge or router, as well as
UDP packet reassembly, although the actual NFS performance is somewhat slower.

Increasing the initial request timeout from 7 to 15 (units are tenths of
seconds) often helps in congested networks.

Please note that we also recommended installation of any NFS and Kernel
jumbo patches.

PRODUCT AREA: Gen. Network
PRODUCT: NFS
SUNOS RELEASE: any
HARDWARE: any

Thank you, everyone.

Vicky Lau
vlau@msmail2.hac.com