[SUMMARY] bizarre socket problem

John Vaughan (jvaughan@onlinemagic.net)
Tue, 17 Feb 1998 14:20:17 -0500 (EST)

The situation:

Netscape Enterprise web servers wouldn't restart after a stop - came up
as 'address in use', even tho both netstat and lsof claimed nothing was
bound to the address in question.

Thanks to the following people (with their suggestions attached):

From: Derek Eichele <Derek_Eichele@AIMFUNDS.COM>

I had similar behavior occur. I had a daemon crash and even
though I couldn't find any trace of it (netstat, lsof, etc),
it also claimed the address was in use. It was a production
box but it was a critical service so I was forced to
(eventually) reboot.

From: Scott VanRavenswaay <scottvr@dfw.dfw.net>

This may be obvious and something you've already checked but typically on
our virtual host servers, that is caused by a hung CGI script. If you
haven't already, check out your 'ps -ef' or '/usr/ucb/ps auxw' listing and
see if there are any 'nobody' (or whatever your server runs as) processes
still running. Of course, that *should* show up in a netstat and if that
*is* the case yet it doesn't show up in netstat there are obviously other
problems and many possible implications I'm sure you're aware of.

From: David Schiffrin <daves@adnc.com>

I don't have any insight into your problem, except I've seen connections
that should be in SYN_RCVD not show up in netstat too. A solution which
should let you avoid the reboot though is:

ifconfig hme0:144 down
ifconfig hme0:144 -trailers aaa.bbb.ccc.144 up

I haven't tried this, as I don't have your problem, but I'd be interested
to know how it goes.

From: David Dhunjishaw <dave@colltech.com>

The patches may help. One kind of ugly hack would be to "down" the
interface and then "up" it again, i.e.:

./stop #stop the web server
ifconfig le0:6 down
ifconfig le0:6 <ip_address args, etc.>
ifconfig le0:6 up
./start

This doesn't solve the problem, but it does allow you to fix the "address
already in use" problem without rebooting the machine. Of course, if this
interface is also used for other connections, i.e. telnet, ftp, etc., then
any existing connections will be lost when you "down" the interface.

From: Arthur Darren Dunham <add@netcom.com>

Had a similar problem (we actually had stuff show up in netstat
though, it was just wacky).

Have you tried ifconfiging the interface down and back up? It worked
for me twice with some odd stuff.

----

Unfortunately, none of this seemed to work consistently.

After a bit more investigating and head-scratching, I think I've finally
figured out what the problem was.

The box runs an in-house chat server system, and one of the elements
attaches itself to port 80. Investigation with netstat saw that _something_
was binding itself to *.80:
[647]root# netstat -an |grep BOUND
*.80 *.* 0 0 33232 0 BOUND

and using lsof, but omitting the netscape servers, showed that the
only thing left was this element of the chat server.

As you may remember, solaris 2.5.1 has this habit of bind()ing to the
relevant port on _every_ virtual interface unless you tell it otherwise
(hence problems with old versions of bind running out of fds) - this
is what is happening here, but netstat and lsof were being 'vague' about
it. The exacerbating problem here was that the chat server gets restarted
every hour or so, meaning that it tries to rebind to every interface it
can every hour - hence the problem. Also, the chat server is using
TLI stuff, so I have no idea what it might be trying to do when I'm not
looking.

So, the conclusion is - even if it appears that nothing can be doing
this, something, somewhere is most likely trying to connect to all
the interfaces on that port...

-- 
John