SUMMARY: Watchdog & forceload errors (SPARCserver 1000E)

tommi.ripatti@siemens.fi
Sat, 14 Feb 1998 17:26:11 +0200

Hello again,

I got some replies conserning these weird crashes.
The bottomline in this case seems to be that
almost certainly nothing can be done without
having to replace hardware...

The output (see below) is from POST (power-on
self test) and the prompt '0A' refers to board
0 processor A.

This particular configuration has two boards
(slots 0 && 1). Board 0 has two SuperSPARC II
processor modules and 128MB of RAM. Board 1
has only 128MB of RAM.

The fault might be in processor A or in board
0 memory. Swapping memory and replacing
processor modules might help but probably
one would have to change the whole board.

IF this doesn't help - well then you probably
have to do some unwanted spending.

SPARCserver 1000 System Installation & Service
manuals proved to be a bit illuminating but
they didn't actually solve the problem or
help me diagnose the source of the problems.

There is an easy way of getting quite a lot
of diagnostic data out of a SPARCserver 1000E:

Just turn the key switch clockwise from 'Standby'
position to 'Diagnostic' position.

This starts the POST selfdiagnostic program. One
should probably connect a lineprinter to serial
port A of servers board 0 (instead of terminal
that is).

Another important point I learned: If the system
is in unknown state and you wish to enter the
OpenBoot program, just push reset button (under
the front panel), press v and press s.

Anyways, my thanks for answering my question
goes to:

mark.baldwin@aur.alcatel.com and
Ed Weller (garyn@solar.wa.com)

---
Tommi Ripatti
Helsinki University of Technology
trainee @ Siemens Finland
---

Here's my original posting:

> Hello Sun Managers, > > Our customer has a SparcServer 1000E with 2 CPU's > which we had to replace with a similar machine of our own. > > The thing is we don't have a service deal with Sun, so any > help would be extremely welcome before we contact Sun. > > First we got in an approx. 24h interval following > error on the console: > > -----cut----- > Error log analysis for Board 0. > > OA> *IOC0 > OA> Multiple Errors > OA> Client Device Error, Internal Error (s)=XI0ERR > OA> *SBI > OA> Xbus Parity Error > OA> Xbus Protocol Error > OA> Log Date: Jan 18 23:08:08 GMT 199E > OA> CPU A Function at time of error: CO IOC > OA> CPU B Function at time of error: CO Alt Sync 1 > OA> CO IOC Test caused System Watchdog Reset LED=00000020 > OA> > *** System Watchdog Reset*** > OA> > -----cut----- > > After this message the system freezed complitely and > you had to boot it by toggling the power. > > Now the server doesn't even boot up at all. We only > get following errors after which the systems seems > to halt totally. > > -----cut----- > WARNING: sbus_add_hard cseight3: autovectored > interrupt at level 0x5 exceeded tuneable sbus_nvect > > WARNING: forceload of drv/pio2 failed > WARNING: forceload of drv/sio4 failed > -----cut----- > > The Question: > o Is, as I suspect, the server hardware mallfunctioning and if so > what is the most probably component that has to be replaced? > > Anyhelp would be welcome. The local Sun Service charges for > maintenance visits almost $200 per hour!! > > Tommi Ripatti > tommi.ripatti@siemens.fi