SUMMARY: disk replacement

Melanie Dymond Harper (mel@vanyel.herald.co.uk)
Fri, 12 Sep 1997 17:28:45 +0100 (BST)

Thanks to:

sweh@mpn.com
spr@banjo.myxa.com (whose original answer I have misplaced)
oyang@phoebe.fcit.monash.edu.au
cmconwa@sandia.gov

and quite a few others whose mail I managed to delete in a fit of insanity :)

My original question was:

-----------------------------------------------------------------------------
Hi all;

A disk in one of our servers has started to show errors, and we're looking to
replace it with the minimum of downtime. Murphy's Law dictates that it would
be the boot disk :)

Is it possible -- and have any of you ever tried it? -- to do this by
installing the OS on a disk in another machine, swapping the two disks over and
then rebooting? Obviously there'll be some restoration of other files needed
but we can do this once the machine's up and running again. (I guess I could
also try shifting the OS to one of the other, complete, disks, but I don't
have an install CD at this site. If it can be done without that I would be
interested.)

Also, I'd like to let repair handle the errors, but there's one block that
repair is failing for. What sort of thing might be causing it? (This block
originally contained part of /var/adm/utmpx, which made for a very interesting
time until that was moved away)

TIA and I'll summarise. Sparc 10 running Solaris 2.4, if it matters.

----------------------------------------------------------------------------

What I eventually did was to bite the bullet and reinstall onto the server
needing the new disk, after backing the failing disk up onto a large spare
partition I had. Restoring the config files and user data from the backup
was not too tricky, total downtime just about acceptable.

I could probably have made the spare partition bootable had that disk had
any swap space on it, but that wasn't the case.

The consensus seems to be that installing on a disk in another machine is
perfectly okay so long as the machines share an architecture and you remember
to get the SCSI IDs straight when installing. Also, if you're copying stuff
across rather than installing from CD, remember that 'installboot' is your
friend.

Nobody came up with any great theories about why repair should not be able
to handle some faults; 'phase of the moon' was the most-suggested :)

==========================================================================

Hmm. If you have a spare SCSI ID, install it in the machine, backup the
existing root disk/restore to new disk (ufsdump/ufsrestore). Shut machine
down, remove old disk, change ID of new disk, boot from CD, installboot
(maybe perhaps can installboot while it has the wrong ID... I dunno), reboot
from hard disk.

That's how I would try and minimise downtime anyway :-)

==========================================================================

I did this once. Not difficult. What I did was to
1. shutdown and plug in the new disk into the SCSI chain,
2. partition the new disk in accordance with the old one
3. do ufsdump from the problem disk to the new disk partition by partition.
(if saw bad sectors, just let it time out and record what it is.)
4. run installboot (the man pages has the detail command.) on the new disk,
5. shutdown and remove the old disk, change the SCSI ID of the new one to the
old one,
6. reboot. That's it.

steps 2,3 can be done via the network on another machine.
(You can do ufsdump and dump the results onto a file, then ufsrestore in on
the another system.)
As long as you've got another SS10 running 2.4, you can do step 4 on that
machine as well.

===========================================================================