Disaster recovery for clusters

CFPDSA · 27-01-2008

My new company uses clustering, but no one really knows what they are doing.
I am new to it myself, but I'm doing an Exchange DR review and want to plug
the holes.

The clusters are wonderful, we use SAN based clusters with Dell hardware.
We have about 10 2-node active/passive clusters total. They work great,
failover beautifully and I am now in love with clustering, especially for
Exchange.

The scary question is this: What happens if we shut down the nodes on the
clusters (or lose power or something) and when we go to start them neither
one works. In other words, what if we have to restore from backup...can we
do it?

From my current reading, the answer would be no, because we are not backing
up anything other than the stores at the moment.

The information out there on recovering an entire cluster is sketchy at
best. The Exchange DR Ops guide suggests that all that is needed is a system
state, but when I try to test this using a VMWARE SCSI based cluster,
restoring the System state onto a fresh OS install results in the cluster
service failing to start.

There are hints in that and other locations that what is needed is an ASR
(for the disk signatures). This is fine (and I am in the process of testing)
but seems impractical for an enterprise backup solution. We are using
BackupExec here at the moment and contemplating switching to Commvault.
AFAIK, neither of these do ASR backups, so we'd have to manually run an ASR
backup on each cluster (the Active node at least). And what about the
floppy? Or excluding certain files from the backup? The ASR option just
doesn't make practical sense in a production environment.

What am I missing here? How do most enterprises do their cluster backups?

Thanks for any input...

-JC

Edwin vMierlo [MVP] · 27-01-2008

Have a look at these documents first
http://www.microsoft.com/technet/pro.../sercbrbp.mspx
http://support.microsoft.com/kb/887017

That should get you started, please post specific question back to this
newsgroup

rgds,
edwin.

"CFPDSA" <[email protected]> wrote in message
news:[email protected]...
> My new company uses clustering, but no one really knows what they are
doing.
> I am new to it myself, but I'm doing an Exchange DR review and want to
plug
> the holes.
>
> The clusters are wonderful, we use SAN based clusters with Dell hardware.
> We have about 10 2-node active/passive clusters total. They work great,
> failover beautifully and I am now in love with clustering, especially for
> Exchange.
>
> The scary question is this: What happens if we shut down the nodes on the
> clusters (or lose power or something) and when we go to start them neither
> one works. In other words, what if we have to restore from backup...can
we
> do it?
>
> From my current reading, the answer would be no, because we are not
backing
> up anything other than the stores at the moment.
>
> The information out there on recovering an entire cluster is sketchy at
> best. The Exchange DR Ops guide suggests that all that is needed is a
system
> state, but when I try to test this using a VMWARE SCSI based cluster,
> restoring the System state onto a fresh OS install results in the cluster
> service failing to start.
>
> There are hints in that and other locations that what is needed is an ASR
> (for the disk signatures). This is fine (and I am in the process of
testing)
> but seems impractical for an enterprise backup solution. We are using
> BackupExec here at the moment and contemplating switching to Commvault.
> AFAIK, neither of these do ASR backups, so we'd have to manually run an
ASR
> backup on each cluster (the Active node at least). And what about the
> floppy? Or excluding certain files from the backup? The ASR option just
> doesn't make practical sense in a production environment.
>
> What am I missing here? How do most enterprises do their cluster backups?
>
> Thanks for any input...
>
> -JC

CFPDSA · 28-01-2008

Ok, here is the current scenario I am troubleshooting:

3 VMs:

DCEXCH - DC/Exchange combo (this is a lab environment)
NodeA/NodeB - two functional nodes hosting an Exchange 2003 clustered
server using a shared SCSI bus for the clustered resources (quorum drive, log
drive, data drive). IDE drive for system/boot partition.

Steps to reproduce the problem:

1) Take a system state backup of NodeA (also program files directory),
backup to a share on DCEXCH. Also backup system state of NodeB and the
Exchange stores using Exchange API based backup. All this is using NTbackup.
2) Shut down Nodes A and B.
3) Copy a sysprepped base OS IDE drive over top of Node A's system drive
file to simulate a wipe and rebuild of the machine. Boot the machine, enter
the name (NodeA) and password, machine reboots and starts clean (not a member
of the domain). Assign a static IP/subnet so we can communicate with DCEXCH.
4) FYI, the quorum, log and data cluster resource disks are visible at this
point in Windows Explorer, labels intact. No errors in event log.
5) Use NTBackup to restore the system state and program files directories.
Prompted to reboot.
6) Reboot takes a long time and we get the "At least one service or driver
failed during system startup" error.
7) Cluster service is not started. Event log lists event id 1000
"Microsoft Clustering Service suffered an unexpected fatal error at line
'<line>' of source module '<source path>'. The error code was '<error code>'.
" The path references d:\ which is the quorum drive, error code is 2. I
have found no useful information about this error on the internet. The
cluster resource drives show up in windows explorer with the correct drive
letters, but the labels do not appear, and clicking on the drive results in a
"the device is not ready" error.
8) Examine the
HKLM\System\currentcontrolset\services\clusdisk\parameters\signatures key and
compare with the disk signatures using diskpart...they are the same.
9) Attempt to use clusterrecovery.exe tool which fails because the cluster
is offline. Attempt using the /fixquorum switch to start the cluster
service, this fails with the same 1000 error.
10) Numerous articles refer to using dumpcfg to re-write the signatures, but
I cannot find dumpcfg anywhere available for download.
11) Based on info in article 217157 was able to determine that the
HKLM\Cluster hive is not loaded (it is blank). Attempted to follow
directions in 224999 to restore the hive by copying the chk backup file over
CLUSDB, but cannot rename CLUSDB as it continues to say there is a process
accessing it (but cluster service is stopped!).

Bottom line: what can I do to fix the cluster in this scenario?

I have read the links you've provided, and many, many others. Nothing out
there is helpful.

I just want to verify the procedure for using a plain, ordinary system state
backup of a cluster node to restore from scratch. Seems like a simple
request... I've verified that an ASR backup/restore works, but doing ASR
backups on 20+ servers seems a bit of a tall order, never mind that they
don't have floppy drives and there are no straightforward official MS
instructions on using RIS to do ASR that I can find. The point is that this
is a basic, basic, basic requirement for any clustering solution and it
should be possible.

Any help is appreciated.

-JC

Raistlin · 28-01-2008

The same problem is bothering my team too. We are truly puzzled by
materials out there which discuss a lot without giving a prictical
measure to solve the problem. The lack of official support for Win2k3
Cluster has prevented us from deploying more tolerent serivces.

On 1ÔÂ28ÈÕ, ÉÏÎç11Ê±01·Ö, CFPDSA <[email protected]> wrote:
> Ok, here is the current scenario I am troubleshooting:
>
> 3 VMs:
>
> DCEXCH - DC/Exchange combo (this is a lab environment)
> NodeA/NodeB - two functional nodes hosting an Exchange 2003 clustered
> server using a shared SCSI bus for the clustered resources (quorum drive, log
> drive, data drive). IDE drive for system/boot partition.
>
> Steps to reproduce the problem:
>
> 1) Take a system state backup of NodeA (also program files directory),
> backup to a share on DCEXCH. Also backup system state of NodeB and the
> Exchange stores using Exchange API based backup. All this is using NTbackup.
> 2) Shut down Nodes A and B.
> 3) Copy a sysprepped base OS IDE drive over top of Node A's system drive
> file to simulate a wipe and rebuild of the machine. Boot the machine, enter
> the name (NodeA) and password, machine reboots and starts clean (not a member
> of the domain). Assign a static IP/subnet so we can communicate with DCEXCH.
> 4) FYI, the quorum, log and data cluster resource disks are visible at this
> point in Windows Explorer, labels intact. No errors in event log.
> 5) Use NTBackup to restore the system state and program files directories..
> Prompted to reboot.
> 6) Reboot takes a long time and we get the "At least one service or driver
> failed during system startup" error.
> 7) Cluster service is not started. Event log lists event id 1000
> "Microsoft Clustering Service suffered an unexpected fatal error at line
> '<line>' of source module '<source path>'. The error code was '<error code>'.
> " The path references d:\ which is the quorum drive, error code is 2. I
> have found no useful information about this error on the internet. The
> cluster resource drives show up in windows explorer with the correct drive
> letters, but the labels do not appear, and clicking on the drive results in a
> "the device is not ready" error.
> 8) Examine the
> HKLM\System\currentcontrolset\services\clusdisk\parameters\signatures key and
> compare with the disk signatures using diskpart...they are the same.
> 9) Attempt to use clusterrecovery.exe tool which fails because the cluster
> is offline. Attempt using the /fixquorum switch to start the cluster
> service, this fails with the same 1000 error.
> 10) Numerous articles refer to using dumpcfg to re-write the signatures, but
> I cannot find dumpcfg anywhere available for download.
> 11) Based on info in article 217157 was able to determine that the
> HKLM\Cluster hive is not loaded (it is blank). Attempted to follow
> directions in 224999 to restore the hive by copying the chk backup file over
> CLUSDB, but cannot rename CLUSDB as it continues to say there is a process
> accessing it (but cluster service is stopped!).
>
> Bottom line: what can I do to fix the cluster in this scenario?
>
> I have read the links you've provided, and many, many others. Nothing out
> there is helpful.
>
> I just want to verify the procedure for using a plain, ordinary system state
> backup of a cluster node to restore from scratch. Seems like a simple
> request... I've verified that an ASR backup/restore works, but doing ASR
> backups on 20+ servers seems a bit of a tall order, never mind that they
> don't have floppy drives and there are no straightforward official MS
> instructions on using RIS to do ASR that I can find. The point is that this
> is a basic, basic, basic requirement for any clustering solution and it
> should be possible.
>
> Any help is appreciated.
>
> -JC

Edwin vMierlo [MVP] · 28-01-2008

> 6) Reboot takes a long time and we get the "At least one service or
driver
> failed during system startup" error.

did you check the system event log to find out which driver ?

> 7) Cluster service is not started. Event log lists event id 1000
> "Microsoft Clustering Service suffered an unexpected fatal error at line
> '<line>' of source module '<source path>'. The error code was '<error
code>'.
> " The path references d:\ which is the quorum drive, error code is 2. I
> have found no useful information about this error on the internet. The
> cluster resource drives show up in windows explorer with the correct drive
> letters, but the labels do not appear, and clicking on the drive results
in a
> "the device is not ready" error.

the fatal error, did the cluster.log file show any errors at that time
(note: the cluster.log file timestamps are written in GMT, regardless of
timezone settings or time on the host)

CFPDSA · 28-01-2008

>
> did you check the system event log to find out which driver ?

As mentioned earlier, it wasn't a driver, it was the cluster service that
failed to start.

> the fatal error, did the cluster.log file show any errors at that time
> (note: the cluster.log file timestamps are written in GMT, regardless of
> timezone settings or time on the host)

Here ya go:

0000067c.00000698::2008/01/28-02:57:17.881 INFO [CS] Cluster Service started
- Cluster Node Version 4.3790
0000067c.00000698::2008/01/28-02:57:17.881 INFO
OS Version 5.2.3790 - Service Pack 2 (ADS 03000112L)
0000067c.00000698::2008/01/28-02:57:17.881 INFO
Local Time is 2008/01/28-05:57:17.881
0000067c.000000b4::2008/01/28-02:57:17.897 INFO [CS] Service Starting...
0000067c.000000b4::2008/01/28-02:57:17.897 INFO [INIT] ClusterInitialize
called to start cluster.
0000067c.000000b4::2008/01/28-02:57:17.897 INFO [EP] Initialization...
0000067c.000000b4::2008/01/28-02:57:17.897 INFO [DM] Initialization
0000067c.000000b4::2008/01/28-02:57:17.897 ERR [DM] DmInitialize: The hive
was loaded- rollback, unload and reload again
0000067c.000000b4::2008/01/28-02:57:17.897 INFO [DM] DmpRestartFlusher: Entry
0000067c.000000b4::2008/01/28-02:57:17.897 INFO [DM] DmpUnloadHive:
unloading the hive
0000067c.000000b4::2008/01/28-02:57:17.928 INFO [Qfs] QfsSetFileAttributes
C:\WINDOWS\Cluster\CLUSDB.BKP$ 80, status 2
0000067c.000000b4::2008/01/28-02:57:17.928 INFO [Qfs] QfsDeleteFile
C:\WINDOWS\Cluster\CLUSDB.BKP$, status 2
0000067c.000000b4::2008/01/28-02:57:17.928 INFO [DM] Loading cluster
database from C:\WINDOWS\Cluster\CLUSDB
0000067c.000000b4::2008/01/28-02:57:17.975 INFO [DM] DmpStartFlusher: Entry
0000067c.000000b4::2008/01/28-02:57:17.975 INFO [DM] DmpStartFlusher: thread
created
0000067c.000000b4::2008/01/28-02:57:17.975 ERR [DM] Failed to open key
Resources, status 2
0000067c.000000b4::2008/01/28-02:57:17.975 ERR Cluster service suffered an
unexpected fatal error at line 1386 of source module
d:\nt\base\cluster\service\dm\dminit.c. The error code was 2.

Edwin vMierlo [MVP] · 28-01-2008

in stead of "fixquorum" can you start with "resetquorumlog" ?

"CFPDSA" <[email protected]> wrote in message
news:[email protected]...
>
> >
> > did you check the system event log to find out which driver ?
>
> As mentioned earlier, it wasn't a driver, it was the cluster service that
> failed to start.
>
> > the fatal error, did the cluster.log file show any errors at that time
> > (note: the cluster.log file timestamps are written in GMT, regardless of
> > timezone settings or time on the host)
>
> Here ya go:
>
> 0000067c.00000698::2008/01/28-02:57:17.881 INFO [CS] Cluster Service
started
> - Cluster Node Version 4.3790
> 0000067c.00000698::2008/01/28-02:57:17.881 INFO
> OS Version 5.2.3790 - Service Pack 2 (ADS 03000112L)
> 0000067c.00000698::2008/01/28-02:57:17.881 INFO
> Local Time is 2008/01/28-05:57:17.881
> 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [CS] Service Starting...
> 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [INIT] ClusterInitialize
> called to start cluster.
> 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [EP] Initialization...
> 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [DM] Initialization
> 0000067c.000000b4::2008/01/28-02:57:17.897 ERR [DM] DmInitialize: The
hive
> was loaded- rollback, unload and reload again
> 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [DM] DmpRestartFlusher:
Entry
> 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [DM] DmpUnloadHive:
> unloading the hive
> 0000067c.000000b4::2008/01/28-02:57:17.928 INFO [Qfs] QfsSetFileAttributes
> C:\WINDOWS\Cluster\CLUSDB.BKP$ 80, status 2
> 0000067c.000000b4::2008/01/28-02:57:17.928 INFO [Qfs] QfsDeleteFile
> C:\WINDOWS\Cluster\CLUSDB.BKP$, status 2
> 0000067c.000000b4::2008/01/28-02:57:17.928 INFO [DM] Loading cluster
> database from C:\WINDOWS\Cluster\CLUSDB
> 0000067c.000000b4::2008/01/28-02:57:17.975 INFO [DM] DmpStartFlusher:
Entry
> 0000067c.000000b4::2008/01/28-02:57:17.975 INFO [DM] DmpStartFlusher:
thread
> created
> 0000067c.000000b4::2008/01/28-02:57:17.975 ERR [DM] Failed to open key
> Resources, status 2
> 0000067c.000000b4::2008/01/28-02:57:17.975 ERR Cluster service suffered
an
> unexpected fatal error at line 1386 of source module
> d:\nt\base\cluster\service\dm\dminit.c. The error code was 2.
>

CFPDSA · 03-02-2008

That didn't work either, same results.

Sorry it's been a while but we've had a litle "undersea cable" problem over
here in the last week that has interrupted regular internet access.

"Edwin vMierlo [MVP]" wrote:

> in stead of "fixquorum" can you start with "resetquorumlog" ?
>
>
>
> "CFPDSA" <[email protected]> wrote in message
> news:[email protected]...
> >
> > >
> > > did you check the system event log to find out which driver ?
> >
> > As mentioned earlier, it wasn't a driver, it was the cluster service that
> > failed to start.
> >
> > > the fatal error, did the cluster.log file show any errors at that time
> > > (note: the cluster.log file timestamps are written in GMT, regardless of
> > > timezone settings or time on the host)
> >
> > Here ya go:
> >
> > 0000067c.00000698::2008/01/28-02:57:17.881 INFO [CS] Cluster Service
> started
> > - Cluster Node Version 4.3790
> > 0000067c.00000698::2008/01/28-02:57:17.881 INFO
> > OS Version 5.2.3790 - Service Pack 2 (ADS 03000112L)
> > 0000067c.00000698::2008/01/28-02:57:17.881 INFO
> > Local Time is 2008/01/28-05:57:17.881
> > 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [CS] Service Starting...
> > 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [INIT] ClusterInitialize
> > called to start cluster.
> > 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [EP] Initialization...
> > 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [DM] Initialization
> > 0000067c.000000b4::2008/01/28-02:57:17.897 ERR [DM] DmInitialize: The
> hive
> > was loaded- rollback, unload and reload again
> > 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [DM] DmpRestartFlusher:
> Entry
> > 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [DM] DmpUnloadHive:
> > unloading the hive
> > 0000067c.000000b4::2008/01/28-02:57:17.928 INFO [Qfs] QfsSetFileAttributes
> > C:\WINDOWS\Cluster\CLUSDB.BKP$ 80, status 2
> > 0000067c.000000b4::2008/01/28-02:57:17.928 INFO [Qfs] QfsDeleteFile
> > C:\WINDOWS\Cluster\CLUSDB.BKP$, status 2
> > 0000067c.000000b4::2008/01/28-02:57:17.928 INFO [DM] Loading cluster
> > database from C:\WINDOWS\Cluster\CLUSDB
> > 0000067c.000000b4::2008/01/28-02:57:17.975 INFO [DM] DmpStartFlusher:
> Entry
> > 0000067c.000000b4::2008/01/28-02:57:17.975 INFO [DM] DmpStartFlusher:
> thread
> > created
> > 0000067c.000000b4::2008/01/28-02:57:17.975 ERR [DM] Failed to open key
> > Resources, status 2
> > 0000067c.000000b4::2008/01/28-02:57:17.975 ERR Cluster service suffered
> an
> > unexpected fatal error at line 1386 of source module
> > d:\nt\base\cluster\service\dm\dminit.c. The error code was 2.
> >
>
>
>

Edwin vMierlo [MVP] · 06-02-2008

OK, back to basic

if you disable cluster service, disable clusdisk.sys and reboot --

do you have access to all your disks ?

"CFPDSA" <[email protected]> wrote in message
news:[email protected]...
> That didn't work either, same results.
>
> Sorry it's been a while but we've had a litle "undersea cable" problem
over
> here in the last week that has interrupted regular internet access.
>
> "Edwin vMierlo [MVP]" wrote:
>
> > in stead of "fixquorum" can you start with "resetquorumlog" ?
> >
> >
> >
> > "CFPDSA" <[email protected]> wrote in message
> > news:[email protected]...
> > >
> > > >
> > > > did you check the system event log to find out which driver ?
> > >
> > > As mentioned earlier, it wasn't a driver, it was the cluster service
that
> > > failed to start.
> > >
> > > > the fatal error, did the cluster.log file show any errors at that
time
> > > > (note: the cluster.log file timestamps are written in GMT,
regardless of
> > > > timezone settings or time on the host)
> > >
> > > Here ya go:
> > >
> > > 0000067c.00000698::2008/01/28-02:57:17.881 INFO [CS] Cluster Service
> > started
> > > - Cluster Node Version 4.3790
> > > 0000067c.00000698::2008/01/28-02:57:17.881 INFO
> > > OS Version 5.2.3790 - Service Pack 2 (ADS 03000112L)
> > > 0000067c.00000698::2008/01/28-02:57:17.881 INFO
> > > Local Time is 2008/01/28-05:57:17.881
> > > 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [CS] Service
Starting...
> > > 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [INIT]
ClusterInitialize
> > > called to start cluster.
> > > 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [EP] Initialization...
> > > 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [DM] Initialization
> > > 0000067c.000000b4::2008/01/28-02:57:17.897 ERR [DM] DmInitialize: The
> > hive
> > > was loaded- rollback, unload and reload again
> > > 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [DM]
DmpRestartFlusher:
> > Entry
> > > 0000067c.000000b4::2008/01/28-02:57:17.897 INFO [DM] DmpUnloadHive:
> > > unloading the hive
> > > 0000067c.000000b4::2008/01/28-02:57:17.928 INFO [Qfs]
QfsSetFileAttributes
> > > C:\WINDOWS\Cluster\CLUSDB.BKP$ 80, status 2
> > > 0000067c.000000b4::2008/01/28-02:57:17.928 INFO [Qfs] QfsDeleteFile
> > > C:\WINDOWS\Cluster\CLUSDB.BKP$, status 2
> > > 0000067c.000000b4::2008/01/28-02:57:17.928 INFO [DM] Loading cluster
> > > database from C:\WINDOWS\Cluster\CLUSDB
> > > 0000067c.000000b4::2008/01/28-02:57:17.975 INFO [DM] DmpStartFlusher:
> > Entry
> > > 0000067c.000000b4::2008/01/28-02:57:17.975 INFO [DM] DmpStartFlusher:
> > thread
> > > created
> > > 0000067c.000000b4::2008/01/28-02:57:17.975 ERR [DM] Failed to open
key
> > > Resources, status 2
> > > 0000067c.000000b4::2008/01/28-02:57:17.975 ERR Cluster service
suffered
> > an
> > > unexpected fatal error at line 1386 of source module
> > > d:\nt\base\cluster\service\dm\dminit.c. The error code was 2.
> > >
> >
> >
> >

CFPDSA · 07-02-2008

Yup, they all show up and I can access them.

"Edwin vMierlo [MVP]" wrote:

> OK, back to basic
>
> if you disable cluster service, disable clusdisk.sys and reboot --
>
> do you have access to all your disks ?
>
>

Edwin vMierlo [MVP] · 07-02-2008

first of all check the signature of your q:\ (quorum disk)
if that is not the original you must change that now.

if your signature is OK, then do the following

ensure Node 2 is fully shut down

rename the q:\mscs folder to q:\mscs_old

enable clusdisk.sys again on Node 1
reboot

start your cluster with -resetquorumlog

does it start now ?

if yes, you need to stop it and start it wihout the parameter

"CFPDSA" <[email protected]> wrote in message
news:[email protected]...
> Yup, they all show up and I can access them.
>
> "Edwin vMierlo [MVP]" wrote:
>
> > OK, back to basic
> >
> > if you disable cluster service, disable clusdisk.sys and reboot --
> >
> > do you have access to all your disks ?
> >
> >

CFPDSA · 11-02-2008

Verified the signatures are correct using diskpart disk detail compared w/
the registry entries.

Renamed mscs, then reenabled clusdisk.sys and rebooted.

Attempted to start cluster service with -resetquorumlog and it fails again
with the same error.

FYI, when the clusdisk.sys driver is enabled, all the disk resources are
inaccessible. They are visible in Explorer, but give an error of "The device
is not ready." when double clicked.

"Edwin vMierlo [MVP]" wrote:

> first of all check the signature of your q:\ (quorum disk)
> if that is not the original you must change that now.
>
> if your signature is OK, then do the following
>
> ensure Node 2 is fully shut down
>
> rename the q:\mscs folder to q:\mscs_old
>
> enable clusdisk.sys again on Node 1
> reboot
>
> start your cluster with -resetquorumlog
>
> does it start now ?
>
> if yes, you need to stop it and start it wihout the parameter
>
>
>
> "CFPDSA" <[email protected]> wrote in message
> news:[email protected]...
> > Yup, they all show up and I can access them.
> >
> > "Edwin vMierlo [MVP]" wrote:
> >
> > > OK, back to basic
> > >
> > > if you disable cluster service, disable clusdisk.sys and reboot --
> > >
> > > do you have access to all your disks ?
> > >
> > >
>
>
>

Edwin vMierlo [MVP] · 11-02-2008

> Verified the signatures are correct using diskpart disk detail compared w/
> the registry entries.
>
> Renamed mscs, then reenabled clusdisk.sys and rebooted.
>
> Attempted to start cluster service with -resetquorumlog and it fails again
> with the same error.

hm, this is going to be a long time before we solve these type of problems
in a news group, if you need quick response I guess you need to start
getting help from Microsoft.

Something not right with either you backup or your restore procedure...

If you still want to keep trying to get this restored cluster online; I do
start to believe this is not the quorum then, maybe some group policy
blocking something, maybe the account which is running the cluster service
cannot access registry, or a file.... again this is going to be a tough one
to troubleshoot in a forum.

have you checked that the user account running the cluster service is a
local admin ?

on the server, log on with the account which is used to run cluster service.
Launch filemon.exe and regmon.exe. then start the cluster service, and
capture filemon and regmon files and see if this gives you a clue

>
> FYI, when the clusdisk.sys driver is enabled, all the disk resources are
> inaccessible. They are visible in Explorer, but give an error of "The
device
> is not ready." when double clicked.
>

that is normal, first they have to be online in cluster prior to you can
access the disks

CFPDSA · 12-02-2008

Well, the whole reason I started this thread was to see what the answer to
the obvious question of "how do you restore a cluster" was.

The particular cluster we are working with here (as stated previously) is a
VMWARE based scsi cluster, just for testing purposes. But the procedure used
to restore in this case should be the same for any scenario.

We do not have MS support (AFAIK) so that is not an option for us. I have
been doing research into a proper disaster recovery plan for our Exchange
clusters and have been unable to find precise guidance on how to restore a
dead cluster (i.e. the system state was backed up, now the cluster won't
start, how do you restore the cluster?).

I thought this would be an easy question... oh well...

"Edwin vMierlo [MVP]" wrote:

>
>
> > Verified the signatures are correct using diskpart disk detail compared w/
> > the registry entries.
> >
> > Renamed mscs, then reenabled clusdisk.sys and rebooted.
> >
> > Attempted to start cluster service with -resetquorumlog and it fails again
> > with the same error.
>
> hm, this is going to be a long time before we solve these type of problems
> in a news group, if you need quick response I guess you need to start
> getting help from Microsoft.
>
> Something not right with either you backup or your restore procedure...
>
>
>
> If you still want to keep trying to get this restored cluster online; I do
> start to believe this is not the quorum then, maybe some group policy
> blocking something, maybe the account which is running the cluster service
> cannot access registry, or a file.... again this is going to be a tough one
> to troubleshoot in a forum.
>
> have you checked that the user account running the cluster service is a
> local admin ?
>
> on the server, log on with the account which is used to run cluster service.
> Launch filemon.exe and regmon.exe. then start the cluster service, and
> capture filemon and regmon files and see if this gives you a clue
>
> >
> > FYI, when the clusdisk.sys driver is enabled, all the disk resources are
> > inaccessible. They are visible in Explorer, but give an error of "The
> device
> > is not ready." when double clicked.
> >
>
> that is normal, first they have to be online in cluster prior to you can
> access the disks
>
>
>

CFPDSA · 12-02-2008

One last question, is the last statement below accurate? (i.e. no ASR = no
cluster rebuild)

From:
http://technet2.microsoft.com/window....mspx?mfr=true

Scenario 8â€”Complete Cluster Failure
Symptom: None of the nodes can boot up.

If all nodes fail in a cluster and the quorum disk cannot be repaired,
follow these steps:

â€¢ Use Automated System Recovery on one node in the original cluster,
choosing a node that was backed up recently and that was active in the
cluster at the time it was backed up. This restores the disk signatures, the
partition layout of the cluster disks (quorum and nonquorum), and the cluster
configuration data. Do not start other nodes until the first node is
restored. For more information, see To Restore a damaged cluster node using
Automated System Recovery.

â€¢ Restore other nodes. For more information, see Restore a damaged cluster
node using Automated System Recovery.

â€¢ Restore your applications and application data from backup data sets.

Important

â€¢ If you do not have an Automated System Recovery backup of each node, you
cannot restore the cluster. Instead, you must recreate your cluster from
scratch. For more information, see Checklist: Planning and creating a server
cluster.

"CFPDSA" wrote:

> Well, the whole reason I started this thread was to see what the answer to
> the obvious question of "how do you restore a cluster" was.
>
> The particular cluster we are working with here (as stated previously) is a
> VMWARE based scsi cluster, just for testing purposes. But the procedure used
> to restore in this case should be the same for any scenario.
>
> We do not have MS support (AFAIK) so that is not an option for us. I have
> been doing research into a proper disaster recovery plan for our Exchange
> clusters and have been unable to find precise guidance on how to restore a
> dead cluster (i.e. the system state was backed up, now the cluster won't
> start, how do you restore the cluster?).
>
> I thought this would be an easy question... oh well...
>
> "Edwin vMierlo [MVP]" wrote:
>
> >
> >
> > > Verified the signatures are correct using diskpart disk detail compared w/
> > > the registry entries.
> > >
> > > Renamed mscs, then reenabled clusdisk.sys and rebooted.
> > >
> > > Attempted to start cluster service with -resetquorumlog and it fails again
> > > with the same error.
> >
> > hm, this is going to be a long time before we solve these type of problems
> > in a news group, if you need quick response I guess you need to start
> > getting help from Microsoft.
> >
> > Something not right with either you backup or your restore procedure...
> >
> >
> >
> > If you still want to keep trying to get this restored cluster online; I do
> > start to believe this is not the quorum then, maybe some group policy
> > blocking something, maybe the account which is running the cluster service
> > cannot access registry, or a file.... again this is going to be a tough one
> > to troubleshoot in a forum.
> >
> > have you checked that the user account running the cluster service is a
> > local admin ?
> >
> > on the server, log on with the account which is used to run cluster service.
> > Launch filemon.exe and regmon.exe. then start the cluster service, and
> > capture filemon and regmon files and see if this gives you a clue
> >
> > >
> > > FYI, when the clusdisk.sys driver is enabled, all the disk resources are
> > > inaccessible. They are visible in Explorer, but give an error of "The
> > device
> > > is not ready." when double clicked.
> > >
> >
> > that is normal, first they have to be online in cluster prior to you can
> > access the disks
> >
> >
> >

Thread: Disaster recovery for clusters

Thread Tools

Disaster recovery for clusters

Re: Disaster recovery for clusters

Re: Disaster recovery for clusters

Re: Disaster recovery for clusters

Re: Disaster recovery for clusters

Re: Disaster recovery for clusters

Re: Disaster recovery for clusters

Re: Disaster recovery for clusters

Re: Disaster recovery for clusters

Re: Disaster recovery for clusters

Re: Disaster recovery for clusters

Re: Disaster recovery for clusters

Re: Disaster recovery for clusters

Re: Disaster recovery for clusters

Re: Disaster recovery for clusters

Similar Threads

HP mini SP42226 disaster Recovery utility download

Need information about sun Solaris disaster recovery plan

Fault Tolerance and Disaster Recovery

How to implement Disaster recovery Exchange server 2007

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions