Failover behaviour on a 2 node cluster

Printable View

22-04-2005
Ariel

Failover behaviour on a 2 node cluster

Hello,

I've got a fibre san at work and 2 blade servers and chasis...
My goal is to setup a 2 node exchange cluster. I'm a newb to clusters.
At this point I have the 2 nodes configured in a cluster. I was just
testing failover. I went through a couple test failure configurations.

I test failure in 3 different ways (the third one I am experiencing
issues):
1)I gracefully shutdown the active node: The passive node took over and
became the active node
2)Selected a Resource and Initiated failur on it 4 times: The Passive
node took over that resource...

3)Hard shutdown the Active node: The Passive node is unable to Take
over.

I'm getting these sort of errors in my event log:

-------Begin Event---------
Service: ClusSvc
Category: Physical Disk Resource
Event ID: 1034

The disk associated with cluster disk resource 'Disk Q:' could not be
found. The expected signature of the disk was 059E1D89. If the disk was
removed from the server cluster, the resource should be deleted. If the
disk was replaced, the resource must be deleted and created again in
order to bring the disk online. If the disk has not been removed or
replaced, it may be inaccessible at this time because it is reserved by
another server cluster node.
-------End Event-----------

How ever if I power on the pseudo Failed node the cluster, it comes
back up and If I fail in one of the 1st 2 ways everything is fine again
(the passive node becomse active)...

At this point I uninstalled multipath drivers for our san disks because
I read of issues such as the above event that are caused by multipath
software...

But this has not fixed it..

It's wierd it seems like the active node that fails somehow locks the
Cluster disk resources... Does anybody have any ideas?

TIA
Ariel
23-04-2005
Daniel Escudero de Félix

Re: Failover behaviour on a 2 node cluster

Good afternoon.

As you said, some "multi-path" software could cause this kind of problems
(http://support.microsoft.com/default...en-us;Q293778).

Could you check that when you extract the disks signatures, the disk
signature for the quorum disk when this reside on "nodeA" is the same that
when quorum disk it is on "nodeB" after fail over ?

Best regards,
Daniel Escudero

"Ariel" <kamayamaya@gmail.com> wrote in message
news:1114188169.416077.272430@o13g2000cwo.googlegroups.com...
> Hello,
>
> I've got a fibre san at work and 2 blade servers and chasis...
> My goal is to setup a 2 node exchange cluster. I'm a newb to clusters.
> At this point I have the 2 nodes configured in a cluster. I was just
> testing failover. I went through a couple test failure configurations.
>
> I test failure in 3 different ways (the third one I am experiencing
> issues):
> 1)I gracefully shutdown the active node: The passive node took over and
> became the active node
> 2)Selected a Resource and Initiated failur on it 4 times: The Passive
> node took over that resource...
>
> 3)Hard shutdown the Active node: The Passive node is unable to Take
> over.
>
> I'm getting these sort of errors in my event log:
>
> -------Begin Event---------
> Service: ClusSvc
> Category: Physical Disk Resource
> Event ID: 1034
>
> The disk associated with cluster disk resource 'Disk Q:' could not be
> found. The expected signature of the disk was 059E1D89. If the disk was
> removed from the server cluster, the resource should be deleted. If the
> disk was replaced, the resource must be deleted and created again in
> order to bring the disk online. If the disk has not been removed or
> replaced, it may be inaccessible at this time because it is reserved by
> another server cluster node.
> -------End Event-----------
>
> How ever if I power on the pseudo Failed node the cluster, it comes
> back up and If I fail in one of the 1st 2 ways everything is fine again
> (the passive node becomse active)...
>
> At this point I uninstalled multipath drivers for our san disks because
> I read of issues such as the above event that are caused by multipath
> software...
>
> But this has not fixed it..
>
> It's wierd it seems like the active node that fails somehow locks the
> Cluster disk resources... Does anybody have any ideas?
>
> TIA
> Ariel
>
23-04-2005
Charles Tolento

RE: Failover behaviour on a 2 node cluster

Ariel

Take a look at : http://support.microsoft.com/kb/895092

Thanks

CT
"Ariel" wrote:

> Hello,
>
> I've got a fibre san at work and 2 blade servers and chasis...
> My goal is to setup a 2 node exchange cluster. I'm a newb to clusters.
> At this point I have the 2 nodes configured in a cluster. I was just
> testing failover. I went through a couple test failure configurations.
>
> I test failure in 3 different ways (the third one I am experiencing
> issues):
> 1)I gracefully shutdown the active node: The passive node took over and
> became the active node
> 2)Selected a Resource and Initiated failur on it 4 times: The Passive
> node took over that resource...
>
> 3)Hard shutdown the Active node: The Passive node is unable to Take
> over.
>
> I'm getting these sort of errors in my event log:
>
> -------Begin Event---------
> Service: ClusSvc
> Category: Physical Disk Resource
> Event ID: 1034
>
> The disk associated with cluster disk resource 'Disk Q:' could not be
> found. The expected signature of the disk was 059E1D89. If the disk was
> removed from the server cluster, the resource should be deleted. If the
> disk was replaced, the resource must be deleted and created again in
> order to bring the disk online. If the disk has not been removed or
> replaced, it may be inaccessible at this time because it is reserved by
> another server cluster node.
> -------End Event-----------
>
> How ever if I power on the pseudo Failed node the cluster, it comes
> back up and If I fail in one of the 1st 2 ways everything is fine again
> (the passive node becomse active)...
>
> At this point I uninstalled multipath drivers for our san disks because
> I read of issues such as the above event that are caused by multipath
> software...
>
> But this has not fixed it..
>
> It's wierd it seems like the active node that fails somehow locks the
> Cluster disk resources... Does anybody have any ideas?
>
> TIA
> Ariel
>
>
23-04-2005
Ariel

Re: Failover behaviour on a 2 node cluster

I was unable to find Dumpcfg.exe, it wasn't in the w2k3 resource kit
microsoft has for download so I found a vbs script
(http://www.castalk.com/ftopic4344.html) to get the disk signatures
After a hard shutdown on NodeA the NodeB had the same disk but with
BLANK Disk signatures... compared against the disk signatures(wich
were the same on both nodes) it had before I the shutdown on NodeA...

But graceful shutdowns do not have this affect....
23-04-2005
Ariel

Re: Failover behaviour on a 2 node cluster

I did obtain 1 of the hotfixes mentioned above as it pertains to my
issue http://support.microsoft.com/kb/886800
But I would not install as I have SP1 installed already...

Thanx Charles I'm going to review the other Hotfixes closely to see if
they also pertain to me.
23-04-2005
Daniel Escudero de Félix

Re: Failover behaviour on a 2 node cluster

Hi.

Disk signatures should not be in blank. Could you check if all data in
HKLM/System/CurrentControlSet/Services/Clusdisk/Parameters on both servers
are the same ?

Regards,
Daniel Escudero

"Ariel" <kamayamaya@gmail.com> wrote in message
news:1114201342.928293.267640@z14g2000cwz.googlegroups.com...
> I was unable to find Dumpcfg.exe, it wasn't in the w2k3 resource kit
> microsoft has for download so I found a vbs script
> (http://www.castalk.com/ftopic4344.html) to get the disk signatures
> After a hard shutdown on NodeA the NodeB had the same disk but with
> BLANK Disk signatures... compared against the disk signatures(wich
> were the same on both nodes) it had before I the shutdown on NodeA...
>
> But graceful shutdowns do not have this affect....
>
25-04-2005
Ariel

Re: Failover behaviour on a 2 node cluster

OK it the disk signatures are correct under the registry... here are
the event failures that happen after a hard shutdown on the active
node. the 2nd event seems interesting, Event ID: 1177. Is it saying
that to switch the quorum disk from NodeA to NodeB it requires NodeA
be Up to fascilitate in transferring the quorum disk. So are my
problems by design? Or should the Passive Node be able to failover on
a power failure of the Active Node?

Event Type: Error
Event Source: ClusSvc
Event Category: Physical Disk Resource
Event ID: 1034
Date: 4/25/2005
Time: 9:31:46 AM
User: N/A
Computer: BERT
Description:
The disk associated with cluster disk resource 'Disk Q:' could not be
found. The expected signature of the disk was 059E1D89. If the disk was
removed from the server cluster, the resource should be deleted. If the
disk was replaced, the resource must be deleted and created again in
order to bring the disk online. If the disk has not been removed or
replaced, it may be inaccessible at this time because it is reserved by
another server cluster node.

For more information, see Help and Support Center at
http://go.microsoft.com/fwlink/events.asp.
----------------------------
Event Type: Error
Event Source: ClusSvc
Event Category: Membership Mgr
Event ID: 1177
Date: 4/25/2005
Time: 9:31:46 AM
User: N/A
Computer: BERT
Description:
Cluster service is shutting down because the membership engine failed
to arbitrate for the quorum device. This could be due to the loss of
network connectivity with the current quorum owner. Check your
physical network infrastructure to ensure that communication between
this node and all other nodes in the server cluster is intact.

For more information, see Help and Support Center at
http://go.microsoft.com/fwlink/events.asp.

---------------------------------
Event Type: Error
Event Source: ClusSvc
Event Category: Startup/Shutdown
Event ID: 1073
Date: 4/25/2005
Time: 9:31:46 AM
User: N/A
Computer: BERT
Description:
Cluster service was halted to prevent an inconsistency within the
server cluster. The error code was 5892.

For more information, see Help and Support Center at
http://go.microsoft.com/fwlink/events.asp.

------------------------------------
Event Type: Error
Event Source: Service Control Manager
Event Category: None
Event ID: 7034
Date: 4/25/2005
Time: 9:31:47 AM
User: N/A
Computer: BERT
Description:
The Cluster Service service terminated unexpectedly. It has done this
1 time(s).

For more information, see Help and Support Center at
http://go.microsoft.com/fwlink/events.asp.
25-04-2005
Ariel

Re: Failover behaviour on a 2 node cluster

http://www.microsoft.com/technet/pro...nsfl.mspx#EDAA
Caught this in another thread. In a 2 noded cluster it cannot tolerate
any failure. a 3 node cluster can tolerate 1 failure.

So it seems as though it is by design correct me if I'm wrong...
I have a lot to learn about clusters I guess...

-Ariel
26-04-2005
John Toner [MVP]

Re: Failover behaviour on a 2 node cluster

This is only true if you are using a "majority node set" quorum disk. If you
plan to use a 2-node cluster, you should use a volume from the shared
storage for your quorum disk.

Regards,
John

"Ariel" <kamayamaya@gmail.com> wrote in message
news:1114447541.904032.258840@o13g2000cwo.googlegroups.com...
>
http://www.microsoft.com/technet/pro...nsfl.mspx#EDAA
> Caught this in another thread. In a 2 noded cluster it cannot tolerate
> any failure. a 3 node cluster can tolerate 1 failure.
>
> So it seems as though it is by design correct me if I'm wrong...
> I have a lot to learn about clusters I guess...
>
> -Ariel
>
30-04-2005
Daniel Escudero de Félix

Re: Failover behaviour on a 2 node cluster

Good night.

How are configured your cluster communications ? Public network as "mixed",
and Private network for only internal communications ?
How are the thresholds configured ? The failover ?

What is the cluster.log log file on node 2 saying ?

Best regards,
Daniel Escudero

"John Toner [MVP]" <jtoner@DIE.SPAM.DIE.mvps.org> wrote in message
news:##b0NwnSFHA.3076@tk2msftngp13.phx.gbl...
> This is only true if you are using a "majority node set" quorum disk. If
you
> plan to use a 2-node cluster, you should use a volume from the shared
> storage for your quorum disk.
>
> Regards,
> John
>
> "Ariel" <kamayamaya@gmail.com> wrote in message
> news:1114447541.904032.258840@o13g2000cwo.googlegroups.com...
> >
>
http://www.microsoft.com/technet/pro.../technologies/
clustering/majnsfl.mspx#EDAA
> > Caught this in another thread. In a 2 noded cluster it cannot tolerate
> > any failure. a 3 node cluster can tolerate 1 failure.
> >
> > So it seems as though it is by design correct me if I'm wrong...
> > I have a lot to learn about clusters I guess...
> >
> > -Ariel
> >
>
>