Physical Disk goes offline when cluster node reboots

Darrek · 13-12-2006

I have a 2 node Windows 2003 SP1 EE cluster connected to an MSA1000 SAN
via integrated FC hub. My SAN is single-path since this is our Dev/QA
environment.

When I reboot any node in the cluster all physical disk resources go
offline while the rebooted server goes through POST. I get Delayed
Write Failed errors in the event log of the node that is still running.
Once the rebooted node is up and running the cluster returns to
normal.

I'm worried that our production cluster may exhibit the same issues
when it goes live even though it is built in a more robust fashion.

I'm open for suggestions.

The servers are HP DL145's, using Emulex FC2243 cards. If I simply
failover a cluster group everything works great.

Thanks.
-DK

Edwin vMierlo · 14-12-2006

Darrek,

Just to confirm that we have the symptom right

- All groups are online on Node 1 (therefore all disks are online on Node 1)
- you reboot Node 2
- All disks on Node 1 go offline on Node 1 during reboot/Post of Node 2

Please confirm that this is what you are experiencing

and two questions:
Q: are the disks who go offline on Node 2, do they fail or do they go
offline ? (please specify, as there is a difference)
Q: Do you see any "reservation lost" messages/events in the system event log
?

rgds,
Edwin.

"Darrek" <Darrek.Kay1@nike.com> wrote in message
news:1166032177.312098.234640@80g2000cwy.googlegroups.com...
> I have a 2 node Windows 2003 SP1 EE cluster connected to an MSA1000 SAN
> via integrated FC hub. My SAN is single-path since this is our Dev/QA
> environment.
>
> When I reboot any node in the cluster all physical disk resources go
> offline while the rebooted server goes through POST. I get Delayed
> Write Failed errors in the event log of the node that is still running.
> Once the rebooted node is up and running the cluster returns to
> normal.
>
> I'm worried that our production cluster may exhibit the same issues
> when it goes live even though it is built in a more robust fashion.
>
> I'm open for suggestions.
>
> The servers are HP DL145's, using Emulex FC2243 cards. If I simply
> failover a cluster group everything works great.
>
> Thanks.
> -DK
>

John Toner [MVP] · 14-12-2006

Make sure you're running supported versions of the HBA drivers. Many vendors
will also require that you apply a STORport hotfix, such as 916048

Regards,
John

"Darrek" <Darrek.Kay1@nike.com> wrote in message
news:1166032177.312098.234640@80g2000cwy.googlegroups.com...
> I have a 2 node Windows 2003 SP1 EE cluster connected to an MSA1000 SAN
> via integrated FC hub. My SAN is single-path since this is our Dev/QA
> environment.
>
> When I reboot any node in the cluster all physical disk resources go
> offline while the rebooted server goes through POST. I get Delayed
> Write Failed errors in the event log of the node that is still running.
> Once the rebooted node is up and running the cluster returns to
> normal.
>
> I'm worried that our production cluster may exhibit the same issues
> when it goes live even though it is built in a more robust fashion.
>
> I'm open for suggestions.
>
> The servers are HP DL145's, using Emulex FC2243 cards. If I simply
> failover a cluster group everything works great.
>
> Thanks.
> -DK
>

Darrek · 15-12-2006

Edwin vMierlo wrote:
> Darrek,
>
> Just to confirm that we have the symptom right
>
> - All groups are online on Node 1 (therefore all disks are online on Node 1)
> - you reboot Node 2
> - All disks on Node 1 go offline on Node 1 during reboot/Post of Node 2
>
> Please confirm that this is what you are experiencing
>
> and two questions:
> Q: are the disks who go offline on Node 2, do they fail or do they go
> offline ? (please specify, as there is a difference)
> Q: Do you see any "reservation lost" messages/events in the system event log
> ?
>
> rgds,
> Edwin.
>

Yes. All groups are online and running fine on Node 1. During Node 2
POST Node 1 reports errors like this in the event log:

(One for each LUN on the SAN)
Event Type: Error
Event Source: Disk
Event Category: None
Event ID: 15
Description:
The device, \Device\Harddisk1, is not ready for access yet.

And then...one of these...

Event Type: Error
Event Source: ClusSvc
Event Category: Physical Disk Resource
Event ID: 1038
Description:
Reservation of cluster disk 'Disk T - QASQLBTmp' has been lost. Please
check your system and disk configuration.

And then...several of these...

Event Type: Warning
Event Source: Ntfs
Event Category: None
Event ID: 50
Description:
{Delayed Write Failed} Windows was unable to save all the data for the
file . The data has been lost. This error may be caused by a failure of
your computer hardware or network connection. Please try to save this
file elsewhere.

More Event 15's, and 1038' for other LUNs

A couple of these mixed in...

Event Type: Information
Event Source: Application Popup
Event Category: None
Event ID: 26
Description:
Application popup: Windows - Delayed Write Failed : Windows was unable
to save all the data for the file Q:\$Mft. The data has been lost. This
error may be caused by a failure of your computer hardware or network
connection. Please try to save this file elsewhere.

One of these:

Event Type: Warning
Event Source: Ftdisk
Event Category: Disk
Event ID: 57
Description:
The system failed to flush data to the transaction log. Corruption may
occur.

At this point Cluster Admin begins sending service stop commands to
SQL.
And I get these:

Event Type: Error
Event Source: ClusSvc
Event Category: Physical Disk Resource
Event ID: 1036
Description:
Cluster disk resource '' did not respond to a SCSI maintenance command.

Followed by several more 57's:

I even managed one of these:

Event Type: Error
Event Source: ClusSvc
Event Category: Physical Disk Resource
Event ID: 1034
Description:
The disk associated with cluster disk resource 'Disk Q:' could not be
found. The expected signature of the disk was BED1F8F9. If the disk was
removed from the server cluster, the resource should be deleted. If the
disk was replaced, the resource must be deleted and created again in
order to bring the disk online. If the disk has not been removed or
replaced, it may be inaccessible at this time because it is reserved by
another server cluster node.

Followed by one of these:

Event Type: Error
Event Source: ClusSvc
Event Category: Startup/Shutdown
Event ID: 1009

Description:
Cluster service could not join an existing server cluster and could not
form a new server cluster. Cluster service has terminated.

The drivers I'm using are Emulex Storport FC2243
5-1.11X1 11/07/2005 WS2K3 32 bit (elxadjct.sys & elxstor.sys)
5.1.3.2 (elxstod.dll)

The MSA 1000 is on firmware 4.48.

Thanks for your help!

-DK

Chuck Timon [Microsoft] · 15-12-2006

Yep, these are all errors that indicate hardware problems...probably, in
your case, with configuration of the hardware. One of the classic examples
of this is Dell Perc RAID controllers. If the 'cluster mode' is not set on
the controllers, then errors like what you are seeing will manifest
themselves. Probably need to speak with your hardware vendor to ensure they
know you are using their hardware on a cluster and have them reviewe the
configuration.

--
Chuck Timon, Jr.
Microsoft Corporation
Longhorn Readiness Team
This posting is provided "AS IS" with no
warranties, and confers no rights.

"Darrek" <Darrek.Kay1@nike.com> wrote in message
news:1166201880.820694.237490@j72g2000cwa.googlegroups.com...
>
> Edwin vMierlo wrote:
>> Darrek,
>>
>> Just to confirm that we have the symptom right
>>
>> - All groups are online on Node 1 (therefore all disks are online on Node
>> 1)
>> - you reboot Node 2
>> - All disks on Node 1 go offline on Node 1 during reboot/Post of Node 2
>>
>> Please confirm that this is what you are experiencing
>>
>> and two questions:
>> Q: are the disks who go offline on Node 2, do they fail or do they go
>> offline ? (please specify, as there is a difference)
>> Q: Do you see any "reservation lost" messages/events in the system event
>> log
>> ?
>>
>> rgds,
>> Edwin.
>>
>
> Yes. All groups are online and running fine on Node 1. During Node 2
> POST Node 1 reports errors like this in the event log:
>
> (One for each LUN on the SAN)
> Event Type: Error
> Event Source: Disk
> Event Category: None
> Event ID: 15
> Description:
> The device, \Device\Harddisk1, is not ready for access yet.
>
> And then...one of these...
>
> Event Type: Error
> Event Source: ClusSvc
> Event Category: Physical Disk Resource
> Event ID: 1038
> Description:
> Reservation of cluster disk 'Disk T - QASQLBTmp' has been lost. Please
> check your system and disk configuration.
>
>
> And then...several of these...
>
> Event Type: Warning
> Event Source: Ntfs
> Event Category: None
> Event ID: 50
> Description:
> {Delayed Write Failed} Windows was unable to save all the data for the
> file . The data has been lost. This error may be caused by a failure of
> your computer hardware or network connection. Please try to save this
> file elsewhere.
>
> More Event 15's, and 1038' for other LUNs
>
> A couple of these mixed in...
>
> Event Type: Information
> Event Source: Application Popup
> Event Category: None
> Event ID: 26
> Description:
> Application popup: Windows - Delayed Write Failed : Windows was unable
> to save all the data for the file Q:\$Mft. The data has been lost. This
> error may be caused by a failure of your computer hardware or network
> connection. Please try to save this file elsewhere.
>
> One of these:
>
> Event Type: Warning
> Event Source: Ftdisk
> Event Category: Disk
> Event ID: 57
> Description:
> The system failed to flush data to the transaction log. Corruption may
> occur.
>
> At this point Cluster Admin begins sending service stop commands to
> SQL.
> And I get these:
>
> Event Type: Error
> Event Source: ClusSvc
> Event Category: Physical Disk Resource
> Event ID: 1036
> Description:
> Cluster disk resource '' did not respond to a SCSI maintenance command.
>
>
> Followed by several more 57's:
>
> I even managed one of these:
>
> Event Type: Error
> Event Source: ClusSvc
> Event Category: Physical Disk Resource
> Event ID: 1034
> Description:
> The disk associated with cluster disk resource 'Disk Q:' could not be
> found. The expected signature of the disk was BED1F8F9. If the disk was
> removed from the server cluster, the resource should be deleted. If the
> disk was replaced, the resource must be deleted and created again in
> order to bring the disk online. If the disk has not been removed or
> replaced, it may be inaccessible at this time because it is reserved by
> another server cluster node.
>
> Followed by one of these:
>
> Event Type: Error
> Event Source: ClusSvc
> Event Category: Startup/Shutdown
> Event ID: 1009
>
> Description:
> Cluster service could not join an existing server cluster and could not
> form a new server cluster. Cluster service has terminated.
>
>
>
>
> The drivers I'm using are Emulex Storport FC2243
> 5-1.11X1 11/07/2005 WS2K3 32 bit (elxadjct.sys & elxstor.sys)
> 5.1.3.2 (elxstod.dll)
>
> The MSA 1000 is on firmware 4.48.
>
>
> Thanks for your help!
>
> -DK
>

Darrek · 15-12-2006

Chuck Timon [Microsoft] wrote:
> Yep, these are all errors that indicate hardware problems...probably, in
> your case, with configuration of the hardware. One of the classic examples
> of this is Dell Perc RAID controllers. If the 'cluster mode' is not set on
> the controllers, then errors like what you are seeing will manifest
> themselves. Probably need to speak with your hardware vendor to ensure they
> know you are using their hardware on a cluster and have them reviewe the
> configuration.
>
> --
> Chuck Timon, Jr.
> Microsoft Corporation
> Longhorn Readiness Team
> This posting is provided "AS IS" with no
> warranties, and confers no rights.
>

I've since found the MS hotfix and updated Emulex drivers. I will be
installing them later today. I've gone through the Emulex
configuration that is available during the POST sequence and have not
found anything that looks like it needs to be configured differently.
The HP website also has nothing clearly called out.

I'll follow up here if the updates help.

-DK

Edwin vMierlo · 18-12-2006

Dear Darrek,

Ensuring that you are running the lastest drivers is a good start.

However, as you report that your problem occurs during POST, which is the
stage where the OS is not even loading, therefore I would be surprised if
updating your drivers would improve your situation.

I would think that you also need to check the firmware/bios which is running
on your Emulex cards, ensure you run the latest firmware, as firmware is
running (or starting to run) during (or just after) POST.

(and if you update drivers, you should update firmware anyway, as the two go
hand in hand)

Quick question: as this is a SAN envioronment: Are you booting from your SAN
?

rgds,
Edwin.

"Darrek" <Darrek.Kay1@nike.com> wrote in message
news:1166207126.999846.241010@j72g2000cwa.googlegroups.com...
>
> Chuck Timon [Microsoft] wrote:
> > Yep, these are all errors that indicate hardware problems...probably, in
> > your case, with configuration of the hardware. One of the classic
examples
> > of this is Dell Perc RAID controllers. If the 'cluster mode' is not set
on
> > the controllers, then errors like what you are seeing will manifest
> > themselves. Probably need to speak with your hardware vendor to ensure
they
> > know you are using their hardware on a cluster and have them reviewe the
> > configuration.
> >
> > --
> > Chuck Timon, Jr.
> > Microsoft Corporation
> > Longhorn Readiness Team
> > This posting is provided "AS IS" with no
> > warranties, and confers no rights.
> >
>
> I've since found the MS hotfix and updated Emulex drivers. I will be
> installing them later today. I've gone through the Emulex
> configuration that is available during the POST sequence and have not
> found anything that looks like it needs to be configured differently.
> The HP website also has nothing clearly called out.
>
> I'll follow up here if the updates help.
>
> -DK
>

Darrek · 19-12-2006

Edwin vMierlo wrote:
> Dear Darrek,
>
> Ensuring that you are running the lastest drivers is a good start.
>
> However, as you report that your problem occurs during POST, which is the
> stage where the OS is not even loading, therefore I would be surprised if
> updating your drivers would improve your situation.
>
> I would think that you also need to check the firmware/bios which is running
> on your Emulex cards, ensure you run the latest firmware, as firmware is
> running (or starting to run) during (or just after) POST.
>
> (and if you update drivers, you should update firmware anyway, as the two go
> hand in hand)
>
> Quick question: as this is a SAN envioronment: Are you booting from your SAN
> ?
>
> rgds,
> Edwin.
>

I understand that drivers may not solve the problem unless they have
code to interact with a booting FC-HBA gracefully during a POST. Since
upgrading the drivers I still have the problem. I will look for a
firmware update.

I'm booting from DAS.

-DK

Chuck Timon [Microsoft] · 20-12-2006

If you are booting from SAN, have you read -
http://support.microsoft.com/kb/886569/en-us ?

Just as an FYI, booting from SAN must be supported by your hardware vendor
or Microsoft won't support it.

--
Chuck Timon, Jr.
Microsoft Corporation
Longhorn Readiness Team
This posting is provided "AS IS" with no
warranties, and confers no rights.

"Darrek" <Darrek.Kay1@nike.com> wrote in message
news:1166552589.238041.128670@t46g2000cwa.googlegroups.com...
>
> Edwin vMierlo wrote:
>> Dear Darrek,
>>
>> Ensuring that you are running the lastest drivers is a good start.
>>
>> However, as you report that your problem occurs during POST, which is the
>> stage where the OS is not even loading, therefore I would be surprised if
>> updating your drivers would improve your situation.
>>
>> I would think that you also need to check the firmware/bios which is
>> running
>> on your Emulex cards, ensure you run the latest firmware, as firmware is
>> running (or starting to run) during (or just after) POST.
>>
>> (and if you update drivers, you should update firmware anyway, as the two
>> go
>> hand in hand)
>>
>> Quick question: as this is a SAN envioronment: Are you booting from your
>> SAN
>> ?
>>
>> rgds,
>> Edwin.
>>
>
> I understand that drivers may not solve the problem unless they have
> code to interact with a booting FC-HBA gracefully during a POST. Since
> upgrading the drivers I still have the problem. I will look for a
> firmware update.
>
> I'm booting from DAS.
>
> -DK
>

Darrek · 20-12-2006

Darrek wrote:

> I understand that drivers may not solve the problem unless they have
> code to interact with a booting FC-HBA gracefully during a POST. Since
> upgrading the drivers I still have the problem. I will look for a
> firmware update.
>
> I'm booting from DAS.
>
> -DK

I've upgraded firmware to the latest available on the HP website:
2.10A10
I upgraded and then disabled the BOOT BIOS so I no longer get prompted
with a Ctrl-E for Emulex utils prompt during POST.
I've set the driver parameters as follows:
QueueDepth=64;NodeTimeOut=10;LinkTimeOut=40;QueueTarget=1;ResetTPRLO=2

The ResetTPRLO is new and I shutdown both nodes and power cycled the
MSA1000 after implementing it. Problem still occurs.

More symptom information:

I'm able to shutdown Node B without any problems. App Groups
gracefully failover to Node A and Node B powers off. When I power ON
Node B and while it performs memory test, Node A loses its LUNs until
Node B has started the Cluster Services.

-DK

Edwin vMierlo · 20-12-2006

Darrek,

I suspect that for some reason you are loosing your SCSI reservation, hence
the *surviving* node looses this and fails the disk.

I have seen TPRLO causing loss of SCSI reservation (actually causing process
logout on the FC-target), so I suggest that you confirm with HP that your
settings are correct. Each storage vendor has it specific settings for
connecting to its storage.

Next step is to examine the cluster.log file on the surviving node, to see
if this gives us any more clues on what is happening.
Search the cluster.log for the "reservation lost" message, and examine what
happens before and after this. (if anything is logged, if it is a true SCSI
reservation lost, then you see nothing leading up to this, just bang-lost !)

If it is a SCSI reservation lost, and no more clues or events can be seen in
the cluster log, it is time to turn to the storage (vendor) and have a look
for clues.

Let us know what you see in the cluster.log file

rgds,
Edwin.

"Darrek" <Darrek.Kay1@nike.com> wrote in message
news:1166561785.139389.249010@48g2000cwx.googlegroups.com...
> Darrek wrote:
>
> > I understand that drivers may not solve the problem unless they have
> > code to interact with a booting FC-HBA gracefully during a POST. Since
> > upgrading the drivers I still have the problem. I will look for a
> > firmware update.
> >
> > I'm booting from DAS.
> >
> > -DK
>
> I've upgraded firmware to the latest available on the HP website:
> 2.10A10
> I upgraded and then disabled the BOOT BIOS so I no longer get prompted
> with a Ctrl-E for Emulex utils prompt during POST.
> I've set the driver parameters as follows:
> QueueDepth=64;NodeTimeOut=10;LinkTimeOut=40;QueueTarget=1;ResetTPRLO=2
>
> The ResetTPRLO is new and I shutdown both nodes and power cycled the
> MSA1000 after implementing it. Problem still occurs.
>
> More symptom information:
>
> I'm able to shutdown Node B without any problems. App Groups
> gracefully failover to Node A and Node B powers off. When I power ON
> Node B and while it performs memory test, Node A loses its LUNs until
> Node B has started the Cluster Services.
>
> -DK
>

Darrek · 21-12-2006

Edwin vMierlo wrote:
> Darrek,
> ...
>
> Next step is to examine the cluster.log file on the surviving node, to see
> if this gives us any more clues on what is happening.
> ...
> rgds,
> Edwin.

I mentioned earlier in the thread that the reservation is being lost.

This extract begins when the first errors start appearing in the
EventLog.
Cluster.log extract:
09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
09:13.566 WARN [LM] LogWrite : LogpAppendPage failed.
09:13.566 INFO [LM] LogWrite : Exit returning=0x00000000
09:13.566 WARN [DM] DmWriteToQuorumLog failed, error=0x00000006
09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34789 type 1
context 4098
09:13.566 INFO [FM] FmpOfflineGroup,
Group=33c75b80-6f81-4657-a059-442224c19f1a
09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 2 context 15
09:13.566 INFO [GUM] Thread 0xe28 UpdateLock wait on Type 2
09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34790
Generation=0
09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34790 type 2
context 15
09:13.566 INFO [NM] Received update to set state for network
83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
09:13.566 WARN [NM] Interface da2e0aec-f760-49b0-bdc7-7582e3bce5dc
failed (node: NODE_A, network: Private LAN).
09:13.566 INFO [LM] LogFlush : pLog=0x015cad90 writing the 1024 bytes
for active page at offset 0x00002000
09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
09:13.566 WARN [LM] LogFlush::LogpWrite failed, error=0x0000048f
09:13.566 INFO Physical Disk <Disk K>: [DiskArb] CompletionRoutine,
status 1167.
09:13.566 ERR Physical Disk <Disk K>: [DiskArb] CompletionRoutine:
reservation lost! Status 1167
09:13.566 INFO [FM] FmpSetResourcePersistentState: Setting persistent
state for resource 26d811a7-f0a3-42e8-afd4-bbd83aa98676...
09:13.566 WARN [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
(Private LAN) is down.
09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34790 type 2
context 15
09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 1 context
4098
09:13.566 INFO [GUM] Thread 0x1560 UpdateLock wait on Type 1
09:13.566 INFO Physical Disk <Disk I>: [DiskArb] CompletionRoutine,
status 1167.
09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
09:13.566 ERR Physical Disk <Disk I>: [DiskArb] CompletionRoutine:
reservation lost! Status 1167
09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34791
Generation=0
09:13.566 INFO [NM] Beginning phase 1 of state computation for network
83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34791 type 1
context 4098
09:13.566 INFO [DM] DmWriteToQuorumLog Entry Seq#=34791 Type=4098
Size=152
09:13.566 INFO [LM] LogCommitSize : Entry RmId=5 Size=152
09:13.566 INFO [LM] LogCommitSize : Exit, returning 0x00000000
09:13.566 INFO [DM] DmUpdateSetValue
09:13.566 INFO [DM] Setting value of PersistentState for key
Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676 to 0x00000000
09:13.566 INFO [API] Notification on port 14ebc8, key c00b0 of
type 64. KeyName Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676
09:13.566 INFO [API] Notification on port 14eee8, key 9d628 of
type 64. KeyName Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676
09:13.566 INFO [API] Notification on port 14ebc8, key cdd30 of
type 64. KeyName
09:13.566 INFO [API] Notification on port 14eee8, key 9f768 of
type 64. KeyName
09:13.566 INFO [DM] DmWriteToQuorumLog Entry Seq#=34791 Type=4098
Size=152
09:13.566 INFO [LM] LogWrite : Entry TrId=34791 RmId=5 RmType = 4098
Size=152
09:13.566 INFO [LM] LogpAppendPage : Writing 1024 bytes to disk at
offset 0x00002000
09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
09:13.566 WARN [LM] LogWrite : LogpAppendPage failed.
09:13.566 INFO [LM] LogWrite : Exit returning=0x00000000
09:13.566 WARN [DM] DmWriteToQuorumLog failed, error=0x00000006
09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34791 type 1
context 4098
09:13.566 INFO [FM] FmpOfflineResource: SQL Server (LA997DB001) depends
on Disk K - DevSQLAMDF. Shut down first.
09:13.566 INFO [FM] FmpOfflineResource: SQL Server Agent (LA997DB001)
depends on SQL Server (LA997DB001). Shut down first.
09:13.566 INFO [FM] FmpRmOfflineResource: InterlockedIncrement on
gdwQuoBlockingResources for resource
8f068a2b-1649-49db-9628-7e3bcb1c0ff6
09:13.566 INFO [NM] Node is down for interface 0
(51173b6b-2071-4f92-a8df-605920339ac1) on network
83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
09:13.566 INFO [NM] Examining connectivity data for interface 1
(da2e0aec-f760-49b0-bdc7-7582e3bce5dc) on network
83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
09:13.566 INFO [NM] The report from interface 0 is not valid on network
83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
09:13.566 INFO [NM] Interface 1 (da2e0aec-f760-49b0-bdc7-7582e3bce5dc)
is up on network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
09:13.566 INFO [NM] Completed phase 1 of state computation for network
83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
09:13.566 INFO [NM] Unavailable=1, Failed = 0, Unreachable=0,
Reachable=1, Up=1 on network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
09:13.566 INFO [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa is now
in state 3
09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 2 context 15
09:13.566 INFO [GUM] Thread 0xe28 UpdateLock wait on Type 2
09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34792
Generation=0
09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34792 type 2
context 15
09:13.566 INFO [NM] Received update to set state for network
83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
09:13.566 INFO [NM] Interface da2e0aec-f760-49b0-bdc7-7582e3bce5dc is
up (node: NODE_A, network: Private LAN).
09:13.566 INFO Physical Disk <Disk F>: [DiskArb] CompletionRoutine,
status 1167.
09:13.566 ERR Physical Disk <Disk F>: [DiskArb] CompletionRoutine:
reservation lost! Status 1167
09:13.566 WARN [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
(Private LAN) is up.
09:13.566 INFO Physical Disk <Disk Q>: [DiskArb] CompletionRoutine,
status 1167.
09:13.566 ERR Physical Disk <Disk Q>: [DiskArb] CompletionRoutine:
reservation lost! Status 1167
09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34792 type 2
context 15
09:13.566 INFO [NM] Worker thread finished processing network
83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
09:13.566 ERR [RM] LostQuorumResource, cluster service terminated...
09:13.801 WARN [RM] Going away, Status = 1, Shutdown = 0.
09:13.801 ERR [RM] Active Resource = 000BD058
09:13.801 ERR [RM] Resource State is 5, "Offline"
09:13.801 ERR [RM] Resource name is SQL Server Agent (LA997DB001)
09:13.801 ERR [RM] Resource type is SQL Server Agent
09:13.801 INFO [RM] Posting shutdown notification.
09:13.801 INFO [RM] NotifyChanges shutting down.
09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for Q
(Partition1) - Received
09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for Q
(Partition1) - Processed
09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for F
(Partition1) - Received
09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for F
(Partition1) - Processed
09:13.957 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK_FAILED
for Q (Partition1) - Received
09:13.957 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK_FAILED
for Q (Partition1) - Processed
09:13.957 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
received
09:13.957 WARN Physical Disk <Disk Q>: [PnP] PokeDiskResource: Can't
open \\.\PhysicalDrive1
09:13.957 INFO Physical Disk <Disk Q>: Offset
String
09:13.957 INFO Physical Disk <Disk Q>: ================
======================================
09:13.957 INFO Physical Disk <Disk Q>: 0000000000008000
\??\Volume{95bfac1b-3def-11db-bc56-0017a43fb52d}
09:13.957 INFO Physical Disk <Disk Q>: *** End of list ***
09:13.957 INFO Physical Disk <Disk Q>: SetupVolGuids: Processing
VolGuid list
09:13.957 WARN Physical Disk <Disk Q>: SetupVolGuids: Unable to assign
VolGuid to device, error 3221225530
09:13.957 WARN Physical Disk <Disk Q>: ValidateMountPoints:
GetVolumeNameForVolumeMountPoint for
(\\?\GLOBALROOT\Device\Harddisk1\Partition1\) returned 3
09:13.957 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
processed
09:13.972 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
received

-DK

Edwin vMierlo · 21-12-2006

Darrek,

You can clearly see the reservation lost in this log, however there are some
other errors which caught my attention.

error=0x00000006
This could be a "ERROR_INVALID_HANDLE" error

error=0x0000048f
This could be a "ERROR_DEVICE_NOT_CONNECTED" error

And then at the end of the extract of your log, you see
DBT_DEVICEREMOVECOMPLETE events
These are when (media) devices are removed from your configuration.... even
when it is a reservation lost, you should not see those, as you are still
connected to storage and your disk devices on the storage are still present.

So, yes the reservation is lost, but that is not the only thing, it could
very well be that you are actually disconnected from your storage, just for
a small period.

What I would do, is the following :

- Ensure that your hosts are setup (drivers firmware settings) as per
requirement of HP
- Open a case with HP, and ask them to investigate the "loss of
connectivity"
- Get HP to open a case with Microsoft, as my analysis is only a "newsgroup
analysis" and I might be wrong (possible !) or HP Engineers might not accept
this, which is fully understandable.
- analyse the switch logs, this might be an area where this "disconnect"
happens, or it might give you clues for further troubleshooting.

So basically once you have checked your host setup and config (yet again, I
know, just being careful), lets turn to the storage and fabrics.

HTH,
_Edwin.

"Darrek" <Darrek.Kay1@nike.com> wrote in message
news:1166639777.056877.11990@i12g2000cwa.googlegroups.com...
> Edwin vMierlo wrote:
> > Darrek,
> > ...
> >
> > Next step is to examine the cluster.log file on the surviving node, to
see
> > if this gives us any more clues on what is happening.
> > ...
> > rgds,
> > Edwin.
>
> I mentioned earlier in the thread that the reservation is being lost.
>
> This extract begins when the first errors start appearing in the
> EventLog.
> Cluster.log extract:
> 09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
> 09:13.566 WARN [LM] LogWrite : LogpAppendPage failed.
> 09:13.566 INFO [LM] LogWrite : Exit returning=0x00000000
> 09:13.566 WARN [DM] DmWriteToQuorumLog failed, error=0x00000006
> 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
> 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34789 type 1
> context 4098
> 09:13.566 INFO [FM] FmpOfflineGroup,
> Group=33c75b80-6f81-4657-a059-442224c19f1a
> 09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 2 context 15
> 09:13.566 INFO [GUM] Thread 0xe28 UpdateLock wait on Type 2
> 09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
> 09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34790
> Generation=0
> 09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34790 type 2
> context 15
> 09:13.566 INFO [NM] Received update to set state for network
> 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> 09:13.566 WARN [NM] Interface da2e0aec-f760-49b0-bdc7-7582e3bce5dc
> failed (node: NODE_A, network: Private LAN).
> 09:13.566 INFO [LM] LogFlush : pLog=0x015cad90 writing the 1024 bytes
> for active page at offset 0x00002000
> 09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
> 09:13.566 WARN [LM] LogFlush::LogpWrite failed, error=0x0000048f
> 09:13.566 INFO Physical Disk <Disk K>: [DiskArb] CompletionRoutine,
> status 1167.
> 09:13.566 ERR Physical Disk <Disk K>: [DiskArb] CompletionRoutine:
> reservation lost! Status 1167
> 09:13.566 INFO [FM] FmpSetResourcePersistentState: Setting persistent
> state for resource 26d811a7-f0a3-42e8-afd4-bbd83aa98676...
> 09:13.566 WARN [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
> (Private LAN) is down.
> 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
> 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34790 type 2
> context 15
> 09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 1 context
> 4098
> 09:13.566 INFO [GUM] Thread 0x1560 UpdateLock wait on Type 1
> 09:13.566 INFO Physical Disk <Disk I>: [DiskArb] CompletionRoutine,
> status 1167.
> 09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
> 09:13.566 ERR Physical Disk <Disk I>: [DiskArb] CompletionRoutine:
> reservation lost! Status 1167
> 09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34791
> Generation=0
> 09:13.566 INFO [NM] Beginning phase 1 of state computation for network
> 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> 09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34791 type 1
> context 4098
> 09:13.566 INFO [DM] DmWriteToQuorumLog Entry Seq#=34791 Type=4098
> Size=152
> 09:13.566 INFO [LM] LogCommitSize : Entry RmId=5 Size=152
> 09:13.566 INFO [LM] LogCommitSize : Exit, returning 0x00000000
> 09:13.566 INFO [DM] DmUpdateSetValue
> 09:13.566 INFO [DM] Setting value of PersistentState for key
> Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676 to 0x00000000
> 09:13.566 INFO [API] Notification on port 14ebc8, key c00b0 of
> type 64. KeyName Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676
> 09:13.566 INFO [API] Notification on port 14eee8, key 9d628 of
> type 64. KeyName Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676
> 09:13.566 INFO [API] Notification on port 14ebc8, key cdd30 of
> type 64. KeyName
> 09:13.566 INFO [API] Notification on port 14eee8, key 9f768 of
> type 64. KeyName
> 09:13.566 INFO [DM] DmWriteToQuorumLog Entry Seq#=34791 Type=4098
> Size=152
> 09:13.566 INFO [LM] LogWrite : Entry TrId=34791 RmId=5 RmType = 4098
> Size=152
> 09:13.566 INFO [LM] LogpAppendPage : Writing 1024 bytes to disk at
> offset 0x00002000
> 09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
> 09:13.566 WARN [LM] LogWrite : LogpAppendPage failed.
> 09:13.566 INFO [LM] LogWrite : Exit returning=0x00000000
> 09:13.566 WARN [DM] DmWriteToQuorumLog failed, error=0x00000006
> 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
> 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34791 type 1
> context 4098
> 09:13.566 INFO [FM] FmpOfflineResource: SQL Server (LA997DB001) depends
> on Disk K - DevSQLAMDF. Shut down first.
> 09:13.566 INFO [FM] FmpOfflineResource: SQL Server Agent (LA997DB001)
> depends on SQL Server (LA997DB001). Shut down first.
> 09:13.566 INFO [FM] FmpRmOfflineResource: InterlockedIncrement on
> gdwQuoBlockingResources for resource
> 8f068a2b-1649-49db-9628-7e3bcb1c0ff6
> 09:13.566 INFO [NM] Node is down for interface 0
> (51173b6b-2071-4f92-a8df-605920339ac1) on network
> 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
> 09:13.566 INFO [NM] Examining connectivity data for interface 1
> (da2e0aec-f760-49b0-bdc7-7582e3bce5dc) on network
> 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> 09:13.566 INFO [NM] The report from interface 0 is not valid on network
> 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> 09:13.566 INFO [NM] Interface 1 (da2e0aec-f760-49b0-bdc7-7582e3bce5dc)
> is up on network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> 09:13.566 INFO [NM] Completed phase 1 of state computation for network
> 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> 09:13.566 INFO [NM] Unavailable=1, Failed = 0, Unreachable=0,
> Reachable=1, Up=1 on network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
> 09:13.566 INFO [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa is now
> in state 3
> 09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 2 context 15
> 09:13.566 INFO [GUM] Thread 0xe28 UpdateLock wait on Type 2
> 09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
> 09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34792
> Generation=0
> 09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34792 type 2
> context 15
> 09:13.566 INFO [NM] Received update to set state for network
> 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> 09:13.566 INFO [NM] Interface da2e0aec-f760-49b0-bdc7-7582e3bce5dc is
> up (node: NODE_A, network: Private LAN).
> 09:13.566 INFO Physical Disk <Disk F>: [DiskArb] CompletionRoutine,
> status 1167.
> 09:13.566 ERR Physical Disk <Disk F>: [DiskArb] CompletionRoutine:
> reservation lost! Status 1167
> 09:13.566 WARN [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
> (Private LAN) is up.
> 09:13.566 INFO Physical Disk <Disk Q>: [DiskArb] CompletionRoutine,
> status 1167.
> 09:13.566 ERR Physical Disk <Disk Q>: [DiskArb] CompletionRoutine:
> reservation lost! Status 1167
> 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
> 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34792 type 2
> context 15
> 09:13.566 INFO [NM] Worker thread finished processing network
> 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> 09:13.566 ERR [RM] LostQuorumResource, cluster service terminated...
> 09:13.801 WARN [RM] Going away, Status = 1, Shutdown = 0.
> 09:13.801 ERR [RM] Active Resource = 000BD058
> 09:13.801 ERR [RM] Resource State is 5, "Offline"
> 09:13.801 ERR [RM] Resource name is SQL Server Agent (LA997DB001)
> 09:13.801 ERR [RM] Resource type is SQL Server Agent
> 09:13.801 INFO [RM] Posting shutdown notification.
> 09:13.801 INFO [RM] NotifyChanges shutting down.
> 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for Q
> (Partition1) - Received
> 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for Q
> (Partition1) - Processed
> 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for F
> (Partition1) - Received
> 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for F
> (Partition1) - Processed
> 09:13.957 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK_FAILED
> for Q (Partition1) - Received
> 09:13.957 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK_FAILED
> for Q (Partition1) - Processed
> 09:13.957 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
> received
> 09:13.957 WARN Physical Disk <Disk Q>: [PnP] PokeDiskResource: Can't
> open \\.\PhysicalDrive1
> 09:13.957 INFO Physical Disk <Disk Q>: Offset
> String
> 09:13.957 INFO Physical Disk <Disk Q>: ================
> ======================================
> 09:13.957 INFO Physical Disk <Disk Q>: 0000000000008000
> \??\Volume{95bfac1b-3def-11db-bc56-0017a43fb52d}
> 09:13.957 INFO Physical Disk <Disk Q>: *** End of list ***
> 09:13.957 INFO Physical Disk <Disk Q>: SetupVolGuids: Processing
> VolGuid list
> 09:13.957 WARN Physical Disk <Disk Q>: SetupVolGuids: Unable to assign
> VolGuid to device, error 3221225530
> 09:13.957 WARN Physical Disk <Disk Q>: ValidateMountPoints:
> GetVolumeNameForVolumeMountPoint for
> (\\?\GLOBALROOT\Device\Harddisk1\Partition1\) returned 3
> 09:13.957 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
> processed
> 09:13.972 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
> received
>
>
> -DK
>

Edwin vMierlo · 22-12-2006

Darrek,

I hope you are pursuing this issue with your software vendor at this point,
and I hope that my view on the cluster.log was clear enough.

In short I think that the reservation lost is a symptom of something else,
either be a physical disconnect or a logical disconnect (PRLO or TPRLO).

In any case let me know how you get on,
and have a merry Xmas.

Rgds,
Edwin.

"Edwin vMierlo" <EdwinvMierlo@discussions.microsoft.com> wrote in message
news:OrjTeGRJHHA.1252@TK2MSFTNGP02.phx.gbl...
> Darrek,
>
> You can clearly see the reservation lost in this log, however there are
some
> other errors which caught my attention.
>
> error=0x00000006
> This could be a "ERROR_INVALID_HANDLE" error
>
> error=0x0000048f
> This could be a "ERROR_DEVICE_NOT_CONNECTED" error
>
> And then at the end of the extract of your log, you see
> DBT_DEVICEREMOVECOMPLETE events
> These are when (media) devices are removed from your configuration....
even
> when it is a reservation lost, you should not see those, as you are still
> connected to storage and your disk devices on the storage are still
present.
>
> So, yes the reservation is lost, but that is not the only thing, it could
> very well be that you are actually disconnected from your storage, just
for
> a small period.
>
> What I would do, is the following :
>
> - Ensure that your hosts are setup (drivers firmware settings) as per
> requirement of HP
> - Open a case with HP, and ask them to investigate the "loss of
> connectivity"
> - Get HP to open a case with Microsoft, as my analysis is only a
"newsgroup
> analysis" and I might be wrong (possible !) or HP Engineers might not
accept
> this, which is fully understandable.
> - analyse the switch logs, this might be an area where this "disconnect"
> happens, or it might give you clues for further troubleshooting.
>
> So basically once you have checked your host setup and config (yet again,
I
> know, just being careful), lets turn to the storage and fabrics.
>
> HTH,
> _Edwin.
>
>
>
>
>
> "Darrek" <Darrek.Kay1@nike.com> wrote in message
> news:1166639777.056877.11990@i12g2000cwa.googlegroups.com...
> > Edwin vMierlo wrote:
> > > Darrek,
> > > ...
> > >
> > > Next step is to examine the cluster.log file on the surviving node, to
> see
> > > if this gives us any more clues on what is happening.
> > > ...
> > > rgds,
> > > Edwin.
> >
> > I mentioned earlier in the thread that the reservation is being lost.
> >
> > This extract begins when the first errors start appearing in the
> > EventLog.
> > Cluster.log extract:
> > 09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
> > 09:13.566 WARN [LM] LogWrite : LogpAppendPage failed.
> > 09:13.566 INFO [LM] LogWrite : Exit returning=0x00000000
> > 09:13.566 WARN [DM] DmWriteToQuorumLog failed, error=0x00000006
> > 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
> > 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34789 type 1
> > context 4098
> > 09:13.566 INFO [FM] FmpOfflineGroup,
> > Group=33c75b80-6f81-4657-a059-442224c19f1a
> > 09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 2 context 15
> > 09:13.566 INFO [GUM] Thread 0xe28 UpdateLock wait on Type 2
> > 09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
> > 09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34790
> > Generation=0
> > 09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34790 type 2
> > context 15
> > 09:13.566 INFO [NM] Received update to set state for network
> > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> > 09:13.566 WARN [NM] Interface da2e0aec-f760-49b0-bdc7-7582e3bce5dc
> > failed (node: NODE_A, network: Private LAN).
> > 09:13.566 INFO [LM] LogFlush : pLog=0x015cad90 writing the 1024 bytes
> > for active page at offset 0x00002000
> > 09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
> > 09:13.566 WARN [LM] LogFlush::LogpWrite failed, error=0x0000048f
> > 09:13.566 INFO Physical Disk <Disk K>: [DiskArb] CompletionRoutine,
> > status 1167.
> > 09:13.566 ERR Physical Disk <Disk K>: [DiskArb] CompletionRoutine:
> > reservation lost! Status 1167
> > 09:13.566 INFO [FM] FmpSetResourcePersistentState: Setting persistent
> > state for resource 26d811a7-f0a3-42e8-afd4-bbd83aa98676...
> > 09:13.566 WARN [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
> > (Private LAN) is down.
> > 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
> > 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34790 type 2
> > context 15
> > 09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 1 context
> > 4098
> > 09:13.566 INFO [GUM] Thread 0x1560 UpdateLock wait on Type 1
> > 09:13.566 INFO Physical Disk <Disk I>: [DiskArb] CompletionRoutine,
> > status 1167.
> > 09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
> > 09:13.566 ERR Physical Disk <Disk I>: [DiskArb] CompletionRoutine:
> > reservation lost! Status 1167
> > 09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34791
> > Generation=0
> > 09:13.566 INFO [NM] Beginning phase 1 of state computation for network
> > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> > 09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34791 type 1
> > context 4098
> > 09:13.566 INFO [DM] DmWriteToQuorumLog Entry Seq#=34791 Type=4098
> > Size=152
> > 09:13.566 INFO [LM] LogCommitSize : Entry RmId=5 Size=152
> > 09:13.566 INFO [LM] LogCommitSize : Exit, returning 0x00000000
> > 09:13.566 INFO [DM] DmUpdateSetValue
> > 09:13.566 INFO [DM] Setting value of PersistentState for key
> > Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676 to 0x00000000
> > 09:13.566 INFO [API] Notification on port 14ebc8, key c00b0 of
> > type 64. KeyName Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676
> > 09:13.566 INFO [API] Notification on port 14eee8, key 9d628 of
> > type 64. KeyName Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676
> > 09:13.566 INFO [API] Notification on port 14ebc8, key cdd30 of
> > type 64. KeyName
> > 09:13.566 INFO [API] Notification on port 14eee8, key 9f768 of
> > type 64. KeyName
> > 09:13.566 INFO [DM] DmWriteToQuorumLog Entry Seq#=34791 Type=4098
> > Size=152
> > 09:13.566 INFO [LM] LogWrite : Entry TrId=34791 RmId=5 RmType = 4098
> > Size=152
> > 09:13.566 INFO [LM] LogpAppendPage : Writing 1024 bytes to disk at
> > offset 0x00002000
> > 09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
> > 09:13.566 WARN [LM] LogWrite : LogpAppendPage failed.
> > 09:13.566 INFO [LM] LogWrite : Exit returning=0x00000000
> > 09:13.566 WARN [DM] DmWriteToQuorumLog failed, error=0x00000006
> > 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
> > 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34791 type 1
> > context 4098
> > 09:13.566 INFO [FM] FmpOfflineResource: SQL Server (LA997DB001) depends
> > on Disk K - DevSQLAMDF. Shut down first.
> > 09:13.566 INFO [FM] FmpOfflineResource: SQL Server Agent (LA997DB001)
> > depends on SQL Server (LA997DB001). Shut down first.
> > 09:13.566 INFO [FM] FmpRmOfflineResource: InterlockedIncrement on
> > gdwQuoBlockingResources for resource
> > 8f068a2b-1649-49db-9628-7e3bcb1c0ff6
> > 09:13.566 INFO [NM] Node is down for interface 0
> > (51173b6b-2071-4f92-a8df-605920339ac1) on network
> > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
> > 09:13.566 INFO [NM] Examining connectivity data for interface 1
> > (da2e0aec-f760-49b0-bdc7-7582e3bce5dc) on network
> > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> > 09:13.566 INFO [NM] The report from interface 0 is not valid on network
> > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> > 09:13.566 INFO [NM] Interface 1 (da2e0aec-f760-49b0-bdc7-7582e3bce5dc)
> > is up on network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> > 09:13.566 INFO [NM] Completed phase 1 of state computation for network
> > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> > 09:13.566 INFO [NM] Unavailable=1, Failed = 0, Unreachable=0,
> > Reachable=1, Up=1 on network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
> > 09:13.566 INFO [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa is now
> > in state 3
> > 09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 2 context 15
> > 09:13.566 INFO [GUM] Thread 0xe28 UpdateLock wait on Type 2
> > 09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
> > 09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34792
> > Generation=0
> > 09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34792 type 2
> > context 15
> > 09:13.566 INFO [NM] Received update to set state for network
> > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> > 09:13.566 INFO [NM] Interface da2e0aec-f760-49b0-bdc7-7582e3bce5dc is
> > up (node: NODE_A, network: Private LAN).
> > 09:13.566 INFO Physical Disk <Disk F>: [DiskArb] CompletionRoutine,
> > status 1167.
> > 09:13.566 ERR Physical Disk <Disk F>: [DiskArb] CompletionRoutine:
> > reservation lost! Status 1167
> > 09:13.566 WARN [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
> > (Private LAN) is up.
> > 09:13.566 INFO Physical Disk <Disk Q>: [DiskArb] CompletionRoutine,
> > status 1167.
> > 09:13.566 ERR Physical Disk <Disk Q>: [DiskArb] CompletionRoutine:
> > reservation lost! Status 1167
> > 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
> > 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34792 type 2
> > context 15
> > 09:13.566 INFO [NM] Worker thread finished processing network
> > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
> > 09:13.566 ERR [RM] LostQuorumResource, cluster service terminated...
> > 09:13.801 WARN [RM] Going away, Status = 1, Shutdown = 0.
> > 09:13.801 ERR [RM] Active Resource = 000BD058
> > 09:13.801 ERR [RM] Resource State is 5, "Offline"
> > 09:13.801 ERR [RM] Resource name is SQL Server Agent (LA997DB001)
> > 09:13.801 ERR [RM] Resource type is SQL Server Agent
> > 09:13.801 INFO [RM] Posting shutdown notification.
> > 09:13.801 INFO [RM] NotifyChanges shutting down.
> > 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for Q
> > (Partition1) - Received
> > 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for Q
> > (Partition1) - Processed
> > 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for F
> > (Partition1) - Received
> > 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for F
> > (Partition1) - Processed
> > 09:13.957 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK_FAILED
> > for Q (Partition1) - Received
> > 09:13.957 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK_FAILED
> > for Q (Partition1) - Processed
> > 09:13.957 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
> > received
> > 09:13.957 WARN Physical Disk <Disk Q>: [PnP] PokeDiskResource: Can't
> > open \\.\PhysicalDrive1
> > 09:13.957 INFO Physical Disk <Disk Q>: Offset
> > String
> > 09:13.957 INFO Physical Disk <Disk Q>: ================
> > ======================================
> > 09:13.957 INFO Physical Disk <Disk Q>: 0000000000008000
> > \??\Volume{95bfac1b-3def-11db-bc56-0017a43fb52d}
> > 09:13.957 INFO Physical Disk <Disk Q>: *** End of list ***
> > 09:13.957 INFO Physical Disk <Disk Q>: SetupVolGuids: Processing
> > VolGuid list
> > 09:13.957 WARN Physical Disk <Disk Q>: SetupVolGuids: Unable to assign
> > VolGuid to device, error 3221225530
> > 09:13.957 WARN Physical Disk <Disk Q>: ValidateMountPoints:
> > GetVolumeNameForVolumeMountPoint for
> > (\\?\GLOBALROOT\Device\Harddisk1\Partition1\) returned 3
> > 09:13.957 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
> > processed
> > 09:13.972 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
> > received
> >
> >
> > -DK
> >
>
>

Thread: Physical Disk goes offline when cluster node reboots

Thread Tools

Physical Disk goes offline when cluster node reboots

Re: Physical Disk goes offline when cluster node reboots

Re: Physical Disk goes offline when cluster node reboots

Re: Physical Disk goes offline when cluster node reboots

Re: Physical Disk goes offline when cluster node reboots

Re: Physical Disk goes offline when cluster node reboots

Re: Physical Disk goes offline when cluster node reboots

Re: Physical Disk goes offline when cluster node reboots

Re: Physical Disk goes offline when cluster node reboots

Re: Physical Disk goes offline when cluster node reboots

Re: Physical Disk goes offline when cluster node reboots

Re: Physical Disk goes offline when cluster node reboots

Re: Physical Disk goes offline when cluster node reboots

Re: Physical Disk goes offline when cluster node reboots

Similar Threads

New cluster - fail to add node 2 to the cluster, Error 0x800706ba: The RPC server is unavailable

Add a node to a Hyper-V Failover Cluster - Disk issues

Single node cluster physical disk issue

Adding new physical disk resource to cluster

Adding new physical disk resource to 2003 active/passive cluster

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions