Results 1 to 14 of 14

Thread: Physical Disk goes offline when cluster node reboots

  1. #1
    Darrek Guest

    Physical Disk goes offline when cluster node reboots

    I have a 2 node Windows 2003 SP1 EE cluster connected to an MSA1000 SAN
    via integrated FC hub. My SAN is single-path since this is our Dev/QA
    environment.

    When I reboot any node in the cluster all physical disk resources go
    offline while the rebooted server goes through POST. I get Delayed
    Write Failed errors in the event log of the node that is still running.
    Once the rebooted node is up and running the cluster returns to
    normal.

    I'm worried that our production cluster may exhibit the same issues
    when it goes live even though it is built in a more robust fashion.

    I'm open for suggestions.

    The servers are HP DL145's, using Emulex FC2243 cards. If I simply
    failover a cluster group everything works great.

    Thanks.
    -DK


  2. #2
    Edwin vMierlo Guest

    Re: Physical Disk goes offline when cluster node reboots

    Darrek,

    Just to confirm that we have the symptom right

    - All groups are online on Node 1 (therefore all disks are online on Node 1)
    - you reboot Node 2
    - All disks on Node 1 go offline on Node 1 during reboot/Post of Node 2

    Please confirm that this is what you are experiencing

    and two questions:
    Q: are the disks who go offline on Node 2, do they fail or do they go
    offline ? (please specify, as there is a difference)
    Q: Do you see any "reservation lost" messages/events in the system event log
    ?

    rgds,
    Edwin.




    "Darrek" <Darrek.Kay1@nike.com> wrote in message
    news:1166032177.312098.234640@80g2000cwy.googlegroups.com...
    > I have a 2 node Windows 2003 SP1 EE cluster connected to an MSA1000 SAN
    > via integrated FC hub. My SAN is single-path since this is our Dev/QA
    > environment.
    >
    > When I reboot any node in the cluster all physical disk resources go
    > offline while the rebooted server goes through POST. I get Delayed
    > Write Failed errors in the event log of the node that is still running.
    > Once the rebooted node is up and running the cluster returns to
    > normal.
    >
    > I'm worried that our production cluster may exhibit the same issues
    > when it goes live even though it is built in a more robust fashion.
    >
    > I'm open for suggestions.
    >
    > The servers are HP DL145's, using Emulex FC2243 cards. If I simply
    > failover a cluster group everything works great.
    >
    > Thanks.
    > -DK
    >




  3. #3
    John Toner [MVP] Guest

    Re: Physical Disk goes offline when cluster node reboots

    Make sure you're running supported versions of the HBA drivers. Many vendors
    will also require that you apply a STORport hotfix, such as 916048

    Regards,
    John

    "Darrek" <Darrek.Kay1@nike.com> wrote in message
    news:1166032177.312098.234640@80g2000cwy.googlegroups.com...
    > I have a 2 node Windows 2003 SP1 EE cluster connected to an MSA1000 SAN
    > via integrated FC hub. My SAN is single-path since this is our Dev/QA
    > environment.
    >
    > When I reboot any node in the cluster all physical disk resources go
    > offline while the rebooted server goes through POST. I get Delayed
    > Write Failed errors in the event log of the node that is still running.
    > Once the rebooted node is up and running the cluster returns to
    > normal.
    >
    > I'm worried that our production cluster may exhibit the same issues
    > when it goes live even though it is built in a more robust fashion.
    >
    > I'm open for suggestions.
    >
    > The servers are HP DL145's, using Emulex FC2243 cards. If I simply
    > failover a cluster group everything works great.
    >
    > Thanks.
    > -DK
    >




  4. #4
    Darrek Guest

    Re: Physical Disk goes offline when cluster node reboots


    Edwin vMierlo wrote:
    > Darrek,
    >
    > Just to confirm that we have the symptom right
    >
    > - All groups are online on Node 1 (therefore all disks are online on Node 1)
    > - you reboot Node 2
    > - All disks on Node 1 go offline on Node 1 during reboot/Post of Node 2
    >
    > Please confirm that this is what you are experiencing
    >
    > and two questions:
    > Q: are the disks who go offline on Node 2, do they fail or do they go
    > offline ? (please specify, as there is a difference)
    > Q: Do you see any "reservation lost" messages/events in the system event log
    > ?
    >
    > rgds,
    > Edwin.
    >


    Yes. All groups are online and running fine on Node 1. During Node 2
    POST Node 1 reports errors like this in the event log:

    (One for each LUN on the SAN)
    Event Type: Error
    Event Source: Disk
    Event Category: None
    Event ID: 15
    Description:
    The device, \Device\Harddisk1, is not ready for access yet.

    And then...one of these...

    Event Type: Error
    Event Source: ClusSvc
    Event Category: Physical Disk Resource
    Event ID: 1038
    Description:
    Reservation of cluster disk 'Disk T - QASQLBTmp' has been lost. Please
    check your system and disk configuration.


    And then...several of these...

    Event Type: Warning
    Event Source: Ntfs
    Event Category: None
    Event ID: 50
    Description:
    {Delayed Write Failed} Windows was unable to save all the data for the
    file . The data has been lost. This error may be caused by a failure of
    your computer hardware or network connection. Please try to save this
    file elsewhere.

    More Event 15's, and 1038' for other LUNs

    A couple of these mixed in...

    Event Type: Information
    Event Source: Application Popup
    Event Category: None
    Event ID: 26
    Description:
    Application popup: Windows - Delayed Write Failed : Windows was unable
    to save all the data for the file Q:\$Mft. The data has been lost. This
    error may be caused by a failure of your computer hardware or network
    connection. Please try to save this file elsewhere.

    One of these:

    Event Type: Warning
    Event Source: Ftdisk
    Event Category: Disk
    Event ID: 57
    Description:
    The system failed to flush data to the transaction log. Corruption may
    occur.

    At this point Cluster Admin begins sending service stop commands to
    SQL.
    And I get these:

    Event Type: Error
    Event Source: ClusSvc
    Event Category: Physical Disk Resource
    Event ID: 1036
    Description:
    Cluster disk resource '' did not respond to a SCSI maintenance command.


    Followed by several more 57's:

    I even managed one of these:

    Event Type: Error
    Event Source: ClusSvc
    Event Category: Physical Disk Resource
    Event ID: 1034
    Description:
    The disk associated with cluster disk resource 'Disk Q:' could not be
    found. The expected signature of the disk was BED1F8F9. If the disk was
    removed from the server cluster, the resource should be deleted. If the
    disk was replaced, the resource must be deleted and created again in
    order to bring the disk online. If the disk has not been removed or
    replaced, it may be inaccessible at this time because it is reserved by
    another server cluster node.

    Followed by one of these:

    Event Type: Error
    Event Source: ClusSvc
    Event Category: Startup/Shutdown
    Event ID: 1009

    Description:
    Cluster service could not join an existing server cluster and could not
    form a new server cluster. Cluster service has terminated.




    The drivers I'm using are Emulex Storport FC2243
    5-1.11X1 11/07/2005 WS2K3 32 bit (elxadjct.sys & elxstor.sys)
    5.1.3.2 (elxstod.dll)

    The MSA 1000 is on firmware 4.48.


    Thanks for your help!

    -DK


  5. #5
    Chuck Timon [Microsoft] Guest

    Re: Physical Disk goes offline when cluster node reboots

    Yep, these are all errors that indicate hardware problems...probably, in
    your case, with configuration of the hardware. One of the classic examples
    of this is Dell Perc RAID controllers. If the 'cluster mode' is not set on
    the controllers, then errors like what you are seeing will manifest
    themselves. Probably need to speak with your hardware vendor to ensure they
    know you are using their hardware on a cluster and have them reviewe the
    configuration.

    --
    Chuck Timon, Jr.
    Microsoft Corporation
    Longhorn Readiness Team
    This posting is provided "AS IS" with no
    warranties, and confers no rights.

    "Darrek" <Darrek.Kay1@nike.com> wrote in message
    news:1166201880.820694.237490@j72g2000cwa.googlegroups.com...
    >
    > Edwin vMierlo wrote:
    >> Darrek,
    >>
    >> Just to confirm that we have the symptom right
    >>
    >> - All groups are online on Node 1 (therefore all disks are online on Node
    >> 1)
    >> - you reboot Node 2
    >> - All disks on Node 1 go offline on Node 1 during reboot/Post of Node 2
    >>
    >> Please confirm that this is what you are experiencing
    >>
    >> and two questions:
    >> Q: are the disks who go offline on Node 2, do they fail or do they go
    >> offline ? (please specify, as there is a difference)
    >> Q: Do you see any "reservation lost" messages/events in the system event
    >> log
    >> ?
    >>
    >> rgds,
    >> Edwin.
    >>

    >
    > Yes. All groups are online and running fine on Node 1. During Node 2
    > POST Node 1 reports errors like this in the event log:
    >
    > (One for each LUN on the SAN)
    > Event Type: Error
    > Event Source: Disk
    > Event Category: None
    > Event ID: 15
    > Description:
    > The device, \Device\Harddisk1, is not ready for access yet.
    >
    > And then...one of these...
    >
    > Event Type: Error
    > Event Source: ClusSvc
    > Event Category: Physical Disk Resource
    > Event ID: 1038
    > Description:
    > Reservation of cluster disk 'Disk T - QASQLBTmp' has been lost. Please
    > check your system and disk configuration.
    >
    >
    > And then...several of these...
    >
    > Event Type: Warning
    > Event Source: Ntfs
    > Event Category: None
    > Event ID: 50
    > Description:
    > {Delayed Write Failed} Windows was unable to save all the data for the
    > file . The data has been lost. This error may be caused by a failure of
    > your computer hardware or network connection. Please try to save this
    > file elsewhere.
    >
    > More Event 15's, and 1038' for other LUNs
    >
    > A couple of these mixed in...
    >
    > Event Type: Information
    > Event Source: Application Popup
    > Event Category: None
    > Event ID: 26
    > Description:
    > Application popup: Windows - Delayed Write Failed : Windows was unable
    > to save all the data for the file Q:\$Mft. The data has been lost. This
    > error may be caused by a failure of your computer hardware or network
    > connection. Please try to save this file elsewhere.
    >
    > One of these:
    >
    > Event Type: Warning
    > Event Source: Ftdisk
    > Event Category: Disk
    > Event ID: 57
    > Description:
    > The system failed to flush data to the transaction log. Corruption may
    > occur.
    >
    > At this point Cluster Admin begins sending service stop commands to
    > SQL.
    > And I get these:
    >
    > Event Type: Error
    > Event Source: ClusSvc
    > Event Category: Physical Disk Resource
    > Event ID: 1036
    > Description:
    > Cluster disk resource '' did not respond to a SCSI maintenance command.
    >
    >
    > Followed by several more 57's:
    >
    > I even managed one of these:
    >
    > Event Type: Error
    > Event Source: ClusSvc
    > Event Category: Physical Disk Resource
    > Event ID: 1034
    > Description:
    > The disk associated with cluster disk resource 'Disk Q:' could not be
    > found. The expected signature of the disk was BED1F8F9. If the disk was
    > removed from the server cluster, the resource should be deleted. If the
    > disk was replaced, the resource must be deleted and created again in
    > order to bring the disk online. If the disk has not been removed or
    > replaced, it may be inaccessible at this time because it is reserved by
    > another server cluster node.
    >
    > Followed by one of these:
    >
    > Event Type: Error
    > Event Source: ClusSvc
    > Event Category: Startup/Shutdown
    > Event ID: 1009
    >
    > Description:
    > Cluster service could not join an existing server cluster and could not
    > form a new server cluster. Cluster service has terminated.
    >
    >
    >
    >
    > The drivers I'm using are Emulex Storport FC2243
    > 5-1.11X1 11/07/2005 WS2K3 32 bit (elxadjct.sys & elxstor.sys)
    > 5.1.3.2 (elxstod.dll)
    >
    > The MSA 1000 is on firmware 4.48.
    >
    >
    > Thanks for your help!
    >
    > -DK
    >




  6. #6
    Darrek Guest

    Re: Physical Disk goes offline when cluster node reboots


    Chuck Timon [Microsoft] wrote:
    > Yep, these are all errors that indicate hardware problems...probably, in
    > your case, with configuration of the hardware. One of the classic examples
    > of this is Dell Perc RAID controllers. If the 'cluster mode' is not set on
    > the controllers, then errors like what you are seeing will manifest
    > themselves. Probably need to speak with your hardware vendor to ensure they
    > know you are using their hardware on a cluster and have them reviewe the
    > configuration.
    >
    > --
    > Chuck Timon, Jr.
    > Microsoft Corporation
    > Longhorn Readiness Team
    > This posting is provided "AS IS" with no
    > warranties, and confers no rights.
    >


    I've since found the MS hotfix and updated Emulex drivers. I will be
    installing them later today. I've gone through the Emulex
    configuration that is available during the POST sequence and have not
    found anything that looks like it needs to be configured differently.
    The HP website also has nothing clearly called out.

    I'll follow up here if the updates help.

    -DK


  7. #7
    Edwin vMierlo Guest

    Re: Physical Disk goes offline when cluster node reboots

    Dear Darrek,

    Ensuring that you are running the lastest drivers is a good start.

    However, as you report that your problem occurs during POST, which is the
    stage where the OS is not even loading, therefore I would be surprised if
    updating your drivers would improve your situation.

    I would think that you also need to check the firmware/bios which is running
    on your Emulex cards, ensure you run the latest firmware, as firmware is
    running (or starting to run) during (or just after) POST.

    (and if you update drivers, you should update firmware anyway, as the two go
    hand in hand)

    Quick question: as this is a SAN envioronment: Are you booting from your SAN
    ?

    rgds,
    Edwin.


    "Darrek" <Darrek.Kay1@nike.com> wrote in message
    news:1166207126.999846.241010@j72g2000cwa.googlegroups.com...
    >
    > Chuck Timon [Microsoft] wrote:
    > > Yep, these are all errors that indicate hardware problems...probably, in
    > > your case, with configuration of the hardware. One of the classic

    examples
    > > of this is Dell Perc RAID controllers. If the 'cluster mode' is not set

    on
    > > the controllers, then errors like what you are seeing will manifest
    > > themselves. Probably need to speak with your hardware vendor to ensure

    they
    > > know you are using their hardware on a cluster and have them reviewe the
    > > configuration.
    > >
    > > --
    > > Chuck Timon, Jr.
    > > Microsoft Corporation
    > > Longhorn Readiness Team
    > > This posting is provided "AS IS" with no
    > > warranties, and confers no rights.
    > >

    >
    > I've since found the MS hotfix and updated Emulex drivers. I will be
    > installing them later today. I've gone through the Emulex
    > configuration that is available during the POST sequence and have not
    > found anything that looks like it needs to be configured differently.
    > The HP website also has nothing clearly called out.
    >
    > I'll follow up here if the updates help.
    >
    > -DK
    >




  8. #8
    Darrek Guest

    Re: Physical Disk goes offline when cluster node reboots


    Edwin vMierlo wrote:
    > Dear Darrek,
    >
    > Ensuring that you are running the lastest drivers is a good start.
    >
    > However, as you report that your problem occurs during POST, which is the
    > stage where the OS is not even loading, therefore I would be surprised if
    > updating your drivers would improve your situation.
    >
    > I would think that you also need to check the firmware/bios which is running
    > on your Emulex cards, ensure you run the latest firmware, as firmware is
    > running (or starting to run) during (or just after) POST.
    >
    > (and if you update drivers, you should update firmware anyway, as the two go
    > hand in hand)
    >
    > Quick question: as this is a SAN envioronment: Are you booting from your SAN
    > ?
    >
    > rgds,
    > Edwin.
    >


    I understand that drivers may not solve the problem unless they have
    code to interact with a booting FC-HBA gracefully during a POST. Since
    upgrading the drivers I still have the problem. I will look for a
    firmware update.

    I'm booting from DAS.

    -DK


  9. #9
    Chuck Timon [Microsoft] Guest

    Re: Physical Disk goes offline when cluster node reboots

    If you are booting from SAN, have you read -
    http://support.microsoft.com/kb/886569/en-us ?

    Just as an FYI, booting from SAN must be supported by your hardware vendor
    or Microsoft won't support it.

    --
    Chuck Timon, Jr.
    Microsoft Corporation
    Longhorn Readiness Team
    This posting is provided "AS IS" with no
    warranties, and confers no rights.

    "Darrek" <Darrek.Kay1@nike.com> wrote in message
    news:1166552589.238041.128670@t46g2000cwa.googlegroups.com...
    >
    > Edwin vMierlo wrote:
    >> Dear Darrek,
    >>
    >> Ensuring that you are running the lastest drivers is a good start.
    >>
    >> However, as you report that your problem occurs during POST, which is the
    >> stage where the OS is not even loading, therefore I would be surprised if
    >> updating your drivers would improve your situation.
    >>
    >> I would think that you also need to check the firmware/bios which is
    >> running
    >> on your Emulex cards, ensure you run the latest firmware, as firmware is
    >> running (or starting to run) during (or just after) POST.
    >>
    >> (and if you update drivers, you should update firmware anyway, as the two
    >> go
    >> hand in hand)
    >>
    >> Quick question: as this is a SAN envioronment: Are you booting from your
    >> SAN
    >> ?
    >>
    >> rgds,
    >> Edwin.
    >>

    >
    > I understand that drivers may not solve the problem unless they have
    > code to interact with a booting FC-HBA gracefully during a POST. Since
    > upgrading the drivers I still have the problem. I will look for a
    > firmware update.
    >
    > I'm booting from DAS.
    >
    > -DK
    >




  10. #10
    Darrek Guest

    Re: Physical Disk goes offline when cluster node reboots

    Darrek wrote:

    > I understand that drivers may not solve the problem unless they have
    > code to interact with a booting FC-HBA gracefully during a POST. Since
    > upgrading the drivers I still have the problem. I will look for a
    > firmware update.
    >
    > I'm booting from DAS.
    >
    > -DK


    I've upgraded firmware to the latest available on the HP website:
    2.10A10
    I upgraded and then disabled the BOOT BIOS so I no longer get prompted
    with a Ctrl-E for Emulex utils prompt during POST.
    I've set the driver parameters as follows:
    QueueDepth=64;NodeTimeOut=10;LinkTimeOut=40;QueueTarget=1;ResetTPRLO=2

    The ResetTPRLO is new and I shutdown both nodes and power cycled the
    MSA1000 after implementing it. Problem still occurs.

    More symptom information:

    I'm able to shutdown Node B without any problems. App Groups
    gracefully failover to Node A and Node B powers off. When I power ON
    Node B and while it performs memory test, Node A loses its LUNs until
    Node B has started the Cluster Services.

    -DK


  11. #11
    Edwin vMierlo Guest

    Re: Physical Disk goes offline when cluster node reboots

    Darrek,

    I suspect that for some reason you are loosing your SCSI reservation, hence
    the *surviving* node looses this and fails the disk.

    I have seen TPRLO causing loss of SCSI reservation (actually causing process
    logout on the FC-target), so I suggest that you confirm with HP that your
    settings are correct. Each storage vendor has it specific settings for
    connecting to its storage.

    Next step is to examine the cluster.log file on the surviving node, to see
    if this gives us any more clues on what is happening.
    Search the cluster.log for the "reservation lost" message, and examine what
    happens before and after this. (if anything is logged, if it is a true SCSI
    reservation lost, then you see nothing leading up to this, just bang-lost !)

    If it is a SCSI reservation lost, and no more clues or events can be seen in
    the cluster log, it is time to turn to the storage (vendor) and have a look
    for clues.

    Let us know what you see in the cluster.log file

    rgds,
    Edwin.


    "Darrek" <Darrek.Kay1@nike.com> wrote in message
    news:1166561785.139389.249010@48g2000cwx.googlegroups.com...
    > Darrek wrote:
    >
    > > I understand that drivers may not solve the problem unless they have
    > > code to interact with a booting FC-HBA gracefully during a POST. Since
    > > upgrading the drivers I still have the problem. I will look for a
    > > firmware update.
    > >
    > > I'm booting from DAS.
    > >
    > > -DK

    >
    > I've upgraded firmware to the latest available on the HP website:
    > 2.10A10
    > I upgraded and then disabled the BOOT BIOS so I no longer get prompted
    > with a Ctrl-E for Emulex utils prompt during POST.
    > I've set the driver parameters as follows:
    > QueueDepth=64;NodeTimeOut=10;LinkTimeOut=40;QueueTarget=1;ResetTPRLO=2
    >
    > The ResetTPRLO is new and I shutdown both nodes and power cycled the
    > MSA1000 after implementing it. Problem still occurs.
    >
    > More symptom information:
    >
    > I'm able to shutdown Node B without any problems. App Groups
    > gracefully failover to Node A and Node B powers off. When I power ON
    > Node B and while it performs memory test, Node A loses its LUNs until
    > Node B has started the Cluster Services.
    >
    > -DK
    >




  12. #12
    Darrek Guest

    Re: Physical Disk goes offline when cluster node reboots

    Edwin vMierlo wrote:
    > Darrek,
    > ...
    >
    > Next step is to examine the cluster.log file on the surviving node, to see
    > if this gives us any more clues on what is happening.
    > ...
    > rgds,
    > Edwin.


    I mentioned earlier in the thread that the reservation is being lost.

    This extract begins when the first errors start appearing in the
    EventLog.
    Cluster.log extract:
    09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
    09:13.566 WARN [LM] LogWrite : LogpAppendPage failed.
    09:13.566 INFO [LM] LogWrite : Exit returning=0x00000000
    09:13.566 WARN [DM] DmWriteToQuorumLog failed, error=0x00000006
    09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
    09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34789 type 1
    context 4098
    09:13.566 INFO [FM] FmpOfflineGroup,
    Group=33c75b80-6f81-4657-a059-442224c19f1a
    09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 2 context 15
    09:13.566 INFO [GUM] Thread 0xe28 UpdateLock wait on Type 2
    09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
    09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34790
    Generation=0
    09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34790 type 2
    context 15
    09:13.566 INFO [NM] Received update to set state for network
    83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    09:13.566 WARN [NM] Interface da2e0aec-f760-49b0-bdc7-7582e3bce5dc
    failed (node: NODE_A, network: Private LAN).
    09:13.566 INFO [LM] LogFlush : pLog=0x015cad90 writing the 1024 bytes
    for active page at offset 0x00002000
    09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
    09:13.566 WARN [LM] LogFlush::LogpWrite failed, error=0x0000048f
    09:13.566 INFO Physical Disk <Disk K>: [DiskArb] CompletionRoutine,
    status 1167.
    09:13.566 ERR Physical Disk <Disk K>: [DiskArb] CompletionRoutine:
    reservation lost! Status 1167
    09:13.566 INFO [FM] FmpSetResourcePersistentState: Setting persistent
    state for resource 26d811a7-f0a3-42e8-afd4-bbd83aa98676...
    09:13.566 WARN [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
    (Private LAN) is down.
    09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
    09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34790 type 2
    context 15
    09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 1 context
    4098
    09:13.566 INFO [GUM] Thread 0x1560 UpdateLock wait on Type 1
    09:13.566 INFO Physical Disk <Disk I>: [DiskArb] CompletionRoutine,
    status 1167.
    09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
    09:13.566 ERR Physical Disk <Disk I>: [DiskArb] CompletionRoutine:
    reservation lost! Status 1167
    09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34791
    Generation=0
    09:13.566 INFO [NM] Beginning phase 1 of state computation for network
    83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34791 type 1
    context 4098
    09:13.566 INFO [DM] DmWriteToQuorumLog Entry Seq#=34791 Type=4098
    Size=152
    09:13.566 INFO [LM] LogCommitSize : Entry RmId=5 Size=152
    09:13.566 INFO [LM] LogCommitSize : Exit, returning 0x00000000
    09:13.566 INFO [DM] DmUpdateSetValue
    09:13.566 INFO [DM] Setting value of PersistentState for key
    Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676 to 0x00000000
    09:13.566 INFO [API] Notification on port 14ebc8, key c00b0 of
    type 64. KeyName Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676
    09:13.566 INFO [API] Notification on port 14eee8, key 9d628 of
    type 64. KeyName Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676
    09:13.566 INFO [API] Notification on port 14ebc8, key cdd30 of
    type 64. KeyName
    09:13.566 INFO [API] Notification on port 14eee8, key 9f768 of
    type 64. KeyName
    09:13.566 INFO [DM] DmWriteToQuorumLog Entry Seq#=34791 Type=4098
    Size=152
    09:13.566 INFO [LM] LogWrite : Entry TrId=34791 RmId=5 RmType = 4098
    Size=152
    09:13.566 INFO [LM] LogpAppendPage : Writing 1024 bytes to disk at
    offset 0x00002000
    09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
    09:13.566 WARN [LM] LogWrite : LogpAppendPage failed.
    09:13.566 INFO [LM] LogWrite : Exit returning=0x00000000
    09:13.566 WARN [DM] DmWriteToQuorumLog failed, error=0x00000006
    09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
    09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34791 type 1
    context 4098
    09:13.566 INFO [FM] FmpOfflineResource: SQL Server (LA997DB001) depends
    on Disk K - DevSQLAMDF. Shut down first.
    09:13.566 INFO [FM] FmpOfflineResource: SQL Server Agent (LA997DB001)
    depends on SQL Server (LA997DB001). Shut down first.
    09:13.566 INFO [FM] FmpRmOfflineResource: InterlockedIncrement on
    gdwQuoBlockingResources for resource
    8f068a2b-1649-49db-9628-7e3bcb1c0ff6
    09:13.566 INFO [NM] Node is down for interface 0
    (51173b6b-2071-4f92-a8df-605920339ac1) on network
    83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
    09:13.566 INFO [NM] Examining connectivity data for interface 1
    (da2e0aec-f760-49b0-bdc7-7582e3bce5dc) on network
    83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    09:13.566 INFO [NM] The report from interface 0 is not valid on network
    83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    09:13.566 INFO [NM] Interface 1 (da2e0aec-f760-49b0-bdc7-7582e3bce5dc)
    is up on network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    09:13.566 INFO [NM] Completed phase 1 of state computation for network
    83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    09:13.566 INFO [NM] Unavailable=1, Failed = 0, Unreachable=0,
    Reachable=1, Up=1 on network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
    09:13.566 INFO [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa is now
    in state 3
    09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 2 context 15
    09:13.566 INFO [GUM] Thread 0xe28 UpdateLock wait on Type 2
    09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
    09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34792
    Generation=0
    09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34792 type 2
    context 15
    09:13.566 INFO [NM] Received update to set state for network
    83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    09:13.566 INFO [NM] Interface da2e0aec-f760-49b0-bdc7-7582e3bce5dc is
    up (node: NODE_A, network: Private LAN).
    09:13.566 INFO Physical Disk <Disk F>: [DiskArb] CompletionRoutine,
    status 1167.
    09:13.566 ERR Physical Disk <Disk F>: [DiskArb] CompletionRoutine:
    reservation lost! Status 1167
    09:13.566 WARN [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
    (Private LAN) is up.
    09:13.566 INFO Physical Disk <Disk Q>: [DiskArb] CompletionRoutine,
    status 1167.
    09:13.566 ERR Physical Disk <Disk Q>: [DiskArb] CompletionRoutine:
    reservation lost! Status 1167
    09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
    09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34792 type 2
    context 15
    09:13.566 INFO [NM] Worker thread finished processing network
    83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    09:13.566 ERR [RM] LostQuorumResource, cluster service terminated...
    09:13.801 WARN [RM] Going away, Status = 1, Shutdown = 0.
    09:13.801 ERR [RM] Active Resource = 000BD058
    09:13.801 ERR [RM] Resource State is 5, "Offline"
    09:13.801 ERR [RM] Resource name is SQL Server Agent (LA997DB001)
    09:13.801 ERR [RM] Resource type is SQL Server Agent
    09:13.801 INFO [RM] Posting shutdown notification.
    09:13.801 INFO [RM] NotifyChanges shutting down.
    09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for Q
    (Partition1) - Received
    09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for Q
    (Partition1) - Processed
    09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for F
    (Partition1) - Received
    09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for F
    (Partition1) - Processed
    09:13.957 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK_FAILED
    for Q (Partition1) - Received
    09:13.957 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK_FAILED
    for Q (Partition1) - Processed
    09:13.957 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
    received
    09:13.957 WARN Physical Disk <Disk Q>: [PnP] PokeDiskResource: Can't
    open \\.\PhysicalDrive1
    09:13.957 INFO Physical Disk <Disk Q>: Offset
    String
    09:13.957 INFO Physical Disk <Disk Q>: ================
    ======================================
    09:13.957 INFO Physical Disk <Disk Q>: 0000000000008000
    \??\Volume{95bfac1b-3def-11db-bc56-0017a43fb52d}
    09:13.957 INFO Physical Disk <Disk Q>: *** End of list ***
    09:13.957 INFO Physical Disk <Disk Q>: SetupVolGuids: Processing
    VolGuid list
    09:13.957 WARN Physical Disk <Disk Q>: SetupVolGuids: Unable to assign
    VolGuid to device, error 3221225530
    09:13.957 WARN Physical Disk <Disk Q>: ValidateMountPoints:
    GetVolumeNameForVolumeMountPoint for
    (\\?\GLOBALROOT\Device\Harddisk1\Partition1\) returned 3
    09:13.957 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
    processed
    09:13.972 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
    received


    -DK


  13. #13
    Edwin vMierlo Guest

    Re: Physical Disk goes offline when cluster node reboots

    Darrek,

    You can clearly see the reservation lost in this log, however there are some
    other errors which caught my attention.

    error=0x00000006
    This could be a "ERROR_INVALID_HANDLE" error

    error=0x0000048f
    This could be a "ERROR_DEVICE_NOT_CONNECTED" error

    And then at the end of the extract of your log, you see
    DBT_DEVICEREMOVECOMPLETE events
    These are when (media) devices are removed from your configuration.... even
    when it is a reservation lost, you should not see those, as you are still
    connected to storage and your disk devices on the storage are still present.

    So, yes the reservation is lost, but that is not the only thing, it could
    very well be that you are actually disconnected from your storage, just for
    a small period.

    What I would do, is the following :

    - Ensure that your hosts are setup (drivers firmware settings) as per
    requirement of HP
    - Open a case with HP, and ask them to investigate the "loss of
    connectivity"
    - Get HP to open a case with Microsoft, as my analysis is only a "newsgroup
    analysis" and I might be wrong (possible !) or HP Engineers might not accept
    this, which is fully understandable.
    - analyse the switch logs, this might be an area where this "disconnect"
    happens, or it might give you clues for further troubleshooting.

    So basically once you have checked your host setup and config (yet again, I
    know, just being careful), lets turn to the storage and fabrics.

    HTH,
    _Edwin.





    "Darrek" <Darrek.Kay1@nike.com> wrote in message
    news:1166639777.056877.11990@i12g2000cwa.googlegroups.com...
    > Edwin vMierlo wrote:
    > > Darrek,
    > > ...
    > >
    > > Next step is to examine the cluster.log file on the surviving node, to

    see
    > > if this gives us any more clues on what is happening.
    > > ...
    > > rgds,
    > > Edwin.

    >
    > I mentioned earlier in the thread that the reservation is being lost.
    >
    > This extract begins when the first errors start appearing in the
    > EventLog.
    > Cluster.log extract:
    > 09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
    > 09:13.566 WARN [LM] LogWrite : LogpAppendPage failed.
    > 09:13.566 INFO [LM] LogWrite : Exit returning=0x00000000
    > 09:13.566 WARN [DM] DmWriteToQuorumLog failed, error=0x00000006
    > 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
    > 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34789 type 1
    > context 4098
    > 09:13.566 INFO [FM] FmpOfflineGroup,
    > Group=33c75b80-6f81-4657-a059-442224c19f1a
    > 09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 2 context 15
    > 09:13.566 INFO [GUM] Thread 0xe28 UpdateLock wait on Type 2
    > 09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
    > 09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34790
    > Generation=0
    > 09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34790 type 2
    > context 15
    > 09:13.566 INFO [NM] Received update to set state for network
    > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > 09:13.566 WARN [NM] Interface da2e0aec-f760-49b0-bdc7-7582e3bce5dc
    > failed (node: NODE_A, network: Private LAN).
    > 09:13.566 INFO [LM] LogFlush : pLog=0x015cad90 writing the 1024 bytes
    > for active page at offset 0x00002000
    > 09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
    > 09:13.566 WARN [LM] LogFlush::LogpWrite failed, error=0x0000048f
    > 09:13.566 INFO Physical Disk <Disk K>: [DiskArb] CompletionRoutine,
    > status 1167.
    > 09:13.566 ERR Physical Disk <Disk K>: [DiskArb] CompletionRoutine:
    > reservation lost! Status 1167
    > 09:13.566 INFO [FM] FmpSetResourcePersistentState: Setting persistent
    > state for resource 26d811a7-f0a3-42e8-afd4-bbd83aa98676...
    > 09:13.566 WARN [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
    > (Private LAN) is down.
    > 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
    > 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34790 type 2
    > context 15
    > 09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 1 context
    > 4098
    > 09:13.566 INFO [GUM] Thread 0x1560 UpdateLock wait on Type 1
    > 09:13.566 INFO Physical Disk <Disk I>: [DiskArb] CompletionRoutine,
    > status 1167.
    > 09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
    > 09:13.566 ERR Physical Disk <Disk I>: [DiskArb] CompletionRoutine:
    > reservation lost! Status 1167
    > 09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34791
    > Generation=0
    > 09:13.566 INFO [NM] Beginning phase 1 of state computation for network
    > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > 09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34791 type 1
    > context 4098
    > 09:13.566 INFO [DM] DmWriteToQuorumLog Entry Seq#=34791 Type=4098
    > Size=152
    > 09:13.566 INFO [LM] LogCommitSize : Entry RmId=5 Size=152
    > 09:13.566 INFO [LM] LogCommitSize : Exit, returning 0x00000000
    > 09:13.566 INFO [DM] DmUpdateSetValue
    > 09:13.566 INFO [DM] Setting value of PersistentState for key
    > Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676 to 0x00000000
    > 09:13.566 INFO [API] Notification on port 14ebc8, key c00b0 of
    > type 64. KeyName Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676
    > 09:13.566 INFO [API] Notification on port 14eee8, key 9d628 of
    > type 64. KeyName Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676
    > 09:13.566 INFO [API] Notification on port 14ebc8, key cdd30 of
    > type 64. KeyName
    > 09:13.566 INFO [API] Notification on port 14eee8, key 9f768 of
    > type 64. KeyName
    > 09:13.566 INFO [DM] DmWriteToQuorumLog Entry Seq#=34791 Type=4098
    > Size=152
    > 09:13.566 INFO [LM] LogWrite : Entry TrId=34791 RmId=5 RmType = 4098
    > Size=152
    > 09:13.566 INFO [LM] LogpAppendPage : Writing 1024 bytes to disk at
    > offset 0x00002000
    > 09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
    > 09:13.566 WARN [LM] LogWrite : LogpAppendPage failed.
    > 09:13.566 INFO [LM] LogWrite : Exit returning=0x00000000
    > 09:13.566 WARN [DM] DmWriteToQuorumLog failed, error=0x00000006
    > 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
    > 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34791 type 1
    > context 4098
    > 09:13.566 INFO [FM] FmpOfflineResource: SQL Server (LA997DB001) depends
    > on Disk K - DevSQLAMDF. Shut down first.
    > 09:13.566 INFO [FM] FmpOfflineResource: SQL Server Agent (LA997DB001)
    > depends on SQL Server (LA997DB001). Shut down first.
    > 09:13.566 INFO [FM] FmpRmOfflineResource: InterlockedIncrement on
    > gdwQuoBlockingResources for resource
    > 8f068a2b-1649-49db-9628-7e3bcb1c0ff6
    > 09:13.566 INFO [NM] Node is down for interface 0
    > (51173b6b-2071-4f92-a8df-605920339ac1) on network
    > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
    > 09:13.566 INFO [NM] Examining connectivity data for interface 1
    > (da2e0aec-f760-49b0-bdc7-7582e3bce5dc) on network
    > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > 09:13.566 INFO [NM] The report from interface 0 is not valid on network
    > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > 09:13.566 INFO [NM] Interface 1 (da2e0aec-f760-49b0-bdc7-7582e3bce5dc)
    > is up on network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > 09:13.566 INFO [NM] Completed phase 1 of state computation for network
    > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > 09:13.566 INFO [NM] Unavailable=1, Failed = 0, Unreachable=0,
    > Reachable=1, Up=1 on network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
    > 09:13.566 INFO [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa is now
    > in state 3
    > 09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 2 context 15
    > 09:13.566 INFO [GUM] Thread 0xe28 UpdateLock wait on Type 2
    > 09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
    > 09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34792
    > Generation=0
    > 09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34792 type 2
    > context 15
    > 09:13.566 INFO [NM] Received update to set state for network
    > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > 09:13.566 INFO [NM] Interface da2e0aec-f760-49b0-bdc7-7582e3bce5dc is
    > up (node: NODE_A, network: Private LAN).
    > 09:13.566 INFO Physical Disk <Disk F>: [DiskArb] CompletionRoutine,
    > status 1167.
    > 09:13.566 ERR Physical Disk <Disk F>: [DiskArb] CompletionRoutine:
    > reservation lost! Status 1167
    > 09:13.566 WARN [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
    > (Private LAN) is up.
    > 09:13.566 INFO Physical Disk <Disk Q>: [DiskArb] CompletionRoutine,
    > status 1167.
    > 09:13.566 ERR Physical Disk <Disk Q>: [DiskArb] CompletionRoutine:
    > reservation lost! Status 1167
    > 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
    > 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34792 type 2
    > context 15
    > 09:13.566 INFO [NM] Worker thread finished processing network
    > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > 09:13.566 ERR [RM] LostQuorumResource, cluster service terminated...
    > 09:13.801 WARN [RM] Going away, Status = 1, Shutdown = 0.
    > 09:13.801 ERR [RM] Active Resource = 000BD058
    > 09:13.801 ERR [RM] Resource State is 5, "Offline"
    > 09:13.801 ERR [RM] Resource name is SQL Server Agent (LA997DB001)
    > 09:13.801 ERR [RM] Resource type is SQL Server Agent
    > 09:13.801 INFO [RM] Posting shutdown notification.
    > 09:13.801 INFO [RM] NotifyChanges shutting down.
    > 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for Q
    > (Partition1) - Received
    > 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for Q
    > (Partition1) - Processed
    > 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for F
    > (Partition1) - Received
    > 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for F
    > (Partition1) - Processed
    > 09:13.957 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK_FAILED
    > for Q (Partition1) - Received
    > 09:13.957 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK_FAILED
    > for Q (Partition1) - Processed
    > 09:13.957 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
    > received
    > 09:13.957 WARN Physical Disk <Disk Q>: [PnP] PokeDiskResource: Can't
    > open \\.\PhysicalDrive1
    > 09:13.957 INFO Physical Disk <Disk Q>: Offset
    > String
    > 09:13.957 INFO Physical Disk <Disk Q>: ================
    > ======================================
    > 09:13.957 INFO Physical Disk <Disk Q>: 0000000000008000
    > \??\Volume{95bfac1b-3def-11db-bc56-0017a43fb52d}
    > 09:13.957 INFO Physical Disk <Disk Q>: *** End of list ***
    > 09:13.957 INFO Physical Disk <Disk Q>: SetupVolGuids: Processing
    > VolGuid list
    > 09:13.957 WARN Physical Disk <Disk Q>: SetupVolGuids: Unable to assign
    > VolGuid to device, error 3221225530
    > 09:13.957 WARN Physical Disk <Disk Q>: ValidateMountPoints:
    > GetVolumeNameForVolumeMountPoint for
    > (\\?\GLOBALROOT\Device\Harddisk1\Partition1\) returned 3
    > 09:13.957 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
    > processed
    > 09:13.972 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
    > received
    >
    >
    > -DK
    >




  14. #14
    Edwin vMierlo Guest

    Re: Physical Disk goes offline when cluster node reboots

    Darrek,

    I hope you are pursuing this issue with your software vendor at this point,
    and I hope that my view on the cluster.log was clear enough.

    In short I think that the reservation lost is a symptom of something else,
    either be a physical disconnect or a logical disconnect (PRLO or TPRLO).

    In any case let me know how you get on,
    and have a merry Xmas.

    Rgds,
    Edwin.



    "Edwin vMierlo" <EdwinvMierlo@discussions.microsoft.com> wrote in message
    news:OrjTeGRJHHA.1252@TK2MSFTNGP02.phx.gbl...
    > Darrek,
    >
    > You can clearly see the reservation lost in this log, however there are

    some
    > other errors which caught my attention.
    >
    > error=0x00000006
    > This could be a "ERROR_INVALID_HANDLE" error
    >
    > error=0x0000048f
    > This could be a "ERROR_DEVICE_NOT_CONNECTED" error
    >
    > And then at the end of the extract of your log, you see
    > DBT_DEVICEREMOVECOMPLETE events
    > These are when (media) devices are removed from your configuration....

    even
    > when it is a reservation lost, you should not see those, as you are still
    > connected to storage and your disk devices on the storage are still

    present.
    >
    > So, yes the reservation is lost, but that is not the only thing, it could
    > very well be that you are actually disconnected from your storage, just

    for
    > a small period.
    >
    > What I would do, is the following :
    >
    > - Ensure that your hosts are setup (drivers firmware settings) as per
    > requirement of HP
    > - Open a case with HP, and ask them to investigate the "loss of
    > connectivity"
    > - Get HP to open a case with Microsoft, as my analysis is only a

    "newsgroup
    > analysis" and I might be wrong (possible !) or HP Engineers might not

    accept
    > this, which is fully understandable.
    > - analyse the switch logs, this might be an area where this "disconnect"
    > happens, or it might give you clues for further troubleshooting.
    >
    > So basically once you have checked your host setup and config (yet again,

    I
    > know, just being careful), lets turn to the storage and fabrics.
    >
    > HTH,
    > _Edwin.
    >
    >
    >
    >
    >
    > "Darrek" <Darrek.Kay1@nike.com> wrote in message
    > news:1166639777.056877.11990@i12g2000cwa.googlegroups.com...
    > > Edwin vMierlo wrote:
    > > > Darrek,
    > > > ...
    > > >
    > > > Next step is to examine the cluster.log file on the surviving node, to

    > see
    > > > if this gives us any more clues on what is happening.
    > > > ...
    > > > rgds,
    > > > Edwin.

    > >
    > > I mentioned earlier in the thread that the reservation is being lost.
    > >
    > > This extract begins when the first errors start appearing in the
    > > EventLog.
    > > Cluster.log extract:
    > > 09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
    > > 09:13.566 WARN [LM] LogWrite : LogpAppendPage failed.
    > > 09:13.566 INFO [LM] LogWrite : Exit returning=0x00000000
    > > 09:13.566 WARN [DM] DmWriteToQuorumLog failed, error=0x00000006
    > > 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
    > > 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34789 type 1
    > > context 4098
    > > 09:13.566 INFO [FM] FmpOfflineGroup,
    > > Group=33c75b80-6f81-4657-a059-442224c19f1a
    > > 09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 2 context 15
    > > 09:13.566 INFO [GUM] Thread 0xe28 UpdateLock wait on Type 2
    > > 09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
    > > 09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34790
    > > Generation=0
    > > 09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34790 type 2
    > > context 15
    > > 09:13.566 INFO [NM] Received update to set state for network
    > > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > > 09:13.566 WARN [NM] Interface da2e0aec-f760-49b0-bdc7-7582e3bce5dc
    > > failed (node: NODE_A, network: Private LAN).
    > > 09:13.566 INFO [LM] LogFlush : pLog=0x015cad90 writing the 1024 bytes
    > > for active page at offset 0x00002000
    > > 09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
    > > 09:13.566 WARN [LM] LogFlush::LogpWrite failed, error=0x0000048f
    > > 09:13.566 INFO Physical Disk <Disk K>: [DiskArb] CompletionRoutine,
    > > status 1167.
    > > 09:13.566 ERR Physical Disk <Disk K>: [DiskArb] CompletionRoutine:
    > > reservation lost! Status 1167
    > > 09:13.566 INFO [FM] FmpSetResourcePersistentState: Setting persistent
    > > state for resource 26d811a7-f0a3-42e8-afd4-bbd83aa98676...
    > > 09:13.566 WARN [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
    > > (Private LAN) is down.
    > > 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
    > > 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34790 type 2
    > > context 15
    > > 09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 1 context
    > > 4098
    > > 09:13.566 INFO [GUM] Thread 0x1560 UpdateLock wait on Type 1
    > > 09:13.566 INFO Physical Disk <Disk I>: [DiskArb] CompletionRoutine,
    > > status 1167.
    > > 09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
    > > 09:13.566 ERR Physical Disk <Disk I>: [DiskArb] CompletionRoutine:
    > > reservation lost! Status 1167
    > > 09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34791
    > > Generation=0
    > > 09:13.566 INFO [NM] Beginning phase 1 of state computation for network
    > > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > > 09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34791 type 1
    > > context 4098
    > > 09:13.566 INFO [DM] DmWriteToQuorumLog Entry Seq#=34791 Type=4098
    > > Size=152
    > > 09:13.566 INFO [LM] LogCommitSize : Entry RmId=5 Size=152
    > > 09:13.566 INFO [LM] LogCommitSize : Exit, returning 0x00000000
    > > 09:13.566 INFO [DM] DmUpdateSetValue
    > > 09:13.566 INFO [DM] Setting value of PersistentState for key
    > > Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676 to 0x00000000
    > > 09:13.566 INFO [API] Notification on port 14ebc8, key c00b0 of
    > > type 64. KeyName Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676
    > > 09:13.566 INFO [API] Notification on port 14eee8, key 9d628 of
    > > type 64. KeyName Resources\26d811a7-f0a3-42e8-afd4-bbd83aa98676
    > > 09:13.566 INFO [API] Notification on port 14ebc8, key cdd30 of
    > > type 64. KeyName
    > > 09:13.566 INFO [API] Notification on port 14eee8, key 9f768 of
    > > type 64. KeyName
    > > 09:13.566 INFO [DM] DmWriteToQuorumLog Entry Seq#=34791 Type=4098
    > > Size=152
    > > 09:13.566 INFO [LM] LogWrite : Entry TrId=34791 RmId=5 RmType = 4098
    > > Size=152
    > > 09:13.566 INFO [LM] LogpAppendPage : Writing 1024 bytes to disk at
    > > offset 0x00002000
    > > 09:13.566 INFO [Qfs] WriteFile 768 (....) 1024, status 1167 (0=>0)
    > > 09:13.566 WARN [LM] LogWrite : LogpAppendPage failed.
    > > 09:13.566 INFO [LM] LogWrite : Exit returning=0x00000000
    > > 09:13.566 WARN [DM] DmWriteToQuorumLog failed, error=0x00000006
    > > 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
    > > 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34791 type 1
    > > context 4098
    > > 09:13.566 INFO [FM] FmpOfflineResource: SQL Server (LA997DB001) depends
    > > on Disk K - DevSQLAMDF. Shut down first.
    > > 09:13.566 INFO [FM] FmpOfflineResource: SQL Server Agent (LA997DB001)
    > > depends on SQL Server (LA997DB001). Shut down first.
    > > 09:13.566 INFO [FM] FmpRmOfflineResource: InterlockedIncrement on
    > > gdwQuoBlockingResources for resource
    > > 8f068a2b-1649-49db-9628-7e3bcb1c0ff6
    > > 09:13.566 INFO [NM] Node is down for interface 0
    > > (51173b6b-2071-4f92-a8df-605920339ac1) on network
    > > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
    > > 09:13.566 INFO [NM] Examining connectivity data for interface 1
    > > (da2e0aec-f760-49b0-bdc7-7582e3bce5dc) on network
    > > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > > 09:13.566 INFO [NM] The report from interface 0 is not valid on network
    > > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > > 09:13.566 INFO [NM] Interface 1 (da2e0aec-f760-49b0-bdc7-7582e3bce5dc)
    > > is up on network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > > 09:13.566 INFO [NM] Completed phase 1 of state computation for network
    > > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > > 09:13.566 INFO [NM] Unavailable=1, Failed = 0, Unreachable=0,
    > > Reachable=1, Up=1 on network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
    > > 09:13.566 INFO [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa is now
    > > in state 3
    > > 09:13.566 INFO [GUM] GumSendUpdate: Locker waiting type 2 context 15
    > > 09:13.566 INFO [GUM] Thread 0xe28 UpdateLock wait on Type 2
    > > 09:13.566 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 1
    > > 09:13.566 INFO [GUM] GumpDoLockingUpdate successful, Sequence=34792
    > > Generation=0
    > > 09:13.566 INFO [GUM] GumSendUpdate: Locker dispatching seq 34792 type 2
    > > context 15
    > > 09:13.566 INFO [NM] Received update to set state for network
    > > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > > 09:13.566 INFO [NM] Interface da2e0aec-f760-49b0-bdc7-7582e3bce5dc is
    > > up (node: NODE_A, network: Private LAN).
    > > 09:13.566 INFO Physical Disk <Disk F>: [DiskArb] CompletionRoutine,
    > > status 1167.
    > > 09:13.566 ERR Physical Disk <Disk F>: [DiskArb] CompletionRoutine:
    > > reservation lost! Status 1167
    > > 09:13.566 WARN [NM] Network 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa
    > > (Private LAN) is up.
    > > 09:13.566 INFO Physical Disk <Disk Q>: [DiskArb] CompletionRoutine,
    > > status 1167.
    > > 09:13.566 ERR Physical Disk <Disk Q>: [DiskArb] CompletionRoutine:
    > > reservation lost! Status 1167
    > > 09:13.566 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
    > > 09:13.566 INFO [GUM] GumSendUpdate: completed update seq 34792 type 2
    > > context 15
    > > 09:13.566 INFO [NM] Worker thread finished processing network
    > > 83b31a0f-af60-4f8f-9f71-1fe7de02a8aa.
    > > 09:13.566 ERR [RM] LostQuorumResource, cluster service terminated...
    > > 09:13.801 WARN [RM] Going away, Status = 1, Shutdown = 0.
    > > 09:13.801 ERR [RM] Active Resource = 000BD058
    > > 09:13.801 ERR [RM] Resource State is 5, "Offline"
    > > 09:13.801 ERR [RM] Resource name is SQL Server Agent (LA997DB001)
    > > 09:13.801 ERR [RM] Resource type is SQL Server Agent
    > > 09:13.801 INFO [RM] Posting shutdown notification.
    > > 09:13.801 INFO [RM] NotifyChanges shutting down.
    > > 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for Q
    > > (Partition1) - Received
    > > 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for Q
    > > (Partition1) - Processed
    > > 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for F
    > > (Partition1) - Received
    > > 09:13.941 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK for F
    > > (Partition1) - Processed
    > > 09:13.957 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK_FAILED
    > > for Q (Partition1) - Received
    > > 09:13.957 INFO Physical Disk: [PnP] Event GUID_IO_VOLUME_LOCK_FAILED
    > > for Q (Partition1) - Processed
    > > 09:13.957 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
    > > received
    > > 09:13.957 WARN Physical Disk <Disk Q>: [PnP] PokeDiskResource: Can't
    > > open \\.\PhysicalDrive1
    > > 09:13.957 INFO Physical Disk <Disk Q>: Offset
    > > String
    > > 09:13.957 INFO Physical Disk <Disk Q>: ================
    > > ======================================
    > > 09:13.957 INFO Physical Disk <Disk Q>: 0000000000008000
    > > \??\Volume{95bfac1b-3def-11db-bc56-0017a43fb52d}
    > > 09:13.957 INFO Physical Disk <Disk Q>: *** End of list ***
    > > 09:13.957 INFO Physical Disk <Disk Q>: SetupVolGuids: Processing
    > > VolGuid list
    > > 09:13.957 WARN Physical Disk <Disk Q>: SetupVolGuids: Unable to assign
    > > VolGuid to device, error 3221225530
    > > 09:13.957 WARN Physical Disk <Disk Q>: ValidateMountPoints:
    > > GetVolumeNameForVolumeMountPoint for
    > > (\\?\GLOBALROOT\Device\Harddisk1\Partition1\) returned 3
    > > 09:13.957 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
    > > processed
    > > 09:13.972 INFO Physical Disk: [PnP] Event DBT_DEVICEREMOVECOMPLETE
    > > received
    > >
    > >
    > > -DK
    > >

    >
    >




Similar Threads

  1. Replies: 1
    Last Post: 23-04-2011, 06:14 PM
  2. Add a node to a Hyper-V Failover Cluster - Disk issues
    By delacom in forum Windows Server Help
    Replies: 5
    Last Post: 18-11-2010, 03:56 PM
  3. Single node cluster physical disk issue
    By AGDGuy in forum Windows Server Help
    Replies: 7
    Last Post: 11-09-2008, 05:03 PM
  4. Adding new physical disk resource to cluster
    By ILYA in forum Windows Server Help
    Replies: 5
    Last Post: 24-10-2007, 06:33 PM
  5. Adding new physical disk resource to 2003 active/passive cluster
    By kesm0724 in forum Windows Server Help
    Replies: 3
    Last Post: 02-05-2007, 01:12 AM

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Page generated in 1,713,562,320.86688 seconds with 17 queries