Windows Server Failover Clustering

Windows Server Failover Clustering (WSFC) is a system-level, high-availability feature that allows server pairs to act as standby nodes for each other. Nodes exchange communication between them, known as a "heartbeat," over the LAN. Notifications are either sent by the active server to the standby node (push heartbeat) or requested periodically from the standby node by the active server (pulled heartbeat). See Failover Clustering in Windows Server on Microsoft Docs for more information about WSFC.

NXLog Agent can collect logs generated by WSFC, parse them, and forward them to a destination of your choice.

WSFC logging changes in Windows Server 2008 R2

WSFC recorded cluster operations and activity in several system log files in the past. Windows Server 2008 R2 and newer versions consolidate logging in Windows Event Log and ETW. The following table compares the log files found in previous versions with Windows Server 2008 R2 and newer logging.

Table 1. Windows cluster logs comparison with Windows Server 2008 R2
Pre-2018 R2 log file	Log functionality	Logging in newer versions
%systemroot%\Cluster\cluster.log	Debug log file for clustering operations	Debug level events are available with ETW.
%systemroot%\system32\LogFiles\Cluster\clcfgsrv.log	Cluster installer logs	An HTML installation report is created in %systemroot%\Cluster\CreateCluster.htm.
%systemroot%\system32\LogFiles\Cluster\clusocm.log	Records cluster-related activity during an operating system upgrade	This is now part of the core Windows event logging. ETW also has two providers tracing cluster-aware updating.
%systemroot%\system32\LogFiles\Cluster\cluscomp.log	Records the activity that occurs during the compatibility check at the start of an operating system upgrade on a cluster node	This is now part of the core Window event logging.

Logging across a Windows cluster

The current active node undertakes logging responsibility in a Windows cluster. As a result, you can install NXLog Agent on any member server, and they will process all logs from the cluster. For example, in a two-node cluster, as shown in the following diagram:

Windows Server Failover Clustering setup

Where:

Cluster-DC is the domain controller since it is a prerequisite for cluster members to be part of the same domain.
ClusterNode1 and ClusterNode2 are cluster members.
ClusterStorage is the iSCI storage used by the cluster; logs are saved to a location on disk here.

If NXLog Agent is configured on ClusterNode1 and ClusterNode2 to output logs to separate files, they will write the same events in their respective files. The source of each event identifies the node where it was generated.

System event logged on a member server

2022-05-31 18:47:35 ClusterNode2.example.com INFO Keywords="9259400833873739776" EventType="INFO" SeverityValue="2" EventID="7036" SourceName="Service Control Manager" ProviderGuid="{555908D1-A6D7-4695-8E1E-26931D2012F4}" Version="0" TaskValue="0" OpcodeValue="0" RecordNumber="6634" ExecutionProcessID="652" ExecutionThreadID="2340" Channel="System" Message="The Software Protection service entered the stopped state." param1="Software Protection" param2="stopped" EventData.Binary="7300700070007300760063002F0031000000"

The advantage of such a setup is that if the active node fails, any other node running NXLog Agent will continue to process logs to their respective output.

Collecting events from Windows Event Log

WSFC logs events in the following logs:

System
Microsoft-Windows-FailoverClustering/Operational (found in Event Viewer under Application and Services Logs > Microsoft > Windows > FailoverClustering)

John Marlin, a Senior Product Manager for High Availability and Storage at Microsoft, provides a detailed List of Failover Cluster Events in Windows 2016/2019 in his Microsoft Tech Community blog post. Additionally, you can find a complete list of Failover Clustering system log events on Microsoft Docs.

FailoverClustering includes other unused log categories. They are possibly placeholders for future updates.

You can configure NXLog Agent to collect all events from the FailoverClustering source or specify a query to collect only a subset of events.

Example 1. Collecting Windows Failover cluster events

This configuration uses the im_msvistalog input module to collect the following subset of FailoverClustering events:

Event ID 1000 (UNEXPECTED_FATAL_ERROR): This event is generated when a software or hardware-related issue prevents the cluster service from starting on a node.
Event ID 1006 (NM_EVENT_MEMBERSHIP_HALT): This event is generated when the cluster service is halted on a member node due to a lack of connectivity with other cluster nodes.
Event ID 1635 (RCM_RESOURCE_FAILURE_INFO): This event is generated by the Resource Control Manager when a specific shared resource within the cluster fails to come online. It is typically seen in SQL Server and shared disk failures.
Event ID 1637 (RCM_RESOURCE_STATE_TRANSITION): This event is generated by the Resource Control Manager when there’s a state transition in any clustered resource. This event doesn’t necessarily reflect an error.

Before sending events to their destination, the configuration converts them to JSON format using the to_json() procedure of the xm_json module.

nxlog.conf

<Extension json>
    Module    xm_json
</Extension>

<Input cluster_evt>
    Module    im_msvistalog
    <QueryXML>
        <QueryList>
            <Query Id="0" Path="System">
                <Select Path="System">*[System[Provider[@Name='Microsoft-Windows-FailoverClustering']
                    and (EventID=1000 or EventID=1006)]]
                </Select>
                <Select Path="Microsoft-Windows-FailoverClustering/Operational">
                    *[System[(EventID=1635 or EventID=1637)]]
                </Select>
            </Query>
        </QueryList>
    </QueryXML>
    Exec      to_json();
</Input>

The following output in JSON format depicts event ID 1637, logged for an IP address that transitioned to a Pending state, after it was processed by NXLog Agent.

Output sample

{
  "EventTime": "2022-05-31T19:48:40.128046-08:00",
  "Hostname": "node2.example.com",
  "Keywords": "4611686018427387904",
  "EventType": "INFO",
  "SeverityValue": 2,
  "Severity": "INFO",
  "EventID": 1637,
  "SourceName": "Microsoft-Windows-FailoverClustering",
  "ProviderGuid": "{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}",
  "Version": 1,
  "TaskValue": 3,
  "OpcodeValue": 0,
  "RecordNumber": 1450,
  "ExecutionProcessID": 2276,
  "ExecutionThreadID": 3544,
  "Channel": "Microsoft-Windows-FailoverClustering/Operational",
  "Domain": "NT AUTHORITY",
  "AccountName": "SYSTEM",
  "UserID": "S-1-5-18",
  "AccountType": "Well Known Group",
  "Message": "Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state OnlineCallIssued to state OnlinePending.",
  "Category": "Resource Control Manager",
  "Opcode": "Info",
  "ResourceName": "Cluster IP Address",
  "GroupName": "Cluster Group",
  "FromState": "OnlineCallIssued",
  "ToState": "OnlinePending",
  "FromStateValue": "133",
  "ToStateValue": "129",
  "EventReceivedTime": "2022-05-31T19:48:40.956099-08:00",
  "SourceModuleName": "cluster_evt",
  "SourceModuleType": "im_msvistalog"
}

Collecting ETW logs

Event Tracing for Windows (ETW) is an advanced debugging feature provided by Microsoft that allows you to create customized event tracing using a provider-consumer model. For more information on how ETW works, refer to About Event Tracing on Microsoft Docs.

WSFC ETW providers

Microsoft documentation on ETW providers tends to be obscure; therefore, working with them often requires a lot of trial and error. The following are three typical providers that you will come across:

Microsoft-Windows-FailoverClustering-SoftwareStorageBusTarget: Shared storage is an important topic in WFSC operations. Whether you are running SCSI, iSCSI, or FiberChannel, you may need visibility on input and output operations and the hardware state of your storage adapters. This provider will help you monitor storage operations on cluster shared storage.
Microsoft-Windows-FailoverClustering-WMIProvider: WMI is a feature-rich alternative management option for a server, and if you are managing your cluster with WMI, this is a key ETW provider.
Microsoft-Windows-ClusterAwareUpdating: Cluster Aware Updating is an optional valuable feature that allows the administrator to update cluster members safely. The cluster enters a special maintenance mode that can suspend operations and transition any active roles in the server as needed while updating it. This ETW provider gives you granular visibility into the transitions, stages, and operations related to this feature. See Cluster-Aware Updating overview on Microsoft Docs for more information.

Follow these steps to obtain the WSFC ETW parameters for your NXLog Agent configuration:

Determine the ETW provider(s) you need. Execute the following command to list all the available providers:
```
> logman query providers
```
The following providers are related to WSFC at the time of writing:
- Microsoft-Windows-FailoverClustering {BAF908EA-3421-4CA9-9B84-6689B8C6F85F}
- Microsoft-Windows-FailoverClustering-Client {A82FDA5D-745F-409C-B0FE-18AE0678A0E0}
- Microsoft-Windows-FailoverClustering-ClusBflt-Diagnostic {923BCB94-58D2-42BE-BBA9-B1315F363838}
- Microsoft-Windows-FailoverClustering-ClusDisk-Diagnostic {7FEF367F-E76C-4592-9912-E12B36A99780}
- Microsoft-Windows-FailoverClustering-Clusport-Diagnostic {29C07D0E-E5A0-4E85-A004-1F668531CE22}
- Microsoft-Windows-FailoverClustering-CsvFlt-Diagnostic {151D3C03-E442-4C4F-AF20-BD48FF41F793}
- Microsoft-Windows-FailoverClustering-CsvFs-Diagnostic {6A86AE90-4E9B-4186-B1D1-9CE0E02BCBC1}
- Microsoft-Windows-FailoverClustering-Manager {11B3C6B7-E06F-4191-BBB9-7099FFF55614}
- Microsoft-Windows-FailoverClustering-NetFt {C1FCCEB3-3F19-42A9-95B9-27B550FA1FBA}
- Microsoft-Windows-FailoverClustering-SoftwareStorageBusTarget {0AC0708A-A44E-49EF-AA7E-FBE8CCC603A6}
- Microsoft-Windows-FailoverClustering-WMIProvider {0461BE3C-BC15-4BAD-9A9E-51F3FADFEC75}
- Microsoft-Windows-ClusterAwareUpdating {10629806-46F2-4366-9092-53025E067E8C}
- Microsoft-Windows-ClusterAwareUpdating-Management {9B9E93D6-5569-4179-8C8A-5201CB2B9536}

Use the provider GUID to query the keywords available for tracing, e.g.:

> logman query providers "{A82FDA5D-745F-409C-B0FE-18AE0678A0E0}"

You should see output similar to the following:

Provider                                 GUID
-------------------------------------------------------------------------------
Microsoft-Windows-FailoverClustering-Client {A82FDA5D-745F-409C-B0FE-18AE0678A0E0}

Value               Keyword              Description
-------------------------------------------------------------------------------
0x0000000000000001  Cluster              Cluster
0x0000000000000002  Node
0x0000000000000004  Group
0x0000000000000008  Resource
0x0000000000000010  Network
0x0000000000000020  NetInt
0x0000000000000040  Quorum
0x0000000000000080  Reconnect            Reconnect
0x0000000000000100  ResType
0x0000000000000200  Property
0x0000000000000400  RPCLog
0x8000000000000000  System               System
0x4000000000000000  Microsoft-Windows-FailoverClustering-Client/Diagnostic Microsoft-Windows-FailoverClustering-Client/Diagnostic

Value               Level                Description
-------------------------------------------------------------------------------
0x01                win:Critical         Critical
0x02                win:Error            Error
0x03                win:Warning          Warning
0x04                win:Informational    Information
0x05                win:Verbose          Verbose

PID                 Image
-------------------------------------------------------------------------------
0x000005d4          C:\Windows\System32\msdtc.exe
0x000007a4          C:\Windows\System32\spoolsv.exe
0x00000878          C:\Windows\System32\svchost.exe

The command completed successfully.

The keywords in the output above represent categories of events that may be included in the trace session. Take note of the HEX values next to the keywords you’re interested in. The sum of these values provides the tracing level for your NXLog Agent configuration. For example, to trace Cluster, Node, Network, and Quorum, the resultant value is 0x0000000000000053.

Example 2. Collecting Failover Clustering client trace logs

This configuration uses the im_etw input module to collect ETW events from the Microsoft-Windows-FailoverClustering-Client provider.

The Level directive specifies that it should capture warning-level and upwards events.

The MatchAnyKeyword directive specifies the sum of event categories we determined above.

Finally, the configuration converts events to JSON format using the to_json() procedure of the xm_json module.

nxlog.conf

<Extension json>
    Module             xm_json
</Extension>

<Input cluster_etw>
    Module             im_etw
    Provider           Microsoft-Windows-FailoverClustering-Client
    Level              Warning
    MatchAnyKeyword    0x00000053
    Exec               to_json();
</Input>

Output sample

{
  "SourceName": "Microsoft-Windows-FailoverClustering",
  "ProviderGuid": "{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}",
  "Channel": "Microsoft-Windows-FailoverClustering/DiagnosticVerbose ",
  "EventID": 5408,
  "Version": 0,
  "ChannelID": 18,
  "OpcodeValue": 0,
  "TaskValue": 0,
  "Keywords": "1152921504606846976",
  "EventTime": "2022-05-31T16:47:18.032921-08:00",
  "ExecutionProcessID": 4336,
  "ExecutionThreadID": 2392,
  "EventType": "DEBUG",
  "SeverityValue": 1,
  "SeverityValue": 1,
  "Severity": "DEBUG",
  "Hostname": "node2",
  "Domain": "NT AUTHORITY",
  "AccountName": "SYSTEM",
  "UserID": "S-1-5-18",
  "AccountType": "Well Known Group",
  "Flags": "EXTENDED_INFO|IS_64_BIT_HEADER|PROCESSOR_INDEX (577)",
  "LogString": "[RCM] rcm::PreemptionTracker::GetPreemptedGroups()",
  "EventReceivedTime": "2022-05-31T16:47:19.035684-08:00",
  "SourceModuleName": "cluster_etw",
  "SourceModuleType": "im_etw"
}

Disclaimer

While we endeavor to keep the information in our guides up to date and correct, NXLog makes no representations or warranties of any kind, express or implied about the completeness, accuracy, reliability, suitability, or availability of the content represented here. We update our screenshots and instructions on a best-effort basis.

NXLog does not guarantee that any scripts provided in our guides are error-free, secure, or suitable for any specific environment. Use of the scripts is at your own risk. In no event shall NXLog be liable for any damages or losses arising from using these scripts.

The accurateness of the content was tested and proved to be working in our lab environment at the time of the last revision with the following software versions:

Microsoft Windows Server 2019 Standard
NXLog Agent version 5.4.7313

Last revision: 7 June 2022