Windows Server Failover Clustering
Windows Server Failover Clustering (WSFC) is a system-level, high-availability feature that allows server pairs to act as standby nodes for each other. Nodes exchange communication between them, known as a "heartbeat," over the LAN. Notifications are either sent by the active server to the standby node (push heartbeat) or requested periodically from the standby node by the active server (pulled heartbeat). See Failover Clustering in Windows Server on Microsoft Docs for more information about WSFC.
NXLog can collect logs generated by WSFC, parse them, and forward them to a destination of your choice.
WSFC logging changes in Windows Server 2008 R2
WSFC recorded cluster operations and activity in several system log files in the past. Windows Server 2008 R2 and newer versions consolidate logging in Windows Event Log and ETW. The following table compares the log files found in previous versions with Windows Server 2008 R2 and newer logging.
Pre-2018 R2 log file | Log functionality | Logging in newer versions |
---|---|---|
%systemroot%\Cluster\cluster.log |
Debug log file for clustering operations |
Debug level events are available with ETW. |
%systemroot%\system32\LogFiles\Cluster\clcfgsrv.log |
Cluster installer logs |
An HTML installation report is created in %systemroot%\Cluster\CreateCluster.htm. |
%systemroot%\system32\LogFiles\Cluster\clusocm.log |
Records cluster-related activity during an operating system upgrade |
This is now part of the core Windows event logging. ETW also has two providers tracing cluster-aware updating. |
%systemroot%\system32\LogFiles\Cluster\cluscomp.log |
Records the activity that occurs during the compatibility check at the start of an operating system upgrade on a cluster node |
This is now part of the core Window event logging. |
Logging across a Windows cluster
The current active node undertakes logging responsibility in a Windows cluster. As a result, you can install NXLog on any member server, and they will process all logs from the cluster. For example, in a two-node cluster, as shown in the following diagram:
Where:
-
Cluster-DC is the domain controller since it is a prerequisite for cluster members to be part of the same domain.
-
ClusterNode1 and ClusterNode2 are cluster members.
-
ClusterStorage is the iSCI storage used by the cluster; logs are saved to a location on disk here.
If NXLog is configured on ClusterNode1 and ClusterNode2 to output logs to separate files, they will write the same events in their respective files. The source of each event identifies the node where it was generated.
2022-05-31 18:47:35 ClusterNode2.example.com INFO Keywords="9259400833873739776" EventType="INFO" SeverityValue="2" EventID="7036" SourceName="Service Control Manager" ProviderGuid="{555908D1-A6D7-4695-8E1E-26931D2012F4}" Version="0" TaskValue="0" OpcodeValue="0" RecordNumber="6634" ExecutionProcessID="652" ExecutionThreadID="2340" Channel="System" Message="The Software Protection service entered the stopped state." param1="Software Protection" param2="stopped" EventData.Binary="7300700070007300760063002F0031000000"
The advantage of such a setup is that if the active node fails, any other node running NXLog will continue to process logs to their respective output.
Collecting events from Windows Event Log
WSFC logs events in the following logs:
-
System
-
Microsoft-Windows-FailoverClustering/Operational (found in Event Viewer under Application and Services Logs > Microsoft > Windows > FailoverClustering)
John Marlin, a Senior Product Manager for High Availability and Storage at Microsoft, provides a detailed List of Failover Cluster Events in Windows 2016/2019 in his Microsoft Tech Community blog post. Additionally, you can find a complete list of Failover Clustering system log events on Microsoft Docs.
FailoverClustering includes other unused log categories. They are possibly placeholders for future updates. |
You can configure NXLog to collect all events from the FailoverClustering
source or specify a query to collect only a subset of events.
This configuration uses the im_msvistalog input module to collect the following subset of FailoverClustering
events:
- Event ID 1000 (UNEXPECTED_FATAL_ERROR)
-
This event is generated when a software or hardware-related issue prevents the cluster service from starting on a node.
- Event ID 1006 (NM_EVENT_MEMBERSHIP_HALT)
-
This event is generated when the cluster service is halted on a member node due to a lack of connectivity with other cluster nodes.
- Event ID 1635 (RCM_RESOURCE_FAILURE_INFO)
-
This event is generated by the Resource Control Manager when a specific shared resource within the cluster fails to come online. It is typically seen in SQL Server and shared disk failures.
- Event ID 1637 (RCM_RESOURCE_STATE_TRANSITION)
-
This event is generated by the Resource Control Manager when there’s a state transition in any clustered resource. This event doesn’t necessarily reflect an error.
Before sending events to their destination, the configuration converts them to JSON format using the to_json() procedure of the xm_json module.
<Extension json>
Module xm_json
</Extension>
<Input cluster_evt>
Module im_msvistalog
<QueryXML>
<QueryList>
<Query Id="0" Path="System">
<Select Path="System">*[System[Provider[@Name='Microsoft-Windows-FailoverClustering']
and (EventID=1000 or EventID=1006)]]
</Select>
<Select Path="Microsoft-Windows-FailoverClustering/Operational">
*[System[(EventID=1635 or EventID=1637)]]
</Select>
</Query>
</QueryList>
</QueryXML>
Exec to_json();
</Input>
The following output in JSON format depicts event ID 1637, logged for an IP address that transitioned to a Pending state, after it was processed by NXLog.
{
"EventTime": "2022-05-31T19:48:40.128046-08:00",
"Hostname": "node2.example.com",
"Keywords": "4611686018427387904",
"EventType": "INFO",
"SeverityValue": 2,
"Severity": "INFO",
"EventID": 1637,
"SourceName": "Microsoft-Windows-FailoverClustering",
"ProviderGuid": "{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}",
"Version": 1,
"TaskValue": 3,
"OpcodeValue": 0,
"RecordNumber": 1450,
"ExecutionProcessID": 2276,
"ExecutionThreadID": 3544,
"Channel": "Microsoft-Windows-FailoverClustering/Operational",
"Domain": "NT AUTHORITY",
"AccountName": "SYSTEM",
"UserID": "S-1-5-18",
"AccountType": "Well Known Group",
"Message": "Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state OnlineCallIssued to state OnlinePending.",
"Category": "Resource Control Manager",
"Opcode": "Info",
"ResourceName": "Cluster IP Address",
"GroupName": "Cluster Group",
"FromState": "OnlineCallIssued",
"ToState": "OnlinePending",
"FromStateValue": "133",
"ToStateValue": "129",
"EventReceivedTime": "2022-05-31T19:48:40.956099-08:00",
"SourceModuleName": "cluster_evt",
"SourceModuleType": "im_msvistalog"
}
Collecting ETW logs
Event Tracing for Windows (ETW) is an advanced debugging feature provided by Microsoft that allows you to create customized event tracing using a provider-consumer model. For more information on how ETW works, refer to About Event Tracing on Microsoft Docs.
WSFC ETW providers
Microsoft documentation on ETW providers tends to be obscure; therefore, working with them often requires a lot of trial and error. The following are three typical providers that you will come across:
- Microsoft-Windows-FailoverClustering-SoftwareStorageBusTarget
-
Shared storage is an important topic in WFSC operations. Whether you are running SCSI, iSCSI, or FiberChannel, you may need visibility on input and output operations and the hardware state of your storage adapters. This provider will help you monitor storage operations on cluster shared storage.
- Microsoft-Windows-FailoverClustering-WMIProvider
-
WMI is a feature-rich alternative management option for a server, and if you are managing your cluster with WMI, this is a key ETW provider.
- Microsoft-Windows-ClusterAwareUpdating
-
Cluster Aware Updating is an optional valuable feature that allows the administrator to update cluster members safely. The cluster enters a special maintenance mode that can suspend operations and transition any active roles in the server as needed while updating it. This ETW provider gives you granular visibility into the transitions, stages, and operations related to this feature. See Cluster-Aware Updating overview on Microsoft Docs for more information.
Follow these steps to obtain the WSFC ETW parameters for your NXLog configuration:
-
Determine the ETW provider(s) you need. Execute the following command to list all the available providers:
> logman query providers
The following providers are related to WSFC at the time of writing:
-
Microsoft-Windows-FailoverClustering {BAF908EA-3421-4CA9-9B84-6689B8C6F85F}
-
Microsoft-Windows-FailoverClustering-Client {A82FDA5D-745F-409C-B0FE-18AE0678A0E0}
-
Microsoft-Windows-FailoverClustering-ClusBflt-Diagnostic {923BCB94-58D2-42BE-BBA9-B1315F363838}
-
Microsoft-Windows-FailoverClustering-ClusDisk-Diagnostic {7FEF367F-E76C-4592-9912-E12B36A99780}
-
Microsoft-Windows-FailoverClustering-Clusport-Diagnostic {29C07D0E-E5A0-4E85-A004-1F668531CE22}
-
Microsoft-Windows-FailoverClustering-CsvFlt-Diagnostic {151D3C03-E442-4C4F-AF20-BD48FF41F793}
-
Microsoft-Windows-FailoverClustering-CsvFs-Diagnostic {6A86AE90-4E9B-4186-B1D1-9CE0E02BCBC1}
-
Microsoft-Windows-FailoverClustering-Manager {11B3C6B7-E06F-4191-BBB9-7099FFF55614}
-
Microsoft-Windows-FailoverClustering-NetFt {C1FCCEB3-3F19-42A9-95B9-27B550FA1FBA}
-
Microsoft-Windows-FailoverClustering-SoftwareStorageBusTarget {0AC0708A-A44E-49EF-AA7E-FBE8CCC603A6}
-
Microsoft-Windows-FailoverClustering-WMIProvider {0461BE3C-BC15-4BAD-9A9E-51F3FADFEC75}
-
Microsoft-Windows-ClusterAwareUpdating {10629806-46F2-4366-9092-53025E067E8C}
-
Microsoft-Windows-ClusterAwareUpdating-Management {9B9E93D6-5569-4179-8C8A-5201CB2B9536}
-
-
Use the provider GUID to query the keywords available for tracing, e.g.:
> logman query providers "{A82FDA5D-745F-409C-B0FE-18AE0678A0E0}"
You should see output similar to the following:
Provider GUID ------------------------------------------------------------------------------- Microsoft-Windows-FailoverClustering-Client {A82FDA5D-745F-409C-B0FE-18AE0678A0E0} Value Keyword Description ------------------------------------------------------------------------------- 0x0000000000000001 Cluster Cluster 0x0000000000000002 Node 0x0000000000000004 Group 0x0000000000000008 Resource 0x0000000000000010 Network 0x0000000000000020 NetInt 0x0000000000000040 Quorum 0x0000000000000080 Reconnect Reconnect 0x0000000000000100 ResType 0x0000000000000200 Property 0x0000000000000400 RPCLog 0x8000000000000000 System System 0x4000000000000000 Microsoft-Windows-FailoverClustering-Client/Diagnostic Microsoft-Windows-FailoverClustering-Client/Diagnostic Value Level Description ------------------------------------------------------------------------------- 0x01 win:Critical Critical 0x02 win:Error Error 0x03 win:Warning Warning 0x04 win:Informational Information 0x05 win:Verbose Verbose PID Image ------------------------------------------------------------------------------- 0x000005d4 C:\Windows\System32\msdtc.exe 0x000007a4 C:\Windows\System32\spoolsv.exe 0x00000878 C:\Windows\System32\svchost.exe The command completed successfully.
-
The keywords in the output above represent categories of events that may be included in the trace session. Take note of the HEX values next to the keywords you’re interested in. The sum of these values provides the tracing level for your NXLog configuration. For example, to trace Cluster, Node, Network, and Quorum, the resultant value is
0x0000000000000053
.
This configuration uses the im_etw input module to collect ETW events from the Microsoft-Windows-FailoverClustering-Client
provider.
The Level directive specifies that it should capture warning-level and upwards events.
The MatchAnyKeyword directive specifies the sum of event categories we determined above.
Finally, the configuration converts events to JSON format using the to_json() procedure of the xm_json module.
<Extension json>
Module xm_json
</Extension>
<Input cluster_etw>
Module im_etw
Provider Microsoft-Windows-FailoverClustering-Client
Level Warning
MatchAnyKeyword 0x00000053
Exec to_json();
</Input>
{
"SourceName": "Microsoft-Windows-FailoverClustering",
"ProviderGuid": "{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}",
"Channel": "Microsoft-Windows-FailoverClustering/DiagnosticVerbose ",
"EventID": 5408,
"Version": 0,
"ChannelID": 18,
"OpcodeValue": 0,
"TaskValue": 0,
"Keywords": "1152921504606846976",
"EventTime": "2022-05-31T16:47:18.032921-08:00",
"ExecutionProcessID": 4336,
"ExecutionThreadID": 2392,
"EventType": "DEBUG",
"SeverityValue": 1,
"SeverityValue": 1,
"Severity": "DEBUG",
"Hostname": "node2",
"Domain": "NT AUTHORITY",
"AccountName": "SYSTEM",
"UserID": "S-1-5-18",
"AccountType": "Well Known Group",
"Flags": "EXTENDED_INFO|IS_64_BIT_HEADER|PROCESSOR_INDEX (577)",
"LogString": "[RCM] rcm::PreemptionTracker::GetPreemptedGroups()",
"EventReceivedTime": "2022-05-31T16:47:19.035684-08:00",
"SourceModuleName": "cluster_etw",
"SourceModuleType": "im_etw"
}