Distribute data processing with NXLog Agent

Load balancing is the process of distributing workload across multiple servers to reduce the strain on each server, optimizing processing and availability.

Below, we provide two methods that you can use to distribute NXLog Agent load.

Using a network load balancer

Whenever possible, using a network layer load balancer is the best method to distribute connections for higher throughput. A network load balancer can distribute connections between multiple identically-configured NXLog Agent instances.

Distribute load with multiple NXLog Agent instances

Figure 1. Processing data with multiple NXLog Agent instances

Alternatively, you can take advantage of NXLog Agent’s multi-threaded architecture to distribute data processing between multiple input module instances of the same agent.

Distribute load with multiple input module instances

Figure 2. Processing data with multiple NXLog Agent inputs

There are several commercial and open-source network load balancers available. We will use nginx for our example. You will need nginx Plus or nginx Open Source version 1.9.0 or later for the TCP and UDP Load Balancing feature. See the installation instructions in the nginx Admin Guide to get started.

Example 1. nginx configuration example

This nginx configuration distributes UDP and TCP connections to an NXLog Agent instance, configured with multiple input instances listening on different ports. On Debian-based systems, the default location of the nginx configuration file is /etc/nginx/nginx.conf, but this may vary depending on your distribution.

The nginx load balancer routes UDP traffic per message and TCP traffic per connection. As a result, load-balancing TCP traffic works best when data sources send a similar number of events.

nginx.conf

load_module /usr/lib/nginx/modules/ngx_stream_module.so; (1)

stream {
  upstream nxlog_udp { (2)
    server 192.168.1.81:1001;
    server 192.168.1.81:1002;
  }

  upstream nxlog_tcp { (3)
    server 192.168.1.81:1003;
    server 192.168.1.81:1004;
  }

  server {
    listen 192.168.1.81:514 udp; (4)
    proxy_pass nxlog_udp;
    proxy_responses 0;
  }

  server {
    listen 192.168.1.81:1514; (5)
    proxy_pass nxlog_tcp;
  }
}

worker_rlimit_nofile 1000000;

events {
  worker_connections 20000; (6)
}

1	The nginx stream module must be loaded from the configuration or enabled with the `--with-stream` configuration parameter.
2	Lists the NXLog Agent input instances listening for UDP connections.
3	Lists the NXLog Agent input instances listening for TCP connections.
4	Specifies the IP address and port nginx will listen on for UDP connections. Configure your sources to send logs to this IP and port.
5	Specifies the IP address and port nginx will listen on for TCP connections. Configure your sources to send logs to this IP and port.
6	The maximum number of simultaneous connections allowed.

Refer to the nginx documentation for more information on the available configuration directives.

Example 2. Receiving data on multiple inputs

This NXLog Agent configuration defines two identical instances of the im_udp input module listening for connections on different ports.

nxlog.conf

<Extension syslog>
    Module        xm_syslog
</Extension>

<Extension json>
    Module        xm_json
</Extension>

<Input udp_1>
    Module        im_udp
    ListenAddr    0.0.0.0:1001
    <Exec> (1)
        parse_syslog(); (2)
        to_json(); (3)
    </Exec>
</Input>

<Input udp_2>
    Module        im_udp
    ListenAddr    0.0.0.0:1002
    <Exec>
        parse_syslog();
        to_json();
    </Exec>
</Input>

<Output file>
    Module        om_file
    File          '/path/to/output/file'
</Output>

<Route r1>
    Path          udp_1, udp_2 => file (4)
</Route>

1	Exec block for heavy parsing.
2	Parses syslog messages into structured data using the parse_syslog() procedure of the xm_syslog module.
3	Converts the event to JSON using the to_json() procedure of the xm_json module.
4	Routes data from all input instances to a single output.

Using NXLog Agent modules as threads

If deploying a Network Load Balancer is not an option, you can implement parallelization within the NXLog Agent configuration. There are several options depending on your use case.

The first method is to implement a selector function in the input instance to reroute individual records to multiple identical output instances. This way, any intensive processing is distributed between different threads.

Example 3. Routing individual messages

This configuration uses the im_tcp input module to listen for connections on port 1514. It then reroutes messages to three identical output instances, distributing the load between them.

Flow control is explicitly disabled when rerouting messages, resulting in NXLog Agent dropping messages if the target module(s) queue is full.

nxlog.conf

<Extension syslog>
    Module       xm_syslog
</Extension>

<Extension json>
    Module       xm_json
</Extension>

<Input tcp_routing>
   Module        im_tcp
   ListenAddr    0.0.0.0:1514
   <Exec>
      if (get_var("linecounter") == undef ) set_var("linecounter", 0); (1)
      set_var("linecounter", get_var("linecounter")+1); (2)

      if get_var("linecounter") == 2 reroute("2"); (3)

      if get_var("linecounter") == 3 {
          reroute("3");
          set_var("linecounter", 0); (4)
      }
      log_info(get_var("linecounter")); (5)
   </Exec>
</Input>

<input null>
    Module       im_null
</input>

<Output file_1>
    Module       om_file
    File         '/path/to/output/file_1'
    <Exec> (6)
        parse_syslog(); (7)
        to_json(); (8)
    </Exec>
</Output>

<Output file_2>
    Module       om_file
    File         '/path/to/output/file_2'
    <Exec>
        parse_syslog();
        to_json();
    </Exec>
</Output>

<Output file_3>
    Module       om_file
    File         '/path/to/output/file_3'
    <Exec>
        parse_json();
        to_json();
    </Exec>
</Output>

<Route 1>
    Path         tcp_routing => file_1
</Route>

<Route 2>
    Path         null => file_2
</Route>

<Route 3>
    Path         null => file_3
</Route>

1	Creates a module variable using the get_var() function and set_var() procedure to initialize a counter. The message falls through to route 1.
2	Increases the counter by 1.
3	Reroutes the message to the relevant output module with the reroute() procedure.
4	Resets the counter once it reaches the maximum number of output instances.
5	The log_info() procedure is used to write the counter’s value to the log file for testing purposes only.
6	Exec block for heavy parsing.
7	Parses syslog messages into structured data using the parse_syslog() procedure of the xm_syslog module.
8	Converts the event to JSON using the to_json() procedure of the xm_json module.

Another option when receiving data over the network is to route connections to multiple identical input instances by enabling the ReusePort directive of the im_tcp or im_udp modules, which allows multiple threads to receive data on the same port. Routing works best when many simultaneous connections deliver approximately the same number of records; otherwise, connection distribution may be skewed and not yield any benefits.

Let’s consider an example where four input threads can handle 7,000 EPS with parsing enabled. Three agents send a cumulative 22,000 EPS.

Route connections to multiple input module instances

Figure 3. Routing connections to multiple NXLog Agent inputs

One might conclude that the total throughput provided by the four threads (28,000 EPS) might be sufficient to handle the influx. However, each source’s connection is associated with a single input thread. Therefore, if source A delivers 20,000 EPS, whereas sources B and C deliver 1,000 EPS, the maximum throughput will not scale as expected. Instead, it will equal the saturation throughput of Input 1 + 2*1,000 EPS, resulting in 9,000 EPS. 13,000 EPS ((20,000+2,000)-9,000) backpressure will cause significant delivery delays.

Distributing connections between threads is handled by the operating system. In our tests, we noticed poor results with few connections.

Example 4. Routing TCP connections

This configuration defines two identical im_tcp input module instances listening for connections on port 1514. The ReusePort directive allows each instance to receive data synchronously on the same port.

nxlog.conf

<Extension syslog>
    Module       xm_syslog
</Extension>

<Extension json>
    Module       xm_json
</Extension>

<Input tcp_1>
    Module        im_tcp
    ListenAddr    0.0.0.0:1514
    ReusePort     TRUE
    <Exec> (1)
        parse_syslog(); (2)
        to_json(); (3)
    </Exec>
</Input>

<Input tcp_2>
    Module        im_tcp
    ListenAddr    0.0.0.0:1514
    ReusePort     TRUE
    <Exec>
        parse_syslog();
        to_json();
    </Exec>
</Input>

<Output file>
   Module         om_file
   File           '/path/to/output/file'
</Output>

<Route 1>
    Path          tcp_1, tcp_2 => file (4)
</Route>

1	Exec block for heavy parsing.
2	Parses syslog messages into structured data using the parse_syslog() procedure of the xm_syslog module.
3	Converts the event to JSON using the to_json() procedure of the xm_json module.
4	Routes data from all input instances to a single output.