2023-05-14

Fluentd buffer in multi-tenant scenarios

What are the best-practices when it comes to setting up the fluentd buffer for a multi-tenant-scenario?

I have used the fluent-operator to setup a multi-tenant fluentbit and fluentd logging solution, where fluentbit collects and enriches the logs, and fluentd aggregates and ships them to AWS OpenSearch.

The operator uses a label router to separate logs from different tenants.

In my cluster, every time a new application is deployed via Helm chart, it applies the following resources:


apiVersion: fluentd.fluent.io/v1alpha1
kind: FluentdConfig
metadata:
  name: -fluentd-config
  labels:
    config.fluentd.fluent.io/enabled: "true"
spec:
  clusterFilterSelector:
    matchLabels:
      filter.fluentd.fluent.io/enabled: "true"
      filter.fluentd.fluent.io/tenant: core
  outputSelector:
    matchLabels:
      output.fluentd.fluent.io/enabled: "true"
      output.fluentd.fluent.io/tenant: 
  watchedLabels:
    

---
apiVersion: fluentd.fluent.io/v1alpha1
kind: Output
metadata:
  name: fluentd-output-
  labels:
    output.fluentd.fluent.io/tenant: 
    output.fluentd.fluent.io/enabled: "true"
spec:
  outputs:
    - customPlugin:
        config: |
          <match **>
            @type opensearch
            host "${FLUENT_OPENSEARCH_HOST}"
            port 443
            logstash_format  true
            logstash_prefix logs-
            scheme https
            log_os_400_reason true
            @log_level ${FLUENTD_OUTPUT_LOGLEVEL:=info}
            <buffer>
               ...
            </buffer>
            <endpoint>
              url "https://${FLUENT_OPENSEARCH_HOST}"
              region "${FLUENT_OPENSEARCH_REGION}"
              assume_role_arn "#{ENV['AWS_ROLE_ARN']}"
              assume_role_web_identity_token_file "#{ENV['AWS_WEB_IDENTITY_TOKEN_FILE']}"
            </endpoint>
          </match>

So, for every new application a new <match> section will be created, and because of that a new buffer configuration for that application:

<ROOT>
  <system>
    rpc_endpoint "127.0.0.1:24444"
    log_level info
    workers 1
  </system>
  <source>
    @type forward
    bind "0.0.0.0"
    port 24224
  </source>
  <match **>
    @id main
    @type label_router
    <route>
      @label "@c9ce9b26357ba0a190e4d01fbf7ef506"
      <match>
        labels app:app-name
        namespaces app-namespace
      </match>
    </route>
  <label @33b5ad9c15abdec648ede544d80f80f5>
    <filter **>
      @type dedot
      de_dot_separator "_"
      de_dot_nested true
    </filter>
    <match **>
      @type opensearch
      host "XXXX.us-west-2.es.amazonaws.com"
      port 443
      logstash_format true
      logstash_prefix "logs-XXX"
      scheme https
      log_os_400_reason true
      @log_level "info"
      <buffer>
         ...
      </buffer>
      <endpoint>
        url https://XXXX.us-west-2.es.amazonaws.com
        region "us-west-2"
        assume_role_arn "arn:aws:iam::XXX:role/raas/fluentd-os-access-us-west-2"
        assume_role_web_identity_token_file "/var/run/secrets/eks.amazonaws.com/serviceaccount/token"
      </endpoint>
    </match>
  </label>
  <match **>
    @type null
    @id main-no-output
  </match>
  <label @FLUENT_LOG>
    <match fluent.*>
      @type null
      @id main-fluentd-log
    </match>
  </label>
</ROOT>

To sum it up, I'll have a buffer for every pod that enables log collection in the Helm chart.

If I had to configure a single buffer for all the cluster I would use something like this:

            <buffer>
              @type memory
              flush_mode interval
              flush_interval FLUENTD_BUFFER_FLUSH_INTERVAL:=60s
              flush_thread_count 1
              retry_type exponential_backoff
              retry_max_times 10
              retry_wait 1s
              retry_max_interval 60s
              chunk_limit_size 8MB
              total_limit_size 512MB
              overflow_action throw_exception
              compress gzip
            </buffer>

This buffer was based on fluentd's documentation default values.

But this is obviously not scalable. I cannot have dozens or maybe even hundreds of applications/pods with the above buffer configuration because it would exhaust Fluentd resources.

How can I define a base "micro-buffer" that would be enough for most pods/applications?



No comments:

Post a Comment