Use Case: Collecting AWS S3 Logs with LogScale & FluentD

This document provides a cookbook example of how to collect logfiles from AWS S3 and ship that data to LogScale. There are lots of options for how to do this, and this particular example is based on AWS CloudTrail data.

This recipe can be used in any situation where log data is being placed into an AWS S3 bucket and that data needs to be shipped into LogScale with minimal latency. It does not address the scenario of collecting historical data from AWS S3.

It makes use of AWS SQS (Simple Queue Service) to provide high scalability and low latency for collection. It does not rely on scanning of the AWS S3 bucket (which is why it does not support historical ingestion) as this approach for collection does not work with S3 at large scale.

The scenario documented here is based on the combination of two FluentD plugins; the AWS S3 input plugin and the core Elasticsearch output plugin.

Why FluentD

FluentD offers many plugins for input and output, and has proven to be a reliable log shipper for many modern deployments. It is chosen in this example specifically because the configuration is clear and understandable, and is relatively trivial to deploy and test.

Pre-Requisites

The following assumes that you have a working installation of FluentD on a server. This example was built using CentOS (CentOS Linux release 8.1.1911) and made use of the gem variant of FluentD installation.

Configure AWS

Assuming that you have an AWS S3 bucket with log data already flowing to it, but no SQS queues configured, you will want to complete the following steps.

This approach is using a dedicated user account with minimal permissions, and authenticating using keys. There are alternative ways to configure the IAM settings if you wish, this is provided as an example.

All items in this case are configured in the same region. This is a requirement for some of the components, the recommendation is to configure this in the region closest to your LogScale or FluentD instances although it is not critical.

Create SQS Queue

  • In the AWS Console go to Services → Application Integration → Simple Queue Service

  • Choose Create New Queue → Standard Queue

  • Give the queue a name, for example, humio-cloudtrail-queue

  • Choose Quick-Create Queue (you may want to tune specific queue parameters depending on the volume of data and your environment. That is beyond the scope of this document.)

Note the ARN, as you will need this later.

Configure SQS Permissions for S3 Events

It is necessary to authorize the S3 bucket to push events into the SQS queue. To do this, you will need the ARN for your S3 bucket. Go to the SQS menu in AWS:

Select your SQS queue then choose Queue Actions → Add a Permission.

Then choose the following settings:

yaml
- Effect: `Allow`
- Principle: `Everybody` (checkbox)
- Actions: `SendMessage`
- Add Conditions:
    -   Qualifier: ``None``
    -   Condition: ``ArnLike``
    -   Key: ``aws:SourceArn``
    -   Value: ``<ARN OF YOUR S3 BUCKET>``

Click Add Permission when done.

Setup S3 Events to SQS

Go back to the configuration for the S3 bucket holding the CloudTrail logs.

  1. Choose PropertiesEvents

  2. Select Add Notification

  3. Give the notification a name, such as cloudtrail-to-humio

  4. Check All object create events

  5. Prefix: AWSLogs/XXXXXXXXXXXX/CloudTrail/ where XXXXXXXXXXXX is your AWS account number

  6. Send to: SQS Queue

  7. SQS: YOUR SQS QUEUE NAME

  8. Click Save.

If you get an error at this point then it's likely you haven't set the permissions correctly for S3 to post events to that SQS queue. Please review that configuration if needed.

Create User Account for FluentD

We recommend that you use a dedicated user account for FluentD. This account will have minimal permissions and be used only for running the FluentD connection.

  1. In the AWS Console go to Security, Identity, & Compliance → IAM

  2. Users → Add User

  3. Provide a user name and choose Programmatic Access (checkbox)

  4. Click Next: Permissions

  5. Click Next: Tags

  6. Click Next: Review

  7. Click Create User (ignore the warning about no permissions for the user)

When you finish creating the user be sure to download and save the Access key ID and Secret access key, as you will need them to complete the FluentD configuration.

We will now create two inline policies for this user (the policies will only exist as part of this user account)

  1. With the user selected, on the Permissions tab, select Add Inline Policy.

  2. Select the JSON editor and paste the following (editing the bucket name to suit)

JSON
{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Action": [
               "s3:GetObject",
               "s3:GetObjectVersion",
               "s3:ListBucketVersions",
               "s3:ListBucket"
           ],
           "Effect": "Allow",
           "Resource": [
               "arn:aws:s3:::my-s3-bucket/*",
               "arn:aws:s3:::my-s3-bucket"
           ]
       }
   ]
}

Note

This policy gives full read access to the bucket. It is possible to modify the Resource section to be more strict on how the permissions are granted. This depends on the layout of your S3 bucket.

  1. Click Review Policy

  2. Give it a name like read-access-to-s3-cloudtrail

  3. Click Create Policy

Repeat the above steps to create a second inline policy for managing the SQS queue. The JSON is:

JSON
{
  "Statement" : [
     {
        "Action" : [
           "sqs:DeleteMessage",
           "sqs:GetQueueUrl",
           "sqs:ListDeadLetterSourceQueues",
           "sqs:ReceiveMessage",
           "sqs:GetQueueAttributes",
           "sqs:ListQueueTags"
        ],
        "Effect" : "Allow",
        "Resource" : "arn:aws:sqs:eu-west-2:507820635124:humio-demo-sq"
     }
   ],
   "Version" : "2012-10-17"
}
Configure AWS CloudTrail to Send Logs to S3

Finally in AWS we configure AWS CloudTrail to send logs to the S3 bucket, using the official Amazon CloudTrail documentation.

What is important is that the CloudTrail logs should go to the S3 bucket that is configured as above, and that the prefix for writing those logs to the bucket matches the configuration in the SQS notification setup.

Create a CloudTrail Parser in LogScale

CloudTrail data is sent as JSON but it is wrapped in a top level Records array. This means that additional parsing is needed for CloudTrail events to appear individually in LogScale. This can be achieved by defining a custom parser in LogScale and associating it with the access token for the repository of your choice.

To create the custom parser in LogScale:

  1. In your repository of choice go to Parsers → New Parser

  2. For the name choose json-cloudtrail

  3. For the Parser Script you can use

logscale
parseJson()
| split(Records, strip=true)
| @rawstring := rename(@display)
| parseTimestamp(field=eventTime)
| drop([@display, _index])
  1. Save the new parser and associate it with the access token for the repository that you will use in the FluentD configuration.

Configure FluentD Input

Install the relevant FluentD plugin for communicating with AWS S3 and SQS. On your FluentD server you can run:

shell
gem install fluent-plugin-s3 -v 1.0.0 --no-document

The input configuration is below:

yaml Syntax
<source>
   @type s3
   aws_key_id XXXXXXXXXXX
   aws_sec_key XXXXXXXXXXXXXXXXXXXXXXXXXXX
   s3_bucket my-s3-bucket
   s3_region eu-west-2
   add_object_metadata true
   <sqs>
     queue_name my-queue-name
   </sqs>
   store_as gzip
   <parse>
     @type json
   </parse>
</source>

Be sure to configure the plugin with the values relevant for your environment, including the ID and Key for the AWS user, S3 bucket name and region, and the SQS queue name.

More details and options for the input plugin are available on GitHub.

Configure FluentD Output

The output for this scenario is the same as the standard output to LogScale when using the Elasticsearch plugin for FluentD as documented here.

To install the elasticsearch plugin on your FluentD server you can run:

shell
fluent-gem install fluent-plugin-elasticsearch

An example output configuration is below:

<match input.s3>
   @type      elasticsearch
   host      my.humio.instance
   port      9200
   scheme     http
   user      ${cloudtrail}
   password    ${YYYYYYYYYYY}
   logstash_format true
</match>

Replace cloudtrail with your LogScale repository name, and YYYYYYYYYYY with your access token.

Note

This is filtering on the tag input.s3 which should match all the data coming from our S3 input plugin, as we did not set or parse any additional tag data.