Use Case: Collecting AWS S3 Logs with LogScale & FluentD
This document provides a cookbook example of how to collect log files from AWS S3 and ship that data to LogScale. There are lots of options for how to do this, and this particular example is based on AWS CloudTrail data.
This recipe can be used in any situation where log data is being placed into an AWS S3 bucket and that data needs to be shipped into LogScale with minimal latency. It does not address the scenario of collecting historical data from AWS S3.
It makes use of AWS SQS (Simple Queue Service) to provide high scalability and low latency for collection. It does not rely on scanning of the AWS S3 bucket (which is why it does not support historical ingestion) as this approach for collection does not work with S3 at large scale.
The scenario documented here is based on the combination of two FluentD plugins; the AWS S3 input plugin and the core Elasticsearch output plugin.
Why FluentD
FluentD offers many plugins for input and output, and has proven to be a reliable log shipper for many modern deployments. It is chosen in this example specifically because the configuration is clear and understandable, and is relatively trivial to deploy and test.
Pre-Requisites
The following assumes that you have a working installation of FluentD on a server. This example was built using CentOS (CentOS Linux release 8.1.1911) and made use of the gem variant of FluentD installation.
Configure AWS
Assuming that you have an AWS S3 bucket with log data already flowing to it, but no SQS queues configured, you will want to complete the following steps.
This approach is using a dedicated user account with minimal permissions, and authenticating using keys. There are alternative ways to configure the IAM settings if you wish, this is provided as an example.
All items in this case are configured in the same region. This is a requirement for some of the components, the recommendation is to configure this in the region closest to your LogScale or FluentD instances although it is not critical.
Create SQS Queue
In the AWS Console go to Services → Application Integration → Simple Queue Service
Choose Create New Queue → Standard Queue
Give the queue a name, for example, humio-cloudtrail-queue
Choose Quick-Create Queue (you may want to tune specific queue parameters depending on the volume of data and your environment. That is beyond the scope of this document.)
Note the ARN, as you will need this later.
Configure SQS Permissions for S3 Events
It is necessary to authorize the S3 bucket to push events into the SQS queue. To do this, you will need the ARN for your S3 bucket. Go to the SQS menu in AWS:
Select your SQS queue then choose Queue Actions → Add a Permission.
Then choose the following settings:
- Effect: `Allow`
- Principle: `Everybody` (checkbox)
- Actions: `SendMessage`
- Add Conditions:
- Qualifier: ``None``
- Condition: ``ArnLike``
- Key: ``aws:SourceArn``
- Value: ``<ARN OF YOUR S3 BUCKET>``
Click Add Permission when done.
Setup S3 Events to SQS
Go back to the configuration for the S3 bucket holding the CloudTrail logs.
Choose
Properties
→Select Add Notification
Give the notification a name, such as
cloudtrail-to-humio
Check All object create events
Prefix:
AWSLogs/XXXXXXXXXXXX/CloudTrail/
whereXXXXXXXXXXXX
is your AWS account numberSend to: SQS Queue
SQS:
YOUR SQS QUEUE NAME
Click
.
If you get an error at this point then it's likely you haven't set the permissions correctly for S3 to post events to that SQS queue. Please review that configuration if needed.
Create User Account for FluentD
We recommend that you use a dedicated user account for FluentD. This account will have minimal permissions and be used only for running the FluentD connection.
In the AWS Console go to Security, Identity, & Compliance → IAM
Users → Add User
Provide a user name and choose Programmatic Access (checkbox)
Click Next: Permissions
Click Next: Tags
Click Next: Review
Click Create User (ignore the warning about no permissions for the user)
When you finish creating the user be sure to download and save the Access key ID and Secret access key, as you will need them to complete the FluentD configuration.
We will now create two inline policies for this user (the policies will only exist as part of this user account)
With the user selected, on the
tab, select .Select the JSON editor and paste the following (editing the bucket name to suit)
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:ListBucketVersions",
"s3:ListBucket"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::my-s3-bucket/*",
"arn:aws:s3:::my-s3-bucket"
]
}
]
}
Note
This policy gives full read access to the bucket. It is possible to modify the Resource section to be more strict on how the permissions are granted. This depends on the layout of your S3 bucket.
Click Review Policy
Give it a name like
read-access-to-s3-cloudtrail
Click Create Policy
Repeat the above steps to create a second inline policy for managing the SQS queue. The JSON is:
{
"Statement" : [
{
"Action" : [
"sqs:DeleteMessage",
"sqs:GetQueueUrl",
"sqs:ListDeadLetterSourceQueues",
"sqs:ReceiveMessage",
"sqs:GetQueueAttributes",
"sqs:ListQueueTags"
],
"Effect" : "Allow",
"Resource" : "arn:aws:sqs:eu-west-2:507820635124:humio-demo-sq"
}
],
"Version" : "2012-10-17"
}
Configure AWS CloudTrail to Send Logs to S3
Finally in AWS we configure AWS CloudTrail to send logs to the S3 bucket, using the official Amazon CloudTrail documentation.
What is important is that the CloudTrail logs should go to the S3 bucket that is configured as above, and that the prefix for writing those logs to the bucket matches the configuration in the SQS notification setup.
Create a CloudTrail Parser in LogScale
CloudTrail data is sent as JSON but it is wrapped in a top level
Records
array. This
means that additional parsing is needed for CloudTrail events to
appear individually in LogScale. This can be achieved by defining a
custom parser in LogScale and associating it with the access token for
the repository of your choice.
To create the custom parser in LogScale:
In your repository of choice go to Parsers → New Parser
For the name choose
json-cloudtrail
For the Parser Script you can use
parseJson()
| split(Records, strip=true)
| @rawstring := rename(@display)
| parseTimestamp(field=eventTime)
| drop([@display, _index])
Save the new parser and associate it with the access token for the repository that you will use in the FluentD configuration.
Configure FluentD Input
Install the relevant FluentD plugin for communicating with AWS S3 and SQS. On your FluentD server you can run:
gem install fluent-plugin-s3 -v 1.0.0 --no-document
The input configuration is below:
<source>
@type s3
aws_key_id XXXXXXXXXXX
aws_sec_key XXXXXXXXXXXXXXXXXXXXXXXXXXX
s3_bucket my-s3-bucket
s3_region eu-west-2
add_object_metadata true
<sqs>
queue_name my-queue-name
</sqs>
store_as gzip
<parse>
@type json
</parse>
</source>
Be sure to configure the plugin with the values relevant for your environment, including the ID and Key for the AWS user, S3 bucket name and region, and the SQS queue name.
More details and options for the input plugin are available on GitHub.
Configure FluentD Output
The output for this scenario is the same as the standard output to LogScale when using the Elasticsearch plugin for FluentD as documented here.
To install the elasticsearch plugin on your FluentD server you can run:
fluent-gem install fluent-plugin-elasticsearch
An example output configuration is below:
<match input.s3>
@type elasticsearch
host my.humio.instance
port 9200
scheme http
user ${cloudtrail}
password ${YYYYYYYYYYY}
logstash_format true
</match>
Replace cloudtrail
with your LogScale
repository name, and YYYYYYYYYYY
with your
access token.
Note
This is filtering on the tag input.s3 which should match all the data coming from our S3 input plugin, as we did not set or parse any additional tag data.