Google Cloud Storage Archiving
Security Requirements and Controls
Change archiving settings
permission
Important
Google Cloud Storage archiving is not supported for Google Cloud Storage buckets where object locking is enabled.
Google Cloud Storage Archiving Storage Format and Layout
When uploading a segment file, LogScale creates the Google Cloud Storage object key based on the tags, start date, and repository name of the segment file. The resulting object key makes the archived data browsable through the Google Cloud Storage management console.
LogScale uses the following pattern:
REPOSITORY/TYPE/TAG_KEY_1/TAG_VALUE_1/../TAG_KEY_N/TAG_VALUE_N/YEAR/MONTH/DAY/START_TIME-SEGMENT_ID.gz
Where:
REPOSITORY
Name of the repository
type
Keyword (static) to identfy the format of the enclosed data.
TAG_KEY_1
Name of the tag key (typically the name of parser used to ingest the data, from the #type field)
TAG_VALUE
Value of the corresponding tag key.
YEAR
Year of the timestamp of the events
MONTH
Month of the timestamp of the events
DAY
Day of the timestamp of the events
START_TIME
The start time of the segment, in the format
HH-MM-SS
SEGMENT_ID
The unique segment ID of the event data
An example of this layout can be seen in the file list below:
$ s3cmd ls -r s3://logscale2/accesslog/
2023-06-07 08:03 1453 s3://logscale2/accesslog/type/kv/2023/05/02/14-35-52-gy60POKpoe0yYa0zKTAP0o6x.gz
2023-06-07 08:03 373268 s3://logscale2/accesslog/type/kv/humioBackfill/0/2023/03/07/15-09-41-gJ0VFhx2CGlXSYYqSEuBmAx1.gz
For more information about this layout, see Event Tags.
File Format
LogScale supports two formats for storage: native format and NDJSON.
Native Format
The native format is the raw data, i.e. the equivalent of the @rawstring of the ingested data:
accesslog127.0.0.1 - - [07/Mar/2023:15:09:42 +0000] "GET /falcon-logscale/css-images/176f8f5bd5f02b3abfcf894955d7e919.woff2 HTTP/1.1" 200 15736 "http://localhost:81/falcon-logscale/theme.css" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" 127.0.0.1 - - [07/Mar/2023:15:09:43 +0000] "GET /falcon-logscale/css-images/alert-octagon.svg HTTP/1.1" 200 416 "http://localhost:81/falcon-logscale/theme.css" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" 127.0.0.1 - - [09/Mar/2023:14:16:56 +0000] "GET /theme-home.css HTTP/1.1" 200 70699 "http://localhost:81/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36" 127.0.0.1 - - [09/Mar/2023:14:16:59 +0000] "GET /css-images/help-circle-white.svg HTTP/1.1" 200 358 "http://localhost:81/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36" 127.0.0.1 - - [09/Mar/2023:14:16:59 +0000] "GET /css-images/logo-white.svg HTTP/1.1" 200 2275 "http://localhost:81/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"
NDJSON Format
The default archiving format is NDJSON When using NDJSON, the parsed fields will be available along with the raw log line. This incurs some extra storage cost compared to using raw log lines but gives the benefit of ease of use when processing the logs in an external system.
json{"#type":"kv","#repo":"weblog","#humioBackfill":"0","@source":"/var/log/apache2/access_log","@timestamp.nanos":"0","@rawstring":"127.0.0.1 - - [07/Mar/2023:15:09:42 +0000] \"GET /falcon-logscale/css-images/176f8f5bd5f02b3abfcf894955d7e919.woff2 HTTP/1.1\" 200 15736 \"http://localhost:81/falcon-logscale/theme.css\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36\"","@id":"XPcjXSqXywOthZV25sOB1hqZ_0_1_1678201782","@timestamp":1678201782000,"@ingesttimestamp":"1691483483696","@host":"ML-C02FL14GMD6V","@timezone":"Z"} {"#type":"kv","#repo":"weblog","#humioBackfill":"0","@source":"/var/log/apache2/access_log","@timestamp.nanos":"0","@rawstring":"127.0.0.1 - - [07/Mar/2023:15:09:43 +0000] \"GET /falcon-logscale/css-images/alert-octagon.svg HTTP/1.1\" 200 416 \"http://localhost:81/falcon-logscale/theme.css\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36\"","@id":"XPcjXSqXywOthZV25sOB1hqZ_0_3_1678201783","@timestamp":1678201783000,"@ingesttimestamp":"1691483483696","@host":"ML-C02FL14GMD6V","@timezone":"Z"} {"#type":"kv","#repo":"weblog","#humioBackfill":"0","@source":"/var/log/apache2/access_log","@timestamp.nanos":"0","@rawstring":"127.0.0.1 - - [09/Mar/2023:14:16:56 +0000] \"GET /theme-home.css HTTP/1.1\" 200 70699 \"http://localhost:81/\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36\"","@id":"XPcjXSqXywOthZV25sOB1hqZ_0_15_1678371416","@timestamp":1678371416000,"@ingesttimestamp":"1691483483696","@host":"ML-C02FL14GMD6V","@timezone":"Z"} {"#type":"kv","#repo":"weblog","#humioBackfill":"0","@source":"/var/log/apache2/access_log","@timestamp.nanos":"0","@rawstring":"127.0.0.1 - - [09/Mar/2023:14:16:59 +0000] \"GET /css-images/help-circle-white.svg HTTP/1.1\" 200 358 \"http://localhost:81/\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36\"","@id":"XPcjXSqXywOthZV25sOB1hqZ_0_22_1678371419","@timestamp":1678371419000,"@ingesttimestamp":"1691483483696","@host":"ML-C02FL14GMD6V","@timezone":"Z"} {"#type":"kv","#repo":"weblog","#humioBackfill":"0","@source":"/var/log/apache2/access_log","@timestamp.nanos":"0","@rawstring":"127.0.0.1 - - [09/Mar/2023:14:16:59 +0000] \"GET /css-images/logo-white.svg HTTP/1.1\" 200 2275 \"http://localhost:81/\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36\"","@id":"XPcjXSqXywOthZV25sOB1hqZ_0_23_1678371419","@timestamp":1678371419000,"@ingesttimestamp":"1691483483696","@host":"ML-C02FL14GMD6V","@timezone":"Z"}
A single NDJSON line is just a JSON object, which formatted looks like this:
json{ "#humioBackfill" : "0", "#repo" : "weblog", "#type" : "kv", "@host" : "ML-C02FL14GMD6V", "@id" : "XPcjXSqXywOthZV25sOB1hqZ_0_1_1678201782", "@ingesttimestamp" : "1691483483696", "@rawstring" : "127.0.0.1 - - [07/Mar/2023:15:09:42 +0000] \"GET /falcon-logscale/css-images/176f8f5bd5f02b3abfcf894955d7e919.woff2 HTTP/1.1\" 200 15736 \"http://localhost:81/falcon-logscale/theme.css\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36\"", "@source" : "/var/log/apache2/access_log", "@timestamp" : 1678201782000, "@timestamp.nanos" : "0", "@timezone" : "Z" }
Each record includes the full detail for each event, including parsed fields, the original raw event string, and tagged field entries.
How Data is Uploaded to Google Cloud Storage
Data is uploaded to Google Cloud Storage as soon as a segment file has been created during ingest (for more information, see Ingestion: Digest Phase).
Each segment file is sent as as multipart upload, so the upload of a single file may require multiple Google Cloud Storage requests. The exact number of requests will depend on rate of ingest, but expect a rate of one request for each 8MB of ingested data.
The size of each part of the upload is configured using the
GCS_ARCHIVING_CHUNK_SIZE
configuration
variable.
Google Cloud Storage Storage Configuration
Keys & Configuration
You need to create a Google service account that is authorized to manage the contents of the bucket that will hold the data. See Google Authentication Documentation for an explanation on how to obtain and provide service account credentials, manually. Go to the Google Service Account Key page to create a service account key.
Once you have the JSON file from Google with a set of credentials, place
them in the /etc
directory on
each LogScale node. Be sure to provide the full path to the file
in the configuration file like this:
GCP_ARCHIVING_ACCOUNT_JSON_FILE=/path/GCS-project-example.json
The JSON file must include the fields
project_id
,
client_email
and
private_key
. Any other field in
the file is currently ignored. Additionally, you will need to set some
options in the LogScale configuration file, related to using
Google Cloud Bucket Storage. Below is an excerpt from that file, showing
the options to set — your actual values will be different, though:
GCP_ARCHIVING_BUCKET=$BUCKET_NAME
GCP_ARCHIVING_ENCRYPTION_KEY=$ENCRYPTION_SECRET
GCP_ARCHIVING_OBJECT_KEY_PREFIX=/basefolder
USING_EPHEMERAL_DISKS=true
These variables set the following values:
GCP_ARCHIVING_BUCKET
sets the name of the bucket to use.The encryption key given with
GCP_ARCHIVING_ENCRYPTION_KEY
can be any UTF-8 string and will be used to encrypt the data stored within the bucket. The suggested value is 64 or more random ASCII characters.The
GCP_ARCHIVING_OBJECT_KEY_PREFIX
is used to set the optional prefix for all object keys. This option is empty by default. TheGCP_ARCHIVING_OBJECT_KEY_PREFIX
option allows nodes to share a single bucket, but each node must use a unique prefix. There is a performance penalty when using a non-empty prefix, and it is therefore recommend not to use a prefix.If there are any ephemeral disks in the cluster, you must set the last option here to
true
.
You can change the settings using the GCP_STORAGE_BUCKET
to point to a fresh bucket at any point in time. From that point,
LogScale will write new files to that bucket while still reading
from any previously-configured buckets. Existing files already written
to any previous bucket will not get written to the
new bucket. LogScale will continue to delete files from the old
buckets that match the file names that LogScale would put there.
Setup Google Cloud Storage archiving with IAM user
To configure Google Cloud Storage with an IAM user:
In Google Cloud Storage:
Log in to the Google Cloud console and go to the Create service account page.
Click Create service account and select a Google Cloud project.
Enter a service account name to display in the Google Cloud console. The Google Cloud console generates a service account ID based on this name. Edit the ID if necessary. You cannot change the ID later.
Optional:Enter a description of the service account.
Decide whether you want to set access controls now. If you do not want to set access controls now, click Done to finish creating the service account. To set access controls now, click Create and continue and continue to the next step.
If you chose to create access controls in the previous step, choose one or more IAM roles to grant to the service account on the project. When you have added the roles, click Continue.
Optional:In the Service account users role field, add members that need to attach the service account to other resources.
Optional:In the Service account admins role field, add members that need to manage the service account.
Click Done to finish creating the service account.
In LogScale:
Go to the repository you want to archive and select
→ .Configure by giving the bucket name, region, and then
.
Monitor Google Cloud Storage Archiving
To monitor the Google Cloud Storage archiving process, the following query can be executed in the humio repository:
#kind=logs thread=/archiving-upload-latency/ class!=/TimerExecutor
| JobAssignmentsImpl/
A monitoring task within LogScale checks this and reports if the latency is greater than 15 minutes. This adds an event entry to the humio repo using the phrase Archiving is lagging by ingest time for more than (ms).
Troubleshoot Google Cloud Storage archiving configuration
If you encounter an access denied error message when configuring Google Cloud Storage archiving, check your configuration settings for missing information or typos.
Tag Grouping
If tag grouping is applied for a repository, the archiving logic will upload one segment into one Google Cloud Storage file, even though the tag grouping makes each segment possibly contain multiple unique combinations of tags. The TAG_VALUE part of the Google Cloud Storage file name that corresponds to a tag with tag grouping will not contain any of the specific values for the tag in that segment, but will instead contain an internal value that denotes which tag group the segment belongs to. This is less human readable than splitting out a segment into a number of Google Cloud Storage files corresponding to each unique tag combination in the segment, but avoids the risk of a single segment being split into an unmanageable amount of Google Cloud Storage files.
Other options
The following sections describe other options for configuring Google Cloud storage archiving, and for fine tuning performance.
HTTP proxy
If LogScale is set up to use an HTTP_PROXY_HOST
,
it will be used for communicating with Google Cloud Storage by default.
To disable it, set the following:
# Use the globally configured HTTP proxy for communicating with GCS.
# Default is true.
GCS_ARCHIVING_USE_HTTP_PROXY=false
Non-default endpoints
You can point to your own hosting endpoint for the GCP to use for bucket storage if you host an GCP-compatible service.
GCP_ARCHIVING_ENDPOINT_BASE=http://my-own-gcs:8080
Virtual host style (default)
LogScale will construct virtual host-style URLs like
https://my-bucket.my-own-gcs:8080/path/inside/bucket/file.txt
.
For this style of access, you need to set your base URL, so it contains a placeholder for the bucket name.
GCS_ARCHIVING_ENDPOINT_BASE=http://{bucket}.my-own-gcs:8080
LogScale will replace the placeholder
{bucket}
with the relevant bucket name at
runtime.
Path-style
Some services do not support virtual host style access, and require
path-style access. Such URLs have the format
https://my-own-gcs:8080/my-bucket/path/inside/bucket/file.txt
.
If you are using such a service, your endpoint base URL should not
contain a bucket placeholder.
GCS_ARCHIVING_ENDPOINT_BASE=http://my-own-gcs:8080
Additionally, you must set
GCS_ARCHIVING_PATH_STYLE_ACCESS
to
true.
IBM Cloud Storage compatibility
To use Google Cloud Storage Archiving with IBM Cloud Storage, set
GCS_ARCHIVING_IBM_COMPAT
to true.
Google Bucket Parameters
There are a few options that can help in tuning LogScale performance related to using Google Cloud for archiving.
Important
There may be financial costs associated with increasing these as storage is billed using a combination of the number of operations and storage used.
You can set the maximum number of files that LogScale will concurrently download or upload. If not set in the configuration file, LogScale will take the number of hyperthreads supported by the CPU(s) and divide it by 2 to determine the value for this option. You might want to set it yourself with a different value:
GCP_ARCHIVING_CONCURRENCY=8
This first option below is used to set the chunk size for upload and download ranges. The maximum is 8 MB, which is the default. The minimum value is 5 MB.
GCP_ARCHIVING_CHUNK_SIZE=8388608
Use this next option to set whether you prefer LogScale fetch
data files from the bucket when possible — even if another node in
the LogScale cluster has a copy. It's set to
false
by default.
In some environments, it may be less expensive to transfer files this way. The transfer from the bucket may be billed at a lower cost, than a transfer from a node in another region or in another data center.
GCP_ARCHIVING_PREFERRED_COPY_SOURCE=false
Setting the preference doesn't guarantee that the bucket copy will be used. The cluster can still make internal replications directly when the file is not yet in a bucket.
Google Cloud Storage archived log re-ingestion
You can re-ingest log data that has been written to a Google Cloud Storage bucket through Google Cloud Storage archiving by using Log Collector and the native JSON parsing within LogScale.
This process has the following requirements:
The files need to be downloaded from the Google Cloud Storage bucket to the machine running the Log Collector. The Google Cloud Storage files cannot be accessed natively by the Log Collector.
The ingested events will be ingested into the repository that is created for the purpose of receiving the data.
To re-ingest logs:
Create a repo in LogScale where the ingested data will be stored. See Creating a Repository or View.
Create an ingest token, and choose the JSON parser. See Assigning Parsers to Ingest Tokens.
Install the Falcon LogScale Collector to read from a file using the
.gz
extension as the file match. For example, using a configuration similar to this:yaml#dataDirectory is only required in the case of local configurations and must not be used for remote configurations files. dataDirectory: data sources: bucketdata: type: file # Glob patterns include: - /bucketdata/*.gz sink: my_humio_instance parser: json ...
For more information, see Sources & Examples.
Copy the log file from the Google Cloud Storage bucket into the configured directory (
/bucketdata
) in the above example.
The Log Collector reads the file that has been copied, sends it to LogScale, where the JSON event data will be parsed and recreated.