Query Function Limitations
Some query functions have specific limitations in the way that they operate with multi-cluster views; specifically the ones that have complex flow of data and the ones that depend on state that may be inconsistent between clusters.
These functions use files which may not be consistent between clusters.
Uses the IOC database, which may not be consistent between clusters
Uses the ip location database which may not be consistent between clusters
Multi-Cluster match()
Support
When using the match()
function in a multi-cluster
scenario, care must be taken to ensure that the same file has been
uploaded to each cluster in the multi-cluster view. LogScale does not
automatically synchronise information across the clusters.
Although querying is not limited or prevented when the versions of the file do not match, the results returned by the query may not be as expected if the content of the file on each cluster is not identical.
The following conditions apply when executing queries using
match()
:
There must be a file on the federating cluster matching the name of the file used for queries across the multi-cluster view.
If individual clusters have different versions of the same file then queries will behave in a well-defined but possibly unintuitive or unexpected way.
The environment variable
UNSAFE_ALLOW_FEDERATED_MATCH
will need to be enabled.
To understand the impact of this, it is important to understand how multi-cluster queries are processed. When executing a multi-cluster query, the query is split into two parts:
The query up to and including the first aggregate function. This part is executed on the remote clusters. If a match appears in this part then each remote cluster will use the version of the file on those clusters.
Everything after the first aggregate is executed on the federating cluster. If a match appears in this part then it will use the version of the file from the multi-cluster cluster.
This means that this multi-cluster query:
match(file="names.csv", field=id, include=[name])
Will use the version of names.csv
on each of the
remote clusters, whereas this query:
groupBy(id) | match(file="names.csv", field=id, include=[name])
Will use the names.csv
present on the federating
cluster. This is because groupBy()
is an aggregate
function so the match comes after the first aggregate.