Query Function Limitations

Some query functions have specific limitations in the way that they operate with multi-cluster views; specifically the ones that have complex flow of data and the ones that depend on state that may be inconsistent between clusters.

Multi-Cluster match() Support

When using the match() function in a multi-cluster scenario, care must be taken to ensure that the same file has been uploaded to each cluster in the multi-cluster view. LogScale does not automatically synchronise information across the clusters.

Although querying is not limited or prevented when the versions of the file do not match, the results returned by the query may not be as expected if the content of the file on each cluster is not identical.

The following conditions apply when executing queries using match():

  • There must be a file on the federating cluster matching the name of the file used for queries across the multi-cluster view.

  • If individual clusters have different versions of the same file then queries will behave in a well-defined but possibly unintuitive or unexpected way.

  • The environment variable UNSAFE_ALLOW_FEDERATED_MATCH will need to be enabled.

To understand the impact of this, it is important to understand how multi-cluster queries are processed. When executing a multi-cluster query, the query is split into two parts:

  • The query up to and including the first aggregate function. This part is executed on the remote clusters. If a match appears in this part then each remote cluster will use the version of the file on those clusters.

  • Everything after the first aggregate is executed on the federating cluster. If a match appears in this part then it will use the version of the file from the multi-cluster cluster.

This means that this multi-cluster query:

logscale
match(file="names.csv", field=id, include=[name])

Will use the version of names.csv on each of the remote clusters, whereas this query:

logscale
groupBy(id) 
| match(file="names.csv", field=id, include=[name])

Will use the names.csv present on the federating cluster. This is because groupBy() is an aggregate function so the match comes after the first aggregate.