Azure Service Fabric Log Format

LogScale integrates with Azure Service Fabric. The integration differentiates between

  1. Service Fabric Platform Events that communicate key operational activities happening within a cluster.

  2. Applications running on a Service Fabric cluster such as .NET core applications using Serilog.

Both types are available for searching and analysis in LogScale.

Service Fabric & LogScale

Azure Service Fabric is a distributed systems platform that makes it easy to package, deploy, and manage scalable and reliable micro-services and containers. Developers and administrators can avoid complex infrastructure problems and focus on implementing mission-critical, demanding workloads that are scalable, reliable, and manageable.

LogScale is a log management system, which enables you to log everything and answer anything in real-time. You can send, search, and visualize all logs instantly, easily, and affordably.

Disclaimer: This document is to be considered an example of how you can build out the setup described. It is by no means intended for production usage, without thorough testing. Please refer to documentation of the individual technologies used.

Overview

For the complete solution these steps are involved:

  1. Setting up LogScale.

  2. Creating a running Service Fabric Cluster.

  3. Installing the Cluster Monitor Service.

  4. Creating a parser for the Service Fabric Events ingested by the Cluster Monitor Service. Make sure it's called servicefabric-platform.

  5. Installing Filebeat.

Install or Update Cluster Monitor

The following procedure describes how to install or update the ClusterMonitor service, using the VMSS Custom Scripts Extension.

  1. Update eventFlowConfig.json with the LogScale ingest token to be used and let serviceUri point to your LogScale cluster.

  2. Upload the following files to a location, accessible through https:

  • InstallClusterMonitor.ps1

  • ClusterMonitor.zip

  • eventFlowConfig.1.json

  1. Change the clusterMonitor_artifactsLocation parameter in the ARM template parameter file to that location. (If you are setting up a new cluster, refer to the Cluster ARM instructions.

  2. Change the forceUpdateTag in the extension to a different string, for example, use as version number.

  3. Update the scale set model or re-run SF cluster deployment as an incremental deployment.

Logging when running distributed applications in a Service Fabric cluster

In a Service Fabric setup, there are three layers of logging which need to take place

  • infrastructure (servers)

  • platform (Service Fabric runtime)

  • applications (your distributed applications running in the cluster).

For more information about these concepts, please refer to the service fabric diagnostics overview.

As an example, there are various metrics and logs we want to collect on all three layers, and we can use different collection methods to accomplish this. Typically by installing an agent. In the case of using LogScale as a logging and monitoring tool for a Service Fabric installation, we are using the following components:

Layer Instrument and Emit Collect Query and Analyze
Infrastructure Hardware, Windows, OS, using syslog, ETW, performance counters. Not covered Not covered
Platform Service Fabric cluster and runtime using ETW. ClusterMonitor service, using Microsoft.Diagnostics.EventFlow LogScale
Applications Serilog for .net core outputting to log files. Filebeat LogScale

For the infrastructure layer which we are not covering in this article, we suggest taking a look at Metricbeat and Winlogbeat which also integrate with LogScale. LogScale offers excellent compression of ingested data (10x) and maintains no indices that take up your disk space and slow down your ingest, supporting the philosophy of logging everything.

Platform

The Service Fabric platform writes several structured events for key operational activities happening within your cluster. These range from cluster upgrades to replica placement decisions. Each event that Service Fabric exposes maps to one of the following entities in the cluster: Cluster, Application, Service, Partition, Replica and Container.

See Service Fabric events for full details.

The platform events are collected by an external service that runs on each machine participating in the cluster to the Service Fabric itself. The service utilizes the EventFlow diagnostics library which have support for several Inputs and Outputs of different formats. In our case the relevant Input is ETW (Event Tracing for Windows) since the Service Fabric Platform Events are available as an ETW provider. The Output is ElasticSearch. LogScale has support for the Elasticsearch bulk API for easy integration with existing log shippers and agents.

Details for installing the service can be found in Install or Update Cluster Monitor.

Note

For the sake of this example, we have increased the verbosity of the runtime to capture as much data as possible.

Parsing Service Fabric Platform Events with LogScale

For the ETW log lines produced by Service Fabric and shipped by EventFlow we are going to write a custom parser in LogScale. Parsers are written in LogScale's query language.

Creating a new parser in LogScale means going to the repository used for ingesting data, and selecting ParsersNew Parser. Make sure to call the parser servicefabric-platform. Afterwards the parser should be assigned to the ingest token to be used, which is done under SettingsIngest API Tokens. You can then change the Assigned Parser through a drop-down box.

The ingest token configured here is used together with the Cluster Monitor Service described above.

The Service Fabric log lines contain both structured and unstructured data. Here's an example (formatted for clarity):

javascript
`{
   "keywords" : 1152921504606846977,
   "level" : 4,
   "payload" : {
      "EventName" : "PerfMonitor",
      "ID" : 1284,
      "Message" : "Thread: 12, ActiveCallback: 1, Memory: 18,223,104, Memory Average: 18,223,104/18,051,085 ",
      "activeCallback" : 1,
      "longavg" : 18051085,
      "memory" : 18223104,
      "shortavg" : 18223104,
      "threadCount" : 12
   },
   "providerName" : "Microsoft-ServiceFabric",
   "timestamp" : "2019-02-26T10:40:51.0501913+00:00"
}`

As the log line stands right now it is ill-fit for a human operator. Reading the json data carefully we see that the Message fields are basically a summary of some of the other fields and represents a nice overview. Our parser will instead present this line in LogScale:

logscale
`2019-02-26 11:40:51.050 
| 4 
| Microsoft-ServiceFabric 
| PerfMonitor 
| 1284 
| Thread: 12, ActiveCallback: 1, Memory: 18,223,104, Memory Average: 18,223,104/18,051,085`

In other words, timestamp followed by level, providername, eventname, eventid and the message we identified before. The parser code we end up with is:

logscale
parseJson()
| @timestamp := parseTimestamp(field=timestamp)
| @display := format(format="%s 
| %s 
| %s 
| %s ", field=[providerName, payload.ID, payload.EventName, payload.Message])
| kvParse()

We start out by calling parseJson() which parses the log line as json and makes the json members available as fields on our event. The result is then piped into parsing of the timestamp field which is assigned to a new @timestamp field. LogScale interprets @timestamp as the event time, so it's essential to get right. If we do not want to display the raw log line in LogScale, in this case json, the [@display] (event) field can be set to some formatted string. We finish the parsing by extracting any key value pairs, like foo=bar, from the original log line.

Note

In newer versions of LogScale, @display is no longer supported. We recommend using the column-based event list that allows for selection of specific fields. In the above parser simply leave the line containing @display out.

Applications

For the sample below, we are using Serilog and Filebeat for log shipping.

Serilog

Serilog is a popular logging framework for .NET. Serilog is focused on structured logging. Doing structured logging means thinking about how the logs will be searched and analyzed afterwards, such as instrumenting a program with

logscale
Log.Information("Number of user sessions {@UserSessions}", userSessionCount);;

Should make it easy to aggregate UserSessions in your log management system afterwards. With serilog you can configure where and how your log lines are outputted. The option we are going to choose is writing JSON formatted log lines to files. The above line would read as:

javascript
{"Timestamp":"2019-01-28T09:38:08.3192070+01:00","Level":"Information","MessageTemplate":"Number of user sessions {@UserSessions}","RenderedMessage":"Number of user sessions 115","Properties":{"UserSessions":115}}

With the following configuration in code:

logscale
var log = Path.Combine("log", $"ConsoleExample.log");
Log.Logger = new LoggerConfiguration()
             .Enrich.FromLogContext()
             .Enrich.With(new ServiceFabricEnricher())
             .WriteTo.File(
                 formatter: new JsonFormatter(renderMessage: true),
                 path: log,
                 retainedFileCountLimit: 5,
                 shared: true)
             .CreateLogger();

Note the renderMessage: true part of the configuration. This instructs Serilog to render the message as part of the formatted log output which ends up in our log files. As before we will use the @display feature in LogScale which allows us to display the rendered message instead of the raw json data. This makes it easier for a human to process the log lines. All data is still available though, for searching and inspection in LogScale.

Our ServiceFabricEnricher class enriches our log lines with information from the node which the service instance is running on:

logscale
public class ServiceFabricEnricher : ILogEventEnricher
   {
    private static readonly string instanceId = Guid.NewGuid().ToString();
    private static readonly string processId = Process.GetCurrentProcess().Id.ToString();
    private static readonly string node = Environment.GetEnvironmentVariable("Fabric_NodeName");
    private static readonly string service = Environment.GetEnvironmentVariable("Fabric_ServiceName");
    private static readonly string application = Environment.GetEnvironmentVariable("Fabric_ApplicationName");
    public void Enrich(LogEvent logEvent, ILogEventPropertyFactory propertyFactory)
    {
        if (!string.IsNullOrWhiteSpace(instanceId))
        {
            logEvent.AddPropertyIfAbsent(new LogEventProperty("InstanceId", new ScalarValue(instanceId)));
        }
        if (!string.IsNullOrWhiteSpace(node))
        {
            logEvent.AddPropertyIfAbsent(new LogEventProperty("Fabric_NodeName", new ScalarValue(node)));
        }
        if (!string.IsNullOrWhiteSpace(service))
        {
            logEvent.AddPropertyIfAbsent(new LogEventProperty("Fabric_ServiceName", new ScalarValue(service)));
        }
        if (!string.IsNullOrWhiteSpace(application))
        {
            logEvent.AddPropertyIfAbsent(new LogEventProperty("Fabric_ApplicationName", new ScalarValue(application)));
        }
        if (!string.IsNullOrWhiteSpace(processId))
        {
            logEvent.AddPropertyIfAbsent(new LogEventProperty("ProcessId", new ScalarValue(processId)));
        }
    }
 }

Writing to a log file instead of shipping the logs in-process is a deliberate decision. The file acts as a persistent buffer between your program and your log management system which enables us to handle retransmissions and long lasting bursts of log lines.

Filebeat

We have our application logs formatted in JSON and written to files on disk, what's still missing is shipping the logs to LogScale. Filebeat is a lightweight, cross platform shipper that is compatible with LogScale. Filebeat uses few resources, is easy to install and handles network problems gracefully.

The following filebeat configuration scrapes all logs from D:\\SvcFab\_App and subfolders named log with files ending in .log. LogScale is compatible with the elastic bulk API so we are using output.elasticsearch. The hosts parameter points to LogScale server. The $INGEST_TOKEN needs to be replaced by a valid Ingest Token. Ingest Tokens are used in LogScale to identify clients, selecting a repository for the incoming logs and selecting which parser should be used.

logscale
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - "D:\\SvcFab\\_App\\**\\log\\*.log"
  encoding: utf-8
output.elasticsearch:
  hosts: ["https://humio.server.com:443/api/v1/ingest/elastic-bulk"]
  username: $INGEST_TOKEN
  compression_level: 5
  bulk_max_size: 200
  worker: 1

We need to get the $INGEST_TOKEN from LogScale. A default ingest token is configured for your log repository in LogScale. Log in to LogScale, go to your log repository, click Settings and then Ingest API Tokens' and either retrieve the default or create a new one. When you send logs and metrics to LogScale for ingestion it needs to be parsed before it is stored in a repository. LogScale has a built-in parser for Serilog that is configured as above. Make sure the Serilog parser is selected for the ingest token used. This is done by selecting the parser in the dropdown 'Assigned Parser' next to the Ingest Token.

If you are experimenting with filebeat from the command line start filebeat with filebeat.exe -e -c filebeat.yml. The -e flag instructs filebeat to log to the console which is useful when experimenting with the configuration.

Query & Analyze

This section contains some examples of using the humio query language to analyze and visualize the data.

Searching Application Logs

Let's take a look at a simple free text search in LogScale

Free Text

Figure 9. Free Text


We have searched for the word sessions, the result is displayed below and a specific log line has been selected with the details shown below the event list.

Let's say we want to plot the average number of user sessions the last hour:

Sum Timechart

Figure 10. Sum Timechart


The query pipes events containing sessions (through the | operator) into the timeChart() query function which uses the avg() function. Consult the docs for more on Query Language Syntax and Query Functions.

Notice the `Properties.UserSessions. This is a structured part of the Serilog log line that allows for easy analysis. For any unstructured part of your log data that isn't turned into a property on your events by the parser, LogScale allows for extracting the data using regular expressions which are then added as one or more fields to the events in question. See the query function regex() for more information.

Often we want to get an overview of exceptions being thrown by our applications:

logscale
Level=Error 
| groupby(field=[Properties.Fabric_ApplicationName, Exception])

The query groups exceptions by application name and exception message. Getting a total count of application exceptions can be achieved by:

logscale
Level=Error 
| count()

LogScale never rejects incoming logs, even logs it for some reason cannot parse. Doing the search:

logscale
@error=* 
| 
groupBy(@error_msg)

Will reveal any events that haven't been properly parsed and group them by reason.

Searching Service Fabric Platform Events

Service fabric node activity can be detected by looking at Node lifecycle events.

The following query visualizes the different node activities happening in the cluster the last 24 hours:

logscale
#type=servicefabric-platform payload.eventName=Node* 
| timechart(series=payload.eventName)
Node Activity

Figure 11. Node Activity


With the following approach we group node activity by the name of the node and collect all activity events into a single field which gives us a nice overview of the different activities related to a node.

logscale
#type=servicefabric-platform payload.eventName=Node* payload.eventName=* 
| groupby(payload.nodeName, function=collect(fields=payload.eventName))
Node Activity 2

Figure 12. Node Activity 2


Digging around in the logs from Service Fabric reveals an interesting event type that we can use to build a few gauges related to node state.

logscale
#type=servicefabric-platform payload.ID=18601 
| tail(1) 
| max(payload.nodeCountsdeactivatedNodes)

The same message also contains information about the number of nodes currently down:

logscale
#type=servicefabric-platform payload.ID=18601 
| tail(1) 
| max(payload.nodeCountsdownNodeCount)

In the drop-down above the search text field we can select the widget type 'Gauge'. We can save to a new or existing dashboard by clicking Save as....

Node Activity 3

Figure 13. Node activity 3


While we can try and infer node health given the node activity service fabric also provides health reports. Health reports are also communicated as health report events.

For a cluster that is having issues health reports can be a significant fraction of the total number of events:

logscale
#type=servicefabric-platform 
| groupby(payload.category)
Health Reports 1

Figure 14. Health Reports 1


Clicking on Health in the above search result reveals amongst others the NodeNewHealthReport. Let's try and get an overview of those:

logscale
#type="servicefabric-platform"
| payload.eventName = NodeNewHealthReport 
| groupby(field=[payload.description, payload.healthState])
Health Reports 2

Figure 15. Health Reports 2


According to the documentation payload.healthState=3 is an Error and needs investigation. So settings up alerts for #type="servicefabric-platform" payload.category=Health payload.healthState=3 is probably a good idea.

Combining the knowledge we have gained from digging into our logs, we can build a Service Fabric cluster dashboard:

Cluster Dashboard

Figure 16. Cluster Dashboard