Pre-aggregated vs log-based metrics (Prometheus vs Splunk)
Comparing aggregated metrics solutions like Prometheus and log-based metrics solutions like Splunk
Software observability is the ability to measure a system’s current state based on the data it generates, such as logs, metrics, and traces[1]. This post focuses on types of metrics systems.
Broadly, there are two ways in which metrics can be collected and monitored.
Pre-aggregated metrics
In this approach, metrics are explicitly instrumented in the code. The metrics are collected and/or aggregated into a time-series database. Examples of these are Datadog[2] and Prometheus.
Since the database is designed for metrics, it is very efficient and cheap. The solutions also come with rich statistical functions, so it's easier to query.
The challenge is to know and explicitly instrument all the useful metrics for your monitors.
Log-based Metrics
In this approach, the system is designed to store logs but does have a query language to analyze and visualize them. Examples of these are Splunk, AWS Cloudwatch Log Insights, and metric filters.
Log-based monitors can help capture unknown metrics that were not explicitly instrumented. Also, since logs are fundamental for any observability, log-based monitors are easier to get started with.
However, there are some challenges like queries can be complicated and brittle. A seemingly harmless change in a log statement can break the monitoring. Besides, the performance of log-based metrics might not be great because the db is not designed to compute metrics and there is no pre-aggregation.
How to use these systems?
If you have a fairly small use case, log-based monitors can be a great place to start.
However, as you scale, it would make sense to incorporate both of these solutions to set up your metrics monitoring.
Recommended strategy
Use a pre-aggregated system to monitor core metrics like Request counts, 5XX counts, percentile latency, and any other useful custom metrics.
Use log-based metric monitors as catchall for anything in the log that is unexpected. E.g. Number of warn logs, error logs, or any other generic suspicious patterns.
The above strategy would help in achieving great monitoring coverage, while not sacrificing efficiency and cost.
Hope this helps in developing a framework for different metric systems.
References
[1] https://www.dynatrace.com/news/blog/what-is-observability-2/
[2] https://www.infoq.com/presentations/datadog-metrics-db/