What You Need to Know About Grafana, Prometheus and Loki - The GRAPL Framework

Sid Roy
Jun 30, 2024
7 min read

Updated: Jul 15, 2024

This article was written by Sid K. Roy, Frankenstein Tools Technology Analyst

People at work in an IT command center — Technology analysts working in a command center surrounded by computer screens

What is the GRAPL framework? GRAPL (which stands for GRAfana, Prometheus and Loki) refers to the open-source observability stack being used at a greater pace across the enterprise to support major observability requirements. GRAPL consists of two well-known open-source platforms in Grafana and Prometheus, and a newer introduction to the stack: Loki.

Enterprises are adopting GRAPL to address both common as well as complex observability scenarios from traditional enterprise compute on Linux, Windows, IBM iSeries (i.e., CPU, RAM, Disk, system events) to extreme versatility across AWS, GCP and Azure native environments including Kubernetes. Other core infrastructure components ranging from network to database are also easily handled by this stack. Basic application performance monitoring (non-byte code instrumentation) is also possible through detailed scraping of response time and error data.

A high-level description of each component of the GRAPL framework is provided below:

Grafana

Grafana is generally known as a data visualization tool, famous for its awesome visualizations and interactive graphs, charts, and tables. Grafana, however, it a lot more. Developed by Grafana Labs, it is an open-source capability with “paid-for” options which provide additional features and options for expanded support targeted at the enterprise. Grafana allows you to easily connect to multiple data sources and then “fuse” them or integrate them easily for highly customized and aesthetically pleasing analytics and easier consumption / interpretation by the user.

From an alerting and anomaly detection standpoint, users can define alerts on your information and metrics from wherever that information is stored, making it a highly flexible and powerful alerting engine not boxed in by constraints normally experienced by expensive ISV alternatives that require data to be imported into the observability system (at a cost) before they can be candidates for alerting.

Grafana has been a top choice for many years by major organizations as a “front end” to many other Observability platforms which lacked the all-important capabilities around user ergonomics, flexibility around data visualizations and ease of data interpretation. This includes Prometheus, but also heavyweights like DataDog, Dynatrace, Splunk and other platforms.

Prometheus

Influenced and open-sourced by a team at SoundCloud in 2012, Prometheus has penetrated virtually everyone Fortunate 500 organization in some for or fashion. A fully open-source initiative, Prometheus has a very active developer and user community. Unlike Grafana, there is not an actual company that manages the core binary and the project's governance structure is based on the Cloud Native Computing Foundation (Prometheus was the second project added after Kubernetes).

Many practitioners’ likened Prometheus as the next iteration of Nagios which was at one time considered the “gold standard” for open-source monitoring and competed against the like of BMC, IBM, SolarWinds and HP / MicroFocus / OpenText for high-intensity Infrastructure monitoring. Prometheus extends that value proposition in the world of hybrid and cloud infrastructure. Backed by a powerful time-series database, Prometheus collects and stores its metrics as time series data, i.e., metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels. Prometheus offers one of the most flexible approaches and low-cost approaches for managing your highly complex environment.

Loki

Not as famous as its other two platform brethren, Grafana Loki (released in 2018) has emerged as a key discussion point in 2023 and 2024 for organizations looking to modernize their approach to enterprise logging or companies who are experiencing “Splunk-fatigue” associated with cost, complexity, or both. Inspired by Prometheus, Loki has an efficient approach for indexing log metadata as labels for each log stream generated by target systems.

Horizontally scalable, Loki was designed to help organizations be able to grow their logging implementation cost-effectively as their data footprint continues to expand (estimates suggest that enterprises are growing their log data footprint by 20% year over year).

Loki's is an agent-based approach that scrapes logs, applies contextual labels, and pushes the streams to a control server that performs several core functions including log ingestion, long-term storage, query processing and log search. Unlike other heavier logging solutions in the market, Loki was designed with high-compression and scalability in mind. Loki leverages compact indexing which is very selective about what metadata is stored from the log line versus capturing the complete raw log entries.

This results in several benefits including faster load and search times, smaller data footprints and reduced operating costs. Loki log stores are queried using LogQL (query language) which supports efficient log analysis and debugging compared to traditional log management solutions.

Computer screen showing sections of programming code — Section of a computer monitor showing programming code

Business Case for GRAPL

By combining these three capabilities into a single operational framework, organizations are experiencing tremendous value in the following areas:

Powerful Observability Tool Belt. GRAPL provides strong and versatile technical capabilities around complex monitoring scenarios often with superior outcomes than when using comparable capabilities from major observability vendors offering more archaic and legacy based solutions to modern day architectures

30-70% Reduction in Licensing Spend. GRAPL framework provides a major reduction in Observability spend when comparing against “apples to apples” capabilities offered by major observability ISVs from a software licensing standpoint. The larger your implementation, the greater the savings

Strong alignment to Open Telemetry Initiatives. Most major organizations are embarking upon or at least piloting. With open standards for telemetry collection and telemetry data storage (via JSON structures), organizations can liberate themselves from vendor-lock-in as well as the “black box” of Observability telemetry collection and storage

Cloud Transformation. GRAPL can handle on-premises, hybrid and cloud-native monitoring with ease providing organizations a viable and effective avenue for tools consolidation. Especially across infrastructure and logging. GRAPL even provides for strong capabilities around IoT monitoring which will be key to provide foundational practices for areas like drones and robotics, as well as more traditional but challenging use cases like Kubernetes

Greater Adoption, Consumption and Decisioning of Observability Information. With DevOps and SRE teams leading the charge for GRAPL usage due to its strength in cloud native environments coupled with the diversity of configuration items covered by the framework including on-premises technologies; GRAPL implementations see higher levels of usage penetration across technology teams versus their more expensive contemporaries.

Implementation Considerations

Standing up a GRAPL implementation is not terribly hard, and with managed options from Grafana Labs, AWS, and Chronosphere -getting started is simply a matter of assigning capable resources or working with a partner. Some key considerations to keep in mind:

For Prometheus, plan for multiple servers and a roadmap to scale accordingly. Sharding and data federation is a key concept to address increasing metric volumes and to aggregate data across the estate - a fundamental concept of horizontal scaling.

Like Prometheus, Loki relies on a distributed architecture which scales horizontally by deploying multiple instances. Architecture and planning around Loki distributor nodes to manage query loads and ensure optimal responsiveness is key for Loki usability especially during peak periods or to support heavy search requirements.

You need to define data retention policies upfront to avoid unmanageable growth which could impact usability of the platform. Keep storage types in mind as well to balance speed of response for queries versus storage infrastructure costs.

For more mature and larger sites - deployment of an effective log / observability data ingestion pipeline leveraging Loki agents, Promtail, and consistent configuration of labels and selectors will address uncontrollable data growth, standardization of log / observability data storage, ability to handle a diverse set of sources and provide the ability to handle high volume

To truly have parity with major observability platforms, GRAPL users need to plan for basics like Business Continuity and Disaster Recovery (high availability, replication, redundancy, etc.) as well infrastructure architecture considerations like load balancing and observability data pipelines. Robust health metrics for the GRAPL platform are also important for Prometheus. Leverage Prometheus to monitor itself for issues like scraping failures, high ingestion rates, or storage capacity limits proactively.

Early Grafana users typically fall into the pit of poor dashboard design and “heavy” dashboards which get choked on overly large datasets and create performance bottlenecks. Employ design best practices for Grafana including leveraging their templating and dynamic filtering features which allow for rapid deployment, standards, reusability, variable management, and scalability across the enterprise

Unlike major observability platforms that have robust capabilities for authentication and authorization including user-friendly Role-Based Access Controls- specific effort needs to be considered for Grafana. This includes strong planning around data security and dashboard access. OAuth and LDAP are readily available to address these challenges as well as audit logging

Managing the actual infrastructure of these tools - including the fleet of agents, probes and Open Telemetry components can be daunting. These tools lack the platform management and administrative sophistication of major ISV Observability tools, so realistic budgeting of technical resources to execute this function and buildout automation is critical. Without this, most GRAPL efforts will fail.

Parting Thoughts . . .

Grafana and Prometheus have come a long way since their inception as tier two observability solutions with limited heavy lifting in production, to a well-integrated framework capable of handling complex enterprise use-cases. The maturity of Loki advances the frameworks ability to solve core logging scenarios.

Several major enterprises which have adopted GRAPL with strong results include the following:

Coinbase: Uses Grafana, Prometheus, and Loki for monitoring their extensive infrastructure, including microservices and cloud environments. Coinbase has experienced Improved visibility into system performance and operational metrics, faster incident response times, and enhanced scalability to handle growing transaction volumes.

eBay: Utilizes Grafana, Prometheus, and Loki to monitor their global e-commerce platform, which operates across multiple regions and handles millions of transactions daily. eBay has seent increased operational efficiency through centralized monitoring and alerting, proactive detection of performance issues, and optimized resource allocation.

Red Hat: Well-known for its open-source software solutions, uses Grafana, Prometheus, and Loki for monitoring their cloud-native applications and infrastructure across hybrid cloud environments. RedHat and customers leveraging their frameworks have enjoyed enhanced observability across their Kubernetes deployments, improved troubleshooting capabilities with centralized logging (Loki), and better insights into application performance metrics.

Cisco: A global leader in networking, observability and IT solutions, employs Grafana, Prometheus, and Loki for monitoring and managing their extensive network infrastructure and cloud services. Cisco has streamlined monitoring operations, proactive identification of network performance issues, and improved capacity planning through detailed metrics and logging analysis.

Sony Interactive Entertainment: The unit responsible for PlayStation products and services, leverages Grafana, Prometheus, and Loki for monitoring their gaming platforms and online services. Despite an ever-growing complex environment, Sony has Increased reliability of online gaming services through proactive monitoring and alerting, optimized performance of backend systems, and improved user experience metrics.

While there can be financial savings by transitioning from an expensive ISV based observability stack, investment in capable technical resources is priority of the first order. Expertise in all their stacks is needed as well as skills related to data analysis, systems architecture, log streams / pipelines and system performance will be needed to compensate for the mature technical features offered out of the box by major platforms like Dynatrace, DataDog or Splunk.