DOKK Library

The Open Source Guide to DevOps Monitoring Tools

Authors Dan Barker

License CC-BY-SA-4.0

Plaintext
        OPENSOURCE.COM




The Open Source Guide to
DevOps Monitoring Tools
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OPENSOURCE.COM




ABOUT OPENSOURCE.COM



                What is Opensource.com?

               OPENSOURCE.COM                            publishes stories about creating,
                                                         adopting, and sharing open source
                solutions. Visit Opensource.com to learn more about how the open source
                way is improving technologies, education, business, government, health, law,
                entertainment, humanitarian efforts, and more.

                Submit a story idea: https://opensource.com/story

                Email us: open@opensource.com

                Chat with us in Freenode IRC: #opensource.com




    THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                                . CC BY-SA 4.0 . OPENSOURCE.COM                                  3
ABOUT THE AUTHOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .




DAN BARKER




     DAN SPENT 12 YEARS                            in the military as a fighter jet mechanic
                                                   before transitioning to a career in
      technology as a Software Engineer. He's now the Chief Architect at the National
      Association of Insurance Commissioners (NAIC).
      He's leading technical and cultural transformations
      for the NAIC, a nonprofit organization focused on
      consumer protection in the insurance industry.
      He's an active participant in the CNCFs Serverless
      Working Group and CloudEvents project. Dan is
      also an organizer of the DevOps KC Meetup and
      the DevOpsDays KC conference.


CONTACT DAN

      Website: http://danbarker.codes

      Email:         dan@danbarker.codes
      Twitter: https://twitter.com/@barkerd427




      4                   THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                                 . CC BY-SA 4.0 . OPENSOURCE.COM
   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS




INTRODUCTION

           A tale of two views                                                                                                             6



CHAPTERS

           4 open source monitoring tools                                                                                                  8
           3 open source log aggregation tools                                                                                          12
           5 alerting and visualization tools                                                                                           15
           3 open source distributed tracing tools                                                                                      20



GET INVOLVED | ADDITIONAL RESOURCES

        Get involved | Additional Resources                                                                                           22
        Write for Us | Keep in Touch                                                                                                  23




   THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                                    . CC BY-SA 4.0 . OPENSOURCE.COM                                      5
A TALE OF TWO VIEWS .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .




         A tale of two views
        ONCE UPON A TIME,                                I was trouble-
                                                         shooting some
         vexing problems in an application that needed to be scaled
                                                                                                                         You might think the second situation happened a long time
                                                                                                                      after the first and we had improved over time. Or maybe you
                                                                                                                      suspect that when I changed jobs, my new company had
         several orders of magnitude with only a couple of weeks to                                                   better tooling. In reality, the second situation happened be-
         re-architect it. We had no log aggregation, no metrics ag-                                                   fore the first. I moved from a company with fairly advanced
         gregation, no distributed tracing, and no visualization. Most                                                observability tools to one with no observability tools. It was
         of our work had to be done on the actual production nodes                                                    strikingly disturbing as the developer to have an application
         using tools like strace and grepping through logs. These                                                     in production and know nothing about it. I learned a lot about
         are great tools, but they don’t make it easy to analyze a dis-                                               the importance of system observability and the related tools
         tributed system across doz-                                                                                                                    as I began rebuilding that
         ens of hosts. We got the job                                                                                                                   infrastructure. Also, Mike Ju-
         done, but it was painful and                                                                                                                   lian’s Practical Monitoring [1]
         involved a lot more guessing                                                                                                                   is a must-read for those who
         and risk than I’d prefer.                                                                                                                      want to know more about
             At a different job, I was help-                                                                                                            their systems.
         ing to troubleshoot an app in
         production that was suffering                                                                                                                                      Observability principles
         from an out of memory (OOM)                                                                                                                     So, what are observability
         issue. The problem was in-                                                                                                                      tools? Actually, what is
         consistent, as it didn’t seem                                                                                                                   observability?
         to correlate with running time,                                                                                                                    Observability isn’t just a
         load, time of day, or any other                                                                                                                 marketing term; it’s a com-
         aspect that would provide some predictability. This was obvi-                                                ponent of control theory [2]. If you want to get a quick primer,
         ously going to be a difficult problem to diagnose on a system                                                this video [3] might be helpful. Basically, observability means
         that spanned hundreds of hosts with many applications calling                                                that you can estimate a particular state of a system based
         it. Luckily, we had log aggregation, distributed tracing, metrics                                            on an output. More generally, a system’s state should be de-
         aggregation, and a plethora of visualizations. We looked at                                                  terministic from its outputs. Controllability, the mathematical
         our memory graph and saw a distinct spike in memory usage,                                                   dual of observability [4], of a system requires that a system
         so we used that spike to alert us so we could diagnose the                                                   state be determined by the inputs to the system.
         issue in real time when it occurred.                                                                            This is a fairly simple concept, but it’s very challenging to
             When we received an alert, we went to our log aggrega-                                                   put into practice. In a sufficiently complex system, it may be
         tion system to correlate the logs to the memory spike. We                                                    nearly impossible to implement full observability. However,
         found the OOM error and the related calls around it. We now                                                  you should strive to get the right outputs that allow you to de-
         understood what application was calling the service that re-                                                 termine the system’s state, especially when you encounter
         sulted in the spike and used that information to find the exact                                              a failure mode.
         transaction that caused the issue. We determined that some-
         one had stored a huge file in a database that our service                                                    Observability tool types
         was now trying to load, but the service was running out of                                                   Over the next few chapters we’ll dig into different types of
         memory before it could fully load and process the record. We                                                 observability tools. For each type, we’ll cover what they’re
         should have been defending against this in the first place,                                                  used for, what specific tools are available, some use cas-
         but we were happy to find it so quickly and fix it with very little                                          es, and any unique characteristics that may come up during
         effort. Once we understood the error, we discovered a lot of                                                 your search for a new tool. These are presented in the order
         records had large files like this, and we didn’t need that part                                              you should implement them. Metrics aggregation is first, as
         of the record to function properly.                                                                          it’s often easy to instrument an application built with any



         6                           THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                                                                   . CC BY-SA 4.0 . OPENSOURCE.COM
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A TALE OF TWO VIEWS


modern language. Second is logging because it will require                 If your tool doesn’t have it yet, you may want to look else-
more application modifications but provides tremendous                     where. Tools that haven't implemented this specification or
value. Third is alerting and visualizations, which require the             don't have it on their roadmap likely have other deficiencies
first two types for full functionality. And last is distributed trac-      in adopting open, modern standards and code.
ing, as it may not be necessary in a simple monolith and is
much harder to implement fully.                                            Open source
                                                                           There are a lot of good tools in this space that aren’t open
Metrics aggregation                                                        source but may be the right fit for your company. If you
This type of tool generally consists of time-series data.                  pick one of those tools, make sure its documentation and
Time-series data is time-ordered data, and it is normally col-             accessory tooling are open source. Open source observ-
lected with an internally consistent interval. This consistency            ability tools can provide valuable insights into how your
allows for some advanced calculations to be applied to the                 other observability tools are functioning (or maybe not
series and provides for predictive analytics using simple                  functioning). They also offer all the other benefits of any
regressions or more advanced algorithms.                                   open source project which you can read more about on
                                                                           opensource.com [5].
Log aggregation
These tools deal with data types that are related more to                  Open standards
events than to a series of consistent data points. This output             Regardless of whether or not a tool is open source, it should
is often emitted as a system enters some undesired state.                  always use open standards when possible. We’ve already
Some systems output a lot of logs that don’t fit this condi-               discussed one of these, OpenAPI, but there are many more.
tion. We’ll cover more of the do’s and don’ts of logging in a              We’ll discuss these standards in the appropriate sections to
future chapter.                                                            ensure you know they exist and where they’re used.

Alerting/visualizations                                                    Wide dissemination
This may not appear to fit with the other types listed, as it’s            Part of observability and openness is allowing everyone to
really subsequent to the others, but it provides a consum-                 view data. The tools you pick should be open by default.
able output for the other types and can produce its own                    You may want to restrict some areas, but you’ll want to de-
outputs. These types of tools generally make the system                    fault to open and limit access only if it’s absolutely required.
more understandable to humans. They also help create a                     You never know who in your company might want to solve
more interactive system through both proactive and reac-                   your problem or who you’ll need to bring in to help solve
tive notifications about negative system states.                           a problem. The last thing you’ll want are access barriers
                                                                           when troubleshooting your income source.
Distributed tracing
Much like tracing within a single application, distributed trac-           Federated model (preferred)
ing allows you to follow a single transaction through an entire            This is similar to defaulting to open, but it allows everyone
system. This allows you to home in on specific transactions                to provide input and control their own areas more locally.
that might be experiencing problems. Due to performance                    Many legacy systems are architected in a way that requires
concerns, a sampling algorithm is often applied.                           all data to flow through a central system regardless of need.
                                                                           This also centralizes control around that data. A federated
Common DevOps features                                                     system allows for local aggregation, processing, and control
There are several aspects you should look for in any type of               while allowing a central organization to collect the same data
observability tool. We’ll cover these generally now and will               or summarized data. The central system likely only wants
bring them back up in later chapters.                                      a subset of the data stored at the local level. This model
                                                                           increases agility, flexibility, and usability.
OpenAPI                                                                       In the following chapters, we’ll be exploring each of the
This specification was previously called Swagger but re-                   observability tool types in more detail. We’ll also help you
named when it was adopted by the OpenAPI Initiative                        choose the right tool for your use case.
within the Linux Foundation. The OpenAPI Specification is
a language-agnostic tool that can automatically generate                   Links
documentation of methods, parameters, and models. This                     [1]	
                                                                               https://www.practicalmonitoring.com/
is commonly used to generate RESTful interfaces in HTTP,                   [2]	
                                                                               https://en.wikipedia.org/wiki/Control_theory
but it is also protocol-agnostic. A user can create a client in            [3]	
                                                                               https://www.youtube.com/watch?v=iRZmJBcg1ZA
almost any language if one doesn't already exist. Every tool               [4]	
                                                                               https://en.m.wikipedia.org/wiki/Duality_(mathematics)
should have this type of API (or should be getting it soon).               [5]	
                                                                               https://opensource.com/



THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                               . CC BY-SA 4.0 . OPENSOURCE.COM                                   7
4 OPEN SOURCE MONITORING TOOLS .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .




       4                      open source
                              monitoring tools
       ISN’T MONITORING                       JUST MONITORING? Doesn’t
                                              it include logging, vi-
        sualization, and time-series data? The terminology around
                                                                                                       Quantiles
                                                                                                       Quantiles aren’t a type of metric, but they’re germane to
                                                                                                       the next two sections, histograms and summaries. Let’s
        monitoring has caused a lot of confusion over the years                                        clarify our understanding of quantiles with an example. A
        and has led to some poor tools that tout the ability to do ev-                                 percentile is a type of quantile. Percentiles are something
        erything in one format. Observability proponents recognize                                     we see regularly, and they should help us understand the
        there are many levels for observing a system. Metrics ag-                                      general concept more easily. A percentile has 100 “buck-
        gregation is primarily time-series data, and that’s what we’ll                                 ets” of values. We often see them related to testing or per-
        discuss in this chapter.                                                                       formance and generally stated as someone scoring within
                                                                                                       the 85th percentile or some other value. This means the
        Features of time-series data                                                                   person scoring within that percentile had a real value that
                                                                                                       fell within the bucket between the 85th and 86th percen-
        Counters                                                                                       tile. This person also scored in the top 15% of all students.
        A counter is a metric that represents a numeric value that                                     We don’t know the scores in the bucket based off this met-
        will only increase. (In other words, a counter should never                                    ric, but that can be derived based on the sum of all scores
        decrease.) Counters accumulate values and present the                                          in the bucket divided by the count of those scores. Quan-
        current total when requested. These are commonly used                                          tiles allow us to understand our data better than using a
        for things like the total number of web requests, number of                                    mean or some other statistical function that doesn’t take
        errors, number of visitors,                                                                                                    into account outliers and
        etc. This is analogous to the                                                                                                  uneven distributions.
        person with a counter device
        standing at the entrance to an                                                                                               Histograms
        event counting all the people                                                                                                A histogram is a little more
        entering. There is generally                                                                                                 complicated than a counter
        no option to decrement the                                                                                                   or a gauge. It is a sample of
        counter without resetting it.                                                                                                observations. It consists of
                                                                                                                                     a counter, which counts all
        Gauges                                                                                                                       the observations, and what
        A gauge is similar to a count-                                                                                               is essentially a gauge that
        er in that it represents a sin-                                                                                              sums the values of the ob-
        gle numeric value, but it can                                                                                                servations. It uses “buckets”
        also decrease. It is essential-                                                                                              or groupings to segment the
        ly a representation of some value at a point in time. A ther-                                  values in order to bound the datasets in a productive way.
        mometer is a good example of a gauge. It moves up and                                          This is commonly seen with quantiles related to request
        down with the temperature and offers a point-in-time read-                                     service-level agreements (SLAs). Let’s say we want to en-
        ing. Other uses include CPU usage, memory usage, network                                       sure 95% of our requests are below 500ms. We could use
        usage, and number of threads.                                                                  a bucket with an upper bound of 0.5s to collect all values



        8                       THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                                                    . CC BY-SA 4.0 . OPENSOURCE.COM
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 OPEN SOURCE MONITORING TOOLS


that fall under 500ms. We would then be able to determine           collecting the data and updating that representation upon
how many of the total requests have fallen into that bucket.        each request (or however the client is configured). This data
We can also determine how far we are from our SLA, but              is collected and saved in a highly efficient storage engine on
this can be difficult to do (as is explained more in the Pro-       local disk. The storage system uses an append-only file per
metheus documentation [1]).                                         metric. This storage isn’t lossy, which means the fidelity of
   Histograms are aggregate metrics that are accumulated            data from a year ago is as high as the data you are collecting
from multiple instances into a central server. This provides        today. However, you may not want to keep that much data
an opportunity to understand the system as a whole rather           locally. Fortunately, there is an option for remote storage for
than on a node by node basis.                                       long-term retention and analysis.
                                                                       Prometheus includes an advanced expression language
Summaries                                                           for selecting and presenting data called PromQL. This data
Summaries are similar to histograms in that they are a sam-         can be displayed graphically, tabularly, or used by external
ple of observations, but the aggregation occurs on the serv-        systems through a REST API. The expression language
er side. Also, the estimate of the quantile is more accurate        allows a user to create regressions, analyze real-time data,
than in a histogram. A summary also uses a sliding time             or trend historical data. Labels are also a great tool for fil-
window, so it serves a slightly different case than a histo-        tering and querying data. Labels can be associated with
gram but is generally used for the same types of metrics.           each metric name.
I normally use a histogram unless I need a very accurate               Prometheus also offers a federation model, which encour-
measure of the quantile.                                            ages more localized control by allowing teams to have their
                                                                    own Prometheis while central teams [8] can also have their
Push/pull                                                           own. The central systems could scrape the same endpoints
No chapter can be written about metrics aggregation tools           as the local Prometheis, but they can also scrape the lo-
without addressing the push vs. pull debate. What is it? The        cal Proemetheis to get the aggregated data that the local
debate centers around whether it is better for your metrics         instances are collecting. This reduces overhead on the end-
aggregation system to have data pushed to it or to have             points. This federation model also allows local instances to
your metrics aggregation system reach out and gather the            collect data from each other.
data by scraping an endpoint. Multiple articles discuss this           Prometheus comes with AlertManager to handle alerts.
(like this one [2] and this one [3]). My perspective is that        This system allows for aggregation of alerts as well as
it mostly doesn’t matter. Additional research is left to the        more complex flows to limit when an alert is sent. Let’s
reader’s discretion.                                                say 10 nodes suddenly go down at the same time a switch
                                                                    goes down. You probably don’t need to send an alert
Tool options                                                        about the 10 nodes, as everyone who receives them will
There are many tools available, both open source and com-           likely be unable to do anything until the switch is fixed.
mercial. We will focus on open source tools, but some of            With the AlertManager, it’s possible to send an alert only
these have an open core model with a paid component.                to the networking team for the switch and include addition-
  Some of these tools feature additional components of              al information about other systems that might be affected.
observability—principally alerting and visualizations. These        It’s also possible to send an email (rather than a page) to
will be covered in this section as additional features and          the systems team so they know those nodes are down and
won’t be covered in subsequent chapters.                            they don’t need to respond unless the systems don’t come
                                                                    up after the switch is repaired. If that occurs, then Alert-
Prometheus                                                          Manager will reactivate those alerts that were suppressed
This is the most well-recognized time-series monitoring             by the switch alert.
solution for cloud-native applications. It is hosted within the
Cloud Native Computing Foundation (CNCF), but it was                Graphite
created by Matt Proud and Julius Volz and sponsored by              Graphite [9] has been around for a long time, and the recent
SoundCloud, with external contributors coming in early to           book The Art of Monitoring [10] covers Graphite in detail.
help develop it. Brian Brazil of Robust Perception [4] has          Graphite has become ubiquitous in the industry, with many
built a business of helping companies adopt Prometheus.             large companies using it at scale.
He also has an excellent blog [5] on his website. The Pro-             Graphite is a push-based system that receives data
metheus documentation [6] is extensive and provides a lot           from applications by having the application push the data
of detail for understanding and using the tool.                     into Graphite’s Carbon component. Carbon stores this
   Prometheus [7] is a pull-based system that uses local con-       data in the Whisper database, and that database and Car-
figuration to describe the endpoints to collect from and the        bon are read by the Graphite web component that allows
interval desired for collection. Each endpoint has a client         a user to graph their data in a browser or pull it through



THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                       . CC BY-SA 4.0 . OPENSOURCE.COM                            9
4 OPEN SOURCE MONITORING TOOLS .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .


        an API. A really cool feature is the ability to export these                                   single host, while the commercial version is inherently
        graphs as images or data files to easily embed them in                                         distributed. This is true of the other central components
        other applications.                                                                            as well. In the open source version, everything runs on a
           Whisper is a fixed-size database that provides fast, reliable                               single host. No data or configuration is stored on external
        storage of numeric data over time. It is a lossy database,                                     systems, so it is fairly easy to manage, but it isn’t as ro-
        which means the resolution of your metrics will degrade over                                   bust as the commercial version.
        time. It will provide high-fidelity metrics for the most recent                                   InfluxDB includes a SQL-like language called InfluxQL for
        collections and gradually reduce that fidelity over time.                                      querying data from the databases. The primary means for
           Graphite also uses dot-separated naming, which implies                                      querying data is the HTTP API. The query language doesn’t
        dimensionality. This dimensionality allows for some creative                                   have as many built-in helper functions as Prometheus, but
        aggregation of metrics and relationships between metrics.                                      those familiar with SQL will likely feel more comfortable with
        This enables aggregation of services across different ver-                                     the language.
        sions or data centers and (getting more specific) a single                                        The TICK stack also includes an alerting system. This sys-
        version running in one data center in a specific Kubernetes                                    tem can do some mild aggregation but doesn’t have the full
        cluster. Granular-level comparisons can also be made to de-                                    capabilities of Prometheus’ AlertManager. It does offer many
        termine if a particular cluster is underperforming.                                            integrations, though. Also, to reduce load on InfluxDB, con-
           Another interesting feature of Graphite is the ability to store                             tinuous queries can be scheduled to store results of queries
        arbitrary events that should be related to time-series metrics.                                that Kapacitor will pick up for alerting.
        In particular, application or infrastructure deployments can
        be added and tracked within Graphite. This allows the op-                                      OpenTSDB
        erator or developer troubleshooting an issue to have more                                      OpenTSDB [15] is an open source time-series database,
        context about what has happened in the environment related                                     as its name implies. It’s unique in this collection of tools
        to the anomalous behavior being investigated.                                                  in that it stores its metrics in Hadoop. This means it is
           Graphite also has a substantial list of functions [11] that                                 inherently scalable. If you already have a Hadoop cluster,
        can be applied to metrics series. However, it lacks a powerful                                 this might be a good option for metrics you want to store
        query language, which some other tools include. It also lacks                                  over the long term. If you don’t have a Hadoop cluster, the
        any alerting functionality or built-in alerting system.                                        operational overhead might be too large of a burden for
                                                                                                       you to bear. However, OpenTSDB now supports Google’s
        InfluxDB                                                                                       Bigtable as a backend, which is a cloud service you don’t
        InfluxDB [12] is a relatively new entrant, newer than Pro-                                     have to operate.
        metheus. It uses an open core model, which means scal-                                            OpenTSDB shares a lot of features with the other systems.
        ing and clustering cost extra. InfluxDB is part of the larger                                  It uses a key-value pairing system it calls tags for identifying
        TICK stack [13] (of Telegraf, InfluxDB, Chronograf, and                                        metrics and adding dimensionality. It has a query language,
        Kapacitor), so we will include all those components’ fea-                                      but it is more limited than Prometheus’ PromQL. It does,
        tures in this analysis.                                                                        however, have several built-in functions that help with learn-
            InfluxDB uses a key-value pair system called tags to                                       ing and usage. The API is the main entry point for querying,
        add dimensionality to metrics, similar to Prometheus and                                       similar to InfluxDB. This system also stores all data forever,
        Graphite. The results are similar to what we discussed                                         unless there’s a time-to-live set in HBase, so you don't have
        previously for the other systems. The metric data can be                                       to worry about fidelity degradation.
        of type float64, int64, bool, and string with nanosec-                                            OpenTSDB doesn’t offer an alerting capability, which
        ond resolution. This is a broader range than most other                                        will make it harder to integrate with your incident response
        tools in this space. In fact, the TICK stack is more of an                                     process. This type of system might be great for long-term
        event-aggregation platform than a native time-series met-                                      Prometheus data storage and for performing more historical
        rics-aggregation system.                                                                       analytics to reveal systemic issues, rather than as a tool to
            InfluxDB uses a system similar to a log-structured merge                                   quickly identify and respond to acute concerns.
        tree for storage. It is called a time-structured merge tree in
        this context. It uses a write-ahead log and a collection of                                    OpenMetrics standard
        read-only data files, which are similar to Sorted Strings Ta-                                  OpenMetrics [16] is a working group seeking to establish
        bles but have series data rather than pure log data. These                                     a standard exposition format for metrics data. It is influ-
        files are sharded per block of time. To learn more, check out                                  enced by Prometheus. If this initiative is successful, we’ll
        this great resource on the InfluxData website [14].                                            have an industry-wide abstraction that would allow us to
            The architecture of the TICK stack is different depend-                                    switch between tools and providers with ease. Leading
        ing on if it’s the open source or commercial version. The                                      companies like Datadog [17] have already started offering
        open source InfluxDB system is self-contained within a                                         tools that can consume the Prometheus exposition format,



        10                      THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                                                    . CC BY-SA 4.0 . OPENSOURCE.COM
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 OPEN SOURCE MONITORING TOOLS


which will be easy to convert to the OpenMetrics standard           [4]	https://www.robustperception.io/
once it’s released.                                                 [5]	https://www.robustperception.io/blog
  It’s also important to note that the contributors to this         [6]	https://prometheus.io/docs/
project include Google and InfluxData (among others).               [7]	https://prometheus.io/
This likely means InfluxDB will eventually adopt the                [8]	https://prometheus.io/docs/introduction/faq/#what-is-the-
OpenMetrics standard. This may also mean that one of                      plural-of-prometheus
the three largest cloud providers will adopt it, if Google’s        [9]	https://graphiteapp.org/
involvement is an indicator. Of course, the exposition for-         [10]	https://artofmonitoring.com/
mat is already being used in the Google-created Kuber-              [11]	http://graphite.readthedocs.io/en/latest/functions.html
netes project [18]. SolarWinds, Robust Perceptions, and             [12]	https://www.influxdata.com/
SpaceNet are also involved.                                         [13]	https://www.thoughtworks.com/radar/platforms/tick-stack
                                                                    [14]	https://docs.influxdata.com/influxdb/v1.5/concepts/
Links                                                                     storage_engine/
[1] https://prometheus.io/docs/practices/histograms/               [15]	http://opentsdb.net/	
[2]	https://thenewstack.io/exploring-prometheus-use-cases-         [16]	https://github.com/RichiH/OpenMetrics
     brian-brazil/                                                  [17]	https://www.datadoghq.com/blog/monitor-prometheus-
[3]	https://prometheus.io/blog/2016/07/23/pull-does-not-scale-           metrics/
     or-does-it/                                                    [18] https://opensource.com/resources/what-is-kubernetes




THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                       . CC BY-SA 4.0 . OPENSOURCE.COM                           11
3 OPEN SOURCE LOG AGGREGATION TOOLS .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .




      3                    open source
                           log aggregation tools
       HOW IS METRICS                        AGGREGATION different from
                                             log aggregation? Can’t
       logs include metrics? Can’t log aggregation systems do the
                                                                                                condition—because it is not a normal operating condition,
                                                                                                it might be valuable during troubleshooting.
                                                                                                   A handful of rules for logging:
       same things as metrics aggregation systems? These are                                       • DO include a timestamp
       questions I see a lot. I’ve also seen vendors pitching their                                • DO format in JSON
       log aggregation system as the solution to all observability                                 • DON’T log insignificant events
       problems. Log aggregation is a valuable tool, but it isn’t                                  • DO log all application errors
       normally a good tool for time-series data.                                                  • MAYBE log warnings
          A couple of valuable features in a time-series metrics ag-                               • DO turn on logging
       gregation system are the regular interval and the storage                                   • DO write messages in a human-readable form
       system customized specifically for time-series data. The                                    • DON’T log informational data in production
       regular interval allows a user to derive real mathematical                                  • DON’T log anything a human can’t read or react to
       results consistently. If a log aggregation system is collect-
       ing metrics in a regular interval, it can potentially work the                           Cloud costs
       same way. However, the storage system isn’t optimized                                    When investigating log aggregation tools, the cloud might
       for the types of queries that are typical in a metrics ag-                               seem like an attractive option. However, it can come with
       gregation system. These queries will take more resources                                 significant costs. Logs represent a lot of data when aggre-
       and time to process using                                                                                               gated across hundreds or
       storage systems found in                                                                                                thousands of hosts and ap-
       log aggregation tools.                                                                                                  plications. The ingestion,
          So, we know a log ag-                                                                                                storage, and retrieval of that
       gregation system is likely                                                                                              data are expensive in cloud-
       not suitable for time-series                                                                                            based systems.
       data, but what is it good for?                                                                                             As a point of reference
       A log aggregation system is                                                                                             from a real system, a col-
       a great place for collecting                                                                                            lection of around 500 nodes
       event data. These are irreg-                                                                                            with a few hundred apps re-
       ular activities that are signif-                                                                                        sults in 200GB of log data
       icant. An example might be                                                                                              per day. There’s probably
       access logs for a web ser-                                                                                              room for improvement in that
       vice. These are significant                                                                                             system, but even reducing it
       because we want to know what is accessing our systems                                    by half will cost nearly $10,000 per month in many SaaS
       and when. Another example would be an application error                                  offerings. This often includes retention of only 30 days,



       12                     THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                                             . CC BY-SA 4.0 . OPENSOURCE.COM
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 OPEN SOURCE LOG AGGREGATION TOOLS



which isn’t very long if you want to look at trending data year-     capability of a security information and event management
over-year.                                                           (SIEM) system [12].
  This isn’t to discourage the use of these systems, as they            The ELK stack also offers great visualization tools
can be very valuable—especially for smaller organizations.           through Kibana, but it lacks an alerting function. Elastic
The purpose is to point out that there could be significant          provides alerting functionality within the paid X-Pack add-
costs, and it can be discouraging when they are realized.            on, but there is nothing built in for the open source sys-
The rest of this chapter will focus on open source and               tem. Yelp has created a solution to this problem, called
commercial solutions that are self-hosted.                           ElastAlert [13], and there are probably others. This addi-
                                                                     tional piece of software is fairly robust, but it increases the
Tool options                                                         complexity of an already complex system.

ELK                                                                  Graylog
ELK [1], short for Elasticsearch, Logstash, and Kibana, is the       Graylog [14] has recently risen in popularity, but it got its
most popular open source log aggregation tool on the market.         start when Lennart Koopmann created it back in 2010. A
It’s used by Netflix, Facebook, Microsoft, LinkedIn, and Cisco.      company was born with the same name two years later.
The three components are all developed and maintained by             Despite its increasing use, it still lags far behind the ELK
Elastic [2]. Elasticsearch [3] is essentially a NoSQL, Lucene        stack. This also means it has fewer community-developed
search engine implementation. Logstash [4] is a log pipeline         features, but it can use the same Beats that the ELK stack
system that can ingest data, transform it, and load it into a        uses. Graylog has gained praise in the Go community with
store like Elasticsearch. Kibana [5] is a visualization layer on     the introduction of the Graylog Collector Sidecar written
top of Elasticsearch.                                                in Go [15].
    A few years ago, Beats were introduced. Beats are data              Graylog uses Elasticsearch, MongoDB [16], and the Gray-
collectors. They simplify the process of shipping data to Log-       log Server under the hood. This makes it as complex to run
stash. Instead of needing to understand the proper syntax            as the ELK stack and maybe a little more. However, Graylog
of each type of log, a user can install a Beat that will export      comes with alerting built into the open source version, as
NGINX logs or Envoy proxy logs properly so they can be               well as several other notable features like streaming, mes-
used effectively within Elasticsearch.                               sage rewriting, and geolocation.
    When installing a production-level ELK stack, a few                 The streaming feature allows for data to be routed to spe-
other pieces might be included, like Kafka [6], Redis [7],           cific Streams in real time while they are being processed.
and NGINX [8]. Also, it is common to replace Logstash                With this feature, a user can see all database errors in a
with Fluentd, which we’ll discuss later. This system can be          single Stream and web server errors in a different Stream.
complex to operate, which in its early days led to a lot of          Alerts can even be based on these Streams as new items
problems and complaints. These have largely been fixed,              are added or when a threshold is exceeded. Latency is prob-
but it’s still a complex system, so you might not want to try        ably one of the biggest issues with log aggregation systems,
it if you’re a smaller operation.                                    and Streams eliminate that issue in Graylog. As soon as the
    That said, there are services available so you don’t have        log comes in, it can be routed to other systems through a
to worry about that. Logz.io [9] will run it for you, but its list   Stream without being processed fully.
pricing is a little steep if you have a lot of data. Of course,         The message rewriting feature uses the open source rules
you’re probably smaller and may not have a lot of data. If you       engine Drools [17]. This allows all incoming messages to be
can’t afford Logz.io, you could look at something like AWS           evaluated against a user-defined rules file enabling a mes-
Elasticsearch Service (ES) [10]. ES is a service Amazon              sage to be dropped (called Blacklisting); a field to be added
Web Services (AWS) offers that makes it very easy to get             or removed; or the message to be modified.
Elasticsearch working quickly. It also has tooling to get all           The coolest feature might be Graylog’s geolocation capa-
AWS logs into ES using Lambda and S3. This is a much                 bility, which supports plotting IP addresses on a map. This is
cheaper option, but there is some management required and            a fairly common feature and is available in Kibana as well,
there are a few limitations.                                         but it adds a lot of value—especially if you want to use this as
    Elastic, the parent company of the stack, offers [11] a          your SIEM system. The geolocation functionality is provided
more robust product that uses the open core model, which             in the open source version of the system.
provides additional options around analytics tools, security            Graylog, the company, charges for support on the open
tools, and reporting. It can also be hosted on Google Cloud          source version if you want it. It also offers an open core
Platform or AWS. This might be the best option, as this              model for its Enterprise version that offers archiving, audit
combination of tools and hosting platforms offers a cheaper          logging, and additional support. There aren’t many other
solution than most SaaS options and still provides a lot of          options for support or hosting, so you’ll likely be on your
value. This system could effectively replace or give you the         own if you don’t use Graylog (the company).



THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                        . CC BY-SA 4.0 . OPENSOURCE.COM                           13
3 OPEN SOURCE LOG AGGREGATION TOOLS .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .


       Fluentd                                                                                  Links
       Fluentd [18] was developed at Treasure Data [19], and the                                [1]	 https://www.elastic.co/webinars/introduction-elk-stack
       CNCF [20] has adopted it as an Incubating project. It was                                [2]	 https://www.elastic.co/
       written in C and Ruby and is recommended by AWS [21]                                     [3] https://www.elastic.co/products/elasticsearch
       and Google Cloud [22]. Fluentd has become a common                                       [4] https://www.elastic.co/products/logstash
       replacement for Logstash in many installations. It acts                                  [5] https://www.elastic.co/products/kibana
       as a local aggregator to collect all node logs and send                                  [6]	 http://kafka.apache.org/
       them off to central storage systems. It is not a log ag-                                 [7] https://redis.io/
       gregation system.                                                                        [8]	 https://www.nginx.com/
          It uses a robust plugin system to provide quick and                                   [9]	 https://logz.io/
       easy integrations with different data sources and data                                   [10] https://aws.amazon.com/elasticsearch-service/
       outputs. Since there are over 500 plugins available, most                                [11]	https://www.elastic.co/cloud
       of your use cases should be covered. If they aren’t, this                                [12]	https://en.wikipedia.org/wiki/Security_information_and_
       sounds like an opportunity to contribute back to the open                                      event_management
       source community.                                                                        [13] https://github.com/Yelp/elastalert
          Fluentd is a common choice in Kubernetes environ-                                     [14] https://www.graylog.org/
       ments due to its low memory requirements (just tens of                                   [15]	https://opensource.com/tags/go
       megabytes) and its high throughput. In an environment                                    [16]	https://www.mongodb.com/
       like Kubernetes [23], where each pod has a Fluentd side-                                 [17] https://www.drools.org/
       car, memory consumption will increase linearly with each                                 [18] https://www.fluentd.org/
       new pod created. Using Fluentd will drastically reduce                                   [19] https://www.treasuredata.com/
       your system utilization. This is becoming a common prob-                                 [20] https://www.cncf.io/
       lem with tools developed in Java that are intended to run                                [21] https://aws.amazon.com/blogs/aws/all-your-data-fluentd/
       one per node where the memory overhead hasn’t been a                                     [22] https://cloud.google.com/logging/docs/agent/
       major issue.                                                                             [23] https://opensource.com/resources/what-is-kubernetes




       14                     THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                                             . CC BY-SA 4.0 . OPENSOURCE.COM
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 ALERTING AND VISUALIZATION TOOLS




5               alerting and
                visualization tools
PERHAPS IT’S                   CLEAR BY THE NAME what alerting
                               and visualization tools are used
for, but it might not be clear why they are observability tools
                                                                   from the alerting system. The operator will only respond
                                                                   to a real incident when he or she is experiencing the prob-
                                                                   lem, emailed by a customer, or called by the boss. In this
or why they’re separated here. Some systems include the            case, alerts have lost their meaning and usefulness.
visualization component in their main product, so why sep-             Alerts are not a constant stream of information or a
arate it here? Observability comes from control theory and         status update. They are meant to convey a problem from
describes our ability to understand a system based on its          which the system can’t automatically recover, and they
inputs and outputs. This chapter focuses on the output com-        are sent only to the individual most likely to be able to
ponent of observability.                                           recover the system. Everything that falls outside this defi-
   Alerting and visualization systems are focused on under-        nition isn’t an alert and is only hurting your employees and
standing the outputs of other systems. This is why they’re         company culture.
grouped together. Visualization and alerting tools could be            Everyone has a different set of alert types, so I’ll not
described as tools that provide structured representations         cover things like priority levels (P1-P5) or models that use
of system outputs. Alerts are basically a synthesized un-          words like Informational, Warning, and Critical. Instead,
derstanding of negative system outputs, and visualizations         I’ll describe the generic categories emergent in complex
are disambiguated structured representations focused on            systems’ incident response.
facilitating user comprehension.                                       You might have noticed I mentioned an “Informational”
   As already mentioned, some systems come with these              alert type right after I wrote that alerts shouldn’t be in-
tools built in, and those have been covered in other sections      formational. Well, not everyone agrees, but also I don’t
with those tools.                                                  consider something an alert if it isn’t sent to anyone. It
                                                                   is a data point that many systems refer to as an alert. It
Common types of alerts and visualizations                          represents some event that should be known but not re-
                                                                   sponded to. It is generally part of the visualization system
Alerts                                                             of the alerting tool and not an event that triggers actual
Let’s first cover what alerts are not. Alerts should not be sent   notifications. Mike Julian covers this and other aspects of
if the human responder can’t do anything about the problem.        alerting in his book Practical Monitoring [1]. It’s a must
This includes alerts that go to multiple individuals with only     read for work in this area.
a few who can respond or situations where every anoma-                 Non-informational alerts consist of types that can be
ly in the system triggers an                                                                         responded to or require ac-
alert. This leads to alert fa-                                                                       tion. I group these into two
tigue and receivers ignoring                                                                         categories: internal outage
all alerts within a specific                                                                         and external outage. (Most
medium until the system es-                                                                          companies have more lev-
calates to a medium that isn’t                                                                       els than this for prioritizing
already saturated.                                                                                   their response efforts.) De-
    For example, if an oper-                                                                         graded system performance
ator is getting hundreds of                                                                          is considered an outage in
emails a day from the alert-                                                                         this model, as it’s usually
ing system, that operator is                                                                         unknown how bad the im-
going to ignore all emails                                                                           pact is to each user.



THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                      . CC BY-SA 4.0 . OPENSOURCE.COM                           15
5 ALERTING AND VISUALIZATION TOOLS .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .


          Internal outages are lower priority than external outages, but                               Another feature of a line chart is that you can often stack
       they still need to be responded to quickly. They often include                                them to show relationships. For example, you might want
       internal systems that company employees use or components                                     to look at requests on each server individually, but also in
       of applications that are visible only to company employees.                                   aggregate. This allows you to understand both the overall
          External outages consist of any system outage that                                         system as well as each instance in the same graph.
       would immediately impact a customer. These don’t include a
       system outage that prevents releasing updates to the sys-
       tem. They do include customer-facing application failures,
       database outages, and networking partitions that hurt avail-
       ability or consistency if either can impact a user. They also
       include outages of tools that may not have direct impact on
       users, as the application continues to run, but this trans-
       parent dependency impacts performance. This is common
       when the system uses some external service or data source
       that isn’t necessary for full functionality but may cause de-
       lays as the application performs retries or handles errors                                    Image source: Grafana (© Grafana Labs)
       from this external dependency.
                                                                                                     Heatmaps
       Visualizations                                                                                Another common visualization is the heatmap. It is useful
         There are a lot of visualization types, and I won’t cover                                   when looking at histograms. This type of visualization is sim-
       them all here. It’s a fascinating area of research. On the                                    ilar to a bar chart but can show gradients within the bars
       data analytics side of my career, this is a constant struggle                                 representing the different percentiles of the overall metric.
       of learning and applying that knowledge. We need to pro-                                      For example, maybe you’re looking at request latencies, and
       vide simple representations of complex system outputs for                                     you want to quickly understand the overall trend as well as
       the widest dissemination of information. Google Charts [2]                                    the distribution of all requests. A heatmap is great for this,
       and Tableau [3] have a wide selection of visualization types.                                 and it can use color to disambiguate the quantity of each
       We’ll cover the most common visualizations and some in-                                       section with a quick glance. The heatmap below shows the
       novative solutions for quickly understanding systems.                                         higher concentration around the centerline of the graph with
                                                                                                     an easy-to-understand visualization of the distribution verti-
       Line chart                                                                                    cally for each time bucket. We might want to review a couple
       The line chart is probably the most common and ubiqui-                                        of points in time where the distribution gets wide while the
       tous visualization available. It also does a pretty good job                                  others are fairly tight like at 14:00. This distribution might be
       of producing an understanding of a system over time. A line                                   a negative performance indicator.
       chart in a metrics system would have a line for each unique
       metric or some aggregation of metrics. This can get confus-
       ing when there are a lot of metrics in the same dashboard
       (as evidenced below), but most systems can select specific
       metrics to view rather than having all of them visible. Also,
       anomalous behavior is easy to spot if it’s significant enough
       to escape the noise of normal operations. Below we can
       see purple, yellow, and light blue lines that might indicate
       anomalous behavior.
                                                                                                     Image source: Grafana.org (© Grafana Labs)

                                                                                                     Gauges
                                                                                                     The last common visualization I’ll cover is used to under-
                                                                                                     stand a single metric quickly. Gauges can be used to repre-
                                                                                                     sent a single metric, like your speedometer represents your
                                                                                                     speed or your gas gauge represents the amount of gas in
                                                                                                     your car. Similar to the gas gauge, most monitoring gaug-
                                                                                                     es clearly indicate what is good and what isn’t. Often (as is
                                                                                                     shown below), good is represented by green, getting worse
                                                                                                     by orange, and “everything is breaking” by red. The middle
       Image source: Stackoverflow.com (Creative Commons BY SA 3.0)                                  row below shows traditional gauges.



       16                      THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                                                  . CC BY-SA 4.0 . OPENSOURCE.COM
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 ALERTING AND VISUALIZATION TOOLS


                                                                   able to contribute new and innovative features to make these
                                                                   systems even better.

                                                                   Alerting tools

                                                                   Bosun
                                                                   If you’ve ever done anything with computers and gotten
                                                                   stuck, the help you received was probably thanks to a Stack
Image source: Grafana.org (© Grafana Labs)
                                                                   Exchange system. Stack Exchange runs many different web-
   This image shows more than just traditional gauges,             sites around a crowdsourced question-and-answer model.
though. The other gauges are single stat representations           Stack Overflow [5] is very popular with developers, and Su-
that are similar to the function of the classic gauge. They        per User [6] is popular with operations. However, there are
all use the same color scheme for quickly indicating system        now hundreds of sites ranging from parenting to sci-fi and
health with just a glance. Arguably, the bottom row is prob-       philosophy to bicycles.
ably the best example of a gauge that allows you to glance            Stack Exchange open sourced its alert management sys-
at a dashboard and know that everything is healthy (or not).       tem, Bosun [7], around the same time Prometheus and its
This type of visualization is usually what I put on a top-level    AlertManager [8] system were released. There were a lot of
dashboard. It offers a full, high-level understanding of system    similarities in the two systems, and that’s a really good thing.
health in seconds.                                                 Like Prometheus, Bosun is written in Golang. Bosun’s scope
                                                                   is more extensive than Prometheus’ as it can interact with
Flame graphs                                                       systems beyond metrics aggregation. It can also ingest data
A less common visualization is the flame graph. It’s not ide-      from log and event aggregation systems. It supports Graph-
al for dashboarding or quickly observing high-level system         ite, InfluxDB, OpenTSDB, and Elasticsearch.
concerns; it’s normally seen when trying to understand a              Bosun’s architecture consists of a single server binary, a
specific application problem. Netflix’s Brendan Gregg intro-       backend like OpenTSDB, Redis, and scollector agents. The
duced them in 2011 [4]. This visualization focuses on CPU          scollector agents [9] automatically detect services on a host
and memory and the associated frames. The X-axis lists the         and report metrics for those processes and other system re-
frames alphabetically, and the Y-axis shows stack depth.           sources. This data is sent to a metrics backend. The Bosun
Each rectangle is a stack frame and includes the function          server binary then queries the backends to determine if any
being called. The wider the rectangle, the more it appears in      alerts need to be fired. Bosun can also be used by tools like
the stack. This method is invaluable when trying to diagnose       Grafana [10] to query the underlying backends through one
system performance at the application level and I urge every-      common interface. Redis is used to store state and metadata
one to give them a try.                                            for Bosun.
                                                                      A really neat feature of Bosun is that it lets you test your
                                                                   alerts against historical data. This was something I missed in
                                                                   Prometheus several years ago when I had data for an issue
                                                                   I wanted alerts on, but no easy way to test my new alert to
                                                                   make sure it would work. I had to create and insert dummy
                                                                   data to test the alert. That was a very time-consuming pro-
                                                                   cess, and this system alleviates that.
                                                                      Bosun also has the usual features like showing simple
                                                                   graphs and creating alerts. It has a powerful expression lan-
                                                                   guage for writing alerting rules. However, it only has email
                                                                   and HTTP notification configurations, which means connect-
                                                                   ing to Slack and other tools requires a bit more customiza-
                                                                   tion (which its documentation covers [11]). Similar to Pro-
                                                                   metheus, Bosun can use templates for these notifications,
Image source: Wikimedia.org (Creative Commons BY SA 3.0)           which means they can look as awesome as you want them
                                                                   to. You can use all your HTML and CSS skills to create the
Tool options                                                       baddest email alert anyone has ever seen.
There are several commercial options for alerting, but this is
Opensource.com, so we’re not even gonna mention them!              Cabot
We’ll cover systems that are being used at scale by real           Cabot [12] was created by a company called Arachnys [13].
companies that you can use at no cost. Hopefully, you’ll be        Many may not know who that is or what it does, but you



THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                      . CC BY-SA 4.0 . OPENSOURCE.COM                           17
5 ALERTING AND VISUALIZATION TOOLS .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .


       have probably felt its impact without knowing it. It has built                                other systems. It supports Graphite, StatsD, InfluxDB, and
       the leading cloud-based solution for fighting financial crimes.                               OpenTSDB as inputs, but it can also forward those metrics
       That sounds pretty cool, right? At a previous company, I was                                  to their respective platforms. This is an interesting concept,
       involved in similar functions around “know your customer"                                     but potentially risky as loads increase on a central service.
       laws [14]. Many companies would see it as very bad press                                      However, if the StatsAgg infrastructure is robust enough,
       to be linked to a terrorist group funneling money through                                     it can still produce alerts even when a backend storage
       their systems. These solutions also help defend against                                       platform has an outage.
       less atrocious offenders like fraudsters who pose a risk to                                      StatsAgg is written in Java and only consists of the main
       the institution, even if less so.                                                             server and UI, which keeps complexity to a minimum. It can
          So why did Arachnys create Cabot? Well, it is kind of                                      send alerts based on regular expression matching and is fo-
       a Christmas present to everyone, as it was a Christmas                                        cused on alerting by service rather than host or instance. Its
       project it built because its developers couldn’t wrap their                                   goal was to fill a void in the open source observability stack,
       heads around Nagios [15]. And really, who can blame                                           and I think it does that quite well.
       them? Cabot was written with Django and Bootstrap, so
       it should be easy for most to contribute to the project. An-                                  Visualization tools
       other interesting factoid is that the name comes from the
       creator’s dog.                                                                                Grafana
          The Cabot architecture is similar to Bosun in that it doesn’t                              Almost everyone knows about Grafana [21] and many have
       collect any data. Instead, it accesses data through the APIs                                  used it. I have been using it for years whenever I need a sim-
       of the tools it is alerting for. Therefore, Cabot uses a pull                                 ple dashboard. The tool I used before was deprecated, and
       (rather than a push) model for alerting. It reaches out into                                  Grafana made that okay when at first I was fairly distraught
       each system’s API and retrieves the information it needs to                                   when I saw the deprecation notice. Grafana was gifted to us
       make a decision based on a specific check. Cabot stores the                                   by Torkel Ödegaard. Oddly, Grafana is another project that
       alerting data in a Postgres database and also has a cache                                     was created around Christmas time and released in January
       using Redis.                                                                                  2014. It has come a long way in only a few years. It started
          Cabot natively supports Graphite [16], but it also supports                                life as a Kibana dashboarding system, which Torkel forked
       Jenkins [17], which is rare in this area. Arachnys [18] uses                                  into what became Grafana.
       Jenkins like a centralized cron, but I like this idea of treating                                 Grafana’s sole focus is presenting monitoring data in a
       build failures like outages. Obviously, a build failure isn’t as                              more usable and pleasing way. It can natively gather data
       critical as a production outage, but it could still alert the team                            from Graphite, Elasticsearch, OpenTSDB, Prometheus, and
       and escalate if the failure isn’t resolved. Who actually checks                               InfluxDB. There’s an Enterprise version that uses plugins
       Jenkins every time an email comes in about a build failure?                                   for more data sources, but there’s no reason those other
       Yeah, me too!                                                                                 data source plugins couldn’t be created as open source, as
          Another interesting feature is that Cabot can integrate                                    the Grafana plugin ecosystem already offers many other
       with Google Calendar for on-call rotations. Cabot calls                                       data sources.
       this feature Rota, which is a British term for a roster or                                        What does Grafana do for me? It provides a central lo-
       rotation. This makes a lot of sense, and I wish other sys-                                    cation for understanding my system. It is web-based, so
       tems would take this idea further. Cabot doesn’t support                                      anyone can access the information, although it can be re-
       anything more complex than primary and backup person-                                         stricted using different authentication methods. Grafana
       nel, but there is certainly room for additional features.                                     can provide knowledge at a glance using many different
       The docs say if you want something more advanced, you                                         types of visualizations. However, it has started integrating
       should look at a commercial option.                                                           alerting and other features that aren’t traditionally combined
                                                                                                     with visualizations.
       StatsAgg                                                                                          Now you can set alerts visually. That means you can look
       StatsAgg [19]? How did that make the list? Well, it’s not                                     at a graph, maybe even one showing where an alert should
       every day you come across a publishing company that                                           have triggered due to some degradation of the system, click
       has created an alerting platform. I think that deserves                                       on the graph where you want the alert to trigger, and then
       recognition. Pearson [20] isn’t just a publishing company                                     tell Grafana where to send the alert. That’s a pretty powerful
       anymore, though. It has several web presences and a joint                                     addition that won’t necessarily replace an alerting platform,
       venture with O’Reilly Media. However, I still think of the                                    but it can certainly help augment it by providing a different
       company as the people who published my school books                                           perspective on alerting criteria.
       and tests.                                                                                        Grafana has also introduced more collaboration features.
          StatsAgg isn’t just an alerting platform; it’s also a met-                                 Users have been able to share dashboards for a long time,
       rics aggregation platform. And it’s kind of like a proxy for                                  meaning you don’t have to create your own dashboard for



       18                      THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                                                  . CC BY-SA 4.0 . OPENSOURCE.COM
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 ALERTING AND VISUALIZATION TOOLS


your Kubernetes [22] cluster because there are several al-           Links
ready available—with some maintained by Kubernetes de-               [1] https://www.practicalmonitoring.com/
velopers and others by Grafana developers.                           [2]	 https://developers.google.com/chart/interactive/docs/
   The most significant addition around collaboration is anno-             gallery
tations. Annotations allow a user to add context to part of a        [3]	 https://libguides.libraries.claremont.edu/​
graph. Then other users can use this context to understand                 c.php?g=474417&p=3286401
the system better. This is an invaluable tool when a team is         [4]	 http://www.brendangregg.com/flamegraphs.html
in the middle of an incident and communication and common            [5]	 https://stackoverflow.com/
understanding are critical. Having all the information right         [6]	 https://superuser.com/
where you’re already looking makes it much more likely that          [7]	 http://bosun.org/
knowledge will be shared across the team quickly. It’s also a        [8]	 https://prometheus.io/docs/alerting/alertmanager/
nice feature to use during blameless postmortems when the            [9] https://bosun.org/scollector/
team is trying to understand how the failure occurred and            [10]	https://grafana.com/
learn more about their system.                                       [11] https://bosun.org/notifications
                                                                     [12]	https://cabotapp.com/
Vizceral                                                             [13] https://www.arachnys.com/
Netflix created Vizceral [23] to understand its traffic patterns     [14]	https://en.wikipedia.org/wiki/Know_your_customer
better when performing a traffic failover. Unlike Grafana,           [15]	https://www.nagios.org/
which is a more general tool, Vizceral serves a very specific        [16]	https://graphiteapp.org/
use-case. Netflix no longer uses this tool internally and says       [17] https://jenkins.io/
it is no longer actively maintained, but it still updates the tool   [18] https://www.arachnys.com/
periodically. I highlight it here primarily to point out an inter-   [19] https://github.com/PearsonEducation/StatsAgg
esting visualization mechanism and how it can help solve a           [20] https://www.pearson.com/us/
problem. It's worth running it in a demo environment just to         [21] https://grafana.com/
better grasp the concepts and witness what's possible with           [22] https://opensource.com/resources/what-is-kubernetes
these systems.                                                       [23] https://github.com/Netflix/vizceral




THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                        . CC BY-SA 4.0 . OPENSOURCE.COM                            19
3 OPEN SOURCE DISTRIBUTED TRACING TOOLS .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .




      3
      DISTRIBUTED TRACING
                           open source distributed
                           tracing tools
                                                        SYSTEMS enable
                                                        tracking a re-
       quest through a software system that is distributed across
       multiple applications, services, and databases as well as
       intermediaries like proxies. This allows for a deeper under-
       standing of what is happening within a software system.
       These systems produce graphical representations that show
       how much time the request took on each step and lists each
       known step.                                                                           Image by Dan Barker (Creative Commns BY SA 4.0)
          A user reviewing this content can determine where the sys-                            This demo is using Istio’s built-in OpenTracing implemen-
       tem is experiencing latencies or blockages. Instead of testing                        tation, so I can get tracing without even modifying my appli-
       the system like a binary search tree when requests start failing,                     cation. It also uses Jaeger, which is OpenTracing-compatible.
       operators and developers can see exactly where the issues                             So what is OpenTracing? Let’s find out.
       begin. This can also reveal where performance changes might
       be occurring from deployment to deployment. It’s always bet-                          OpenTracing API
       ter to catch regressions automatically by alerting to the anom-                       OpenTracing [3] is a spec that grew out of Zipkin [4] to pro-
       alous behavior rather than having your customers tell you.                            vide cross-platform compatibility. It offers a vendor-neutral
          How does this tracing thing work? Well, each request gets                          API for adding tracing to applications and delivering that
       a special ID that’s usually injected into the headers. This ID                        data into distributed tracing systems. A library written for
       uniquely identifies that transaction. This transaction is nor-                        the OpenTracing spec can be used with any system that is
       mally called a trace. The trace is the overall abstract idea                          OpenTracing compliant. Zipkin, Jaeger, and AppDash are
       of the entire transaction. Each trace is made up of spans.                            examples of open source tools that have adopted the open
       These spans are the actual work being performed, like a ser-                          standard, but even proprietary tools like Datadog and Insta-
       vice call or a database request. Each span also has a unique                          na are adopting it. This is expected to continue as OpenTrac-
       ID. Spans can create subsequent spans called child spans,                             ing reaches ubiquitous status.
       and child spans can have multiple parents.
          Once a transaction (or trace) has run its course, it can be                        OpenCensus
       searched in a presentation layer. There are several tools in                          Okay, we have OpenTracing, but what is this OpenCensus [5]
       this space that we’ll discuss later, but the picture below is                         thing that keeps popping up in my searches? Is it a compet-
       of Jaeger [1] from my Istio walkthrough [2]. It shows multi-                          ing standard, something completely different, or something
       ple spans of a single trace. The power of this is immediately                         complementary? That answer depends on who you ask. I
       clear as you can better understand the transaction’s story at                         will do my best to explain the difference (as I understand it).
       a glance.                                                                                OpenCensus is a more holistic or all-inclusive approach.
                                                                                             OpenTracing is focused on establishing an open API and spec
                                                                                             and not on open implementations for each language and trac-
                                                                                             ing system. OpenCensus provides not only the specification
                                                                                             but also the language implementations and wire protocol. It
                                                                                             also goes beyond tracing by including additional metrics that
                                                                                             are normally outside the scope of distributed tracing systems.
                                                                                                OpenCensus allows viewing data on the host where the
                                                                                             application is running, but it also has a pluggable exporter
                                                                                             system for exporting data to central aggregators. The current
                                                                                             exporters produced by the OpenCensus team are Zipkin,
                                                                                             Prometheus, Jaeger, Stackdriver, Datadog, and SignalFx,
                                                                                             but anyone can create an exporter.



       20                    THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                                           . CC BY-SA 4.0 . OPENSOURCE.COM
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 OPEN SOURCE DISTRIBUTED TRACING TOOLS


  From my perspective, there’s a lot of overlap. One isn’t              or ElasticSearch. The query service can access the data store
necessarily better than the other, but it’s important to know           directly and provide that information to the web UI.
what each does and doesn’t do. OpenTracing is primarily a                  By default, a user won’t get all the traces from the Jaeger
spec with others doing the implementation and opinionation.             clients. The system samples 0.1% (1 in 1,000) of traces that
OpenCensus provides a holistic approach for the local com-              pass through each client. Keeping and transmitting all traces
ponent with more opinionation but still requires other sys-             would be a bit overwhelming to most systems. However, this
tems for remote aggregation.                                            can be increased or decreased by configuring the agents,
                                                                        which the client consults with for its configuration. This sam-
Tool options                                                            pling isn’t completely random, though, and it’s getting better.
                                                                        Jaeger uses probabilistic sampling, which tries to make an
Zipkin                                                                  educated guess at whether a new trace should be sampled
Zipkin was one of the first systems of this kind. It was developed      or not. Adaptive sampling is on its roadmap [14], which will
by Twitter based on the Google Dapper paper [6] about the in-           improve the sampling algorithm by adding additional context
ternal system Google uses. Zipkin was written using Java, and           for making decisions.
it can use Cassandra or ElasticSearch as scalable backends.
Most companies should be satisfied with one of those options.           AppDash
The lowest supported Java version is Java 6. It also uses the           AppDash [15] is a distributed tracing system written in Gol-
Thrift [7] binary communication protocol, which is popular in the       ang, like Jaeger. It was created by Sourcegraph [16] based on
Twitter stack and is hosted as an Apache project.                       Google’s Dapper and Twitter’s Zipkin. Similar to Jaeger and
   The system consists of reporters (clients), collectors, a            Zipkin, AppDash supports the OpenTracing standard; this was
query service, and a web UI. Zipkin is meant to be safe in              a later addition and requires a component that is different from
production by transmitting only a trace ID within the context           the default component. This adds risk and complexity.
of a transaction to inform receivers that a trace is in pro-               At a high level, AppDash’s architecture consists mostly of
cess. The data collected in each reporter is then transmitted           three components: a client, a local collector, and a remote
asynchronously to the collectors. The collectors store these            collector. There’s not a lot of documentation, so this descrip-
spans in the database, and the web UI presents this data to             tion comes from testing the system and reviewing the code.
the end user in a consumable format. The delivery of data               The client in AppDash gets added to your code. AppDash
to the collectors can occur in three different methods: HTTP,           provides Python, Golang, and Ruby implementations, but
Kafka, and Scribe.                                                      OpenTracing libraries can be used with AppDash’s Open-
   The Zipkin community [8] has also created Brave [9], a               Tracing implementation. The client collects the spans and
Java client implementation compatible with Zipkin. It has no            sends them to the local collector. The local collector then
dependencies, so it won’t drag your projects down or clutter            sends the data to a centralized AppDash server running its
them with libraries that are incompatible with your corporate           own local collector, which is the remote collector for all other
standards. There are many other implementations, and Zipkin             nodes in the system.
is compatible with the OpenTracing standard, so these imple-
mentations should also work with other distributed tracing sys-         Links
tems. The popular Spring framework has a component called               [1] https://www.jaegertracing.io/
Spring Cloud Sleuth [10] that is compatible with Zipkin.                [2] https://www.youtube.com/watch?v=T8BbeqZ0Rls
                                                                        [3] http://opentracing.io/
Jaeger                                                                  [4] https://zipkin.io/
Jaeger [11] is a newer project from Uber Technologies that              [5]	 https://opencensus.io/
the CNCF [12] has since adopted as an Incubating project.               [6]	 https://static.googleusercontent.com/media/research.
It is written in Golang, so you don’t have to worry about hav-                google.com/en//archive/papers/dapper-2010-1.pdf
ing dependencies installed on the host or any overhead of               [7]	 https://thrift.apache.org/
interpreters or language virtual machines. Similar to Zipkin,           [8]	 https://zipkin.io/pages/community.html
Jaeger also supports Cassandra and ElasticSearch as scal-               [9] https://github.com/openzipkin/brave
able storage backends. Jaeger is also fully compatible with             [10]	https://cloud.spring.io/spring-cloud-sleuth/
the OpenTracing standard.                                               [11] https://www.jaegertracing.io/
    Jaeger’s architecture is similar to Zipkin, with clients (report-   [12]	https://www.cncf.io/
ers), collectors, a query service, and a web UI, but it also has        [13]	https://en.wikipedia.org/wiki/Apache_Thrift
an agent on each host that locally aggregates the data. The             [14]	https://www.jaegertracing.io/docs/roadmap/#adaptive-
agent receives data over a UDP connection, which it batches                   sampling
and sends to a collector. The collector receives that data in the       [15] https://github.com/sourcegraph/appdash
form of the Thrift [13] protocol and stores that data in Cassandra      [16] https://about.sourcegraph.com/



THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                           . CC BY-SA 4.0 . OPENSOURCE.COM                           21
GET INVOLVED | ADDITIONAL RESOURCES .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .


GET INVOLVED

                 If you find these articles useful, get involved! Your feedback helps improve the status
                 quo for all things DevOps.
                 Contribute to the Opensource.com DevOps resource collection, and join the team of
                 DevOps practitioners and enthusiasts who want to share the open source stories
                 happening in the world of IT.
                 The Open Source DevOps team is looking for writers, curators, and others who can help
                 us explore the intersection of open source and DevOps. We’re especially interested in
                 stories on the following topics:

                    • D
                       evOps practical how to’s
                    • D
                       evOps and open source
                    • D
                       evOps and talent
                    • D
                       evOps and culture
                    • D
                       evSecOps/rugged software

                 Learn more about the Opensource.com DevOps team: https://opensource.com/devops-team




ADDITIONAL RESOURCES

                 The ultimate DevOps hiring guide
                 This free download provides advice, tactics, and information about the state of DevOps
                 hiring for both job seekers and hiring managers.
                 Download it now: The ultimate DevOps hiring guide


                 The Open Organization Guide to IT Culture Change
                 In The Open Organization Guide to IT Culture Change, more than 25 contributors from
                 open communities, companies, and projects offer hard-won lessons and practical ad-
                 vice on how to create an open IT department that can deliver better, faster results and
                 unparalleled business value.
                 Download it now: The Open Organization Guide to IT Culture Change




       22                      THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                                               . CC BY-SA 4.0 . OPENSOURCE.COM
     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . WRITE FOR US | KEEP IN TOUCH



WRITE FOR US

     Would you like to write for Opensource.com? Our editorial calendar includes upcoming themes,
     community columns, and topic suggestions: https://opensource.com/calendar
     Learn more about writing for Opensource.com at: https://opensource.com/writers
     We're always looking for open source-related articles on the following topics:
               Big data: Open source big data tools, stories, communities, and news.
               Command-line tips: Tricks and tips for the Linux command-line.
               Containers and Kubernetes: Getting started with containers, best practices,
               security, news, projects, and case studies.
               Education: Open source projects, tools, solutions, and resources for educators,
               students, and the classroom.
               Geek culture: Open source-related geek culture stories.
               Hardware: Open source hardware projects, maker culture, new products, howtos,
               and tutorials.
               Machine learning and AI: Open source tools, programs, projects and howtos for
               machine learning and artificial intelligence.
               Programming: Share your favorite scripts, tips for getting started, tricks for
               developers, tutorials, and tell us about your favorite programming languages and
               communities.
               Security: Tips and tricks for securing your systems, best practices, checklists,
               tutorials and tools, case studies, and security-related project updates.



                                          Keep in touch!
                Sign up to receive roundups of our best articles,
               giveaway alerts, and community announcements.
           Visit opensource.com/email-newsletter to subscribe.




     THE OPEN SOURCE GUIDE TO DEVOPS MONITORING TOOLS                          . CC BY-SA 4.0 . OPENSOURCE.COM                             23