kubernetes, Technical

Kubernetes Logging 101

Containerization brings predictability and consistency across the development pipeline. A developer can package code in a container and ship the same container into production knowing it will work the same. However, for this consistent experience to happen, there are many cogs and levers working in the underlying layers of the container stack. Containers abstract away the complex internals of infrastructure and deliver a simple consistent user experience. The part of the Docker stack that’s especially important in this aspect is the orchestration layer.

An orchestration tool like Kubernetes takes care of the complexity of managing numerous containers by providing many smart defaults. It takes care of changes and configuration with groups of containers called pods, and groups of pods called clusters. In doing so, it lets you focus on what matters most to you – the code and data that’s housed in your Kubernetes cluster. Because of these advantages, Kubernetes has become the leading container orchestration tool today.

belgium-antwerp-shipping-container-163726
Source: Pexels.com

Kubernetes makes it easy to manage containers at scale, but it comes with a steep learning curve. This is the reason for the numerous startups offering managed Kubernetes services – Platform9, Kismatic, OpenShift, and CoreOS Tectonic to name a few. However, learning the ins-and-outs of Kubernetes is well worth the effort because of the power and control it gives you.

No matter which route you take to managing your Kubernetes cluster, one fundamental requirement to running a successful system is log analysis. Traditional app infrastructure required log data to troubleshoot performance issues, system failures, bugs, and attacks. With the emergence of modern infrastructure tools like Docker and Kubernetes, the importance of logs has only increased.

The Importance of Log Data in Kubernetes

Log data is essential to Kubernetes management. Kubernetes is a very dynamic platform with tons of changes happening all the time. As containers are started and stopped, and IP addresses and loads change, Kubernetes makes many minute changes to ensure services are available and performance is not impacted. But there is still the odd time when things break, or performance slows down. At those times, you need the detail that only log data can provide. Not both performance and security, you need log data to ensure proper compliance to laws like HIPAA and PCI DSS. Or, if there’s a data breach, you’ll want to go back in time to identify the origin of the attack and its progression across your system. For all these use cases, log data is indispensable.

There are many ways you can access and analyze Kubernetes log data ranging from simple to advanced. Let’s start with the simplest option and move up the chain.

Monitoring A Pod

Pod-level monitoring is the most rudimentary form of viewing Kubernetes logs. You use the kubectl commands to fetch log data for each pod individually. These logs are stored in the pod and when the pod dies, the logs die with them. They are useful when you’re just starting out, and have just a few pods. You can instantly check the health of pods without needing a robust logging setup for a big cluster.

Monitoring A Node

Logs collected for each node are stored in a JSON file. This file can get really large, and in this case, you can use the logrotate function to split the log data in multiple files once a day, or when the data reaches a particular size like 10MB. Node-level logs are more persistent than pod-level ones. Even if a pod is restarted, it’s previous logs are retained in a container. But if a pod is evicted from a node, its log data is deleted.

While pod-level and node-level logging are important concepts in Kubernetes, they aren’t meant to be real logging solutions. Rather, they act as a building block for the real solution – cluster-level logging.

Monitoring the Cluster

Kubernetes doesn’t provide a default logging mechanism for the entire cluster, but leaves this up to the user and third-party tools to figure out. One approach is to build on the node-level logging. This way, you can assign an agent to log every node and combine their output.

The default option is Stackdriver which uses a Fluentd agent and writes log output to a local file. However, you can also set it to send the same data to Google Cloud. From here you can use Google Cloud’s CLI to query the log data. This, however, is not the most powerful way to analyze your log data.

The ELK Stack

The most common way to implement cluster-level logging is to use a Fluentd agent to collect logs from the nodes, and pass them onto an external Elasticsearch cluster. The log data is stored and processed using Elasticsearch, and can be visualized using a tool like Kibana. The ELK stack (Elasticsearch, Logstash, Kibana) is the most popular open source solution for logging today, and its components often form the base for many other modern logging solutions, including LogDNA (but that’s a topic for a whole other post). The ELK stack offers more powerful logging, and more extensibility than the Stackdriver / Google Cloud option.

One example of an organization that uses this setup for centralized logging for their Kubernetes cluster is Samsung. They use the Fluentd / ELK stack combination, but add Kafka for an added step of buffering and monitoring. Samsung has even open sourced this configuration of tools and called K2 Charts.

Sidecar Containers

You can stream logs of different formats together, but this would be harder to analyze and could get messy, considering the scale of Kubernetes. Instead, the preferred way is to attach a sidecar container for each type of log data. A sidecar container is dedicated to collecting logs, and is very lightweight. Every sidecar container contains a Fluentd agent for collecting and transporting logs to a destination.

Archived Log Storage

Storing logs is critical, especially for security. For example, you may find out about a breach in your system that started two years ago, and want to trace its development. In this case, you need archived log data to go back to that point in time, and see the origin of the breach, and to what extent it has impacted your system.

Kubernetes offers basic local storage for logs, but this is not what you’d want to use for a production cluster. You can either use block storage like AWS S3 or Azure Blog, or you can ask your log analysis vendor to give you extended storage on their platform. For archived data, it’s best to leverage cloud storage than on premise servers as they’re more cost efficient and can be easily accessed when needed.

Dedicated Log Analysis Platforms

The ELK stack is a common way to access and manage Kubernetes logs, but it can be quite complex with the number of tools to setup and manage. Ideally, you want your logging tool to get out of the way and let you focus on your log data and your Kubernetes cluster. In this case, it pays to go with a dedicated log analysis platform like LogDNA, which comes with advanced log management features, and is fully managed so you don’t have to worry about availability and scaling your log infrastructure.

You can start collecting Kubernetes logs in LogDNA using just 2 simple kubectl commands:

kubectl create secret generic logdna-agent-key –from-literal=logdna-agent-key=YOUR-INGESTION-KEY-HERE

kubectl create -f https://raw.githubusercontent.com/logdna/logdna-agent/master/logdna-agent-ds.yaml

Deeply customized for Kubernetes, LogDNA automatically recognizes all metadata for your Kubernetes cluster including pods, nodes, containers, and namespaces. It lets you analyze your Kubernetes cluster in real-time, and provides powerful natural language search, filters, parsing, shortcuts, and alerts.

LogDNA even mines your data using machine learning algorithms and attempts to predict issues even before they happen. This is the holy grail of log analysis, and it wasn’t possible previously. Thanks to advances in machine learning and the cloud enabling computing at this scale, it’s now a reality.

To summarize, Kubernetes is the leading container orchestration platform available today. Yet, running a production cluster of Kubernetes takes a lot of familiarity with the system and robust tooling. When it comes to log analysis, Kubernetes offers basic log collection for pods, nodes, and clusters, but for a production cluster you want unified logging at the cluster level. The ELK stack comes closest to what a logging solution for Kubernetes should look like. However, it’s a pain to maintain and runs into issues once you hit the limits of your underlying infrastructure.

For unified log analysis for Kubernetes, you need a dedicated log analysis platform like LogDNA. It comes with advanced features like powerful search, filtering, and machine learning to help you get the most out of your log data. Being a fully managed service, you can focus on your Kubernetes cluster and leave the drudge of maintaining log infrastructure to LogDNA. As you run a production Kubernetes cluster, you need a powerful log analysis tool like LogDNA to truly enjoy the promise of Kubernetes – running containers at massive scale.

Learn more about LogDNA for Kubernetes here.

Technical

Choosing The Right Ingestion Client

LogDNA has a range of options by which you can supply your account with log data. If you’re not entirely familiar, these are via the Collector Agent, syslog, code libraries, PaaS integration, and REST API.

Given all these options, which one is right for you? Unfortunately, like many things in tech (as in life), the question is difficult to answer definitively. I don’t mean this in any non-committal way. I only mean that it’s hard to say which option is the right one without some form of well thought through qualifications.

Given that, let’s work through the five options which LogDNA provides, and consider when they would, and when they wouldn’t, be the right option to choose. Let’s start off with, arguably, the simplest of the five, the logging agent.

The Collector Agent

If you want the easiest option, one supported across all of the major operating systems, (Linux, macOS, and Windows), then the Collector Agent is the choice to make. Written in Node.js, it logs information at the OS level. To quote the official documentation, it:

Reads log files from the computer it is installed on and uploads the log data to your LogDNA account…It opens a secure web socket to LogDNA’s ingestion servers and ‘tails’ for new log files added to your specific logging directories. When those files change, the changes are then sent to LogDNA, via the secure web socket.

In addition to reading log files on your server(s), the Collector Agent can also read from the Windows Application Log.

As you may expect from a system daemon or service, after installation, all that’s required is to provide a configuration. On Linux/UNIX systems, this is stored in /etc/logdna.conf and on Windows, it is stored in C:\ProgramData\logdna\logdna.conf.

As an example, let’s assume that my servers are running Ubuntu 16.04 and that I have installed the agent. With that done, the minimum configuration that I’d need to add to /etc/logdna.conf, so that it started sending my log data is:

logdir = /var/log/nginx,/var/log/auth
key = <YOUR LOGDNA INGESTION KEY>

I’ve only needed to provide the directories to read the log files from that are of interest to me, as well as my LogDNA ingestion key. In the example provided, it would send log information from my NGINX server, as well as authentication requests to my LogDNA account.

If there are files that shouldn’t be included, then I can use the exclude option to have the agent skip over them. And there’s another handy option: tags. It allows you to group hosts automatically into dynamic host groups without having to explicitly assign a host to a group within the LogDNA web app.

The agent can be installed using package managers, such as Homebrew, APT, and Yum, as well as from source. Additionally, it also integrates with Kubernetes, if you’re using Docker containers.

Given this level of simplicity, if you’re just getting started with LogDNA, want an option that requires a minimum investment, both initially and over time, and want something that’s available on all the major platforms, then the Collector Agent is the right one to choose.

However, it is considered a bit of a shotgun approach. That is, while it can send your log data straight to LogDNA with little effort, you have only negligible control over how and when it does so. Specifically, you can’t choose a level of log granularity, such as debug, info, warning, and critical.

syslog

The next option is syslog. It is the veteran logging daemon, available on all Linux/UNIX installations, since its development by Eric Allman back in the 1980s. If you’re not familiar, syslog (also available as syslog-ng and rsyslog) is a logging system which allows for logging on a server locally, as well as to remote syslog servers.

It is an excellent choice for logging information across a wide range of devices, such as printers and routers, and a wide range of system-services, such as authentication, servers such as NGINX, CUPS, and Apache, system-level events, and kernel-level events.

To send log data to LogDNA using syslog requires a little bit more effort than the logging agent. Gladly, they’re well documented under Settings -> "View your logs" -> "Via syslog".

Setting up syslog requires adding an entry to /etc/syslog.conf, which can be generated in the “Log Source” modal dialog. For example: *.* @syslog-a.logdna.com:16312. syslog-ng and syslog require a little more configuration effort. But like syslog, the instructions are well documented.

syslog is a particularly appealing option if you’re looking for simplicity and minimal cost. One of the three daemons is likely universally available on any Linux/UNIX system, as they’ve been around for so long. What’s more, it’s a very well understood application, with a wealth of documentation available on how to configure it across a wide variety of use cases.

This gives it an advantage over the Collector Agent. Where the agent passes log data as-is, syslog allows you to choose the format in which the log messages are written. This may not be necessary. But it’s handy nonetheless, as you can write log data that better suits the needs of your applications. Doing so, you’re better able to parse the log data when the time comes, as it will make more sense.

Given this advantage, the fact that syslog is free, and that LogDNA provides detailed configuration instructions and supports host tags/dynamic groups, it’s well worth considering.

You won’t have to look far to find support for it, and you won’t have to invest much time in maintaining it. Another handy aspect of sending log data using syslog is that you don’t need to explicitly integrate against LogDNA, such as when developing applications, to send the log data from your applications. By logging to the system logs, that information is automatically, and transparently sent by syslog.

However, one thing that works against using syslog is a lack of native Windows support. Having said that, there are syslog daemons for Windows, such as Kiwi syslog Server. And you’ll find others if you do further research. But they’re not natively available like syslog is for Linux/UNIX systems.

PaaS Integration

Let’s say that neither the agent nor a syslog daemon suits your purposes, as you’re using a PaaS (Platform as a Service), such as Heroku. In that case, you still need to send your log data to LogDNA, but you may not have as much control as you otherwise might.

In this case, while the options aren’t extensive, you’re not without options. LogDNA supports Heroku, Kubernetes, Docker (including Docker Cloud, Convox, and Rancher), Fluentd, Flynn, CloudWatch, Elastic Beanstalk, and Cloud Foundry.

Assuming that you’re using one in the above list, let’s first consider what you get, which depends on the platform. Some provide more configuration and flexibility; some provide less. In many cases, there isn’t much choice in how logs can be forwarded.

Then there’s the setup complexity and time. While I won’t say that any of the setups are easy, some will require more time and effort than others. However, regardless of what you’re using, LogDNA provides thorough documentation for each one.

Code Libraries

Now to my favorite option, code libraries. As a software developer, you’d naturally expect that it would be. LogDNA has libraries officially developed and supported for four languages — Node.JS, Python, Rails, and Ruby. There are also libraries for Go, iOS, Java, and PHP.

Given that these are some of the most popular languages and that they have runtimes available on the major operating systems, it’s likely that at least one of them will suit your needs.

Now let’s consider the positives of using code libraries. Unlike the previous two options, code libraries afford you complete control over both how and when your log data is sent. You can integrate logging support into your existing application(s).

You can log as much or as little as your application or regulatory environment demands. You can add as much optional extra information as you want. And you can specify, exactly, the information that is sent, such as custom metadata.

Here’s a minimalist example using PHP, which is often my language of choice.

<?php

use Monolog\Logger;
use Zwijn\Monolog\Handler\LogdnaHandler;

require_once ('vendor/autoload.php');

$logger = new Logger('general');
$logdnaHandler = new LogdnaHandler(
    'YOUR LOGDNA INGESTION KEY',
    'myappname',
    Logger::DEBUG
);
$logger->pushHandler($logdnaHandler);

# Sends debug level message "mylog" with some related meta-data
$logger->debug(
    "mylog", [
        'logdna-meta-data-field1' => [
            'value1' => 'value',
            'value2' => 5
            ],
        'logdna-meta-data-field2' => [
            'value1' => 'value'
            ]
    ]
);

This example initializes a logger object, which will send log messages, at or above the level of DEBUG to LogDNA. It will also send two sets of custom metadata — metadata which could be anything that makes sense to your application.

To be fair, this is rather a “Hello World” example. And integrating logging into your code base would take more effort than it took to create this short example — especially if it’s an older and quite complex legacy codebase.

That said, I hope you’ll appreciate the level of power and customizability that using a code library provides. On the flip side, code libraries also incur a high(er) investment, both in their initial investment, as well as in their maintenance over the longer term.

They take special planning, design, testing, and debugging. They also need to be developed using security best practices, both in the code itself, as well as in the deployment process. Given that, while you have much more power, the investment will be higher. Something to keep in mind.

The REST API

And now to the last option. If none of the previous four options suit — or your particular setup is entirely custom — there’s one last option that just may work, the LogDNA REST API.

Similar to using a code library, the REST API allows you to send log data, along with custom metadata, to LogDNA.

curl "https://logs.logdna.com/logs/ingest?hostname=EXAMPLE_HOST&mac=C0:FF:EE:C0:FF:EE&ip=10.0.1.101&now=$(date +%s)" \
-u INSERT_INGESTION_KEY: \
-H "Content-Type: application/json; charset=UTF-8" \
-d \
'{
   "lines": [
     {
       "line":"This is an awesome log statement",
       "app":"myapp",
       "level": "INFO",
       "env": "production",
       "meta": {
         "customfield": {
           "nestedfield": "nestedvalue"
         }
       }
     }
   ]
}'

In the example log request, from the docs, above, you can see that requests to the API are GET requests, and require that a hostname and LogDNA ingestion key are provided. A MAC address, IP address, and a timestamp are able to be included as well.

The request body is in JSON format and can contain one or more message lines. JSON is a light-weight and almost universally accepted format, one able to be created by most every popular software language, along with a variety of tools, today.

As with the code libraries, and as you can see in the example above, the message lines contain the message’s timestamp, body, application name, and log level, as well as optional metadata.

Given that the request is being made using curl, it could be wrapped in a shell script on Linux/UNIX or batch script on Windows. Then, if applicable, the script could be tied into either CRON or the Windows Scheduler. If written well, it could also take input from any number of services, or be called from an existing process via a system call.

However, you don’t need to use curl. You could use any software language that can compose and send a GET request, whether natively or via a third-party package, such as GuzzleHTTP for PHP. Similar to how they work, you could integrate logging directly into an existing application, even if it didn’t have dedicated support via one of the eight code libraries.

The REST API offers quite a significant level of flexibility; a flexibility that the Collector Agent, syslog, and PaaS options simply don’t — or can’t — provide.

Unfortunately though, as with the code library option, there won’t be a clear set of steps required to integrate it. Every application is different and has different needs and requirements. What’s more, it will require a higher investment, both at the start and over time. However, if it’s the one that best suits your needs, then it’s worth considering.

In Conclusion

These are the five ways in which you can send log data from your servers and applications to LogDNA. They all have pros and cons, they’re not all equal, and they can’t be compared in the proverbial “apples to apples” way.

The needs of your particular setup, whether server or application, will supply most of the criteria; as will your budget and the experience of your technical staff. But regardless of the choice that you make, you can get started quite quickly, likely using the collector agency or syslog, and then progressing to the more involved options as time goes by, according to your particular use case, and as your budget allows.

kubernetes, Technical

Logging In The Era Of Containers

Log analysis will always be fundamental to any monitoring strategy. It is the closest you can get to the source of what’s happening with your app or infrastructure. As application development has undergone drastic change over the past few years with the rise of DevOps, containerization, and microservices architecture, logs have not become less important, rather they are now at the forefront of running stable, highly available, and performant applications. However, the practice of logging has changed. From being simple it is now complex, from just a few hundred lines of logs we now often see millions of lines of logs, from all being in one place we are now dealing with distributed log data. Yet as logging has become more challenging, a new breed of tools have arrived on the scene to manage and make sense of all this logging activity. In this post, we’ll look at the sea change that logging has undergone, and how innovative solutions have sprung up to address these challenges.

Complexity of the stack

Traditional client-server applications were simple to build, understand, and manage. The frontend was required to run on a few browsers or operating systems. The backend consisted of a single consolidated database, or at most a couple of databases on a single server. When something goes wrong you can jump into your system logs at /VAR/LOG and easily identify the source of the failure and how to fix it.

With today’s cloud-native apps, the application stack has become tremendously complex. Your apps need to run on numerous combinations of mobile devices, browsers, operating systems, edge devices, and enterprise platforms. Cloud computing has made it possible to deliver apps consistently across the world using the internet, but it comes with its own challenges of management. VMs (virtual machines) brought more flexibility and cost efficiencies over hardware servers, but organizations soon outgrew them and needed a faster way to deliver apps. Enter Docker.

Containers bring consistency to the development pipeline by breaking down complex tasks and code into small manageable chunks. This fragmentation lets organizations ship software faster, but it requires you to manage a completely new set of components. Container registries, the container runtime, an orchestration tool or CaaS service – all make a container stack more complex than VMs.

Volume of data has spiked

Each component generates its own set of logs. Monolithic apps are decomposed into microservices with each service being powered by numerous containers. These containers have short life spans of a few hours compared to VMs which typically run for months or even years. Every request is routed across the service mesh and touches many services before being returned as a response to the end user. As a result, the total volume of logs has multiplied. Correlating the logs in one part of the system with those of another part is difficult, and insights are hard won. Having more log data is an opportunity for better monitoring, but only if you’re able to glean insights out of the data efficiently.

Many logging mechanisms

Each layer has its own logging mechanism. For example, Docker has drivers for many log aggregators. Kubernetes, on the other hand, doesn’t support drivers. Instead it uses a Fluentd agent running on every node. More on Fluentd later in this post. Kubernetes doesn’t have native log storage, so you need to configure log storage externally. If you use a CaaS platform like ECS, they would have their own set of log data. ECS has it’s own logs collector. With log collection so fragmented, it can be dizzying to jump from one tool to another to make sense of errors when troubleshooting. Containers require you to unify logging from all the various components for the logs to be useful.

The rise of open source tools

As log data has become more complex the solutions for logging have matured as well. Today, there are many open source tools available. The most popular open source logging tool is the ELK stack. It’s actually a collection of three different open source tools – Elasticsearch, Logstash, and Kibana. Elasticsearch is a distributed full-text search database, Logstash is a log aggregation tool, and Kibana is a visualization tool for time-series data. It’s easy to get started with the ELK stack when you’re dipping your toes into container logging, and it packs a lot of powerful features like high availability, near-real-time analysis, and great visualizations. However, once your logs reach the limits of your physical nodes that power the ELK stack, it becomes challenging to maintain operations smoothly. Performance lags and resource consumption become an issue. Despite this, the ELK stack has sparked many other container logging solutions like LogDNA. These solutions have found innovative ways to deal with the problems that weigh down the ELK stack.

Fluentd is another tool commonly used along with the ELK stack. It is a log collection tool that manages the flow of log data from the source app to any log analysis platform. Its strength is that it has a wide range of plugins and can integrate with a wide variety of sources. However, in a Kubernetes setup, to send logs to Elasticsearch, Fluentd places an agent in every node, and so becomes a drain on system resources.

Machine learning is the future

While open source tools have led the way in making logging solutions available, they require a lot of maintenance overhead when monitoring real-world applications. Considering the complexity of the stack, volume of data, and various logging mechanisms, what’s needed is a modern log analysis platform that can intelligently analyze log data and derive insights. Analyzing log data by manual methods is a thing of the past. Instead machine learning is opening up possibilities to let algorithms do the heavy lifting of crunching log data and extracting meaningful outcomes. Because algorithms can spot minute anomalies that would be invisible to humans, they can identify threats much before a human would, and in doing so can help prevent outages even before they happen. LogDNA is one of the pioneers in this attempt to use machine learning to analyze log data.

In conclusion, it is an exciting time to build and use log analysis solutions. The challenge is great, and the options are plenty. As you choose a logging solution for your organization, remember the differences between legacy applications and modern cloud-native ones, and choose a tool that supports the latter most comprehensively. And as you think about the future of log management, remember that the key words are ‘machine learning’.

Technical

Scaling Elasticsearch – The Good, The Bad and The Ugly

ElasticSearch bursted on the database scene in 2010 and has now become a staple of many IT teams’ workflow. It has especially revolutionized data intensive tasks like ecommerce product search, and real-time log analysis. ElasticSearch is a distributed, full-text search engine, or database. Unlike traditional relational databases that store data in tables, ElasticSearch uses a key value store for objects, and is a lot more versatile. It can run queries way more complex than traditional databases, and do this at petabyte scale. However, though ElasticSearch is a very capable platform, it requires a fair bit of administration to get it to perform at its best. In this post, we look at the pros and cons of running an ElasticSearch cluster at scale.

The Pros

1. Architected for scale

ElasticSearch has a more nuanced, and robust architecture than relational databases. Here are some of the key parts of ElasticSearch, and what they mean:

  • Cluster: A collection of nodes and is sometimes used to refer to the ElasticSearch instance itself
  • Index: A logical namespace that’s used to organize and identify the data in ElasticSearch
  • Type: A category used to organize data within an index
  • Document: A collection of data objects within an index
  • Shard: A partition of data that is part of an index, and runs on a node
  • Node: The underlying physical server that hosts shards with all their data
  • Replica shard: An exact replica of a primary shard that’s typically placed on a node separate from the primary shard

elas_0205

The levels of abstraction in the architecture makes it easy to manage the system. Whether it’s the physical nodes, the data objects, or shards – they can all be easily controlled at an individual level, or at an aggregate level.

2. Distributed storage and processing

Though ElasticSearch can process data on a single node, it’s built to function across numerous nodes. It allocates primary and replica shards equally across all available nodes, and generates high throughput using parallel processing. This means when it receives a query, it knows which shards have the data required for process the query, and it retrieves data from all those shards simultaneously. This way it can leverage the memory and processing power of many nodes at the same time.

The best part is that this parallelization is all built-in. The user doesn’t have to lift a finger to configure how requests are routed among shards. The strong defaults of the system make it easy to get started with ElasticSearch. ElasticSearch abstracts away the low lying processes, and delivers a great user experience.

3. Automated failover handling

When working at scale, the most critical factor is to ensure high availability. ElasticSearch achieves this in the way it manages its nodes and shards. All nodes are managed by a master node. The master records changes with nodes such as adding and removing of nodes. Every time a node is added or removed the master re-shards the cluster and re-balances how shards are organized on nodes.

The master node doesn’t control data processing, and in this way doesn’t become a single point of failure. In fact, no single node can bring the system down, not even the master node. If the master node fails, the other nodes auto-elect one of the nodes to replace it. This self-organizing approach to infrastructure is what ElasticSearch excels at, and this is why it works great at scale.

4. Powerful ad hoc querying

ElasticSearch breaks the limits of relational databases with its full-text search capabilities. Relational databases are stored in rows and columns and are rigid in how they store and process data. ElasticSearch on the other hand stores data in the form of objects. These objects can be connected to each other in complex structures that can’t be attained with rows and columns.

For example, ElasticSearch can sort its text-based search results based on relevance to the query. ElasticSearch can execute this complex processing at large scale and return results just as quickly. In fact, the results can be returned in near-real time making ElasticSearch a great option for troubleshooting incidents using logs, or powering search suggestions. This speed is what separates ElasticSearch from traditional options.

 

The Cons

For all its strengths, ElasticSearch has a few weaknesses that show up when you start to scale it to hundreds of nodes and petabytes of data. Let’s discuss them.

1. Can be finicky with the underlying infrastructure

ElasticSearch is great for parallel processing, but once you scale up, capacity planning is essential to get it to work at the same speed. ElasticSearch can handle a lot of nodes, however, it requires the right kind of hardware to perform at peak capacity.

If you have too many small servers it could result in too much overhead to manage the system. Similarly, if you have just a few powerful servers, failover could be an issue. ElasticSearch works best on a group of servers with 64GB of RAM each. Less than that and it may run into memory issues. Similarly, queries are much faster on data stored in SSDs than rotating disks. However, SSDs are expensive, and when storing terabytes or petabytes of data, you need to have a mix of both SSD and rotating disks.

These considerations require planning and fine-tuning as the system scales. Much of the effort is in maintaining the infrastructure that powers ElasticSearch than managing the data inside it.

2. Needs to be fine-tuned to get the most out of it

Apart from the infrastructure, you also need to set the right number of replica shards to ensure all nodes are healthy with ‘green’ status, and not ‘yellow’. For querying to run smoothly, you need to have a well-organized hierarchy of Indexes, Types, and IDs – though this is not as difficult as with relational databases.

You can fine-tune the cluster manually when there’s less data, but today’s apps change so fast, that you’ll end up with a lot of management overhead. What you need is a way to organize data and the underlying infrastructure that will work at scale.

3. No geographic distribution

Though it can work in this way, ElasticSearch doesn’t recommend distributing data across multiple locations globally. The reason is that it treats all nodes as if they were in the same data center, and doesn’t take into account network latency. This results in slower processing of queries if the nodes aren’t colocated. However, today’s apps are geographically distributed. With microservices architecture, services can be hosted on servers globally and still need the same level of access to the data stored in ElasticSearch.

 

ElasticSearch is a powerful database for doing full-text analysis, and it is essential to DevOps teams today. However, it takes expertise to successfully scale an ElasticSearch cluster and ensure it functions seamlessly.

Technical

Logging Best Practices – New Elevated Importance in the Dev’s Toolkit

A constantly evolving development environment has completely changed the way we approach the app process. Our command-line forebears would tuck tail and run if they saw the ever-increasing complexities modern developers contend with every day. Scores of data streaming in daily, new frameworks, and technical stacks that are starting to make the biological cell look simple.

But for all of this increased complexity – an often overlooked powerful source of making sense of it all has been neglected; we’re now finally elevating this source to its proper place. We’re talking about the limitless potential that logging can provide app developers. By practicing intelligent server logging from the start, we can utilize server logs for enhanced security, data optimization and overall increased performance.

For a moment think about how important logged information is in a few select non-technical fields. Banks must have records of money transfers, airplanes have to track their flights throughout the process, and so on. If an issue were to occur right now or in the future, this data would be there to overview and help us come to a quick solution.

The best logging practices should no longer be an afterthought, but part of the development process. This line of thinking needs to be at the forefront of any future cloud-based logging strategy.

 

Develop Intelligent Logging Practices From the Start

There is a crucial importance to how your own log messages are constructed in the first place. Proper log construction is integral to making sense of your own data and allowing third party applications – like LogDNA – the ability to parse logs efficiently and help you gain increased insights on your data.

JSON logging is one of the most efficient ways to format your logs into a database that can be easily searched through and analyzed. This gives you power on your own and helps your other tools get the job done even faster. JSON’s logging format allows for a simple standard for reading and coding, without sacrificing the ability to comfortably store swathes of data your app may be producing.

It’s best to begin JSON logging from the get-go when developing a new application. But with enough elbow-grease you can reasonably go back and change an app for future JSON support.

JSON logging standards should be widespread and mandatory for the rest of your project team. This way everyone is able to comprehend the same data and avoid any communication mishaps. A majority of libraries will assist in creating metadata tags for your logs and these can then be simply configured into a JSON log. Here is a brief pseudo example of a default logging level output and it’s JSON input.

Logger Output

logger.info (“User clicked buy”, {“userId”: “xyz”, “transactionId”: “buy”, “pageId”: “check-out” }};

JSON Input

{
“alert”: “User clicked buy”,
“userId”: “xyz”,
“transactionId”: “buy”,
“pageId”: “check-out”
}

A standard like this is important namely for two major reasons. It facilitates a shared understanding that can be read between different departments, including: devops, project-leads and business oriented team members, which creates a central interchangeable environment to utilize shared data from one another. This might include varied business functions across a company, from marketing initiatives on the business front to a developer streamlining a new (UI) on the checkout cart function.

Secondly, when the log output is in this format it allows for machines to read the data thoroughly and with greater ease. What could take hours or even days while manually searching has been reduced to a few seconds with the machine’s all-seeing eye.  

 

Making Sense of Levels

The identity of your data is the next step to take once you’ve constructed it in an adequate manner. It helps you monitor how users are experiencing your interface; it can look out for debilitating performance issues, potential security problems, as well as give you the tools to use user trends to create a better experience.

Some log levels will not be on a production-level app; an example of this would be debug. Others will be streaming in constantly. Our previous example showed the information level, typically used for simple inputs of information from the user. There is a breadth of great information here – you can count on that.

Furthermore, levels such as warning, error, and worst of all fatal or critical, are logs that need immediate attention before besmirching your good name. This leads us to another important practice.  

 

Structuring Data

These levels directly translate into a few important JSON fields. The fields of either json.level and json.severity help determine where warnings and errors are coming from. You can catch some of these earlier warnings before they snowball into a major fatal problem. Valuable fields of data like this help catch major events in the process. Some other important fields to look out for include json.appId, json.platform and json.microservices. If you’re running your logs through Kubernetes then you know you’ll be running on a wide variety of different platforms. The json.platform comes in handy for this.

In summary of these few listed fields: you can filter logs pertaining to only events that cause errors or warnings, isolate messages from designated add-on apps and microservices, or self select parsed logs from multiple staging platforms.

There are several data types that JSON supports and these include numbers, strings and nested objects. The trick is in knowing how to not get these mixed up. Quotes around a number can be misread as a string. It’s important to properly label events and fields so that these mix ups don’t happen.  

Many developers are apprehensive about saving and storing all of this data, feeling as if they’ll never be able to look through any of it anyhow. There’s a part of that which is true. But the best thing about proper logging practices is that they don’t have to anyways. That’s the machine’s job. More importantly, it’s the machine-based solution that LogDNA has got down to a science. You never know which part of the data you’ll need down the line.   

 

Always Collect Data for Further Use

Logs are the nutrient rich pieces of data that can be stored and collected for future use. They don’t take up much space and can be looked over instantly. Log monitoring is the best way to get a clear picture of your overall system. Logging is a fundamental way to understand your environment and understand the myriad of events going on. Many future trends can be predicted from past data as well.

Proper logging monitoring helps detect any potential problems and you can rid them from your app before they become a major issue.

Collect as much data as you can.

You may think that not all data is created equally and that is sometimes true. But you’d be surprised what certain innocuous metrics could point towards in the future. With that being said, you should know your system better than anyone else. This extends to what type of data you think is most crucial in logging. As a general rule of thumb the following are two important pieces of information that should always be logged.

Log Poor Performance

If a certain part of your application should be performing at a general high speed, i.e. a database ping – then you should log this event if it is not performing adequately. If an event went above a certain ms, then you’d be rightfully worried about the overall performance. This leads you to look into the problem at hand.

Log Errors

You’re bound to come across a few errors after adding any type of new features to a system. Logging these errors can be a useful piece of information for debugging and future analytics. If this error somehow gets into production, you now have the tools to rid your app of the problem and ensure a happy user experience.

 

Watch for Performance Deviations & Look at Larger Trends

The macro picture of your logging data will be the deciding factor in telling you how your system is functioning and performing on an overall basis. This is more important than looking at a single data point entry. Graphing this type of data in a logging management system will give a visual representation to a process that used to be nearly inaccessible.

 

Keep the Bottom-Line in Mind

At the end of the day what matters most is that your development processes are always evolving, growing and keeping the end-user in mind. Helping your internal teams take advantage of their own data is liberating.

We help find the interwoven connections between data and apply new and innovative features to contend with a rapidly evolving logging sphere. Follow these parameters for the best logging practices through proper log message construction, collaborative standards, structuring and storing data and you will succeed.

Log management is a real time understanding of your metrics, events and more that can mean a lot of different things to different parts of your team. What was once forgotten data sitting in an old dusty cavern of your server is now a leading development tool.

Technical

Logging with Kubernetes should not be this painful

Before reading this article, we recommend having a basic working knowledge of Kubernetes. Check out our previous article, Kubernetes in a nutshell, for a brief introduction.

Logging Kubernetes with Elasticsearch stack is free, right?

If the goal of using Kubernetes is to automate management of your production infrastructure, then centralized logging is almost certainly a part of that goal. Because containers managed by Kubernetes are constantly created and destroyed, and each container is an isolated environment in itself, setting up centralized logging on your own can be a challenge. Fortunately, Kubernetes offers an integration script for the free, open-source standard for modern centralized logging: the Elasticsearch stack (ELK). But beware, Elasticsearch and Kibana may be free, but running ELK on Kubernetes is far from cheap.

Easy installation

As outlined by this Kubernetes docs article on Elasticsearch logging, the initial setup required for an Elasticsearch Kubernetes integration is actually fairly trivial. You set an environment variable, run a script, and boom you’re up and running.

However, this is where the fun stops.

The way the Elasticsearch Kubernetes integration works is by running per-node Fluentd collection pods that send log data to Elasticsearch pods, which can then be viewed by accessing the Kibana pods. In theory, this works just fine, however, in practice, Elasticsearch requires significant effort to scale and maintain.

JVM woes

Since Elasticsearch is written in Java, it runs inside of a Java Virtual Machine (JVM), which has notoriously high resource overhead, even with properly configured garbage collection (GC). Not everyone is a JVM tuning expert. Instead of being several services distributed across multiple pods, Elasticsearch is one giant service inside of a single pod. Scaling individual containers with large resource requirements seems to defeat much of the purpose of using Kubernetes, since it is likely that Elasticsearch pods may eat up all of a given node’s resources.

Elasticsearch cluster overhead

Elasticsearch’s architecture requires multiple Elasticsearch masters and multiple Elasticsearch data nodes, just to be able to start accepting logs at any scale beyond deployment testing. Each of these masters and nodes all run inside a JVM and consume significant resources as a whole. If you are logging at a reasonably high volume, this overhead is inefficient inside a containerized environment and logging at high volume, in general, introduces a whole other set of issues. Ask any of our customers who’ve switched to us from running their own ELK.

slack-imgs

Free ain’t cheap

While we ourselves are often tantalized by the possibility of using a fire-and-forget open-source solution to solve a specific problem, properly maintaining an Elasticsearch cluster is no easy feat. Even so, we encourage you to learn more about Elasticsearch and what it has to offer, since it is, without a doubt, a very useful piece of software. However, like all software it has its nuances and pitfalls, and is therefore important to understand how they may affect your use case.

Depending on your logging volume, you may want to configure your Elasticsearch cluster differently to optimize for a particular use case. Too many indices, shards, or documents all can result in different crippling and costly performance issues. On top of this, you’ll need to constantly monitor Elasticsearch resource usage within your Kubernetes cluster so your other production pods don’t die because Elasticsearch decides to hog all available memory and disk resources.

At some point, you have to ask yourself, is all this effort worthwhile?

LogDNA cloud logging for Kubernetes

logo_k8

As big believers in Kubernetes, we spent a good amount of time researching and optimizing our integration. In the end, we were able to get it down to a copy-pastable set of two kubectl commands:

kubectl create secret generic logdna-agent-key --from-literal=logdna-agent-key=<YOUR-API-KEY-HERE>
kubectl create -f https://raw.githubusercontent.com/logdna/logdna-agent/master/logdna-agent-ds.yaml

This is all it takes to send your Kubernetes logs to LogDNA. No manual setting of environment variables, no editing of configuration files, no maintaining servers or fiddling with Elasticsearch knobs, just copy and paste. Once executed, you will be able to view your logs inside the LogDNA web app. And we extract all pertinent Kubernetes metadata such as pod name, container name, namespace, container id, etc. No Fluentd parsing pods required (or any other dependencies, for that matter).

Easy installation is not unique to our Kubernetes integration. In fact, we strive to provide concise and convenient instructions for all of our integrations. But don’t just take our word for it, you can check out all of our integrationsWe also support a multitude of useful features, including alerts, JSON field search, archiving, line-by-line contextual information.

All for $1.25/GB per month. We’re huge believers in pay for what you use. In many cases, we’re cheaper than running your own Elasticsearch cluster on Kubernetes.

For those of you not yet using LogDNA, we hope our value proposition is convincing enough to at least give us a try.

If you don’t have a LogDNA account, you can create one on https://logdna.com or if you’re on macOS w/homebrew installed:

brew cask install logdna-cli
logdna register 
# now paste the api key into the kubectl commands above

Thanks for reading and we look forward to hearing your feedback.

Technical

Kubernetes in a Nutshell

kuberneteslogo

In addition to being the greek word for helmsman, Kubernetes is a container orchestration tool that enables container management at scale. Before we explain more about Kubernetes, it is important to understand the larger context of containers.

Broadly, containers isolate individual apps within a consistent environment and can be run on a wide variety of hosting providers and platforms. This enables developers to test their app in a development environment identical to their production environment, without worrying about dependency conflicts or hosting provider idiosyncrasies. When new code is deployed, the currently running containers are systematically destroyed and then replaced by new containers running the new code, effectively providing quick, stateless provisioning. This is especially appealing to those using a microservices architecture, with many moving parts and dependencies.

While containers in themselves ensure a consistent, isolated environment, managing a multitude of individual containers is cumbersome without a purpose-built management tool, like Kubernetes. Instead of manually configuring networking and DNS, Kubernetes not only lets you deploy and manage thousands of containers, but also automates nearly all aspects of your development infrastructure. In addition to networking and DNS, Kubernetes also optimizes sharing machine resources across containers, so that any given machine, or node, is properly utilized, reducing costly inefficiencies. This is particularly powerful at scale since you effectively eliminate a significant chunk of DevOps overhead.

To learn more about Kubernetes, check out this handy getting started guide. If you’re feeling bullish on Kubernetes and want to learn more about your logging options, read Logging with Kubernetes should not be this painful.