Comparison, Technical

How Fluentd plays a central role in Kubernetes logging

Written by Twain Taylor

Collecting logs is a complex challenge with containerized applications. Docker enables you to decompose a monolith into microservices. This brings more control over each part of the application. Containers help to separate each layer – infrastructure, networking, and application – from the other layers. It gives you flexibility in where to run your app – in a private data center, or a public cloud platform, or even move between cloud platforms as the need arises. Networking tools are plugins to the container stack and they enable container-to-container communication at large scale. At the application layer, microservices is the prefered model when adopting containers, although containers can be leveraged to improve efficiency in monolithic applications as well. While this is a step up from the vendor-centric options of yesterday, it brings complexity across the application stack. To manage this complexity it takes deep visibility. The kind of visibility that starts with metrics, but goes deeper with logging.

The thing about logging containers is that there are too many data sources. At every level of the stack logs are generated in great detail, certainly much more than traditional monolithic applications. Log data is generated as instances are created and retired, their configurations changed, as network proxies communicate with each other, as services are deployed, and requests are processed. This data is rich with nuggets of information vital for monitoring and controlling the performance of the application.

With the amount of log data ever increasing, it requires specialized tools to manage and adapt to the specific needs of containers at every step of the log lifecycle. Kubernetes is built with an open architecture that leaves room for this type of innovation. It allows for open source logging tools to be created which can extract logs from Kubernetes and process these logs on their own. In response, there have been logging tools that have stepped up to the task. These logging tools are predominantly open source, and give you flexibility in how you’d like to manage logging. There are tools for every step including log aggregation, log analysis, and log visualization. One such tool that has risen to the top in terms of log aggregation is Fluentd.

What is Fluentd?

Fluentd is an open source tool that focuses exclusively on log collection, or log aggregation. It gathers log data from various data sources and makes them available to multiple endpoints. Fluentd aims to create a unified logging layer. It is source and destination agnostic and is able to integrate with tools and components of any kind. Fluentd has first-class support for Kubernetes, the leading container orchestration platform. It is the recommended way to capture Kubernetes events and logs for monitoring. Adopted by the CNCF (Cloud-native Computing Foundation), Fluentd’s future is in step with Kubernetes, and in this sense, it is a reliable tool for the years to come.

Fluentd
Source: Flickr

EFK is the new ELK

Previously, the ELK stack (Elasticsearch, Logstash, Kibana) was the best option to log applications using open source tools. Elasticsearch is a full-text search engine and database that’s ideally suited to process and analyze large quantities of log data. Logstash is similar to Fluentd – a log aggregation tool. Kibana focuses exclusively on visualizing the data coming from Elasticsearch. Logstash is still widely used but now has competition from Fluentd – more on this later. So today, we’re not just talking about the ELK stack, but the EFK stack. Although, admittedly, it’s not as easy to pronounce.

Fluentd Deployment

You can install Fluentd from its Docker image which can be further customized.

Kubernetes also has an add-on that lets you easily deploy the Fluentd agent. If you use Minikube, you can install Fluentd via its Minikube addon. All it takes is a simple command minikube addons enable efk. This installs Fluentd alongside Elasticsearch and Kibana. While Fluentd is lightweight, the other two are resource heavy and will need additional memory in the VM used to host them and will take some time to initialize as well.

Kops, the Kubernetes cluster management tool, also has an addon to install Fluentd as part of the EFK trio.

Another way to install Fluentd is to use a Helm chart. If you have Helm setup, this is the simplest and most future-proof way to install Fluentd. Helm is a package manager for Kubernetes and lets you install Fluentd with a single command:

$ helm install –name my-release incubator/fluentd-elasticsearch

Once installed, you can further configure the chart with many options for annotations, Fluentd configmaps and more. Helm makes it easy to manage versioning for Fluentd, and even has a powerful rollback feature which lets you revert to an older version if needed. It is especially useful if you want to install Fluentd on remote clusters as you can share Helm charts easily and install them in different environments.

If you use a Kubernetes managed service they may have their own way of installing Fluentd that’s specific to their platform. For example, with GKE, you’ll need to define variables that are specific to the Google Cloud platform like region, zone, and Project ID. Then, you’ll need to create the service account, create a Kubernetes cluster, deploy a test logger and finally deploy the Fluentd daemonset to the cluster.

How it works

The Docker runtime collects logs from every container on every host and stores them at /var/log. The Fluentd image is already configured to forward all logs from /var/log/containers and some logs from /var/log. Fluentd reads the logs and parses them into JSON format. Since it’s stored in JSON the logs can be shared widely with any endpoint.

Fluentd also adds some Kubernetes-specific information to the logs. For example, it adds labels to each log message to give the logs some metadata which can be critical in better managing the flow of logs across different sources and endpoints. It reads Docker logs, etcd logs, and kubernetes logs.

The most popular endpoint for log data is Elasticsearch, but you can configure Fluentd to send logs to an external service such as LogDNA for deeper analysis. By using a specialized log analysis tool, you can save time troubleshooting and monitoring. With features like instant search, saved views, and archival storage of data, a log analysis tool is an essential if you’re setting up a robust logging system that involves Fluentd.

Fluentd Alternatives

Logstash

Logstash is the most similar alternative to Fluentd and does log aggregation in a way that works well for the ELK stack.

Logstash uses if-then rules to route logs while Fluentd uses tags to know where to route logs. Both are powerful ways to route logs exactly where you want them to go with great precision. Which you prefer will depend on the kind of programming language you’re familiar with – declarative or procedural.

Next, both Fluentd and Logstash have a vast library of plugins which make them both versatile. In terms of getting things done with plugins, both are very capable and have wide support for pretty much any job. You have plugins for the most popular input and output tools like Elasticsearch, Kafka, and AWS S3 and plugins for tools that may be used by a niche group of users as well. Fluentd here has a bit of an edge as it has a comparatively bigger library of plugins.

Logstash
Source: Flickr

When it comes to size, Fluentd is more lightweight than Logstash. This has a bearing on the logging agent that’s attached to containers. The bigger the production applications, the larger the number of containers and data sources, the more agents are required. A lighter logging agent like Fluentd’s is prefered for Kubernetes applications.

Fluent Bit

While Fluentd is pretty light, there’s also Fluent Bit an even lighter version of the tool that removes some functionality, and has a limited library of 30 plugins. However, it’s extremely lightweight weighing in at ~450KB next to the ~40MB of the full blown Fluentd.

Conclusion

Logging is a critical function when running applications in Kubernetes. Logging is difficult with Kubernetes, but thankfully, there are capable solutions at every step of the logging lifecycle. Log aggregation is at the center of logging with Kubernetes. It enables you to collect all logs end-to-end and deliver them to various data analysis tools for consumption. Fluentd is the leading log aggregator for Kubernetes due to its’ small footprint, better plugin library, and ability to add useful metadata to the logs makes it ideal for the demands of Kubernetes logging. There are many ways to install Fluentd – via the Docker image, Minikube, kops, Helm, or your cloud provider.

Being tool-agnostic, Fluentd can send your logs to Elasticsearch or a specialized logging tool like LogDNA. If you’re looking to centralize logging from Kubernetes and other sources, Fluentd can be the unifying factor that brings more control and consistency to your logging experience. Start using Fluentd and get more out of your log data.

Technical

Challenges with Logging Serverless Applications

Serverless computing is a relatively new trend with big implications for software companies. Teams can now deploy code directly to a platform without having to manage the underlying infrastructure, operating system, or software. While this is great for developers, it introduces some significant challenges in monitoring and logging applications.

This post explores both the challenges in logging serverless applications, and techniques for effective cloud logging.

What is Serverless, and How is it Different?

In many ways, serverless computing is the next logical step in cloud computing. As containers have shown, splitting applications into lightweight units helps DevOps teams build and deploy code faster. Serverless takes this a step further by abstracting away the underlying infrastructure, allowing DevOps to deploy code without having to configure the environment that it runs on. Unlike containers or virtual machines, the platform provider manages the environment, provisions resources, and starts or stops the application when needed.

For this post, we’ll focus on a type of serverless computing called Functions as a Service (FaaS). In FaaS, applications consist of individual functions performing specific tasks. Like containers, functions are independent, quick to start and stop, and can run in a distributed environment. Functions can be replaced, scaled, or removed without impacting the rest of the application. And unlike a virtual machine or even a container, functions don’t have to be active to respond to a request. In many ways, they’re more like event handlers than a continuously running service.

With so many moving pieces, how do you log information in a meaningful way? The challenges include:

  • Collecting logs
  • Contextualizing logs
  • Centralizing logs

Collecting Logs

Serverless platforms such as AWS Lambda and Google Cloud Functions provide two types of logs: request logs and application logs.

Request logs contain information about each request that accesses your serverless application. They can also contain information about the platform itself, such as its execution time and resource usage. In App Engine, Google’s earliest serverless product, this includes information about the client who initiated the request as well as information about the function itself. Request logs also contain unique identifiers for the function instance that handled the request, which is important for adding context to logs.

Application logs are those generated by your application code. Any messages written to stdout or stderr are automatically collected by the serverless platform. Depending on your platform, these messages are then streamed to a logging service where you can store them or forward them to another service.

Although it may seem unnecessary to collect both types of logs, doing so will give you a complete view of your application. Request logs provide a high-level view of each request over the course of its lifetime, while application logs provide a low-level view of the request during each function invocation. This makes it easier to trace events and troubleshoot problems across your application.

 

Contextualizing Events

Context is especially important for serverless applications. Because each function is its own independent unit, logging just from the application won’t give you all the information necessary to resolve a problem. For example, if a single request spans multiple functions, filtering through logs to find messages related to that request can quickly become difficult and cumbersome.

Request logs already store unique identifiers for new requests, but application logs likely won’t contain this information. Lambda allows functions to access information about the platform at runtime using a context object. This object lets you access information such as the current process instance and the request that invoked the process directly from your code.

For example, this script uses the context object to include the current request ID in a log:

import logging

def my_function(event, context):
	logging.info("Function invoked.",
extra={"request_id": context.aws_request_id})

In addition, gateway services such as Amazon API Gateway often create separate request IDs for each request that enters the platform. This lets you correlate log messages not only by function instance, but for the entire call to your application. This makes it much easier to trace requests that involve multiple functions or even other platform services.

Centralizing Logs

The decentralized nature of serverless applications makes collecting and centralizing logs all the more important. Centralized log management makes it easier to aggregate, parse, filter, and sort through log data. However, serverless applications have two very important limitations:

  • Functions generally have read-only filesystems, making it impossible to store or even cache logs locally
  • Using a logging framework to send logs over the network could add latency and incur additional usage charges

Because of this, many serverless platforms automatically ingest logs and forward them to a logging service on the same platform. Lambda and Cloud Functions both detect logs written to stdout and stderr and forward them to AWS CloudWatch Logs and Stackdriver Logs respectively, without any additional configuration. You can then stream these logs to another service such as LogDNA for more advanced searching, filtering, alerting, and graphing. This eliminates the need for complex logging configurations or separate log forwarding services.

Conclusion

Although the serverless architecture is very different from those preceding it, most of our current logging best practices still apply. The main difference is in how logs are collected and the information they contain about the underlying platform. As the technology matures, we will likely see new best practices emerge for logging serverless applications.

ProTip, Technical

Log Alerts & Monitoring – Best Practices

Application logs are more than just tools for root cause analysis. They’re also a way to gain insight into critical events such as a loss in sales, server warnings, HTTP errors, performance, and numerous other activities that impact business productivity. Logs are just thousands of lines of raw data, but they can be parsed and leveraged to provide a better understanding of what goes on under the hood of your application. It’s common for developers to set up logging for code exceptions within the application, but logging provides far more benefits than just bug tracking and should be used to alert administrators to issues that need their attention.

Application and Server Performance

Any administrator who has dealt with performance issues will tell you that it’s one of the most difficult problems to analyze and pinpoint a root cause for repair. Performance degradation can occur at certain times of the day, during an active DDoS attack, or what seems like no reason at all. QA can do performance testing on your application, but these tests rarely represent a production environment that supports thousands of users concurrently. For most organizations, performance issues occur during business growth and can harm its potential expansion. Performance is also problematic because it’s unforeseen and rarely a quick fix for developers and administrators.

Using log alerts, you can assess what’s happening when an application’s performance is diminished. It could be from a specific event such as a poorly optimized database query or when CPU usage spikes. Log these types of events, and you can not only identify when application performance will wane but also when server resources could be exhausted. Instead of suffering from a server crash, logging these events will give you insights for when it could be time to upgrade server hardware. They also help you pinpoint components of your application that could be causing performance degradation.

Your developers would need to set a baseline, but — for instance — you could set an alert for any event that takes longer than 100 milliseconds to process. When you see a pattern, you can then have developers research more into these application components for better optimization.

When CPU usage spikes to over 80%, set an alert to inform administrators. It could be something as simple as upgrading RAM or even your server’s CPU, but having these log alerts will give you the ability to analyze the time of day and any patterns surrounding application procedures.

Failed Sales Events

Most applications log exceptions so that developers can look back into errors and provide updates. But not every exception is created equally. The most critical events are those that impact business revenue. These events are the ones you should monitor closely and send alerts to the right response team.

It shouldn’t be your customers calling to tell you that there are bugs in your application. You want to find them before they do, and many customers will just bounce to another vendor if you have too many bugs. Many of your customers won’t report application issues at all, so you could be losing sales every hour and never know it.

When you build a shopping cart, you should have a point where you ask a visitor for a way to contact them. Usually, email is a common input field during account creation. With an alert on failed shopping cart activity, you have a way to contact a customer should they receive an error and bail on the shopping experience. Alerts are a great tool to salvage a lost customer due to application bugs.

But you also need alerts to tell you when components of your application are creating obstacles for customer shopping experiences. It could be an alert for performance (similar to the previous example), or your customers could be dropping at a specific point in the sales funnel. Alerts give you insights into the efficacy of your marketing and user experience. Use them generously to identify issues with your sales pages and find solutions for a better user experience.

Security and Suspicious Activity

Developers are good at logging exceptions, but they don’t usually account for security events. Security events can be any suspicious behaviors such as automated logins from account takeover tools (ATOs) using customer data, repeated patterns of failed admin login attempts, ACL changes or new accounts created.

Usually, some security events are triggered from the database, but this limits your logging to database-specific activity. With logging and alerts, you should use them to make the right people aware of suspicious activity that happens on any one of your servers outside of database activity.

With ATOs, an attacker will use automated software to log into a customer’s account and use purchased credit card data to buy product to test if the card is viable. Logs should be used to detect this type of activity and alert administrators to suspicious events.

Any modifications to security permissions or authorization should also be logged and alerts sent. This could be elevation of permissions for any specific user, new routing rules configured on your infrastructure, or user access to critical files. Security events are a primary method organizations use to identify attacks before they become catastrophic data breaches.

How Do You Set Up Alerts?

You need the right data before you can set up alerts. The data that you log is up to you, but some standard data points are needed to create efficient logs that scale across your infrastructure. Some common data points include:

  • Date and time
  • Host name
  • Application name
  • Customer or account that experienced the error
  • IP address or other geographic markers for the visitor
  • Raw exception information
  • Line number where it occurred (if applicable)
  • Type of error (fatal, warning, etc)

You could write a logging application, or you could save time and configuration hassles by using LogDNA. With built-in features that provide logging data and sending alerts, you can save months of development and testing for your own solution.

Instead of only using logs for basic events, an organization’s best practices should include activity that gives administrators insight into patching issues before they become catastrophic instead of just using them to retroactively find solutions. LogDNA can provide you with the right tools and alerts that organizations can leverage to avoid revenue-impacting bugs and server errors.

Security, Technical

Building Secure Applications with Machine Learning & Log Data

In recent years, machine learning has swept across the world of software delivery and is changing the way applications are built, shipped, monitored, and secured.  And log monitoring is one of the industries that keeps evolving with new capabilities afforded with machine learning.

What Is Machine Learning?

Machine learning is the process of using algorithms and computer intelligence to analyze and make sense of large quantities of complex data that would otherwise be difficult or impossible to do by a human security analyst. There are many forms of machine learning from algorithms that can be trained to replicate human decision making at scale to algorithms that take an open-ended approach to find interesting pieces of data with little input or guidance. Differences aside, machine learning looks to get more value out of data in a way that humans can’t do manually. This is critical for security, which deals with large quantities of data, and often misses the important data, or catches it too late.

Gnome-stock_person_bot.svg
Machine learning is made possible by the power of cloud computing and how it makes crunching big data cheaper and more powerful. Analyzing large quantities of complex data takes a lot of computing power, readily available memory, and fast networking that’s optimized for scale. Cloud vendors today provide GPU (graphical processing unit) instances that are particularly well suited for machine learning. The alternate method is to use numerous cheap servers and a distributed approach to analyze the data at scale. This is possible with the advances in distributed computing over the past few years.

Additionally, cloud storage with fast I/O speeds is necessary for complex queries to be executed within a short period of time.  AWS itself offers multiple storage solutions like Amazon EFS, EBS, and S3. Each of these serve different purposes and are ideal for different types of data workloads. The cloud has stepped up in terms of compute, memory, storage and overall tooling available to support machine learning.

Security Threats Are Becoming Increasingly Complex

The primary use case for machine learning in log analysis has to do with security. There are many forms of security attacks today, which range in complexity. Email phishing attacks, promo code abuse, credit card theft, account takeover, and data breaches are some of the security risks that log analysis can help protect against. According to the Nilson Report, these types of security attacks cost organizations a whopping $21.8 billion each year. And that doesn’t even include the intangible costs associated with losses in trust, customer relationships, brand value, and more that stem from a security attack. Security threats are real, and log analysis can and should be used to counter them.

Attackers are becoming more sophisticated as they look to new technology and tools to carry out their attacks. Indeed, criminals themselves are early adopters of big data, automation tools and more. They use masking software to hide their tracks, bots to conduct attacks at scale, and in some cases have armies of humans assisting in the attack. With such a coordinated effort, they can easily break through the weak defenses of most organizations. That’s why you see some of the biggest web companies, the most highly secured government institutions, and the fastest growing startups all fall victim to these attackers, who can breach their defense easily.

SecOps Is Predominantly Manual

Since much of security is conducted by humans, or at most by tools that use rules to authorize or restrict access, attackers eventually understand the rules and find ways to breach them. They may find a limit to the number of requests a gatekeeper tool can handle per second, and then bombard the tool with more requests than the limit. Or they may find unsecured IoT devices on the edge of the network that can be easily compromised and taken control of. This was the case with the famous Dyn DDoS attack which leveraged unsecured DVRs and then bombarded the Dyn network, taking down with it a large percentage of the internet’s top websites that relied on Dyn for DNS services. The point is that manual security reviews, and even rule-based security, doesn’t scale and is not enough to secure systems against the most sophisticated attackers. What’s needed is a machine learning approach to security, and one that leverages logs to detect and stop attacks before they escalate.

Dyn_logo_(black_text).svg

Many security risks occur at the periphery of the system, so it’s essential to keep a close watch on all possible entry points. End users access the system from outside, and can sometimes knowingly or unknowingly compromise the system. You should be able to spot a malicious user from the smallest of triggers; for example, an IP address or a geo location that is known to be suspicious should be investigated. Login attempts are another giveaway that your system may be under attack. Frequent unsuccessful login attempts are a bad sign and need to be further investigated. Access logs need to be watched closely to spot these triggers, and it is best done by a machine learning algorithm.

While manual and rule-based review can work to a certain point, increasingly sophisticated attacks are best thwarted by using machine learning. You may need to crawl external third-party data to identify fraudulent activity, and it helps to look not just inside, but outside of your organization for data that can shed light on suspicious activity. But with growing data sets to analyze, you need more than basic analytics — you need the scale and power of machine learning to spot patterns, and find the needle in the haystack. Correlating your internal log data with external data sets is a challenge, but it can be done with machine learning algorithms that look for patterns in large quantities of unstructured data.

Machine Learning For Security

Machine learning can go further in spotting suspicious patterns from multiple pieces of data. It can look at two different pieces of data, sometimes not obviously associated with each other, and highlight a meaningful pattern. For example, if a new user accesses parts of the system that are sensitive, or tries to gain access to confidential data, a machine learning algorithm can spot this from looking at their browsing patterns or the requests they make. It can decipher that this user is likely looking to breach the system and may be dangerous. Highlighting this behavior an hour or two in advance can potentially prevent the breach from occurring. To do this, machine learning needs to look at the logs showing how the user accesses and moves through the application. The devil is in the details, and logs contain the details. But often, the details are so hidden that human eyes can’t spot them; this is where machine learning can step in and augment what’s missing in a human review.

In today’s cloud-native environment, applications are deeply integrated with each other — no application is an island on its own. This being the case, many attacks occur from neighboring apps which may have escalated their privileges. It’s easy to read the news about data breaches and find cases where organizations blame their partner organizations or an integrated third-party app for a security disaster. Monitoring your own system is hard enough, and it takes much more effort and sophistication to monitor outside applications that interact with yours.

Whereas humans may overlook the details when monitoring a large number of integrated applications and APIs, a machine learning algorithm can monitor every API call log, every network request that’s logged, and every kind of resource accessed by third-party applications. It can identify normal patterns as well as suspicious ones. For example, if an application utilizes a large percentage of available memory and compute for a long period of time, it is a clear trigger. A human may notice this after a few minutes or hours of it occurring, but a machine learning algorithm can spot the anomaly in the first few seconds, and bring it to your attention. Similarly, it can highlight a spike in requests from any single application quickly, and highlight that this may be a threat.

Elasticsearch-Logo

Machine learning algorithms are especially good at analyzing unstructured or semi-structured data like text documents or lines of text. Logs are full of text data that need to be analyzed, and traditional analytics tools like SQL databases are not ideally suited for log analysis. This is why newer tools like Elasticsearch have sprung up to help make sense of log data at scale. Machine learning algorithms work along with these full-text search engines to spot patterns that are suspicious or concerning. It can derive this insight from the large quantities of log data being generated by applications. In today’s containerized world, the amount of log data to be analyzed is on the rise, and only with the power of machine learning can you get the most insight in the quickest time.

Conclusion

As you look to get more out of your log data, you need an intelligent logging solution like LogDNA that leverages machine learning to give you insight in a proactive manner. Algorithms are more efficient and faster than humans at reading data, and they should be used to preempt attacks by identifying triggers in log data. As you assess a logging solution, do look at its machine learning features. Similarly, as you plan your logging strategy, ensure machine learning is a key component of your plans, and that you rely not just on traditional manual human review, but leverage the power of machine learning algorithms.

Security, Technical

The Role of Log Analysis in Application Security

Security is perhaps the most complex, time-sensitive, and critical aspect of IT Ops. It’s similar to the ICU (Intensive care unit) room in a hospital where the most serious of cases go, and any mistake can have far-reaching consequences. Just as doctors need to monitor various health metrics like heart rate, blood pressure, and more, Security Analysts need to monitor the health of their systems at all time – both when the system is functioning as expected, and especially when things break. To gain visibility into system health at any time, the go-to resource for any Security Analyst is their system logs. In this post, we cover all that can go wrong in the lifetime of an application, and how Security Analysts can respond to these emergencies.

The Role of Log Analysis in Application Security
Log Analysis is an integral component to application security

Common security incidents

An application can be attacked in any number of ways, but below are some of the most common type of security issues that SecOps teams deal with day in and day out.

  • Viruses & malware: This is when a malicious script or software is installed on your system and it tries to disrupt, or manipulate the functioning of your system.
  • Data breaches: The data contained in your systems is an extremely valuable asset. A data breach is when this valuable data is accessed, or stolen by an attacker.
  • Phishing attacks: This tactic is used on end users where an attacker poses as a genuine person or organization, and lures the user to click on a link after which their credentials like logins, and financial information are stolen.
  • Account takeovers: This is when a cyber criminal gains access to a user’s login credentials and misuses it to their own advantage.
  • DDoS attacks: Distributed denial of service (DDoS) attacks occur when multiple infected computers target a single server and overload it with requests. The server is overloaded with requests and denies service to other genuine users.
  • SQL injection: This happens when an attacker runs malicious SQL queries against a database with the aim of taking the database down.
  • Cross-site scripting: This type of attack involves the hacker using browser scripts to execute harmful actions on the client-side.

Initial response

When an incident occurs, the first thing you need to do is estimate the impact of the attack.

Estimate the impact of the attack. Which parts of the system have failed, which parts are experiencing latency, how many users and accounts are affected, which databases have been compromised, and more. Understanding this will help you decide what needs to be protected right away, and what your first actions should be. You can reduce the impact of an attack by acting fast.

Incident management using log data

Once you’ve taken the necessary ‘first aid’ measures, you’ll need to do deeper troubleshooting to find the origin of the attack, and fully plug all vulnerabilities. When dealing with these security risks, log data is indispensable. Let’s look at the various ways you can use log data to troubleshoot, take action, and protect yourself from these types of attacks.

Application level

When an incident occurs, you probably receive an alert from one of your monitoring systems about the issue. To find the origin of the incident, where you look first will depend on the type of issue.

At the application level you want to look at how end users have used your app. You’d look at events like user logins, and password changes. If user credentials have been compromised by a phishing attack, it’s hard to spot the attacker initially, but looking for any unusual behavior patterns can help.

Too many transactions in a short period of time by a single user, or transactions of very high value can be suspicious. Changes in shipping addresses is also a clue that an attacker may be at work.

At the application layer, you can view who has had access to your codebase. For this you could look at requests for files and data. You could filter your log data by application version, and notice if there are any older versions still in use. Older versions could be a concern as they may not include any critical security patches for known vulnerabilities. If your app is integrated with third-party applications via API, you’ll want to review how it’s been accessed.

When looking at log data it always helps to correlate session IDs with timestamps, and even compare timestamps across different pieces of log data. This gives you the full picture of what happened, how it progressed, and what the situation is now. Especially look for instances of users changing admin settings or opting for their behavior to not be tracked. These are suspicious and are clear signs of an attack. But even if you find initial signs at the application layer, you’ll need to dig deeper to the networking layer to know more about the attack.

Network level

At the network level there are many logs to view. To start with, the IP addresses and locations from where your applications were accessed from is important. Also, notice the device types like USBs, IoT devices, or embedded systems. If you notice suspicious patterns like completely new IP addresses and locations, or malfunctioning devices, you can disable them or restrict their access.

Particularly, notice if there are any breaches to the firewall, or cases of modifying or deletion of cookies. In legacy architecture a breach in a firewall can leave the entire system vulnerable to attackers, but with modern containerized setups, you can implement granular firewalls using tools like Project Calico.

Look for cases of downtimes or performance issues like timeouts across the network, these can help you assess the impact of the attack, and maybe even find the source. The networking layer is where all traffic in your system flows, and in the event of an attack, malicious events are passing through your network. Looking at these various logs will give you visibility into what is transpiring across your network.

Going a level deeper, you can look at the health of your databases for more information.

Database level

Hackers are after the data in your system. This could include financial data like credit card information, or even personally identifiable information like a social security number or home address. They could look for loosely stored passwords and usernames associated with them. All this data is stored in databases in cloud or physical disks in your system, and they need to be checked for breaches, and secured immediately if not compromised already.

To spot these breaches take a look at the log trail for data that’s been encrypted or decrypted. Especially notice how users have accessed personally identifiable information or payment related information. Changes to file names and file paths, an unusually high number of events from a single user are all signs of data breaches.

Your databases are also used to store all your log data, and you’d want to check if disk space was insufficient to store all logs, or if any logs were left out recently. Notice if the data stored on these disks were altered or tampered with.

OS level

The first thing to check in the operating system (OS) layer should be the access controls for audit logs. This is because savvy attackers will first try to delete or alter audit logs to cover their tracks. Audit logs record all user activity in the system, and if an attack is in progress, the faster you secure your audit logs the better.

Another thing to check is if timestamps are consistent across all your system logs. Often, attackers would change the timestamps for certain log files making it difficult for you to correlate one piece of log data with another. This makes investigation difficult and gives the attacker more time to carry out their agenda.

Once you’re confident that your audit logs and timestamps haven’t been tampered with, you can investigate other OS level logs to trace user activity. How users logged in and out of your system, system changes especially related to access controls, usage of admin privileges, changes in network ports, files and software that was added or removed, and commands executed recently, especially those from privileged accounts – all this information can be analyzed using your system logs.

The operating system is like the control center for your entire applications stack, and commands originating from here need to be investigated during an attack. The next place to look for vulnerabilities is the hardware devices in your system.

Hardware or infrastructure level

Some of the recent DDoS attacks, like the one that affected Dyn last year was a result of hackers gaining access to hardware devices that end users owned. They cracked the weak passwords of these devices and overloaded the server with requests. If end user devices are a major part of your system, they should be watched carefully as they can be easy targets for hackers. Looking at log data for how edge devices behave on the network can help spot these attacks. With the growth of the internet of things (IoT) and smart devices across all sectors, these types of attacks are becoming more common.

Additionally, if your servers or other hardware are also accessed by third-party vendors, or partner organizations, you need to keep an eye on how these organizations use your hardware and the data on it.

After the incident

Hopefully, all these sources of log data can help you as you troubleshoot and resolve attacks. After the incident, you’ll need to write a post-mortem. Even in this step, log data is critical to telling the entire story of the attack – its origin, progression, impact radius, resolution, and finally restoring the system back to normalcy.

As you can tell, log data is present at every stage of incident management and triaging. If you want to secure your system, pay attention to the logs you’re collecting. If you want to be equipped to deal with today’s large scale attacks, you’ll need to look to your log data. It’s no doubt logs are essential to SecOps, and understanding how to use them during an incident is an important skill to learn.

Comparison, Technical

3 Logging Use Cases

The versatility of logs allows them to be used across the development lifecycle, and to solve various challenges within an organization. Let’s look at three logging use cases from leading organizations at various stages of the product life cycle, and see what we can learn from them.

logdna_in_action.png

Transferwise – Improving Mobile App Reliability

Transferwise is an online payments platform for transferring money across countries easily, and their app runs across multiple platforms. One of the challenges they face with their mobile app is analyzing crashes. It’s particularly difficult to reproduce crashes with mobile as there are many more possible issues – device-specific features, carrier network, memory issues, battery drain, interference from other apps, and many more. A stack trace doesn’t have enough information to troubleshoot the issue. To deal with this, Transferwise uses logs to better understand crashes. They attach a few lines of logs to a crash report which gives them vital information on the crash.

To implement this, they use the open source tool CocoaLumberjack. It transmits crash logs to external loggers where they can be analyzed further. It enables Transferwise to print a log message to the console. You can save the log messages to the cloud or include them in a user-generated bug report. As soon as the report is sent, the user is notified that Transferwise is already working on fixing the issue. This is much better than being unaware of the crash, or ignoring it because they can’t find the root cause.

You should ensure to exclude sensitive data in the log messages. To have more control over how log messages are reported and classified, Transferwise uses a logging policy. They classify logs into 5 categories – error, warning, info, debug, and verbose – each has a different priority level, and are reported differently.

While CocoaLumberjack works only on Mac and iOS, you can find a similar tool like Timber or Hugo for Android. But the key point of this case study is that logging can give you additional insight into crashes especially in challenging environments like mobile platforms. It takes a few unique tools and some processes and policies in place to ensure the solution is safe enough to handle sensitive data, but the value is in increased visibility into application performance, and how you can use it to improve user experience.

[Read more here.]

Wealthfront – Enhancing User Experience with A/B Tests

Wealthfront is a wealth management solution that uses data analytics to help its users invest wisely and earn more over the long term. Though the Wealthfront web app is the primary interface for a user to make transactions, their mobile app is more actively engaged with and is an important part of the solution. Wealthfront is a big believer in A/B testing to improve the UI of their applications. While they have a mature A/B testing process setup for the web app, they didn’t have an equivalent for their mobile apps. As a result they just applied the same learnings across both web and mobile. This is not the best strategy, as mobile users are different from web users, and the same results won’t work across both platforms. They needed to setup an A/B testing process for their mobile apps too.

For inspiration, they looked to Facebook who had setup something similar for their mobile apps with Airlock – a framework for A/B testing on mobile. Wealthfront focussed their efforts on four fronts – backend infrastructure, API design, the mobile client, and experiment analysis. They found logs essential for the fourth part – experiment analysis. This is because logs are a much more accurate representation of the performance and results of an experiment than relying on a backend database. With mobile, the backend infrastructure is very loosely coupled with the frontend client and reporting can be inaccurate if you rely on backend numbers. With logs, however, you can gain visibility into user actions, and each step of a process as it executes. One reason why logging is more accurate is that the logging is coded along with the experiment. Thus, logging brings you deeper visibility into A/B testing and enables you to provide a better user experience. This is what companies like Facebook and Wealthfront have realized, and it can work for you too.

[Read more here.]

Twitter – Achieving Low Latencies for Distributed Systems

At Twitter where they run distributed systems to manage data at very large scale, they use high-performance replicated logs to solve various challenges brought on by distributed architectures. Leigh Stewart of Twitter comments that “Logs are a building block of distributed systems and once you understand the basic pattern you start to see applications for them everywhere.”

To implement this replicated log service they use two tools. The first is the open source Apache BookKeeper which is a low-level log storage tool. They chose BookKeeper for its low latency and high durability even under peak traffic. Second, they built a tool called DistributedLog to provide higher level features on top of BookKeeper. These features include naming and metadata for log streams, data segmentation policies like log retention and segmentation. Using this combination, they were able to achieve write latencies of 10ms, and not exceeding 20ms even at the slowest write speed. This is very efficient, and is possible because of using the right open source, and proprietary tools in combination with each other.

[Read more here.]

As the above examples show, logs play a vital role in various situations across multiple teams and processes. They can be used to make apps more reliable by reducing crashes, improve the user interface using A/B tests, and enforce better safety policies on end users. As you look to improve your applications in these areas, the way these organizations have made use of logs is worth taking note of and implementing in a way that’s specific to your organization. You also need a capable log analysis platform like LogDNA to collect, process and present your log data in a way that’s usable and actionable. Working with log data is challenging, but with the right goals, the right approach, and the right tools, you can gain a lot of value from log data to improve various aspects of your application’s performance.

Comparison, Technical

LogDNA Helps Developers Adopt the AWS Billing Model for More Cost-Effective Logging

Amazon Web Services (AWS) uses a large scale pay-as-you-go model for billing and pricing some seventy plus cloud services. LogDNA has taken a page from that same playbook and offers similar competitive scaling for our log management system. For most companies, managing data centers and a pricey infrastructure is a thing of the past. Droves of tech companies have transitioned into cloud-based services. This radical shift in housing backend data and crucial foundations has completely revolutionized the industry and created a whole new one in the process.

logdna_adopts_aws_billing_model
LogDNA Helps Developers Adopt the AWS Billing Model for More Cost-Effective Logging

For such an abrupt change – one would think that an intelligent shift in pricing methods would have followed. For the majority of companies this is simply not the case.

New industries call for new pricing arrangements. Dynamically scalable pricing is practically a necessity for data-based SaaS companies. Flexible pricing just makes sense and accounts for vast and variable customer data usage.

AWS, and for that matter, LogDNA, have taken the utilities approach to a complex problem. The end user will only pay for what they need and use. Adopting this model comes with a set of new challenges and advantages that can be turned into actionable solutions. There is no set precedent for a logging provider using the AWS billing model. We are on the frontier of both pricing and innovation of cloud logging.

LogDNA Pricing Versus a Fixed System

The LogDNA billing model is based on a pay-per-gig foundation. That means that each GB used is charged on an individual basis before being totaled at the end of the month. What follows then is for each plan: low minimums, no daily cap, and scaling functionality.

Here is an example of a fixed tiered system with a daily cap. For simplicity’s sake, here is a four day usage-log (no pun intended) of a log management system with a 1 GB /day cap.

Monthly Plan: 30 GB Monthly – $99

Day 1: 0.2 GB

Day 2: 0.8 GB

Day 3: 1 GB

Day 4: 0.5 GB

This four day usage is equivalent to 2.5 GB logged. That’s an incredible amount of waste because of a daily cap and variable use. Let’s dive into a deeper comparison of the amount of money wasted compared to our lowest tiered plan.

LogDNA’s Birch Plan charges $1.50 per GB. If we had logged that same amount of usage with our pricing system it would cost roughly $3.75. While the fixed system doesn’t show us the price per GB – we can compare it to LogDNA with some simple math. If a monthly plan at a fixed rate of $99 per month is equal to 30 GB usage per month then you can reasonably say that each GB is equal to about $3.30 in this situation.

Can you spot the difference in not only pricing, but cloud waste as well? With a daily cap, the end-user isn’t even getting to use all of that plan anyhow. A majority of cloud users are underestimating how much they’re wasting. Along with competitive pricing, our billing model cuts down tremendously on wasted cloud spend.       

Challenges of the Model

It’s important again to note that our model is unique amongst logging providers. This unearths a number of interesting challenges. AWS itself has set a great example by publishing a number of guides and guidelines.

The large swath of AWS services (which seems to be growing by the minute) are all available on demand. For simple operations, this means that only a few services will be needed without any contracts or licensing. The scaled pricing allows the company to grow at any rate they can afford, without having to adjust their plan. This lessens the risk of provisioning too much or too little. Simply put, we scale right along with you. So there’s no need to contact a sales rep.

LogDNA as an all-in-one system deals with a number of these same challenges. The ability to track usage is a major focus area to us so that we can ensure you have full transparency into what your systems are logging with us. Our own systems track and bill down to the MB, so that the end-user can have an accurate picture of the spend compared to usage rates. This is not only helpful, but allows us to operate in a transparent manner with no hidden fees. Though it is powered by a complex mechanism internally, it provides a simplified, transparent billing experience for our customers.

LogDNA users have direct control over their billing. While this may seem like just another thing to keep track of, it’s rather a powerful form of agency you can now use to take control of your budget and monetary concerns. Users can take their methodical logging mentality and apply that to their own billing process, allowing greater control over budgets and scale.   

Say, for example, that there is an unexpected spike in data volume. Your current pricing tier will be able to handle the surge without any changes to your LogDNA plan. As an added bonus, we also notify you in the event of a sudden increase in volume. Due to the ever-changing stream of log data – we even offer the tools of ingestion control so that you can even exclude logs you don’t need and not be billed for them.

Our focus on transparency as part of the user experience not only builds trust, but also fosters a sense of partnership.

Scaling for All Sizes & Purposes

Our distinctly tiered system takes into account how many team members (users) will be using LogDNA on the same instance and length of retention (historical log data access for metrics and analytic purposes.) Additionally we also have our scaled pricing tier – HIPAA compliant for protected health information (which includes a Business Associate Agreement, or BAA, for handling sensitive data).

Pictured here is a brief chart of some basic scaled prices for our three initial individual plans. The full scope of the plans is listed here. This is a visualization of a sample plan per each tier.

Plan Estimator

BIRCH – $1.50 /GB – Retention: 7 Days – Up to 5 Users
Monthly GB Used 1GB 4GB 16GB 30GB
Cost Per Month $1.50 $6 $24 $45

Monthly Minimum: $3.00

MAPLE – $2.00 /GB – Retention: 14 Days – Up to 10 Users
Monthly GB Used 10GB 30GB 120GB 1TB
Cost Per Month $20 $60 $240 $2000

Monthly Minimum: $20.00

OAK – $3.00 /GB – Retention: 30 Days – Up to 25 Users
Monthly GB Used 50GB 60GB 150GB 1TB
Cost Per Month $150 $180 $450 $3,000

Monthly Minimum: $100.00

Custom Solutions for All & Competitive Advantages

Many pricing systems attempt to offer a one size fits all model. Where they miss the mark, we succeed with a usability that scales from small shops to large enterprise solutions. Our WIllow (Free) Plan is a single user system that allows an individual to see if a log management system is right for their individual project or eventual collaborated team effort into a paid tier. High data plans are also customized and still retain the AWS billing model. We also offer a full-featured 14-day trial.

The adoption of this model creates a competitive advantage in the marketplace for both parties. LogDNA can provide services to all types of individuals and companies with a fair transparent pricing structure. The end user is given all relevant data usage and pricing information along with useful tools to manage it as they see fit.

For example, imagine you are logging conservatively, focusing only on the essentials like poor performance and exceptions. In the middle of putting out a fire, your engineering team realizes that they are missing crucial diagnostic information. Once the change is made, those new log lines will start flowing into LogDNA without ever having to spend time mulling over how to adjust your plan. Having direct control over your usage and spending without touching billing is enormously beneficial to not only our customers, but also reduces our own internal overhead for managing billing.

Competitive Scenario – Bridging the Divide Between Departments

Picture this scenario; there has been an increased flux of users experiencing difficulty while using your app. The support team has been receiving ticket after ticket. Somewhere there is a discrepancy between what the user is doing and what the app is returning. The support team needs to figure out why these users are having difficulty using the app. These support inquiries have stumped the department – the director needs to ask the engineering team how they can retrieve pertinent information to remedy a fix.  

LogDNA helps bridge the divide by providing the support team with relevant information to the problem at hand. For this particular example, the engineering team instruments new code to log all customer interactions with API endpoints. The support team has a broader vision of how users are interacting with the interface. They’ve now been equipped with a new tool in their arsenal from the engineers. There was nothing lost in translation between the departments during this exchange.  

After looking through the new logged information, the support team is able to solve the problem many of its users were experiencing. The support team has served its purpose by responding to these inquiries and making the end-user happy. All it took was some collaboration between two different departments.   

The log volume has increased due to new logs being funneled through the system. But the correlation between increased log volume and better support is worth it. During this whole process no changes are required to your current account plan with LogDNA. Future issues that may arise will be easily fixed as a result of this diagnostic information being readily available. The cost of losing users outweighs the cost of extra logs.

LogDNA places the billing model on an equal level of importance as the actual log management software itself. It can be used to make decisions all across the board. LogDNA’s billing model allows itself to adapt to budgetary concerns, user experience and a better grasp of your own data all at once.