ProTip, Technical

Best Practices for Log Alerts

Application logs are more than just tools for root cause analysis. They’re also a way to gain insight into critical events such as a loss in sales, server warnings, HTTP errors, performance, and numerous other activities that impact business productivity. Logs are just thousands of lines of raw data, but they can be parsed and leveraged to provide a better understanding of what goes on under the hood of your application. It’s common for developers to set up logging for code exceptions within the application, but logging provides far more benefits than just bug tracking and should be used to alert administrators to issues that need their attention.

Application and Server Performance

Any administrator who has dealt with performance issues will tell you that it’s one of the most difficult problems to analyze and pinpoint a root cause for repair. Performance degradation can occur at certain times of the day, during an active DDoS attack, or what seems like no reason at all. QA can do performance testing on your application, but these tests rarely represent a production environment that supports thousands of users concurrently. For most organizations, performance issues occur during business growth and can harm its potential expansion. Performance is also problematic because it’s unforeseen and rarely a quick fix for developers and administrators.

Using log alerts, you can assess what’s happening when an application’s performance is diminished. It could be from a specific event such as a poorly optimized database query or when CPU usage spikes. Log these types of events, and you can not only identify when application performance will wane but also when server resources could be exhausted. Instead of suffering from a server crash, logging these events will give you insights for when it could be time to upgrade server hardware. They also help you pinpoint components of your application that could be causing performance degradation.

Your developers would need to set a baseline, but — for instance — you could set an alert for any event that takes longer than 100 milliseconds to process. When you see a pattern, you can then have developers research more into these application components for better optimization.

When CPU usage spikes to over 80%, set an alert to inform administrators. It could be something as simple as upgrading RAM or even your server’s CPU, but having alerts on these logs will give you the ability to analyze the time of day and any patterns surrounding application procedures.

Failed Sales Events

Most applications log exceptions so that developers can look back into errors and provide updates. But not every exception is created equally. The most critical events are those that impact business revenue. These events are the ones you should monitor closely and send alerts to the right response team.

It shouldn’t be your customers calling to tell you that there are bugs in your application. You want to find them before they do, and many customers will just bounce to another vendor if you have too many bugs. Many of your customers won’t report application issues at all, so you could be losing sales every hour and never know it.

When you build a shopping cart, you should have a point where you ask a visitor for a way to contact them. Usually, email is a common input field during account creation. With an alert on failed shopping cart activity, you have a way to contact a customer should they receive an error and bail on the shopping experience. Alerts are a great tool to salvage a lost customer due to application bugs.

But you also need alerts to tell you when components of your application are creating obstacles for customer shopping experiences. It could be an alert for performance (similar to the previous example), or your customers could be dropping at a specific point in the sales funnel. Alerts give you insights into the efficacy of your marketing and user experience. Use them generously to identify issues with your sales pages and find solutions for a better user experience.

Security and Suspicious Activity

Developers are good at logging exceptions, but they don’t usually account for security events. Security events can be any suspicious behaviors such as automated logins from account takeover tools (ATOs) using customer data, repeated patterns of failed admin login attempts, ACL changes or new accounts created.

Usually, some security events are triggered from the database, but this limits your logging to database-specific activity. With logging and alerts, you should use them to make the right people aware of suspicious activity that happens on any one of your servers outside of database activity.

With ATOs, an attacker will use automated software to log into a customer’s account and use purchased credit card data to buy product to test if the card is viable. Logs should be used to detect this type of activity and alert administrators to suspicious events.

Any modifications to security permissions or authorization should also be logged and alerts sent. This could be elevation of permissions for any specific user, new routing rules configured on your infrastructure, or user access to critical files. Security events are a primary method organizations use to identify attacks before they become catastrophic data breaches.

How Do You Set Up Alerts?

You need the right data before you can set up alerts. The data that you log is up to you, but some standard data points are needed to create efficient logs that scale across your infrastructure. Some common data points include:

  • Date and time
  • Host name
  • Application name
  • Customer or account that experienced the error
  • IP address or other geographic markers for the visitor
  • Raw exception information
  • Line number where it occurred (if applicable)
  • Type of error (fatal, warning, etc)

You could write a logging application, or you could save time and configuration hassles by using LogDNA. With built-in features that provide logging data and sending alerts, you can save months of development and testing for your own solution.

Instead of only using logs for basic events, an organization’s best practices should include activity that gives administrators insight into patching issues before they become catastrophic instead of just using them to retroactively find solutions. LogDNA can provide you with the right tools and alerts that organizations can leverage to avoid revenue-impacting bugs and server errors.

Security, Technical

Building Secure Applications with Machine Learning & Log Data

In recent years, machine learning has swept across the world of software delivery and is changing the way applications are built, shipped, monitored, and secured.  And log monitoring is one of the industries that keeps evolving with new capabilities afforded with machine learning.

What Is Machine Learning?

Machine learning is the process of using algorithms and computer intelligence to analyze and make sense of large quantities of complex data that would otherwise be difficult or impossible to do by a human security analyst. There are many forms of machine learning from algorithms that can be trained to replicate human decision making at scale to algorithms that take an open-ended approach to find interesting pieces of data with little input or guidance. Differences aside, machine learning looks to get more value out of data in a way that humans can’t do manually. This is critical for security, which deals with large quantities of data, and often misses the important data, or catches it too late.

Gnome-stock_person_bot.svg
Machine learning is made possible by the power of cloud computing and how it makes crunching big data cheaper and more powerful. Analyzing large quantities of complex data takes a lot of computing power, readily available memory, and fast networking that’s optimized for scale. Cloud vendors today provide GPU (graphical processing unit) instances that are particularly well suited for machine learning. The alternate method is to use numerous cheap servers and a distributed approach to analyze the data at scale. This is possible with the advances in distributed computing over the past few years.

Additionally, cloud storage with fast I/O speeds is necessary for complex queries to be executed within a short period of time.  AWS itself offers multiple storage solutions like Amazon EFS, EBS, and S3. Each of these serve different purposes and are ideal for different types of data workloads. The cloud has stepped up in terms of compute, memory, storage and overall tooling available to support machine learning.

Security Threats Are Becoming Increasingly Complex

The primary use case for machine learning in log analysis has to do with security. There are many forms of security attacks today, which range in complexity. Email phishing attacks, promo code abuse, credit card theft, account takeover, and data breaches are some of the security risks that log analysis can help protect against. According to the Nilson Report, these types of security attacks cost organizations a whopping $21.8 billion each year. And that doesn’t even include the intangible costs associated with losses in trust, customer relationships, brand value, and more that stem from a security attack. Security threats are real, and log analysis can and should be used to counter them.

Attackers are becoming more sophisticated as they look to new technology and tools to carry out their attacks. Indeed, criminals themselves are early adopters of big data, automation tools and more. They use masking software to hide their tracks, bots to conduct attacks at scale, and in some cases have armies of humans assisting in the attack. With such a coordinated effort, they can easily break through the weak defenses of most organizations. That’s why you see some of the biggest web companies, the most highly secured government institutions, and the fastest growing startups all fall victim to these attackers, who can breach their defense easily.

SecOps Is Predominantly Manual

Since much of security is conducted by humans, or at most by tools that use rules to authorize or restrict access, attackers eventually understand the rules and find ways to breach them. They may find a limit to the number of requests a gatekeeper tool can handle per second, and then bombard the tool with more requests than the limit. Or they may find unsecured IoT devices on the edge of the network that can be easily compromised and taken control of. This was the case with the famous Dyn DDoS attack which leveraged unsecured DVRs and then bombarded the Dyn network, taking down with it a large percentage of the internet’s top websites that relied on Dyn for DNS services. The point is that manual security reviews, and even rule-based security, doesn’t scale and is not enough to secure systems against the most sophisticated attackers. What’s needed is a machine learning approach to security, and one that leverages logs to detect and stop attacks before they escalate.

Dyn_logo_(black_text).svg

Many security risks occur at the periphery of the system, so it’s essential to keep a close watch on all possible entry points. End users access the system from outside, and can sometimes knowingly or unknowingly compromise the system. You should be able to spot a malicious user from the smallest of triggers; for example, an IP address or a geo location that is known to be suspicious should be investigated. Login attempts are another giveaway that your system may be under attack. Frequent unsuccessful login attempts are a bad sign and need to be further investigated. Access logs need to be watched closely to spot these triggers, and it is best done by a machine learning algorithm.

While manual and rule-based review can work to a certain point, increasingly sophisticated attacks are best thwarted by using machine learning. You may need to crawl external third-party data to identify fraudulent activity, and it helps to look not just inside, but outside of your organization for data that can shed light on suspicious activity. But with growing data sets to analyze, you need more than basic analytics — you need the scale and power of machine learning to spot patterns, and find the needle in the haystack. Correlating your internal log data with external data sets is a challenge, but it can be done with machine learning algorithms that look for patterns in large quantities of unstructured data.

Machine Learning For Security

Machine learning can go further in spotting suspicious patterns from multiple pieces of data. It can look at two different pieces of data, sometimes not obviously associated with each other, and highlight a meaningful pattern. For example, if a new user accesses parts of the system that are sensitive, or tries to gain access to confidential data, a machine learning algorithm can spot this from looking at their browsing patterns or the requests they make. It can decipher that this user is likely looking to breach the system and may be dangerous. Highlighting this behavior an hour or two in advance can potentially prevent the breach from occurring. To do this, machine learning needs to look at the logs showing how the user accesses and moves through the application. The devil is in the details, and logs contain the details. But often, the details are so hidden that human eyes can’t spot them; this is where machine learning can step in and augment what’s missing in a human review.

In today’s cloud-native environment, applications are deeply integrated with each other — no application is an island on its own. This being the case, many attacks occur from neighboring apps which may have escalated their privileges. It’s easy to read the news about data breaches and find cases where organizations blame their partner organizations or an integrated third-party app for a security disaster. Monitoring your own system is hard enough, and it takes much more effort and sophistication to monitor outside applications that interact with yours.

Whereas humans may overlook the details when monitoring a large number of integrated applications and APIs, a machine learning algorithm can monitor every API call log, every network request that’s logged, and every kind of resource accessed by third-party applications. It can identify normal patterns as well as suspicious ones. For example, if an application utilizes a large percentage of available memory and compute for a long period of time, it is a clear trigger. A human may notice this after a few minutes or hours of it occurring, but a machine learning algorithm can spot the anomaly in the first few seconds, and bring it to your attention. Similarly, it can highlight a spike in requests from any single application quickly, and highlight that this may be a threat.

Elasticsearch-Logo

Machine learning algorithms are especially good at analyzing unstructured or semi-structured data like text documents or lines of text. Logs are full of text data that need to be analyzed, and traditional analytics tools like SQL databases are not ideally suited for log analysis. This is why newer tools like Elasticsearch have sprung up to help make sense of log data at scale. Machine learning algorithms work along with these full-text search engines to spot patterns that are suspicious or concerning. It can derive this insight from the large quantities of log data being generated by applications. In today’s containerized world, the amount of log data to be analyzed is on the rise, and only with the power of machine learning can you get the most insight in the quickest time.

Conclusion

As you look to get more out of your log data, you need an intelligent logging solution like LogDNA that leverages machine learning to give you insight in a proactive manner. Algorithms are more efficient and faster than humans at reading data, and they should be used to preempt attacks by identifying triggers in log data. As you assess a logging solution, do look at its machine learning features. Similarly, as you plan your logging strategy, ensure machine learning is a key component of your plans, and that you rely not just on traditional manual human review, but leverage the power of machine learning algorithms.

Security, Technical

The Role of Log Analysis in Application Security

Security is perhaps the most complex, time-sensitive, and critical aspect of IT Ops. It’s similar to the ICU (Intensive care unit) room in a hospital where the most serious of cases go, and any mistake can have far-reaching consequences. Just as doctors need to monitor various health metrics like heart rate, blood pressure, and more, Security Analysts need to monitor the health of their systems at all time – both when the system is functioning as expected, and especially when things break. To gain visibility into system health at any time, the go-to resource for any Security Analyst is their system logs. In this post, we cover all that can go wrong in the lifetime of an application, and how Security Analysts can respond to these emergencies.

The Role of Log Analysis in Application Security
Log Analysis is an integral component to application security

Common security incidents

An application can be attacked in any number of ways, but below are some of the most common type of security issues that SecOps teams deal with day in and day out.

  • Viruses & malware: This is when a malicious script or software is installed on your system and it tries to disrupt, or manipulate the functioning of your system.
  • Data breaches: The data contained in your systems is an extremely valuable asset. A data breach is when this valuable data is accessed, or stolen by an attacker.
  • Phishing attacks: This tactic is used on end users where an attacker poses as a genuine person or organization, and lures the user to click on a link after which their credentials like logins, and financial information are stolen.
  • Account takeovers: This is when a cyber criminal gains access to a user’s login credentials and misuses it to their own advantage.
  • DDoS attacks: Distributed denial of service (DDoS) attacks occur when multiple infected computers target a single server and overload it with requests. The server is overloaded with requests and denies service to other genuine users.
  • SQL injection: This happens when an attacker runs malicious SQL queries against a database with the aim of taking the database down.
  • Cross-site scripting: This type of attack involves the hacker using browser scripts to execute harmful actions on the client-side.

Initial response

When an incident occurs, the first thing you need to do is estimate the impact of the attack.

Estimate the impact of the attack. Which parts of the system have failed, which parts are experiencing latency, how many users and accounts are affected, which databases have been compromised, and more. Understanding this will help you decide what needs to be protected right away, and what your first actions should be. You can reduce the impact of an attack by acting fast.

Incident management using log data

Once you’ve taken the necessary ‘first aid’ measures, you’ll need to do deeper troubleshooting to find the origin of the attack, and fully plug all vulnerabilities. When dealing with these security risks, log data is indispensable. Let’s look at the various ways you can use log data to troubleshoot, take action, and protect yourself from these types of attacks.

Application level

When an incident occurs, you probably receive an alert from one of your monitoring systems about the issue. To find the origin of the incident, where you look first will depend on the type of issue.

At the application level you want to look at how end users have used your app. You’d look at events like user logins, and password changes. If user credentials have been compromised by a phishing attack, it’s hard to spot the attacker initially, but looking for any unusual behavior patterns can help.

Too many transactions in a short period of time by a single user, or transactions of very high value can be suspicious. Changes in shipping addresses is also a clue that an attacker may be at work.

At the application layer, you can view who has had access to your codebase. For this you could look at requests for files and data. You could filter your log data by application version, and notice if there are any older versions still in use. Older versions could be a concern as they may not include any critical security patches for known vulnerabilities. If your app is integrated with third-party applications via API, you’ll want to review how it’s been accessed.

When looking at log data it always helps to correlate session IDs with timestamps, and even compare timestamps across different pieces of log data. This gives you the full picture of what happened, how it progressed, and what the situation is now. Especially look for instances of users changing admin settings or opting for their behavior to not be tracked. These are suspicious and are clear signs of an attack. But even if you find initial signs at the application layer, you’ll need to dig deeper to the networking layer to know more about the attack.

Network level

At the network level there are many logs to view. To start with, the IP addresses and locations from where your applications were accessed from is important. Also, notice the device types like USBs, IoT devices, or embedded systems. If you notice suspicious patterns like completely new IP addresses and locations, or malfunctioning devices, you can disable them or restrict their access.

Particularly, notice if there are any breaches to the firewall, or cases of modifying or deletion of cookies. In legacy architecture a breach in a firewall can leave the entire system vulnerable to attackers, but with modern containerized setups, you can implement granular firewalls using tools like Project Calico.

Look for cases of downtimes or performance issues like timeouts across the network, these can help you assess the impact of the attack, and maybe even find the source. The networking layer is where all traffic in your system flows, and in the event of an attack, malicious events are passing through your network. Looking at these various logs will give you visibility into what is transpiring across your network.

Going a level deeper, you can look at the health of your databases for more information.

Database level

Hackers are after the data in your system. This could include financial data like credit card information, or even personally identifiable information like a social security number or home address. They could look for loosely stored passwords and usernames associated with them. All this data is stored in databases in cloud or physical disks in your system, and they need to be checked for breaches, and secured immediately if not compromised already.

To spot these breaches take a look at the log trail for data that’s been encrypted or decrypted. Especially notice how users have accessed personally identifiable information or payment related information. Changes to file names and file paths, an unusually high number of events from a single user are all signs of data breaches.

Your databases are also used to store all your log data, and you’d want to check if disk space was insufficient to store all logs, or if any logs were left out recently. Notice if the data stored on these disks were altered or tampered with.

OS level

The first thing to check in the operating system (OS) layer should be the access controls for audit logs. This is because savvy attackers will first try to delete or alter audit logs to cover their tracks. Audit logs record all user activity in the system, and if an attack is in progress, the faster you secure your audit logs the better.

Another thing to check is if timestamps are consistent across all your system logs. Often, attackers would change the timestamps for certain log files making it difficult for you to correlate one piece of log data with another. This makes investigation difficult and gives the attacker more time to carry out their agenda.

Once you’re confident that your audit logs and timestamps haven’t been tampered with, you can investigate other OS level logs to trace user activity. How users logged in and out of your system, system changes especially related to access controls, usage of admin privileges, changes in network ports, files and software that was added or removed, and commands executed recently, especially those from privileged accounts – all this information can be analyzed using your system logs.

The operating system is like the control center for your entire applications stack, and commands originating from here need to be investigated during an attack. The next place to look for vulnerabilities is the hardware devices in your system.

Hardware or infrastructure level

Some of the recent DDoS attacks, like the one that affected Dyn last year was a result of hackers gaining access to hardware devices that end users owned. They cracked the weak passwords of these devices and overloaded the server with requests. If end user devices are a major part of your system, they should be watched carefully as they can be easy targets for hackers. Looking at log data for how edge devices behave on the network can help spot these attacks. With the growth of the internet of things (IoT) and smart devices across all sectors, these types of attacks are becoming more common.

Additionally, if your servers or other hardware are also accessed by third-party vendors, or partner organizations, you need to keep an eye on how these organizations use your hardware and the data on it.

After the incident

Hopefully, all these sources of log data can help you as you troubleshoot and resolve attacks. After the incident, you’ll need to write a post-mortem. Even in this step, log data is critical to telling the entire story of the attack – its origin, progression, impact radius, resolution, and finally restoring the system back to normalcy.

As you can tell, log data is present at every stage of incident management and triaging. If you want to secure your system, pay attention to the logs you’re collecting. If you want to be equipped to deal with today’s large scale attacks, you’ll need to look to your log data. It’s no doubt logs are essential to SecOps, and understanding how to use them during an incident is an important skill to learn.

Comparison, Technical

3 Logging Use Cases

The versatility of logs allows them to be used across the development lifecycle, and to solve various challenges within an organization. Let’s look at three logging use cases from leading organizations at various stages of the product life cycle, and see what we can learn from them.

logdna_in_action.png

Transferwise – Improving Mobile App Reliability

Transferwise is an online payments platform for transferring money across countries easily, and their app runs across multiple platforms. One of the challenges they face with their mobile app is analyzing crashes. It’s particularly difficult to reproduce crashes with mobile as there are many more possible issues – device-specific features, carrier network, memory issues, battery drain, interference from other apps, and many more. A stack trace doesn’t have enough information to troubleshoot the issue. To deal with this, Transferwise uses logs to better understand crashes. They attach a few lines of logs to a crash report which gives them vital information on the crash.

To implement this, they use the open source tool CocoaLumberjack. It transmits crash logs to external loggers where they can be analyzed further. It enables Transferwise to print a log message to the console. You can save the log messages to the cloud or include them in a user-generated bug report. As soon as the report is sent, the user is notified that Transferwise is already working on fixing the issue. This is much better than being unaware of the crash, or ignoring it because they can’t find the root cause.

You should ensure to exclude sensitive data in the log messages. To have more control over how log messages are reported and classified, Transferwise uses a logging policy. They classify logs into 5 categories – error, warning, info, debug, and verbose – each has a different priority level, and are reported differently.

While CocoaLumberjack works only on Mac and iOS, you can find a similar tool like Timber or Hugo for Android. But the key point of this case study is that logging can give you additional insight into crashes especially in challenging environments like mobile platforms. It takes a few unique tools and some processes and policies in place to ensure the solution is safe enough to handle sensitive data, but the value is in increased visibility into application performance, and how you can use it to improve user experience.

[Read more here.]

Wealthfront – Enhancing User Experience with A/B Tests

Wealthfront is a wealth management solution that uses data analytics to help its users invest wisely and earn more over the long term. Though the Wealthfront web app is the primary interface for a user to make transactions, their mobile app is more actively engaged with and is an important part of the solution. Wealthfront is a big believer in A/B testing to improve the UI of their applications. While they have a mature A/B testing process setup for the web app, they didn’t have an equivalent for their mobile apps. As a result they just applied the same learnings across both web and mobile. This is not the best strategy, as mobile users are different from web users, and the same results won’t work across both platforms. They needed to setup an A/B testing process for their mobile apps too.

For inspiration, they looked to Facebook who had setup something similar for their mobile apps with Airlock – a framework for A/B testing on mobile. Wealthfront focussed their efforts on four fronts – backend infrastructure, API design, the mobile client, and experiment analysis. They found logs essential for the fourth part – experiment analysis. This is because logs are a much more accurate representation of the performance and results of an experiment than relying on a backend database. With mobile, the backend infrastructure is very loosely coupled with the frontend client and reporting can be inaccurate if you rely on backend numbers. With logs, however, you can gain visibility into user actions, and each step of a process as it executes. One reason why logging is more accurate is that the logging is coded along with the experiment. Thus, logging brings you deeper visibility into A/B testing and enables you to provide a better user experience. This is what companies like Facebook and Wealthfront have realized, and it can work for you too.

[Read more here.]

Twitter – Achieving Low Latencies for Distributed Systems

At Twitter where they run distributed systems to manage data at very large scale, they use high-performance replicated logs to solve various challenges brought on by distributed architectures. Leigh Stewart of Twitter comments that “Logs are a building block of distributed systems and once you understand the basic pattern you start to see applications for them everywhere.”

To implement this replicated log service they use two tools. The first is the open source Apache BookKeeper which is a low-level log storage tool. They chose BookKeeper for its low latency and high durability even under peak traffic. Second, they built a tool called DistributedLog to provide higher level features on top of BookKeeper. These features include naming and metadata for log streams, data segmentation policies like log retention and segmentation. Using this combination, they were able to achieve write latencies of 10ms, and not exceeding 20ms even at the slowest write speed. This is very efficient, and is possible because of using the right open source, and proprietary tools in combination with each other.

[Read more here.]

As the above examples show, logs play a vital role in various situations across multiple teams and processes. They can be used to make apps more reliable by reducing crashes, improve the user interface using A/B tests, and enforce better safety policies on end users. As you look to improve your applications in these areas, the way these organizations have made use of logs is worth taking note of and implementing in a way that’s specific to your organization. You also need a capable log analysis platform like LogDNA to collect, process and present your log data in a way that’s usable and actionable. Working with log data is challenging, but with the right goals, the right approach, and the right tools, you can gain a lot of value from log data to improve various aspects of your application’s performance.

Comparison, Technical

LogDNA Helps Developers Adopt the AWS Billing Model for More Cost-Effective Logging

Amazon Web Services (AWS) uses a large scale pay-as-you-go model for billing and pricing some seventy plus cloud services. LogDNA has taken a page from that same playbook and offers similar competitive scaling for our log management system. For most companies, managing data centers and a pricey infrastructure is a thing of the past. Droves of tech companies have transitioned into cloud-based services. This radical shift in housing backend data and crucial foundations has completely revolutionized the industry and created a whole new one in the process.

logdna_adopts_aws_billing_model
LogDNA Helps Developers Adopt the AWS Billing Model for More Cost-Effective Logging

For such an abrupt change – one would think that an intelligent shift in pricing methods would have followed. For the majority of companies this is simply not the case.

New industries call for new pricing arrangements. Dynamically scalable pricing is practically a necessity for data-based SaaS companies. Flexible pricing just makes sense and accounts for vast and variable customer data usage.

AWS, and for that matter, LogDNA, have taken the utilities approach to a complex problem. The end user will only pay for what they need and use. Adopting this model comes with a set of new challenges and advantages that can be turned into actionable solutions. There is no set precedent for a logging provider using the AWS billing model. We are on the frontier of both pricing and innovation of cloud logging.

LogDNA Pricing Versus a Fixed System

The LogDNA billing model is based on a pay-per-gig foundation. That means that each GB used is charged on an individual basis before being totaled at the end of the month. What follows then is for each plan: low minimums, no daily cap, and scaling functionality.

Here is an example of a fixed tiered system with a daily cap. For simplicity’s sake, here is a four day usage-log (no pun intended) of a log management system with a 1 GB /day cap.

Monthly Plan: 30 GB Monthly – $99

Day 1: 0.2 GB

Day 2: 0.8 GB

Day 3: 1 GB

Day 4: 0.5 GB

This four day usage is equivalent to 2.5 GB logged. That’s an incredible amount of waste because of a daily cap and variable use. Let’s dive into a deeper comparison of the amount of money wasted compared to our lowest tiered plan.

LogDNA’s Birch Plan charges $1.50 per GB. If we had logged that same amount of usage with our pricing system it would cost roughly $3.75. While the fixed system doesn’t show us the price per GB – we can compare it to LogDNA with some simple math. If a monthly plan at a fixed rate of $99 per month is equal to 30 GB usage per month then you can reasonably say that each GB is equal to about $3.30 in this situation.

Can you spot the difference in not only pricing, but cloud waste as well? With a daily cap, the end-user isn’t even getting to use all of that plan anyhow. A majority of cloud users are underestimating how much they’re wasting. Along with competitive pricing, our billing model cuts down tremendously on wasted cloud spend.       

Challenges of the Model

It’s important again to note that our model is unique amongst logging providers. This unearths a number of interesting challenges. AWS itself has set a great example by publishing a number of guides and guidelines.

The large swath of AWS services (which seems to be growing by the minute) are all available on demand. For simple operations, this means that only a few services will be needed without any contracts or licensing. The scaled pricing allows the company to grow at any rate they can afford, without having to adjust their plan. This lessens the risk of provisioning too much or too little. Simply put, we scale right along with you. So there’s no need to contact a sales rep.

LogDNA as an all-in-one system deals with a number of these same challenges. The ability to track usage is a major focus area to us so that we can ensure you have full transparency into what your systems are logging with us. Our own systems track and bill down to the MB, so that the end-user can have an accurate picture of the spend compared to usage rates. This is not only helpful, but allows us to operate in a transparent manner with no hidden fees. Though it is powered by a complex mechanism internally, it provides a simplified, transparent billing experience for our customers.

LogDNA users have direct control over their billing. While this may seem like just another thing to keep track of, it’s rather a powerful form of agency you can now use to take control of your budget and monetary concerns. Users can take their methodical logging mentality and apply that to their own billing process, allowing greater control over budgets and scale.   

Say, for example, that there is an unexpected spike in data volume. Your current pricing tier will be able to handle the surge without any changes to your LogDNA plan. As an added bonus, we also notify you in the event of a sudden increase in volume. Due to the ever-changing stream of log data – we even offer the tools of ingestion control so that you can even exclude logs you don’t need and not be billed for them.

Our focus on transparency as part of the user experience not only builds trust, but also fosters a sense of partnership.

Scaling for All Sizes & Purposes

Our distinctly tiered system takes into account how many team members (users) will be using LogDNA on the same instance and length of retention (historical log data access for metrics and analytic purposes.) Additionally we also have our scaled pricing tier – HIPAA compliant for protected health information (which includes a Business Associate Agreement, or BAA, for handling sensitive data).

Pictured here is a brief chart of some basic scaled prices for our three initial individual plans. The full scope of the plans is listed here. This is a visualization of a sample plan per each tier.

Plan Estimator

BIRCH – $1.50 /GB – Retention: 7 Days – Up to 5 Users
Monthly GB Used 1GB 4GB 16GB 30GB
Cost Per Month $1.50 $6 $24 $45

Monthly Minimum: $3.00

MAPLE – $2.00 /GB – Retention: 14 Days – Up to 10 Users
Monthly GB Used 10GB 30GB 120GB 1TB
Cost Per Month $20 $60 $240 $2000

Monthly Minimum: $20.00

OAK – $3.00 /GB – Retention: 30 Days – Up to 25 Users
Monthly GB Used 50GB 60GB 150GB 1TB
Cost Per Month $150 $180 $450 $3,000

Monthly Minimum: $100.00

Custom Solutions for All & Competitive Advantages

Many pricing systems attempt to offer a one size fits all model. Where they miss the mark, we succeed with a usability that scales from small shops to large enterprise solutions. Our WIllow (Free) Plan is a single user system that allows an individual to see if a log management system is right for their individual project or eventual collaborated team effort into a paid tier. High data plans are also customized and still retain the AWS billing model. We also offer a full-featured 14-day trial.

The adoption of this model creates a competitive advantage in the marketplace for both parties. LogDNA can provide services to all types of individuals and companies with a fair transparent pricing structure. The end user is given all relevant data usage and pricing information along with useful tools to manage it as they see fit.

For example, imagine you are logging conservatively, focusing only on the essentials like poor performance and exceptions. In the middle of putting out a fire, your engineering team realizes that they are missing crucial diagnostic information. Once the change is made, those new log lines will start flowing into LogDNA without ever having to spend time mulling over how to adjust your plan. Having direct control over your usage and spending without touching billing is enormously beneficial to not only our customers, but also reduces our own internal overhead for managing billing.

Competitive Scenario – Bridging the Divide Between Departments

Picture this scenario; there has been an increased flux of users experiencing difficulty while using your app. The support team has been receiving ticket after ticket. Somewhere there is a discrepancy between what the user is doing and what the app is returning. The support team needs to figure out why these users are having difficulty using the app. These support inquiries have stumped the department – the director needs to ask the engineering team how they can retrieve pertinent information to remedy a fix.  

LogDNA helps bridge the divide by providing the support team with relevant information to the problem at hand. For this particular example, the engineering team instruments new code to log all customer interactions with API endpoints. The support team has a broader vision of how users are interacting with the interface. They’ve now been equipped with a new tool in their arsenal from the engineers. There was nothing lost in translation between the departments during this exchange.  

After looking through the new logged information, the support team is able to solve the problem many of its users were experiencing. The support team has served its purpose by responding to these inquiries and making the end-user happy. All it took was some collaboration between two different departments.   

The log volume has increased due to new logs being funneled through the system. But the correlation between increased log volume and better support is worth it. During this whole process no changes are required to your current account plan with LogDNA. Future issues that may arise will be easily fixed as a result of this diagnostic information being readily available. The cost of losing users outweighs the cost of extra logs.

LogDNA places the billing model on an equal level of importance as the actual log management software itself. It can be used to make decisions all across the board. LogDNA’s billing model allows itself to adapt to budgetary concerns, user experience and a better grasp of your own data all at once.  

Technical

Querying Multiple ElasticSearch Clusters with Cross-Cluster Search

With great power comes great responsibility, and with big data comes the necessity to query it effectively. ElasticSearch has been around for over seven years and changed the game in terms of running complex queries on big data (petabyte scale). Tasks like e-commerce product search, real-time log analysis for troubleshooting or generally anything that involves querying big data is considered “data intensive”. ElasticSearch is a distributed, full-text search engine, or database, and the key word here is “distributed”.

A lot of small problems are much easier to deal with than a few big ones, and DevOps is all about spreading out dependencies and responsibility so it’s easier on everyone. ElasticSearch uses the same concept to help query big data; it’s also highly scalable and open-source. Imagine you need to setup an online store, a private “Google search box” that your customers could use to search for anything in your inventory. That’s exactly what ElasticSearch can do for your application monitoring and logging data. It stores all your data, or in the context of our post, all your logging data in nodes that make up several clusters.

DevOps on Data

Staying with the online store example, modern day queries can get pretty technical and a customer could, for example, be looking for only products in a certain price range, or a certain colour, or a certain anything. Things can get more complicated if you’re also running a price alerting system that lets customers set alerts if things on their wish list drop below a certain price. ElasticSearch gives you those full-text search and analytics capabilities by breaking data down into nodes, clusters, indexes, types, documents, shards and replicas. This is how it allows you to store, search, and analyze big data quickly and in “near” real time (NRT).

The architecture and design of ElasticSearch is based on the assumption that all nodes are located on a local network. This is the use case that is extensively tested for and in a lot of cases is the environment that users operate in. However, monitoring data can be stored on different servers and clusters and to query them, ElasticSearch needs to run across clusters. If your clusters are at different remote locations, this is where ElasticSearch’s assumption that all nodes are on the same network starts working against you. When data is stored across multiple ElasticSearch clusters, querying it becomes harder.

Global Search

sar_helicopter

Network disruptions are much more common between distributed infrastructure (even with a dedicated link) along with a host of other problems. Workarounds are what adapting to new technology is all about, and there have been a number of them — one of the most recent and effective ones being tribe nodes. ElasticSearch has a number of different use cases within organizations and are spread across departments. It could be used for logging visitor data in one, analyzing financial transactions in another, and deriving insights from social media data across a third.

Since data resides across the cluster on different nodes, some complex queries need to get data from multiple nodes and process them; for that you need to query multiple clusters. If all these clusters are not at the same physical location, tribe node connects them and lets you deal with them like one big cluster. What makes the tribe node unique is that it doesn’t impose any restrictions on core APIs like the cross-cluster search. The tribe node supports almost all APIs, with the exception of meta-level APIs like Create Index, for example, which must be executed on each cluster separately.

Tribe Nodes

The tribe node works by executing search requests across multiple clusters and merging the results from each cluster into a single global cluster result. It does this by actually joining each cluster as a node that keeps updating itself on the state of the cluster. This uses considerable resources, as the node has to acknowledge every single cluster state update from every remote cluster.

Additionally, with tribe nodes, the node that receives the request (the corresponding node) basically does all the work. This means the node that receives the request identifies which indices, shards, and nodes the search has to be executed against. It sends requests to all relevant nodes, decides what the top N-hits that need to be fetched are and then actually fetches them.

The tribe node is also very hard to maintain code-wise over time — especially since it’s the only exception to ElasticSearch’s rule that a node must belong to one and only one cluster.

Cross-Cluster Search

If DevOps is about spreading the load around, it’s pretty obvious what the problem is with tribe node. One node is being taxed with all the processing work while the nodes not relevant to the query standby idle. With cross-cluster search, you’re actually remotely querying each cluster with its own _search APIs, so no additional nodes that need to be constantly updated would join the cluster and slow it down. When a search request is executed on a node, instead of doing everything itself, the node forwards the indices at a rate of one _search_shard request per cluster.

The _search API allows ElasticSearch to execute searches, queries, aggregations, suggestions, and more against multiple indices which are in turn broken down into shards. The concept is that instead of having a huge Database A, Database B, Database C and so on, it merges everything into one giant solid block of data. The next step is to break it down into bits (shards) and give every node a piece to look after, worry about, care for, maintain, and query when required. This makes it a lot easier to query since the load is spread evenly across all the nodes.

Now, unlike in tribe nodes where the first node would wait for every node to reply, do the math and fetch the documents, with cross-cluster search the initial node has done it’s job already. Once the shards are sent to all clusters for comparison, all further processing is done locally on the relevant clusters. Further time and processing power is saved by sending shards to only 3 nodes per cluster by default; you can also choose how many nodes per cluster you would like discovered.

One Direction

one-way-sign-1434558486ukg

Now traffic flows only one way in cross-cluster searches and that means the corresponding node just passes the message on to the three default nodes that carry on the process. You can also choose which nodes you would like to act as gateways and which nodes you would actually like to store data on. This gives you a lot more control over the traffic going in and out of your cluster.

Again, unlike tribe nodes that require an actual additional node in each of your clusters, with cross-cluster search, no additional or special nodes are required for cross cluster searches and it isn’t tied to any specific API. Any node can act as a corresponding node and you can control which nodes get to be corresponding nodes and which nodes don’t. Furthermore, when merging clusters, tribe nodes can’t keep two indices with the same name even if they’re from different clusters. Cross-cluster search aims to fix this limitation by being able to dynamically register, update, and remove remote clusters.

The Need For Logging

There are also commercial algorithms built on ElasticSearch to make life and logging even easier. LogDNA is a good example, and we’ve been known to talk about our product in context of it being the “Apple of logging”. Along with predictive intelligence and machine learning, LogDNA allows users to aggregate, search and filter from all hosts and apps. LogDNA also features automatic parsing of fields from common log formats, such as weblogs, Mongo, Postgres and JSON. Additionally, we offer a live-streaming tail using a web interface or command line interface (CLI).

LogDNA provides the power to query your log data end-to-end without having to worry about clusters, nodes, or indices. It does all the heavy lifting behind the scenes so you enjoy an intuitive and intelligent experience when analyzing your log data. As we discussed earlier, maintaining your own ElasticSearch stack and stitching together all the infrastructure and dependencies at every level is a pain. Instead, you’re better off saving all that time and opting for a third-party tool that abstracts away low-lying challenges and lets you deal with your log data directly. That’s what tools like LogDNA do.  

From not being able to query across clusters, to querying through a special node, to finally remotely querying across clusters, ElasticSearch is certainly making progress. In an age where data is “big”, nothing is as important as the ability to make it work for you and we can sure expect ElasticSearch to continue to make this feature better. However, if you’d rather save yourself all the effort of managing multi-cluster querying, and instead analyze and derive value from your log data, LogDNA is the way to go.

kubernetes, Technical

Top Kubernetes Metrics & Logs for End-to-End Monitoring

Kubernetes makes life as a DevOps professional easier by creating levels of abstractions like pods and services that are self sufficient. Now, though this means we no longer have to worry about infrastructure and dependencies, what doesn’t change is the fact that we still need to monitor our apps, the containers they’re running on, and the orchestrators themselves. What makes things more interesting, however, is that the more Kubernetes piles on levels of abstraction to “simplify” our lives, the more levels we have to see through to effectively monitor the stack.

Across the various levels you need to monitor resource sharing, communication, application deployment and management, and discovery. Pods are the smallest deployable units created by Kubernetes that run on nodes which are grouped into clusters. This means that when we say “monitoring” in Kubernetes, it could be at a number of levels — the containers themselves, the pods they’re running on, the services they make up, or the entire cluster. Let’s look at the key metrics and log data that we need to analyze to achieve end-to-end visibility in a Kubernetes stack.

Usage Metrics

Performance issues generally arise from CPU and memory usage and are likely the first resource metrics users would want to review. This brings us to cAdvisor, an open source tool that automatically discovers every container and collects CPU, memory, filesystem, and network usage statistics. Additionally, cAdvisor also provides the overall machine usage by analyzing the ‘root’ container on the machine. Sounds too good to be true, doesn’t it? Well it is, and the catch is that cAdvisor is limited in a sense that It only collects basic resource utilization and doesn’t offer any long term storage or analysis capabilities.

CPU, Memory and Disk I/O

Why is this important? With traditional monitoring, we’re all used to monitoring actual resource consumption at the node level. With Kubernetes, we’re looking for the sum of the resources consumed by all the containers across nodes and clusters (which keeps changing dynamically). Now, if this sum is less than your node’s capacity, your containers have all the resources they need, and there’s always room for Kubernetes to schedule another container if load increases. However, If it goes the other way around and you have too few nodes, your containers might not have enough resources to meet requests. This is why making sure that requests never exceed your collective node capacity is more important than monitoring simple CPU or memory usage.

With regards to disk usage and I/O, with Kubernetes we’re more interested in the percentage of disk in use as opposed to the size of our clusters, so graphs are wired to trigger alerts based on the percentage of disk size being used. I/O is also monitored in terms of Disk I/O per node, so you can easily tell if increased I/O activity is the cause for issues like latency spikes in particular locations.

Kube Metrics

There are a number of ways to collect metrics from Kubernetes, although Kubernetes doesn’t report metrics and instead relies on tools like Heapster instead of the cgroup file. This is why a lot of experts say that container metrics should usually be preferred to Kubernetes metrics. A good practice however, is to collect Kubernetes data along with Docker container resource metrics and correlate them with the health and performance of the apps they run. That being said, while Heapster focuses on forwarding metrics already generated by Kubernetes, kube-state-metrics is a simple service focused on generating completely new metrics from Kubernetes.

These metrics have really long names which are pretty self explanatory; kube_node_status_capacity_cpu_cores and kube_node_status_capacity_memory_bytes are the metrics used to access your node’s CPU and memory capacity respectively. Similarly, kube_node_status_allocatable_cpu_cores tracks CPU resources currently available and kube_node_status_allocatable_memory_bytes does the same for memory. Once you get the hang of how they’ve been named, it’s pretty easy to make out what the metric keeps track of.

Consuming Metrics

These metrics are designed to be consumed either by Prometheus or a compatible scraper, and you can also open /metrics in a browser to view them raw. Monitoring a Kubernetes cluster with Prometheus is becoming a very popular choice as both Kubernetes & Prometheus have similar origins and are instrumented with the same metrics in the first place. This means less time and effort lost in “translation” and more productivity. Additionally, Prometheus also keeps track of the number of replicas in each deployment, which is an important metric.

Pods typically sit behind services that are scaled by “replica sets” which create or destroy pods as needed and then disappear. ReplicaSets are further controlled by “declaring state” for a number of running ReplicaSets (done during deployment). This is another example of a feature built to improve performance that makes monitoring more difficult. Replica sets need to be monitored and kept track of just like everything else if you want to continue to make your applications perform better and faster.

Network Metrics

fishing-nets-and-sea-14278025510sg

Now, like with everything else in Kubernetes, networking is about a lot more than network in, network out and network errors. Instead you have a boatload of metrics to look out for which include request rate, read IOPS, write IOPS, error rate, network traffic per second and network packets per second. This is because we have new issues to deal with as well, like load balancing and service discovery and where you used to have network in and network out, there are thousands of containers. These thousands of containers make up hundreds of microservices which are all communicating with each other, all the time.

A lot of organizations are turning to a virtual network to support their microservices as software-defined networking gives you the level of control you need in this situation. That’s why a lot of solutions like Calico, Weave, Istio and Linkerd are gaining popularity with their tools and offerings. SD-WAN especially is becoming a popular choice to deal with microservice architecture.

Kubernetes Logs

Everything a containerized application writes to stdout and stderr is handled and redirected somewhere by a container engine and, more importantly, is logged somewhere. The functionality of a container engine or runtime, however, is usually not enough for a complete logging solution because when a container crashes, for example, it takes everything with it, including the logs. Therefore, logs need a separate storage, independent of nodes, pods, or containers. To implement this cluster-level, logging is used, which provides a separate backend to store and analyze your logs. Kubernetes provides no native storage solution but you can integrate quite a few existing ones.

Kubectl Logs

Kubectl is the logging command to see logs from the Kubernetes CLI and can be used as follows:

$ kubectl logs

This is the most basic way to view logs on Kubernetes and there are a lot of operators to make your commands even more specific. For example, “$ kubectl logs pod1” will only return logs from pod1. “$ kubectl logs -f my-pod” streams your pod logs, and “kubectl logs job/hello” will give you the logs from the first container of a job named hello.

Logs for Troubleshooting

joensuun_kanava2

Logs are particularly useful for debugging problems and troubleshooting cluster activity. Some variations of kubectl logs for troubleshooting are:

  • kubectl logs –tail=20 pod1” which displays only the most recent 20 lines of output in pod1; or
  • kubectl logs –since=1h pod1” which will show you all logs from pod1 written in the last hour.

To get the most out of your log data, you can export your logs to a log analysis service like LogDNA and leverage its advanced logging features. LogDNA’s Live Streaming Tail makes troubleshooting with logs even easier since you can monitor for stack traces and exceptions in real time, in your browser. It also lets you combine data from multiple sources with all related events so you can do a thorough root cause analysis while looking for bugs.

Logging Levels and Verbosity

Additionally, there are different logging levels depending on how deep you want to go; if you don’t see anything useful in the logs and want to dig deeper, you can select a level of verbosity. To enable verbose logging on the Kubernetes component you are trying to debug, you just need to use –v or –vmodule, to at least level 4, though it goes up all the way to level 8. While level 3 gives you a reasonable amount of information with regards to recent changes made, level 4 is considered debug-level verbosity. Level 6 is used to display requested resources while level 7 displays HTTP request headers and 8 HTTP request contents. The level of verbosity you choose will depend on the task at hand, but it’s good to know that Kubernetes gives you deep visibility when you need it.

Kubernetes monitoring is changing and improving every day because at the end of the day, that’s the name of the new game. The reason monitoring is so much more “proactive” now is because everything rests on how well you understand the ins and outs of your containers. The better the understanding, the better the chances of improvement, the better the end user experience. So in conclusion, literally everything depends on how well you monitor your applications.