LogBlog

« Logging Poll #8 Context for Log Analysis | Main | Anton Logging Tip of the Day #15: Fear and Loathing in Event 560 (and 562 and 567) »

More 80s: Rubik's Cube for Log Operations

clip_image001

While log management for operations and log management for compliance or security are different applications, they share many of the same foundational requirements so system administrators can benefit from recent advances inspired by security applications:

- Collection

The ability to collect log data from a large variety of sources – with different protocols and different formats, either through an agent-less or agent-based infrastructure. A near-real-time collection is also critical to both security and operations use of logs. Such timely collection enables alerting that warns the users of recent or even impending system failures.

- Normalization

The ability to compare log data from disparate sources. For example, the ability to run a user activity report aggregating all login activity for a particular user, including login to the VPN and the finance server. Or such as the ability to run one report that shows all activity for a particular user, from e-mails sent to websites visited. For operational use, performance measurement across different systems can only be done on normalized data.

- Summarization

The ability to count and summary the log messages collected, by log type, by message type and such. One failed login perhaps isn’t meaningful, but more than five in a row could be significant. The same logic applies to system errors and failures that needs to be reviewed while using logs for maintaining and optimizing system and network operations.

- Statistical analysis

Unusual patterns in log data, an unusual ratio between accepted and denied connections on a firewall for example, can be an indication of a security breach. In the future, statistical algorithms applied to log data may enable failure prediction and other advanced analysis that directly contributes to improved SLAs.

- Alerting

The ability to trigger (near) real-time alerts that are user configurable, either based on manually written rules or automated statistical analysis. Such alerts serve to bring urgent issues to system operator or security analysts attention.

- Search

Search is central to log-based investigations, whether for an operations use (such as system fault investigation) or security use (hacker or insider attack). An ability to go through 100% of logs is key for all three uses for logs: security, compliance and operations. Such searches must be fast and easy – so that users are able to run them while under pressure of a troubleshooting or security incident.

It is also important to note that log management for operations has its own unique requirements:

- Collection revisited

Faults are notoriously singular – this means that they occur once, but never again in quite the same manner. Therefore it is very difficult to predict what log messages are going to be most useful for problem isolation and most practitioners now admit it is best to keep all log data around for post-incident analysis. Therefore, the requirement to collect 100% of all log messages of all log sources is even more important in operations than it is in security.

- Log browsing (data mining)

While for compliance, an auditor may review the same report (say failed logins) every quarter, no two troubleshooting session are quite the same. Problem isolation is an interactive process of trial and error. An administrator may look at the same data from many different angles before understanding the root-cause – like examining a Rubik’s cube. Reports have to be customizable on the fly. Pre- and post-report filtering options are important to allow for dynamic report (re)-configuration. Search is important, but not sufficient and you will likely want to be combine search with access to normalized and cross-correlated information.

- Search (and reporting) speed

Speed truly matters when it comes to fault detection and problem isolation. Whether a forensic investigation takes one hour or one day or one week usually doesn’t really break the bank, but whether a down-time situation persists for minutes or hours can be a matter of many millions of dollars in missed revenues. When troubleshooting a problem, every query must be very fast: whether indexed search or a report against normalized data, every second and every minute counts.

- GUI and Workflow

An external auditor looking at logs to verify that nobody improperly accessed credit card information is going to follow a very different work-flow from an internal investigator examining a potential fraud case and yet completely different from that help-desk person who is trying to tell you why your e-mail isn’t being delivered or your VPN connection is so slow. For optimal functionality and productivity, the best graphical user interfaces and workflows are application specific.

- SOA-based portal or mash-up

The initial fault alarm will likely land with a help-desk employee; in the form of an HP Software (or equivalent) alert, a log alert or a phone call from an unhappy user. Either way, the first-level support person will attempt to perform some analysis. In many cases, truly understanding the problem requires access to log data. Without log automation, it could require a phone call to a third-level support person and a long wait-time until the escalation managers returns his log analysis. However, in the new brave world of log analysis, the help-desk employee could access log data remotely with a single mouse-click assuming the task is made easy enough. It probably means further customizing the workflow and GUI to a particular customer’s situation. This is easy to do with today’s web 2.0 technologies and open web services APIs: a custom portal or mash-up can be created in days.

- SOA based integration

Unlike with log management for security, for fault analysis very mature consoles and dashboards exist. These event management systems even have correlation and alerting capabilities. Rather than replacing these systems with yet another console, most companies are going to look for the ability to integrate a new information source, log data in this case, into the existing fault management console. Web services likely will be the mechanism of choice.

- (Lack of) archiving

Keeping log data around for long periods of time is not a requirement. Data quickly loses its value after the fact. However, mining historical data patterns to predict future failures before they occur can be very valuable. This field is still in its infancy, but shows a lot of promise. Given enough data, both error data and fault data, predictive analysis is not far in the future.

It appears to me that the ideal technical architecture for log management recognizes both similarities and differences of the various log management use cases (and there are many more than just security and operations). Would the ideal solution perhaps be a common log data platform that can collect, aggregate, summarize, normalize, index and apply basic analytics to log data once, while allowing for a many different user experiences depending on the use case?

Posted May 06, 2008 in | Permalink


TrackBack

TrackBack URL for this entry:
http://www.loglogic.com/mt/mt-tb.cgi/322

Visit loglogic.com

I ♥ Logs

Subscribe to this blog’s feed RSS

July 2008
Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
Categories
Archives
Blogroll
Blogroll
Compliance
Good Reading
LogLogic
LogLogic Partners
Sites We Watch