Vivit Blog
Blog Home All Blogs

Real-time Monitoring Metrics: The Magical Mundane

Posted By Larry Dragich, Tuesday, October 23, 2012

by Larry Dragich
Director EAS, Auto Club Group

Application Performance Management (APM) has many benefits when implemented with the right support structure and sponsorship. It’s the key for managing action, going red to green, and trending on performance.

As you strive to achieve new levels of sophistication when creating performance baselines, it is important to consider how you will navigate the oscillating winds of application behavior as the numbers come in from all directions. The behavioral context of the user will highlight key threshold settings to consider as you build a framework for real-time alerting into your APM solution.

This will take an understanding of the application and an analysis of the numbers as you begin looking at user patterns. Metrics play a key role in providing this value through different views across multiple comparisons. Absent from any behavioral learning engines which are now emerging in the APM space, you can begin a high level analysis on your own to come to a common understanding of each business application’s performance.

Just as water seeks its own level, an application performance baseline will eventually emerge as you track the real-time performance metrics outlining the high and low watermarks of the application. This will include the occasional anomalous wave that comes crashing through affecting the user experience as the numbers fluctuate.

Depending on transaction volume and performance characteristics there will be a certain level of noise that you will need to squelch to a volume level that can be analyzed. When crunching the numbers and distilling patterns, it will be essential to create three baseline comparisons that you will use like a compass for navigation into what is real and what is an exception.

Real-time vs. Yesterday

As the real-time performance metrics come in, it is important to watch the application performance at least at the five minute interval as compared to the day before to see if there are any obvious changes in performance.

Real-time vs. 7 days ago

Comparing Monday to Sunday may not be relevant if your core business hours are M-F; using the real-time view and comparing it to the same day as the previous week will be more useful. Especially if a new release of the application was rolled out over the weekend and you want to know how it compares with the previous week.

Real-time vs. 10 day rolling average

Using a 10, 15 or 30 day rolling average is helpful in reviewing overall application performance with the business, because everyone can easily understand averages and what they mean when compared against a real-time view.

Capturing real-time performance metrics in five minute intervals is a good place to start. Once you get a better understanding of the application behavior you may increase or decrease the interval as needed. For real-time performance alerting, using the averages will give you a good picture when something is out of pattern, and to report on Service Level Management using percentiles (90%, 95%, etc.), will help create and accurate view for the business. To make it simple to remember, alert on the averages and profile with percentiles.

Conclusion

Operationally there are things you may not want to think about all of the time (e.g. standard deviations, averages, percentiles, etc.), but you have to think about them long enough to create the most accurate picture possible as you begin to distill performance patterns with each business application. This can be accomplished by building meaningful performance baselines that will help feed your Service Level Management processes well into the future.

Related Links:

Prioritizing Gartner's APM Model

Event Management: Reactive, Proactive, or Predictive?

Tags:  Application Performance Management 

Share |
PermalinkComments (0)
 

Event Management: Reactive, Proactive, or Predictive?

Posted By Larry Dragich, Director EAS, Auto Club Group / AAA Michigan, Thursday, August 30, 2012
Can Event management help foster a curiosity for innovative possibilities to make application performance better? Blue-sky thinkers may not want to deal with the myriad of details on how to manage the events being generated operationally, but could learn something from this exercise.

Consider the major system failures in your organization over the last 12 to 18 months. What if you had a system or process in place to capture those failures and mitigate them from a proactive standpoint preventing them from reoccurring? How much better off would you be if you could avoid the proverbial "Groundhog Day” with system outages? The argument that system monitoring is just a nice to have, and not really a core requirement for operational readiness, dissipates quickly when a critical application goes down with no warning.

Starting with the Event management and Incident management processes may seem like a reactive approach when implementing an Application Performance Management (APM) solution, but is it really? If "Rome is burning”, wouldn’t the most prudent action be to extinguish the fire, then come up with a proactive approach for prevention? Managing the operational noise can calm the environment allowing you to focus on APM strategy more effectively.



Asking the right questions during a post-mortem review will help generate dialog, outlining options for alerting and prevention. This will direct your thinking towards a new horizon of continual improvement that will help galvanize proactive monitoring as an operational requirement. Here are three questions that build on each other as you work to mature your solution:

1. Did we alert on it when it went down, or did the user community call us?

2. Can we get a proactive alert on it before it goes down, (e.g. dual power supply failure in server)?

3. Can we trend on the event creating a predictive alert before it is escalated, (e.g. disk space utilization to trigger a minor@90%, major@95%, critical@98%)?

The preceding questions are directly related to the following categories respectively: Reactive, Proactive, and Predictive.

Reactive – Alerts that occur at failure

Multiple events can occur before a system failure; eventually an alert will come in notifying you that an application is down. This will come from either the users calling the Service Desk to report an issue or it will be system generated corresponding with an application failure.

Proactive – Alerts that occur before failure

These alerts will most likely come from proactive monitoring to tell you there are component failures that need attention but have not yet affected overall application availability, (e.g. dual power supply failure in server).

Predictive – Alerts that trend on a possible failure

These alerts are usually set up in parallel with trending reports that will help predict subtle changes in the environment, (e.g. trending on memory usage or disk utilization before running out of resources).

Conclusion

Once you build awareness in the organization that you have a bird’s eye view of the technical landscape and have the ability to monitor the ecosystem of each application (as an ecologist), people become more meticulous when introducing new elements into the environment. They know that you are watching, taking samples, and trending on the overall health and stability leaving you free to focus on the strategic side of APM without distraction.

Related Links:

For a high-level view of a much broader technology space refer to slide show on BrightTALK.com which describes "The Anatomy of APM- Webcast” in more context.

For more information on the critical success factors in APM adoption and how this centers around the End-User-Experience (EUE), read The Anatomy of APM and the corresponding blog APM’s DNA – Event to Incident Flow.

Prioritizing Gartner's APM Model

APM and MoM – Symbiotic Solution Sets

Tags:  application performance management  event management 

Share |
PermalinkComments (0)
 

The DNA of APM – Event to Incident Flow

Posted By Larry Dragich, Director EAS, Auto Club Group / AAA Michigan, Thursday, July 12, 2012

This article is the corollary to "The Anatomy of APM” which outlines four foundational elements of a successful APM strategy: Top Down Monitoring, Bottom Up Monitoring, Reporting, and Incident Management. Here I provide a deeper context on how the event-to-incident flow is structured.

It is the correlation of events and the amalgamation of metrics that bring value to the business by way of dashboards and trending reports, and it’s in the way the business interprets the accuracy of those metrics that determines the success of the implementation. If an event occurs and no one sees it, believes it, or takes action on it, APM’s value can be severely diminished and you run the risk of owning "shelfware.”

Overall, as events are detected and consumed by the system, it is the automation that is the lifeblood of an APM solution, ensuring the pulse of the incident flow is a steady one. The goal is to show a conceptual view of how events flow through the environment and eventually become incidents. At a high level, the Trouble Ticket Interface (TTI) will correlate the events into alerts, and alerts into incidents which then become tickets, enabling the Operations team to begin working toward resolution.


The event flow moves from the outside in, and then from the center to the right. Here is how it’s managed:

• The outside blue circles represent the monitoring toolsets that collect information directly from the Infrastructure and the critical applications.

• The inner green (teal) circles represent the toolsets the Enterprise Systems Management (ESM) team manages, and is where most of the critical application thresholds are set.

• The dark brown circles are logical connection points depicting how the events are collected as they flow through the system – Once the events hit this connection point they go to 3 output queues.

• The Red circles on the right are the Incident Output queues for each event after it has been tracked and correlated.

The transformation between event-to-incident is the critical junction where APM and ITIL come together to provide tangible value back to the business. So if you only take one thing away from this picture, it would be the importance of managing the strategic intent of the output queues, because this is the key for managing action, going red to green, and trending.

Conclusion

I’m suggesting that it is not necessarily the number of features or technical stamina of each monitoring tool to process large volumes of data that will make an APM implementation successful; it’s the choices you make in how you put them together to manage the event-to-incident flow that determines your success. Timeliness and accuracy in this area will help you gain credibility and confidence with each of your constituents and business partners you support.

Related Links:

For a high-level view of a much broader technology space, refer to slide show on BrightTALK.com which describes "The Anatomy of APM- Webcast” in more context.

APM and MoM – Symbiotic Solution Sets

The Anatomy of APM

Prioritizing Gartner's APM Model


You can contact me on LinkedIn.

Tags:  APM  Application Performance Management 

Share |
PermalinkComments (2)
 

Contact Us

Vivit Worldwide
P.O. Box 18510
Boulder, CO 80308

Email: contact-us@vivit-worldwide.org

Mission

Vivit's mission is to serve
the Micro Focus User
Community through
Advocacy, Community,
and Education.