0625 Proactive Problem Management With Redwood Insights Blog 1

In any complex IT environment, things go wrong. A critical process fails, services are interrupted and the pressure is on. This is the world of incident management: the crucial, immediate “firefight” to restore service as quickly as possible. Tools like the RunMyJobs by Redwood Monitor are essential for this, providing the real-time alerts and control you need to manage the moment.

But what happens after the fire is out? This is where you make real, lasting improvements. This is the world of problem management: the forensic investigation into the root cause of an incident to ensure it never happens again.

Redwood Insights is the essential tool for this investigation in RunMyJobs, enabling you to identify trends that are critical for long-term problem resolution. With persona-based dashboards that visualize near-time historical execution data, Redwood Insights allows you to move beyond guesswork and find the root cause of your most complex operational problems.

This post explores how you can use Redwood Insights to transition from a reactive operational posture to a proactive one, using data to solve complex issues and optimize your automation landscape.

Core challenges of effective problem management

Without the right analytical tools, it’s difficult for you to move from a “hunch” to a data-driven conclusion about the root cause of an issue. Teams often lack the aggregated historical data needed for a proper investigation. This leads to two common, frustrating scenarios:

  • The major incident post-mortem: A critical production process failed last night, causing significant disruption. The incident team resolved it, but the question remains: Was it a one-time anomaly, or is it a symptom of a deeper flaw that will cause another major outage soon?
  • The “death by a thousand cuts:” A seemingly minor job fails intermittently, causing small disruptions. You log it as a low-priority incident every time and manually fix it. No single incident is big enough to warrant a major investigation, but the cumulative impact on team resources and user confidence is significant.

Real-world problem management scenarios with Redwood Insights

Let’s look at how Redwood Insights helps teams move from putting out fires to preventing them through data-driven investigations into both major incidents and recurring annoyances.

1. The major incident post-mortem – anomaly or systemic flaw?

The process: Following a major outage of a critical data warehousing job that was resolved by the on-call team, you’re tasked with conducting a root-cause analysis to prevent recurrence.

The investigation with Redwood Insights:

Job Insights 1
The Job Insights dashboards can be accessed when viewing jobs in the user interface for easy contextual analysis.
  1. You open the Job Insights report for the failed job to get a complete historical view.
  2. You use heat maps to see if failures have ever correlated with this specific date or time of month before, trying to identify patterns.
  3. To determine if this was an infrastructure issue, you switch to the Job Server Analysis dashboard. This allows you to quickly rule out a systemic problem by comparing performance across your environment. 
  4. Confident that the infrastructure is sound, you return to the job’s execution data. As you analyze the widgets, you clarify the situation using a smart narrative, powered by AI: a simple, natural-language summary of the data.

The business outcome and ROI:

  • Action taken: Based on this clear, data-driven context, you can confidently classify the issue. You document the anomaly and close the problem record, avoiding an unnecessary and costly investigation into a one-off event.
  • Business outcome: This data-driven approach avoids wasting resources chasing ghost issues while ensuring that genuine systemic risks get the attention they deserve.
  • ROI: This leads to improved long-term service stability, more efficient use of skilled engineering resources (who now solve real problems) and increased business confidence in the automation platform.

2. Solving the recurring problem with data

The process: An end-of-day reporting workflow has been failing intermittently for weeks, creating a backlog of low-priority incidents.

The investigation with Redwood Insights:

Operator Overview 1
The Operator Overview is your starting point for problem investigations and analysis.
  1. You begin your investigation on the Operator Overview dashboard. Your eyes are immediately drawn to a widget highlighting the “top ten jobs with most frequent failures,” which confirms this reporting job is a chronic offender that needs attention.
  2. You analyze the job’s history and use heat maps to discover a clear pattern: The failures almost always occur on weekday afternoons. 
  3. To understand why, you pivot to the Queue Analysis dashboard to drill down into the systems involved. Here, the data clearly shows that when the reporting job fails, queue wait times are consistently high, indicating resource contention is the likely culprit.

The business outcome and ROI:

  • Action taken: With definitive proof of the root cause, you submit a change request to create a dedicated queue for the reporting workflow, a targeted improvement based on historical data.
  • Business outcome: The recurring incidents stop completely. The business service becomes reliable, and the stream of low-priority tickets ceases.
  • ROI: This eliminates the hidden operational cost of repeatedly fixing the same small issue, frees up your Operations team from repetitive tasks and improves the reliability and timeliness of service delivery.

Your toolkit for proactive problem management

Queue Analysis 1
The Queue Analysis dashboards provide a system view that enables users to visualize the relationship between performance and platform configurations.

These tools give you the operational visibility and historical context to take IT operations from reactive troubleshooting to a data-driven, intelligent function.

  • Identify recurring issues: Use the Operator dashboards to prioritize the most impactful, systemic problems by highlighting key metrics, such as the top ten failing jobs.
  • Correlate failures to find patterns: Use interactive widgets like heat maps to uncover underlying triggers for recurring problems by correlating failures to specific dates or other factors.
  • Isolate system-specific problems: Use the Job Server Analysis and Queue Analysis dashboards to understand if failures are application-specific or tied to a particular component, which is crucial for problem management.
  • Drive data-driven improvements: Use the detailed Job Insights and Workflow Insights dashboards to perform targeted analysis, enhancing processes through redesign or resource reallocation based on historical performance data.

From reactive firefighting to strategic reliability

Redwood Insights provides the essential tools for a mature problem management practice. It allows you to move beyond the immediate incident and analyze historical trends to find and permanently eliminate the underlying causes.

The result is a more stable, reliable and optimized automation environment. This leads to fewer outages, more efficient use of IT resources and consistently more timely and reliable service management.

Watch this video preview of Redwood Insights to learn more.

Ready to move beyond firefighting and start solving problems for good? Discover how Redwood Insights can power your problem management process. Book a demo of RunMyJobs today.

About The Author

Dan Pitman's Avatar

Dan Pitman

Dan Pitman is a Senior Product Marketing Manager for RunMyJobs by Redwood. His 25-year technology career has spanned roles in development, service delivery, enterprise architecture and data center and cloud management. Today, Dan focuses his expertise and experience on enabling Redwood’s teams and customers to understand how organizations can get the most from their technology investments.

GARTNER is a trademark of Gartner, Inc. and/or its affiliates.