Home » How Rajesh Kesavalalji Optimizes AI Infrastructure Through Advanced GPU Monitoring

How Rajesh Kesavalalji Optimizes AI Infrastructure Through Advanced GPU Monitoring

With 18 years of experience spanning backend engineering, microservices architecture, and AI infrastructure, Rajesh Kesavalalji has witnessed the evolution from traditional server monitoring to the complex observability challenges of modern GPU-powered systems. Currently serving as Senior Engineer at Omniva, an AI cloud-computing startup, Rajesh has developed new approaches to GPU health monitoring, system reliability, and performance optimization. In this extended interview, he shares how out-of-band telemetry changed their infrastructure efficiency, his approach to microservices migration, and his vision for AI infrastructure observability.

Detecting Hardware Issues Before They Cause Problems

When Rajesh joined Omniva, the company’s AI infrastructure relied on standard monitoring. Traditional metrics like GPU utilization, thermal readings, and power consumption provided only surface-level visibility into system health.

“We quickly realized that thermal and power-related throttling were key contributors to performance degradation, but in-band metrics cannot reliably detect underlying hardware issues like failing fans or GPUs hitting thermal or power limits silently,” Rajesh explains. “To address this, we introduced out-of-band GPU telemetry, which allowed us to monitor hardware health independently of the operating system and workload layers.”

Out-of-band monitoring changed how the team managed infrastructure. Instead of reacting to performance drops after they occurred, they could identify potential issues before they impacted customer workloads.

The implementation focused on three metrics: GPU core temperature, power consumption, and fan speeds. These hardware-level signals provided direct insight into silicon-level behavior, independent of software stack performance.

“Using OOB metrics, we were able to proactively detect thermal throttling events, power cap hits, and fan failures that in-band tools often missed or delayed,” he notes. “This enabled us to take corrective actions such as replacing underperforming fans, rebalancing workloads, and even replacing degraded GPU cores, which significantly reduced downtime and improved sustained performance across the board.”

From Monoliths to Microservices

Rajesh’s expertise extends beyond hardware monitoring to complex architectural transformations. During his tenure at British Telecom, he led a migration from monolithic architecture to microservices using Spring Boot, navigating challenges that many organizations face today.

“Identifying different services from an existing monolithic codebase to be candidates for microservices and then decoupling them is the biggest challenge,” he explains. Decoupling components that were deeply interdependent, often sharing the same database schema or business logic, required careful domain analysis and coordination across teams.

His solution involved implementing the strangler pattern, moving specific functionality to new microservices while maintaining the monolith’s operational integrity. This approach allowed the team to validate each new service in production without requiring a complete system cutover.

Test coverage played a role in ensuring system reliability throughout the migration. We focused on building a strong suite of integration and contract tests, particularly around API boundaries and shared data models,” Rajesh notes. “CI/CD pipelines were enhanced to run these validations across both monolith and microservices to catch regressions early.”

The team also invested in observability, implementing distributed tracing, service-level dashboards, and alerting systems to detect latency or availability issues during the migration process. Combined with feature flags and gradual rollout strategies, this approach maintained high reliability while decomposing the legacy system.

Event Sourcing in Supply Chain Systems

At Nordstrom, Rajesh applied event sourcing principles to solve supply chain challenges, developing systems that could handle warehouse events, auditing requirements, and data correction scenarios.

“For teams new to event sourcing, it can be tricky to identify events for their data. I normally suggest teams start small and come up with a system design to understand end-to-end data flow,” he explains. “For auditing, we have to compare event-sourced data with the source. In the case of warehouse events, we’ll take daily snapshots of warehouse data and reconcile with enriched events that our applications have processed.”

This reconciliation process acknowledged the inherent challenges in distributed systems. “There is an acceptable delta due to data propagation and processing. If delta is higher, we send correction events to adjust deltas,” Rajesh notes.

This approach strikes a balance between the benefits of event sourcing and the practical need to maintain data consistency across complex enterprise systems.

The Evolution of Observability

Rajesh’s transition from traditional microservices observability to AI infrastructure monitoring illustrates the evolving demands of modern systems.

“My background has largely been in implementing metrics and monitoring for microservices-based architectures, where the tooling and best practices, like Prometheus, OpenTelemetry, and distributed tracing, are already mature and well understood,” he explains. “Transitioning to observability for AI infrastructure, especially for GPU-based systems, has been both exciting and challenging because the ecosystem is still emerging and less standardized.”

The fundamental principles remain constant: clean metric labeling, consistent trace propagation, and proper instrumentation. However, AI systems require extending these concepts to accommodate GPU-specific telemetry and hardware-level signals.

“One key lesson I’ve carried over is the importance of clean metric labeling and consistent trace propagation across services, which becomes even more critical when correlating metrics from disjointed sources in an AI pipeline,” Rajesh notes. “While tools like Prometheus and OpenTelemetry remain foundational, we had to extend our approach to accommodate GPU-specific telemetry, such as scraping low-level hardware metrics via Redfish APIs.”

The challenge lies in connecting disparate signals from hardware, orchestration layers, and model execution runtimes. “In microservices, logs and traces often tell a fairly direct story. In AI systems, the challenge is often in stitching together disparate signals from hardware, orchestration layers, and model execution runtimes.”

Cost-Effective Monitoring at Scale

When designing monitoring solutions for AI environments, Rajesh emphasizes the importance of separating concerns and planning for scale from the beginning.

“To design a cost-effective and scalable solution for aggregate node health monitoring in AI environments, we decoupled telemetry ingestion from visualization and alerting,” he explains. “We used OpenTelemetry Collectors to scrape GPU node metrics covering temperature, utilization, memory bandwidth, power draw, and ECC errors and pushed them into a message broker layer powered by Kafka.”

This architecture addresses one of the major challenges in AI infrastructure monitoring: the volume of high-frequency telemetry data. “For scalability, we implemented Kafka consumers that batched and deduplicated metrics before forwarding them to long-term storage backends, such as Mimir or Object Storage, for analytics. This helped control storage costs and prevent cardinality blow-ups, especially under high-frequency GPU telemetry loads.

The system’s debugging capabilities focus on correlation and drill-down functionality. “To enhance debugging efficiency, we built dynamic dashboards that correlated node-level logs, traces, and GPU metrics. This allowed us to visually drill down from fleet-wide aggregate health to individual node anomalies, such as identifying how thermal throttling or power capping might degrade model inference throughput.”

Mentoring Engineers

Beyond technical implementation, Rajesh is passionate about developing engineering talent and sharing knowledge within the community.

When mentoring newer engineers on observability and performance optimization, I start by understanding their current perspective, what they already know, how they think about system health, and what improvements they would suggest based on their recent experiences,” he explains. This helps create a collaborative, non-hierarchical space where learning is contextual and relevant.

His mentoring approach emphasizes foundational understanding over tool-specific knowledge. From there, I introduce foundational concepts, such as the difference between white-box and black-box monitoring, the value of RED/USE metrics, and how to consider latency, throughput, and resource bottlenecks. I encourage them to treat observability as a design consideration, not a bolt-on.”

Real-world application forms the core of his teaching methodology. “I also walk them through real-world scenarios such as tracing GPU bottlenecks, analyzing high-cardinality metrics, or correlating logs and traces, and explain both the tooling and the mental models behind effective diagnosis.”

Staying Current in a Rapidly Evolving Field

Rajesh maintains his expertise through a combination of community engagement and continuous learning from industry leaders.

“I stay current with AI infrastructure trends by engaging with a mix of open-source communities and technical deep dives,” he explains. Medium, Substack, and engineering blogs from leading AI companies, such as OpenAI, MosaicML, Weights & Biases, and Uber AI, often reveal detailed post-mortems, GPU optimization tricks, and trade-offs regarding observability and cost.

He also participates in conferences and technical forums that bridge research and practical implementation. “Conferences and forums like Kubeflow Summit, AI Infrastructure Alliance events, and Arxiv Sanity provide valuable insights for research-meets-practice, especially around distributed training, inference orchestration, and hardware-aware scheduling.”

The Future of AI Infrastructure Observability

Rajesh sees substantial changes coming to AI infrastructure observability tools and practices.

“I see observability tools for AI infrastructure becoming significantly more mature over the next 2-3 years,” he predicts. “We’re already starting to see GPU providers expose more comprehensive and useful metrics out of the box, such as thermal throttling indicators, power consumption, memory bandwidth, and error rates via tools like NVIDIA’s DCGM and Redfish APIs.”

This evolution will likely lead to greater standardization and easier integration across different platforms and vendors.

Building an Observability Stack

For organizations beginning their AI infrastructure journey, Rajesh recommends a phased approach that builds capability over time.

For an organization just starting to scale AI infrastructure, I’d recommend beginning with Prometheus for metrics collection and Grafana for visualization to monitor core GPU and node health using exporters like DCGM and Node Exporter,” he suggests. “Early integration of OpenTelemetry Collector helps standardize metric, log, and trace ingestion pipelines for future scalability.”

The implementation strategy should prioritize immediate visibility needs while preparing for future complexity. “Logging with Loki or a structured log backend is crucial for correlating issues across AI workloads and system-level events. As workloads grow, introducing Kafka can help buffer and batch high-frequency telemetry data efficiently.”

His recommended implementation sequence reflects lessons learned from scaling complex systems: “Prioritization should focus first on GPU/node health visibility, followed by logging, and then tracing and intelligent alerting as model-serving and infrastructure complexity increase.”

Through his work at the intersection of traditional infrastructure engineering and AI systems, Rajesh demonstrates how foundational engineering principles adapt to meet the demands of emerging technologies. His approach to problem-solving, emphasis on mentorship, and focus on practical solutions provide a roadmap for organizations navigating the complexities of modern AI infrastructure.

Steve Gickling

CTO at Calendar | Website

A seasoned technology executive with a proven record of developing and executing innovative strategies to scale high-growth SaaS platforms and enterprise solutions. As a hands-on CTO and systems architect, he combines technical excellence with visionary leadership to drive organizational success.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.