Deepak Soni

🚀 Recent Open Source Work

Production-ready frameworks and benchmarks for distributed AI/ML training, inference, and computer vision on Oracle Cloud Infrastructure

⭐ LATEST

ARBM: Agentic AI Benchmarking

Comprehensive 15-track benchmark suite for evaluating LLM agentic workflow capabilities including planning, tool orchestration, self-healing, and context persistence.

Key Results:

Agent Persistence: 95% loop completion

Self-Healing: 100% error recovery

Overall Score: 85% agentic capability

15 Benchmark Tracks:

• Planning, Tool Use, Execution, Validation
• Context Retention, Error Recovery, JSON Output

Tech Stack:

Llama-3 Nemotron Mixtral vLLM OCI

View on GitHub

Agentic Workflows 15 Tracks

🆕 NEW

LLM Quantization Benchmark

Comprehensive benchmarking framework for evaluating LLM inference across quantization methods (AWQ, GPTQ, GGUF, FP8) with throughput, latency, memory, and quality metrics.

Key Results:

AWQ 4-bit: 73% faster vs FP16

Memory: 75% reduction with INT4

Quality: 98% retained accuracy

Methods Compared:

• AWQ, GPTQ, GGUF, FP8, AQLM
• 7B to 70B models tested

Tech Stack:

vLLM AWQ GPTQ GGUF K8s

View on GitHub

Quantization 5+ Methods

🕐 RECENT

LLM Observability Stack v2.0

Production-ready observability framework for monitoring LLM inference workloads on Kubernetes with NVIDIA GPUs, combining centralized logging with infrastructure validation testing.

Key Capabilities:

Documents: 248,805+ indexed logs

Validation: 11 infrastructure tests

Dashboards: 4 pre-built Kibana views

Stack Components:

• ELK Stack (Elasticsearch, Logstash, Kibana, Filebeat)
• Pytest validation suite, vLLM inference

Tech Stack:

ELK 8.12 vLLM Pytest OKE A10 GPUs

View on GitHub

LLM Monitoring 96GB VRAM

Reasoning Model Benchmarking

Comprehensive benchmarking framework for reasoning-first LLMs like NVIDIA Nemotron-3-Nano, analyzing hybrid Mamba-Transformer MoE architectures on OCI GPU infrastructure.

Key Results:

Throughput: 1,390 tokens/sec peak

GSM8K: 60% math reasoning accuracy

HumanEval: 40% code generation

12 Benchmarks:

• 7 Custom Tracks (MoE, TP, Context, Concurrency)
• 5 Industry Standards (MMLU, GSM8K, HumanEval, MT-Bench)

Tech Stack:

Nemotron Mamba vLLM OKE OCI

View on GitHub

Reasoning Analysis 12 Benchmarks

RAG Evaluation Framework

Comprehensive framework for evaluating Retrieval-Augmented Generation (RAG) systems, measuring retrieval quality, generation accuracy, and end-to-end performance on OCI.

Evaluation Metrics:

Retrieval: Precision, Recall, MRR

Generation: Faithfulness, Relevance

Latency: E2E response time

Tech Stack:

RAG Vector DB LLM OCI

View on GitHub

RAG Pipeline Quality Metrics

🆕 NEW

Speculative Decoding Framework

Comprehensive performance evaluation of Speculative Decoding techniques for LLM inference acceleration, comparing draft-target model configurations on OCI GPU infrastructure.

Key Benchmarks:

Speedup: Token generation acceleration

Acceptance Rate: Draft token verification

Trade-offs: Quality vs speed analysis

Tech Stack:

Speculative Decoding vLLM Draft Models OCI

View on GitHub

Inference Speedup Deep Analysis

🆕 NEW

MoE Inferencing Benchmark

Comprehensive benchmark framework for Mixture of Experts (MoE) model inference on OCI, analyzing expert routing efficiency, throughput scaling, and memory utilization patterns.

Key Focus Areas:

Expert Routing: Load balancing analysis

Throughput: Token generation benchmarks

Memory: Sparse activation patterns

Tech Stack:

MoE Models vLLM Kubernetes OCI

View on GitHub

Expert Analysis Performance Metrics

Distributed LLM Training Benchmark

Comprehensive benchmark framework comparing PyTorch DDP, FSDP, and DeepSpeed ZeRO-2/ZeRO-3 for distributed LLM training on Oracle Kubernetes Engine (OKE) with NVIDIA GPUs.

Key Results (4 NVIDIA A10 GPUs):

Best Throughput: ZeRO-2 at 18,147 tokens/sec

Best Memory: ZeRO-3 at 9.67 GB VRAM

Best Scaling: 41.2% efficiency

Challenges Solved:

✅ DeepSpeed configuration bugs (string vs int)
✅ Kubernetes pod results collection
✅ NCCL networking for A10/A100/H100/H200
✅ Worker RANK computation from K8s index

Tech Stack:

PyTorch DeepSpeed Kubernetes OCI NVIDIA GPUs

View on GitHub

1,000+ lines Python 4,055 lines docs

Mistral-7B QLoRA Fine-tuning

Production-ready implementation of Mistral-7B-Instruct fine-tuning using QLoRA with 4-bit quantization for efficient training on consumer/cloud GPUs.

Key Features:

4-bit quantization - Reduced memory footprint

QLoRA - Efficient parameter updates

Single GPU - Consumer hardware compatible

Use Cases:

• Custom domain adaptation (legal, medical, financial)
• Instruction following for specific tasks
• Cost-effective LLM customization
• Research and experimentation

Tech Stack:

Mistral-7B QLoRA bitsandbytes PEFT HuggingFace

View on GitHub

Memory efficient Production ready

YOLO + Triton Inference

High-performance YOLOv8 object detection deployment using NVIDIA Triton Inference Server on Oracle Kubernetes Engine with TensorRT optimization.

Performance Highlights:

TensorRT - GPU-accelerated inference

Triton - Production-grade serving

Kubernetes - Scalable deployment

Applications:

• Real-time object detection and tracking
• Computer vision pipelines
• Edge-to-cloud deployment patterns
• Benchmarking inference performance

Tech Stack:

YOLOv8 Triton TensorRT OKE NVIDIA GPU

View on GitHub

Low latency Auto-scaling

NVIDIA Nsight Systems Profiling

Deep-dive GPU profiling framework using NVIDIA Nsight Systems to analyze CUDA kernels, NVTX markers, and NCCL communication patterns across distributed training strategies.

Key Insights (2 NVIDIA A10 GPUs):

JIT Compilation: DeepSpeed fused_adam overhead

NCCL Analysis: AllReduce vs AllGather patterns

Timeline: GPU utilization visualization

Profiling Coverage:

DDP, FSDP, ZeRO-2, ZeRO-3 strategies
CUDA memory allocation tracking
CPU-GPU synchronization analysis
Communication bottleneck detection

Tech Stack:

Nsight Systems CUDA NVTX NCCL DeepSpeed

View on GitHub

GPU Profiling Timeline Analysis

LLM Inference Benchmarking

Comprehensive framework for benchmarking vLLM vs NVIDIA Triton vs HuggingFace TGI inference servers on Kubernetes with NVIDIA Nsight Systems GPU profiling.

Key Results (NVIDIA A10 - Mistral-7B):

Peak Throughput: TGI at 8.07 req/s

Token Rate: vLLM at 412 tok/s

GPU Utilization: vLLM at 99% SM

Framework Features:

Side-by-side inference server comparison
Nsight Systems CUDA kernel profiling
Kubernetes-native deployment manifests
A10, A100, H100, H200, B200 GPU support

Tech Stack:

vLLM Triton TGI Kubernetes Nsight

View on GitHub

3 Inference Servers Performance Analysis

IBM Fusion HCI LLM Benchmarking

Reusable benchmarking framework for evaluating LLM inference server performance on OpenShift clusters with GPU acceleration using IBM Fusion HCI and NVIDIA A100 MIG GPUs.

Key Results (A100 MIG 20GB):

vLLM: 560.71 tok/s (3.4x faster)

Latency: 2543ms P50 (vLLM)

Triton: 404.72 tok/s

Framework Features:

ACM ManifestWork templates for managed clusters
Universal benchmark client for all backends
Visualization tools for performance analysis
Complete GPU troubleshooting docs

Tech Stack:

OpenShift IBM Fusion HCI vLLM Red Hat ACM A100 GPU

View on GitHub

OpenShift Native 3 Inference Engines

NVIDIA cuOpt EV Fleet Optimization

Complete framework for deploying NVIDIA cuOpt on OCI for Electric Vehicle fleet optimization with GPU-accelerated route optimization - 10-100x faster than CPU solvers.

Key Results (4x A10 GPUs):

Success Rate: 100% across 17 scenarios

Throughput: 150+ vehicles/min at scale

Cost Savings: 15-25% delivery reduction

Use Cases:

Last-mile delivery optimization (24-77s)
EV charging station routing
Real-time fleet dispatch (<15s)
Enterprise 500+ vehicle operations

Tech Stack:

NVIDIA cuOpt OCI OKE NVIDIA NIM A10 GPU

View on GitHub

10-100x Faster Route Optimization

LLM Observability Stack

Complete Prometheus + Grafana observability stack for monitoring GPU clusters, vLLM inference, and LLM training workloads on Oracle Kubernetes Engine (OKE).

Key Features:

GPU Metrics: DCGM Exporter for NVIDIA GPUs

vLLM Metrics: Latency, throughput, KV cache

Training: Loss curves, gradient norms

Dashboards Included:

• Cluster Management Home (unified view)
• GPU Cluster Overview & GPU Health Alerts
• vLLM Inference & Training Cluster
• OKE Cluster Overview

Tech Stack:

Prometheus Grafana DCGM OKE AlertManager

View on GitHub

7 Dashboards 30+ Alert Rules

LLM Serving Benchmark

Comprehensive benchmarking framework for evaluating LLM serving performance comparing vLLM, TGI, and NVIDIA NIM on Kubernetes with detailed latency and throughput analysis.

Key Metrics:

Latency: P50, P95, P99, TTFT, TPOT

Throughput: Tokens/sec, Requests/sec

Resources: GPU memory, utilization

Framework Features:

• Side-by-side inference server comparison
• Configurable concurrency levels
• Kubernetes deployment manifests
• A10, A100, H100 GPU support

Tech Stack:

vLLM TGI NIM Kubernetes NVIDIA GPU

View on GitHub

3 Inference Servers Performance Analysis

MoE Training Parallelism Framework

Comprehensive framework for benchmarking distributed Mixture-of-Experts (MoE) training using Expert Parallelism (EP) and hybrid EP+Data Parallelism strategies on Oracle Kubernetes Engine with NVIDIA GPUs.

Key Results (4 NVIDIA A10 GPUs):

Hybrid EP=2,DP=2: 8.77x speedup (96,592 tok/s)

Memory Reduction: 56% with Expert Parallelism

Compute/Comm: 34.75x efficient ratio

Benchmark Tracks:

Expert routing & load balancing analysis
NCCL AlltoAll communication profiling
EP vs DP vs Hybrid scaling comparison
Auxiliary loss tuning (CV=0.04 optimal)

Tech Stack:

PyTorch NCCL Expert Parallelism Kubernetes OCI

View on GitHub

5 Benchmark Tracks Expert Parallelism

LLM Training Parallelism Guide

Practical strategy guide for selecting LLM training parallelism approaches, comparing DDP, Pipeline, Tensor Parallelism, and hybrid strategies with detailed NCCL communication pattern analysis.

Key Results (4 NVIDIA A10 GPUs):

DDP (4 GPU): 52,847 tok/s (54% scaling)

NCCL Analysis: AllReduce, Send/Recv, AllGather

Hybrid PP=2xTP=2: 10,069 tok/s

Strategy Coverage:

Data Parallelism (DDP) - single & multi-node
Pipeline Parallelism (PP=2, PP=4)
Tensor Parallelism (TP) strategies
NCCL communication pattern visualization

Tech Stack:

PyTorch DDP NCCL Nsight Systems OKE

View on GitHub

NCCL Patterns Strategy Guide

Open Source Projects

Production-ready frameworks

5K+

Lines of Code

Comprehensive documentation

100%

Cloud Native

OCI + Kubernetes ready

All projects include comprehensive documentation, performance benchmarks, and production deployment guides

View All Projects on GitHub

Professional Experience

AI Architect - AI Centre of Excellence

Oracle Iberia

Feb 2021 - Present | Campanillas, Spain

• Define enterprise reference architectures, performance baselines, and guardrails for GPU-accelerated GenAI/HPC on OCI

• Architect and operate GPU-accelerated Gen AI and HPC/AI platforms on OCI (Kubernetes/OKE plus Slurm & PBS Pro)

• Lead performance engineering & benchmarking, CUDA/NCCL micro-benchmarks, optimize GPU utilization and throughput

• Enable distributed training & inference for LLMs and CV/NLP (DeepSpeed/FSDP/Horovod on Slurm/PBS Pro and OKE)

• Build reusable IaC blueprints (Terraform/Resource Manager, Helm, OCI DevOps/OCIR) for rapid GPU cluster deployment

• Partner with automotive CAE/simulation teams to map CFD/FEA/crash workloads to optimal shapes/schedulers

Senior Professional, Emerging Technologies

DXC Technology

Nov 2018 – Jan 2021 | Europe & UK

• HPC and emerging technologies consultant supporting scientific computing workloads across financial services, aerospace, and automotive

• Delivered HPC infrastructure engineering using NVIDIA Bright Cluster Manager, xCAT, LSF, and PBS Pro

• Led automation initiatives with Ansible, Docker, and Python for cluster provisioning and application deployment

• Enabled hybrid cloud integration with AWS and GCP for scalable compute environments

• Conducted extensive application and hardware benchmarking with performance optimization

HPC Analyst

Citi (Citicorp Services India)

Aug 2016 – Oct 2018 | Financial Engineering

• HPC Engineer supporting Financial Engineering Research Group for real-time financial trading and risk modeling

• Point-of-Contact for emerging HPC technologies, driving innovation in simulation grid architecture

• Conducted hardware and application benchmarking, validating performance for production trading environments

• Designed and tested hybrid HPC architecture PoCs ensuring scalability and reliability

• Customized ELK stack for infrastructure observability, log correlation, and anomaly detection

Lead HPC Solutions Developer

Tata Motors / Tata Technologies

Jun 2008 – Aug 2016 | Automotive R&D

• HPC operations and infrastructure lead for Computer-Aided Engineering (CAE) Research Group

• Directed daily operations of multi-node, heterogeneous HPC cluster for automotive simulations

• Led integration and performance tuning of LS-DYNA, Abaqus, Ansys Fluent, MSC Nastran, StarCCM+, OptiStruct

• Developed custom CAE job submission portal integrated with PBS Pro

• Enabled centralized CAE access via Altair e-Compute portal across engineering teams

Senior Linux System Administrator

Sankalp Venture

Mar 2007 – May 2008

Led enterprise Linux infrastructure administration, web/mail servers, and team management for Indian Express news sites.

Programmer & Academic Mentor

Vindhaya Institute

Mar 2004 – Feb 2007

Taught computer science subjects, conducted lab sessions, mentored B.E. students on software development projects.

Professional Portfolio

Strategic partnerships with leading organizations across global markets, delivering transformational AI infrastructure solutions with proven results and measurable impact

Fortune 500

Global Enterprises

• Energy & Oil Companies

• Manufacturing Giants

• $2.4T+ combined market cap

AI Unicorns

Innovation Leaders

• LLM Builders

• Biotech AI

• Research Pioneers

Government

Public Sector

• Smart Cities

• National Initiatives

• Vision 2030 Projects

FinTech

Financial Innovation

• Payment Leaders

• Cross-Border Platforms

• Digital Banking

Global Reach

EMEA

Europe, Middle East, Africa

AMERICAS

North & South America

APAC

Asia Pacific

NORDIC

Scandinavian Region

50+

Global Organizations

Across 4 continents

100k+

GPU Hours

AI/ML workloads

$50M+

Infrastructure Value

Deployed solutions

Professional Impact

Delivering mission-critical AI/ML infrastructure solutions that drive digital transformation across industries

Energy Sector

DataRobot AI platform deployment for digital transformation initiatives

✓ Production deployment success

FinTech

Cross-border payments platform with data sovereignty compliance

✓ 25% cost reduction achieved

AI Innovation

LLM training infrastructure for next-generation AI companies

✓ 50+ GPU cluster deployed

Telecom

5G network optimization with AI-driven analytics

✓ Full compliance achieved

Public Sector

Smart city initiatives and digital governance platforms

✓ Multi-region deployment

Manufacturing

Predictive maintenance and quality optimization systems

✓ 30% efficiency gain

Service Categories

Infrastructure Architecture

AI/ML platform design, HPC cluster deployment, cloud migration strategy

Performance Optimization

GPU utilization, RDMA networking, workload scheduling, cost reduction

Compliance & Security

Data sovereignty, regulatory compliance, security best practices

Client Success Stories

Delivering transformational AI/ML and HPC infrastructure solutions across diverse industry verticals

Global Energy Corporation

Digital Transformation - AI/ML Platform

Fortune 500 Oil & Gas Company

Challenge: Deploy DataRobot AI platform for digital transformation initiatives

Solution: Enhanced performance with specialized OCI features and HPC expertise

Impact: ✅ Delivered on timeline, customer adopted OCI for production workloads

FinTech Payment Leader

Cross-Border Payments Platform

FTSE 250 Listed Company

Challenge: Expand into new market with data sovereignty compliance

Solution: Oracle Cloud deployment with enhanced GlusterFS integration

Impact: ✅ 25% cost reduction vs on-premises, enabling rapid market entry

Telecommunications Giant

5G Network Optimization

Leading Middle East Telecom Provider

Challenge: 5G network optimization with AI-driven analytics platform

Solution: GPU-accelerated HPC cluster with Oracle Linux optimization

Impact: ✅ Full compliance achieved, enhanced performance delivered

Biotech AI Pioneer

Therapeutic AI Research Platform

Healthcare AI Innovator

Challenge: Generative AI for therapeutic antibody design and protein discovery

Solution: High-performance GPU cluster with specialized AI frameworks

Impact: ✅ Accelerated drug discovery, research breakthrough achieved

Government Smart City Initiative

AI-Powered Urban Management

National Vision 2030 Project

Challenge: AI-powered visual pollution detection processing 100K-200K images daily

Solution: Scalable cloud infrastructure with automated AI pipeline

Impact: ✅ Revolutionized environmental monitoring and urban planning

LATEST

European AI Research Unicorn

xLSTM Technology • 50-Node GPU Cluster

Austrian LLM Pioneer

Challenge: European LLM builder entering productization phase, seeking AI leadership position

Solution: Deployed 50+3 Node GPU H100.8 cluster with RDMA networking

Impact: ✅ Enabled transition from university research to commercial AI products

Technical Publications & Insights

Sharing knowledge and insights through technical blogs, reference architectures, and open-source contributions

Oracle Blog Author

View all publications at blogs.oracle.com/authors/deepak-soni

OCI Reference Architectures

Deploy Scalable OwnGPT Model on Oracle Cloud

Reference architecture for deploying enterprise-scale generative AI solutions on OCI with comprehensive ERP integration capabilities.

Oracle Cloud GenAI ERP Integration

View Architecture

Accelerate VM Image Storage in KVM

Accelerate and scale the storage of virtual machine images in a KVM environment with enterprise-grade reliability.

KVM Storage Performance

View Architecture

Remote Synchronous Block Replication

Use remote synchronous block replication on Oracle Cloud Infrastructure for enterprise-grade data replication.

Block Storage Replication OCI

View Architecture

Video Surveillance Analytics Performance

Video surveillance and analytics software performance optimization on OCI for enhanced security operations.

Video Analytics Performance OCI Blog

Read Blog

Protein Large Language Models

Powering protein large language models in antibody discovery on OCI for pharmaceutical innovation.

Protein LLM Antibody OCI Blog

Read Blog

Telco Innovation with GPUs

Accelerating telco innovation by leveraging power of GPUs on OCI for enhanced customer experiences.

Telco GPU OCI Blog

Read Blog

One Lexiicon ownGPT AI Model

Pioneering collaboration for AI innovation and excellence with One Lexiicon ownGPT AI model on OCI.

ownGPT AI Model OCI Blog

Read Blog

De Novo Antibody Design

Pioneering de novo antibody design with OCI, supporting Silica Corpora's AI mission for precision and efficacy.

AI Design Antibody OCI Blog

Read Blog

Primary GitHub Repository

Personal collection of AI/ML infrastructure projects, HPC configurations, automation scripts, and technical implementations for enterprise-scale deployments.

AI Infrastructure HPC Automation

View Repository

Academic & Research Projects

Research-focused repository containing academic projects, mathematical computing implementations, and early-stage experimental work in system architecture.

Research Academic Mathematics

View Repository

Oracle DevRel Contributions

Enterprise-grade implementations for Oracle Developer Relations, featuring DeepSpeed training, GPU clustering, and production-ready AI infrastructure patterns.

Oracle DeepSpeed Enterprise

View Contributions

Medical RAG Chatbot

Advanced Retrieval-Augmented Generation (RAG) chatbot for medical information and healthcare applications, featuring vector search, semantic understanding, and context-aware responses for medical queries.

RAG Healthcare AI/ML Vector Search

View Repository

Technical Articles from Tata Technologies Experience

Published insights and technical achievements from my tenure as HPC & CAE Systems Engineer at Tata Technologies, focusing on performance optimization and enterprise-scale solutions.

NUMA Benchmarking Results

Performance Comparison Chart

17.7s → 5.59s

68% faster

Unlocking Performance: How NUMA Tuning Can Triple Your CAE Simulation Speed on HPC

Demonstrated 68% reduction in runtime through NUMA optimization techniques for compute-intensive CAE applications like LS-DYNA in automotive simulations. Achieved 45-50% performance improvement in CPU time through strategic process and memory placement.

NUMA HPC Optimization CAE Performance LS-DYNA

Read Article

HPC Ecosystem Architecture

MSC Patran - Stress Analysis

12 weeks → 9 weeks

25% faster

From Bottleneck to Breakthrough: Revolutionizing CAE Workflows with a Tailored HPC Ecosystem

Engineered custom HPC environment that reduced CAE loop time by 25% (12 weeks to 9 weeks) with 8x increase in license utilization. Implemented GlusterFS distributed file system and Torque/Maui job scheduling for enterprise-scale CAE workflows.

HPC Architecture CAE Workflow MSC Nastran Performance

Read Article

Technical Impact

OCI Reference Architectures

10+

Technical Articles

Open Source Projects

1000+

Developer Engagements

About Me

Core Expertise

🚀 Recent Open Source Work

ARBM: Agentic AI Benchmarking

Key Results:

15 Benchmark Tracks:

Tech Stack:

LLM Quantization Benchmark

Key Results:

Methods Compared:

Tech Stack:

LLM Observability Stack v2.0

Key Capabilities:

Stack Components:

Tech Stack:

Reasoning Model Benchmarking

Key Results:

12 Benchmarks:

Tech Stack:

RAG Evaluation Framework

Evaluation Metrics:

Tech Stack:

Speculative Decoding Framework

Key Benchmarks:

Tech Stack:

MoE Inferencing Benchmark

Key Focus Areas:

Tech Stack:

Distributed LLM Training Benchmark

Key Results (4 NVIDIA A10 GPUs):

Challenges Solved:

Tech Stack:

Mistral-7B QLoRA Fine-tuning

Key Features:

Use Cases:

Tech Stack:

YOLO + Triton Inference

Performance Highlights:

Applications:

Tech Stack:

NVIDIA Nsight Systems Profiling

Key Insights (2 NVIDIA A10 GPUs):

Profiling Coverage:

Tech Stack:

LLM Inference Benchmarking

Key Results (NVIDIA A10 - Mistral-7B):

Framework Features:

Tech Stack:

IBM Fusion HCI LLM Benchmarking

Key Results (A100 MIG 20GB):

Framework Features:

Tech Stack:

NVIDIA cuOpt EV Fleet Optimization

Key Results (4x A10 GPUs):

Use Cases:

Tech Stack:

LLM Observability Stack

Key Features:

Dashboards Included:

Tech Stack:

LLM Serving Benchmark

Key Metrics:

Framework Features:

Tech Stack:

MoE Training Parallelism Framework

Key Results (4 NVIDIA A10 GPUs):

Benchmark Tracks:

Tech Stack:

LLM Training Parallelism Guide

Key Results (4 NVIDIA A10 GPUs):

Strategy Coverage:

Tech Stack:

📰 LinkedIn Newsletter

Beyond the Model

Recent Articles:

Topics Covered:

Core Expertise

AI/ML & HPC Infrastructure Architecture

Cloud Solutions Architecture