Job Title:
Lead Solution Architect
Company: Yotta Data Services Private Limited
Location: Mumbai, Maharashtra
Created: 2026-01-04
Job Type: Full Time
Job Description:
About Yotta:Yotta Data Services is powering Digital Transformation with Scalable Cloud, Colocation, and Managed Services.Yotta Data Services offers a comprehensive suite of cloud, data center, and managed services designed to accelerate digital transformation for businesses of all sizes. With state-of-the-art infrastructure, cutting-edge AI capabilities, and a commitment to data sovereignty, we empower organisations to innovate securely and efficiently.Total Experience: 10+ years in systems engineering, network engineering, cloud infrastructure, or datacenter design.Key Responsibilities:1. AI Systems Architecture (Compute, GPU, OS)• Design and deploy large-scale GPU clusters (H100, H200, GB200 and GB300) for distributed training and inference.• Architect multi-node GPU systems using:NVLink/NVSwitchPCIe Gen5• Define OS, kernel, driver, and runtime configurations optimized for AI workloads (CUDA/ROCm, NCCL, UCX, OFED).• Develop high-performance compute blueprints for diverse use cases: training, fine-tuning, retrieval, and batch inference.2. High-Performance NetworkingArchitect AI fabric networks including:• InfiniBand HDR/NDR/XDR/SPX• RoCEv2 / RDMA• 100/200/400/800 Gbps Ethernet fabricsDesign low-latency, high-bandwidth topologies (fat-tree, dragonfly+, multiplane architectures).Plan and tune inter-node communication for distributed AI training (NCCL, MPI, UCX).Implement network segmentation, isolation, and multi-tenant security for AI compute clusters.3. Storage & Data Pipeline InfrastructureArchitect high-throughput storage solutions for AI:• Parallel file systems (Lustre, BeeGFS, IBM Spectrum Scale)• Cloud-native high-performance storage (FSx for Lustre, Azure ANF, GCS Filestore High Scale)• NVMe, NVMe-over-Fabrics, object storage.Optimize data pipelines for large-scale dataset ingestion, feature extraction, checkpointing, and streaming.4. Platform Integration & OrchestrationIntegrate systems with Kubernetes GPU environments (EKS/AKS/GKE, K8s onprem, Kueue, Volcano).Design infrastructure to support distributed training frameworks:• PyTorch DDP• DeepSpeed• Ray Train• JAX / TPU alternativesEnable robust scheduling, multi-tenancy, and job orchestration.5.Reliability, Monitoring & Performance OptimizationImplement monitoring for GPU utilization, network telemetry, I/O performance, and cluster health (Prometheus, Grafana, DCGM, NetQ).Conduct performance tuning across:• NIC/driver stack• GPU topology• Storage throughput• Network congestion management (ECN, PFC, QoS)Design systems for high availability, resilience, and disaster recovery.6.Security & Compliance (Infra-Level)• Implement hardware-level and network-level security controls—IAM, RBAC, ACLs, segmentation, encryption in transit.• Architect secure multi-tenant GPU environments, including confidential computing where supported.• Ensure system compliance with SOC2, ISO 27001, or industry-specific security frameworks.Good to have skills:• Experience building clusters for AI training at >100 GPUs scale.• Familiarity with AI data engineering systems (Kafka, Spark, Ray Data).• Experience with bare-metal provisioning tools (MAAS, iPXE, Metal³).• Knowledge of GPU virtualization, MIG/partitioning, or multi-tenant GPU scheduling.Qualification Criteria:10+ years in systems engineering, network engineering, cloud infrastructure, or datacenter design.Deep hands-on experience with:• GPU systems (NVIDIA)• InfiniBand / RDMA / RoCEv2• High-performance storage solutions• Linux systems tuning• HPC/AI cluster designStrong networking background (L2/L3 switching, routing, QoS, congestion control, BGP/EVPN).Familiarity with AI frameworks and distributed training, even if not a data scientist.Expertise with infrastructure automation:• Terraform• Ansible• Kubernetes manifests / HelmInterested candidates can share their updated resume at sadhatrao@