Job Title:

Senior Data Center Engineer – AI/ML & GPU Platforms

Company: DC Tech Consulting

Location: Kolhapur, Maharashtra

Created: 2025-12-18

Job Type: Full Time

Job Description:

Senior Data Center Engineer – AI/ML & GPU PlatformsLocation: RemoteExperience: 7+ YearsType: Full-timeRole OverviewWe are seeking a highly skilled Senior Data Center Compute Engineer to design, build, and operate GPU-enabled compute platforms for AI/ML and high-performance workloads. This role is heavily focused on Kubernetes, virtualization, orchestration platforms, and GPU infrastructure, with responsibility for building and managing scalable, production-grade GPU compute fabrics.The ideal candidate will have deep hands-on experience in Kubernetes cluster deployment and lifecycle management, virtualization platforms, and GPU hardware management, enabling reliable and high-performance AI workloads across on-prem and hybrid data center environments.Key ResponsibilitiesDesign, deploy, and operate GPU-enabled compute infrastructure for AI/ML, HPC, and accelerated workloads.Build and manage Kubernetes clusters at scale, including:Cluster bootstrap, upgrades, and lifecycle managementHigh availability control planes and worker nodesMulti-tenant and multi-cluster environmentsImplement GPU scheduling, isolation, and sharing within Kubernetes (MIG, device plugins, GPU operators).Deploy and manage virtualization platforms (VMware, KVM, OpenStack, or similar) supporting AI and container workloads.Design and operate compute orchestration platforms spanning VMs, containers, and bare-metal nodes.Integrate GPU servers (NVIDIA A100, H100, L40S, etc.) into Kubernetes and virtualization environments.Automate compute and cluster provisioning using Ansible, Terraform, Helm, and scripting (Bash/Python).Optimize compute performance, GPU utilization, and resource efficiency across clusters.Manage bare-metal provisioning, OS imaging, and firmware lifecycle for compute nodes.Collaborate with networking and storage teams to deliver a fully integrated AI compute fabric.Implement monitoring, logging, and capacity planning for compute and GPU resources.Maintain detailed documentation for cluster architecture, compute design, and operational runbooks.Required Skills & Qualifications7+ years of experience in data center compute or platform engineering roles.Strong expertise in Kubernetes deployment and management, including:Production-grade cluster designUpgrades, scaling, and troubleshootingKubernetes scheduling and resource managementHands-on experience with virtualization platforms such as VMware, KVM, OpenStack, or equivalent.Solid understanding of container runtimes, orchestration, and cloud-native architectures.Experience managing GPU hardware and drivers, including:NVIDIA GPU installation and firmwareCUDA, NVIDIA drivers, and GPU operatorsProficiency in automation and IaC tools (Ansible, Terraform, Helm).Strong Linux administration skills (RHEL, Ubuntu, CentOS).Experience with performance tuning and capacity planning for compute-intensive workloads.Excellent troubleshooting skills across OS, Kubernetes, virtualization, and GPU layers.Preferred / Good to HaveExperience building GPU compute fabrics / GPUaaS platforms.Knowledge of NVIDIA technologies such as MIG, NVLink, NVSwitch, GPUDirect, and CUDA ecosystems.Familiarity with containerized AI/ML frameworks (Kubeflow, Ray, MLFlow).Exposure to bare-metal Kubernetes (RKE2, OpenShift, kubeadm, MAAS).Experience with monitoring and observability tools (Prometheus, Grafana).Understanding of hybrid cloud compute models and on-prem to cloud integrations.Kubernetes certifications (CKA / CKAD) or virtualization certifications are a plus.

Apply Now

➤