Job Title:
Senior Linux Administrator – AI/ML & Data Center Networking
Company: DC Tech Consulting
Location: Eluru, Andhra pradesh
Created: 2025-12-18
Job Type: Full Time
Job Description:
Location: RemoteExperience: 7+ YearsType: Full-timeRole OverviewWe are seeking a highly skilled Senior Linux Administrator with strong Data Center Networking expertise to join our AI/ML infrastructure team. The role focuses on designing, deploying, and operating on-premises Linux and Kubernetes environments optimized for AI/ML and high-performance computing (HPC) workloads, with a significant emphasis on data center networking architectures.The ideal candidate will bring hands-on experience across Linux systems, Kubernetes, and modern data center networks, including high-speed Ethernet or InfiniBand fabrics used for AI/ML and GPU clusters. This role requires a deep understanding of how network design, latency, throughput, and reliability directly impact AI/ML performance.Key ResponsibilitiesDeploy, configure, and manage on-premises Linux servers supporting AI/ML and GPU-accelerated workloads.Design, implement, and operate data center networking for AI/ML infrastructure, including:High-speed Ethernet (25G/40G/100G/400G) or InfiniBand fabricsSpine-leaf architectures and low-latency network designsConfigure and troubleshoot Kubernetes networking, including CNI plugins (Calico, Cilium, Flannel), service networking, ingress, and network policies.Optimize network performance, latency, and throughput for distributed training, storage access, and HPC workloads.Work closely with network teams to integrate switching, routing, VLAN/VXLAN, BGP, and load balancing into Kubernetes and AI platforms.(Desirable) Automate infrastructure and network provisioning using Ansible, Terraform, and scripting (Bash/Python).Administer and monitor data center components such as compute servers, network switches, storage systems, and virtualization platforms.Troubleshoot end-to-end issues spanning Linux OS, Kubernetes and network layers.Ensure security, segmentation, and compliance across compute and network environments.Plan and implement scalable, highly available architectures for AI/ML platforms.Maintain accurate documentation including network diagrams, IP plans, topology maps, and runbooks.Required Skills & Qualifications7+ years of experience in Linux system administration (RHEL, Ubuntu, CentOS).Strong hands-on experience with data center networking, including:L2/L3 networking fundamentals (VLANs, routing, BGP, VXLAN)Spine-leaf architectures and modern DC network designsHigh-bandwidth, low-latency networks for AI/HPC workloadsProven experience managing Kubernetes clusters, with solid understanding of Kubernetes networking concepts.Experience integrating compute, storage, and networking for large-scale on-prem or hybrid data centers.Working knowledge of network performance tuning, packet flow, and troubleshooting tools (tcpdump, iperf, ethtool, etc.).Experience with automation tools such as Ansible, Terraform, and CI/CD pipelines.Proficiency in Bash and Python scripting.Strong understanding of system and network performance optimization.Excellent problem-solving and cross-team collaboration skills.Preferred / Good to HaveExperience with NVIDIA GPU networking, GPUDirect, RDMA, or InfiniBand environments.Familiarity with HPC and distributed AI training frameworks.Exposure to data center switches from vendors such as Cisco, Arista, Juniper, NVIDIA (Spectrum), etc.Experience with monitoring and observability tools (Prometheus, Grafana).Knowledge of hybrid cloud networking and on-prem to cloud connectivity.CKA (Certified Kubernetes Administrator) or networking certifications (CCNA/CCNP or equivalent) are a plus.