Job Title:

Machine Learning Engineer

Company: Pilotcrew AI

Location: New delhi, Delhi

Created: 2026-03-07

Job Type: Full Time

Job Description:

Machine Learning Engineer – Generative AI & Agent EvaluationLocation: Remote Company: Pilotcrew AI Type: Full-Time Experience: 2–6 Years About Pilotcrew AIPilotcrew AI builds infrastructure for AI Agent Evaluation. We benchmark large language models, run automated agent evaluations, power human-in-the-loop assessments, and host AI arenas for competitive testing.Our mission is to make AI agents measurable, reliable, and production-ready through structured, scalable evaluation systems.Role OverviewWe are hiring a Machine Learning Engineer with strong Generative AI expertise to design and build scalable evaluation infrastructure for LLMs and AI agents.You will architect distributed inference pipelines, structured trace logging systems, tool-call validation frameworks, and automated grading engines. The role involves benchmarking proprietary and open-weight LLMs, implementing pass@k and robustness metrics, building adversarial stress-testing pipelines, and analyzing agent failure modes under real-world conditions.This is a systems-heavy, production-focused GenAI role requiring strong ML fundamentals and engineering rigor.Key Responsibilities- Design and implement distributed LLM inference pipelines - Build automated benchmarking systems for reasoning, planning, and tool use - Implement pass@k, reliability metrics, variance analysis, and statistical confidence evaluation - Develop adversarial testing frameworks for stress-testing agents - Create structured evaluation pipelines (rule-based and model-based graders) - Build trace capture, logging, and telemetry systems for multi-step agent workflows - Validate tool calls and sandboxed execution environments - Optimize inference for latency, cost, and throughput - Manage dataset versioning and reproducible benchmark pipelines - Deploy and monitor GenAI systems in production (AWS/GCP/Azure) Required Skills- Strong Python programming and system design skills - Hands-on experience with Generative AI systems and LLM APIs - Experience with PyTorch or TensorFlow - Experience building production ML or GenAI systems - Strong understanding of decoding strategies, temperature effects, and sampling variance - Familiarity with async processing, distributed task execution, or job scheduling - Experience with Docker and cloud deployment - Strong debugging, observability, and reliability engineering mindset Preferred Skills- Experience with AI agent architectures (ReAct, tool-calling, planner-executor loops) - Experience with reward modeling or evaluation science - Knowledge of RLHF or alignment pipelines - Familiarity with vector databases (FAISS, Pinecone, Weaviate) - Experience with distributed systems (Ray, Celery, Kubernetes) - Experience building internal benchmarking platforms What We Value- Ownership and bias toward execution - Systems thinking and failure-mode analysis - Comfort working with non-deterministic model behavior - Ability to design measurable, reproducible evaluation pipelines - Clear technical communication Why Join Pilotcrew AI- Work on cutting-edge AI agent evaluation infrastructure - Solve real-world GenAI reliability challenges - High technical ownership and autonomy - Opportunity to shape how AI agents are benchmarked at scale

Apply Now

➤