IN.JobDiagnosis logo

Job Title:

Machine Learning Engineer

Company: Pilotcrew AI

Location: New delhi, Delhi

Created: 2026-03-07

Job Type: Full Time

Job Description:

Machine Learning Engineer – Generative AI & Agent EvaluationLocation: Remote  Company: Pilotcrew AI  Type: Full-Time  Experience: 2–6 Years  About Pilotcrew AIPilotcrew AI builds infrastructure for AI Agent Evaluation. We benchmark large language models, run automated agent evaluations, power human-in-the-loop assessments, and host AI arenas for competitive testing.Our mission is to make AI agents measurable, reliable, and production-ready through structured, scalable evaluation systems.Role OverviewWe are hiring a Machine Learning Engineer with strong Generative AI expertise to design and build scalable evaluation infrastructure for LLMs and AI agents.You will architect distributed inference pipelines, structured trace logging systems, tool-call validation frameworks, and automated grading engines. The role involves benchmarking proprietary and open-weight LLMs, implementing pass@k and robustness metrics, building adversarial stress-testing pipelines, and analyzing agent failure modes under real-world conditions.This is a systems-heavy, production-focused GenAI role requiring strong ML fundamentals and engineering rigor.Key Responsibilities- Design and implement distributed LLM inference pipelines  - Build automated benchmarking systems for reasoning, planning, and tool use  - Implement pass@k, reliability metrics, variance analysis, and statistical confidence evaluation  - Develop adversarial testing frameworks for stress-testing agents  - Create structured evaluation pipelines (rule-based and model-based graders)  - Build trace capture, logging, and telemetry systems for multi-step agent workflows  - Validate tool calls and sandboxed execution environments  - Optimize inference for latency, cost, and throughput  - Manage dataset versioning and reproducible benchmark pipelines  - Deploy and monitor GenAI systems in production (AWS/GCP/Azure)  Required Skills- Strong Python programming and system design skills  - Hands-on experience with Generative AI systems and LLM APIs  - Experience with PyTorch or TensorFlow  - Experience building production ML or GenAI systems  - Strong understanding of decoding strategies, temperature effects, and sampling variance  - Familiarity with async processing, distributed task execution, or job scheduling  - Experience with Docker and cloud deployment  - Strong debugging, observability, and reliability engineering mindset  Preferred Skills- Experience with AI agent architectures (ReAct, tool-calling, planner-executor loops)  - Experience with reward modeling or evaluation science  - Knowledge of RLHF or alignment pipelines  - Familiarity with vector databases (FAISS, Pinecone, Weaviate)  - Experience with distributed systems (Ray, Celery, Kubernetes)  - Experience building internal benchmarking platforms  What We Value- Ownership and bias toward execution  - Systems thinking and failure-mode analysis  - Comfort working with non-deterministic model behavior  - Ability to design measurable, reproducible evaluation pipelines  - Clear technical communication  Why Join Pilotcrew AI- Work on cutting-edge AI agent evaluation infrastructure  - Solve real-world GenAI reliability challenges  - High technical ownership and autonomy  - Opportunity to shape how AI agents are benchmarked at scale

Apply Now

➤
Home | Contact Us | Privacy Policy | Terms & Conditions | Unsubscribe | Popular Job Searches
Use of our Website constitutes acceptance of our Terms & Conditions and Privacy Policies.
Copyright © 2005 to 2026 [VHMnetwork LLC] All rights reserved. Design, Develop and Maintained by NextGen TechEdge Solutions Pvt. Ltd.