MLOps - Model Deployment (Machine Learning Operations) High-Level Objective: Take existing open-source Machine Learning (ML) models for chemical synthesis (predicting products from reactants) and retrosynthesis (determining the inverse route) and deploy them within Google Cloud Platform (GCP) to ensure they are accessible by dependent services. Specific Tasks: The scope requires the deployment of several models, including, but not limited to: MolecularTransformer, ChemFormer, ReactionT5, RetroChimera (note that each model may present distinct challenges related to memory, architecture, and dependencies). This deployment process involves several steps for each model: Creating a Dockerfile and building the corresponding Docker image. Writing the necessary inference logic. Deploying and testing the resulting endpoint on GCP. Optimizing the deployment for latency. Expected Outputs: Successful completion of the project will result in multiple ML models being made accessible to other services on GCP, deployed, for example, on a Vertex AI endpoint. Required Skillset: The executing vendor must possess expertise in the following technologies: GCP: Vertex AI, Artifact Registry Tools/Frameworks: Docker, PyTorch, Hugging Face NetOps - GPU Cluster Upgrades (Network Operations) High-Level Objective: Vendor will provide software implementation, scheduling, and observability setup, for a cluster of approx. 100 NVIDIA A100 GPUs located in Mountain View, California. Vendor shall deploy the industry-standard "MAAS + Ansible" stack to automate cluster lifecycle management. Custom tooling or proprietary scripts are explicitly out of scope. Specific Tasks: Bare Metal Provisioning (MAAS) Install and configure Canonical MAAS (Metal-as-a-Service) on the designated head node. Configuration Management (Ansible) Develop and deploy modular Ansible Playbooks to configure the software layer on top of the OS. Automate the installation and version-locking of NVIDIA Drivers (Headless), CUDA Toolkit (v12.x), Infiniband/RDMA networking, and Google-mandated security agents (fleetspeakd/GRR, CrowdStrike). Workload Management (Scheduling) Install and configure the Slurm Workload Manager using Ansible. Configure Fair Share scheduling and preemption rules to allow multiple research teams to equitably share cluster resources. Integrate user management with either local systems or LDAP, as specified by the Google point of contact. Observability Stack Deploy the Prometheus database and Grafana visualization tool. Install the NVIDIA DCGM Exporter agent on all compute nodes using Ansible to capture deep GPU telemetry. Import standard NVIDIA dashboards into Grafana. Expected Outputs: (1) A functional MAAS Web UI accessible to Google Admins that allows for the automated discovery, power cycling, disk wiping, and OS provisioning of all compute nodes via IPMI/BMC. (2) A master playbook that can take a fresh Ubuntu install and fully configure it to standard without manual intervention. (3) A functional slurmctld (controller) and slurmd (on compute nodes) installation. (4) Standard NVIDIA dashboards imported into Grafana to display Real-time GPU Temperature, Power Draw, and Per-User Usage metrics. Required Skillset: The executing vendor would require proven expertise in the following areas: o Bare Metal and OS Management: Canonical MAAS (Metal-as-a-Service) installation and configuration, IPMI/BMC, and Ubuntu 22.04 LTS operating system provisioning. o Configuration Management: Advanced proficiency in developing and deploying Ansible Playbooks. o NVIDIA Software Stack: Installation and version-locking of NVIDIA Drivers (Headless), CUDA Toolkit (v12.x), and the NVIDIA DCGM Exporter. Ability to interpret NVIDIA XID errors and PCIe "bus falling off" issues. o Networking: Experience with Infiniband/RDMA networking configuration. o Workload Scheduling: Expertise in installing and configuring Slurm Workload Manager, including Fair Share scheduling, preemption rules, and user management integration (Local or LDAP). o Monitoring and Visualization: Deployment and configuration of Prometheus and Grafana.
Job Title
MLOps - Model Deployment [Machine Learning Operations]