Job Title:

Remote Data Engineer - 78243

Company: Turing

Location: Hubballi, Karnataka

Created: 2026-05-17

Job Type: Full Time

Job Description:

ABOUT THE ROLE:The TDD's multi-modal expertise capture pipeline and Graph RAG branch of the Hybrid architecture require a dedicated data engineer who builds and operates the document ingestion pipeline, chunking engine, vector database indexing, retrieval/re-ranking pipeline, golden dataset management system, and data pool routing infrastructure. This role executes the data architecture designs produced by the IC6 Data Engineer and operates the data infrastructure throughout the engagement.Project Context:You will build and operate the data pipelines that power the Tradecraft Evaluation Platform. You will implement the batch extraction pipeline that ingests 11 years of historical expertise artifacts (transcripts, memos, reports, code), the Graph RAG pipeline that indexes tradecraft in Weaviate for retrieval-augmented evaluation, the golden dataset management system that tracks 300–500 validated rows with full lineage, and the data pool routing infrastructure that enforces physical separation between eval, training, and holdout datasets.KEY RESPONSIBILITIES:A. STANDARD RESPONSIBILITIES:Build and maintain data ingestion pipelines that process multiple file formats reliably at scaleOperate and optimize database systems (relational, vector, cache) for performance and reliabilityImplement data quality checks, validation rules, and monitoring for pipeline healthWrite and maintain data transformation logic in Python with comprehensive test coverageB. PROJECT-SPECIFIC RESPONSIBILITIES:Build the batch extraction pipeline that ingests historical expertise artifacts in multiple formats (PDF, audio, video, text) using Apache Tika and PyMuPDF for document parsing, stores them in S3 with metadata in PostgreSQL, and routes them through the chunking engineImplement the semantic chunking engine that processes different document types with type-specific strategies (semantic chunking for transcripts, structural chunking for reports, function-level chunking for code) and generates embeddings for vector database indexingBuild and operate the Weaviate vector database indexing pipeline, including schema configuration, batch embedding ingestion, metadata tagging, and retrieval/re-ranking pipeline with hybrid search (vector + keyword)Implement the golden dataset management system in PostgreSQL with versioning (dataset version → row versions with immutable snapshots), pool routing (eval/training/holdout), and full lineage tracking (source_artifact → extraction_run → candidate_row → validation_event → golden_row)Build the data pool routing service that enforces physical separation between eval, training, and holdout pools using separate S3 buckets and PostgreSQL schemas, with audit logging of all routing decisionsImplement the synthetic augmentation data pipeline that takes validated human examples, generates synthetic variants via LLM, and routes them through the human review queue with /"synthetic—pending validation/" statusBuild the Graph Context Retriever that extracts entity subgraphs from the knowledge graph via read-only API for the evaluation scenario contextImplement the cost tracking data pipeline using TimescaleDB to record per-evaluation-run, per-model LLM API costs with attributionREQUIRED SKILLS & EXPERIENCE:[STANDARD] 7–10 years of experience in data engineering with production pipeline development[STANDARD] Expert-level Python proficiency (3.11+) for data pipeline development, including async patterns[PROJECT-SPECIFIC] Hands-on experience building and operating Weaviate (or Pinecone/Qdrant) vector database pipelines, including schema design, batch ingestion, and retrieval optimization[PROJECT-SPECIFIC] Experience with document processing pipelines using Apache Tika, PyMuPDF, or equivalent for multi-format ingestion (PDF, DOCX, audio transcription output)[STANDARD] Expert-level PostgreSQL experience, including schema design, indexing, triggers, and operational management[PROJECT-SPECIFIC] Experience implementing data versioning and lineage tracking systems for ML datasets[STANDARD] Experience with S3 (or equivalent object storage) for large-scale document and artifact storage[STANDARD] Experience with Redis for caching and Celery for async task orchestrationExperience Requirements:YEARS OF EXPERIENCE: 7–10 years in data engineeringSENIORITY LEVEL: SeniorTYPICAL BACKGROUND: Senior data engineer at an AI/ML platform company; data pipeline engineer at a search/retrieval company; backend engineer who transitioned into data engineering for NLP/LLM systems; data engineer at a risk/compliance technology companyCOMPLEXITY INDICATORS: Has built pipelines processing 10K+ documents in multiple formats; has operated vector databases with 1M+ embeddings; has implemented data versioning systems for ML datasets; has built data separation infrastructure for complianceLEADERSHIP / OWNERSHIP EXPECTATIONS: Owns all data pipeline implementation and operations; makes independent decisions on pipeline design within the architecture defined by IC6 Data Engineer; operates data infrastructure without dedicated DBA supportSUCCESS INDICATORS:Has built and deployed a production RAG data pipeline with vector database indexing and retrieval achieving >70% precisionHas implemented a golden dataset or ML evaluation dataset management system with versioning and lineageHas built multi-format document ingestion pipelines processing 10K+ documents reliablyHas implemented physical data separation for a compliance-sensitive systemProject-Specific Skills and Domain KnowledgeMust-Have:Experience implementing semantic chunking strategies for different document types (transcripts, reports, code) with measurable retrieval quality impactExperience building data pipelines that integrate with LLM APIs for extraction and augmentation tasksExperience implementing physical data separation (separate storage, separate schemas) for compliance-sensitive ML datasetsExperience with TimescaleDB or equivalent time-series databases for metrics and cost trackingPREFERRED QUALIFICATIONSExperience with knowledge graph data models and entity resolution pipelinesExperience operating data infrastructure in FedRAMP-compatible environmentsAWS Data Analytics Specialty or equivalent certificationExperience with OpenAI Whisper for audio transcription pipeline integrationExperience with embedding model selection and evaluation for RAG systemsContributions to open-source data engineering toolsProject-Specific Skills and Domain KnowledgeStrongly Preferred:Experience with graph database APIs for subgraph extraction (Neo4j, Neptune, or similar)Experience with FastAPI for building data service APIsFamiliarity with NLP preprocessing pipelines (tokenization, NER, text normalization)Experience with PII detection and anonymization in data pipelines★ Trade-Craft Experience — A Significant PlusCandidates with backgrounds in intelligence analysis, signals intelligence, law enforcement data fusion, or related trade-craft disciplines are strongly encouraged to apply. Understanding of link analysis, entity disambiguation under adversarial conditions, handling classified or compartmentalised data, and mission-driven product constraints will set you apart.

Apply Now

➤