We are looking for a Data Engineer to join our AI/LLM Delivery Unit, responsible for building scalable data pipelines and infrastructure that power AI and machine learning solutions.

This role plays a critical part in enabling LLM-based applications, data workflows, and AI model lifecycle management. The ideal candidate has strong experience in data engineering, cloud platforms, and pipeline automation, with exposure to AI/ML environments.


Key Responsibilities

1. Data Pipeline Development

  • Design, build, and maintain scalable data pipelines (ETL/ELT) for structured and unstructured data
  • Ensure reliable ingestion, transformation, and delivery of high-quality datasets
  • Optimize pipelines for performance, cost, and scalability


2. AI / LLM Data Infrastructure

  • Support data workflows for AI/ML and LLM systems, including training, fine-tuning, and evaluation datasets
  • Build data pipelines for:

o Text corpora and unstructured datasets

o Embeddings and vector databases

o Retrieval-Augmented Generation (RAG) systems

  • Enable efficient data access for Data Scientists and ML Engineers


3. Data Processing & Automation

  • Automate data extraction, transformation, and validation processes
  • Implement batch and real-time data processing solutions
  • Improve operational efficiency through data automation (aligned with process optimization use cases)


4. Data Quality & Governance

  • Implement data validation, monitoring, and quality checks
  • Ensure data integrity, consistency, and compliance with security standards
  • Maintain data documentation and lineage tracking


5. Collaboration & Delivery

  • Work closely with Data Scientists, ML Engineers, and Delivery teams
  • Translate business and AI requirements into scalable data architectures
  • Support end-to-end AI delivery lifecycle from data ingestion to deployment


Qualifications


Education

  • Bachelor’s degree in Computer Science, Data Engineering, Information Systems, or related field
  • Advanced degree is a plus


Experience

  • 3–7+ years of experience in data engineering or related roles
  • Experience supporting AI/ML or analytics platforms
  • Exposure to AI/LLM-related data pipelines is a strong advantage


Technical Skills


Core Skills

· Strong programming skills in Python and/or Scala

· Expertise in SQL and database design

· Experience building ETL pipelines (Airflow, Dagster, or similar)

Data & Platform Skills


· Experience with:

o Data warehouses (Snowflake, BigQuery, Redshift)

o Distributed data processing (Spark)

o APIs and data integration

· Familiarity with streaming tools (Kafka, Kinesis) is a plus


AI/LLM-Related Skills

· Experience working with unstructured data pipelines (text, NLP datasets)

· Familiarity with:

o Vector databases (Pinecone, FAISS, Weaviate)

o Embeddings pipelines

o RAG architectures


Cloud & DevOps

· Hands-on experience with AWS, Azure, or GCP

· Knowledge of:

o Docker / containerization

o CI/CD pipelines

o Infrastructure-as-Code (Terraform is a plus)

---

Core Competencies

· Strong data modeling and system design skills

· Attention to detail and data quality

· Problem-solving and analytical thinking

· Effective communication with both technical and non-technical stakeholders

· Ability to work in fast-paced, delivery-oriented environments

---

Nice-to-Have

· Experience in AI/LLM or Generative AI projects

· Familiarity with annotation pipelines or data labeling workflows

· Exposure to MLOps frameworks

· Experience in high-scale or enterprise data environments

---

What Success Looks Like

· Builds robust, scalable data pipelines supporting AI/LLM projects

· Improves efficiency and reliability of data workflows

· Enables faster model development through high-quality datasets

· Supports successful delivery of client-facing AI solutions


More from Innodata Inc.
Innodata Knowledge Services, Inc. 1 day ago
Innodata Knowledge Services, Inc. 1 day ago

Data Engineer

Apply Now
Back to search page