About Us


AiLogic Neural Network Pvt Ltd is an AI-driven product company focused on building advanced language technology solutions, including machine translation, document intelligence, and large-scale NLP systems. We are looking for a highly motivated Data Engineer to join our growing AI team and contribute to the development of scalable data processing pipelines for NLP and LLM applications.


Roles & Responsibilities


  • Design, develop, and maintain scalable data pipelines for processing large volumes of structured and unstructured data.
  • Build document ingestion and processing workflows for PDFs, scanned documents, HTML pages, and other text sources.
  • Implement OCR, PDF parsing, HTML parsing, and text extraction pipelines.
  • Develop document chunking and preprocessing frameworks for NLP and LLM-based applications.
  • Work with Hugging Face models and NLP libraries for text processing tasks.
  • Create and optimize data transformation workflows using Python, Apache Spark, and Spark SQL.
  • Develop and manage Vector Database pipelines for embedding storage and retrieval.
  • Implement text normalization, sentence segmentation, deduplication, and data quality processes.
  • Design and implement data masking, classification, and categorization solutions.
  • Collaborate with AI/ML engineers to prepare datasets for model training and inference.
  • Optimize large-scale data processing workflows for performance, scalability, and cost efficiency.
  • Maintain CI/CD pipelines and follow software engineering best practices.
  • Monitor, troubleshoot, and improve production data processing systems.


Mandatory Skills


  • Strong experience with Python programming.
  • Hands-on experience in NLP concepts such as:
  • Tokenization
  • Text Processing
  • Hugging Face Transformers
  • Experience in:
  • PDF Parsing
  • OCR
  • HTML Parsing
  • Text Extraction
  • Document Chunking
  • Experience with Apache Spark and Spark SQL.
  • Working knowledge of Vector Databases.
  • Good understanding of Git and CI/CD practices.
  • Experience building data pipelines and ETL workflows.
  • Strong debugging and problem-solving skills.


Preferred Skills


  • Text Normalization
  • Sentence Segmentation
  • Exact Deduplication and Near Deduplication
  • Data Masking
  • Data Classification & Categorization
  • Embedding Generation and Retrieval Pipelines
  • Large-scale Document Processing Systems
  • RAG (Retrieval-Augmented Generation) Pipelines


Performance Optimization Skills


  • CPU Distribution and Parallel Processing
  • Pre-batch Generation Techniques
  • Chunking Optimization Strategies
  • Stream Processing vs Batch Processing
  • GPU and CPU Parallel Distribution
  • CUDA Optimization
  • PyTorch Performance Tuning
  • Spark Performance Optimization


Qualifications


  • Bachelor's or Master's degree in Computer Science, Data Science, Information Technology, or a related field.
  • 2–3 years of experience in Data Engineering, NLP Engineering, or AI Data Processing.
  • Experience working with large-scale datasets and distributed computing frameworks.

More from AiLogic Neural Network Pvt Ltd
AiLogic Neural Network Pvt Ltd 3 days ago
AiLogic Neural Network Pvt Ltd 20 hours ago
AiLogic Neural Network Pvt Ltd 3 days ago

Data Engineer

Apply Now
Back to search page