About Us
AiLogic Neural Network Pvt Ltd is an AI-driven product company focused on building advanced language technology solutions, including machine translation, document intelligence, and large-scale NLP systems. We are looking for a highly motivated Data Engineer to join our growing AI team and contribute to the development of scalable data processing pipelines for NLP and LLM applications.
Roles & Responsibilities
- Design, develop, and maintain scalable data pipelines for processing large volumes of structured and unstructured data.
- Build document ingestion and processing workflows for PDFs, scanned documents, HTML pages, and other text sources.
- Implement OCR, PDF parsing, HTML parsing, and text extraction pipelines.
- Develop document chunking and preprocessing frameworks for NLP and LLM-based applications.
- Work with Hugging Face models and NLP libraries for text processing tasks.
- Create and optimize data transformation workflows using Python, Apache Spark, and Spark SQL.
- Develop and manage Vector Database pipelines for embedding storage and retrieval.
- Implement text normalization, sentence segmentation, deduplication, and data quality processes.
- Design and implement data masking, classification, and categorization solutions.
- Collaborate with AI/ML engineers to prepare datasets for model training and inference.
- Optimize large-scale data processing workflows for performance, scalability, and cost efficiency.
- Maintain CI/CD pipelines and follow software engineering best practices.
- Monitor, troubleshoot, and improve production data processing systems.
Mandatory Skills
- Strong experience with Python programming.
- Hands-on experience in NLP concepts such as:
- Tokenization
- Text Processing
- Hugging Face Transformers
- Experience in:
- PDF Parsing
- OCR
- HTML Parsing
- Text Extraction
- Document Chunking
- Experience with Apache Spark and Spark SQL.
- Working knowledge of Vector Databases.
- Good understanding of Git and CI/CD practices.
- Experience building data pipelines and ETL workflows.
- Strong debugging and problem-solving skills.
Preferred Skills
- Text Normalization
- Sentence Segmentation
- Exact Deduplication and Near Deduplication
- Data Masking
- Data Classification & Categorization
- Embedding Generation and Retrieval Pipelines
- Large-scale Document Processing Systems
- RAG (Retrieval-Augmented Generation) Pipelines
Performance Optimization Skills
- CPU Distribution and Parallel Processing
- Pre-batch Generation Techniques
- Chunking Optimization Strategies
- Stream Processing vs Batch Processing
- GPU and CPU Parallel Distribution
- CUDA Optimization
- PyTorch Performance Tuning
- Spark Performance Optimization
Qualifications
- Bachelor's or Master's degree in Computer Science, Data Science, Information Technology, or a related field.
- 2–3 years of experience in Data Engineering, NLP Engineering, or AI Data Processing.
- Experience working with large-scale datasets and distributed computing frameworks.