Data Engineer

AiLogic Neural Network Pvt Ltd (Hyderabad, TS, India) Follow 2 days ago

About Us

AiLogic Neural Network Pvt Ltd is an AI-driven product company focused on building advanced language technology solutions, including machine translation, document intelligence, and large-scale NLP systems. We are looking for a highly motivated Data Engineer to join our growing AI team and contribute to the development of scalable data processing pipelines for NLP and LLM applications.

Roles & Responsibilities

Design, develop, and maintain scalable data pipelines for processing large volumes of structured and unstructured data.
Build document ingestion and processing workflows for PDFs, scanned documents, HTML pages, and other text sources.
Implement OCR, PDF parsing, HTML parsing, and text extraction pipelines.
Develop document chunking and preprocessing frameworks for NLP and LLM-based applications.
Work with Hugging Face models and NLP libraries for text processing tasks.
Create and optimize data transformation workflows using Python, Apache Spark, and Spark SQL.
Develop and manage Vector Database pipelines for embedding storage and retrieval.
Implement text normalization, sentence segmentation, deduplication, and data quality processes.
Design and implement data masking, classification, and categorization solutions.
Collaborate with AI/ML engineers to prepare datasets for model training and inference.
Optimize large-scale data processing workflows for performance, scalability, and cost efficiency.
Maintain CI/CD pipelines and follow software engineering best practices.
Monitor, troubleshoot, and improve production data processing systems.

Mandatory Skills

Strong experience with Python programming.
Hands-on experience in NLP concepts such as:
Tokenization
Text Processing
Hugging Face Transformers
Experience in:
PDF Parsing
OCR
HTML Parsing
Text Extraction
Document Chunking
Experience with Apache Spark and Spark SQL.
Working knowledge of Vector Databases.
Good understanding of Git and CI/CD practices.
Experience building data pipelines and ETL workflows.
Strong debugging and problem-solving skills.

Preferred Skills

Text Normalization
Sentence Segmentation
Exact Deduplication and Near Deduplication
Data Masking
Data Classification & Categorization
Embedding Generation and Retrieval Pipelines
Large-scale Document Processing Systems
RAG (Retrieval-Augmented Generation) Pipelines

Performance Optimization Skills

CPU Distribution and Parallel Processing
Pre-batch Generation Techniques
Chunking Optimization Strategies
Stream Processing vs Batch Processing
GPU and CPU Parallel Distribution
CUDA Optimization
PyTorch Performance Tuning
Spark Performance Optimization

Qualifications

Bachelor's or Master's degree in Computer Science, Data Science, Information Technology, or a related field.
2–3 years of experience in Data Engineering, NLP Engineering, or AI Data Processing.
Experience working with large-scale datasets and distributed computing frameworks.

Apply Now

Save Job