Job Title: Senior DevOps Engineer – Generative AI & Cloud Solutions
Company: BrezQ
Location: Hyderabad, India (Hybrid / On-site Options)
Employment Type: Full-Time
About BrezQ
At BrezQ , we don’t just deliver IT consulting and software services; we develop future-driven tech solutions that help businesses lead the digital era. From robust Cloud-Native Development and Salesforce CRM solutions to highly secure, scalable infrastructure, we specialize in modernizing technology ecosystems. As we scale our AI & ML Integration practices, we are looking for a forward-thinking DevOps Engineer to bridge the gap between traditional cloud infrastructure and the evolving world of Generative AI.
Role Summary
We are seeking a Senior DevOps Engineer with specialized experience in Generative AI / LLMOps pipelines. In this role, you will design, automate, and secure the infrastructure that powers our custom software solutions and AI integrations. You will be responsible for setting up scalable cloud environments, building seamless CI/CD pipelines, and establishing cutting-edge deployment frameworks for Large Language Models (LLMs) and foundational AI applications.
Key Responsibilities
1. GenAI & LLMOps Infrastructure
- Design and manage cloud infrastructure tailored for hosting, fine-tuning, and serving Generative AI models.
- Implement and manage infrastructure for AI orchestration frameworks (e.g., LangChain, LlamaIndex, Semantic Kernel).
- Optimize vector databases (e.g., Pinecone, Milvus, Qdrant, or pgvector in PostgreSQL) for highly efficient Retrieval-Augmented Generation (RAG) pipelines.
- Implement cost-optimization strategies for heavy compute resources (GPUs/TPUs).
2. Core DevOps & Cloud Engineering
- Architect, scale, and maintain secure, cloud-native infrastructure utilizing AWS, Azure, or GCP .
- Orchestrate and manage containerized workloads using Docker and Kubernetes .
- Implement robust Infrastructure as Code (IaC) using Terraform, OpenTofu, or CloudFormation.
- Build, maintain, and optimize secure CI/CD pipelines (GitHub Actions, GitLab CI, or Jenkins) for automated software and model deployment.
3. Monitoring, Observability & Security
- Set up advanced observability tools (Prometheus, Grafana, Datadog) to track both infrastructure health and AI model performance metrics (latency, token usage, drift).
- Collaborate with engineering teams to ensure highly secured IT solutions, embedding data privacy, vulnerability scanning, and guardrails for LLM prompts/outputs.
Required Technical Skills & Qualifications
- Experience: 5+ years of total experience in DevOps/Cloud Engineering, with at least 1–2 years actively supporting AI/ML or GenAI workloads in production.
- AI/LLM Ecosystem: Practical exposure to deploying LLMs (OpenAI APIs, Hugging Face, Anthropic, or open-source models like Llama via Ollama/vLLM).
- Containerization & Orchestration: Deep expertise in Kubernetes (EKS, AKS, or GKE) and Docker.
- Infrastructure as Code: Proven mastery of Terraform .
- Databases: Familiarity with traditional databases (PostgreSQL, MongoDB) as well as vector stores.
- Programming: Strong scripting and automation skills in Python (highly preferred for AI workflows), Go (Golang), or Bash.