Create Alert
Email me similar jobs

Platform Engineer (Python)

Full-time
Job Title: Senior LLMOps / AI Platform Engineer
Location: Atlanta, GA
Visa: USC

Key Requirement: Azure OpenAI, Kubernetes, ArgoCD, Jenkins, Azure AI Search, Langfuse, LLMOps, RAG, CI/CD, Observability, Python Must be willing to work onsite in Atlanta, GA

Job Description: This role owns the operational foundation for George and other generative AI applications. The person should make the AI platform reliable, observable, cost-aware, secure, and release-ready across development, QA, and production environments.


Role Area Expected Ownership

Azure AI / Azure OpenAI Model deployments, quotas, TPM/RPM planning, rate-limit troubleshooting, deployment configuration, model upgrade support, fallback planning, and cost/performance monitoring.

Azure AI Search / RAG Infrastructure Index, indexer, skillset, embedding, semantic search, hybrid search, retrieval quality support, and performance troubleshooting for George knowledge sources.

CI/CD and Release Engineering Jenkins pipelines, build promotion, environment readiness, automated gates, release checklists, rollback planning, and deployment validation.

Kubernetes / Argo CD Argo CD sync health, manifest drift, Kubernetes pod health, environment promotion, scaling, config maps, secrets, and rollback operations.

Langfuse / Observability Trace ingestion, dashboards, prompt/version visibility, datasets support, experiment troubleshooting, latency, token usage, cost, and failure analysis.

Production Support Incident triage, root-cause analysis, operational runbooks, cross-team issue resolution, and executive-ready status reporting.

George-Specific Responsibilities

  • Azure resource configuration
  • Model operations
  • RAG platform support
  • Langfuse ownership
  • Jenkins pipeline ownership
  • Argo CD / Kubernetes ownership
  • Operational observability
  • Incident management

Responsibilities for Broader Agentic Applications Capability

  • Agent runtime operations
  • Tool-call observability
  • Prompt/model release operations
  • Evaluation gates in CI/CD
  • Security and governance

Candidate Requirements Must-Have

  • Azure cloud experience
  • Kubernetes experience
  • Jenkins or comparable CI/CD experience
  • Argo CD / GitOps experience
  • Observability and incident response
  • LLM application understanding
  • Python/scripting ability
  • Cross-functional communication

Strongly Preferred

  • Langfuse
  • Azure AI Foundry / Azure OpenAI
  • Azure AI Search
  • Dynatrace / App Insights / OpenTelemetry
  • LangGraph / MCP / agent frameworks
  • IaC: Bicep, Terraform, Helm, Kustomize

Expected Deliverables

  • LLMOps runbook
  • Environment readiness checklist
  • Release readiness gate
  • Observability dashboard set
  • Incident RCA template
  • Model/prompt deployment process

30 / 60 / 90 Day Expectations

  • First 30 days: Understand George architecture, Azure resources, model deployments, Jenkins, Argo CD, Langfuse, environments, release process, and current operational gaps. Produce an initial LLMOps assessment.
  • First 60 days: Improve dashboards, runbooks, release checklists, Langfuse health checks, Azure quota monitoring, and Argo CD/Jenkins troubleshooting documentation.
  • First 90 days: Implement or formalize AI release gates, improve observability coverage, reduce production troubleshooting time, and establish repeatable deployment/rollback process for model, prompt, index, and backend changes.

Cleo Consulting is an equal opportunity employer (Minorities/Women/Veterans/Disabled)

For applications and inquiries, contact: [email protected]

Similar jobs

More from Openkyber
Openkyber 4 hours ago
Openkyber 4 hours ago
Openkyber 4 hours ago

Platform Engineer (Python)

Apply Now
Back to search page