AI Data — Data Algorithm Expert / Architect [33252}

Stealth Startup (Menlo Park CA, USA) Follow 11 hours ago

Apply Now

Responsibilities

Core data infrastructure & pipeline R&D. Lead the build-out of the underlying data processing engine supporting trillion-parameter code models. Design high-performance, reusable automated processing operators, and build the full-chain production pipeline from raw code crawling, cleaning, and de-identification through to high-quality training-set delivery—solving compute and storage bottlenecks at massive data scale.
High-quality code corpus construction. Drive the fine-grained cleaning and mixing strategy for complex data including multilingual code, open-source projects, technical documentation, and commit history. Use heuristic rules and model-based scoring to build a competitive “golden code corpus” that ensures the highest pretraining data quality.
Frontier data synthesis & test-time scaling. Track the technical paradigms of leading AI coding tools (Cursor, Composer-2, Claude Code, etc.); research and ship LLM-based high-quality code data synthesis strategies; explore data optimization methods for test-time scaling and build a data augmentation flywheel that breaks past the scale ceiling of natural code data.
Evaluation-driven data iteration (EDD). Partner with the algorithm team to build scientific benchmarks for code capability, establishing a virtuous loop of data production → model evaluation → bad-case analysis → data iteration. Precisely surface model weaknesses to drive dynamic dataset adjustment and expansion.
Tokenization & data alignment optimization. Given the unique nature of code (AST structure, special symbols), lead or optimize the design of a dedicated tokenization vocabulary; study the effect of different code data types on model scaling laws to improve token and training efficiency.

Qualifications

Education. Bachelor's degree or above in Computer Science, AI, Mathematics, or a related field, with a strong algorithm foundation and very strong hands-on engineering ability.
Big-data engineering. Proficient in Python; skilled with large-scale distributed data processing frameworks such as Spark / Flink / Ray; comfortable in Linux and able to independently develop and tune batch-and-stream processing for TB/PB-scale unlabeled text/code data.
LLM & code understanding. Deep understanding of the Transformer architecture and the training flow of mainstream open-source LLMs (Qwen, Llama, DeepSeek, etc.); some grasp of code syntax and compiler principles; deep understanding of how different training stages (pretrain / SFT / RLHF) demand different data distributions and quality.
Hands-on data experience. Complete project experience processing massive code/text data; mastery of core techniques including MinHash/SimHash deduplication, privacy-compliance filtering, and hard-example mining; able to abstract tedious cleaning logic into systematic standards.
Research + engineering. Publications at top venues (NeurIPS, ICML, ICLR, ACL) on data-centric AI, data synthesis, RL, or efficient training are a plus; alongside strong ability to turn theory into high-performance production code.

Nice-to-haves

Heavy hands-on use of mainstream AI coding tools (Cursor, Composer-2, Claude Code, Copilot) with deep insight into coding-agent workflows.
Familiarity with large-scale data processing frameworks (e.g., datatrove) and real experience handling trillion-scale datasets.
Core-contributor experience on the data side of a well-known open-source LLM project, or strong placements in competitions such as Kaggle.
Background in low-resource-language code processing, multimodal interleaved data processing, or large-scale knowledge-graph construction.

Apply Now

Save Job