Job Description
Scale AI is on a mission to accelerate the development of AI applications. As a Machine Learning Systems Research Engineer, you will build algorithms for next-gen Agent RL training platforms and collaborate with ML teams to enhance research and development.
Responsibilities
- Build, profile and optimize our training and inference framework
- Post-train state of the art models, developed both internally and from the community, to define stable post-training recipes for our enterprise engagements
- Collaborate with ML teams to accelerate their research and development, and enable them to develop the next generation of models and data curation
- Create a next-gen agent training algorithm for multi-agent/multi-tool rollouts
Skills
- At least 1-3 years of LLM training in a production environment
- Passionate about system optimization
- Experience with post-training methods like RLHF/RLVR and related algorithms like PPO/GRPO etc
- Ability to demonstrate know-how on how to operate the architecture of the modern GPU cluster
- Experience with multi-node LLM training and inference
- Strong software engineering skills, proficient in frameworks and tools such as CUDA, Pytorch, transformers, flash attention, etc
- Strong written and verbal communication skills to operate in a cross functional team environment
- PhD or Masters in Computer Science or a related field
Benefits
- Comprehensive health, dental and vision coverage
- Retirement benefits
- A learning and development stipend
- Generous PTO
- Commuter stipend
Company Overview
Company H1B Sponsorship
Apply To This Job