Job Description
About the position
Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build
an end-to-end platform for developing, training, and deploying AI
systems—designed to take ideas from research to production with less friction.
Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI
combines developer-first software with cost-efficient, large-scale compute.
Teams get the tools they need for experimentation, training, and production
inference, with security, observability, and control built in.
We serve solo researchers, startups, and large enterprises. Lightning AI
operates globally with offices in New York City, San Francisco, Seattle, and
London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and
Firstminute.
We are seeking a highly skilled Research Engineer to work on optimizing training
and inference workloads on compute accelerators and clusters, through the
Lightning Thunder compiler and the broader PyTorch Lightning ecosystem. This
role sits at the intersection of deep learning research, compiler development,
and large-scale system optimization. You’ll be shaping technology that pushes
the boundaries of model performance and efficiency, creating foundational
software that will impact the entire machine learning ecosystem. You will be
joining the Engineering Team and report to our Tech Lead. This is a hybrid role
based in our New York City, San Francisco, or London office, with an in-office
requirement of two days per week. The salary range for this role is
$180,000-$250,000.
- Responsibilities
- Develop performance-oriented model optimizations at multiple levels:
Graph-level (e.g., operator fusion, kernel scheduling, memory planning)
Kernel-level (CUDA, Triton, custom operators for specialized hardware)
- System-level (distributed training across GPUs/TPUs, inference serving atscale)
- Advance the Thunder compiler by building optimization passes, graph transformations, and integration hooks to accelerate training and inferenceworkloads.
- Work across the software stack to ensure optimizations are accessible to end users through clean APIs, automated tooling, and seamless integration withPyTorch Lightning. Design and implement profiling and debugging tools toanalyze model execution, identify bottlenecks, and guide optimizationstrategies.
- Collaborate with hardware vendors and ecosystem partners to ensure Thunder runs efficiently across diverse backends (NVIDIA, AMD, TPU, specializedaccelerators).
- Contribute to open-source projects by developing new features, improving documentation, and supporting community adoption.
- Engage with researchers and engineers in the community, providing guidance on performance tuning and advocating for Thunder as the go-to optimization layerin ML workflows.
- Work cross-functionally with Lightning’s product and engineering teams to ensure compiler and optimization improvements align with the broader productvision.
- Requirements
- Strong expertise with deep learning frameworks such as PyTorch
- Hands-on experience with model optimization techniques, including graph-level optimizations, quantization, pruning, mixed precision, or memory-efficienttraining.
- Knowledge of distributed systems and parallelism strategies (data/model/pipeline parallelism, checkpointing, elastic scaling).
- Familiarity with software engineering practices: designing APIs, building robust tooling, testing, CI/CD for performance-sensitive systems.
- Excellent collaboration and communication skills, with the ability to partner across research, engineering, and external contributors.
- Bachelor’s degree in Computer Science, Engineering
- Nice-to-haves
- Experience with CUDA, Triton, or other GPU programming models for developing custom kernels.
- Deep understanding of deep learning compiler internals (IR design, operator fusion, scheduling, optimization passes) or proven work inperformance-critical software.
- Proven track record contributing to open-source projects in ML, HPC, or compiler domains.
- Advanced degree (Master’s or PhD) in machine learning, compilers, or systems highly preferred.
- Benefits
- We offer competitive base salaries and equity with a 25% one year cliff and monthly vesting thereafter. For our international employees, we work with ourEOR to pay you in your local currency and provide equitable benefits across theglobe.
- Medical, dental and vision
- Life and AD&D insurance
- Flexible paid time off including winter closure
- Paid family leave benefits
- $500 monthly meal reimbursement, including groceries & food delivery services
- $500 one time home office stipend
- $1,000 annual learning & development stipend
- 100% Citibike membership (NYC only)
- $45/month gym membership
- Additional various medical and mental health services
Apply tot his job
Apply To this Job