Job Description
Summary
As a member of the AI model team, you will drive innovation in architecture development for cutting-edge models of various scales, including small, large, and multi-modal systems. Your work will enhance intelligence, improve efficiency, and introduce new capabilities to advance the field.
You will have a deep expertise in LLM architectures, a strong grasp of pre-training optimization with a hands-on, research-driven approach. Your mission is to explore and implement novel techniques and algorithms that lead to groundbreaking advancements: data curation, strengthening baselines, identifying and resolving existing pre-training bottlenecks to push the limits of AI performance.
Responsibilities:
- Conduct pre-training AI models on large, distributed servers equipped with thousands of NVIDIA GPUs.
- Design, prototype, and scale innovative architectures to enhance model intelligence.
- Independently and collaboratively execute experiments, analyze results, and refine methodologies for optimal performance.
- Investigate, debug, and improve both model efficiency and computational performance.
- Contribute to the advancement of training systems to ensure seamless scalability and efficiency on target platforms.
Job requirements
- A degree in Computer Science or related field. Ideally PhD in NLP, Machine Learning, or a related field, complemented by a solid track record in AI R&D (with good publications in A* conferences).
- Hands-on experience contributing to large-scale LLM training runs on large, distributed servers equipped with thousands of NVIDIA GPUs, ensuring scalability and impactful advancements in model performance.
- Familiarity and practical experience with large-scale, distributed training frameworks, libraries and tools.
- Deep knowledge of state-of-the-art transformer and non-transformer modifications aimed at enhancing intelligence, efficiency and scalability.
- Strong expertise in PyTorch and Hugging Face libraries with practical experience in model development, continual pretraining, and deployment.
Skills
- Development
- Machine Learning
- Software Engineering