Research Engineer - Vision Language Action Models for Intelligent Cyber Physical Systems | Research Engineer - Vision Language Action Models for Intelligent Cyber Physical Systems (f/m/div.)

Robert Bosch GmbH

Renningen, Baden-Württemberg, Deutschland
Published Mar 4, 2026
Full-time
No information

Job Summary

As a Research Engineer at Bosch, you will lead the development of cutting-edge Vision-Language-Action (VLA) architectures, enabling AI agents to interpret human instructions and act autonomously within complex physical environments. Your day-to-day work involves connecting multimodal representation learning with long-term control and planning to move beyond reactive AI toward cognitive intelligence. You will be responsible for building scalable infrastructure for training and deployment, including simulation tools and evaluation methods. This role is unique as it bridges the gap between fundamental research and practical industrial application, allowing you to implement advanced AI methods in robotics, automated driving, and smart building systems. You will collaborate with interdisciplinary teams to shape Bosch's long-term strategy for intelligent automation, ensuring that research prototypes are transformed into robust, explainable, and semantically grounded solutions for real-world cyber-physical systems.

Required Skills

Education

Excellent Master's degree in Computer Science, Machine Learning, Robotics, or related technical fields; PhD in Multimodal AI, Robotics, Reinforcement Learning, or Generative AI preferred.

Experience

  • Multiple years of experience in developing and deploying machine learning solutions in distributed software development teams
  • Demonstrated industrial software development experience through code contributions in large-scale machine learning projects or benchmarks
  • Proven track record of academic excellence with publications in leading AI and robotics conferences (e.g., NeurIPS, CVPR, ICRA)
  • Experience in designing and training multimodal architectures such as Flamingo, GPT-4V, or RT-2
  • Hands-on experience with visual grounding, cross-modal attention, and instruction-following architectures
  • Experience with cloud infrastructure and multi-GPU training pipelines

Languages

German (Basic)English (Fluent)

Additional

  • Submission of GitHub or Kaggle profile links is requested. Role involves driving research into practical innovation across interdisciplinary teams.