Participation Instructions - Robotics

General rules:

Our benchmark consists of three tasks that involve 1) learning policies from datasets and analyse the learned policies, and 2) blue team / red team approaches to model attacks. For the latter, blue teams aim to provide robust solutions and red teams aim to attack the solutions of blue teams.

  • Data:

We will prepare a pre-selection of existing public datasets, own datasets, as well as a fully working simulator with a robot model. These will be provided via a web platform including a textual introduction and tutorials to them. We aim to make the data as accessible as possible to allow a wide participation in the communities.

  • Simulator:

We will develop and release a sophisticated simulation environment with a robot model and the possibility to automatically load the competition datasets. For the simulation part, we will rely on the sophisticated simulators included in the open-source Robot Operating System (ROS). For the robot models, we will use PAL Robotics existing robot models, which will allow participants to learn behaviours for a complex real system. For an easy installation, we will bundle the simulation environment into a Docker container. This setup allows participants to quickly install the simulator on various operating systems, increasing the accessibility for the research communities. The Docker image and all required simulation data will be provided via the competition platform.

Ground Truth format


Submissions file format



  • Safety metrics (track 1): For instance, classical safety metrics measure the distance to humans, the velocity in the human’s proximity, and how well
    the learned policy behaves when assuming random behaviours of humans. To this end, we would like to compute these metrics as an average over a whole test dataset or using the k worst scenarios in the test dataset. These measures ultimately allow one to determine how often the robot is performing desired behaviour.
  • Task robustness (track 2): Classical robustness metrics measure the task performance and accuracy during training of the robot’s policy. This is for instance done using different validation datasets. For instance, we can measure how often the robot successfully accomplishes the task, how fast the robot accomplishes its task, or how often the robot needs to replan its actions to achieve the task. We aim to employ a blue team/red team approach to measure task robustness. Blue teams will receive individual datasets to learn policies that are robust against attacks from a single or small number of clients. Red teams will take the solutions of the red teams and try to break the solutions by developing malicious clients that attempt to degrade the performance as much as possible. The degradation in performance can be measured with the classical robustness metrics.
  • Data privacy (track 3): We are interested in ensuring that sensitive data is kept private at all times. Proposed solutions should not allow the inference of specific attributes of the environment that the robot is operating. We aim to measure progress in data privacy by employing a blue team / red team approach. Blue teams get individual datasets with different home environments. They will learn robot policies in these environments that fulfil the desired task with a predefined accuracy while maximising privacy that does not allow the inference of sensitive attributes of the environment. Red teams will get the solutions of the blue teams and the used training datasets. Their task is to perform an attribute inference attack, i.e., they need to identify sensitive attributes of the the training environment each blue team has used during training. We will measure how well the privacy claims of a blue team hold by computing how many times red teams were able to identify the used environment. Vice versa, we will measure the success of the attacks of a red team by computing how many times they correctly identified the blue teams used environments.