Participation Instructions - Cybersecurity

General rules

  1. The binary classification task consists of distinguishing malware samples from benign applications only relying on ML-based approaches. The use of whitelisting, blacklisting, or signatures is not allowed. The submitted models can only rely on statically extracted features, i.e., applications must not be executed during the feature extraction process.
  2. Participants must train their models only on the provided training dataset. They must evaluate them on the provided test datasets employing the provided evaluation code.
  3. Everything must be fully reproducible. The participants must provide all the required code to train and deploy their models, including the feature extraction process (except for Track 1 where the features will be provided to the participants) and, if necessary, the pre-set random seeds to guarantee more accurate reproducibility. All submitted models and results are subject to re-evaluations. We ask to publicly release pre-trained models and source code (e.g., on a GitHub repository).
  4. To participate in a track by submitting a new model, users must train the model and follow the track instructions to understand how to compute the predicted labels and scores on the released test datasets. The models must be evaluated on all the provided test sets.

Track 1: Adversarial Robustness to Feature-space Attacks (starting soon)

  1. The submitted models must only rely on the provided feature set or a custom subset thereof (in this case, the user must specify the selected features).
  2. The submitted models must accept feature vectors as input and provide the classification score of the positive class and the predicted class labels as output.
  3. The testing must be performed with the provided code, which will execute a feature-space attack and then provide the predicted labels and scores for the generated adversarial samples.

Track 2: Adversarial Robustness to Problem-space Attacks (starting soon)

  1. The submitted models must accept APK files as input and provide the classification scores of the positive class and the predicted class labels as output.
  2. The testing must be performed with the provided code, which will execute a problem-space attack and then provide the predicted labels and scores for the generated adversarial samples.

Track 3: Temporal Robustness to Data Drift

  1. The submitted models must accept APK files as input and provide the classification scores of the positive class and the predicted class labels as output.
  2. To perform the testing, the participants must classify the test applications with their model and provide the predicted labels and the classification scores of the positive class.

In this repository, you can find the detailed instructions required to build your detector, already implemented baseline methods, and code examples on how to create the submission file.

Submissions file format

For all the evaluation tracks, the submission must be uploaded in a JSON file, containing a list with a dictionary for each evaluation round (the first dictionary corresponds to the first round, and so on). The keys of each dictionary are the SHA256 hashes of the test set samples for the respective round. An array containing the predicted class label (either 0 or 1) and the positive class score must be associated with each SHA256 hash.

[
  {
    sha256: [label, score],
    …
  },
  …
]

Metrics

  • TPR - True Positive Rate (Track 1 and 2): this metric is computed as the percentage of correctly detected malware and will be used for Track 1 and 2, where the test set contains only malware samples.
  • F1 Score (Track 3): this metric is computed as the harmonic mean of Precision and Recall, and it is particularly suited for evaluating in a single metric the binary classification performance on unbalanced datasets.
  • Area Under Time - AUT: this metric was introduced in [1] to evaluate the performance of malware detectors over time. It is based on the trapezoidal rule as the AUC-based metrics. Its value is enclosed in the [0, 1] interval, where an ideal detector with robustness to temporal performance decay has AUT = 1. We compute the metric under point estimates of the TPR (Track 1 and 2) and the F1 Score (Track 3) along the time period of the test samples.

References

[1] Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., & Cavallaro, L. (2018). TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time. USENIX Security Symposium.