Participation Instructions - Cybersecurity

General rules

  1. The binary classification task consists of distinguishing malware samples from benign applications that rely only on ML-based approaches. The use of whitelisting, blacklisting, or signatures is not allowed. The submitted models can only rely on statically extracted features, i.e., applications must not be executed during the feature extraction process.
  2. Participants must train their models only on the provided training dataset. They must evaluate them on the provided test datasets employing the provided evaluation code.
  3. Everything must be fully reproducible. The participants must provide all the required code to train and deploy their models, including the feature extraction process (except for Track 1, where the features will be provided to the participants) and, if necessary, the pre-set random seeds to guarantee more accurate reproducibility. All submitted models and results are subject to re-evaluations. We ask that pre-trained models and source code be publicly released (e.g., on a GitHub repository).
  4. To participate in a track by submitting a new model, users must train the model and follow the track instructions to understand how to compute the predicted labels and scores on the released test datasets. The models must be evaluated on all the provided test sets.

Track 1: Adversarial Robustness to Feature-space Attacks

  1. The submitted models must only rely on the provided feature set or a custom subset thereof (in this case, the user must specify the selected features).
  2. The submitted models must accept feature vectors as input and provide the classification score of the positive class and the predicted class labels as output.
  3. The submitted models must have a False Positive Rate equal to or lower than 1% on the provided validation set composed of benign samples only.
  4. The testing must be performed with the provided code, which will classify the test sets, execute a feature-space attack, and output the submission file with predicted labels and scores.

Track 2: Adversarial Robustness to Problem-space Attacks (starting soon)

  1. The submitted models must accept APK files as input and provide the classification scores of the positive class and the predicted class labels as output.
  2. The submitted models must have a False Positive Rate equal to or lower than 1% on the provided validation set composed of benign samples only.
  3. The testing must be performed with the provided code, which will classify the test sets, execute a problem-space attack, and output the submission file with predicted labels and scores.

Track 3: Temporal Robustness to Data Drift

  1. The submitted models must accept APK files as input and provide the classification scores of the positive class and the predicted class labels as output.
  2. To perform the testing, the participants must classify the test applications with their model and provide the predicted labels and the classification scores of the positive class.

In this repository, you can find the detailed instructions required to build your detector, the evaluation code for Tracks 1 and 2, already implemented baseline methods, and code examples on how to create the submission files.

Submissions file format

For all the evaluation tracks, the submission must be uploaded in a JSON file, containing a list with a dictionary for each required evaluation. The keys of each dictionary are the SHA256 hashes of the test set samples for the respective dataset. An array containing the predicted class label (either 0 or 1) and the positive class score must be associated with each SHA256 hash. For Tracks 1 and 2, the first dictionary contains the classification results on the provided goodware-only test set (with which to check the false positive rate), while the other ones contain the classification results on the provided malware-only test set with different amounts of adversarial perturbations. For track 3, each dictionary corresponds to an evaluation round test set (the order must be preserved).

[
  {
    sha256: [label, score],
    …
  },
  …
]

Metrics

  • Detection Rate (a.k.a. True Positive Rate, Track 1 and 2): this metric is computed as the percentage of correctly detected malware and will be used for Track 1 and 2 on a test set containing only malware samples.
  • False Positive Rate (Track 1 and 2): this metric is computed as the percentage of legitimate samples wrongly detected as malware and will be used for Track 1 and 2 on a test set containing only legitimate samples. 
  • F1 Score (Track 3): this metric is computed as the harmonic mean of Precision and Recall, and it is particularly suited for evaluating the binary classification performance on unbalanced datasets in a single metric.
  • Area Under Time - AUT (Track 3): this metric was introduced in [1] to evaluate the performance of malware detectors over time. It is based on the trapezoidal rule as the AUC-based metrics. Its value is enclosed in the [0, 1] interval, where an ideal detector with robustness to temporal performance decay has AUT = 1. We compute the metric under point estimates of the F1 Score along the time period of the test samples.

References

[1] Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., & Cavallaro, L. (2018). TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time. USENIX Security Symposium.