Participation Instructions - Cybersecurity

TL;DR: How to participate

Submission Deadline: March 31, 2025 at 23:59 AoE.

Competition Rules

The binary classification task consists of distinguishing malware samples from benign applications, only relying on ML-based approaches.
The use of whitelisting, blacklisting, or signatures is not allowed.
The submitted models can only rely on statically extracted features, i.e., applications must not be executed during the feature extraction process.
Participants must train their models only on the provided training dataset.
The submitted model must be evaluated only with the provided code.
Everything must be fully reproducible. The participants must provide all the required code to train and deploy their models, including pre-trained models and the feature extraction process (except for Track 1, where the features will be provided to the participants) and, if necessary, the pre-set random seeds to guarantee more accurate reproducibility. All submitted models and results are subject to re-evaluations. All the provided material will be kept private until the end of the competition and made public after the winners are announced.
You can participate independently in each of the three evaluation tracks. Participation in all three tracks is not mandatory.

Track 1: Adversarial Robustness to Feature-space Attacks

The submitted models must only rely on the provided feature set or a custom subset thereof (in this case, the user must specify the selected features).
The submitted models must accept feature vectors as input and provide the classification score of the positive class and the predicted class labels as output.
The submitted models must have a False Positive Rate equal to or lower than 1% on the provided validation set composed of benign samples only.
The winner will be the one with the highest Detection Rate when 100 features are manipulated. If multiple participants report the same score, the other metrics will be considered in this order to determine the winner: Detection Rate when 50 features are manipulated, Detection Rate when 25 features are manipulated, Detection Rate in the absence of manipulation, False Positive Rate.

Track 2: Adversarial Robustness to Problem-space Attacks

The submitted models must accept APK files as input and provide the classification scores of the positive class and the predicted class labels as output.
The submitted models must have a False Positive Rate equal to or lower than 1% on the provided validation set composed of benign samples only.
The winner will be the one with the highest Detection Rate when 100 features are manipulated. If multiple participants report the same score, the other metrics will be considered in this order to determine the winner: Detection Rate in the absence of manipulation, False Positive Rate.

Track 3: Temporal Robustness to Data Drift

The submitted models must accept APK files as input and provide the classification scores of the positive class and the predicted class labels as output.
The winner will be the one with the highest Area Under Time. If multiple participants report the same score, they will be considered joint winners.

In this repository, you can find the detailed instructions required to build your detector, the evaluation code, and code examples on how to create the submission files.

Submissions file format

For all the evaluation tracks, the submission must be uploaded in a JSON file, containing a list with a dictionary for each required evaluation. The keys of each dictionary are the SHA256 hashes of the test set samples for the respective dataset. An array containing the predicted class label (either 0 or 1) and the positive class score must be associated with each SHA256 hash. For Tracks 1 and 2, the first dictionary contains the classification results on the provided goodware-only test set (with which to check the false positive rate), while the other ones contain the classification results on the provided malware-only test set with different amounts of adversarial perturbations. For track 3, each dictionary corresponds to an evaluation round test set (the order must be preserved).

[
{
sha256: [label, score],
…
},
…
]

Metrics

Detection Rate (a.k.a. True Positive Rate, Track 1 and 2): this metric is computed as the percentage of correctly detected malware and will be used for Track 1 and 2 on a test set containing only malware samples.
False Positive Rate (Track 1 and 2): this metric is computed as the percentage of legitimate samples wrongly detected as malware and will be used for Track 1 and 2 on a test set containing only legitimate samples.
F1 Score (Track 3): this metric is computed as the harmonic mean of Precision and Recall, and it is particularly suited for evaluating the binary classification performance on unbalanced datasets in a single metric.
Area Under Time - AUT (Track 3): this metric was introduced in [1] to evaluate the performance of malware detectors over time. It is based on the trapezoidal rule as the AUC-based metrics. Its value is enclosed in the [0, 1] interval, where an ideal detector with robustness to temporal performance decay has AUT = 1. We compute the metric under point estimates of the F1 Score along the time period of the test samples.

References

[1] Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., & Cavallaro, L. (2018). TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time. USENIX Security Symposium.

Challenge News

05/31/2024
All the competition tracks are open
02/02/2024
Track 3: all dataset are available - fixed bug in provided features
11/30/2023
16th ACM Workshop on Artificial Intelligence and Security (AISec)

Important Dates

April 4, 2025 (23:59 AoE): submission deadline for all the competition tracks.