method: Swin Transformer DCT2023-09-04

Authors: Davide Alessandro Coccomini, Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro

Affiliation: ISTI-CNR

Email: davidealessandro.coccomini@isti.cnr.it

Description: We fine-tuned a Swin Transformer Base pre-trained on Imagenet on the provided training set. The training images underwent heavy random data augmentation in the training phase (inspired by [1]) to spur the models to generalize better. Since the images generated by Diffusion Models are known to introduce noise, the models could be made to overfit by learning to recognize it exclusively. To avoid this, among the various transformations applied to images, there are many noise addition and compression techniques, even in combination. Also, some random rotation, brightness, crops, dropouts, resize and many other manipulations are applied to boost generalization.
During the training process the images are also transformed in the DCT domain since with a probability of 50% since, as shown in [2], this should emphasize the artifacts.

In order to choose the best model we also created a custom Validation Set composed of real images taken from Flickr Dataset and images generated by GANs (ProGAN, StyleGAN, StyleGAN2 and RelGAN) and with Diffusion Models (Stable Diffusion and GLIDE) inspired by "Detecting Images generated by Diffusers".

Authors: Davide Alessandro Coccomini, Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro

Description: We fine-tuned two Deep Learning models pretrained on Imagenet. Specifically a two Swin Transformer Base. The images underwent heavy random data augmentation in the training phase (inspired by [1]) to spur the models to generalize better. Since the images generated by Diffusion Models are known to introduce noise, the models could be made to overfit by learning to recognize it exclusively. To avoid this, among the various transformations applied to images, there are many noise addition and compression techniques, even in combination. Also, some random rotation, brightness, crops, dropouts, resize and many other manipulations are applied to boost generalization.
During the training process of one of the two Swin Transformers, the images are also transformed in the DCT domain since with a probability of 50% since, as shown in [1], this should emphasize the artifacts.
Both the models are used to make a prediction on each image in the test set and the final prediction is the mean of the two predictions.

In order to choose the best model we also created a custom Validation Set composed of real images taken from Flickr Dataset and images generated by GANs (ProGAN, StyleGAN, StyleGAN2 and RelGAN) and with Diffusion Models (Stable Diffusion and GLIDE) inspired by "Detecting Images generated by Diffusers".

method: Swin Transformer2023-08-24

Authors: Davide Alessandro Coccomini, Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro

Affiliation: ISTI-CNR

Email: davidealessandro.coccomini@isti.cnr.it

Description: We fine-tuned a pre-trained Swin Transformer Base on Imagenet. The architecture has been chosen because of previous analysis of its generalization capabilities in deepfake detection field [1]. The training process was conducted by applying a heavy random data augmentation (inspired by [2]) to camouflage the trace left by diffusion models on the images and force the model to focus on visual inconsistencies and introduced artifacts. In addition to transformations such as shift, scale and rotate, noise manipulations via FFT, introduction and combination of noises such as Multiplicative Noise and ISO Noise (also in combination) and various levels of compression were used.
Image resizing was also treated with different techniques, including random crop and Isotropic Resize.
In order to choose the best model we also created a custom Validation Set composed of real images taken from Flickr Dataset and images generated by GANs (ProGAN, StyleGAN, StyleGAN2 and RelGAN) and with Diffusion Models (Stable Diffusion and GLIDE) inspired by "Detecting Images generated by Diffusers".

Coccomini, D.A.; Caldelli, R.; Falchi, F.; Gennaro, C. On the Generalization of Deep Learning Models in Video Deepfake Detection. J. Imaging 2023, 9, 89.

Coccomini, D.A.; Zilos, G.K., Caldelli, R.; Falchi, F.; Amato G.; S. Papadopoulos; Gennaro, C.G. MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection, Arxiv, 2022

Source code

Source code 2

Ranking Table

Description Paper Source Code
Metrics
DateMethodf1_score
2023-09-04Swin Transformer DCT0.97725668575014
2023-08-31Swin Transformer + Swin Transformer DCT0.97365746892832
2023-08-24Swin Transformer0.97105355677956
2023-08-24Swin Transformer + Resnet50 DCT0.95234775873754
2023-08-22Resnet50 + Swin Transformer 0.94966915523661
2023-09-28CNN detection with Multi-modal0.88971233544612
2023-09-08Basic0.80222598068634
2023-09-08MiniVGG0.8006292644557
2023-10-27First Submission0.79736329918108
2023-09-02Baseline0.77303002356799
2023-09-10Task1 testing submission0.68246036940662
2023-08-26swin baseline0.20702247191011
2023-09-25grag 2epoch0.13617305480316
2023-09-25grag 3epoch0.063666215955186
2023-09-25grag 5epoch0.059210526315789
2023-09-25grag 4epoch0.035153797865662
2023-08-02Random0
2023-08-24Random0
2023-08-24Random 010
2023-08-24Random 020

Ranking Graphic