Unlocking the Hidden Potential of CLIP in Generalizable Deepfake Detection

https://www.alphaxiv.org/overview/2503.19683v2

作者的方法核心在于利用预训练的 CLIP-ViT-L/14 图像编码器作为特征提取器,并结合策略性微调和正则化技术以增强泛化能力。

作者没有采用完整模型微调(这有在有限的深度伪造数据上过拟合的风险),而是将 LN-tuning 作为其主要的参数高效微调(PEFT)策略。这种方法仅微调视觉编码器中的层归一化参数,在总计 3.03 亿参数中引入了大约 10.4 万个可训练参数(0.03%)。这种选择性训练保留了 CLIP 可泛化的预训练表示,同时允许任务特定的适应

它并没有发明全新的算法,而是首次将三种在其他领域(如人脸识别)被证明有效的技术,巧妙地组合并成功应用到了基于 CLIP 模型的深度伪造检测任务上,效果显著:

特征归一化 (Normalization):将模型提取的特征投影到一个超球面上。这是最关键的一步,极大地提升了模型的泛化能力,防止了过拟合。

度量学习 (Metric Learning):在超球面上进一步优化特征的分布,让“真”和“假”的特征分得更开,聚得更拢。

潜在空间增强 (Latent Augmentation):在特征层面直接“合成”新的伪造样本进行训练,让模型能更好地识别闻所未闻的新型伪造技术。

总而言之,它的创新之处在于证明了用这套简单而经典的“组合拳”,就能让一个通用大模型(CLIP)在深度伪造检测任务上达到甚至超越许多复杂的专用模型,为领域内树立了一个强大又简洁的标杆。

该模型在 FaceForensics++(c 23 压缩)上进行训练,并使用自定义验证集,该验证集结合了 FF++测试样本和附加数据集,以更好地反映分布外性能。训练使用 Adam 优化器,采用余弦学习率调度和标准图像增强。 评估遵循跨数据集协议以评估泛化能力,在 Celeb-DF-v 2、DFDC、Google 的 DFD、FFIW 和 DeepSpeak v 1.0 上进行测试。主要指标是视频级别的 AUROC,通过平均每个视频所有采样帧的预测值计算得出。

可以看该论文由 ai 生成的 blog 与开源的代码 https://github.com/yermandy/deepfake-detection?tab=readme-ov-file

不过话又说回来这篇论文不如另一篇

Effort: Efficient Orthogonal Modeling for Generalizable AI-Generated Image Detection 获得了 icml 2025 的 spotlight https://www.alphaxiv.org/abs/2411.15633v1

感觉有点像把大模型领域已经用过的很多方法再在 deepfake detection 用一遍跑一下[捂脸]

复现的环境配置

重庆地区的一台 NVIDIA GeForce RTX 4090(后面因为要微调训练跑 PEFT 更换成上海地区的H20)

需要解决一些命令行工具的镜像问题(一般在我自己的机器上我会选择终端挂代理,毕竟有些镜像同步的有问题)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
merges.txt: 525kB [00:00, 2.89MB/s]
Traceback (most recent call last):
  File "/root/along/deepfake-detection/inference_torchscript.py", line 22, in <module>
    preprocess = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/dfdet/lib/python3.12/site-packages/transformers/processing_utils.py", line 1070, in from_pretrained
    args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs)

比如 uv huggingface 都需要配置镜像

export UV_INDEX_URL=https://pypi.tuna.tsinghua.edu.cn/simple
export HF_ENDPOINT=https://hf-mirror.com
export TRANSFORMERS_OFFLINE=0

TRANSFORMERS_OFFLINE 这个环境变量的作用:

含义行为适用场景
未设置0在线模式- 先检查本地缓存
- 如果没有对应模型或 tokenizer,会尝试从 Hugging Face Hub 下载
默认情况,联网训练/推理
1离线模式- 只从本地缓存加载
- 不会尝试联网下载,若本地缺少文件会报错
服务器无网环境、本地缓存部署、需要严格控制依赖时
所以当因为网络导致的 huggingface 下载下的模型文件缺失后建议在设置好镜像后再设置一下 TRANSFORMERS_OFFLINE=0

某个镜像站因为访问太多被限制后可以换着用

HF_ENDPOINT=https://huggingface.tuna.tsinghua.edu.cn python inference_torchscript.py

最小推理验证

(dfdet) root@gpu-4090-24g-instance-4783-jvmdmc1c-7c7c75bb4f-gr65j:~/along/deepfake-detection# HF_ENDPOINT=https://hf-mirror.com python inference_torchscript.py
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
tokenizer_config.json: 905B [00:00, 4.51MB/s]
vocab.json: 961kB [00:00, 3.38MB/s]
p(real) = 0.8087, p(fake) = 0.1913, image: datasets/CDFv2/Celeb-synthesis/id0_id1_0000/000.png
p(real) = 0.3680, p(fake) = 0.6320, image: datasets/CDFv2/Celeb-synthesis/id0_id1_0000/045.png
p(real) = 0.1854, p(fake) = 0.8146, image: datasets/CDFv2/Celeb-synthesis/id0_id1_0000/030.png
p(real) = 0.1476, p(fake) = 0.8524, image: datasets/CDFv2/Celeb-synthesis/id0_id1_0000/015.png
p(real) = 0.9300, p(fake) = 0.0700, image: datasets/CDFv2/YouTube-real/00000/000.png
p(real) = 0.8800, p(fake) = 0.1200, image: datasets/CDFv2/YouTube-real/00000/014.png
p(real) = 0.6876, p(fake) = 0.3124, image: datasets/CDFv2/YouTube-real/00000/028.png
p(real) = 0.2524, p(fake) = 0.7476, image: datasets/CDFv2/YouTube-real/00000/043.png
p(real) = 0.1790, p(fake) = 0.8210, image: datasets/CDFv2/Celeb-real/id0_0000/045.png
p(real) = 0.0947, p(fake) = 0.9053, image: datasets/CDFv2/Celeb-real/id0_0000/030.png
p(real) = 0.1089, p(fake) = 0.8911, image: datasets/CDFv2/Celeb-real/id0_0000/015.png
p(real) = 0.6545, p(fake) = 0.3455, image: datasets/CDFv2/Celeb-real/id0_0000/000.png

最好还是挂 clash 全局代理 (爆金币买了好的节点)

镜像站有些仓库没有同步

(dfdet) root@gpu-h20-96g-instance-4783-hnpwvuy2-6444d8fb46-hjjgx:~/deepfake-detection# python inference.py 
FF-ALL-ClipL-LinearNorm-LN-CE+LS+UnAl-slerp
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████| 1.71G/1.71G [00:21<00:00, 79.1MB/s]
Using bfloat16 Automatic Mixed Precision (AMP)
p(real) = 0.8056, p(fake) = 0.1944, image: datasets/CDFv2/Celeb-synthesis/id0_id1_0000/000.png
p(real) = 0.3693, p(fake) = 0.6307, image: datasets/CDFv2/Celeb-synthesis/id0_id1_0000/045.png
p(real) = 0.1807, p(fake) = 0.8193, image: datasets/CDFv2/Celeb-synthesis/id0_id1_0000/030.png
p(real) = 0.1456, p(fake) = 0.8544, image: datasets/CDFv2/Celeb-synthesis/id0_id1_0000/015.png
p(real) = 0.9294, p(fake) = 0.0706, image: datasets/CDFv2/YouTube-real/00000/000.png
p(real) = 0.8775, p(fake) = 0.1225, image: datasets/CDFv2/YouTube-real/00000/014.png
p(real) = 0.6610, p(fake) = 0.3390, image: datasets/CDFv2/YouTube-real/00000/028.png
p(real) = 0.2524, p(fake) = 0.7476, image: datasets/CDFv2/YouTube-real/00000/043.png
p(real) = 0.1807, p(fake) = 0.8193, image: datasets/CDFv2/Celeb-real/id0_0000/045.png
p(real) = 0.0933, p(fake) = 0.9067, image: datasets/CDFv2/Celeb-real/id0_0000/030.png
p(real) = 0.1074, p(fake) = 0.8926, image: datasets/CDFv2/Celeb-real/id0_0000/015.png
p(real) = 0.6575, p(fake) = 0.3425, image: datasets/CDFv2/Celeb-real/id0_0000/000.png

开始训练

准备数据

自然,两重原因,挂完代理后少了一重,但是数据集的官网也是不堪重负,因此找到了 kaggle 仓库中的 FaceForensics++ 并且通过查阅原 paper 找到其使用的数据集压缩版本 c23

https://www.kaggle.com/datasets/xdxd003/ff-c23

记得要配置好 ~/.kaggle/kaggle.json 之后就可以通过 curl 来下载了。

(dfdet) root@gpu-h20-96g-instance-4783-hnpwvuy2-6444d8fb46-hjjgx:~/deepfake-detection# curl -L -o ~/deepfake-detection/ff-c23.zip  https:/
/www.kaggle.com/api/v1/datasets/download/xdxd003/ff-c23
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 74 16.6G   74 12.4G    0     0  26.5M      0  0:10:41  0:07:58  0:02:43 26.2M

处理数据

要求使用该仓库

https://github.com/SCLBD/DeepfakeBench

该仓库中已经提供好了经过处理的数据(将人脸从视频中提取出 frame 并裁剪)但是和本论文要求的目录结构不一样,所以手动调整目录结构后开始训练(也是连蒙带猜目录结构应该是怎样)

开始训练

 
┏━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
 Name Type Params Mode
┡━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
 0 feature_extractor PeftModel  303 M train
 1 model LinearProbe  2.0 K train
 2 criterion Loss      0 train
 3 train_step_outputs OutputsForMetrics      0 train
 4 val_step_outputs OutputsForMetrics      0 train
 5 test_step_outputs OutputsForMetrics      0 train
└───┴────────────────────┴───────────────────┴────────┴───────┘
Trainable params: 104 K                                                                                                                   
Non-trainable params: 303 M                                                                                                               
Total params: 303 M                                                                                                                       
Total estimated model params size (MB): 1.2 K                                                                                             
Modules in train mode: 118                                                                                                                
Modules in eval mode: 346                                                                                                                 
/opt/miniconda3/envs/dfdet/lib/python3.12/site-packages/lightning/pytorch/utilities/data.py:79: Trying to infer the `batch_size` from an 
ambiguous collection. The batch size we found is 12. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
Logs: runs/train/example-run/
/opt/miniconda3/envs/dfdet/lib/python3.12/site-packages/lightning/pytorch/utilities/data.py:79: Trying to infer the `batch_size` from an 
ambiguous collection. The batch size we found is 128. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
Epoch 0/9  ━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129/1331 0:01:05 0:10:10 1.97it/s train/loss_step: 0.608 train/loss_ce_step: 0.608  
                                                                                        train/alignment_step: 0.602 train/uniformity_step:
                                                                                        -1.188                                            
 

训练过程

可以在该链接下查看训练时各指标变化图 https://wandb.ai/alongforllm-ucas/deepfake?nw=nwuseralongforllm

最终测试

训练到第 7 个 epoch 程序卡死了(nvidia 显存占用也为 0),强制退出后就用当前的 checkpoint 来进行测试

python run.py --test

测试集和训练集独立,当然该仓库的问题是测试集中的样本数量比较少。(得到的 1.0 分数很可能是在这个特定的小样本上的一种“偶然的完美表现”。)

Logs: runs/test/example-run/
/opt/miniconda3/envs/dfdet/lib/python3.12/site-packages/lightning/pytorch/utilities/data.py:79: Trying to infer the `batch_size` from an 
ambiguous collection. The batch size we found is 12. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃             Test metric             ┃            DataLoader 0             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│           num_test_files            │                12.0                 │
│           test/acc_frame            │                0.75                 │
│           test/acc_video            │         0.6666666865348816          │
│           test/alignment            │         0.6961564421653748          │
│          test/auroc_frame           │               0.8125                │
│          test/auroc_video           │                 1.0                 │
│       test/balanced_acc_frame       │                0.75                 │
│       test/balanced_acc_video       │                0.75                 │
│     test/dataset/CDF/acc_frame      │                0.75                 │
│     test/dataset/CDF/acc_video      │         0.6666666865348816          │
│    test/dataset/CDF/auroc_frame     │               0.8125                │
│    test/dataset/CDF/auroc_video     │                 1.0                 │
│ test/dataset/CDF/balanced_acc_frame │                0.75                 │
│ test/dataset/CDF/balanced_acc_video │                0.75                 │
│     test/dataset/CDF/eer_frame      │                0.25                 │
│     test/dataset/CDF/eer_video      │                 0.0                 │
│   test/dataset/CDF/f1_score_frame   │         0.7333333492279053          │
│   test/dataset/CDF/f1_score_video   │         0.6666666865348816          │
│     test/dataset/CDF/mAP_frame      │         0.8417658805847168          │
│     test/dataset/CDF/mAP_video      │                 1.0                 │
│           test/eer_frame            │                0.25                 │
│           test/eer_video            │                 0.0                 │
│         test/f1_score_frame         │         0.7333333492279053          │
│         test/f1_score_video         │         0.6666666865348816          │
│              test/loss              │         0.5597330927848816          │
│            test/loss_ce             │         0.5597330927848816          │
│           test/mAP_frame            │         0.8417658805847168          │
│           test/mAP_video            │                 1.0                 │
│           test/uniformity           │         -1.2965565919876099         │
└─────────────────────────────────────┴─────────────────────────────────────┘