DeepFense Tutorial: Configuration Guide

This tutorial walks you through the entire DeepFense configuration file step by step. By the end, you'll understand every section and know how to customize experiments.


How to Train

Everything starts with one command:

python train.py --config deepfense/config/train.yaml

That's it. The entire training process—model architecture, data loading, augmentations, loss functions, optimizer settings—is controlled by this single YAML file.

Let's go through each section.


1. Global Settings

exp_name: "W2V_AASIST"
output_dir: "./outputs/"
seed: 1234
Parameter What it does
exp_name Name of your experiment. A folder with this name (+ timestamp) is created in output_dir
output_dir Where to save checkpoints, logs, and plots
seed Random seed for reproducibility. Use the same seed to get identical results

Output structure:

outputs/W2V_AASIST_20240115_143000/
├── config.yaml      # Copy of your config
├── train.log        # Training logs
├── best_model.pth   # Best checkpoint
└── ckpts/           # All checkpoints


2. Data Configuration

data:
  sampling_rate: 16000
  label_map: {"bonafide": 1, "spoof": 0}

  train: {...}
  val: {...}
  test: {...}

2.1 Global Data Settings

Parameter What it does
sampling_rate Target audio sample rate (Hz). Most SSL models expect 16000
label_map Maps string labels to integers. bonafide: 1 means "real" = class 1

2.2 Train/Val/Test Splits

Each split follows the same structure:

train:
  dataset_type: "StandardDataset"
  dataset_names: ["ASVSpoof19"]
  parquet_files: ["./data/train.parquet"]
  root_dir: "/path/to/audio/root"    # Optional: prepended to paths in parquet

  batch_size: 32
  shuffle: True

  base_transform: [...]
  augment_transform: [...]
Parameter What it does Typical Values
dataset_type Dataset class to use "StandardDataset"
parquet_files List of Parquet files containing metadata Absolute or relative paths
root_dir Base directory prepended to path column in parquet Optional, for relative paths
dataset_names Names for each parquet (for logging) Optional
batch_size Samples per batch 8-64 (depends on GPU memory)
shuffle Randomize order each epoch True for train, False for val/test
max_per_class Limit samples per class Optional, for debugging

2.3 Parquet Format

Your Parquet files must have these columns:

Column Required Description
path ✅ Yes Path to audio file (absolute, or relative if using root_dir)
label ✅ Yes String label: "bonafide" or "spoof"
ID Optional Unique identifier (useful for leaderboard submissions)

Example: Creating a Parquet file

import pandas as pd

df = pd.DataFrame({
    "path": ["audio/LA_E_001.flac", "audio/LA_E_002.flac"],
    "label": ["bonafide", "spoof"],
    "ID": ["LA_E_001", "LA_E_002"]
})
df.to_parquet("train.parquet")


3. Transforms

Transforms process audio before it goes into the model.

3.1 Base Transform (Applied to all data)

base_transform:
  - type: "pad"
    max_len: 64600      # ~4 seconds at 16kHz
    random_pad: False   # If audio > max_len: random crop (True) or take start (False)
    pad_type: "repeat"  # If audio < max_len: repeat to fill
Transform What it does
pad Ensures all audio is exactly max_len samples

Calculating max_len: - 16000 Hz × 4 seconds = 64000 samples - Common values: 48000 (3s), 64600 (4s), 96000 (6s)

3.2 Augment Transform (Training only)

augment_transform:
  - type: "rawboost"
    noise_ratio: 0.4    # 40% chance to apply
    algo: 5             # Algorithm variant (0-8)

  - type: "rir"
    noise_ratio: 0.5
    csv_file: "./data/rirs.csv"
    # csv_file format: CSV with a 'path' column containing absolute paths to audio files.

When you list multiple augmentations, they are applied sequentially (one after another).

Available Augmentations:

Type What it does Key Parameters
rawboost Adds convolutive/impulsive noise algo (0-8), noise_ratio
rir Room impulse response (reverb) csv_file, noise_ratio
add_noise Additive background noise csv_file, snr_low, snr_high
add_babble Mix multiple speakers csv_file, speaker_count
speed_perturb Change speed/pitch speeds: [90, 100, 110]
codec Compression artifacts noise_ratio
drop_chunk Zero out time segments drop_length_low/high
drop_freq Apply notch filters drop_freq_low/high

Using an Augmentation Pipeline (advanced):

For more control, wrap augmentations in a pipeline:

augment_transform:
  - type: "augmentation_pipeline"
    mode: "parallel"        # Pick ONE random augmentation
    concat_original: false  # Don't keep original
    p: 0.5                  # 50% chance to augment
    transforms:
      - type: "rawboost"
        noise_ratio: 1.0
      - type: "rir"
        noise_ratio: 1.0
Mode Behavior
parallel Randomly pick one transform
sequential Apply all transforms in order

4. Model Configuration

model:
  type: "StandardDetector"

  frontend: {...}
  backend: {...}
  loss: [...]

The model has three components:

Audio → [Frontend] → Features → [Backend] → Embeddings → [Loss] → Score

4.1 Frontend (Feature Extractor)

The frontend converts raw audio into high-level features.

frontend:
  type: "wavlm"
  args:
    source: "unil"
    ckpt_path: "./models/WavLM-Large.pt"
    freeze: True

Available Frontends:

Type Source Description Output Dim
wav2vec2 fairseq / huggingface Meta's Wav2Vec 2.0 768 (base), 1024 (large)
wavlm unil / huggingface Microsoft's WavLM 768 (base), 1024 (large)
hubert fairseq / huggingface Meta's HuBERT 768 (base), 1024 (large)
mert huggingface Music foundation model 768-1024
eat huggingface Efficient Audio Transformer 768

Key Parameters:

Parameter What it does
source Where to load model from. "huggingface" = HuggingFace Hub, "fairseq" = local .pt file. DeepFense handles the different forward logics automatically.
ckpt_path Path or HuggingFace model ID
freeze If True, don't train frontend weights (recommended for fine-tuning)

Examples:

# HuggingFace WavLM
frontend:
  type: "wavlm"
  args:
    source: "huggingface"
    ckpt_path: "microsoft/wavlm-base"
    freeze: True

# Local Fairseq Wav2Vec2
frontend:
  type: "wav2vec2"
  args:
    source: "fairseq"
    ckpt_path: "/path/to/wav2vec_large.pt"
    freeze: True

4.2 Backend (Classifier)

The backend takes features and produces a fixed-size embedding.

backend:
  type: "AASIST"
  args:
    input_dim: 1024       # Must match frontend output
    filts: [70, [1, 32], [32, 32], [32, 64], [64, 64]]
    gat_dims: [64, 32]

Available Backends:

Type Description Best For
AASIST Graph Attention Network State-of-the-art spoofing detection
ECAPA_TDNN Speaker verification architecture Strong generalization
Nes2Net Res2Net-based CNN Efficient, good performance
RawNet2 CNN + GRU Classic architecture
MLP Simple feedforward Quick experiments, SSL frontends

Important: input_dim must match your frontend's output dimension!

Frontend Output Dim
*-base models 768
*-large models 1024

MLP Backend (simplest):

backend:
  type: "MLP"
  args:
    input_dim: 768
    projection: [256, 64]    # Hidden layers
    pooling_type: "mean"     # mean, max, or asp (attentive)

4.3 Loss Functions

The loss module computes the training objective and produces scores for evaluation.

loss:
  - type: "CrossEntropy"
    weight: 1.0
    embedding_dim: 64
    n_classes: 2

Available Losses:

Type Description When to Use
CrossEntropy Standard classification Simple baseline, quick experiments
OCSoftmax One-Class Softmax Best for spoofing (compacts bonafide, pushes spoof)
AMSoftmax Additive Margin Softmax Good generalization, angular margin
ASoftmax Angular Softmax Alternative angular metric

Multi-Loss Training:

You can combine multiple losses with weights:

loss:
  - type: "CrossEntropy"
    weight: 0.5
    embedding_dim: 64
    n_classes: 2

  - type: "AMSoftmax"
    weight: 0.5
    embedding_dim: 64
    n_classes: 2
    s: 30.0
    m: 0.4

The total loss = 0.5 × CrossEntropy + 0.5 × AMSoftmax

Loss Parameters:

Parameter Description
embedding_dim Must match backend output dimension
n_classes Number of classes (usually 2: bonafide/spoof)
weight Contribution to total loss
s (AMSoftmax) Scale factor (typical: 30)
m (AMSoftmax) Margin (typical: 0.2-0.5)
w_posi, w_nega (OCSoftmax) Margins for positive/negative class
alpha (OCSoftmax) Scale factor (typical: 20)

OCSoftmax (Recommended for Spoofing):

loss:
  - type: "OCSoftmax"
    weight: 1.0
    embedding_dim: 64
    w_posi: 0.9
    w_nega: 0.2
    alpha: 20.0

5. Training Configuration

training:
  trainer: "StandardTrainer"
  epochs: 50
  device: "cuda"
  num_workers: 4

5.1 Basic Settings

Parameter What it does Typical Values
epochs Number of training epochs 30-100
device Where to run "cuda", "cpu", "cuda:0"
num_workers DataLoader workers 4-8

5.2 Logging

batch_log_interval: 50   # Log every 50 batches

5.3 Evaluation

eval_every_epochs: 1     # Validate every epoch
# OR
eval_every_steps: 500    # Validate every 500 steps

monitor_metric: "EER"    # Metric to track for best model
monitor_mode: "min"      # "min" = lower is better, "max" = higher is better

5.4 Optimizer

optimizer:
  type: "adam"
  lr: 0.0001
  weight_decay: 0.0001
Type When to Use
adam Default choice, works well for most cases
adamw Adam with proper weight decay
sgd Sometimes better for fine-tuning

Learning Rate Tips: - Frozen frontend: lr: 0.0001 to 0.00001 - End-to-end training: lr: 0.00001 to 0.000001 - If loss is NaN: reduce learning rate

5.5 Scheduler

scheduler:
  type: "cosine_annealing"
  T_max: 50              # Should equal epochs
  eta_min: 0.000001      # Minimum LR
Type Behavior
cosine_annealing Smoothly decreases LR following cosine curve
step_lr Drops LR by gamma every step_size epochs
exponential Multiplies LR by gamma each epoch

5.6 Metrics

metrics:
  ACC: {}
  F1_SCORE: {}
  EER: {}
  minDCF:
    Pspoof: 0.05
    Cmiss: 1
    Cfa: 1
Metric Description Good Value
ACC Accuracy > 95%
F1_SCORE F1 Score > 0.95
EER Equal Error Rate < 5%
minDCF Minimum Detection Cost < 0.1

6. Complete Example Configurations

Example 1: Quick Baseline

exp_name: "baseline"
output_dir: "./outputs/"
seed: 42

data:
  sampling_rate: 16000
  label_map: {"bonafide": 1, "spoof": 0}

  train:
    parquet_files: ["./data/train.parquet"]
    root_dir: "/data/asvspoof19"
    batch_size: 32
    shuffle: True
    base_transform:
      - type: "pad"
        max_len: 64600

  val:
    parquet_files: ["./data/val.parquet"]
    root_dir: "/data/asvspoof19"
    batch_size: 64
    base_transform:
      - type: "pad"
        max_len: 64600

model:
  type: "StandardDetector"
  frontend:
    type: "wav2vec2"
    args:
      source: "huggingface"
      ckpt_path: "facebook/wav2vec2-base"
      freeze: True
  backend:
    type: "MLP"
    args:
      input_dim: 768
      projection: [256, 64]
      pooling_type: "mean"
  loss:
    - type: "CrossEntropy"
      embedding_dim: 64
      n_classes: 2

training:
  epochs: 20
  device: "cuda"
  optimizer:
    type: "adam"
    lr: 0.0001
  metrics:
    EER: {}
    ACC: {}

Example 2: State-of-the-Art Setup

exp_name: "wavlm_aasist_oc"
output_dir: "./outputs/"
seed: 1234

data:
  sampling_rate: 16000
  label_map: {"bonafide": 1, "spoof": 0}

  train:
    parquet_files: ["./data/train.parquet"]
    root_dir: "/data/asvspoof"
    batch_size: 24
    shuffle: True
    base_transform:
      - type: "pad"
        max_len: 64600
        random_pad: True
    augment_transform:
      - type: "augmentation_pipeline"
        mode: "parallel"
        p: 0.5
        transforms:
          - type: "rawboost"
            noise_ratio: 1.0
            algo: 5
          - type: "rir"
            noise_ratio: 1.0
            csv_file: "./data/rirs.csv"
          - type: "add_noise"
            noise_ratio: 1.0
            csv_file: "./data/noise.csv"
            snr_low: 5
            snr_high: 20

  val:
    parquet_files: ["./data/val.parquet"]
    root_dir: "/data/asvspoof"
    batch_size: 48
    base_transform:
      - type: "pad"
        max_len: 64600

model:
  type: "StandardDetector"
  frontend:
    type: "wavlm"
    args:
      source: "huggingface"
      ckpt_path: "microsoft/wavlm-large"
      freeze: True
  backend:
    type: "AASIST"
    args:
      input_dim: 1024
      filts: [70, [1, 32], [32, 32], [32, 64], [64, 64]]
      gat_dims: [64, 32]
  loss:
    - type: "OCSoftmax"
      embedding_dim: 32
      w_posi: 0.9
      w_nega: 0.2
      alpha: 20.0

training:
  epochs: 50
  device: "cuda"
  num_workers: 4
  eval_every_epochs: 1
  monitor_metric: "EER"
  monitor_mode: "min"
  optimizer:
    type: "adam"
    lr: 0.0001
    weight_decay: 0.0001
  scheduler:
    type: "cosine_annealing"
    T_max: 50
    eta_min: 0.000001
  metrics:
    EER: {}
    F1_SCORE: {}
    minDCF:
      Pspoof: 0.05

7. Testing Your Model

After training:

python test.py \
    --config deepfense/config/train.yaml \
    --checkpoint outputs/your_exp/best_model.pth

To test on a different dataset, update the data.test section in your config:

data:
  test:
    dataset_type: "StandardDataset"
    parquet_files: ["./data/eval.parquet"]
    batch_size: 64
    base_transform:
      - type: "pad"
        max_len: 64600

Then run the same test command -- it reads the test split from the config.


8. Common Issues & Solutions

Problem Solution
CUDA out of memory Reduce batch_size, set freeze: True on frontend
Loss is NaN Reduce lr (learning rate)
EER stuck at 50% Model not learning - check data paths, try different lr
FileNotFoundError Check parquet_files paths, use root_dir for relative paths
KeyError: bonafide Check label_map matches your parquet's label column

Summary Cheatsheet

Section Key Parameters
Data parquet_files, root_dir, batch_size, base_transform
Frontend type, ckpt_path, freeze
Backend type, input_dim (must match frontend output)
Loss type, embedding_dim (must match backend output), weight
Training epochs, lr, monitor_metric

Quick Start Command:

python train.py --config deepfense/config/train.yaml