Frontend Components¶

Frontends are responsible for converting raw audio waveforms into high-level features or representations.

Available Frontends¶

1. Wav2Vec2 (`wav2vec2`)¶

Supports both Fairseq (original) and HuggingFace implementations.

Configuration Signature:

frontend:
  type: wav2vec2
  args:
    source: string
    ckpt_path: string
    freeze: bool

Parameters:

source - (str) "fairseq" (default) or "huggingface".
ckpt_path - (str) Fairseq: Path to .pt file. HuggingFace: ID (e.g., facebook/wav2vec2-base) or path.
freeze - (bool) Whether to freeze weights.

Example:

frontend:
  type: wav2vec2
  args:
    source: huggingface
    ckpt_path: facebook/wav2vec2-base
    freeze: True

2. WavLM (`wavlm`)¶

Supports both Microsoft (original) and HuggingFace implementations.

Configuration Signature:

frontend:
  type: wavlm
  args:
    source: string
    ckpt_path: string
    freeze: bool

Parameters:

source - (str) "unil" (default) or "huggingface".
ckpt_path - (str) Microsoft: Path to .pt file. HuggingFace: ID (e.g., microsoft/wavlm-base) or path.
freeze - (bool) Whether to freeze weights.

Example:

frontend:
  type: wavlm
  args:
    source: huggingface
    ckpt_path: microsoft/wavlm-base
    freeze: True

3. HuBERT (`hubert`)¶

Supports both Fairseq (original) and HuggingFace implementations.

Configuration Signature:

frontend:
  type: hubert
  args:
    source: string
    ckpt_path: string
    freeze: bool

Parameters:

source - (str) "fairseq" (default) or "huggingface".
ckpt_path - (str) Fairseq: Path to .pt file. HuggingFace: Model ID.
freeze - (bool) Whether to freeze weights.

Example:

frontend:
  type: hubert
  args:
    source: fairseq
    ckpt_path: /path/to/hubert_large_ll60k.pt
    freeze: True

4. MERT (`mert`)¶

Music Audio Pre-training model, specialized for music but useful for general audio.

Configuration Signature:

frontend:
  type: mert
  args:
    ckpt_path: string
    trust_remote_code: bool
    freeze: bool

Parameters:

source - (str) "huggingface" (default).
ckpt_path - (str) HF ID (e.g., m-a-p/MERT-v1-95M) or path.
trust_remote_code - (bool) Needed for some MERT versions (default True).
freeze - (bool) Whether to freeze weights.

Example:

frontend:
  type: mert
  args:
    ckpt_path: m-a-p/MERT-v1-95M
    freeze: True

5. EAT (`eat`)¶

Efficient Audio Transformer. It performs internal Fbank extraction.

Configuration Signature:

frontend:
  type: eat
  args:
    ckpt_path: string
    trust_remote_code: bool
    freeze: bool

Parameters:

source - (str) "huggingface" (default).
ckpt_path - (str) HF ID (e.g. worstchan/EAT-large_epoch20_pretrain).
trust_remote_code - (bool) (default True).
freeze - (bool) Whether to freeze weights.

Example:

frontend:
  type: eat
  args:
    ckpt_path: worstchan/EAT-large_epoch20_pretrain
    freeze: True

6. Mel Spectrogram (`mel_spec`)¶

Standard Mel-spectrogram extraction using torchaudio.

Configuration Signature:

frontend:
  type: mel_spec
  args:
    n_fft: int
    hop_length: int
    n_mels: int

Parameters:

n_fft - (int) FFT window size (default: 1024).
hop_length - (int) Hop length (default: 160).
n_mels - (int) Number of mel bands (default: 80).

Example:

frontend:
  type: mel_spec
  args:
    n_fft: 1024
    hop_length: 160
    n_mels: 80

Input/Output¶

Input: Raw waveform Tensor of shape (Batch, Time).
Output: Feature Tensor. Shape depends on the frontend:
- Transformers (Wav2Vec2, WavLM, etc.): (Batch, Time, Dim)
- Spectrograms (MelSpec): (Batch, Channels, Freq, Time) - Note: Backends need adjustment to handle 4D input.

Next Step: Backends →

Frontend Components¶

Available Frontends¶

1. Wav2Vec2 (wav2vec2)¶

2. WavLM (wavlm)¶

3. HuBERT (hubert)¶

4. MERT (mert)¶

5. EAT (eat)¶

6. Mel Spectrogram (mel_spec)¶

Input/Output¶

1. Wav2Vec2 (`wav2vec2`)¶

2. WavLM (`wavlm`)¶

3. HuBERT (`hubert`)¶

4. MERT (`mert`)¶

5. EAT (`eat`)¶

6. Mel Spectrogram (`mel_spec`)¶