Frontend Components¶
Frontends are responsible for converting raw audio waveforms into high-level features or representations.
Available Frontends¶
1. Wav2Vec2 (wav2vec2)¶
Supports both Fairseq (original) and HuggingFace implementations.
Configuration Signature:
Parameters:
- source - (str)
"fairseq"(default) or"huggingface". - ckpt_path - (str) Fairseq: Path to
.ptfile. HuggingFace: ID (e.g.,facebook/wav2vec2-base) or path. - freeze - (bool) Whether to freeze weights.
Example:
2. WavLM (wavlm)¶
Supports both Microsoft (original) and HuggingFace implementations.
Configuration Signature:
Parameters:
- source - (str)
"unil"(default) or"huggingface". - ckpt_path - (str) Microsoft: Path to
.ptfile. HuggingFace: ID (e.g.,microsoft/wavlm-base) or path. - freeze - (bool) Whether to freeze weights.
Example:
3. HuBERT (hubert)¶
Supports both Fairseq (original) and HuggingFace implementations.
Configuration Signature:
Parameters:
- source - (str)
"fairseq"(default) or"huggingface". - ckpt_path - (str) Fairseq: Path to
.ptfile. HuggingFace: Model ID. - freeze - (bool) Whether to freeze weights.
Example:
4. MERT (mert)¶
Music Audio Pre-training model, specialized for music but useful for general audio.
Configuration Signature:
Parameters:
- source - (str)
"huggingface"(default). - ckpt_path - (str) HF ID (e.g.,
m-a-p/MERT-v1-95M) or path. - trust_remote_code - (bool) Needed for some MERT versions (default
True). - freeze - (bool) Whether to freeze weights.
Example:
5. EAT (eat)¶
Efficient Audio Transformer. It performs internal Fbank extraction.
Configuration Signature:
Parameters:
- source - (str)
"huggingface"(default). - ckpt_path - (str) HF ID (e.g.
worstchan/EAT-large_epoch20_pretrain). - trust_remote_code - (bool) (default
True). - freeze - (bool) Whether to freeze weights.
Example:
6. Mel Spectrogram (mel_spec)¶
Standard Mel-spectrogram extraction using torchaudio.
Configuration Signature:
Parameters:
- n_fft - (int) FFT window size (default: 1024).
- hop_length - (int) Hop length (default: 160).
- n_mels - (int) Number of mel bands (default: 80).
Example:
Input/Output¶
- Input: Raw waveform Tensor of shape
(Batch, Time). - Output: Feature Tensor. Shape depends on the frontend:
- Transformers (Wav2Vec2, WavLM, etc.):
(Batch, Time, Dim) - Spectrograms (MelSpec):
(Batch, Channels, Freq, Time)- Note: Backends need adjustment to handle 4D input.
- Transformers (Wav2Vec2, WavLM, etc.):
Next Step: Backends →