Technology Apr 28, 2026 · 7 min read

MOSS-Audio: 8B Parameters Challenge 30B, New Benchmark for Open-Source Audio Understanding Models

MOSS-Audio: 8B Parameters Challenge 30B, New Benchmark for Open-Source Audio Understanding Models Understanding an audio clip goes far beyond simply "transcribing spoken words into text." A real-world audio clip may simultaneously contain human speech, background ambience, music, emotion...

DE
DEV Community
by Garyvov
MOSS-Audio: 8B Parameters Challenge 30B, New Benchmark for Open-Source Audio Understanding Models

MOSS-Audio: 8B Parameters Challenge 30B, New Benchmark for Open-Source Audio Understanding Models

Understanding an audio clip goes far beyond simply "transcribing spoken words into text."

A real-world audio clip may simultaneously contain human speech, background ambience, music, emotional shifts, and even overlapping multi-party conversations. A truly usable audio understanding system needs to simultaneously identify who is speaking, detect emotional states, interpret background sounds, analyze musical content, and answer time-aware questions like "What did the speaker say at the 2-minute mark?"

In April 2026, the OpenMOSS team, in collaboration with MOSI.AI and Shanghai Science and Technology Innovation Engine Co., Ltd., released MOSS-Audio—an open-source audio understanding model that unifies speech, environmental sound, music comprehension, and time-aware reasoning into a single foundation model.

MOSS-Audio-8B outperforms 30B models with several times more parameters across multiple benchmarks, with particularly striking advantages in timestamp ASR tasks.

Model Family

Four variants launched at first, all built on the Qwen3 language model backbone:

Model LLM Backbone Total Parameters Optimization Direction
MOSS-Audio-4B-Instruct Qwen3-4B ~4.6B Direct instruction following
MOSS-Audio-4B-Thinking Qwen3-4B ~4.6B Chain-of-thought (CoT) reasoning
MOSS-Audio-8B-Instruct Qwen3-8B ~8.6B Direct instruction following
MOSS-Audio-8B-Thinking Qwen3-8B ~8.6B Chain-of-thought (CoT) reasoning

The Instruct variants are designed for direct instruction following, producing structured, predictable outputs suitable for integration into production pipelines. The Thinking variants are trained with chain-of-thought reasoning and reinforcement learning, delivering stronger performance on multi-step reasoning tasks.

Architecture Deep Dive

Overall Architecture

MOSS-Audio adopts a modular three-stage design: audio encoder → modality adapter → language model backbone. Raw audio is encoded into a continuous temporal representation at 12.5 Hz, projected into the LLM embedding space, and then processed through autoregressive text generation.

MOSS-Audio Overall Architecture

Custom Audio Encoder

Unlike many multimodal models that directly use off-the-shelf frontends (such as Wav2Vec2 or CLAP), MOSS-Audio trains a dedicated audio encoder from scratch. This design brings two key advantages: the encoder is jointly optimized across multiple acoustic domains—speech, environmental sounds, and music—avoiding the poor performance of off-the-shelf encoders in specialized domains; and the encoder trains more cohesively with the language model backbone, reducing the modality gap.

DeepStack Cross-Layer Feature Injection

This is the most noteworthy innovation in MOSS-Audio's architecture.

Traditional multimodal architectures typically pass only the encoder's top-layer output to the LLM, causing low-level acoustic details (prosody, transients, rhythm, timbre, background structure) to be lost during deep abstraction. MOSS-Audio introduces a DeepStack cross-layer injection module:

  • Selects features from early and middle layers of the encoder
  • Projects them independently and injects them directly into the early layers of the LLM
  • Preserves multi-granularity information from low-level acoustic details to high-level semantic abstractions

This design enables the model to maintain its semantic understanding capabilities without losing sensitivity to subtle acoustic cues—especially critical for music analysis, emotion recognition, and environmental sound classification.

Time-Aware Representation

Time awareness is the core dimension that distinguishes audio understanding from image understanding. During pre-training, MOSS-Audio inserts explicit time-marker tokens between audio frame representations at fixed time intervals.

The model natively learns "what happened when," supporting timestamp ASR, event localization, time-based question answering, and long-audio lookback—all without requiring additional localization heads or post-processing pipelines.

Benchmark Performance

General Audio Understanding

MOSS-Audio-8B-Thinking achieves an average accuracy of 71.08 across four benchmarks:

General Audio Understanding Benchmark

Model Size MMAU MMAU-Pro MMAR MMSU Average
MOSS-Audio-8B-Thinking 8B 77.33 64.92 66.53 75.52 71.08
Step-Audio-R1 33B 78.67 59.68 69.15 75.18 70.67
Qwen3-Omni-30B 30B 72.06 61.22 66.40 69.00 67.91
MOSS-Audio-4B-Thinking 4B 75.78 63.13 64.83 73.88 68.37

MOSS-Audio-4B-Thinking (68.37) already outperforms all open-source competitors in the 7B/9B size range. The 8B version surpasses the 33B Step-Audio-R1 on both MMAU-Pro and MMAR.

Speech Captioning

MOSS-Audio-8B-Instruct achieves the highest average score of 3.7252 on speech captioning tasks, leading in 11 out of 13 fine-grained dimensions (gender, accent, pitch, volume, timbre, clarity, fluency, personality, etc.).

Speech Captioning Radar Chart

ASR Performance

MOSS-Audio-8B-Instruct leads with a comprehensive CER (Character Error Rate) of 11.30. It stands out in the following challenging scenarios:

  • Dialect recognition: CER 8.76 (91.24% accuracy)
  • Singing transcription: CER 9.81
  • Code-switching: CER 10.18
  • Non-speech vocalizations: CER 4.31

These results not only surpass traditional ASR models (Paraformer, Fun-ASR, SenseVoice) but also hold their own against larger multimodal models.

Timestamp ASR

This is where MOSS-Audio truly shines. Timestamp ASR measures a model's ability to transcribe audio while precisely annotating the timing of each word:

Model AISHELL-1 (Chinese) LibriSpeech (English)
MOSS-Audio-8B-Instruct 35.77 131.61
MOSS-Audio-4B-Instruct 76.96 358.13
Qwen3-Omni-30B 833.66 646.95
Gemini-3.1-Pro 708.24 871.19

MOSS-Audio-8B scores 35.77 on AISHELL-1, far surpassing Qwen3-Omni-30B's 833.66—a gap of over 23x. This advantage stems directly from the time-aware representation design: the model natively learns temporal alignment rather than relying on post-processing.

Core Capabilities

MOSS-Audio covers six major capabilities:

  1. Speech and Content Understanding — Precise transcription + word-level/sentence-level timestamp alignment
  2. Speaker/Emotion/Event Analysis — Identify speaker characteristics, analyze emotional states, detect key acoustic events
  3. Scene/Sound Cue Extraction — Infer context from background noise and environmental sounds
  4. Music Understanding — Analyze musical style, emotional progression, and instrumentation
  5. Audio QA and Summarization — Generate summaries and answer questions for podcasts, meetings, and interviews
  6. Complex Reasoning — Multi-hop reasoning through chain-of-thought

One model covering all scenarios—developers no longer need to stitch together multiple specialized models for different audio tasks.

Deployment and Fine-Tuning

Environment Setup

git clone https://github.com/OpenMOSS/MOSS-Audio.git
cd MOSS-Audio
conda create -n moss-audio python=3.12 -y
conda activate moss-audio
conda install -c conda-forge "ffmpeg=7" -y
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"

Inference

python infer.py  # Default prompt: Describe this audio.

Gradio UI

python app.py

SGLang Service Deployment

git clone -b moss-audio https://github.com/OpenMOSS/sglang.git
cd sglang && pip install -e "python[all]"
sglang serve --model-path ./weights/MOSS-Audio --trust-remote-code

Fine-Tuning

Official full fine-tuning scripts (finetune/finetune.py) are provided, supporting both LoRA and full-parameter fine-tuning, with data in JSONL audio-text dialogue format.

Technical Deep Analysis

Why Can 8B Challenge 30B?

MOSS-Audio's efficiency advantage comes from three layers.

Audio encoding efficiency. The self-trained encoder is optimized for 12.5 Hz temporal resolution. Compared to general-purpose encoders (such as Wav2Vec2's 50 Hz output), sequence length is compressed approximately 4x, significantly reducing the number of input tokens to the LLM.

Information density of cross-layer injection. The DeepStack design enables the LLM to receive multi-level features simultaneously, avoiding the problem in traditional architectures where the LLM needs to "learn from scratch" low-level acoustic features. It's equivalent to providing the LLM with pre-processed acoustic knowledge rather than raw encoded representations.

Native integration of time awareness. Time-marker tokens are embedded into the sequence from the pre-training stage onward, so time-aware capabilities are encoded directly into model weights, adding zero extra overhead during inference.

Why Is the Timestamp ASR Gap So Large?

The root cause of competitors' poor timestamp ASR performance lies in architectural design. Models like Qwen3-Omni rely on post-processing modules or additional localization heads to generate timestamps, essentially treating temporal alignment as a separate task. MOSS-Audio embeds time markers into the sequence from the pre-training stage—time awareness is a core capability, not an add-on.

This is similar to the gap between models with native multilingual support and those that understand through translation—the former establishes mappings at the foundational level, while the latter requires an extra conversion layer.

Apache 2.0 License

MOSS-Audio is open-sourced under the Apache License 2.0, which allows commercial use, modification, and distribution with no copyleft restrictions.

Conclusion

The release of MOSS-Audio marks a significant advancement in open-source audio understanding. An 8B-parameter model achieving performance that surpasses 30B models, along with order-of-magnitude advantages in timestamp ASR, demonstrates the core value of architectural innovation. Two key innovations—DeepStack cross-layer injection and time-aware representation—provide a reference for future audio-language model design.

With fine-tuning support and service deployment tools fully in place, MOSS-Audio now has a complete pipeline from research to production.

Related Links

DE
Source

This article was originally published by DEV Community and written by Garyvov.

Read original article on DEV Community
Back to Discover

Reading List