Nemotron-3 Nano Omni × BibiGPT

NVIDIA released Nemotron-3 Nano Omni on 2026-04-28 — a 30B-A3B Mamba-Transformer MoE multimodal model with ~3B active parameters per token, jointly processing image, video, audio, and text. Day-0 on Hugging Face under NVIDIA's Open Model Agreement with full commercial use. BibiGPT routes long-form video understanding, long-context audio Q&A, and document intelligence through Nemotron-tier multimodal backbones for the creator and enterprise workflows.

Released · 2026-04-28 30B-A3B MoE multimodal Hugging Face Day-0

Key facts (90-second read)

NVIDIA released Nemotron-3 Nano Omni on 2026-04-28 — a 30B-A3B Mamba2-Transformer MoE multimodal model with ~3B active parameters per token, jointly processing image, video, audio, and text. Day-0 on Hugging Face under NVIDIA's Open Model Agreement with full commercial use rights, plus OpenRouter and build.nvidia.com NIM. Best-in-class on MMlongbench-Doc, OCRBenchV2, WorldSense, and DailyOmni; up to 9× higher multimodal throughput vs alternatives. For BibiGPT users, Nemotron-3 Nano Omni is the long-form multimodal backbone shape we route long videos, podcasts, and document Q&A through.

Features

What is Nemotron-3 Nano Omni?

NVIDIA's 2026-04-28 multimodal flagship in the Nemotron 3 Nano family — a 30B-parameter Mamba2-Transformer hybrid MoE backbone with 128 experts, top-6 routing, and roughly 3B active parameters per token. It unifies image, video, audio, and text understanding in a single model, available Day-0 on Hugging Face.

30B-A3B MoE multimodal backbone

31B total parameters with ~3B active per token via 128-expert top-6 MoE routing. The hybrid combines 23 Mamba selective-state-space layers (long context efficiency), 23 MoE layers, and 6 grouped-query attention layers — long-context multimodal intelligence at a 3B-active inference cost.

Image · video · audio · text in one model

CRADIO v4-H acts as the vision encoder for image and video frames; Parakeet acts as the speech encoder for audio inputs. One model handles document Q&A, summarization, transcription, and video reasoning — no separate stack per modality.

Hugging Face Day-0, commercial-friendly

Released under NVIDIA's Open Model Agreement with full commercial use rights. BF16, FP8, and NVFP4 variants are all on Hugging Face on day one (plus OpenRouter and build.nvidia.com NIM), making local and serverless deployment straightforward.

Why this matters for BibiGPT users

BibiGPT is the AI audio/video assistant for creators and enterprises — long-video summarization, visual analysis, document intelligence, and knowledge-product generation. Nemotron-3 Nano Omni is exactly the multimodal backbone shape that BibiGPT routes long-form audio/video understanding through.

Long-form video understanding gets cheaper

A 30B-A3B model with ~3B active parameters runs roughly an order of magnitude cheaper than a dense 30B at inference time — leading on WorldSense and DailyOmni video/audio benchmarks. BibiGPT can route long lectures, podcasts, and conferences through Nemotron-tier reasoning without burning premium budget.

Document intelligence + audio in one pass

Best-in-class on MMlongbench-Doc and OCRBenchV2, plus Parakeet for audio. BibiGPT's document Q&A, subtitle translation, and audio transcription pipelines benefit from a single model handling OCR-heavy PDFs, long videos, and meeting recordings together.

Edge and self-host pathways open up

FP8 (~32.8 GB) and NVFP4 (~20.9 GB) variants make Nemotron-3 Nano Omni viable on a single GPU. For BibiGPT's enterprise API customers, that means an on-prem multimodal option for sensitive footage — not just a hosted-only flagship.

5 key changes (90-second read)

Headline shifts from the Nemotron-3 Nano Omni release on 2026-04-28.

  1. 1

    30B-A3B MoE goes multimodal

    NVIDIA extends the Nemotron 3 Nano family to a unified image / video / audio / text model. 31B total parameters, ~3B active per token via 128-expert top-6 MoE — long-context multimodal at a 3B-dense inference cost.

  2. 2

    Mamba2-Transformer hybrid backbone

    The architecture interleaves 23 Mamba selective-state-space layers, 23 MoE layers, and 6 grouped-query attention layers. Mamba carries the long-context heavy lifting; MoE adds conditional capacity; GQA layers provide attention where it matters most.

  3. 3

    Vision and audio encoders unified

    CRADIO v4-H handles image and video frames; Parakeet handles audio. One model covers document intelligence, video understanding, transcription, and audio Q&A — no separate stack per modality.

  4. 4

    Hugging Face Day-0 with commercial-use license

    Released under NVIDIA's Open Model Agreement with full commercial use rights. BF16, FP8, and NVFP4 variants on Hugging Face on day one, plus OpenRouter (free tier) and build.nvidia.com NIM microservice.

  5. 5

    Quantization for single-GPU deployment

    FP8 variant ≈ 32.8 GB (8.5 effective bits/weight, with FP8 KV cache); NVFP4 mixed-precision ≈ 20.9 GB (~4.98 bits/weight). Edge and self-host become viable for enterprises that need on-prem multimodal reasoning.

3 typical scenarios for BibiGPT users

Where Nemotron-3 Nano Omni pays off most for BibiGPT's creator and enterprise audience.

Long video understanding at low active-parameter cost

BibiGPT summarizes 90-minute lectures, podcasts, and conferences. With a 30B-A3B MoE that activates only ~3B parameters per token, Nemotron-tier multimodal reasoning runs at a fraction of dense-30B inference cost — leading on WorldSense and DailyOmni video/audio benchmarks.

Document Q&A + audio intelligence in one model

Nemotron-3 Nano Omni is best-in-class on MMlongbench-Doc and OCRBenchV2 while also handling audio via Parakeet. BibiGPT's document Q&A, subtitle translation, and meeting transcription pipelines collapse into a single multimodal pass.

On-prem multimodal for enterprise API customers

FP8 (~32.8 GB) and NVFP4 (~20.9 GB) variants make single-GPU deployment realistic. For BibiGPT's enterprise API customers with sensitive footage, Nemotron-3 Nano Omni is the on-prem backbone option — not just a hosted-only multimodal flagship.

Frequently Asked Questions

Ask us anything!

Use BibiGPT to summarize long videos — backed by Nemotron-tier multimodal models

BibiGPT routes long-form video, audio, and document understanding through multimodal backbones in the shape of NVIDIA Nemotron-3 Nano Omni. Paste a B站 / YouTube / podcast link or upload a file — get summaries, mind maps, AI Q&A, and short-form re-renders without leaving the workflow.