Llama 4 × BibiGPT
Meta shipped Llama 4 on 2025-04-05 — the first natively multimodal Llama and the first to use a mixture-of-experts (MoE) architecture. Scout ships with 17B active / 109B total parameters across 16 experts and a 10M token context window; Maverick ships with 17B active / 400B total parameters across 128 experts and a 1M token context. BibiGPT routes long-form video summarization, multi-document Q&A and self-host pipelines through Llama 4 as one of the open-weight backbones, alongside Mistral Medium 3.5 and DeepSeek-V4.
Key facts (90-second read)
As of 2026-05-09: Meta released Llama 4 on 2025-04-05 — the first natively multimodal Llama family and the first to use a mixture-of-experts (MoE) architecture. Scout ships at 17B active / 109B total / 16 experts with a 10M token context window; Maverick ships at 17B active / 400B total / 128 experts with a 1M context window. Both are open-weight, both run on a single H100-class host, and both are licensed under Meta's Llama 4 Community License. For BibiGPT users, Scout's 10M context fits dozens of full transcripts in one prompt — no chunking, no cross-chunk reference loss.
Features
What ships in Llama 4?
Two open-weight checkpoints — Scout and Maverick — both natively multimodal, both built on a mixture-of-experts (MoE) architecture. Scout targets 10M context on a single H100; Maverick targets best-in-class multimodal reasoning on a single H100 host.
Scout — 17B active / 109B total / 10M context
Scout is a 17 billion active parameter MoE with 16 experts and 109B total parameters. Its 10M token context window is the longest in the open-weight tier and fits on a single NVIDIA H100 with Int4 quantization.
Maverick — 17B active / 400B total / 1M context
Maverick is a 17 billion active parameter MoE with 128 routed experts plus a shared expert and 400B total parameters. Its 1M token context window targets long-form reasoning on a single H100 DGX host. Meta benchmarks Maverick ahead of GPT-4o and Gemini 2.0 Flash on multimodal tasks.
Open weights, natively multimodal
Scout and Maverick ship as open-weight downloads on llama.com and Hugging Face. Both accept text and image inputs natively (no separate vision adapter), and both can be self-hosted under Meta's Llama 4 Community License — review terms before commercial deployment.
What 10M context + open weights mean for BibiGPT users
BibiGPT's job is turning hour-long videos and podcasts into structured notes. Scout's 10M context is enough headroom to fit dozens of full transcripts in one prompt; Maverick's multimodal head means image-rich content (slides, screenshots, frame extracts) gets first-class treatment.
Multi-episode course summarization
A full 20-episode YouTube course or a year of podcast back-catalogue fits in Scout's 10M context. Cross-episode references ("which episode introduced concept X?") resolve in a single inference, with no retrieval index in between.
Slide + transcript multimodal Q&A
Pair BibiGPT-extracted transcripts with frame screenshots from a lecture or product demo. Maverick's native multimodal head answers questions that span both modalities — "on which slide did the speaker show the architecture diagram?" — without OCR pre-processing.
Self-host for privacy-sensitive content
Open weights mean Scout or Maverick can run on your own GPUs. Sensitive corporate meetings, paywalled course content, and internal training materials can be summarized on-prem — audio, transcripts and frames never leave your network.
5 key changes (90-second read)
Headline shifts from the Llama 4 release.
- 1
Released 2025-04-05
Meta dropped Llama 4 Scout and Maverick on April 5, 2025 — the first open-weight Llama herd to ship natively multimodal and on a mixture-of-experts architecture.
- 2
First Llama on MoE
Llama 4 is Meta's first Llama family to use mixture-of-experts routing. Only ~17B parameters fire per token even though total parameter counts run 109B (Scout) or 400B (Maverick), keeping inference cost close to a 17B dense model.
- 3
Scout — 10M token context
Scout's 10M context window is the longest in any open-weight Llama and beats most closed-weight peers. Achieved via interleaved attention layers without positional embeddings plus inference-time temperature scaling on attention.
- 4
Maverick — 400B / 128 experts / multimodal SOTA
Maverick uses 128 routed experts plus a shared expert for 400B total parameters. Meta benchmarks it ahead of GPT-4o and Gemini 2.0 Flash on multimodal tasks; deployable on a single H100 DGX host.
- 5
Behemoth previewed (~2T total)
Meta also previewed Llama 4 Behemoth, a ~2T total parameter teacher model used to train Scout and Maverick. Not yet released as an open-weight checkpoint.
3 typical scenarios for BibiGPT users
Grounded in real BibiGPT user personas — all actionable today.
Multi-episode course — full summary in one prompt
Use BibiGPT to extract transcripts from a 20-episode YouTube course, then route the summarization step through Llama 4 Scout. The full 20-episode stack fits in 10M context, so cross-episode references stay intact instead of being stitched from chunk summaries.
Slide + transcript multimodal Q&A
Pair BibiGPT-extracted lecture transcripts with frame screenshots. Maverick's native multimodal head answers spanning questions like "on which slide did the speaker introduce the architecture diagram?" — no OCR pipeline, no caption preprocessing in between.
Self-host for privacy — open weights in production
Run Scout or Maverick on your own GPUs under the Llama 4 Community License, then pair with BibiGPT's transcript extractor for sensitive corporate meetings or paywalled course content. Audio, transcripts and frames stay on-prem; summaries never leave your network.
FAQ'S
Frequently Asked Questions
Ask us anything!
Summarize a 20-episode course in one prompt — Llama 4 routing included
BibiGPT auto-routes long-form video and podcast summarization through long-context backbones (Llama 4 Scout's 10M context included). Drop a YouTube, Bilibili or podcast URL and get full-transcript summaries plus AI Q&A in 5 languages — no chunking artifacts, no cross-chunk reference loss.