Llama 4 × BibiGPT

Meta shipped Llama 4 on 2025-04-05 — the first natively multimodal Llama and the first to use a mixture-of-experts (MoE) architecture. Scout ships with 17B active / 109B total parameters across 16 experts and a 10M token context window; Maverick ships with 17B active / 400B total parameters across 128 experts and a 1M token context. BibiGPT routes long-form video summarization, multi-document Q&A and self-host pipelines through Llama 4 as one of the open-weight backbones, alongside Mistral Medium 3.5 and DeepSeek-V4.

Released · 2025-04-05 Scout 10M context · Maverick 400B MoE Open-weight · multimodal

Key facts (90-second read)

As of 2026-05-09: Meta released Llama 4 on 2025-04-05 — the first natively multimodal Llama family and the first to use a mixture-of-experts (MoE) architecture. Scout ships at 17B active / 109B total / 16 experts with a 10M token context window; Maverick ships at 17B active / 400B total / 128 experts with a 1M context window. Both are open-weight, both run on a single H100-class host, and both are licensed under Meta's Llama 4 Community License. For BibiGPT users, Scout's 10M context fits dozens of full transcripts in one prompt — no chunking, no cross-chunk reference loss.

Features

What ships in Llama 4?

Two open-weight checkpoints — Scout and Maverick — both natively multimodal, both built on a mixture-of-experts (MoE) architecture. Scout targets 10M context on a single H100; Maverick targets best-in-class multimodal reasoning on a single H100 host.

Scout — 17B active / 109B total / 10M context

Scout is a 17 billion active parameter MoE with 16 experts and 109B total parameters. Its 10M token context window is the longest in the open-weight tier and fits on a single NVIDIA H100 with Int4 quantization.

Maverick — 17B active / 400B total / 1M context

Maverick is a 17 billion active parameter MoE with 128 routed experts plus a shared expert and 400B total parameters. Its 1M token context window targets long-form reasoning on a single H100 DGX host. Meta benchmarks Maverick ahead of GPT-4o and Gemini 2.0 Flash on multimodal tasks.

Open weights, natively multimodal

Scout and Maverick ship as open-weight downloads on llama.com and Hugging Face. Both accept text and image inputs natively (no separate vision adapter), and both can be self-hosted under Meta's Llama 4 Community License — review terms before commercial deployment.

What 10M context + open weights mean for BibiGPT users

BibiGPT's job is turning hour-long videos and podcasts into structured notes. Scout's 10M context is enough headroom to fit dozens of full transcripts in one prompt; Maverick's multimodal head means image-rich content (slides, screenshots, frame extracts) gets first-class treatment.

Multi-episode course summarization

A full 20-episode YouTube course or a year of podcast back-catalogue fits in Scout's 10M context. Cross-episode references ("which episode introduced concept X?") resolve in a single inference, with no retrieval index in between.

Slide + transcript multimodal Q&A

Pair BibiGPT-extracted transcripts with frame screenshots from a lecture or product demo. Maverick's native multimodal head answers questions that span both modalities — "on which slide did the speaker show the architecture diagram?" — without OCR pre-processing.

Self-host for privacy-sensitive content

Open weights mean Scout or Maverick can run on your own GPUs. Sensitive corporate meetings, paywalled course content, and internal training materials can be summarized on-prem — audio, transcripts and frames never leave your network.

5 key changes (90-second read)

Headline shifts from the Llama 4 release.

  1. 1

    Released 2025-04-05

    Meta dropped Llama 4 Scout and Maverick on April 5, 2025 — the first open-weight Llama herd to ship natively multimodal and on a mixture-of-experts architecture.

  2. 2

    First Llama on MoE

    Llama 4 is Meta's first Llama family to use mixture-of-experts routing. Only ~17B parameters fire per token even though total parameter counts run 109B (Scout) or 400B (Maverick), keeping inference cost close to a 17B dense model.

  3. 3

    Scout — 10M token context

    Scout's 10M context window is the longest in any open-weight Llama and beats most closed-weight peers. Achieved via interleaved attention layers without positional embeddings plus inference-time temperature scaling on attention.

  4. 4

    Maverick — 400B / 128 experts / multimodal SOTA

    Maverick uses 128 routed experts plus a shared expert for 400B total parameters. Meta benchmarks it ahead of GPT-4o and Gemini 2.0 Flash on multimodal tasks; deployable on a single H100 DGX host.

  5. 5

    Behemoth previewed (~2T total)

    Meta also previewed Llama 4 Behemoth, a ~2T total parameter teacher model used to train Scout and Maverick. Not yet released as an open-weight checkpoint.

3 typical scenarios for BibiGPT users

Grounded in real BibiGPT user personas — all actionable today.

Multi-episode course — full summary in one prompt

Use BibiGPT to extract transcripts from a 20-episode YouTube course, then route the summarization step through Llama 4 Scout. The full 20-episode stack fits in 10M context, so cross-episode references stay intact instead of being stitched from chunk summaries.

Slide + transcript multimodal Q&A

Pair BibiGPT-extracted lecture transcripts with frame screenshots. Maverick's native multimodal head answers spanning questions like "on which slide did the speaker introduce the architecture diagram?" — no OCR pipeline, no caption preprocessing in between.

Self-host for privacy — open weights in production

Run Scout or Maverick on your own GPUs under the Llama 4 Community License, then pair with BibiGPT's transcript extractor for sensitive corporate meetings or paywalled course content. Audio, transcripts and frames stay on-prem; summaries never leave your network.

Frequently Asked Questions

Ask us anything!

Summarize a 20-episode course in one prompt — Llama 4 routing included

BibiGPT auto-routes long-form video and podcast summarization through long-context backbones (Llama 4 Scout's 10M context included). Drop a YouTube, Bilibili or podcast URL and get full-transcript summaries plus AI Q&A in 5 languages — no chunking artifacts, no cross-chunk reference loss.