DeepSeek V4 Preview × BibiGPT — Pro + Flash dual SKU

DeepSeek published the V4 Preview lineup on 2026-04-24 — V4-Pro (1.6T MoE / 49B active) and V4-Flash (284B / 13B active), both at 1M context, with a new Hybrid CSA+HCA attention scheme and three API modes (Fast / Expert / Vision). BibiGPT users can pair this preview lineup for hour-long video, podcast and multi-document summarization once it lands behind the routing layer.

Released · 2026-04-24 Pro 1.6T / Flash 284B 1M context · CSA+HCA

Key facts (90-second read)

As of 2026-05-08: DeepSeek released the V4 Preview lineup on 2026-04-24. Two SKUs ship together — V4-Pro (1.6T MoE / 49B active) and V4-Flash (284B / 13B active) — both at a 1M token context window, both running on the new Hybrid CSA + HCA attention scheme, both reachable through Fast / Expert / Vision API modes. Compared with the earlier V4 release (covered separately at /features/deepseek-v4-1m-context-explained), V4 Preview's news is the dual SKU split, the Hybrid CSA+HCA attention upgrade, and the explicit three-mode API surface — not the 1M jump itself. For BibiGPT users, V4-Flash is the cheap default for hour-long video and podcast summarization, V4-Pro is reserved for harder reasoning passes on the same transcript, and Vision mode pairs cleanly with BibiGPT's frame-extraction workflow. Authoritative sources: api-docs.deepseek.com news260424 and the deepseek-ai collection on Hugging Face.

Features

What ships in DeepSeek V4 Preview?

Two SKUs released together on 2026-04-24 — V4-Pro and V4-Flash — both at a 1M token context window, both running on the new Hybrid CSA+HCA attention scheme, both reachable through three distinct API modes.

Pro vs Flash dual SKU

V4-Pro is a 1.6T MoE checkpoint with 49B parameters firing per token. V4-Flash is a 284B MoE checkpoint with only 13B active per token — same context window, same attention scheme, but a much lighter inference footprint at a fraction of the per-token cost.

Hybrid CSA + HCA attention

V4 Preview replaces prior MoE-only attention with Hybrid CSA + HCA — cross-shared attention plus hierarchical-causal attention. The hybrid scheme is designed to keep semantic coherence intact across long documents instead of degrading toward the tail of the context window.

Three API modes — Fast / Expert / Vision

Each preview SKU is reachable through three modes. Fast prioritizes throughput; Expert prioritizes reasoning quality; Vision adds multimodal input on top of the same backbone — one API surface, three knobs to dial cost-vs-quality and modality.

What V4 Preview means for BibiGPT users

BibiGPT turns hour-long videos and podcasts into structured notes. V4-Flash drops the per-token cost of a 1M-context summarization run hard, V4-Pro reaches for the highest reasoning ceiling, and Vision opens the door for screenshot-grade frame analysis — all on the same context budget.

1M context — 8h podcast end-to-end

1,000,000 tokens fit a full 8-hour conference recording, an entire multi-episode course, or a stack of related papers in one prompt. BibiGPT's chunk-and-stitch pipeline can collapse to a single inference for content that previously needed retrieval, cutting cross-chunk reference loss between hours one and eight.

V4-Flash unlocks cheap long-context summary

Only 13B parameters fire per token on V4-Flash. For BibiGPT-style summarization workloads — long transcript in, structured outline out — Flash is the dominant cost-quality point inside the 1M-context tier. Pro is reserved for harder reasoning hops on the same transcript.

Vision mode + BibiGPT visual analysis

V4-Vision takes screenshots and frames as input. BibiGPT's existing visual analysis workflow — extract key frames from a video, then ask the model what's on screen — can pair directly with V4-Vision once exposed in the routing layer. Frame-level Q&A becomes one inference, not a separate captioner pass.

5 key changes (90-second read)

Headline shifts from the DeepSeek V4 Preview release on 2026-04-24.

  1. 1

    Pro vs Flash dual SKU

    V4-Pro is 1.6T MoE / 49B active per token. V4-Flash is 284B / 13B active — same context window, same attention, much lighter inference. Pick Flash for cheap long-context summarization, Pro for harder reasoning passes on the same transcript.

  2. 2

    Hybrid CSA + HCA attention

    Cross-shared attention plus hierarchical-causal attention replaces V4's MoE-only attention. The hybrid scheme is designed to preserve semantic coherence across the full 1M-token context — the failure mode that hour-long-video summarization runs into.

  3. 3

    Three API modes — Fast / Expert / Vision

    Each Preview SKU exposes Fast (throughput), Expert (reasoning quality) and Vision (multimodal input) on the same API surface. One context budget, three knobs to dial cost-vs-quality and modality.

  4. 4

    1M context, 8h-podcast friendly

    Both Pro and Flash keep the V4 family's 1M token context window. A full 8-hour conference recording or a multi-episode course series fits in one prompt — BibiGPT's chunk-and-stitch pipeline can collapse to a single inference for content that previously needed retrieval.

  5. 5

    Open weights on Hugging Face

    V4 Preview checkpoints land in the deepseek-ai collection on Hugging Face the same week. Self-hostable for privacy-sensitive workloads — paywalled course content, internal meeting recordings — without sending audio or transcripts to a third-party API.

3 typical scenarios for BibiGPT users

Grounded in real BibiGPT user personas — all actionable today by extracting a transcript with BibiGPT and calling V4 Preview directly until native routing lands.

Creator — 8-hour podcast, single-prompt outline

Use BibiGPT to extract an 8-hour podcast or all-day conference recording transcript, then route the outline-and-summary step through V4-Flash on Expert mode. Full transcript fits in 1M context, so chapter references stay coherent end-to-end without chunk-stitch artifacts.

Student — multi-episode course cross-Q&A

Concatenate BibiGPT-extracted transcripts from a multi-episode lecture series. With 1M headroom, ask 'which episode covered topic X?' and resolve directly on V4-Flash without an external retrieval index that drops citations between episode boundaries.

Power user — frame-level visual analysis with V4-Vision

Extract key frames from a slide-deck talk or chart-heavy video with BibiGPT, then send the frames to V4-Vision alongside the transcript. Frame-level Q&A — 'what's the y-axis on slide 14?' — collapses to one inference, no separate captioner pass.

Frequently Asked Questions

Ask us anything!

Run V4-Flash on a 1M-context podcast — start with BibiGPT transcript extraction

BibiGPT extracts long-form transcripts from YouTube, Bilibili and podcast URLs in 5 languages. Pair the transcript with V4-Flash for the cheapest 1M-context summary point in this tier, V4-Pro for hardest reasoning, V4-Vision for frame-level analysis. Once V4 Preview is routed inside BibiGPT, the same workflow runs end-to-end behind a single URL.