Introduction

From an and systems perspective, Sonar is a retrieval-augmented generation (RAG) family: Sonar-Medium-Online at inference, it runs a retrieval step (live web search), converts the retrieved documents into a compact context, and conditions a language model to generate an answer that includes explicit citation metadata. Sonar-Medium-Online In other words: retrieval → context assembly → conditional language generation → citation attachment.

Why This Matters Sonar-Medium-Online:

It changes the conditioning distribution of the model: Responses are explicitly anchored to retrieved source content, so hallucination surface area drops compared to closed models that rely solely on parametric memory.
It introduces new operational signals to measure and optimize: Retrieval latency, retrieval precision, context size vs token cost, citation correctness, and the end-to-end latency from caller to first token.
It requires production-grade shaping: Caching, domain filters, post-filters, active monitoring of token usage, and fallback paths if the search layer fails.

This guide treats Sonar-Medium-Online as a mid-tier RAG model: Lower cost and lower search-context depth than Pro/Deep tiers, but optimized for short, citation-aware answers and low latency.

What is Sonar-Medium-Online

Sonar = Retrieval + Generator + Attributor:

Retriever (online): Live web queries run at request time. The retriever returns candidate documents or snippets (URLs, content snippets, metadata).
Context builder: A component that decides how much retrieved content to include in the LM conditioning buffer (search_context_size: Low/medium/high).
Conditional LM: The sequence model (the “Sonar” model) that receives the user prompt plus the selected retrieved context and emits a grounded answer.
Attribution post-processor: The model (or the pipeline) formats inline citations or a citation block mapping claims to source URLs and snippets.

From an NLP quality view, Sonar reduces unconstrained generation by constraining the LM to ground on retrieved evidence. That said, retrieval quality and the heuristics for context selection remain critical to overall accuracy.

Sonar-Medium-Online explained: variants & when to use them

Terminology:

Sonar — family name (RAG-capable models).
Medium — mid-level compute/parameter/capacity tradeoff. Think of it as “balanced perplexity vs cost”.
Online — retrieval happens live against web indexes at inference time (vs offline or cached retrieval).

When to pick Sonar-Medium-online

You need fast, short answers with citations (user-facing Q&A, help widgets).
You need freshness (news, product pages, specs) but can accept medium citation depth (2–5 sources most of the time).
Latency and cost are constraints: pick medium-online for chat UIs where response time needs to be low and per-request cost must be moderate.

When not to pick it:

When you need long-form, deeply-sourced research across hundreds of papers — choose Pro/Deep or a heavy RAG pipeline with specialized retrievers.
When you need heavy multimodal synthesis (images + reasoning) — pick a multimodal stack.

Key Features Sonar-Medium-Online

Live retrieval grounding: Reduces parametric hallucinations by giving the LM explicit evidence to condition on.
Search_context_size controls: A tunable knob that changes the number and length of retrieval snippets added to the prompt. This is the main cost/coverage tradeoff.
Citation-aware outputs: Model includes URLs and short snippets or inline citations, enabling user verification and EEAT improvement.
Latency-optimized for short answers: The Model and retrieval pipeline are tuned to return first tokens quickly for chat-like interactions.
API-first design: Integrates with standard REST/SDK flows and supports usual production controls (max_tokens, stop tokens, timeouts).

Note: because the model conditions on retrieved snippets, the effective context window for the LM includes both the user prompt and the retrieved tokens — so prompt engineering must balance prompt length vs the retrieval budget.

Real-World Performance & Benchmarks

Below is a reproducible benchmarking plan you can run with your prompts. Publish raw prompts and test vectors for EEAT.

What to Measure

Latency (ms): Time to first token (TTFT) and time to last token (TTL). Use client timestamps.
Token usage: Input tokens (prompt + retrieved context) vs output tokens.
Citation precision: % of claims backed by cited sources (manual audit).
Answer quality: Human-blind rating (1–5) for correctness, completeness, and clarity.
Failure modes: Percentage of requests that return malformed JSON, loops, or finish_reason: length.

Recommended Test Tasks

Short factual Q&A (10–50 tokens): 200 sample questions across domains (dates, specs, definitions). Measure accuracy and TTFT.
Multi-source summaries: Provide 3–8 URLs and ask for a 150–250 word summary with citations. Measure citation recall and coverage.
Attribution test: Ask the model to state a fact and cite the exact sentence/paragraph and URL that supports it. Manually validate.
Stress test: Long prompts, heavy retrieval (high search_context_size), and repeated rapid requests to observe rate limits and instability.

Example Benchmark Table

Task	Metric	Expectation (sonar-medium-online)	Notes
Short factual Q&A	TTFT	Low (<300ms typical)	Great for chat UX
Multi-doc summary	Accuracy	Medium-High	May miss deep scholarly citations
Citation correctness	Precision	Medium	Increase search_context_size for more sources
Stability	Failure modes	Occasional loops reported	Log request IDs when reproducible

Benchmarking Tips

Use fixed seeds for deterministic generation when possible.
Save raw request/response payloads and request IDs for reproducibility and bug reports.
Include environment metadata (region, API version, date-tested) with each benchmark result.

Pricing Explained

Pricing primitives

Input tokens: Everything you send (system prompt + user input + retrieval snippets).
Output tokens: What model returns.
Per-1M token price: Price for 1,000,000 tokens for input and for output (they may differ).
Total cost per request = input_cost + output_cost.

Important: pricing changes, and you should validate current per-1M rates on the official price page before publishing.

Sonar-Medium-Online infographic showing Perplexity’s real-time search-grounded AI model, key features, benchmarks, pricing overview, and API use cases. — Sonar-Medium-Online explained — a fast, citation-aware AI model from Perplexity, optimized for real-time web search, balanced accuracy, and cost-efficient API use.

Illustrative worked examples Sonar-Medium-Online

Assume:

Input price = $1 per 1,000,000 tokens
Output price = $5 per 1,000,000 tokens

Example A — short chat

Input: 200 tokens → $0.0002
Output: 150 tokens → $0.00075
Total ≈ $0.00095 per short request.

Example B — Multi-Doc Summary

Input: 2000 tokens → $0.002
Output: 1500 tokens → $0.0075
Total ≈ $0.0095 per summary request.

Monthly Sample

200,000 short chat requests @ $0.00095 ≈ $190 / month.

Integration and Infra Patterns

search_context_size: Set to low for fast chat, medium for typical QA, and high for summarization or deep research.
Retries & backoff: Handle 429/5xx with exponential backoff.
Streaming support: If the API supports streaming, show partial content quickly and continue receiving tokens.
Post-filtering: Apply domain whitelists/blacklists to remove low-quality or NSFW sources before presenting to users.
Caching: Cache retrieved documents and model outputs for repeated queries; cache keys should include the query normalized plus a TTL.
Monitoring: Log request_id, latencies, token counts, and sample outputs for manual review.
Fallbacks: Fall back to cached answers or a safe canned response when the retrieval layer fails.

Common problems with sonar-Medium-Online and Fixes

Garbled Output or Token Loops

Regenerate (sometimes transient).
Increase model tier to Pro/Deep to see if behavior persists.
Add explicit stop tokens and lower max_tokens.
Save request_id and payload, and file a bug with support if reproducible.

Low-Quality or NSFW Citations

Use server-side domain whitelists and blacklists.
Post-filter results and remove suspicious domains.
Report reproducible examples to the community/support.

Unexpected Costs

Enforce max_tokens and search_context_size.
Use progressive summarization.
Set alerts on billing dashboards and periodically audit heavy users/queries.
Retrieval misses/false positives

Increase search_context_size or tune retriever query generation.
Use domain-limited retrieval for sensitive apps.
Preprocess user queries to Disambiguate entities (e.g., canonicalize names, include dates).

Sonar-Medium-Online vs Alternatives

Short verdict: Choose Sonar when you need live web grounding + citations, and you care about latency. Use deeper tiers for research, and different stacks for multimodal or heavy fine-tuning.

Feature / Need	Sonar-Medium-Online	Sonar-Pro / Deep	Large Generative LLMs (no live retrieval)	Specialized stacks (Claude, Google models)
Live web grounding	Yes	Yes (deeper)	Varies	Some integrations
Cost	Mid	Higher	Varies	Varies
Best for	Fast Q&A, citation-aware UIs	Research reports	General assistants	Safety-critical / enterprise
Citation depth	Medium	High	Low	Medium/Varies

Decision checklist

Do you need citations + fresh web content? → Sonar family.
Is sub-second latency critical? → prefer medium tier and small retrieval context.
Need deep analysis across many sources? → Sonar-Pro/Deep or custom RAG stack.

Pros & Cons Sonar-Medium-Online

Pros

Built for live retrieval, grounding, and citations.
Good latency for short queries.
Flexible tiers to control cost/quality tradeoffs.

Cons

Some community reports of instability (rare token loops).
Citation quality depends on retrieval quality; post-filtering is often needed.
Pricing depends on retrieval context size; test with your workload.

FAQs Sonar-Medium-Online

Q1: What is sonar-medium-online best used for?

A: Fast, citation-aware Q&A and short summaries where you need up-to-date grounding without paying for higher-end reasoning tiers.

Q2: How can I control Sonar costs?

A: Use token limits, concise system instructions, chunking, caching, and the cost calculator to estimate per-request price. Enforce max_tokens and prefer progressive expansion.

Q3: Are there known bugs with sonar-medium-online?

A: Some users report garbled or looping outputs in certain builds. If you see reproducible problems, collect request/response IDs and report them to Perplexity support or the community.

Q4: Where can I find official docs and pricing?

A: Perplexity’s official model docs and pricing pages are the canonical sources. Add links to those pages in your article.

Conclusion Sonar-Medium-Online

Sonar-Medium-Online is a practical, mid-tier RAG model for providing quick, citation-aware answers at a moderate cost. It’s ideal for chat UIs, support bots, and news Q&A. Always benchmark with your real prompts, log reproducible Failures, and publish your “how we tested” details to improve trust.

ToolKitByAI

Sonar-Medium-Online Unlocked: This Changes Everything