models April 11, 2026 8 min read

Muse Spark: When Meta Betrays Open Source and Rejoins the Race from Behind

After the Llama 4 scandal and Alexandr Wang's ground-up rebuild, Meta launches its first closed-source model Muse Spark in an existential bet on the future of AI.

AI DayaHimour Team

April 11, 2026

Muse Spark: When Meta Betrays Open Source and Rejoins the Race from Behind

Between the April 8, 2026 announcement and the April 2025 scandal, only twelve months had passed for Meta. But they were enough to change everything: the architecture, the strategy, the ambition, and the identity itself.

On that day, Meta unveiled Muse Spark — the first model produced by Meta Superintelligence Labs (MSL), the research arm established in mid-2025 after one of the most critical moments in the company’s technical history. The model is available for free via meta.ai and the Meta AI app from the moment of announcement, and is on its way to rolling out across WhatsApp, Instagram, Facebook, Messenger, and Ray-Ban smart glasses in the coming weeks.

What makes this launch an event beyond a mere new model announcement is not its technical capabilities alone — but the decision embedded in the announcement: Muse Spark is entirely closed-source. No weights, no self-hosting, no public weight preview. This announcement, simply put, marks the end of an entire era for Meta.

From Avocado to Muse Spark: Nine Months from Scratch

To understand what this launch represents, one must return to April 2025. Meta launched Llama 4 to reactions it did not anticipate. Quickly, independent researchers discovered that the version the company submitted to the LM Arena platform was not the same version available to the public — but a version specifically optimized to improve benchmark results. Later, additional investigations revealed that Meta had privately tested 27 different variants of Llama 4 and selected the best-performing one for submission to the platform.

The scandal was costly for a reason deeper than mere deception: Meta had built its AI reputation on the discourse of transparency and open source. Llama 3 was the star of the open community. Llama 4 turned that into a public relations nightmare. According to later investigations, the Llama 4 team resorted in the final training stages to mixing test data into training data, leading to severe overfitting.

The $14.3 Billion Deal

In June 2025, Mark Zuckerberg made a radical decision: he spent $14.3 billion for a non-voting 49% stake in Scale AI, bringing its founder Alexandr Wang to assume — for the first time in the company’s history — the position of Chief AI Officer. Under the deal, Scale AI’s valuation rose to over $29 billion. According to reports, Wang maintained his board seat at Scale AI while transitioning to Meta.

In August 2025, Wang announced via an internal memo that MSL would be divided into four divisions: AI Research, Superintelligence Research, Product Development, and Infrastructure. The dissolution of the AGI Foundations team and redistribution of its members was also announced, further centralizing authority around Wang. In November 2025, Yann LeCun — one of AI’s “godfathers” — left Meta after 12 years to establish an independent lab focused on “world models.”

What Is Muse Spark, Actually?

Muse Spark is a natively multimodal model — not merely a text model later fine-tuned to process images, but designed from the ground up to integrate text, images, and audio as inputs within a single unified framework. The model natively supports tool-use, visual chain of thought, and multi-agent orchestration. The context window reaches 262,000 tokens.

What truly distinguishes Muse Spark is not these features per se — most have become the “minimum bar” for any leading model in 2026 — but the core design philosophy. According to Wang, the team rebuilt everything from scratch: the architecture, the training data pipeline, the computational infrastructure — including the massive Hyperion data center.

Benchmarks: A Complex Picture

Official Benchmarks — Muse Spark vs Competitors — April 2026

Source: Official Meta AI Blog

📷 Multimodal

CharXiv Reasoning (Chart Understanding) 🏆 86.4

GPT-5.4: 82.8 / Gemini 3.1 Pro: 80.2 / Opus 4.6: 65.3

MMMU Pro (Multidisciplinary Understanding) 80.4

Gemini 3.1 Pro: 83.9 / GPT-5.4: 81.2 / Opus 4.6: 77.4

SimpleVQA (Visual Accuracy) 71.3

Gemini 3.1 Pro: 72.4 / Opus 4.6: 62.2 / GPT-5.4: 61.1

ZeroBench (Multi-step Visual Reasoning) 🏆 33.0

GPT-5.4: 41.0 / Gemini 3.1 Pro: 29.0 — Opus 4.6: —

🧠 Text / Reasoning

Humanity's Last Exam — Without Tools 42.8%

Gemini 3.1 Pro: 45.4 / GPT-5.4: 43.9 / Opus 4.6: 40.0

Humanity's Last Exam — With Tools (Contemplating) 50.4%

Opus 4.6: 53.1 / GPT-5.4: 52.1 / Gemini 3.1 Pro: 51.4

GPQA Diamond (PhD-level Reasoning) 89.5%

Gemini 3.1 Pro: 94.3 / GPT-5.4: 92.8 / Opus 4.6: 92.7

LiveCodeBench Pro (Competitive Programming) 80.0%

GPT-5.4: 87.5 / Gemini 3.1 Pro: 82.9 / Opus 4.6: 70.7

ARC-AGI-2 (Abstract Reasoning) ⚠️ 42.5%

Gemini 3.1 Pro: 76.5 / GPT-5.4: 76.1 / Opus 4.6: 63.3 — Largest gap

🏥 Health

HealthBench Hard 🏆🏆 42.8%

GPT-5.4: 40.1 / Gemini 3.1 Pro: 20.6 / Grok 4.2: 20.3 / Opus 4.6: 14.8

MedXpertQA Multimodal 78.4%

Gemini 3.1 Pro: 81.3 / GPT-5.4: 77.1 / Opus 4.6: 64.8

🤖 Agentic

DeepSearchQA (Agentic Search) 🏆 74.8%

Opus 4.6: 73.7 / GPT-5.4: 73.6 / Gemini 3.1 Pro: 69.7

SWE-Bench Verified (Agentic Coding) 77.4%

Opus 4.6: 80.8 / Gemini 3.1 Pro: 80.6 / Grok 4.2: 76.7

Terminal-Bench 2.0 ⚠️ 59.0%

GPT-5.4: 75.1 / Gemini 3.1 Pro: 68.5 / Opus 4.6: 65.4

τ²-Bench Telecom (Tool Use) 91.5

Grok 4.2: 96.5 / Gemini 3.1 Pro: 95.6 / Opus 4.6: 92.1

GDPval-AA Elo (Office Tasks) ⚠️ 1444

GPT-5.4: 1672 / Opus 4.6: 1606 / Gemini 3.1 Pro: 1320

Clear lead

Text & reasoning

Agentic tasks

Documented gaps

These numbers reveal a clearly mixed picture. The model leads globally in health (HealthBench Hard: 42.8%, ahead of GPT-5.4 at 40.1%), in visual understanding (CharXiv Reasoning: 86.4%), and in agentic tasks (DeepSearchQA: 74.8%). But it lags sharply in competitive programming (LiveCodeBench Pro: 80.0% vs. 87.5% for GPT-5.4), and in abstract reasoning in particular.

On ARC-AGI-2 specifically — the benchmark that tests recognition of entirely novel patterns that cannot be memorized — Muse Spark scores 42.5% while Gemini 3.1 Pro and GPT-5.4 reach approximately 76%. The gap here is not merely a decline; it is a structural gap suggesting that the model still struggles with tasks requiring abstract symbolic reasoning far removed from the texts and images it was trained on.

Meta itself acknowledged in its technical post the existence of “performance gaps” in long-horizon agent systems and programming workflows. But the company points out that Muse Spark is “the first and smallest model in the series,” hinting that larger upcoming models may close these gaps.

The Break with Open Source

The weightiest decision was not technical. Muse Spark is entirely closed-source — the weights are not available for download, there is no self-hosting, and API access is currently restricted to a private preview for selected partners, with no announced pricing or timeline for general availability.

This shift starkly contradicts Zuckerberg’s previous rhetoric, in which he stated that “open source represents the best opportunity for the world to harness this technology to create the greatest economic opportunity and safety for all.” One analyst described Muse Spark as “closed like the private school Zuckerberg attended.” Some commentators view the closure not as a change in philosophy but as an implicit acknowledgment that “open source stopped being a competitive advantage and became a competitive burden.”

The tech community — which had built thousands of projects and studies on Llama — received this reservation with widespread skepticism. Reddit, specifically the r/LocalLLaMA community comprising thousands of developers who rely on Meta’s open-source models, saw angry reactions.

Three Billion Users as a Deployment Arena

What Muse Spark possesses that none of the competitor models do is the immediate deployment arena: over three billion people use Meta’s apps daily. The model will reach WhatsApp, Instagram, Facebook, and Messenger within weeks — not as an option the user seeks out, but as an assistant embedded in the products they already use.

This scale of deployment gives Meta an exceptional advantage in collecting real-world usage data, thereby continuously improving the model. The Shopping Mode that Meta is testing adds a behavioral data layer derived from user interactions across its platforms — from purchases to interactions with ads and content.

The Cost: Between $115 and $135 Billion

During the announcement of Q4 2025 results, Meta revealed its capital expenditure plan for 2026: between $115 billion and $135 billion — roughly double the $72.22 billion spent in 2025. This figure far exceeds analyst expectations of $110 billion. A significant portion of these funds is directed toward funding MSL, expanding data centers, and strengthening third-party cloud capabilities.

This scale of spending explains the logic of closure: when spending at this level, it becomes difficult to give away the weights for free. But the deeper question is: was there another way? Could Meta have remained in the open-source camp while maintaining its competitive edge?

The Roadmap: Larger Models Ahead

Wang indicated on X that Muse Spark is merely the beginning: “This is the first step. Larger models are in development, and there are plans to open-source future releases.” But the tech community received this promise with clear skepticism — the announcement of Muse Spark without open weights makes it, for now, exclusive to Meta’s ecosystem alone.

Internal sources indicate that the codename “Avocado” referred specifically to Muse Spark, and that the next model in the Muse series is already in development. Meta has not mentioned any specific timeline for future releases.

Performance Gaps and Independent Verification

Beyond the official numbers, there is another story. Independent evaluations — such as those conducted by Artificial Analysis after obtaining early access from Meta — paint a more conservative picture. In the Humanity’s Last Exam test, Artificial Analysis recorded 39.9% for Muse Spark, trailing Gemini 3.1 Pro Preview (44.7%) and GPT-5.4 (41.6%). However, it is important to note that these evaluations were not “fully independent,” as the access itself came through Meta.

Independent evaluations suggest that Muse Spark is “entering the leading group” without being “a leader in every domain.” In areas such as programming and abstract reasoning, the model remains behind. This aligns with what official benchmarks showed, but adds a layer of caution: the official numbers reveal superiority in some areas and deficiency in others, but the magnitude of some of these gaps may be larger than the initial figures suggest.

Release Season: A Race That Never Stops

The Muse Spark announcement came in the same week that Cursor announced its third version and Claude Code expanded into Auto Mode. Competitors have not paused. OpenAI continues developing GPT-5.4, Google is strengthening Gemini, Anthropic is pushing Claude forward. In this context, Meta’s return to the race does not mean winning it — it only means the company is no longer out of the race.

The same week also saw Meta announce new collaborations with app developers to integrate Muse Spark into their products via API — a move indicating that the company does not intend to rely solely on its own apps, but rather seeks to build a closed ecosystem around its new model.

What remains open is not the question of Muse Spark’s quality — which appears convincing within its announced limits for health and science — but about the methodology’s credibility. After the Llama 4 experience, the definitive judgment will depend on what independent evaluation reveals outside Meta’s internal labs. And this is perhaps the real vulnerability that Alexandr Wang faces now: not building a powerful model, but something harder — restoring the trust of a community that once felt betrayed.

On X, one AI developer wrote: “Llama 4 was a scandal. Muse Spark is closed. When do we trust Meta again?” The answer to that question may take longer than the nine months it took to build the model. And perhaps the answer will not come from comparing numbers on benchmarks, but from daily testing, in the apps of millions, over the months ahead.

Muse SparkMetaMeta Superintelligence LabsAlexandr Wang2026

Total Views

... readers

Share this article: