Microsoft Unveils MAI-Thinking-1, Its First AI Reasoning Model
At Build 2026, Microsoft announced the arrival of new in-house AI models, now directly available through Microsoft Foundry and Azure Speech. This new generation of models targets four use cases: text, image, voice, and transcription.
The MAI models, for Microsoft AI, are built directly by Microsoft, without OpenAI. This is an important detail that marks a clear break from the company led by Sam Altman.
Table of Contents
Reasoning with MAI-Thinking-1
For logical processing and text generation, Microsoft introduces MAI-Thinking-1, its very first LLM dedicated to reasoning. Until now, the MAI family consisted of only 4 models dedicated to image generation and audio capabilities (available in Foundry since April). As mentioned earlier, this model was trained from scratch on proprietary data, without relying on the training of a third-party model.
Positioned as a mid-sized model, it is based on a Mixture-of-Experts (MoE) architecture, which selectively activates only the parts of the model needed for each request. The goal: increase the model's overall capabilities without dramatically increasing compute requirements. In a way, this architecture makes the model more efficient.
"MAI-Thinking-1 is particularly well suited to enterprise use cases that often require deep contextual understanding: long-document analysis, complex multi-step reasoning, and processing extended agent traces without fragmentation or stitching.", Microsoft says.
How does this model compare with the competition? I'm thinking in particular of Claude Opus by asking this question.
Microsoft addresses this point. It states that MAI-Thinking-1 matches Claude Opus 4.6 on the SWE-Bench Pro benchmark, with a notable difference for customers: the cost would be significantly lower. It is therefore less capable than GPT-5.5, Claude Opus 4.8, and Gemini 3.5-flash. However, depending on the usage scenario, results may vary.
This model is currently available in private preview (accessible on request via this form).
Image generation with MAI-Image-2.5
Microsoft also unveiled a family of models dedicated to image generation and image-to-image editing. There is the standard version on one hand, and the MAI-Image-2.5-Flash version on the other, which is faster and better suited to high-volume production.
The Redmond-based company emphasizes the integration of a set of control tools, especially to maintain consistency between generated or edited images:
- Identity and character consistency: preservation of faces, hair, clothing, and overall body identity across different styles or poses.
- Style and scene control: the ability to apply a full restyle (anime effect, color grading, film grain, rejuvenation) or rearrange the scene (adding, removing, or moving objects, adjusting human poses).
- Text, graphics, and layout control: generation of typography and logos, text changes based on natural-language prompts, or creation of PowerPoint-ready infographics.

Here are the pricing details for these two models available through Microsoft Foundry:
| Model | Text input (per million tokens) | Image input (per million tokens) | Image output (per million tokens) |
| MAI-Image-2.5 | $5.00 | $8.00 | $47.00 |
| MAI-Image-2.5 Flash | $1.75 | $1.75 | $33.00 |
Voice cloning and transcription with MAI models
Finally, the last area Microsoft is targeting is audio. At Build 2026, two models were unveiled to address text-to-speech and audio transcription needs.
- MAI-Voice-2
This multilingual text-to-speech model supports 15 languages and can recreate the unique voice identity of a specific person. To do so, it uses a short reference audio sample to instantly capture the characteristics of the voice (tone, emotion, accent, rhythm, etc.).
A faster version, MAI-Voice-2 Flash, will follow later.
- MAI-Transcribe-1.5
This speech-to-text model supports a total of 43 languages. It was designed to remain accurate even in the presence of background noise, overlapping conversations, or when analyzing long meetings. It is also customizable so it can recognize proper names, brand names, or specific technical vocabulary.
On the standard multilingual FLEURS benchmark (covering 25 languages), its word error rate improves from 3.9% to 3.7% (compared with MAI-Transcribe-1), strengthening its global first-place ranking. In fact, according to Microsoft, this model ranks first in 11 key languages and outperforms Whisper-large-v3.
Here are the pricing details for these two models available through Azure Speech:
| Model | Price |
| MAI-Voice-2 | $22.00 per million characters |
| MAI-Transcribe-1.5 | From $0.36 per hour |
Ultimately, Microsoft seems determined to build a complete ecosystem of AI models developed in-house, and therefore without relying on third-party technologies. It will be interesting to follow the evolution of these models, especially compared with the competition, over the coming months.

