• The Agent Roundup
  • Posts
  • 👾 The AI Coding Showdown: 6 Platforms, 1 App, Shocking Results

👾 The AI Coding Showdown: 6 Platforms, 1 App, Shocking Results

New model releases from OpenAI (GPT-5 and gpt-oss), Anthropic (Opus 4.1), and Alibaba (Qwen3 Flash series). 6 AI coding platforms tested, here are the results.

Welcome to The Agent Roundup

So overall, if we’re given the time, I believe in human ingenuity. I think we’ll get this right. I think also, humans are infinitely adaptable. I mean, look where we are today. Our brains were evolved for a hunter-gatherer lifestyle and we’re in modern civilisation. The difference here is, it’s going to be 10 times bigger than the Industrial Revolution, and maybe 10 times faster.

Demis Hassabis - Head of Google DeepMind

This week’s topics:

  • OpenAI releases GPT-5 and an open-source version

  • Anthropic further enhances Claude with Opus 4.1

  • Alibaba leading open models with enhanced Qwen3

  • Test results of 6 vibe coding AI agent platforms

  • Plus market updates, trending AI tools, community highlights, and more

AI Agent News Roundup

💥 Breakthroughs

OpenAI GPT-5 & OSS

gpt-5 math benchmark

Source: OpenAI

GPT-5 is available now in 5 different variants:

GPT-5 and GPT-5 Pro. The latter comes with extended reasoning.

GPT-5 Mini: Designed to handle lighter-weight reasoning tasks, reduced latency and cost.

GPT-5 Chat: Designed for conversations.

GPT-5 Nano: Fastest, for real-time applications.

OpenAI also released two open-source models:

gpt-oss-120b: General-purpose use cases, optimized to run on a single H100 GPU with at least 80GB of VRAM.

gpt-oss-20b: Lower-latency, at least 16GB of GPU VRAM.

Claude Opus 4.1

opus 4.1 benchmark

Source: Anthropic

Anthropic released Claude Opus 4.1.

Performance improved to 74.5% on SWE-bench Verified benchmark, up from 72.5% in Claude Opus 4.

With Agentic, Coding and Reasoning Upgrades.

Improvements in research and data analysis skills, with particular enhancements in detail tracking and agentic search.

The upgrade specifically targets real-world coding scenarios, in-depth research capabilities, and data analysis tasks requiring high attention to detail - aligning with the growing demand for more precise AI-driven development and research tools.

Flash series of Qwen3

qwen3 benchmark

Source: Alibaba

Alibaba released:

Qwen3-Coder-Flash: 30B parameter model with a massive 1M token context window.

Qwen3-Coder-480B-A35B-Instruct.

Qwen3-2507 in two variants: Qwen3-Instruct-2507 and Qwen3-Thinking-2507.

Capabilities:

AI-powered analysis and generation across entire codebases in a single prompt.

Significant improvements in instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage.

Positioning Alibaba competitively in the coding AI space.

📈 More Market Updates

🇺🇸 Domino: Approximately 80% of North American phone orders now processed through AI assistant featuring localized speech patterns, powered by San Francisco-based Rime Labs text-to-speech technology.

🇺🇸 Meta: $65 billion allocated for AI initiatives in 2025, targeting nearly 1 gigawatt of computing capacity supported by over 1.3 million GPUs.

🇺🇸 Microsoft: Over $30 billion in capital expenditures planned for current quarter to scale cloud platform and expand AI-focused data center infrastructure.

🇺🇸 Amazon: More than $100 billion committed to AWS AI development, described by CEO Andy Jassy as "once-in-a-lifetime" market opportunity.

🇨🇦 Shopify: New commerce-focused features including checkout integration tools, global product search capabilities, and universal cart functionality for AI agents.

An AI scheduling assistant that lives up to the hype.

Skej is an AI scheduling assistant that works just like a human. You can CC Skej on any email, and watch it book all your meetings. Skej handles scheduling, rescheduling, and event reminders. Imagine life with a 24/7 assistant who responds so naturally, you’ll forget it’s AI.

I Put 6 Vibe Coding Apps to The Ultimate Test – The Results Shocked Me

investment portfolio tracker replit

My vibe-coded investment portfolio tracker

Can AI really replace developers? I put this question to the ultimate test by building the same investment portfolio tracker across 6 different AI coding platforms. What I discovered will change how you think about AI-powered development.

🎯 The Challenge: One App, Six Platforms

I tasked each platform with building a comprehensive investment portfolio tracker with these exact requirements:

The Benchmark App:

  • Real-time data integration for Bitcoin, Ethereum, Nasdaq stocks, global equities, and ETFs

  • Interactive dashboard with account overview, 12-month performance tracking

  • Dual visualization via line charts (portfolio value over time) and pie charts (asset allocation)

  • Transaction management system with buy/sell functionality

  • Cash account integration with automatic balance updates

  • Complete transaction history in chronological order

  • Mobile-responsive design with modern dark theme

Each platform got the identical prompt. No technical specifications. No second chances.

🏆 The Results That Shocked Me

1st Place: Caffeine AI (7.7/10) 🥇

The dark horse that came out of nowhere! This platform absolutely nailed the user experience with fast generation, solid reliability and mostly functional app after the initial prompt.

investment portfolio tracker caffeine ai
Pros
  • Simple and easy-to-use interface

  • Entire app runs on blockchain which makes it more secure

  • All-in-one (persistent storage, authentication, one-click deployment)

Cons
  • Alpha version, not yet released to general public

  • Lacks some features (code editing/export, file upload)

  • UI design quality a bit behind competitors

2nd Place: Firebase Studio (7.2/10) 🥈

Google's offering brought moderate results after the initial prompt. A few features were working but needed more prompts to fix the remaining issues.

investment portfolio tracker firebase studio
Pros
  • For beginners and advanced users

  • Full IDE support with Gemini models

  • Great customization and available features

Cons
  • More complex for beginners

3rd Place: v0 (6.3/10) 🥉

With Vercel's version, I needed three attempts and manual bug fixing to run the preview. UI quality is high but needs more work to make it fully functional.

investment portfolio tracker v0
Pros
  • Easy to use

  • Good app quality

  • Rich features (code export, database integrations, deployment on Vercel)

Cons
  • Very simple code editor

  • Slow code generation

4th Place: Replit (6.0/10)

The coding giant delivered solid fundamentals but lacked the polish for production-ready apps. After the initial prompt, none of the features worked but the app dashboard looked great.

investment portfolio tracker replit
Pros
  • Ease of use

  • Supported features (file uploads, prompt enhancer, deployment, database)

  • Great UI design in apps

Cons
  • Very slow code generation

  • Vendor lock in to Replit

  • None of the prompted features worked

5th Place: Lovable (5.5/10)

Despite the charming name, this platform struggled with prompt alignment and complex feature integration. Gave the simplest app version of the tested platforms.

investment portfolio tracker lovable
Pros
  • Easy to use and has all the necessary features

Cons
  • Don’t allow code edit or export in free version

  • Slow code generation and low app quality

  • Apps are always public in free mode

6th Place: Bolt.new (FAIL)

The platform that made me almost rage quit. Complete breakdown on core functionality - couldn't spin up the app.

investment portfolio tracker bolt
  • No rating. Consumed all the tokens for the day without a visual result.

📊 The Score Breakdown

Here's how each platform performed across critical developer metrics:

Metric

Replit

Firebase Studio

Lovable

Bolt.new

v0

Caffeine AI

Ease of Use

8

6

7

7

9

Speed of Generation

4

7

5

5

9

App Quality

6

6

5

7

7

Supported Features

7

9

7

7

6

Customization/Flexibility

5

8

4

6

6

Scalability/Reliability

6

7

5

6

9

Total Average

6.0

7.2

5.5

FAIL

6.3

7.7

💡 Key Takeaways

🔹 Features: Most platforms offer similar features, with Firebase Studio allowing editing the code in a web IDE.

🔹 The Reliability Gap: Security is nonexistent in most of the apps. Upgrading after its live is difficult without losing user data. Caffeine AI is a promising counterpart due to its decentralized architecture, integrated authentication system, and data persistence.

🔹 Great for Rapid Prototyping and MVPs: AI coding tools are rapidly evolving, but they’re not ready to replace seasoned developers. However, they’re powerful for rapid prototyping and MVP development.

Tool Spotlight

A 20B MMDiT image foundation model that delivers significant advances in complex text rendering and precise image editing, especially strong at creating stunning graphic posters with native text.

A general-purpose world model that can generate an unprecedented diversity of interactive environments from text prompts, creating dynamic worlds navigable in real time at 24 frames per second at 720p resolution.

ElevenLabs' latest AI Audio innovation featuring a studio-grade AI engine built from high-quality stems, delivering 44.1kHz audio that can closely follow lyrics, key, and BPM across complex, multi-layered prompts.

Google's asynchronous, agent-based coding tool powered by Gemini 2.5 Pro that integrates with GitHub, clones codebases into Google Cloud virtual machines, and is now officially out of beta after thousands of developers completed tens of thousands of tasks.

MiniMax's next-generation speech generation model featuring significantly enhanced multilingual performance, more realistic voice cloning accuracy, and expanded language support to over 40 languages.

Community Highlights

More Resources

Blog: In-depth articles on AI workflows and practical strategies for growth
AI Tool Collection: Discover and compare validated AI solutions
Consultancy: Explore AI potential or make your team AI-fit
Agency: Production-ready AI implementation services

See you next time!

Tobias