What Google's New Chips Mean If You Train Your Own Models

This is a submission for the Google Cloud NEXT Writing Challenge

Google just shipped two fully custom AI chips in a single year.

Not derivatives of each other. Not a "pro" and "lite" tier of the same design. Two chips, built from the ground up, for fundamentally different jobs, one for training and one is for inference. If you're building anything on Google Cloud that touches AI workloads, this announcement deserves more than a passing glance.

Let me break down what TPU v8 actually is, why the split-chip strategy matters, and most practically what it means for developers thinking about cost and architecture decisions today.

A Quick Orientation: What Are TPUs?

Before diving into v8, some context about TPU's and what they are.

Google kicked off its Tensor Processing Unit program in 2013, four years before the transformer paper, a decade before the current AI arms race. The origin story is surprisingly pragmatic. Google realized that if every user interacted with Google Search via voice for just 30 seconds a day, running the matrix multiplications on general-purpose CPUs would require building "two or three additional complete Googles." The infrastructure cost was simply impossible.

Custom silicon was the answer. A 100x efficiency gain over waiting for CPUs to catch up.

TPU v1 shipped in 2015 as a true ASIC(a single-application chip purpose-built for inference). Since then, generation by generation, Google has added liquid cooling, custom numeric formats (bfloat16), custom network interconnects, and training support. By the time Ironwood (TPU v7, announced at Cloud Next 2025) arrived, they had the most powerful TPU supercomputer ever built, now running at massive scale across Google's own services and Google Cloud.

So the question heading into 2026 was "How do you follow that up?".

Enter TPU v8: The Year of Two Chips

The answer turned out to be: you don't ship one chip. You ship two.

TPU 8T - The Training Powerhouse

Spec	Value
Chips per pod	~9,600
FP performance vs Ironwood	~3x per pod
Scale-out network bandwidth	2x
Scale-up network bandwidth	4x
Interconnect	Custom ICI

8T is the raw horsepower play. If you are training frontier models, fine-tuning large foundation models, or running research workloads at scale, this is your chip. The 4x scale-up bandwidth improvement is particularly significant as it means faster gradient synchronization across chips, which will directly translates to shorter training runs.

TPU 8i - The Inference Engine

Spec	Value
Chips per pod	1,152
FP exaflops vs Ironwood	10x per pod
HBM capacity vs Ironwood	7x (same scale-up)
Network topology	Boardfly (novel)
Primary design goal	Minimum latency

8i is where the architectural story gets really interesting. The pod is smaller in chip count but packs a 10x leap in floating point exaflops and 7x more HBM. This chip was designed around one constraint above all else which is latency.

The Boardfly Network: Why Topology Matters for Agents

Here's the detail from the announcement that most headlines will miss.

Previous TPU generations used a network topology optimized for throughput — getting large volumes of data through the interconnect efficiently. Great for training. But for agentic inference workloads, what you actually care about is the minimum time for any one token to travel between chips — the network diameter.

Google collaborated with DeepMind to design a completely new interconnect topology for 8i called Boardfly. It dramatically reduces the diameter of the chip network, meaning the distance between any two chips — and therefore the latency floor drops significantly.

Why does this matter to you as a developer?

If you are building an AI agent that needs to orchestrate sub-agents, run tool calls, stream responses, or operate under real-time constraints, your user experience is bounded by inference latency. A 10 millisecond improvement at the chip level cascades through every layer of your stack. At scale, it's the difference between an agent that feels responsive and one that feels like it's "thinking."

The fact that Google was internally discussing agents two years ago when 8i's design began; before agents became the dominant industry conversation, speaks to how tightly DeepMind's research roadmap feeds into the hardware pipeline.

The Cost Angle: It is Not Just About AI Models

Here's the example that should make every developer paying attention.

Citadel is one of the world's leading securities trading firms is running TPUs. It is not running them for frontier model training, not even for LLM inference, they using TPUs for trading systems.

They have achieved the following:

2–4x efficiency improvements
30% cost reduction

This is important to unpack. TPUs have classically been positioned as specialized ML hardware. But across eight generations, they've become increasingly capable at any mathematically intensive workload, not just neural network ops. Matrix-heavy computation of almost any kind can benefit.

For developers, the implication is this, if your workload is dominated by dense numerical computation (financial modeling, scientific simulation, signal processing, large-scale recommendation engines), the TPU cost curve may now be favorable even if you're not training a model.

The Training to Inference Strategic Shift

In the early 2000s, Google's dominant infrastructure problem was building the web index, which is a massive upfront compute task. By 2005, the index was built. The dominant workload flipped entirely to serving it. The value was not in the index itself; it was in delivering search results to billions of users in milliseconds.

The same transition is happening with AI. The value is going to come from serving the model. And serving it is where the value is created for Gemini Enterprise and search and ads and YouTube.

Training is the index-build. Inference is the serving. And just like web search, the inference workload is going to dwarf training in the long run, both in volume and in the aggregate compute it consumes.

This is the strategic logic for 8i. It is not an afterthought. A deliberate bet that inference is the dominant workload of the next era, made two years before that became obvious to the market.

For developers building on Cloud TPUs: this means Google is investing serious silicon real estate in making inference cheaper and faster. That should flow through to pricing over time as capacity scales.

Reliability: The Problem Nobody Talks About Enough

The most candid part of the announcement was about reliability at scale.

When you're coordinating tens of thousands of chips on a single computation, the math gets uncomfortable fast. If even one chip fails, the entire computation halts — like a nervous system with a severed node. At 100,000 chips, statistically, several chips will fail every single day.

The engineering challenge isn't just detecting the failure. Here are some of the advantages:

Detecting it within seconds, not minutes
Diagnosing which chip (needle-in-haystack at scale)
Reconfiguring and resuming without human intervention

Google says they're achieving 97% "goodput" across 10,000+ chip workloads. "Goodput" is a term worth understanding, it is not theoretical peak FLOPS, it is the percentage of time the system is making actual forward progress on the computation. 97% means automated recovery is happening continuously, fast enough to maintain effective throughput.

The harder unsolved problem, acknowledged openly is silent data corruption. A chip that fails cleanly is manageable. A chip that silently produces incorrect results and propagates those errors to every chip it communicates with, 9is the nightmare scenario. It's an active area of work across the entire industry.

For developers using Cloud TPUs for long training runs, this directly affects your wall-clock training time and checkpoint strategy. Understanding goodput vs. raw FLOPS is increasingly important when evaluating real infrastructure cost.

What This Means Looking Forward

A few predictions worth noting from the announcement:

CPUs are making a comeback. Not for the ML compute itself, but for agentic orchestration — spinning up sandboxes, running tool calls, managing state between inference steps. The general-purpose compute layer of the agentic stack is going to grow substantially.

Specialization is accelerating. Two chips might become more. As specific workloads (long-context retrieval, multimodal processing, real-time RL) demand their own optimization profiles, the era of one-chip-fits-all is ending. If you're designing AI infrastructure architecture today, design for heterogeneous compute.

The performance-per-dollar curve is still steep. 10x exaflops in a single generation is a number that should recalibrate any cost model you built 12 months ago. If you're making infrastructure decisions based on 2025 TPU pricing and performance, revisit them.

TorchTPU: The Announcement That Should Have Led

If the chip specs are the hardware story, TorchTPU is the software story that makes all of it accessible to the 90% of ML teams who live in PyTorch.

The previous path to TPUs in PyTorch was torch_xlawhich is a package that bolted XLA compilation onto PyTorch via a lazy tensor execution model. It worked, but it came with real friction:

Recompilations triggered by dynamic shapes (variable sequence lengths, different batch sizes between steps)
Lazy execution semantics that didn't match PyTorch's eager-by-default mental model
Limited distributed training APIs: pure SPMD only, no DDP or FSDP compatibility
A learning curve that made GPU-trained engineers feel like they were context-switching into a different ecosystem

Using TPU's on a Pytorch(which is used by over 70% of the industry) meant a careful rewrite cost before you could even evaluate if TPU was worth it. TorchTPU enables developers to migrate existing PyTorch workloads with minimal code changes while giving them the APIs and the tools to extract every ounce of compute from the hardware. ou can run your existing PyTorch code in eager mode on a TPU VM today. No graph captures, no lazy tensor surprises. Your debugging workflows, your pdb calls, your print(tensor) statements. They all work as expected. TorchTPU integrates natively with the torch.compile interface for full-graph compilation. It uses XLA as the backend compiler.

TorchTPU natively supports custom kernels written in Pallas and JAX. Many third-party libraries that build on Pytorch's distributed API's work unchanged.

Here’s what it looks like to move an existing PyTorch training script to a TPU VM with TorchTPU:

# Before (GPU training)
import torch

device = torch.device("cuda")
model = MyModel().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

for batch in dataloader:
    inputs, labels = batch[0].to(device), batch[1].to(device)
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# After (TorchTPU — minimal changes)
import torch
import torch_tpu  # new import

device = torch.device("tpu")  # change device string
model = MyModel().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

# Optional: compile for full performance
model = torch.compile(model, backend="tpu")

for batch in dataloader:
    inputs, labels = batch[0].to(device), batch[1].to(device)
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

The device string changes, the torch_tpu import is added, and torch.compile is added if you want peak performance. That’s the migration for a standard training loop.

For distributed training across a TPU slice:

# Provision a TPU VM slice
gcloud compute tpus tpu-vm create my-tpu \
  --zone=us-central2-b \
  --accelerator-type=v8t-32 \
  --version=tpu-vm-tf-2.18.0-pjrt

# Install TorchTPU on all hosts
gcloud compute tpus tpu-vm ssh my-tpu \
  --worker=all \
  --command="pip install torch torchtpu"

# Launch distributed training (standard torchrun API)
gcloud compute tpus tpu-vm ssh my-tpu \
  --worker=all \
  --command="torchrun --nproc_per_node=8 train.py"

TorchTPU is currently in preview with a road map for 2026 to add more functionality.

Cost Math: Training + Serving on TPUs

Let’s run through a concrete scenario.

Scenario: A mid-sized SaaS company fine-tuning a 7B parameter domain-specific model, serving it to ~500 concurrent users

Training side (8t)

A typical 7B fine-tuning run on TPU v7 (Ironwood) at ~$5/chip-hour across a v7-256 slice (256 chips):

Estimated run time: ~8 hours per training run
Cost per run: 256 chips × $5/hr × 8 hrs ≈ $10,240

With 8t's 2.7× training price-performance improvement, the same job at the same quality:

Same cost delivers 2.7× more throughput → run completes in ~3 hours
Or: same 8-hour budget now trains a meaningfully larger model variant
Cost-effective run: ≈ $3,800 for equivalent training quality

For teams running weekly fine-tuning cycles on fresh data, this compounds quickly.

Inference side (8i)

The 80% inference price-performance improvement is the number that hits the monthly bill hardest. If you’re serving a 7B model 24/7 to real users, your inference cost is ongoing and surprise it never actually stops.

A current TPU v7 inference budget of $8,000/month
Equivalent workload on 8i: ~$4,400/month
Annual saving: ~$43,000 on inference alone

At SMB margins, $43K/year in infrastructure savings is real money. It’s a junior engineer’s salary. It’s runway.

Developer Takeaways

If you're running inference at scale, 8i is the chip to watch. The Boardfly topology and 10x FP improvement are purpose-built for agentic, low-latency workloads. Monitor Cloud TPU availability and pricing as 8i reaches GA later in 2026.
If you're training large models, 8T's 3x pod performance and 4x scale-up bandwidth meaningfully compress training timelines. At the margin, shorter training runs = lower cost = more iteration cycles.
If your workload is numerically intensive but not ML, the Citadel case study is permission to evaluate TPUs seriously. Profile your compute bottlenecks and benchmark.
Design for reliability, not peak FLOPS. For production workloads, goodput — not theoretical throughput — is the number that determines your actual infrastructure cost. Ask your cloud provider for goodput SLAs, not just FLOPS specs.
Plan for heterogeneous compute. The agentic stack will likely involve TPUs for inference, CPUs for orchestration, and possibly additional specialized chips. Start thinking about your architecture at that level now.

Google has been building towards this for 13 years. TPU v8 with two chips, is the most ambitious step yet. The developers who understand why the split happened and what each chip was designed for will be better positioned to make infrastructure decisions that hold up as the agentic era fully arrives. TorchTPU finally makes TPUs accessible to PyTorch-native teams without a painful rewrite. The friction that kept most ML engineers defaulting to NVIDIA is being systematically removed.

The infrastructure that powers intelligence is still being defined. But it's coming into focus fast.

Photo by Alexandre Debiève on Unsplash

DE

Source

This article was originally published by DEV Community and written by Samuel Komfi.

Read original article on DEV Community

Back to Discover