Sohu And The End Of General‑Purpose AI Chips

A look at Etched’s transformer‑only ASIC and what it means for data centers, humanoid robots, and the next phase of AI infrastructure.

Jessica Alvarez

May 3, 2026

4m read

Upvote this article

Etched is pursuing perhaps the purest bet in AI hardware. Its Sohu chip is an application specific ASIC that only runs transformer models and hard wires the full transformer computation graph directly into silicon. It cannot execute CNNs, RNNs, LSTMs, protein folding systems, or legacy vision stacks. It is built for a single architecture that now underpins language models, multimodal systems, and emerging agent frameworks.

The company has raised 120 million dollars to bring this design to market, manufacturing on TSMC’s 4 nanometer node and explicitly positioning itself as a challenger to Nvidia’s grip on AI inference economics. Founder Gavin Uberti summarized the risk profile directly. If transformers disappear, the company disappears. If they persist, he argues, the payoff could be on the scale of the largest hardware franchises ever built.

Hard wired transformers in silicon

Sohu is not a flexible GPU with a new marketing label. It is a transformer specific pipeline built as dedicated hardware blocks. Attention, QKV projections, softmax, feed forward layers with GELU or SiLU, and layer normalization are all implemented as fixed function stages in a single dataflow. There is no traditional instruction decoder or general purpose shader core. Data flows through a rigid graph that mirrors what modern decoder and encoder decoder transformers actually do at inference time.

The chip is fabricated at near reticle limit size on TSMC 4 nanometer with 144 gigabytes of HBM3E memory, delivering an estimated 4.8 terabytes per second of bandwidth. Etched claims that this architecture allows it to sustain very high utilization on transformer workloads, with internal figures pointing to above ninety percent of peak compute being applied to useful operations.

Across public material, Etched repeatedly highlights one number. An eight chip Sohu server reportedly surpasses five hundred thousand tokens per second on Llama 70B, compared with roughly twenty three thousand tokens per second for an eight H100 system and about forty five thousand for an eight B200 system on similar models. Independent analysts have noted that these figures likely depend on aggressive batching and model specific tuning, yet even with conservative assumptions the throughput advantage appears material.

This profile does not just help frontier labs chasing leaderboard wins. It reshapes the bill of materials for any product that spends most of its time running large decoders at low batch, from low latency agents to simulation heavy planning systems and eventual humanoid stacks. Instead of contorting workloads around the economics of general purpose GPUs, Sohu effectively assumes that transformers are the default and optimizes the entire hardware budget around keeping them saturated.

A narrow architecture in a transformer world

The obvious trade‑off is brittleness. Etched states openly that if transformers are replaced by state space models, RWKV style architectures, or something not yet invented, Sohu becomes essentially useless. The chip cannot be repurposed to earlier model families or non transformer deep learning systems. It is a single purpose machine that lives or dies with one architecture.

The counterweight is that transformers now anchor almost every state of the art model in production, from ChatGPT and Gemini to Sora and Stable Diffusion 3, and they continue to absorb new modalities rather than lose them. When the cost of training frontier systems runs into billions of dollars, even single digit percentage gains in inference efficiency can justify specialized silicon. Etched is not chasing a marginal improvement. It is targeting an order of magnitude jump in throughput per dollar on the dominant architecture of the current cycle.

That is the core of the Sohu thesis. Strip away flexibility, commit completely to transformers, and turn the resulting focus into raw speed.