Machine LearningApr 9, 2026

HiFloat4 Format for Language Model Pre-training on Ascend NPUs

Huawei's HiFloat4 format trains large language models at 4-bit precision on Ascend chips, staying within 1% of full-precision accuracy.

5.5

Scrape Score

5.5

Academic

1.7

Commercial

5.0

Cultural

HorizonMid (2-5y)

Evidencemedium

Was this useful?

The Thesis

This paper tests whether a new 4-bit floating-point number format called HiFloat4 can train large language models on Huawei's Ascend hardware without meaningful accuracy loss. The motivation is straightforward: cutting from 16-bit to 4-bit numerical precision can theoretically quadruple how many operations a chip can do per second and halve memory use. The authors find that with specific stabilization techniques, HiFloat4 training stays within 1% relative error of full-precision baselines across both dense models and mixture-of-experts (MoE) architectures — models where computation is routed to specialized sub-networks rather than run through one monolithic network. The catch is that this result is confined entirely to Huawei's own Ascend NPU clusters, with no cross-hardware validation, and the comparison landscape is limited to MXFP4, the format promoted by the broader industry consortium behind the MX floating-point standard.

Catalyst

Industry momentum shifted sharply toward 4-bit training in 2024-2025 after Nvidia demonstrated NVFP4 in its Blackwell architecture and the Open Compute Project standardized the MXFP4 format, giving researchers concrete hardware targets to benchmark against. Huawei's Ascend NPU line has become strategically critical for Chinese AI development under U.S. export controls, creating urgency to prove that Ascend can support frontier-scale training at competitive efficiency. Those two forces — a maturing FP4 ecosystem and geopolitical hardware isolation — converged to make this specific comparison paper both technically timely and commercially relevant.

What's New

Earlier FP4 training work, including Nvidia's NVFP4 papers and the MXFP4 specification from the MX consortium, focused primarily on inference (running a trained model) or small-scale training experiments on GPU hardware. This paper extends FP4 all the way through large-scale pre-training on Ascend NPUs, covering both linear layers and the expert-routing layers inside MoE models, which are notoriously sensitive to numerical noise. The authors also contribute a set of stabilization techniques — gradient scaling and outlier handling strategies — that they claim are necessary to prevent numerical degradation specific to the Ascend hardware and HiFloat4 encoding.

The Counter

Every experiment in this paper runs exclusively on Huawei Ascend NPU clusters, making it impossible to independently verify the results or compare them against published GPU baselines under identical conditions. The 1% relative error claim is promising, but the paper does not disclose training loss curves or downstream benchmark scores that would let readers judge whether that error accumulates into meaningful model quality degradation at scale. The comparison is also narrow: HiFloat4 vs. MXFP4, with no head-to-head against NVFP4 on the hardware it was designed for. More fundamentally, even if the technique works perfectly, its addressable market is currently gated behind U.S. export controls, meaning the hardware it depends on cannot legally reach most of the world's largest AI labs. Any claim that HiFloat4 influences global AI training efficiency is speculative until the format and hardware can be reproduced or licensed outside Huawei's supply chain.

Longs

CAMT — Camtek, semiconductor inspection equipment used in advanced chip packaging for NPU-class devices
AMAT (Applied Materials) — deposition and etch tools for advanced logic nodes including NPU fabrication
SMIC (0981.HK) — primary foundry for Huawei Ascend chips, direct beneficiary of Ascend volume
SOXX (semiconductor ETF) — broad exposure to low-precision compute silicon demand
Baidu (BIDU) — major Chinese hyperscaler likely to adopt Ascend-native training formats under export controls

Shorts

Nvidia — if HiFloat4 on Ascend proves competitive, Chinese hyperscalers have less incentive to seek workarounds to acquire H100/H200 hardware
MXFP4 ecosystem vendors — a differentiated HiFloat4 format fragments the low-precision standard, complicating toolchain and compiler investment for companies that bet on MX as the universal format

Enablers (Picks & Shovels)

Huawei CANN (Compute Architecture for Neural Networks) — the software stack that exposes HiFloat4 operations to training frameworks on Ascend
MindSpore — Huawei's open-source training framework where HiFloat4 kernels are implemented
Open Compute Project MX specification — defines MXFP4, the primary benchmark format this paper compares against
LLaMA model architecture (Meta AI) — used as an open reference architecture for cross-format benchmarking in the paper

Private Watchlist

Enflame Technology — Chinese AI chip startup competing with Ascend in domestic training clusters
Biren Technology — Chinese GPU-class chip designer targeting large model training workloads
Vastai Technologies — Chinese cloud compute provider offering Ascend NPU access to model developers

Resources

The Paper

Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques. Recent work has demonstrated that 4-bit floating-point (FP4) formats--such as MXFP4 and NVFP4--can be successfully applied to linear GEMM operations in large language models (LLMs), achieving up to 4x improvements in compute throughput and memory efficiency compared to higher-precision baselines. In this work, we investigate the recently proposed HiFloat4 FP4 format for Huawei Ascend NPUs and systematically compare it with MXFP4 in large-scale training settings. All experiments are conducted on Ascend NPU clusters, with linear and expert GEMM operations performed entirely in FP4 precision. We evaluate both dense architectures (e.g., Pangu and LLaMA-style models) and mixture-of-experts (MoE) models, where both standard linear layers and expert-specific GEMMs operate in FP4. Furthermore, we explore stabilization techniques tailored to FP4 training that significantly reduce numerical degradation, maintaining relative error within 1% of full-precision baselines while preserving the efficiency benefits of 4-bit computation. Our results provide a comprehensive empirical study of FP4 training on NPUs and highlight the practical trade-offs between FP4 formats in large-scale dense and MoE models.

arXiv abstract →PDF →

Synthesized 4/27/2026, 9:22:42 PM · claude-sonnet-4-6