HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment
A new neural network architecture claims to detect driver drowsiness from video more accurately and cheaply — but real-world deployment remains unproven.

The Thesis
Driver fatigue is one of the leading causes of road fatalities, and automakers are under growing regulatory pressure to detect it reliably in real time. Most existing video-based approaches either burn too much compute for in-car hardware or rely on simple pairwise relationships between facial landmarks — missing the complex, multi-point interactions that distinguish a genuine yawn from a spoken word. This paper proposes HST-HGN, a system that models facial geometry using hypergraphs (graph structures where a single edge can connect more than two points at once, capturing group-level interactions) combined with a bidirectional sequence model called Bi-Mamba that processes video timelines efficiently in both forward and backward directions. The claimed result is better fatigue detection at lower computational cost — a combination that would matter most in the always-on, battery-constrained environment of a production vehicle. The catch is that the evaluation is benchmark-only; no deployment on actual automotive hardware is demonstrated.
Catalyst
Two enabling trends converged recently. First, the Mamba state-space model architecture — a class of sequence model designed to scale linearly rather than quadratically with sequence length — was published in late 2023, making long-range temporal modeling tractable on modest hardware for the first time. Second, increasingly strict driver monitoring regulations in the EU (GSR2 mandates active driver monitoring from 2024) and NCAP scoring updates are creating immediate commercial pull for efficient, embeddable fatigue detection.
What's New
Earlier approaches fell into two camps: heavyweight transformer-based models with quadratic attention complexity that struggle on embedded chips, and lightweight graph neural networks that only model pairwise facial-landmark relationships and miss higher-order co-movements. Prior fatigue systems such as RT-GENE and standard graph convolutional networks (GCNs) treated each facial landmark connection independently. HST-HGN instead uses hypergraph edges that can encode three-way, four-way, or higher-order facial co-movements — theoretically richer for capturing expressions — and replaces the transformer's expensive attention with Mamba's linear-complexity state-space scan, cutting compute while extending the temporal window the model can consider.
The Counter
The paper benchmarks on standard academic fatigue datasets, but these are almost certainly cleaner and more controlled than the chaotic reality of in-cabin video: variable lighting, glasses, masks, head poses, and camera placements that shift between vehicle models. The claim of 'real-time edge deployment suitability' is made without actually running the model on automotive-grade hardware — no latency numbers on a production chip appear in the abstract, only an assertion of linear complexity. Hypergraph networks are theoretically expressive, but that expressiveness comes with its own implementation overhead, and prior work has shown that simple GCNs often close the gap once properly tuned. The Bi-Mamba temporal module processes sequences bidirectionally, which is fine for offline analysis but creates a fundamental causality problem for true real-time use: you cannot look backward in time at frames that haven't happened yet. Finally, the fatigue detection space is already occupied by well-funded, regulation-tested commercial players with years of OEM integration data — academic benchmark supremacy rarely translates directly to that environment.
Longs
- MOBILEYE (MBLY) — direct driver monitoring system supplier to OEMs, would adopt or compete with this approach
- STMicroelectronics (STM) — edge AI chips used in automotive cabin sensing
- Aptiv (APTV) — vehicle interior sensing and ADAS integration
- BOTZ (robotics/automation ETF) — broad exposure to embedded AI sensing
- Ambarella (AMBA) — low-power computer vision SoCs targeted at in-cabin cameras
Shorts
- Seeing Machines and Smart Eye — if a cheaper, more accurate open-research approach commoditizes the core algorithm, their proprietary software moat erodes
- Transformer-centric CV startups — architectural advantage of attention-based models weakens if linear-complexity SSMs match quality at lower cost
- Tier-1 suppliers using older rule-based or simple CNN drowsiness detectors — face replacement by learning-based hypergraph approaches
Enablers (Picks & Shovels)
- Mamba / state-space model open-source implementations (e.g., the original Mamba GitHub repo by Gu & Dao)
- NTHU-DDD, DROZY, and similar public fatigue benchmark datasets used for evaluation
- MediaPipe and OpenFace facial landmark extraction pipelines that supply the geometric inputs
- Qualcomm Snapdragon Ride and similar automotive-grade edge AI SoCs that would host deployment
- EU General Safety Regulation 2 (GSR2) compliance testing frameworks that define the performance bar
Private Watchlist
- Seeing Machines — publicly traded in Australia (SEE.AX), specializes in driver monitoring AI
- Smart Eye — publicly traded in Sweden (SEYE), OEM driver monitoring systems
- Cipia (formerly Foretellix spin-off) — in-cabin sensing startup
- Eyeris Technologies — private company focused on in-vehicle occupant and driver monitoring
Resources
The Paper
It remains challenging to assess driver fatigue from untrimmed videos under constrained computational budgets, due to the difficulty of modeling long-range temporal dependencies in subtle facial expressions. Some existing approaches rely on computationally heavy architectures, whereas others employ traditional lightweight pairwise graph networks, despite their limited capacity to model high-order synergies and global temporal context. Therefore, we propose HST-HGN, a novel Heterogeneous Spatial-Temporal Hypergraph Network driven by Bidirectional State Space Models. Spatially, we introduce a hierarchical hypergraph network to fuse pose-disentangled geometric topologies with multi-modal texture patches dynamically. This formulation encapsulates high-order synergistic facial deformations, effectively overcoming the limitations of conventional methods. In temporal terms, a Bi-Mamba module with linear complexity is applied to perform bidirectional sequence modeling. This explicit temporal-evolution filtering enables the network to distinguish highly ambiguous transient actions, such as yawning versus speaking, while encompassing their complete physiological lifecycles. Extensive evaluations across diverse fatigue benchmarks demonstrate that HST-HGN achieves state-of-the-art performance. In particular, our method strikes a balance between discriminative power and computational efficiency, making it well-suited for real-time in-cabin edge deployment.