EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations

Abstract

Reliable trajectory prediction from an ego-centric perspective is crucial for robotic navigation in human-centric environments. However, existing methods typically assume idealized observation histories, failing to account for the perceptual artifacts inherent in first-person vision, such as occlusions, ID switches, and tracking drift. This discrepancy between training assumptions and deployment reality severely limits model robustness. To bridge this gap, we introduce EgoTraj-Bench, the first real-world benchmark that grounds noisy, first-person visual histories in clean, bird's-eye-view future trajectories, enabling robust learning under realistic perceptual constraints. Building on this benchmark, we propose BiFlow, a dual-stream flow matching model that concurrently denoises historical observations and forecasts future motion by leveraging a shared latent representation. To better model agent intent, BiFlow incorporates our EgoAnchor mechanism, which conditions the prediction decoder on distilled historical features via feature modulation. Extensive experiments show that BiFlow achieves state-of-the-art performance, reducing minADE and minFDE by 10-15% on average and demonstrating superior robustness. We anticipate that our benchmark and model will provide a critical foundation for developing trajectory forecasting systems truly resilient to the challenges of real- world, ego-centric perception.

Motivation

Most trajectory prediction methods are trained and evaluated under the assumption of clean, complete observations from a bird's-eye view (BEV). In real robot deployment, however, observations come from a first-person view (FPV) camera, where occlusions, ID switches, tracking drift, and perspective distortions are unavoidable. This gap between training assumptions and deployment reality severely limits model robustness.

EgoTraj-Bench intuition: FPV noisy history vs BEV clean trajectory

■ Cyan: occlusion-induced gaps · ■ Red: ID switches · ■ Green: ego-centric perspective distortions.
Dashed: first-person view (FPV) derived history · Solid: bird's-eye view (BEV) derived trajectory.

The radar chart below illustrates the performance gap of SOTA model MoFlow when evaluated under clean BEV histories versus noisy FPV histories across all ETH-UCY folds and the TBD dataset. FPV noise causes dramatic degradation.

MoFlow performance gap: minADE@20 (BEV vs FPV)

MoFlow minADE@20 under BEV (clean) vs. FPV (noisy) settings. The large gap across all datasetsmotivates need for benchmarks explicitly modeling ego-view noise.

EgoTraj-Bench directly addresses this by providing real-world FPV noisy histories paired with clean BEV future ground truth, enabling fair evaluation and robust learning under deployment-level noise.

Benchmark

EgoTraj-Bench is built on the TBD dataset, leveraging its synchronized BEV overhead and FPV front-facing camera recordings. FPV trajectories are extracted via YOLOv8 detection and BotSort tracking, then projected to world coordinates and paired with clean BEV ground truth via Hungarian matching. The result is a benchmark with 210 minutes of real-world recordings, 30 Hz annotation, and 36,947 aligned trajectory pairs.

EgoTraj-Bench captures three types of perceptual artifacts inherent in first-person vision:

Occlusion-induced gaps: pedestrians temporarily disappearing behind obstacles, causing missing observations.
ID switches: tracking identity confusion across frames, corrupting agent-level trajectory continuity.
Ego-centric perspective distortions: coordinate space misalignment between FPV image space and BEV world coordinates.

EgoTraj-TBD is the only benchmark featuring both perceptual and real-world physical ego-noise, with the lowest history MSE among FPV-noise datasets.

Model

BiFlow employs a dual-stream flow matching framework that jointly learns two mappings from the same noisy FPV input: a history stream that reconstructs clean BEV past trajectories, and a prediction stream that forecasts clean BEV future trajectories. Both streams share a unified contextual encoder, allowing denoising knowledge learned in the history stream to directly inform future prediction.

The EgoAnchor mechanism distills intent priors from historical hidden features, combines agent-level and scene-level anchors, and injects them into the prediction decoder via adaptive feature modulation, improving stability under heavy or missing observations.

The ablation below confirms that robustness gains come from holistic joint modeling, and simply correcting invalid observation points is insufficient, as the input also contains tracking errors and perspective distortions that require full trajectory-level denoising.

Visualization

Under noisy FPV inputs, BiFlow produces predictions with stronger continuity and consistency. Even when the noisy FPV history deviates significantly from the clean BEV reference, BiFlow maintains reasonable motion trends with lower endpoint error.

Gray: history (shared for BEV & FPV inputs). Red: FPV predicted future. Orange: BEV clean future ground truth.

BibTeX

@article{liu2025egotraj,
                            title={EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations},
                            author={Liu, Jiayi and Zhou, Jiaming and Ye, Ke and Lin, Kun-Yu and Wang, Allan and Liang, Junwei},
                            journal={arXiv preprint arXiv:2510.00405},
                            year={2025}
                          }