中文

Why 3D Scenes May Emerge as a Transformative Modality in Human Communication

November 2025 | Feiran Wang | Homepage

The Evolution of Information Dimensions

From ancient postal systems to the early World Wide Web, human beings have predominantly shared and connected through one-dimensional information: text and voice. With the advancement of communication technologies, information modalities gradually expanded into two dimensions, first with images and later with videos. Today, 1D and 2D modalities coexist in our modern life, enabling increasingly instant communication and underpinning much of contemporary society.

The Appeal of Immersive Information

Higher-dimensional information formats appear to resonate with human perception, though not simply because they contain more raw data. Rather, they may align more closely with how we naturally experience the world, potentially triggering stronger emotional and cognitive responses. A photograph of a battlefield can convey the immediacy of conflict more viscerally than text descriptions; a video of a child playing might evoke joy more spontaneously than written accounts. This suggests that visual media can communicate certain types of information, particularly emotional and contextual, more efficiently than text alone.

3D scenes represent a further step in this progression. By immersing individuals in spatial environments that approximate physical presence, they offer the potential for more intuitive understanding and richer contextual information transfer, though the extent of this advantage likely varies significantly across different use cases.

The Trade-offs Between Modalities

It's important to recognize that different modalities (text, images, video, and 3D scenes) each have distinct strengths rather than existing in a simple hierarchy. Lower-dimensional formats excel at logical argumentation and precise communication, while requiring less bandwidth and cognitive load. Higher-dimensional formats can convey experiential and emotional content more readily, as suggested by the adage "a picture is worth a thousand words." One might extend this to suggest that experiencing a 3D scene could communicate spatial and contextual information more effectively than multiple videos.

The emergence of 3D as a more accessible modality would likely complement rather than replace existing formats, with each medium supporting the others to convey facts, perspectives, and experiences.

However, higher-dimensional information comes with trade-offs: increased data requirements, potential for information overload, and sometimes weaker capacity for structured logical expression. In practice, effective communication often involves thoughtful combinations of text, images, video, and increasingly, 3D elements.

The Missing Piece: Accessible Infrastructure

While 3D scene technology has advanced significantly, it has not yet achieved the kind of societal penetration seen with the World Wide Web or, more recently, large language models. A key factor appears to be the lack of a ubiquitous, user-friendly delivery platform.

Historical patterns offer some perspective: early text-based communication required telegraph infrastructure and later the internet; widespread image and video sharing became practical only with modern smartphones and high-bandwidth networks. Currently, 3D content creation and consumption remain largely confined to specialized domains such as gaming, professional visualization, and research. What's still missing is a device that seamlessly integrates into everyday life for the general public.

Despite substantial investment from major technology companies and startups in AR/VR headsets and 3D imaging systems over the past two decades, no product has yet achieved the trifecta of comfort, affordability, and broad utility. Many devices generate initial excitement but fail to establish sustained usage patterns, suggesting that significant technical or design challenges remain.

Technology adoption often follows nonlinear trajectories. The iPhone's introduction in 2007 catalyzed rapid mainstream adoption of mobile internet and visual media, features that had existed in various forms for years but hadn't found the right platform.

A similar inflection point for 3D technology is plausible, though its timing and specific form remain to be determined.

Accelerating Progress in 3D Research

The research landscape for 3D technologies has evolved remarkably over the past decade, driven by improvements in computing hardware, depth sensors, and machine learning algorithms.

Traditional geometry-based methods like COLMAP [1] and multi-view stereo provided important foundations. Microsoft's Kinect Fusion [2] demonstrated consumer-grade RGBD reconstruction. The integration of learning-based approaches, such as MVSNet [3], marked a significant methodological shift. More recently, Neural Radiance Fields (NeRF) [4] and 3D Gaussian Splatting [5] have enabled novel view synthesis with impressive visual quality, while transformer architectures have shown promise across multiple 3D tasks.

The emergence of vision foundation models like DUSt3R [6], VGGT [7] and others represents another potential step change, dramatically improving the quality and speed of 2D-to-3D conversion in certain scenarios. Current research increasingly focuses on multimodal integration, incorporating language models and pursuing "world models" that combine multiple sensing and reasoning capabilities.

These developments suggest that many technical barriers to high-quality 3D reconstruction and rendering are diminishing, though challenges around efficiency, generalization, and real-time performance remain active research areas.

A Future Closer Than It Appears

3D scenes offer a distinctive capacity for conveying spatial presence and contextual richness. Humans' innate spatial reasoning abilities suggest a natural affinity for well-designed 3D interfaces. The practical advantages of this match between human perception and 3D content are increasingly being demonstrated across various applications.

The foundations have developed substantially: hardware capabilities continue to advance, algorithmic approaches are maturing rapidly, and theoretical understanding deepens. Whether 3D modalities will achieve mainstream adoption comparable to 2D visual media depends on multiple factors: continued technical progress, the emergence of compelling use cases, development of accessible platforms, and ultimately user behavior and preferences.

The trajectory suggests we may be approaching an inflection point. The pace of capability development in both research and industry has accelerated markedly.

While the specific form factor and timeline remain open questions, the convergence of multiple technological threads points toward a nearer-term future than conventional wisdom might suggest. This is a moment of genuine possibility, one where the pieces are rapidly coming together. The question is perhaps less about whether 3D communication will become mainstream, and more about recognizing when that transition begins to unfold.

About the Author

Feiran Wang is a PhD student in Computer Vision, specializing in 3D reconstruction, vision foundation models, medical imaging, and generative AI. His research focuses on bridging the physical and digital world, creating real-world impact.

Visit my homepage to learn more about my research and publications.

References

  1. Schönberger, Johannes Lutz, and Jan-Michael Frahm. "Structure-from-Motion Revisited." Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  2. Newcombe, Richard A., et al. "Kinectfusion: Real-time dense surface mapping and tracking." 2011 10th IEEE international symposium on mixed and augmented reality. IEEE, 2011.
  3. Yao, Yao, et al. "Mvsnet: Depth inference for unstructured multi-view stereo." Proceedings of the European conference on computer vision (ECCV). 2018.
  4. Mildenhall, Ben, et al. "Nerf: Representing scenes as neural radiance fields for view synthesis." Communications of the ACM 65.1 (2021): 99-106.
  5. Kerbl, Bernhard, et al. "3D Gaussian splatting for real-time radiance field rendering." ACM Trans. Graph. 42.4 (2023): 139-1.
  6. Wang, Shuzhe, et al. "Dust3r: Geometric 3d vision made easy." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
  7. Wang, Jianyuan, et al. "Vggt: Visual geometry grounded transformer." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.