To view PDF files

You need Adobe Reader 7.0 or later in order to read PDF files on this site.
If Adobe Reader is not installed on your computer, click the button below and go to the download site.

Feature Articles: NTT R&D at Expo 2025 Osaka, Kansai, Japan

Vol. 23, No. 10, pp. 45–50, Oct. 2025. https://doi.org/10.53829/ntr202510fa5

A Pavilion Clad in Emotions: Harmonized Communication Experiences between People and Objects

Junichi Oka, Xu Shi, Shintaro Mizuno, Satoshi Suzuki,
Yuichi Nakazawa, and Motohiro Takagi

Abstract

This article introduces one of the key concepts behind the NTT Pavilion at Expo 2025 Osaka, Kansai, Japan: the creation of future communication through the Pavilion Clad in Emotions. It showcases the efforts behind IOWN (Innovative Optical and Wireless Network) Photonic Disaggregated Computing and MediaGnosis, NTT’s next-generation media processing artificial intelligence. It also presents initiatives using IOWN Photonic Disaggregated Computing and Embodied Knowledge Understanding Technology to ensure safety and security for visitors to the NTT Pavilion.

Keywords: IOWN Photonic Disaggregated Computing, data-centric infrastructure (DCI), photonics-electronics convergence (PEC) devices

PDF PDF

1. Production overview of the NTT Pavilion at Expo 2025 Osaka, Kansai, Japan

To create future communication in which human emotions and sensations are linked with objects, NTT used Innovative Optical and Wireless Network (IOWN) Photonic Disaggregated Computing, the computing platform of the IOWN era, along with its proprietary next-generation media processing artificial intelligence (AI) MediaGnosis.

In Zone 2 of the NTT Pavilion, where performances by the music group Perfume were presented, five cameras were installed to capture visitors’ facial expressions. These images were analyzed once per second to count the number of smiles. On the basis of the number of smiles detected, the curtain that enveloped the pavilion was dynamically and intricately controlled (Fig. 1).


Fig. 1. Overview of the DCI system at the NTT Pavilion, Expo 2025 Osaka, Kansai, Japan.

To provide a safe and secure experience for visitors, NTT implemented a system that combines IOWN Photonic Disaggregated Computing with Embodied Knowledge Understanding Technology. Sixteen cameras installed around the pavilion detected individuals and calculated crowd density by dividing the number of people detected by the area covered by each camera.

The system also estimated skeletal posture to determine whether a person has fallen. If a fall was detected, the location information was sent to mobile devices used by pavilion staff. Upon receiving the alert, staff could promptly guide visitors to alleviate congestion or assist those who have fallen, enabling swift and responsive operations.

2. Features of IOWN Photonic Disaggregated Computing

IOWN Photonic Disaggregated Computing is a next-generation computing platform composed of a hardware layer consisting of servers equipped with graphics processing units (GPUs) and photonics-electronics convergence (PEC) switches and a software layer featuring the data-centric infrastructure (DCI) controller, which flexibly manages GPU and other accelerator resources (Fig. 2).


Fig. 2. Composition and features of IOWN Photonic Disaggregated Computing.

The use of AI for video processing has been accelerating in fields such as marketing and security. However, conventional systems face challenges such as the linear increase in physical resource usage, such as GPUs, and the limitations of Ethernet-switch processing capacity due to growing data traffic. In particular, the increase in power consumption has raised concerns about its environmental impact. To address these issues, IOWN Photonic Disaggregated Computing promotes efficient use of physical resources and applies PEC devices to Ethernet switches, thus reducing power consumption to one-eighth compared with standard server configurations.

The DCI controller has three main features. The first feature is a software implementation technique that draws out the full potential of cutting-edge hardware to achieve high-performance data processing. By offloading processing to dedicated hardware, such as GPUs and SmartNICs (smart network interface cards), it enables fast data processing and low power consumption.

The second feature is a technology that optimizes hardware usage according to the load and inference conditions of the applications. For example, in the case of video analysis of people, the system increases the number of physical resources, such as GPUs, during busy daytime hours and reduces them during quieter nighttime periods. Adjusting the amount of physical resources required for inference contributes to reducing power consumption.

The third feature is inference selection technology tailored to the content of the video. Unlike conventional methods that apply the same inference regardless of whether people appear in the image, this technology distinguishes between images with and without people. Images with people are processed using advanced inference, while those without are handled with lightweight inference. By determining the type of inference for each image, the system minimizes the number of advanced inference executions and achieves lower power consumption.

The PEC switch differs from conventional Ethernet switch mechanisms in that it replaces internal electrical connections with optical ones. By incorporating PEC devices, the PEC switch supports 128 Ethernet ports within a single compact chassis, thus improving performance and reducing power consumption. The PEC devices embedded within the PEC switch are extremely compact and efficient components. They offer data processing performance equivalent to eight conventional 400G optical modules while consuming only half the power.

3. Features of MediaGnosis

In the NTT Pavilion, we analyzed the facial expressions of visitors and controlled the facade accordingly. Specifically, the faces detected with the DCI technology were analyzed by MediaGnosis, which is NTT’s original AI. From the analysis results, we calculated the ratio of happy and surprised visitors and reflected it into the control of the facade as the excited state. Toward Expo 2025 Osaka, Kansai, Japan (hereafter referred to as Expo 2025), we improved the facial expression analysis engine in MediaGnosis to resolve the following two issues (Fig. 3).


Fig. 3. Challenges with MediaGnosis and approaches.

The first issue is that in Zone 2 of the NTT Pavilion, where five cameras capture visitors, the illumination conditions are highly unstable due to the performance-driven displays as well as light-emitting diodes installed on the stage that frequently turn on and off. Consequently, the appearance of visitors’ faces changes significantly over time. Additionally, when multiple visitors appear in the same camera frame, some faces often overlap and cannot be fully captured.

Facial expression analysis requires detailed facial information; however, it must work robustly even under these challenging conditions. To address this challenge, we incorporated more than ten data augmentation techniques during the training of the AI model, including contrast transformations and artificial image occlusions. These augmentations simulate variations that may occur in our actual pavilion. By doing so, the AI model can perform consistently even under such unstable conditions. We thus successfully developed a facial expression analysis AI system with high robustness, even in the complex and unstable conditions expected at Expo 2025.

The second issue is a wide range of age groups were expected to visit our pavilion. Especially, a large number of children were anticipated. However, most publicly available datasets for facial expression analysis consist primarily of facial images of individuals over age 15. This poses a significant challenge for achieving high-performance facial expression analysis for individuals under the age of 15. One possible reason is that publicly accessible datasets often exclude children’s facial images due to concerns about privacy. Therefore, it remains difficult to improve the AI’s accuracy in analyzing children’s facial expressions using many public datasets.

To address this, we created a valuable original dataset consisting primarily of facial images of individuals aged 0 to 15, which is accessible only within NTT laboratories. Since it is difficult to capture a number of child participants comparable with that of adult datasets, we prepared a wide variety of camera angles and background patterns during data collection to maximize variation, even with a limited number of subjects. By leveraging this dataset for training, we successfully enhanced the facial expression analysis performance across a broad range of age groups, including children.

Before Expo 2025’s opening, we conducted a field trial of the improved facial expression analysis AI model in Zone 2 of the actual NTT Pavilion. Through this trial with extras representing a wide range of ages and nationalities, we were able to confirm that the model delivers sufficient performance and robustness under dynamic stage conditions. We believe that this achievement demonstrates the strength of MediaGnosis. Its applications extend beyond Expo 2025 to a wide range of potential use cases.

4. Features of Embodied Knowledge Understanding Technology

Embodied Knowledge Understanding Technology is a technology that recognizes human behavior from surveillance cameras and cameras mounted on robots. At Expo 2025, the NTT Pavilion used video footage from cameras installed outside the pavilion to detect individuals who have fallen within the premises using AI on the DCI system, thus efficiently achieving surveillance of the NTT Pavilion through the DCI system.

Many fall detection technologies either deploy AI models directly on edge devices, such as surveillance cameras, or install AI models on central processing unit (CPU)/GPU servers. However, the former approach often requires simplifying the AI models, which can result in reduced accuracy, while the latter does not efficiently process for multiple people across numerous cameras. To enable efficient AI-based monitoring for the 16 cameras installed around the NTT Pavilion, it was necessary to develop a fall detection system with an AI model capable of multi-camera, multi-person inference.

There are two key technical points (Fig. 4). The first is the construction of a fall detection model that can adapt to variations in camera height and angle. The cameras installed outside the NTT Pavilion were positioned at various heights, ranging from approximately 4 to 8 meters, and their angles also differed significantly. Previous fall detection models were mainly designed for indoor environments, which made it difficult for them to adapt to the diverse heights and angles of outdoor cameras. To address this challenge, we created a video dataset modeled after an event venue to develop a fall detection model that can adjust to changes in camera height and angle. This dataset includes multi-camera, multi-person video footage by simulating a scenario in which approximately 50 performers are present at an event venue, engaging in various actions including falls, all of which were then annotated. Using this dataset, we constructed a model capable of detecting falls in video footage from cameras with a wide range of heights and angles, achieving better performance than with conventional fall detection models.


Fig. 4. Key points in Embodied Knowledge Understanding Technology.

The second key point is a person tracking technology that is independent of the number of people present. To detect falls, it is first necessary to detect individuals, then monitor whether they remain stationary for a certain period, and finally issue a fall alert if the relevant criteria are met. It is thus essential to track every person appearing in the video, assign them an identifier, and manage their status accordingly. Tracking each individual traditionally requires dedicated processing resources for each person, resulting in increased computational load as the number of people grows. This limitation has made it difficult to apply the technology to large-scale event venues, where cameras often capture many people simultaneously. With our technology, we constructed a robust person tracking model that is resilient to fluctuations in the number of people by enabling parallel processing of internal tracking operations for each individual on a GPU. Specifically, by offloading loop processing—which previously had to be executed on a CPU—to parallel processing on a GPU, we minimized the increase in CPU load even when multiple people appear across multiple camera feeds, enabling efficient GPU execution.

5. Future outlook

DCI will continue to develop by leveraging the insights gained through its implementation at Expo 2025, with the goal of completing a commercial version by fiscal year 2026. To actualize the IOWN vision for 2030, efforts will be made to further reduce power consumption and generate new value.

PEC devices will undergo further miniaturization and be adapted to replace electrical connections within computers, including those between CPUs and GPUs, with optical connections. A resource pooling mechanism will be introduced to aggregate resources of the same type, such as CPUs and GPUs, enabling the creation of computers that can freely connect to the necessary physical resources. These developments will contribute to even greater reductions in power consumption.

MediaGnosis is a media processing AI with a wide range of capabilities, not limited to facial expression analysis. One can also experience it through its demo site (https://www.rd.ntt/mediagnosis/demo/). On the basis of the technologies and data accumulated through our efforts at Expo 2025, we believe that we can further enhance the robustness of MediaGnosis, even under unstable conditions.

Embodied Knowledge Understanding Technology will enable the development of technology for efficiently recognizing behavior from multiple cameras and multiple people, obtained at Expo 2025, and we aim to establish technology that can detect not only falls but also more advanced human behavior in real time in a multi-camera, multi-person environment. This will enable us to expand the application of this technology to work behavior in a variety of industries.

Junichi Oka
Director, System Software Project, NTT Software Innovation Center.
He received a B.E. and M.E. in mechanical engineering from Waseda University, Tokyo, in 1996 and 1998. From 1998 to 2022, he was engaged in the research and development of NTT’s operational monitoring software “Crane,” diskless boot systems “STRAGEX,” and open source cloud systems, such as OpenStack and OpenShift. He is currently engaged in the research and development on the DCI system exhibited at Expo 2025 Osaka, Kansai, Japan, and efficient software development processes.
Xu Shi
Senior Research Scientist, AI Application Platform Project, NTT Software Innovation Center.
She received an M.E. from the Department of Computer and Network Engineering, the University of Electro-Communications, Tokyo, in 2014. She joined NTT Software Innovation Center the same year and has since been engaged in research and development in the fields of data analytics, video AI technologies, and AI framework for IOWN, including their practical application and commercialization.
Shintaro Mizuno
Vice President, Head of NTT Software Innovation Center.
He received a B.E. and M.E. in mechanical and environmental informatics from Tokyo Institute of Technology in 1995 and 1997. He joined NTT Software Laboratory in 1997 and studied distributed computing and authentication systems. From 2011 to 2025, he was engaged in the research and development of computing systems, including open source cloud systems, such as OpenStack and Kubernetes, and a next-generation computing architecture for IOWN.
Satoshi Suzuki
Research Scientist, NTT Human Informatics Laboratories.
He received a B.E., M.E., and Ph.D. from the University of Electro-Communications, Tokyo, in 2015, 2017, and 2022. He joined NTT in 2017. His current research interests include neural networks, computer vision, and machine learning. He is a member of the Information Processing Society of Japan.
Yuichi Nakazawa
Senior Research Engineer, NTT Human Informatics Laboratories. 
He received a B.E. in arts in environment and information studies and M.E. in media and governance from Keio University, Tokyo, in 2001 and 2003. He joined NTT in 2003. His current role is driving the development and business expansion of MediaGnosis. He is a member of the Acoustical Society of Japan.
Motohiro Takagi
Senior Research Engineer, NTT Human Informatics Laboratories.
He received a B.E., M.E., and Ph.D. from Keio University, Tokyo, in 2009, 2011, and 2020. In 2011, he joined NTT. His research interests include human behavior understanding through machine learning, computer vision, and natural language processing.

↑ TOP