Efficient Video Redaction at the Edge: Human Motion Tracking for Privacy Protection

Haotian Qiao1, Vidya Srinivas2, Peter Dinda3, Robert P. Dick1

1University of Michigan, 2University of Washington, 3Northwestern University

For a quick evaluation, we recommend using our provided Docker image. See Dataset for details.

Abstract

Computationally efficient, camera-based, real-time human position tracking on low-end, edge devices would enable numerous applications, including privacy-preserving video redaction and analysis. Unfortunately, running most deep neural network based models in real time requires expensive hardware, making widespread deployment difficult, particularly on edge devices. Shifting inference to the cloud increases the attack surface, generally requiring that users trust cloud servers, and increases demands on wireless networks in deployment venues. Our goal is to determine the extreme to which edge video redaction efficiency can be taken, with a particular interest in enabling, for the first time, low-cost, real-time deployments with inexpensive commodity hardware. We present an efficient solution to the human detection (and redaction) problem based on singular value decomposition (SVD) background removal and describe a novel time- and energy-efficient sensor-fusion algorithm that leverages human position information in real-world coordinates to enable real-time visual human detection and tracking at the edge. These ideas are evaluated using a prototype built from (resource- constrained) commodity hardware representative of commonly used low-cost IoT edge devices. The speed and accuracy of the system are evaluated via a deployment study, and it is compared with the most advanced relevant alternatives. The multi-modal system operates at a frame rate ranging from 20 FPS to 60 FPS, achieves a wIoU0.3 score ranging from 0.71 to 0.79, and successfully performs complete redaction of privacy-sensitive pixels with a success rate of 91%–99% in human head regions and 77%–91% in upper body regions, depending on the number of individuals present in the field of view. These results demonstrate that it is possible to achieve adequate efficiency to enable real-time redaction on inexpensive, commodity edge hardware.
Key Words: ERASE (Efficient Redaction Automation System at the Edge), Human detection and tracking, SVD background subtraction, UWB localization, Sensor fusion, Real time edge computing, Privacy preserving
Appropriate Audience: Efficient human detection and tracking on edge devices, UWB + vision sensor fusion, Privacy-preserving video redaction

Method

ERASE fuses visual and UWB localization information. By projecting the UWB position estimates from physical coordinates to pixel coordinates, the system estimates the number of people in the camera view and initializes tentative bounding box solutions (blue in the figure below), thereby accelerating and improving the accuracy of vision-based human detection. The system then optimizes these bounding boxes (red is the final prediction). The system finally redacts the regions containing privacy information.

Description of my research figure

Figure 1: ERASE system flow chart. Sub-figures 1 - 6 show the intermediate outputs at each stage of ERASE, illustrating the redaction process. The red box is the final bounding box prediction.


Visuals & Key Results

Description of my research figure

Figure 2: ERASE is capable of detecting individuals across a variety of poses and under partial occlusion. The predicted bounding boxes (in red) accurately enclose the human figures in these challenging scenarios.

Description of my research figure

Figure 3: Accuracies as functions of redaction method and number of people in scenes. YOLO11m achieves the best accuracy in all measures. ERASE achieves good results in recall and wIoU0.3, which implies that it is able to redact most privacy-relevant pixels without redacting many privacy-irrelevant pixels. However, it has lower precision compared to the neural network models, implying that its estimated boxes cover more privacy-irrelevant pixels. ERASE is the only one capable of running in real-time.

Main result graph

Figure 4: Note that the lines for YOLO11n and MediaPipe Detect overlap. ERASE is able to run at over 20 FPS, approximately 4X faster than the neural network models, when five people are in the scene.

Comparison table

Figure 5: Success rates for redacting heads, upper bodies, and whole bodies for various numbers of people in the field of view. After redaction by ERASE, most privacy-relevant pixels are successfully removed.


Dataset and Docker Image

The data provided in the Github Repo is a subset of the full dataset. Public access without extra consent is limited to this subset. If you need access to the full dataset for research purpose, please contact the authors via email. Once you get the access, please store it securely on a local machine and do not distribute it.

For a quick evaluation, we also provide a Docker image containing necessary pre-built environment, the repo, and the partial dataset. The image is published on Docker Hub. To pull the image, run: docker pull qhaotian0525/erase:v1.0

See here for how to use the redaction code (also available in the repository under `docs/redaction.md`).

Contact Info: Haotian Qiao -- qhaotian@umich.edu

Citation

If you find our work useful in your research, please consider citing:

                @article{qiao2025efficient,
                author    = {Qiao, Haotian and Srinivas, Vidya and Dinda, Peter and Dick, Robert P.},
                title     = {Efficient Video Redaction at the Edge: Human Motion Tracking for Privacy Protection},
                journal   = {{ACM} Trans.\ Embedded Computing Systems},
                year      = 2025,
                volume    = 24,
                number    = {5s},
                articleno = 120,
                pages     = 22,
                url       = {https://doi.org/10.1145/3762994},
                month     = sep
                }