DescriptionTrauma is the leading cause of mortality in children and young adults. The initial resuscitation of injured patients is critical for identifying and managing life-threatening injuries. Despite the use of a standardized protocol, errors remain frequent during this initial evaluation. Computerized decision support has been proposed as a method for reducing errors in this setting. Medical activities are considered as key components of the clinical workflows. Automatic activity recognition during trauma resuscitation is necessary to be studied to generate the computerized decision for the next step and analyze the errors after the resuscitation. Video understanding developed rapidly these years due to the success of using deep learning methods in computer vision tasks. Our work, video-based activity recognition during clinical, is important because videos contain rich texture features for recognizing activities during clinical.
We first present medical phase recognition during trauma resuscitation. Each Medical phase can be considered as a sequence of activities, which represents the progress of current resuscitation. Based on the protocal of the Advanced Trauma Life Support (ATLS), each trauma resuscitation case can be divided into five phases in sequential. The pre-arrival phase is focused on preparation for the patient, the primary survey for identifying and managing life-threatening injuries, the secondary survey phase for identifying additional injuries that need management, the post-secondary phases for initiating additional injury management, and the patient-departure phase for identifying the patient leaves the room. Identification of phases aids in the determination of errors in the type and order of activities. Decision support in this domain should reflect the priorities of each phase. Knowledge of the current phases aids in the prioritization of required activities based on the underlying goals in each. We used depth videos recorded using a Kinect-v2 as input to preserve the privacy of the patients and providers. We also introduced a reduced long-term operation (RLO) method for modeling long-term video context and a progress gate (PG) method to distinguish visually similar phases using video progress.
We next present a medical activity recognition system during trauma resuscitation. Different with phases recognition, using depth videos failed to recognize activities during trauma resuscitation. Recognizing activities require detail texture features from RGB videos. In order to use the texture features provided by RGB videos as well as preserve patient and provider privacy, we feed the RGB videos into a inflated 3D convolution (i3D) network pre-trained with a public activity recognition dataset and store the pre-computed features for fine tuning. Although fine tuning the network using pre-computed features might cause a performance gap compare to end-to-end training, the system is privacy preserving because the 3D convolution filters contain nonlinear operations that are irreversible. We evaluated the system on five medical activities that were most frequently performed during trauma resuscitation.
Multi-label activity recognition is designed for recognizing multiple activities that are performed simultaneously or sequentially in each video. Multi-label activity recognition is an understudied field but has more general real-world use cases. Most of the recent multi-label activity recognition methods are derived from structures for single activities that generate a shared feature vector and apply sigmoid as the output activation function. Although these methods are able to provide multi-label outputs, the shared feature vector is not designed for multi-label activities and ignore the correlations between activities. Activities during trauma resuscitation are also multi-label activities. We then present an approach to multi-label activity recognition that extracts independent feature descriptors for each activity and learns activity correlations. This structure can be trained end-to-end and plugged into any existing network structures for video classification. We evaluated the introduced structure on our trauma resuscitation data and show more than 5% mAP score improvement over the baseline. We also evaluated our method and outperform the state-of-the-art on four public multi-label activity recognition datasets, which shows the generalizability of our method.
Transformer is first proposed for NLP tasks and recently adopted for computer vision task due to its ability for modeling long-range feature dependencies. We finally present Video Transformer (VidTr) that extends the ViT, a pure transformer-based image classification network, to a video classification network to perform spatio-temporal modeling from raw pixels (without convolutions). We also applied separable-attention, which reduces the memory cost by 3.3 × while keeping the same performance. To further compact the model, we propose the standard deviation based topK pooling attention, which reduces the computation by dropping non-informative features. Our system further improved by 3% mAP for trauma activity recognition by using our proposed VidTr as backbone due to the better modeling of long-term spatio-temporal features. To show generalizability of the proposed VidTr, we also show that VidTr achieves state-of-the-art performance on five commonly used dataset with lower computational requirement, showing both the efficiency and effectiveness of our design.