TY - JOUR TI - Process progress estimation and activity recognition DO - https://doi.org/doi:10.7282/T3CF9TJH PY - 2018 AB - Activity recognition is fundamentally necessary in many real-world applications, making it a valuable research topic. For example, activity tracking and decision support is crucial in medical settings, and activity recognition and prediction are critical in smart home applications. In this paper, we focus on activity recognition strategies and their applications to real-world problems. Depending on the application scenario, activities can be hierarchically categorized into high-level and low-level activities. The high-level activities may contain one or more low-level activities. For example, if cooking is a high-level activity, it may contain several low-level activities such as preparing, chopping, stiring, etc… Although studied for decades, there are several challenges remaining for high-level activity recognition, also known as process phase detection. A high-level activity usually has a long duration and consists of several low-level activities. Treating high-level activity recognition as a per-time-instance classification problem overlooks the associations between activities over time. We thus proposed considering high-level activity recognition as a regression problem. Based on this assumption, we implemented a deep learning framework that extracts features from input data and designed a rectified tanh activation function to generate a continuous regression curve between 0 and 1. We used the regression result to represent the overall completeness of the event process. Because the same event often follows similar high-level activity processes, we then used a Gaussian mixture model (GMM) to take the estimated overall completeness to supplement high-level activity recognition. Since the Gaussian mixture model requires that there be no duplication of high-level activities in an event (single activity has to follow Gaussian distribution), it might not fully represent real-world scenarios. To combat this limitation, we further proposed the use of LSTM layers to replace the GMM for high-level activity prediction. We applied our system to four real-world sports and medical datasets, achieving state-of-the-art performance. The system is now working in a trauma rooms at the Children’s National Medical Center, estimating the overall completeness of each trauma resuscitation, the high-level activity of each trauma resuscitation, and remaining time for a trauma resuscitation to complete in real-time. Compared to high-level activities, the low-level activities are more challenging to recognize. This is because low-level activity recognition often requires detailed, noise-free sensor data, which is often difficult to obtain in real-world scenarios. Many manually crafted features were proposed to combat the data noise, but these features were often not generalizable and feature selection was often arbitrary. We are the first to propose deep learning with passive RFID data for activity recognition. The automatic feature extraction does not require manual input, making our system transferable and generalizable. We further proposed the RSS-map representation of RFID data, which works well with ConvNet structures by including both spatial and temporal associations. Because of the limitations of passive RFIDs, we extended our system from using a single sensor to working with a sensor network. We studied activity recognition with multiple sensory types, including RGB-D cameras, a microphone array, and the passive RFID sensor. We were able to follow previously successful activity recognition research focusing on each different sensor type. To build a system that makes final decisions based on features extracted from all sensors, we developed a modified slow fusion strategy, instead of traditional voting. We built a deep multimodal neural network that has multiple feature extraction sub-networks for different input modalities, that feed into a single activity prediction network. The multimodal structure is able to increase overall activity recognition accuracy, but one key problem remains: the extracted features from different sensors contain both useful and misleading information. The system simply takes all the extracted features for activity recognition, because it does not know which features to rely on for activity recognition. Addressing this issue, we proposed a network that automatically generates “masks” that highlight the important features for video-based activity recognition. Unlike many “attention” based deep learning frameworks, we used a conditional generative adversarial network for mask generating. This is because the conditional GAN gives us additional control of the generated masks, whereas we have no control of the generated attention map with regular attention networks. Our experimental results demonstrate that given manually generated activity performer masks as ground truth, the cGAN is able to generate masks that only highlight the activity performer. The activity recognition network with our proposed mask generator achieved performance comparable with other online systems on the published dataset. Though proven applicable, training the cGAN requires a large number of manually generated masks as ground truth, which is not often available in real-world applications. Building on the idea of a cGAN mask generator, we proposed a multimodal deep learning framework with attention that works with multi-sensory input. We proposed the feature attention and modality attention for feature extraction and fusion. The network can be fine-tuned by our asynchronous fine-tuning strategy using deep Q learning. Our experimental results demonstrate that our attention network with deep reinforcement learning based fine-tuning outperforms previous research. The proposed fine-tuning also prevents over-fitting when training a deep network on a small datasets. Finally, we propose and introduce our ongoing work on concurrent activity recognition and our future work. Concurrent activity performance is common in the real-world: a person can drink while watching TV; a medical team can perform multiple tasks simultaneously through different medical personnel. However, recognizing concurrent activities remains an open research topic because it is neither a simple multi-class nor a binary classification problem. We proposed a shared feature extractor to extract features from different input modalities. We then treated the concurrent activity recognition as a coding problem, and trained a deep auto-encoder to generate binary code denoting each activities’ relevant and irrelevant features for activity recognition. However, this network was hard to train and converge because the shared features contains both . The recognition network easily over-fit to the unrelated features as opposed to the activity itself. Because the ground truth labels only provide whether the recognized activity is correct or incorrect, it disregards the associations between recognition results and the feature space. Addressing such an issue, we further proposed to modify the reinforcement learning based plugin that has been successfully used in our attention tuning to provide additional information for concurrent activity recognition. We asked human to provide feedback on whether the system made the decision based on the correct and associated features first, and then only partially tuned the network weights based on human feedback. KW - Electrical and Computer Engineering LA - eng ER -