DescriptionLiving in the information explosion era, the amount of data grows rapidly from different sources and the analysis of those data are in great demand. Data from media such as news and twitter are popular sources. Two aspects from those data are especially of interest. One is about discovering the chronological rule behind the text data, which has the application to decision making and future planning. With the increasingly enormous data, automatic and simultaneously detection of the abnormality is also essential for network safety or even military surveillance to prevent attacks. This thesis works on approaches to solve the two
problems.
Part I focuses on discovering the dynamics over time for texts by using State Space Model and the Sequential Monte Carlo methods. Specifically, we attempt to analyze the evolution of topics, a latent variable that summarizes documents, distributed over each document changing over time. Inspired by the Latent Dirichlet Allocation model, autoregression related state space models are built in to describe the dynamic structure. Simulations and a real data example are present to demonstrate the new model setup and inference process.
In Part II, we propose a non-parametric framework to detect multiple outliers. Conformal analysis, a recent developed tool, can determine precise levels of confidence in new predictions. Based on the conformal analysis, we propose a non-parametric framework that can be suitable for various data format and models, without the assumption of knowing the data distribution. Moreover, multiple testing scheme with controlled False Discovery Rate is established meanwhile. From the simulation results, even under a ‘wrong model’, the outlier detection framework still works with controlled FDR.