DescriptionSemantic reverse engineering has become the main approach to explore and understand the big picture of the binary code for closed-source software packages. However, semantic reverse engineering still has two unsolved challenges: (1) to recognize and recover data structure instances from binary memory images without execution traces; and (2) to locate the critical algorithm implementation and extract the high-level semantic meaning for the associated memory addresses/registers. These capabilities have many computer security and forensics applications, such as vulnerability discovery, sensitive data protection and so on.
In this dissertation, I present new techniques to perform automatic semantic reverse engineering to address the above-mentioned challenges. First, I present a systematic framework, ReViver, for semantic reverse engineering of data structure instances from live memory without execution trace. Using the discovered data structure instances in live memory, I develop a new domain-specific semantic memory data attack against power grid controllers. What’s more, I propose a framework, Mismo, to analyze embedded system binaries to extract semantic information about the control algorithms that they implement. Finally, I build BinSec, a vulnerability assessment tool which leverages deep learning and dynamic analysis to do cross-platform binary code similarity detection to identify known vulnerabilities. I demonstrate how I integrate these new techniques to explore semantic information for binary protection and exploitation.
I have obtained the following experimental results. ReViver achieved 98.1% average accuracy in recovering memory data structure instances without execution traces for real-world applications. Mismo’s accuracy for data discovery was an average of 89.82%, and 84.96% for code and data semantics discovery, respectively. For BinSec, I evaluate 25 existing CVE vulnerability functions for the Google Pixel 2 smartphone and Android Things IoT firmware images. The deep learning model identifies vulnerabilities with an accuracy of over 93% and the dynamic analysis can help to identify the correct matches among the top 3 ranked outcomes 100% of the time.