Abstract
Pattern mining is one of the best-known concepts in Data Mining. A big problem in pattern mining is that humongous amounts of patterns can be mined even from small datasets. This makes it hard for domain experts to discover knowledge using pattern mining, for example in the field of Bioinformatics. In this thesis we address the pattern explosion using compression. We argue that the best pattern set is that set of patterns that compresses the data best. Based on an analysis from MDL (Minimum Description Length) perspective, we introduce a heuristic algorithm, called Krimp, which finds the best set of patterns. High compression ratios and good classification scores confirm that Krimp selects patterns that are very characteristic for the data. After this, we proceed with a series of well-known problems in Knowledge Discovery, which we each unravel with our compression approach. We propose a database dissimilarity measure and show how compression can be used to characterise differences between databases. We present an algorithm that generates synthetic data that is virtually indiscernible from the original data, but can also be used to preserve privacy. Changes in data streams are detected by using a Krimp compressor to check whether the data distribution has been changed or not. Finally, compression is used to identify the components of a database and to find interesting groups in a database. In each chapter, we provide an extensive experimental evaluation to show that the proposed methods perform well on a large variety of datasets. In the end, we conclude that having less, but more characteristic patterns is key to successful Knowledge Discovery and that compression is very useful in this respect. Not as goal in itself, but as means to an end: compression picks the patterns that matter.
Original language | Undefined/Unknown |
---|---|
Qualification | Doctor of Philosophy |
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 9 Feb 2010 |
Publisher | |
Print ISBNs | 978-90-393-5235-9 |
Publication status | Published - 9 Feb 2010 |