Abstract
As a result of cheap data storage, nowadays it is not the question if a company or institution collects data or not, but rather how much they collect. Transforming data into information and getting insight in this information is perhaps the most important problem in our data rich society. That is, only collecting data serves no goal, but data becomes valuable when insight can be gained from it.
Data mining is the subfield of computer science that concerns itself with transforming large amounts of data into information in the form of patterns. The idea is that the identified patterns result in new insights by exposing interesting structure or behaviour in the data. It may be obvious that defining what exactly is interesting is one of the key challenges.
One of the main applications of data mining on which we focus in this thesis is exploratory data analysis. In this analysis we make use of summaries and characterisations of a dataset to gain insight. That is, by inspecting and exploring the patterns that comprise these models we can extract important information from the data. In this thesis we employ the Minimum Description Length (MDL) principle to find such models which we call summaries. That is, we find the best summary as the set of patterns that give the best compression of the data.
Additionally, these summaries can also be used for other data mining tasks, such as the identification of irregular or abnormal data points. All these deviations from what could be expected are called anomalies. We also focus on anomaly detection in this thesis, for which the goal is to gain more insight in the information we already have.
Finally, we conclude that the MDL principle can be successfully employed in the domain of multivariate sequential data. Both for summarisation and anomaly detection successful algorithms have been introduced which are tested on a variety of synthetic and real world datasets.
Data mining is the subfield of computer science that concerns itself with transforming large amounts of data into information in the form of patterns. The idea is that the identified patterns result in new insights by exposing interesting structure or behaviour in the data. It may be obvious that defining what exactly is interesting is one of the key challenges.
One of the main applications of data mining on which we focus in this thesis is exploratory data analysis. In this analysis we make use of summaries and characterisations of a dataset to gain insight. That is, by inspecting and exploring the patterns that comprise these models we can extract important information from the data. In this thesis we employ the Minimum Description Length (MDL) principle to find such models which we call summaries. That is, we find the best summary as the set of patterns that give the best compression of the data.
Additionally, these summaries can also be used for other data mining tasks, such as the identification of irregular or abnormal data points. All these deviations from what could be expected are called anomalies. We also focus on anomaly detection in this thesis, for which the goal is to gain more insight in the information we already have.
Finally, we conclude that the MDL principle can be successfully employed in the domain of multivariate sequential data. Both for summarisation and anomaly detection successful algorithms have been introduced which are tested on a variety of synthetic and real world datasets.
Original language | English |
---|---|
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 17 May 2017 |
Publisher | |
Print ISBNs | 978-90-393-6721-6 |
Publication status | Published - 17 May 2017 |
Bibliographical note
SIKS Dissertation Series ; 2017-07Keywords
- Sequence Mining
- MDL
- Multivariate Event Sequences
- Summarisation
- Anomaly Detection