TY - GEN
T1 - An MDL-based Frequent Itemset Hierarchical Clustering Technique to Improve Query Search Results of An Individual Search Engine
AU - Puspitaningrum, D.
AU - [No Value], Fauzi
AU - Susilo, B.
AU - Pagua, J. A.
AU - Erlansari, A.
AU - Andreswari, D.
AU - Efendi, R.
AU - Prasetya, I.S.W.B.
PY - 2015
Y1 - 2015
N2 - In this research we propose a technique of frequent itemset hierarchical clustering (FIHC) using an MDL-based algorithm, viz KRIMP. Different from the FIHC technique, in this proposed method we define clustering as a rank sequence problem of the top-3 ranked list of each itemsets-of-keywords clusters in web documents search results of a given query to a search engine. The key idea of an MDL compression based approach is the code table. Only frequent and representative keywords as those in a KRIMP code table can be used as candidates, instead of using all important keywords from keywords extractor such as RAKE. To simulate information needs in the real world, the web documents are originated from the search results of a multi domain query. By starting in a meta-search engine environment to grab many relevant documents, we set up k=50,100,200 for k-toplist retrieved documents of each search engine to build a dataset for automatic relevance judgement. We implement a clustering technique to the best individual search engine the MDL-based FIHC algorithm with setting of k=50,100,200 for k-toplist of retrieved documents of each search engine, minimum support=5 for itemset KRIMP compression, and minimum cluster support=0.1 for FIHC clustering. Our results show that the MDL-based FIHC clustering can improve the relevance scores of web search results on an individual search engine significantly (until 39.2 % at precision P10, k-toplist=50).
AB - In this research we propose a technique of frequent itemset hierarchical clustering (FIHC) using an MDL-based algorithm, viz KRIMP. Different from the FIHC technique, in this proposed method we define clustering as a rank sequence problem of the top-3 ranked list of each itemsets-of-keywords clusters in web documents search results of a given query to a search engine. The key idea of an MDL compression based approach is the code table. Only frequent and representative keywords as those in a KRIMP code table can be used as candidates, instead of using all important keywords from keywords extractor such as RAKE. To simulate information needs in the real world, the web documents are originated from the search results of a multi domain query. By starting in a meta-search engine environment to grab many relevant documents, we set up k=50,100,200 for k-toplist retrieved documents of each search engine to build a dataset for automatic relevance judgement. We implement a clustering technique to the best individual search engine the MDL-based FIHC algorithm with setting of k=50,100,200 for k-toplist of retrieved documents of each search engine, minimum support=5 for itemset KRIMP compression, and minimum cluster support=0.1 for FIHC clustering. Our results show that the MDL-based FIHC clustering can improve the relevance scores of web search results on an individual search engine significantly (until 39.2 % at precision P10, k-toplist=50).
M3 - Conference contribution
BT - Asia Information Retrieval Societies Conference (AIRS)
PB - Springer
ER -