Abstract
We compare three known semantic web page segmentation
algorithms, each serving as an example of a particular approach to the
problem, and one self-developed algorithm, WebTerrain, that combines
two of the approaches. We compare the performance of the four algorithms
for a large benchmark of modern websites we have constructed,
examining each algorithm for a total of eight configurations. We found
that all algorithms performed better on random pages on average than
on popular pages, and results are better when running the algorithms
on the HTML obtained from the DOM rather than on the plain HTML.
Overall there is much room for improvement as we find the best average
F-score to be 0.49, indicating that for modern websites currently
available algorithms are not yet of practical use.
algorithms, each serving as an example of a particular approach to the
problem, and one self-developed algorithm, WebTerrain, that combines
two of the approaches. We compare the performance of the four algorithms
for a large benchmark of modern websites we have constructed,
examining each algorithm for a total of eight configurations. We found
that all algorithms performed better on random pages on average than
on popular pages, and results are better when running the algorithms
on the HTML obtained from the DOM rather than on the plain HTML.
Overall there is much room for improvement as we find the best average
F-score to be 0.49, indicating that for modern websites currently
available algorithms are not yet of practical use.
Original language | English |
---|---|
Title of host publication | Proceedings of ICWE 2015 |
Publisher | Springer |
Pages | 374-391 |
Volume | 9114 |
DOIs | |
Publication status | Published - 2015 |
Publication series
Name | LNCS |
---|---|
Publisher | Springer |