A Machine Learning Based Approach for Table Detection
on The Web*
Yalin Wang, and Jianying Hu
Abstract
Table is a commonly used presentation scheme, especially for describing
relational information. However, table understanding remains
an open problem in both document image analysis and information retrieval
fields. In this paper, we consider the problem of table detection in web
documents. Its potential applications include web mining, knowledge management,
and web content summarization and delivery to narrow-bandwidth devices.
We describe a machine learning based approach
to classify each given table entity as either genuine or non-genuine.
Various features reflecting the layout as well as content characteristics of
tables are studied.
In order to facilitate the training and evaluation of our table classifier,
we designed a novel web document table ground truthing protocol and used it to
build a large table ground truth database. The database consists of 1,393 HTML
files collected from hundreds of different web sites and contains 11,477
leaf elements, out of which 1,740 are genuine tables.
Experiments were conducted using the cross validation method and an F-measure of
95.88% was achieved.
Figures (click on each for a larger version):
Related Publications
-
Y. Wang, J. Hu, "Automatic Table Detection in HTML Documents", Web Document Analysis Challenges and Opportunities, A. Antonacopoulos and J. Hu (Eds.), World Scientific, pp. 135-154
-
Y. Wang and J. Hu,
Detecting Tables in HTML Documents",
D. Lopresti, J. Hu, and R. Kashi (Eds.),
Document Image Analysis System V,
5th International Workshop DAS 2002, Princeton, NJ, USA,
Aug. 2002. Proceedings, pp. 249 - 260.
This paper won the best student paper award.
-
Y. Wang and J. Hu
"
A Machine Learning Based Approach for Table Detection on The Web",
The Eleventh International World Web Conference, WWW2002, pp. 242-250,
Hawaii, USA, May 2002.
*The web table ground truth database can be downloaded here.