TY - GEN
T1 - Information extraction from web pages using semi-structured data alignment
AU - Kuboyama, Tetsuji
AU - Miyahara, Tetsuhiro
AU - Hirokawa, Sachio
AU - Itou, Eisuke
PY - 2005/12/1
Y1 - 2005/12/1
N2 - Information extraction from semistructured data such as HTML documents gains importance with the unflagging growth of Web data storage. This paper proposes a structure-based method for extracting Web contents and their metadata from a set of HTML documents generated from a common template, as shown in syllabus and staff data in universities. These HTML documents include a number of grammatical mistakes in HTML, redundant or missing fragments introduced by manual editing. This method first finds a canonical HTML document compliant with the common template. Next, the correspondences of the data between the canonical document and the other documents are identified by an approximate matching algorithm, and aligned according to the correspondences of the data. Experiments have been conducted to extract attribute names for metadata construction and, to align data records from syllabus Web pages in universities.
AB - Information extraction from semistructured data such as HTML documents gains importance with the unflagging growth of Web data storage. This paper proposes a structure-based method for extracting Web contents and their metadata from a set of HTML documents generated from a common template, as shown in syllabus and staff data in universities. These HTML documents include a number of grammatical mistakes in HTML, redundant or missing fragments introduced by manual editing. This method first finds a canonical HTML document compliant with the common template. Next, the correspondences of the data between the canonical document and the other documents are identified by an approximate matching algorithm, and aligned according to the correspondences of the data. Experiments have been conducted to extract attribute names for metadata construction and, to align data records from syllabus Web pages in universities.
UR - http://www.scopus.com/inward/record.url?scp=84867368995&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84867368995&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84867368995
SN - 9806560531
SN - 9789806560536
T3 - WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings
SP - 42
EP - 47
BT - WMSCI 2005 - The 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Proceedings
T2 - 9th World Multi-Conference on Systemics, Cybernetics and Informatics, WMSCI 2005
Y2 - 10 July 2005 through 13 July 2005
ER -