TY - GEN
T1 - Testbed for information extraction from deep web
AU - Yamada, Yasuhiro
AU - Craswell, Nick
AU - Nakatoh, Tetsuya
AU - Hirokawa, Sachio
N1 - Copyright:
Copyright 2017 Elsevier B.V., All rights reserved.
PY - 2004/5/19
Y1 - 2004/5/19
N2 - Search results generated by searchable databases are served dynamically and far larger than the static documents on the Web. These results pages have been referred to as the Deep Web [1]. We need to extract the target data in results pages to integrate them on different searchable databases. We propose a testbed for information extraction from search results. We chose 100 databases randomly from 114,540 pages with search forms. Therefore, these databases have a good variety. We selected 51 databases which include URLs in a results page and manually identify target information to be extracted. We also suggest evaluation measures for comparing extraction methods and methods for extending the target data.
AB - Search results generated by searchable databases are served dynamically and far larger than the static documents on the Web. These results pages have been referred to as the Deep Web [1]. We need to extract the target data in results pages to integrate them on different searchable databases. We propose a testbed for information extraction from search results. We chose 100 databases randomly from 114,540 pages with search forms. Therefore, these databases have a good variety. We selected 51 databases which include URLs in a results page and manually identify target information to be extracted. We also suggest evaluation measures for comparing extraction methods and methods for extending the target data.
UR - http://www.scopus.com/inward/record.url?scp=84880089492&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84880089492&partnerID=8YFLogxK
U2 - 10.1145/1013367.1013468
DO - 10.1145/1013367.1013468
M3 - Conference contribution
AN - SCOPUS:84880089492
T3 - Proceedings of the 13th International World Wide Web Conference on Alternate Track, Papers and Posters, WWW Alt. 2004
SP - 346
EP - 347
BT - Proceedings of the 13th International World Wide Web Conference on Alternate Track, Papers and Posters, WWW Alt. 2004
PB - Association for Computing Machinery, Inc
T2 - 13th International World Wide Web Conference on Alternate Track, Papers and Posters, WWW Alt. 2004
Y2 - 19 May 2004 through 21 May 2004
ER -