Please use this identifier to cite or link to this item: https://hdl.handle.net/10216/5255
Author(s): Sérgio Nunes
Cristina Ribeiro
Gabriel David
Title: Using neighbors to date web documents
Issue Date: 2007
Abstract: Time has been successfully used as a feature in web information retrieval tasks. In this context, estimating a document's inception date or last update date is a necessary task. Classic approaches have used HTTP header fields to estimate a document's last update time. The main problem with this approach is that it is applicable to a small part of web documents. In this work, we evaluate an alternative strategy based on a document's neighborhood. Using a random sample containing 10,000 URLs from the Yahoo! Directory, we study each document's links and media assets to determine its age. If we only consider isolated documents, we are able to date 52% of them. Including the document's neighborhood, we are able to estimate the date of more than 85\% of the same sample. Also, we find that estimates differ significantly according to the type of neighbors used. The most reliable estimates are based on the document's media assets, while the worst estimates are based on incoming links. These results are experimentally evaluated with a real world application using different datasets.
Subject: Tecnologia da informação
Information technology
URI: https://repositorio-aberto.up.pt/handle/10216/5255
Source: WIDM '07: Proceedings of the 9th annual ACM international workshop on Web information and data management
Document Type: Artigo em Livro de Atas de Conferência Internacional
Rights: restrictedAccess
License: https://creativecommons.org/licenses/by-nc/4.0/
Appears in Collections:FEUP - Artigo em Livro de Atas de Conferência Internacional

Files in This Item:
File Description SizeFormat 
54513.pdf
  Restricted Access
5.24 MBAdobe PDF    Request a copy from the Author(s)


This item is licensed under a Creative Commons License Creative Commons