|
Web Archiving
This book will focus on the Web as a publishing medium, which was its first aim, and makes it the largest content information repository ever. Public Information available on the Internet, mainly on the Web, is larger than information distributed on any other media today. Strategic business information and communication is now also usually accessed
and operated thought internet/extranet/intranet services.
The raw nature of Web content, the unpredictable remote changes that can affect it, and the growth of database-driven web site make usual archiving and back-up procedures inefficient to preserve Web material. Specific archiving procedures to preserve this content for mid and long term are hardly ever in place in most organization.
The organization of this book is delineated by three stages (building, using, and preserving). Each chapter is intended to provide first a detailed presentation of existing methods and available technology, often inspired from other domains but adapted to the specific topic of Web archiving.
ntroduction
This introduction will provide a brief historical background on archiving activity, its aim and organization for the various media before the Web appeared. Then a systematic survey of differences with the Web will lay the grounds for analyzing mains issues raised by Web archiving. These issues will delineate the introduction to the parts of the book. This introducing chapter will also present an overview of existing Web Archives.
Part I - Building Web archives
This first part will address the process of building Web archives both from the technical and selection point of view. The main method, harvesting, will be detailed at the site level as well as the large scale domain (crawling) with emphasis on what makes harvesting for archiving different from traditional crawling for search engines. Deep web sites archiving will also be addressed with description of the technical ingest process that is necessary for these complex sites. Finally a chapter on Automatic selection of Web material will address the critical issue, given the huge amount of information available on the Web, of content selection based on automatic tools
Using Web Archives
This part is dedicated to the tools necessary for using Web archives, and will provided a technical view of how they operate and how they are built.
An overview of traditional as well as researcher's use of the Web opens up this part providing a background to understand what are the needs Web archives have to fulfill. Among the tools needed to provide access to these collections, some are intended to allow ultimately for manual browsing of selected sets of items, likening the traditional access information retrieval tools, thought adapted to Web material. Some others are built to extract and automatically process large set of information. Both of them will be addressed in this part.
Preserving Web Collections
Preserving Web material is most challenging given the wide variety of formats that can be found on the Web and given the complex architecture of Web site today. It is also challenging in the sense that commitments of traditional actors in this domain (libraries, archives, museum etc.) are questioned by the nature of the Web itself as a new publishing medium.
The first chapter will analyze all the technical dependencies for accessing and rendering this material over time and will also cover storage issues, various approaches (migration, emulation) that can be applied to these collections with practical examples of migration and technical metadata to document format and rendering tools.
The second chapter will address the various collaboration model that can be applied and their technical implication (distributed crawl, distributed storage and cross-access).
|