The computer revolution has produced a society that feeds on information. Yet much of the information is its raw form: data. There is no shortage of this raw material. It is created in vast quantities by financial transactions, legal proceedings, and government activities; reproduced in an overwhelming flood of reports, magazines, and newspapers; and dumped wholesale into filing cabinets, libraries, and computers. The challenge is to manage the stuff efficiently and effectively, so that pertinent items can be located and information extracted without undue expense or inconvenience.
The traditional method of storing documents on paper is expensive in terms of both storage space and, more importantly, the time it takes to locate and retrieve information when it is required. It is becoming ever more attractive to store and access documents electronically. The text in a stack of books hundreds of feet high can be held on just one computer disk, which makes electronic media astonishingly efficient in terms of physical space. In addition, the information can be accessed using keywords drawn from the text itself. Compared with manual document-indexing schemes, this approach provides both flexibility (all words are keywords) and reliability (because indexing is accomplished without any human interpretation or intervention). Moreover, organizations nowadays have to cope with diverse sources of electronic information such as machine-readable text, fax and other scanned documents, and digitized graphics. All these can be stored and accessed efficiently using electronic media rather than paper.
This book discusses how to manage large numbers of documents -- gigabytes of data. A gigabytes is approximately one thousand million bytes, enough to store the text of a thousand books, about the size of an office wall packed floor to ceiling. The term has gained currency only recently, as the capacity of mass storage devices has grown. Just two decades ago, requirements measured in megabytes (one million bytes) seemed extravagant, even fanciful. Now personal computers come with gigabytes of storage, and it is commonplace for even small organizations to store many gigabytes of data. Since the first edition of this book, the explosion of the World Wide Web has made terabytes (one trillion bytes) of data available to the public, making even more people aware of the problems involved in handling this quantity of data.
There are two challenges when managing such huge volumes of data, both of which are addressed in this book. The first is storing the data efficiently. This is done by compressing it. The second is providing fast access through keyword searches. For this, a tailor-made electronic index must be constructed. Traditional methods of compression and searching need to be adapted to meet these challenges. These are two topics examined in this book. The end result of applying the techniques described here is a computer system that can store millions of documents and retrieve the documents that contain any given combination of keywords in a matter of seconds, or even in a fraction of a second.
Here is an example to illustrate the power of the methods described in this book. With them, you can create a database from a few gigabytes of text and use it to answer a query like "retrieve all documents that include paragraphs containing the two words 'managing' and 'gigabytes'" in just a few seconds on an office workstation. In truth, given an appropriate index to the text, this is not such a remarkable feat. What is impressive, through, is that the database that needs to be created, which includes the index and the complete text (both compressed, of course), is less than half the size of the original text alone. In addition, the time it takes to build this database on a workstation of moderate size is just a few hours. And perhaps most amazing of all, the time required to answer the query is less than if the database had not been compressed.
Many of the techniques described in this book have been invented and tested recently and are only now being put into practice. Ways to index the text for rapid search and retrieval are thoroughly examined; this material forms the core of the book. Topics covered include text compression and modeling, methods for the compression of images, and page layout recognition to separate pictures and diagrams from text.
Full-text indexes are inevitably very large and therefore potentially expensive. However, this book shows how a complete index to every word-and, if desired, every number - in the text can be provides with minimal storage overhead and extremely rapid access.
The objective of this book is to introduce a new generation of techniques for managing large collections of documents and images. After reading it, you will understand what these techniques are and appreciate their strengths and applicability.
...
preface
《深入搜索引擎》热门书评
-
Managing Gigabyte中文版
28有用 3无用 naomielie 2009-10-26
英文版是99年出版的,从英文标题也可以看出来是老书了。原来是信息检索实现方面的经典教材。现在看来内容稍显陈旧。建议参考http://www.douban.com/subject/3059637/(原书网站+电子版:http://nlp.stanford.edu/IR-book/information...
-
有点太老了
4有用 1无用 ibillguo 2010-05-26
相对于Introduction to Information Retrieval来说这本书太老了,基本上就前几章还算比较有用要知道这本书是在Google普及前出版的(1999年)帮助Google成名的Link Analysis才刚刚发表不久另外也缺少了机器学习在信息检索中应用的介绍如果不是信息检索的...
-
对得起标题~
3有用 0无用 LionHeart 2012-03-28
很老的书,不过的确对得起标题,内容翔实全面,翻译的也很不错。当初看的时候正好在研究lucene的源代码,里面的内容对我帮助很大。在《信息检索导论》这本书之前,《深入搜索引擎》应该是全面介绍信息检索最好的书了。...
-
大规模数据处理的技术太重要了
1有用 1无用 cathymm 2007-11-03
不知道为什么国内目前还没有引进这本书。学校也只有计算机图书馆有2本。基本很难才能借到,真希望有翻译版的啊。...
-
别买重复了,2009年的书,换个马甲又开始买
1有用 0无用 Clark 2014-03-22
书内容是数据处理的经典教材,不过买的同学注意,别买重了,这本书与2009年电子工业出版社出版的<<深入搜索引擎>>内容完全一样。这是上一本书的链接:http://book.douban.com/subject/3729518/两本书不同的地方:1.价格2.译者序的时间签名:一...
书名: 深入搜索引擎
作者:
出版社: 电子工业出版社
原作名: Managing Gigabytes: Compressing and Indexing Documents and Images
副标题: 海量信息的压缩、索引和查询
译者: 梁斌 | Alistair Moffat | Timothy C·Bell
出版年: 2009-6
页数: 540
定价: 79.00元
ISBN: 9787121084911