1) David Hawking , Web Search Engines: Part 1 and Part 2 IEEE Computer, June 2006. http://www.computer.org/portal/web/csdl/doi/10.1109/MC.2006.213 and http://www.computer.org/portal/web/csdl/doi/10.1109/MC.2006.286
Part 1
GYM search engines - Google, Yahoo!, and Microsoft
"Currently, the amount of Web data that search engines crawl and index is on the order of 400 terabytes, placing heavy loads on server and network infrastructure. Allowing for overheads, a full crawl would saturate a 10-Gbps network link for more than 10 days" (86)
"Engineering a Web-scale crawler is not for the unskilled or fainthearted. Crawlers are highly complex parallel systems, communicating with millions of different Web servers, among which
can be found every conceivable failure mode, all manner of deliberate and accidental crawler traps, and every variety of noncompliance with published standards." (88)
Part 2
Indexers can create a file in two phases: scanning and inversion
"Compression. Indexers can reduce demands on disk space and memory by using compression algorithms for key data structures. Compressed data structures mean fewer disk accesses and can lead to faster indexing and faster query processing, despite the CPU cost of compression and decompression." (88)
"The major problem with the simplequery processor is that it returns poor results. In response to the query “the Onion” (seeking the satirical newspaper site), pages about soup and gardening would almost certainly swamp the desired result." (90)
"A high priority for search engine operation is monitoring the search quality to ensure that it does not decrease when a new index is installed or when the search algorithm is modiļ¬ed." (90)
2) Shreeves, S. L., Habing, T. O., Hagedorn, K., & Young, J. A. (2005). Current developments and future trends for the OAI protocol for metadata harvesting. Library Trends, 53(4), 576-589.
OAI-PMH = Open Access Initiative Protocol for Metadata Harvesting (2001)
data providers or repositories = make metadata available
service providers or harvesters = selectively harvest metadata
Open Language Archives Community = creating a worldwide language resource
Extendible Repository Resource Locators = ERRoLs (allows an OAI repository to stand alone as a web application)
"Controlled vocabularies will become more important..." (587)
3) MICHAEL K. BERGMAN, “The Deep Web: Surfacing Hidden Value” e p http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0007.104
Standard search engines cannot find websites in the "deep web"
"Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web."
"Legitimate criticism has been leveled against search engines for these indiscriminate crawls, mostly because they provide too many results"
"Serious information seekers can no longer avoid the importance or quality of deep Web information. But deep Web information is only a component of total information available. Searching must evolve to encompass the complete Web."
No comments:
Post a Comment