t has taken me several years to grasp just how vast a gulf lies between searching and researching the Internet. Searching the Internet is computer science. We practice our understanding of search technologies and search engines. Researching the Internet is library science. We act upon our understanding of how information is arranged.
These two approaches are very distinct.
I shall attempt to trace a gradual evolution in how we find information using the Internet. I believe we have been moving from Internet searching to Internet research - from computer science to library science. If I am right, this portends perhaps the single most dramatic change to library science in decades: a renaissance of library science and librarianship.
The two or three most effective ways to search the Internet change every year or two. It comes as a bit of a shock to realize, but even the very short history of the Internet has seen a wide range of tools and techniques come and go. Today, there appears to be a consensus that Google is the primary search tool for searching for Internet information. And yet this same conviction was directed to Yahoo! just 2 years ago. What has happened?
In the very early days, before the Web arrived, I remember pleading with my Internet service provider to mirror a copy of the many guidebooks that made up the Internet Clearinghouse Project. You may know of this project as its later re-incarnation: the Argus Clearinghouse. In its heyday it was internationally famous. One of its typical text guidebook, "Not Just Cows," described in detail all of the better Internet resources and active mailing lists for agriculture. When I met this archive, it was racing past 130 guidebooks.
Archie complemented this as a database of all the publicly accessible files found on FTP sites. Actually, Archie was not a complete database but was thought to index well over 95% of all FTP material. This coverage was so complete, it started the tradition that the publisher was responsible for informing a nearby Archie if a new FTP site was launched.
How far we have come today. Most of the guidebooks have grown up or disintegrated in time. Argus has not been updating for several years and is being folded into the Internet Public Library (IPL) directory. Lou Rosenfeld (of Argus) formed his own consulting company [www.lourosenfeld.com] and gives seminars in conjunction with the Norman Nielsen group. Argus' direct competitor, AlphaSearch, is gone too. Even Archie gave way to Shareware.com, which was then purchased by C|net, then lost all pretence at completeness. But much more was lost. The idea that a single person could organize all the resources in a given topic was one casualty. So was the idea of a search engine that indexed all Internet resources, as Archie did for FTP. The Internet simply outgrew these ideas. In the early days it was both possible and brilliantly executed.
With the arrival of Gophers, Veronica stepped in and became a third vital approach to finding Internet information. Veronica was a quasi-definitive list of all Gopher categories. It never attained the completeness that Archie had for FTP resources and its fame slipped rapidly away once it became apparent that the Web was going to be far more interesting than Gopherspace.
THE WEB ARRIVES
The early search engines, with names like the World Wide Web Worm and Webcrawler, changed this environment significantly. These search engines indexed most of the Web, certainly achieving initially over 50% coverage, then slipping to 30% as the Web grew. These tools were as famous as Google and Yahoo! today. Everyone used them. And when the Web was young, they sparkled.
Unfortunately, the search algorithms used by early search engines were of the kind used by commercial databases of the day. A search for "Internet Research," returned a list of Web pages ranked by frequency and title. Web pages with "Internet Research" in their titles would lead the list, followed by pages with the words "Internet research" occurring several times in the text. This gave rise to the uninspired marketing maxim that you must place your primary keywords in the title and three or four times in the first paragraph.
These early search engines also invited and even expected publishers to inform them of new Web pages. The search engines would dutifully send out their spiders, sometimes immediately. For some reason, though, I don't remember much use of field searching in these early days. Perhaps the early search engines did not permit Title and URL searching, or perhaps we didn't know we needed these tools.
Complementing these early search engines were two simple techniques that gave the motion to Internet surfing. Initially, we would search for a hotlinks page. A search for "Accounting Hotlinks" would likely unearth a page created by someone who had just finished a scan of accounting resources. If it was a month or two old, it served as a very fine starting point for your efforts to do the same.
About a year later, as Hotlinks stopped being the word de jour, we would visit the "further links" section of an interesting Web site. Publishers were kindly creating these lists more and more, pointing out and linking to comparable sites. This may have been where the habit of surfing arose - you could hop on and gradually move from one Web site, to its further links page, to the next Web site, to its further links page - surfing to the information that peers recognized as useful.
THE AGE OF THE DIRECTORY
The World Wide Web Virtual Library, soon followed by Yahoo!, began to succeed as the guidebooks began to falter. Yahoo! required much less effort to update, so rapidly delivered a far more extensive list of resources - though sadly listing few of the cherished mailing lists.
Yahoo! really made its move at a time when the early search engines were struggling to make the transition to popularity ranking. There were too many resources out there. The basic search algorithms that had delivered such brilliant results only a year earlier were now increasingly exasperating. They didn't work any more. The best information was often buried deep within a mass of other information.
Essentially, as the Web grew, and search engine databases struggled unsuccessfully to keep pace, the search engine results deteriorated. It did not help that these early search engines defaulted to OR, so that even a simple search for three blind mice would deliver millions of results. Adding the + symbol before each word - making an explicit request for a Boolean AND search - initially tamed this mess, but the trouble was more fundamental. It required a major rethink in how information was ranked to revitalize these search engines.
In this chaotic transition, Yahoo! reigned supreme. Suddenly you could not move fast enough to see what Yahoo! had to offer. The age of the directory also heralded a raging business model that, through massive promotion, made Yahoo! synonymous with Internet research for a time.
LATE ERA SEARCH ENGINES
The growth of the Internet continued. When Google introduced ranking technologies, it changed everything. Here was a way to float the more popular and coincidentally the more recognized resources to the top of the long search engine lists. With the default changed to AND, the search engines began to work again as an effective research tool. Then the databases searched by search engines swelled in size.
There were fundamental shifts taking place. With these new algorithms, the search engines no longer required the assistance of publishers to index the best information. Initially, they began asking for email addresses - often bathing a publisher in spam as a price for indexing - and then some gradually stopped altogether. At the same time, as databases grew, the potential pay-off for a publisher shrank. Most new publishers would only occasionally see a visitor sent their way from any effort in informing the search engines of new pages.
When Google crested one billion records, the limitations of Yahoo! were becoming increasingly apparent. No directory could ever index the complete volume of the Internet effectively, it was said, forgetting that only a few years earlier Archie had effectively indexed all FTP resources. What had happened, of course, was rapid Internet growth that diluted earlier achievements to the point of being inadequate. It did not help that at this time Yahoo! began to charge a consideration fee for publishers wishing to be indexed.
BOOLEAN, FIELDS AND MORE
Another change happened. The search engines allowed for field searching, and those in the know began to make much greater use of additional techniques to further refine their searching. A title search could be most helpful in certain circumstances. AlltheWeb permitted a title search using title.normal:words. This was later changed to match Altavista's simpler title:words though Google persisted for a long time in not inviting users to use its title search capability.
Almost by accident, many researchers began extending a skill I refer to as URL interpretation. From an early understanding that .gov means government and .au Australia, researchers could intuit additional information from the Web address. On a good day, I can tell the format, date, publisher, and type of author from the URL. Guessing these elements helps me to anticipate type and quality of information on the site.
Region also came into play. A simple url:.au would limit results to Australia. Even more effectively, Bryan Strome with his SearchEngineCollossus.com would (and still does) lead you quickly to a regional search engine; an Australian only search engine. Predictions swept the Web that the next great step forward would be in regional Webspace and in topic-specific search engines. Both predictions, I am mindful, play as yet minor roles in Internet research.
BACK TO CHAOS
As the Internet grows further, search engines begin to run into trouble again. Google stands at just about 2 1/2 billion records now but the Web races ahead at a much faster pace. There are complex reasons for this pace - not least that the number of people capable of Internet publishing grows at an exponential rate. I've explained my views at www.SpireProject.com/art10.htm and www.SpireProject.com/art13.htm. This growth is real and seriously disrupts popularity ranking. Estimating an absolute size of the Web is perilous, but if you accept an estimate of 15 billion Web pages, only 14% of the Web is indexed. Next year, as this figure surely dips below 7%, ranking technology takes on a whole new meaning.
Where once ranking would float the best information to our attention, by next year it will retreat to become similar to Yahoo! with its emphasis on site, time, and money. Google is not losing its battle but is definitely losing the technological war on organizing chaos. However, this war is being fought more successfully on other fronts.
CHANGES IN APPROACH
There is more to this evolution than a change in tools. This is really a story about a change in approach. In the early days we expected almost all FTP resources to be indexed by Archie. With the early search engines, we expected most important Web pages to be represented. Tomorrow, we will expect most important Web sites to be represented. Yes, we will leap from Web pages to Web sites.