We’ve crawled the web for 32 years: What’s changed?

It was 20 years ago this year that I authored a book called “Search Engine Marketing: The Essential Best Practice Guide.” It is generally regarded as the first comprehensive guide to SEO and the underlying science of information retrieval (IR).

I thought it would be useful to look at what I wrote back in 2002 to see how it stacks up today. We’ll start with the fundamental aspects of what’s involved with crawling the web.

It’s important to understand the history and background of the internet and search to understand where we are today and what’s next. And let me tell you, there is a lot of ground to cover.

Our industry is now hurtling into another new iteration of the internet. We’ll start by reviewing the groundwork I covered in 2002. Then we’ll explore the present, with an eye toward the future of SEO, looking at a few important examples (e.g., structured data, cloud computing, IoT, edge computing, 5G),

All of this is a mega leap from where the internet all began.

Join me, won’t you, as we meander down search engine optimization memory lane.

An important history lesson

We use the terms world wide web and internet interchangeably. However, they are not the same thing.

You’d be surprised how many don’t understand the difference.

The first iteration of the internet was invented in 1966. A further iteration that brought it closer to what we know now was invented in 1973 by scientist Vint Cerf (currently chief internet evangelist for Google).

The world wide web was invented by British scientist Tim Berners-Lee (now Sir) in the late 1980s.

Interestingly, most people have the notion that he spent something equivalent to a lifetime of scientific research and experimentation before his invention was launched. But that’s not the case at all. Berners-Lee invented the world wide web during his lunch hour one day in 1989 while enjoying a ham sandwich in the staff café at the CERN Laboratory in Switzerland.

And to add a little clarity to the headline of this article, from the following year (1990) the web has been crawled one way or another by one bot or another to this present day (hence 32 years of crawling the web).

Why you need to know all of this

The web was never meant to do what we’ve now come to expect from it (and those expectations are constantly becoming greater).

Berners-Lee originally conceived and developed the web to meet the demand for automated information-sharing between scientists in universities and institutes around the world.

So, a lot of what we’re trying to make the web do is alien to the inventor and the browser (which Berners-Lee also invented).

And this is very relevant to the major challenges of scalability search engines have in trying to harvest content to index and keep fresh, at the same time as trying to discover and index new content.

Search engines can’t access the entire web

Clearly, the world wide web came with inherent challenges. And that brings me to another hugely important fact to highlight.

It’s the “pervasive myth” that began when Google first launched and seems to be as pervasive now as it was back then. And that’s the belief people have that Google has access to the entire web.

Nope. Not true. In fact, nowhere near it.

When Google first started crawling the web in 1998, its index was around 25 million unique URLs. Ten years later, in 2008, they announced they had hit the major milestone of having had sight of 1 trillion unique URLs on the web.

More recently, I’ve seen numbers suggesting Google is aware of some 50 trillion URLs. But here’s the big difference we SEOs all need to know:

And 50 trillion is a whole lot of URLs. But this is only a tiny fraction of the entire web.

Google (or any other search engine) can crawl an enormous amount of content on the surface of the web. But there’s also a huge amount of content on the “deep web” that crawlers simply can’t get access to. It’s locked behind interfaces leading to colossal amounts of database content. As I highlighted in 2002, crawlers don’t come equipped with a monitor and keyboard!

Also, the 50 trillion unique URLs figure is arbitrary. I have no idea what the real figure is at Google right now (and they have no idea themselves of how many pages there really are on the world wide web either).

These URLs don’t all lead to unique content, either. The web is full of spam, duplicate content, iterative links to nowhere and all sorts of other kinds of web debris.

Understanding search engine architecture

In 2002, I created a visual interpretation of the “general anatomy of a crawler-based search engine”:

Clearly, this image didn’t earn me any graphic design awards. But it was an accurate indication of how the various components of a web search engine came together in 2002. It certainly helped the emerging SEO industry gain a better insight into why the industry, and its practices, were so necessary.

Although the technologies search engines use have advanced greatly (think: artificial intelligence/machine learning), the principal drivers, processes and underlying science remain the same.

Although the terms “machine learning” and “artificial intelligence” have found their way more frequently into the industry lexicon in recent years, I wrote this in the section on the anatomy of a search engine 20 years ago:

“In the conclusion to this section I’ll be touching on ‘learning machines’ (vector support machines) and artificial intelligence (AI) which is where the field of web search and retrieval inevitably has to go next.”

‘New generation’ search engine crawlers

It’s hard to believe that there are literally only a handful of general-purpose search engines around the planet crawling the web, with Google (arguably) being the largest. I say that because back in 2002, there were dozens of search engines, with new startups almost every week.

As I frequently mix with much younger practitioners in the industry, I still find it kind of amusing that many don’t even realize that SEO existed before Google was around.

Although Google gets a lot of credit for the innovative way it approached web search, it learned a great deal from a guy named Brian Pinkerton. I was fortunate enough to interview Pinkerton (on more than one occasion).

He’s the inventor of the world’s first full-text retrieval search engine called WebCrawler. And although he was ahead of his time at the dawning of the search industry, he had a good laugh with me when he explained his first setup for a web search engine. It ran on a single 486 machine with 800MB of disk and 128MB memory and a single crawler downloading and storing pages from only 6,000 websites!

Somewhat different from what I wrote about Google in 2002 as a “new generation” search engine crawling the web.

“The word ‘crawler’ is almost always used in the singular; however, most search engines actually have a number of crawlers with a ‘fleet’ of agents carrying out the work on a massive scale. For instance, Google, as a new generation search engine, started with four crawlers, each keeping open about three hundred connections. At peak speeds, they downloaded the information from over one hundred pages per second. Google (at the time of writing) now relies on 3,000 PCs running Linux, with more than ninety terabytes of disk storage. They add thirty new machines per day to their server farm just to keep up with growth.”

And that scaling up and growth pattern at Google has continued at a pace since I wrote that. It’s been a while since I saw an accurate figure, but maybe a few years back, I saw an estimate that Google was crawling 20 billion pages a day. It’s likely even more than that now.

Hyperlink analysis and the crawling/indexing/whole-of-the-web conundrum

Is it possible to rank in the top 10 at Google if your page has never been crawled?

Improbable as it may seem in the asking, the answer is “yes.” And again, it’s something I touched on in 2002 in the book:

From time to time, Google will return a list, or even a single link to a document, which has not yet been crawled but with notification that the document only appears because the keywords appear in other documents with links, which point to it.