Search
Engine Robots - How They Work, What They Do (Part
I)
Daria Goetsch
Search Innovation
April 11, 2003
Automated search engine robots, sometimes called
"spiders" or "crawlers", are the seekers of web
pages. How do they work? What is it they really
do? Why are they important?
You'd think with all the fuss about indexing
web pages to add to search engine databases, that
robots would be great and powerful beings. Wrong.
Search engine robots have only basic functionality
like that of early browsers in terms of what they
can understand in a web page. Like early browsers,
robots just can't do certain things. Robots don't
understand frames, Flash movies, images or JavaScript.
They can't enter password protected areas and
they can't click all those buttons you have on
your website. They can be stopped cold while indexing
a dynamically generated URL and slowed to a stop
with JavaScript navigation.
How Do Search Engine Robots Work?
Think of search engine robots as automated data
retrieval programs, traveling the web to find
information and links.
When you submit a web page to a search engine
at the "Submit a URL" page, the new URL is added
to the robot's queue of websites to visit on its
next foray out onto the web. Even if you don't
directly submit a page, many robots will find
your site because of links from other sites that
point back to yours. This is one of the reasons
why it is important to build your link popularity
and to get links from other topical sites back
to yours.
When arriving at your website, the automated
robots first check to see if you have a robots.txt
file. This file is used to tell robots which areas
of your site are off-limits to them. Typically
these may be directories containing only binaries
or other files the robot doesn't need to concern
itself with.
Robots collect links from each page they visit,
and later follow those links through to other
pages. In this way, they essentially follow the
links from one page to another. The entire World
Wide Web is made up of links, the original idea
being that you could follow links from one place
to another. This is how robots get around.
The "smarts" about indexing pages online comes
from the search engine engineers, who devise the
methods used to evaluate the information the search
engine robots retrieve. When introduced into the
search engine database, the information is available
for searchers querying the search engine. When
a search engine user enters their query into the
search engine, there are a number of quick calculations
done to make sure that the search engine presents
just the right set of results to give their visitor
the most relevant response to their query.
You can see which pages on your site the search
engine robots have visited by looking at your
server logs or the results from your log statistics
program. Identifying the robots will show you
when they visited your website, which pages they
visited and how often they visit. Some robots
are readily identifiable by their user agent names,
like Google's "Googlebot"; others are bit more
obscure, like Inktomi's "Slurp". Still other robots
may be listed in your logs that you cannot readily
identify; some of them may even appear to be human-powered
browsers.
Along with identifying individual robots and
counting the number of their visits, the statistics
can also show you aggressive bandwidth-grabbing
robots or robots you may not want visiting your
website. In the resources section of the end of
this article, you will find sites that list names
and IP addresses of search engine robots to help
you identify them.
How Do They Read The Pages On Your Website?
When the search engine robot visits your page,
it looks at the visible text on the page, the
content of the various tags in your page's source
code (title tag, meta tags, etc.), and the hyperlinks
on your page. From the words and the links that
the robot finds, the search engine decides what
your page is about. There are many factors used
to figure out what "matters" and each search engine
has its own algorithm in order to evaluate and
process the information. Depending on how the
robot is set up through the search engine, the
information is indexed and then delivered to the
search engine's database.
The information delivered to the databases then
becomes part of the search engine and directory
ranking process. When the search engine visitor
submits their query, the search engine digs through
its database to give the final listing that is
displayed on the results page.
The search engine databases update at varying
times. Once you are in the search engine databases,
the robots keep visiting you periodically, to
pick up any changes to your pages, and to make
sure they have the latest info. The number of
times you are visited depends on how the search
engine sets up its visits, which can vary per
search engine.
Sometimes visiting robots are unable to access
the website they are visiting. If your site is
down, or you are experiencing huge amounts of
traffic, the robot may not be able to access your
site. When this happens, the website may not be
re-indexed, depending on the frequency of the
robot visits to your website. In most cases, robots
that cannot access your pages will try again later,
hoping that your site will be accessible then.
Resources
- SpiderSpotting - Search Engine Watch
- http://searchenginewatch.com/webmasters/spiders.html
- Robotstxt.org
- List of robots and protocols for setting up
a robots.txt file.
- http://www.robotstxt.org/
- Spider-Food
- Tutorials, forums and articles about Search
Engine spiders and Search Engine Marketing.
- http://spider-food.net/
- Spiderhunter.com
- Articles and resources about tracking Search
Engine spiders.
- http://www.spiderhunter.com/
- Sim Spider Search Engine Robot Simulator
- Search Engine World has a spider that simulates
what the Search Engine robots read from your
website.
- http://www.searchengineworld.com/cgi-bin/sim_spider.cgi
About the Author:
Daria Goetsch is the founder and Search Engine
Marketing Consultant for Search Innovation Marketing
(http://www.searchinnovation.com),
a Search Engine Promotion company serving small
businesses. Besides running her own company, Daria
is an associate of WebMama.com, an Internet web
marketing strategies company. She has specialized
in search engine optimization since 1998, including
three years as the Search Engine Specialist for
O'Reilly & Associates, a technical book publishing
company.
Copyright © 2003 Search Innovation Marketing.
All Rights Reserved.
Return
to FREE articles index |