Search engines, especially Google work by doing 3 essential activities:
- Crawling – an automated “bot” visits pages on your site and “crawls” it (sometimes called a spider) using a search algorithm to figure out what the page is about. Sitemaps and Robots.txt files make it easier for bots to work on your site.
- Indexing – the bot will index your site. You can think of the web as a library, a website as a bood and Google a digital Librarian that keeps track of where all books are on shelves. As new editions of the books come out, along with new books, Google needs to keep them in the right order on the shelves and maintains an index of where all the books are. For those who actually remember card catalogs, Google is the world’s largest card catalog indexing websites by various attributes.
- Serving Results –When you search on Google, the massive index uses mathematical algorithms to sift through all the Terabytes of data to return to you the most relevant results for the information you seek. The results are returned on a webpage called a Search Engine Return Page or SERP for short.
Sitemaps help search engine robots crawl your site to index its content. Sitemaps are especially important when your site is large, new or may have pages that are not well linked from your home page. A Google sitemap is an XML file that includes links to your pages as well as metadata that informs the bot what the page is about. Sitemaps should be updated every time you add a new page or update an existing page. Sitemaps are also helpful if your pages contain dynamic scripts that change the content continuously, contain videos, special graphics etc. To see a sitemap for the Bridge Group Online click here: http://www.bridgegrouponline.com/sitemap.xml
Sitemaps should be submitted to Google through the Site Master tool. You can have more than one sitemap for your site. For example, you might have a sitemap that includes all the URLs in your site and separate sitemap for Podcasts, videos or blogs. For example you can submit an RSS feed of your blogs as an additional sitemap and Google will crawl it and index the content. A bot can compare the old sitemap to the updated one and save time by only crawling the new or updated pages. The sitemap file should be placed in the root directory of your website.
In addition to letting Google know what URL’s you have in your site, you may want Google to ignore certain pages. For example, let’s say you are testing different landing pages as part of a banner ad campaign where you want test if certain headlines or key word make a difference in conversion rates. The pages are essentially the same with very minor differences. You want Google to know about the page, but not get penalized for duplicate content. Google hates duplicate content and doesn’t like to waste it’s bot’s time indexing pages that don’t enrich the user experience, or worse are black hat SEO techniques to stuff key words at Google that really are not relevant.
To avoid getting penalized, you can add a Robots.txt file that will tell bots not to follow or visit certain pages. You may also have directories on your website you don’t want robots to access that are private and may be subscription sites or intranet pages for a company that you do not want indexed. The robots.txt file tells Google not to follow those pages. Keep in mind, that not all bots – especially SPAM bots will pay attention to a no follow notice in a robots.txt file. The Robots.txt file should be placed in the root directory of your website.