A robots.txt is great part of your website SEO as it controls search engine robots crawling and directs to specific path of your website you want search robots to access your site. In absence of a robots.txt a website does not seem to be properly optimized for better SEO because not all of your website files important to be indexed in Google and other search engines and these unnecessary files can cause bad effects on your organic search ranking even can lead your site to be penalized by Google panda.
So if you don’t have a robots.txt created on your site then you should create one and optimize for SEO. In this tutorial I will discuss about the basics and few advanced guides about robots.txt optimization for better search engine ranking.
What is a Robots.txt file?
A robots.txt is a simple text file that remains in root directory of a website server and directs search engine bots right path to crawl and index content of the given website. The file consists of robots exclusion standard that is a protocol with a small set of commands that tells search engine bots to access your site by section and by specific kind of web crawlers such as mobile crawlers, desktop crawlers etc.
A robots.txt file contains instructions of specific format (Syntax) that can’t enforce search robots behavior but directs the search robots in specified way to access files that are necessary to show in search results and ignore those files are unimportant to show in search results, pretty simple. Absence of a robots.txt for a website tells search engine bots to index entire site resources like posts, pages, scripts, images, and everything resides in your website directories.
Different websites have different robots.txt files, suppose a robots.txt file for your root domain example.com is quite different than a robots.txt file for your sub-domain blog.example.com, and the rules that would apply for example.com would not apply to blog.example.com in addition each protocol and port needs its own robots.txt file; http://example.com/robots.txt
does not apply to pages under https://example.com:8080/
or https://example.com/
Robots.txt Usages
Non-image files
For non-image files (web pages) a robots.txt file is used to control crawling traffic and to avoid pages that are unimportant on your site. You should not use robots.txt to hide your web pages appearing from search results since your pages can still be indexed by other pages that link to your pages and already indexed by search engines.
This is not a proven practice to hide URLs by robots.txt command rather you can use password protection or noindex tags or directives to hide pages appearing from search results.
Image files
Using robots.txt file you can prevent images appearing from search results though it cannot prevent other pages linking to your images.
Resource files
Robots.txt file has better usage to prevent unnecessary resources like CSS, JavaScript, Images etc. that are loaded with your website and don’t play any role in your website design and performance issues. If you think your site has resources that are loaded without playing any role in your website performances then you can block these resource files using robots.txt. However if the absence of these resources make Google and other search engines harder to read and understand your webpage then you shouldn’t block these resources.
Limitations of Robots.txt for Web Crawlers
Before using robots.txt file on your website you must know the risks related to limitations for a robots.txt blocking method. The following are certain limitations of robots.txt for your website
The instructions of your robots.txt cannot enforce the web crawler’s behavior to your site but you can direct web crawler accessing your site in specified way. Googlebot can understand and obey most of the robots.txt formats but still there are many other search engine bots might not obey the instructions on your robots.txt file. So to make sure all of the search engine bots understand your command you can use password protecting private files on server.
While Google will ignore accessing web page you blocked on robots.txt file but still those disallowed pages can be indexed from other places on the web basically from backlink provider site. So to stop those pages appearing from Google search results you should use other blocking methods like password protecting the files on your server or using the noindex meta tags or responsive header tags.
Robots.txt Syntax
To create a robots.txt file on your website you have to access the root directory of your web server and you can create one for your website. If you can’t access the root directory of your web server then you should contact your hosting provider or try using different blocking methods like password protecting files on the server or inserting noindex tags into your page’s header section.
A simple robots.txt file contains two key words
User-agent and Disallow
User-agent is a search engine robot like Googlebot; you can find most User-agents in the Web Robots Database.
Disallow is a command for User-agent that tells search robot to ignore accessing a particular URL or section of your website.
Google uses two types of search robots Googlebot for Google Search and Googlebot-Image for Google Image Search. For ad publishing through Adsense, Google uses the most popular search robot Mediapartners-Google that analyzes a website to match particular keywords for displaying most relevant ads based on content.
The syntax for an idea robots.txt file is as follows:
User-agent: [the name of the robot the following rule applies to] Disallow: [the URL path you want to block] Allow: [the URL path in of a subdirectory, within a blocked parent directory, that you want to unblock]
These two lines together are called single entry where Disallow rule only applies to User-agent(s) and you can create as much entry as you want and multiple Disallow lines can be applied on multiple User-agents, all in one entry.
If you want to set commands applied to all available search robots then insert asterisk (*) after User-agent: such as
User-Agent: * Disallow:
The above robots.txt file tells all the search robots to index an entire site.
The following are the most used commands in robots.txt to block specific file(s) of a site.
Block… | Sample |
---|---|
The entire site using a forward slash (/): | Disallow: / |
A directory and its contents by using the directory name with a forward slash: | Disallow: /sample-directory/ |
A webpage by listing the page after the slash: | Disallow: /private_file.html |
A specific image from Google Images: | User-agent: Googlebot-Image Disallow: /images/dogs.jpg |
All images on your site from Google Images: | User-agent: Googlebot-Image Disallow: / |
Files of a specific file type (for example, .gif): | User-agent: Googlebot Disallow: /*.gif$ |
Allow Mediapartners-Google to analyze your entire site | User-agent: Mediapartners-Google Allow: / |
Note: Directies are case-sensitive. For example, Disallow: /file.asp would block http://www.example.com/file.asp, but would allow http://www.example.com/File.asp. Googlebot also ignores white-space, and unknown directives in the robots.txt.
For pattern matching rule you can use following commands
Pattern-matching rule | Sample |
---|---|
To block any sequence of characters, use an asterisk (*). For instance, the sample code blocks access to all subdirectories that begin with the word “private”: | User-agent: Googlebot Disallow: /private*/ |
To block access to all URLs that include question marks (?). For example, the sample code blocks URLs that begin with your domain name, followed by any string, followed by a question mark, and ending with any string: | User-agent: Googlebot Disallow: /*? |
Warning: Both Google and Yahoo support wildcards, but you really have to be very careful when changing a robots.txt file using pattern matching rules. For one line code mistake caused Aaron Wall (Founder of SEO Book) to lose $10K in profit by losing organic search ranking position in Google while Aaron removed his one of the top linked pages having a similar URL in the list of pruning URLs using wild cards in robots.txt. Read more about it.
WordPress Robots.txt File
Till now you have learned the basics and few advanced guides about robots.txt file that is used to control search robots crawling across your entire site. Now you will learn WordPress robots.txt file creation and optimization processes.
You can check your robots.txt file once it’s created on your website directory by typing http://yoursite.com/robots.txt
on your browser and access it.
If robots.txt is already created then you will see something like
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php
Otherwise it will return a 404 error page if robots.txt file does not exist and you have to create a new robots.txt file and save it on root directory of your website.
Creating a Robots.txt for WordPress
A robots.txt file is a simple text document (I already mentioned) so you can create a robots.txt file easily by creating a new Text Document on your desktop and naming the file as “robots” [all small case letters]
If you site is an “Addon Domain” then you have to access your website directory from “public_html” and upload robots.txt there.
After uploading robots.txt don’t forget to check if it’s stored on root directory appropriately by typing
http://yoursite.com/robots.txt
on your browser and access it.
Once robots.txt is found on right path you’re successfully done!
Alternative of robots.txt creation
Alternatively you can create a new robots.txt using WordPress SEO Tools option
To create a robots.txt with WordPress SEO tools first of all login your WordPress account and then navigate to SEO > Tools > File editor
You will see following Syntax
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php
Now you can bring changes in your robots.txt file and save it.
Optimize Your Robots.txt for Better SEO
Now you have a new robots.txt file stored on root directory of your website so you’re ready to optimize the file for better SEO.
You can edit your robots.txt from WordPress SEO > Tools section or using a third party plugin Multipart robots.txt editoron your WordPress site.
If you use WordPress SEO then follow this process
Login your WordPress account and access to SEO > Tools
Now click on “File Editor” link and you can create an optimized robots.txt
Below are the most common practices for creating standard robots.txt for WordPress
User-agent: * Disallow: /cgi-bin/ Disallow: /wp-content/plugins/ Disallow: /wp-content/themes/ Disallow: /comments/feed/ Disallow: /trackback/ Disallow: /index.php Disallow: /xmlrpc.php Disallow: /readme.html Disallow: /refer/ User-agent: Googlebot-Image Allow: /wp-content/uploads/ User-agent: Mediapartners-Google Allow: / Sitemap: https://www.upstartwebdesign.com/sitemap.xml Sitemap: https://www.upstartwebdesign.com/sitemap-image.xml
In the above robots.txt format I used two sitemaps to be indexed by Google and other search engines that follow robots.txt syntax.
It’s not recommended to submit sitemap using robots.txt file because sitemap must be submitted in manual process from Sitemaps section in Google, and Bing for better indexing and crawling ratio. I have placed those two sitemaps in robots.txt for maintaining standard of a robots.txt file though I have already submitted those sitemaps at Google, Bing and other search engines in manual process.
If you don’t know how to submit WordPress sitemap to Google then read this tutorial.
Affiliation
If you use affiliate manager in WordPress to manage your affiliate links then output a standard format using your domain name to refer third party URLs in this way
http://yoursite.com/out/bluehost or http://yoursite.com/refer/bluehost
Here both “out” and “refer” works as affiliate slugs and uses a 301 redirect to send traffic to specific site using your affiliate link.
These links may appear in Google search results if you don’t block them in robots.txt file.
In the above sitemap file I used Disallow: /refer/
because I use “refer” to send traffic to my affiliate links and I don’t want to get these links indexed and appeared in Google search results.
Also if these links get indexed in Google search results I may get a bigger penalty due to bad quality content by Panda. So it’s imperative to block your affiliate links in robots.txt file.
When you have created your optimized robots.txt syntax simply input them in Robots.txt section in SEO Tools
And click “Save changes to Robots.txt” and you have successfully updated your robots.txt
Update your Robots.txt file with Google Search Console
Now you have to update your robots.txt file using robots.txt Tester in Google search console. To do so simply login your Google search console account using Google account and access your site Search console dashboard
Now access to Crawl > robots.txt Tester
Now Googlebot, Googlebot-Image will start obeying your robots.txt command and crawl your website the way you defined in robots.txt file.
Conclusion
A robots.txt file is an integral part of any website and in absence of this file a site does not seem to be optimized for better SEO. Robots.txt file can play great role in overall SEO configurations of your site, improve search results appearance, and search engine ranking.
Again there are major risks related to robots.txt optimization if you play wrong with syntax and uncertainly remove important pages from your site, just like Aaron Wall (A famous SEO on the planet) did a single line of code mistake in his robots.txt file and the reason he had to compensate for $10K by losing organic search ranking for one of his top linked pages.
So be very careful about your robots.txt optimization and be sure what you’re telling search engines to access your site.