How to Create Optimized Robots.txt for Your WordPress Site

A robots.txt is great part of your website SEO as it controls search engine robots crawling and directs to specific path of your website you want search robots to access your site. In absence of a robots.txt a website does not seem to be properly optimized for better SEO because not all of your website files important to be indexed in Google and other search engines and these unnecessary files can cause bad effects on your organic search ranking even can lead your site to be penalized by Google panda.

So if you don’t have a robots.txt created on your site then you should create one and optimize for SEO. In this tutorial I will discuss about the basics and few advanced guides about robots.txt optimization for better search engine ranking.

What is a Robots.txt file?

A robots.txt is a simple text file that remains in root directory of a website server and directs search engine bots right path to crawl and index content of the given website. The file consists of robots exclusion standard that is a protocol with a small set of commands that tells search engine bots to access your site by section and by specific kind of web crawlers such as mobile crawlers, desktop crawlers etc.

A robots.txt file contains instructions of specific format (Syntax) that can’t enforce search robots behavior but directs the search robots in specified way to access files that are necessary to show in search results and ignore those files are unimportant to show in search results, pretty simple. Absence of a robots.txt for a website tells search engine bots to index entire site resources like posts, pages, scripts, images, and everything resides in your website directories.

Different websites have different robots.txt files, suppose a robots.txt file for your root domain example.com is quite different than a robots.txt file for your sub-domain blog.example.com, and the rules that would apply for example.com would not apply to blog.example.com in addition each protocol and port needs its own robots.txt file; http://example.com/robots.txt does not apply to pages under https://example.com:8080/ or https://example.com/

Robots.txt Usages

Non-image files

For non-image files (web pages) a robots.txt file is used to control crawling traffic and to avoid pages that are unimportant on your site. You should not use robots.txt to hide your web pages appearing from search results since your pages can still be indexed by other pages that link to your pages and already indexed by search engines.

This is not a proven practice to hide URLs by robots.txt command rather you can use password protection or noindex tags or directives to hide pages appearing from search results.

Image files

Using robots.txt file you can prevent images appearing from search results though it cannot prevent other pages linking to your images.

Resource files

Robots.txt file has better usage to prevent unnecessary resources like CSS, JavaScript, Images etc. that are loaded with your website and don’t play any role in your website design and performance issues. If you think your site has resources that are loaded without playing any role in your website performances then you can block these resource files using robots.txt. However if the absence of these resources make Google and other search engines harder to read and understand your webpage then you shouldn’t block these resources.

Limitations of Robots.txt for Web Crawlers

Before using robots.txt file on your website you must know the risks related to limitations for a robots.txt blocking method. The following are certain limitations of robots.txt for your website

The instructions of your robots.txt cannot enforce the web crawler’s behavior to your site but you can direct web crawler accessing your site in specified way. Googlebot can understand and obey most of the robots.txt formats but still there are many other search engine bots might not obey the instructions on your robots.txt file. So to make sure all of the search engine bots understand your command you can use password protecting private files on server.

While Google will ignore accessing web page you blocked on robots.txt file but still those disallowed pages can be indexed from other places on the web basically from backlink provider site. So to stop those pages appearing from Google search results you should use other blocking methods like password protecting the files on your server or using the noindex meta tags or responsive header tags.

Robots.txt Syntax

To create a robots.txt file on your website you have to access the root directory of your web server and you can create one for your website. If you can’t access the root directory of your web server then you should contact your hosting provider or try using different blocking methods like password protecting files on the server or inserting noindex tags into your page’s header section.

A simple robots.txt file contains two key words

User-agent and Disallow

User-agent is a search engine robot like Googlebot; you can find most User-agents in the Web Robots Database.

Disallow is a command for User-agent that tells search robot to ignore accessing a particular URL or section of your website.

Google uses two types of search robots Googlebot for Google Search and Googlebot-Image for Google Image Search. For ad publishing through Adsense, Google uses the most popular search robot Mediapartners-Google that analyzes a website to match particular keywords for displaying most relevant ads based on content.

The syntax for an idea robots.txt file is as follows:

User-agent: [the name of the robot the following rule applies to]
Disallow: [the URL path you want to block]
Allow: [the URL path in of a subdirectory, within a blocked parent directory, that you want to unblock]

These two lines together are called single entry where Disallow rule only applies to User-agent(s) and you can create as much entry as you want and multiple Disallow lines can be applied on multiple User-agents, all in one entry.

If you want to set commands applied to all available search robots then insert asterisk (*) after User-agent: such as

User-Agent: *
Disallow:

The above robots.txt file tells all the search robots to index an entire site.

The following are the most used commands in robots.txt to block specific file(s) of a site.

Block… Sample
The entire site using a forward slash (/): Disallow: /
A directory and its contents by using the directory name with a forward slash: Disallow: /sample-directory/
A webpage by listing the page after the slash: Disallow: /private_file.html
A specific image from Google Images: User-agent: Googlebot-Image
Disallow: /images/dogs.jpg
All images on your site from Google Images: User-agent: Googlebot-Image
Disallow: /
Files of a specific file type (for example, .gif): User-agent: Googlebot
Disallow: /*.gif$
Allow Mediapartners-Google to analyze your entire site User-agent: Mediapartners-Google
Allow: /

Note: Directies are case-sensitive. For example, Disallow: /file.asp would block http://www.example.com/file.asp, but would allow http://www.example.com/File.asp. Googlebot also ignores white-space, and unknown directives in the robots.txt.

For pattern matching rule you can use following commands

Pattern-matching rule Sample
To block any sequence of characters, use an asterisk (*). For instance, the sample code blocks access to all subdirectories that begin with the word “private”: User-agent: Googlebot
Disallow: /private*/
To block access to all URLs that include question marks (?). For example, the sample code blocks URLs that begin with your domain name, followed by any string, followed by a question mark, and ending with any string: User-agent: Googlebot
Disallow: /*?

Warning: Both Google and Yahoo support wildcards, but you really have to be very careful when changing a robots.txt file using pattern matching rules. For one line code mistake caused Aaron Wall (Founder of SEO Book) to lose $10K in profit by losing organic search ranking position in Google while Aaron removed his one of the top linked pages having a similar URL in the list of pruning URLs using wild cards in robots.txt. Read more about it.

WordPress Robots.txt File

Till now you have learned the basics and few advanced guides about robots.txt file that is used to control search robots crawling across your entire site. Now you will learn WordPress robots.txt file creation and optimization processes.

You can check your robots.txt file once it’s created on your website directory by typing http://yoursite.com/robots.txton your browser and access it.

If robots.txt is already created then you will see something like

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Otherwise it will return a 404 error page if robots.txt file does not exist and you have to create a new robots.txt file and save it on root directory of your website.

Creating a Robots.txt for WordPress

A robots.txt file is a simple text document (I already mentioned) so you can create a robots.txt file easily by creating a new Text Document on your desktop and naming the file as “robots” [all small case letters]

If you site is an “Addon Domain” then you have to access your website directory from “public_html” and upload robots.txt there.

After uploading robots.txt don’t forget to check if it’s stored on root directory appropriately by typing

http://yoursite.com/robots.txt on your browser and access it.

Once robots.txt is found on right path you’re successfully done!

Alternative of robots.txt creation

Alternatively you can create a new robots.txt using WordPress SEO Tools option

To create a robots.txt with WordPress SEO tools first of all login your WordPress account and then navigate to SEO > Tools > File editor

You will see following Syntax

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Now you can bring changes in your robots.txt file and save it.

Optimize Your Robots.txt for Better SEO

Now you have a new robots.txt file stored on root directory of your website so you’re ready to optimize the file for better SEO.

You can edit your robots.txt from WordPress SEO > Tools section or using a third party plugin Multipart robots.txt editoron your WordPress site.

If you use WordPress SEO then follow this process

Login your WordPress account and access to SEO > Tools

Now click on “File Editor” link and you can create an optimized robots.txt

Below are the most common practices for creating standard robots.txt for WordPress

User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /comments/feed/
Disallow: /trackback/
Disallow: /index.php
Disallow: /xmlrpc.php
Disallow: /readme.html
Disallow: /refer/

User-agent: Googlebot-Image
Allow: /wp-content/uploads/

User-agent: Mediapartners-Google
Allow: /

Sitemap: https://www.upstartwebdesign.com/sitemap.xml
Sitemap: https://www.upstartwebdesign.com/sitemap-image.xml

In the above robots.txt format I used two sitemaps to be indexed by Google and other search engines that follow robots.txt syntax.

It’s not recommended to submit sitemap using robots.txt file because sitemap must be submitted in manual process from Sitemaps section in Google, and Bing for better indexing and crawling ratio. I have placed those two sitemaps in robots.txt for maintaining standard of a robots.txt file though I have already submitted those sitemaps at Google, Bing and other search engines in manual process.

If you don’t know how to submit WordPress sitemap to Google then read this tutorial.

Affiliation

If you use affiliate manager in WordPress to manage your affiliate links then output a standard format using your domain name to refer third party URLs in this way

http://yoursite.com/out/bluehost or http://yoursite.com/refer/bluehost

Here both “out” and “refer” works as affiliate slugs and uses a 301 redirect to send traffic to specific site using your affiliate link.

These links may appear in Google search results if you don’t block them in robots.txt file.

In the above sitemap file I used Disallow: /refer/ because I use “refer” to send traffic to my affiliate links and I don’t want to get these links indexed and appeared in Google search results.

Also if these links get indexed in Google search results I may get a bigger penalty due to bad quality content by Panda. So it’s imperative to block your affiliate links in robots.txt file.

When you have created your optimized robots.txt syntax simply input them in Robots.txt section in SEO Tools

And click “Save changes to Robots.txt” and you have successfully updated your robots.txt

Update your Robots.txt file with Google Search Console

Now you have to update your robots.txt file using robots.txt Tester in Google search console. To do so simply login your Google search console account using Google account and access your site Search console dashboard

Now access to Crawl > robots.txt Tester

Now Googlebot, Googlebot-Image will start obeying your robots.txt command and crawl your website the way you defined in robots.txt file.

Conclusion

A robots.txt file is an integral part of any website and in absence of this file a site does not seem to be optimized for better SEO. Robots.txt file can play great role in overall SEO configurations of your site, improve search results appearance, and search engine ranking.

Again there are major risks related to robots.txt optimization if you play wrong with syntax and uncertainly remove important pages from your site, just like Aaron Wall (A famous SEO on the planet) did a single line of code mistake in his robots.txt file and the reason he had to compensate for $10K by losing organic search ranking for one of his top linked pages.

So be very careful about your robots.txt optimization and be sure what you’re telling search engines to access your site.

Leave a Comment