There are a couple of directives that let search engine bots know what pages and other content search engine bots are to be crawled and indexed. The most common types are the robots.txt file and the meta robots tag.
The first one, robots.txt file, let search engines know which specific part of the website is to be crawled, whether it be a page, a subfolder, a sitemap, etc.
This helps search engines produce a more optimized and efficient crawl by telling them which sections on your website are important and sections you do not want to be prioritized to be indexed.
However, it is important to remember that search engine bots are not required to follow and respect this file.
The next one that is most commonly used is the meta robots tag. This allows indexation control at the page level.
A meta robots tag can include a lot of values such as:
- Index: which allows search engines to add the page to their index
- Noindex: which tells search engines that you do not allow this page to be added to their index and disallows if from appearing in search results for that specific search engine
- Follow: which instructs search engines to follow links on a page, so that the bots can crawl other links or pages within
- Nofollow: which instructs search engines to not follow the links on that page
- None: a shortcut used for noindex, nofollow
- All: a shortcut used for index, follow
- Noimageindex: which instructs search engines to not index images on a page (images can still be indexed using the meta robots tag if they are linked to from another site
- Noarchive: Tells search engines to not show a cached version of a page
- Nocache: Same as the noarchive tag, but specific to Bingbot/MSNbot
- Nosnippet: this instructs search engines to not display snippets of texts or videos
- Notranslate: this instructs search engines to not show translations of a page in search engine results.
- Unavailable_after: this tag instructs search engines what specific day and time they should not display a result in their index
- Noyaca: this tag instructs Yandex crawler bots to not use page descriptions in the search engine results
In addition to this, there is another tag out there that allows the issuing of noindex, nofollow directives.
The X-Robots-Tag differs from the robots.txt file and meta robots tag mainly on where it is placed and what it controls. The X-Robots-tag is a part of the HTTP head that controls the indexing of a page on the whole and at the same time, to specific elements on a page. According to Google, any directive that you can use on meta robots tag can also be used on the X-Robots-Tag.
It is highly encouraged that developers use X-Robots-Tag as opposed to the other directives as it offers more flexibility.
How to Check
There are a few ways to check for an X-Robots-Tag on the site. One of them is through Screaming Frog.
After running a site through the platform, you can navigate to the “Directives” tab and look for the “X-Robots-Tag” column. There you will see which sections of the site are using the tag, along with the specific directives.
There are also a few different plugins that you can download that allow you to determine whether an X-Robots-Tag is being used.
There are a number of ways to instruct search engines to not crawl certain parts or sections on a page. Understanding each directive and how they affect one another is crucial in order to avoid any major SEO issue.