- Considering crawling and indexing instructions.
- Crawling links from the link tag.
- Automatic stop of crawling.
- Additional settings.
You can find the advanced settings under ‘Settings → Advanced‘. They are used to configure crawling process, in particular:
- whether or not to follow indexing instructions;
- consider or not links from the <link> tag;
- automatically stop crawling.
1. Considering crawling and indexing instructions
This is the first and the most important section of the advanced settings. It includes such configurations as:
1.1. Robots.txt → tick to take into account directives from the robots.txt file for a chosen User Agent: Allow/Disallow accounts for adding a specific page to the results table.
Please note that the Google Chrome browser is used as a default User Agent for HTTP requests, but for virtual robots.txt the Netpeak Spider bot is used. The reason is that the Google Chrome User Agent doesn’t consider robots.txt directives, while we need to check how they work for different bots.
You can test directives from robots.txt file when a website is at the development stage using the ‘Virtual robots.txt‘ feature in Netpeak Spider. It allows you to test new or updated directives in robots.txt without changing the real file.
1.2. Canonical → tick to take into account canonical instructions in the <link rel=”canonical” /> tag in the <head> section of the document or ‘Link: rel=”canonical”‘ in HTTP response header and consider links from them the only outgoing links from the page. The parameter is set by default.
1.3. Refresh → tick to take into account the refresh instructions in HTTP response headers or the <meta http-equiv=”refresh” /> tag in the <head> section of a document and consider links from this directive the only outgoing links from the page.
1.4. X-Robots-Tag → tick to take into account the X-Robots-Tag instructions in HTTP response header for a chosen User Agent:
- Follow/Nofollow accounts for considering links from a specific page;
- Index/Noindex accounts for adding a specific page to the results table.
1.5. ‘Nofollow‘ link attributes → tick to not follow links with the ‘nofollow‘ attribute like <a href=”https://example.com/” rel=”nofollow”>Example</a>
Note that when Netpeak Spider follows the indexing instructions, disallowed pages will not be crawled but will be added to the ‘Skipped URLs‘ table. However, regardless of settings and parameters, Netpeak Spider always splits results into compliant, non-compliant, and non-HTML pages.
Keep in mind that search engine robots, in any case, take into account canonical instructions, directives in robots.txt, and Meta Robots, that is why a website may have indexing issues in case of their absence or incorrect configuration.
2. Crawling links from the link tag
To configure crawling links from the <link> tag, use the following settings:
- Next/Prev → tick to follow links from the <link rel=”next” /> and <link rel=”prev” /> tags in the <head> section of a document.
- AMP HTML → tick to follow the links from <link rel=”amphtml” tags in the <head> block.
- Other → tick to add all URLs from other <link> tags in the <head> section of a document to the results table. Note that this setting ignores the rel=”stylesheet” (CSS), rel=”next/prev”, and rel=”amphtml” directives because they are covered by other settings.
3. Automatic Stop of Crawling
In the ‘Pause crawling automatically‘ section you can configure automatic stop of crawling in cases:
- When website returns the ‘429 Too Many Requests‘ status code → crawling will be paused when the server returns the 429 status code.
- When the response timeout is exceeded → crawling will be paused if the response timeout is exceeded. By default, it’s 30 seconds, but you can change this parameter in the ‘General‘ tab of crawling settings.
You can resume the crawling at any time.
4. Additional Settings
Additionally, this tab contains such settings as:
- Allow cookies → tick in case the analyzed website is closed for all requests without a cookie file. Also, all requests will be tracked within one session. Otherwise, every new request will generate a new session. By default, this parameter is enabled.
- Retrieve 4xx error pages content → tick to retrieve all selected parameters of the pages that return a 4xx status code.
- Crawl relative canonical URLs → tick to enable crawling of relative canonical URLs in the <link rel=”canonical” /> tag in the <head> section of a document or the ‘Link: rel=”canonical”‘ HTTP response header. In this case, found URLs will be added to the results table.
To reset settings on the current settings tab, use the ‘Reset settings to default‘ button or set the ‘Default‘ template to reset settings on all the tabs.