In Netpeak Spider 3.6 we’ve implemented a support of custom HTTP headers for a flexible program configuration. It allows you to solve more advanced tasks like checking SEO issues on websites that use a web form authentication.
1. What this feature is for
1.1. Crawling websites requiring authentication
Custom HTTP headers will allow you to crawl or scrape data from websites which content is available only for authorized users.
1.2. Avoiding crawling protection
Owing to custom HTTP headers, a web server will consider requests sent by Netpeak Spider not as automatic ones but as sent by a user.
1.3. Getting dynamic versions of pages
This feature will be necessary when you have to crawl a website that sends different source code depending on parameters in HTTP headers: a device, client, region, language, or screen resolution.
Please, use this function with great responsibility. If you do not fully understand how it works – avoid using it. When this function is enabled, Netpeak Spider will follow all links that might not be available for ordinary users, including links that add or delete data from websites. It may lay waste to your website.
2. How to configure HTTP headers
To configure custom HTTP headers, go to ‘Settings’ → ‘HTTP headers’.
2.1. Such fields as ‘User-Agent’, ‘Accept’, ‘Accept-Encoding’ can’t be changed.. In case of creating another header with a similar name, the crawler will ignore it to avoid errors during the crawling.
Please note that User agent should be configured on the ‘User agent’ tab of settings menu.
2.2. The ‘Add header’ button will add a new row with the fields ‘Name’, ‘Value’ and ‘Delete’ button. You can type your own name and value of the header. The number of headers that you can add in settings is not restricted.
2.3. The ‘Clear all’ button removes all added headers except the first three onesand ‘Reset settings to default’ button clears all the added headers returning the standard list of headers.
2.4. You can save an added set of headers as a template with a corresponding button.
A few points to remember here:
- If the ‘Allow cookies’ checkbox on the ‘Advanced’ tab is NOT ticked, the program will NOT send cookies. Otherwise, it will send entered headers in the request to a web server and process received cookies.
- To consider custom HTTP headers during the crawling, put a URL in the ‘Initial URL‘ bar before you start crawling.
- The ‘Authorization’ header on the ‘Authentication’ tab that is used to get access to websites requiring basic authentication will be ignored if the same header is used in the ‘HTTP headers’ settings.
- Some headers can’t be separated by comma, for instance, ‘Authorization’ and ‘Referer’. If one field has several values separated by comma, the request will have only the last specified value.
3. Use cases
3.1. Checking changes on a website with ‘If-Modified-Since’ header
1. Add a new header in the ‘HTTP headers’ settings with the following value – If-Modified-Since: <day-name>, <day> <month> <year> <hour>:<minute>:<second> GMT.
Example: If-Modified-Since: Wed, 1 Jan 2020 07:28:00 GMT
2. Set a user agent in the ‘User agent’ settings that will be used in the request headers sent to a web server of a crawled website.
3. Enter an initial URL and hit the ‘Start’ button.
What is it for?
If pages return 200 status code during crawling it means that they were changed in the date range specified in the ‘If-Modified-Since’ header. Otherwise, a page should return 304 status code.
3.2. Crawl locale-adaptive content
You can set any language and region value from ‘Accept-Language’, ‘Cookie’, ‘Referer’ headers in any header with a unique name to analyze locale-adaptive content.