How can I limit the crawling according to the folder, depth or part of a URL in Netpeak Spider?

Modified on Mon, 09 Oct 2023 at 07:44 PM

If you need to set a limit for the website crawling (e.g. exclude subdomains or include only one folder), you can use the following settings:

Checkboxes on the ‘General‘ tab:
- Crawl only in directory – allows crawl exact website directory without leaving it.
- Crawl all subdomains – turn this function off to consider pages from subdomains outside the host specified in ‘Initial URL‘ as external.

On the ‘Restrictions‘ tab you can set:
- Max number of crawled URLs – allows limiting number of crawled pages for scanning.
- Max crawling depth – allows determining how deep the program will crawl a website, based on the number of clicks from the initial URL to the crawled one.
- Max URL depth – allows determining how deep the program will crawl into directories of a website, based on the number of segments in a URL.
- Max number of redirects – this value has an influence on several parameters:
  - The number of redirects the program will follow to reach the target URL.
  - The number of redirects to determine the corresponding issue in the sidebar.

In the ‘Rules‘ tab, you can add an exclusion or inclusion of pages, where URL matches a set rule.

You can combine all suggested options in the way you want, and also save them as a template for further part crawlings of your website.

In case, if there are some questions remained after reading the article, please contact our Customer Support.