Start a conversation

How to Crawl Slow Websites

  1. Selecting crawling mode and crawling area.
  2. Crawling speed settings.
  3. Automatic crawling pause and resuming.

1. Selecting crawling mode and crawling area

If it is not necessary to crawl the entire website, you can restrict crawling area, so the program won’t expose a website to a longstanding load. There are several ways to do it:

  • Limit crawling to one category → enter a URL of necessary category in the ‘Initial URL‘ field and enable the ‘Crawl only in directory‘ option. It can be found under ‘Settings → General‘. Keep in mind that to use this mode, the category should have an appropriate URL structure when URLs of the category and its pages begin with the same path. For example: website.com/category and website/category/first-item. 

Limit crawling to one category

  • Limit crawling by using rules → this feature will help you to focus only on pages that match certain rules. These might be pages whose URLs contain particular words. 

2. Crawling speed settings 

To adjust crawling speed considering low performance of a crawled website, use the settings under ‘Settings → General‘:

Crawling speed settings

  • Decrease the number of threads → set up not more than 5 threads in the corresponding field. It will reduce a number of concurrent parallel connections and decrease the load on a website.


  • Set up a delay between requests → adjust a delay between requests that are sent by the crawler to a server in the corresponding field. Delay is applied to each thread, so if the website is sensitive to high load, use a delay combined with a minimum number of threads.


  • Increase response timeout → by default Netpeak Spider waits 30,000 milliseconds for a page response and moves on to the next one unless it receives a response within this time. If you know in advance that page response speed is low, you can increase response timeout.


3. Automatic crawling pause and resuming

If you encounter the ‘429 Too many Requests‘ status code during crawling, we recommend doing the following steps:

  1. Go to ‘Settings → Advanced‘ and tick options in the ‘Pause crawling automatically‘ section:

  • When website returns the ‘429 Too Many Requests‘ status code.
  • When the response timeout is exceeded.

Automatic crawling pause and resuming when 429 code

  1. Decrease the number of threads.

  2. Change settings according to the recommendations in the first paragraph of this article.

  3. Save settings.

  4. Continue crawling if the error appeared at the beginning; restart crawling; recrawl certain pages with incorrect codes.  

Choose files or drag and drop files
Was this article helpful?
Yes
No

Still Thinking?

Thousands of specialists around the world use Netpeak Software products for daily SEO-tasks. Sign up to start your 7-day free trial right now!