Skip to main content
Skip table of contents

Start or Schedule Website Crawling

There are two options to start a crawling of a website:

  • Crawl Now

  • Schedule Crawling

Go to Germain Workspace > Left Menu > Analytics > Website Crawler

image-20240220-144701.png

Start website crawling - Germain UX

Crawl Now

This is the quickest way to execute a website crawler and get its results once completed. Configuration settings in this option are limited but sufficient for most use cases.

Configuration Settings

  • URL: Starting URL for the crawler.

  • Stay On Domain

    • True: Visit only URLs on the same domain or its subdomain (e.g. google.com and drive.google.com are on the same domain)

    • False: Visit every URL

  • Store Successfully Visited URLs

    • True: Store all Website URL Availability facts (these representing available and not available URLs)

    • False: Store only failed Website URL Availability facts (not available URLs only)

  • HTTP Failure Status Code: Any visited URL with returned HTTP status code equal or bigger to this value will be considered as unavailable.

  • Maximum Crawling Depth: This value represents how deep the crawler can visit URLs. Null value means no cap for maximum crawling depth.

  • Maximum URLs To Crawl: This value puts a cap on how many URLs can be visited by the crawler. Null value means there is no cap for maximum URLs to crawl.

  • Crawler Threads: How many independent threads will be used to crawl your website. More threads means more resources needed but quicker execution time.

  • Ignore URLs: URLs which shouldn’t be ignored by the crawler. Regex patterns are allowed if you want to exclude all domains (e.g. to ignore all URLs from drive.google.com, you need to add .*drive.google.com.* value)

image-20240220-150956.png

Start website crawling (2) - Germain UX

Schedule Crawling of a Website

This is a more advanced option and it allows to configure:

  • a website crawler on a schedule

  • provide more advanced settings to the crawler (e.g. customer headers, authentication settings, connection settings and more)

image-20240220-150300.png

Start website crawling (3)- Germain UX

Tip(s)

Crawler does not work?

A website may refuse to be crawled unless it is performed from a known browser. e.g. Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36

image-20240826-203903.png

If you are having any difficulty, please contact us.

Service: Automation

Feature Availability: 2024.1

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.