Skip to main content
Skip table of contents

Start or Schedule Website Crawling

There are two options to start a crawling of a website:

  • Crawl Now

  • Schedule Crawling

Go to Germain Workspace > Left Menu > Analytics > Website Crawler

image-20240220-144701.png

Start website crawling - Germain UX

Crawl Now

This is the quickest way to execute a website crawler and get its results once completed. Configuration settings in this option are limited but sufficient for most use cases.

Configuration Settings

  • URL: Starting URL for the crawler.

  • Stay On Domain

    • True: Visit only URLs on the same domain or its subdomain (e.g. google.com and drive.google.com are on the same domain)

    • False: Visit every URL

  • Store Successfully Visited URLs

    • True: Store all Website URL Availability facts (these representing available and not available URLs)

    • False: Store only failed Website URL Availability facts (not available URLs only)

  • HTTP Failure Status Code: Any visited URL with returned HTTP status code equal or bigger to this value will be considered as unavailable.

  • Maximum Crawling Depth: This value represents how deep the crawler can visit URLs. Null value means no cap for maximum crawling depth.

  • Maximum URLs To Crawl: This value puts a cap on how many URLs can be visited by the crawler. Null value means there is no cap for maximum URLs to crawl.

  • Crawler Threads: How many independent threads will be used to crawl your website. More threads means more resources needed but quicker execution time.

  • Ignore URLs: URLs which shouldn’t be ignored by the crawler. Regex patterns are allowed if you want to exclude all domains (e.g. to ignore all URLs from drive.google.com, you need to add .*drive.google.com.* value)

image-20240220-150956.png

Start website crawling (2) - Germain UX

Schedule Crawling of a Website

This is a more advanced option and it allows to configure:

  • a website crawler on a schedule

  • provide more advanced settings to the crawler (e.g. customer headers, authentication settings, connection settings and more)

Tip(s)

No Message Available

That message likely indicates that the Germain UX Services are not deployed or running. Please refer to this guide on how to deploy Germain UX Services.

image-20241231-083717.png

No Message Available on Crawler - Germain UX

Crawler does not work?

A website may refuse to be crawled unless it is performed from a known browser. e.g. Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36

image-20240826-203903.png

If you are having any difficulty, please contact us.

Service: Automation

Feature Availability: 2024.1

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.