Start or Schedule Website Crawling
There are two options to start a crawling of a website:
Crawl Now
Schedule Crawling
Go to Germain Workspace > Left Menu > Analytics > Website Crawler

Start website crawling - Germain UX
Crawl Now
This is the quickest way to execute a website crawler and get its results once completed. Configuration settings in this option are limited but sufficient for most use cases.
Configuration Settings
URL: Starting URL for the crawler.
Stay On Domain
True: Visit only URLs on the same domain or its subdomain (e.g. google.com and drive.google.com are on the same domain)
False: Visit every URL
Store Successfully Visited URLs
True: Store all Website URL Availability facts (these representing available and not available URLs)
False: Store only failed Website URL Availability facts (not available URLs only)
HTTP Failure Status Code: Any visited URL with returned HTTP status code equal or bigger to this value will be considered as unavailable.
Maximum Crawling Depth: This value represents how deep the crawler can visit URLs. Null value means no cap for maximum crawling depth.
Maximum URLs To Crawl: This value puts a cap on how many URLs can be visited by the crawler. Null value means there is no cap for maximum URLs to crawl.
Crawler Threads: How many independent threads will be used to crawl your website. More threads means more resources needed but quicker execution time.
Ignore URLs: URLs which shouldn’t be ignored by the crawler. Regex patterns are allowed if you want to exclude all domains (e.g. to ignore all URLs from drive.google.com, you need to add .*drive.google.com.* value)

Start website crawling (2) - Germain UX
Schedule Crawling of a Website
This is a more advanced option and it allows to configure:
a website crawler on a schedule
provide more advanced settings to the crawler (e.g. customer headers, authentication settings, connection settings and more)
Tip(s)
No Message Available
That message likely indicates that the Germain UX Services are not deployed or running. Please refer to this guide on how to deploy Germain UX Services.

No Message Available on Crawler - Germain UX
Crawler does not work?
A website may refuse to be crawled unless it is performed from a known browser. e.g. Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36

If you are having any difficulty, please contact us.
Service: Automation
Feature Availability: 2024.1