Start or Schedule Website Crawling

There are two options to start a crawling of a website:

Crawl Now
Schedule Crawling

Go to Germain Workspace > Left Menu > Analytics > Website Crawler

Crawl Now

This is the quickest way to execute a website crawler and get its results once completed. Configuration settings in this option are limited but sufficient for most use cases.

Configuration Settings

URL: Starting URL for the crawler.
Stay On Domain
- True: Visit only URLs on the same domain or its subdomain (e.g. google.com and drive.google.com are on the same domain)
- False: Visit every URL
Store Successfully Visited URLs
- True: Store all Website URL Availability facts (these representing available and not available URLs)
- False: Store only failed Website URL Availability facts (not available URLs only)
HTTP Failure Status Code: Any visited URL with returned HTTP status code equal or bigger to this value will be considered as unavailable.
Maximum Crawling Depth: This value represents how deep the crawler can visit URLs. Null value means no cap for maximum crawling depth.
Maximum URLs To Crawl: This value puts a cap on how many URLs can be visited by the crawler. Null value means there is no cap for maximum URLs to crawl.
Crawler Threads: How many independent threads will be used to crawl your website. More threads means more resources needed but quicker execution time.
Ignore URLs: URLs which shouldn’t be ignored by the crawler. Regex patterns are allowed if you want to exclude all domains (e.g. to ignore all URLs from drive.google.com, you need to add .*drive.google.com.* value)

Schedule Crawling of a Website

This is a more advanced option and it allows to configure:

a website crawler on a schedule
provide more advanced settings to the crawler (e.g. customer headers, authentication settings, connection settings and more)

Tip(s)

No Message Available

That message likely indicates that the Germain UX Services are not deployed or running. Please refer to this guide on how to deploy Germain UX Services.

Crawler does not work?

A website may refuse to be crawled unless it is performed from a known browser. e.g. Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36

If you are having any difficulty, please contact us.

Service: Automation

Feature Availability: 2024.1