Start or Schedule Website Crawling
There are two options to start a crawling of a website:
Crawl Now
Schedule Crawling
Go to Germain Workspace > Left Menu > Analytics > Website Crawler
Crawl Now
This is the quickest way to execute a website crawler and get its results once completed. Configuration settings in this option are limited but sufficient for most use cases.
Configuration Settings
URL: Starting URL for the crawler.
Stay On Domain
True: Visit only URLs on the same domain or its subdomain (e.g. google.com and drive.google.com are on the same domain)
False: Visit every URL
Store Successfully Visited URLs
True: Store all Website URL Availability facts (these representing available and not available URLs)
False: Store only failed Website URL Availability facts (not available URLs only)
HTTP Failure Status Code: Any visited URL with returned HTTP status code equal or bigger to this value will be considered as unavailable.
Maximum Crawling Depth: This value represents how deep the crawler can visit URLs. Null value means no cap for maximum crawling depth.
Maximum URLs To Crawl: This value puts a cap on how many URLs can be visited by the crawler. Null value means there is no cap for maximum URLs to crawl.
Crawler Threads: How many independent threads will be used to crawl your website. More threads means more resources needed but quicker execution time.
Ignore URLs: URLs which shouldn’t be ignored by the crawler. Regex patterns are allowed if you want to exclude all domains (e.g. to ignore all URLs from drive.google.com, you need to add .*drive.google.com.* value)
Schedule Crawling of a Website
This is a more advanced option and it allows to configure:
a website crawler on a schedule
provide more advanced settings to the crawler (e.g. customer headers, authentication settings, connection settings and more)
Tip(s)
Crawler does not work?
A website may refuse to be crawled unless it is performed from a known browser. e.g. Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36
If you are having any difficulty, please contact us.
Service: Automation
Feature Availability: 2024.1