Crawls and downloads web pages running on a list of provided naked domains e.g. "example.com". The actor stores HTML snapshot, screenshot, text body, and HTTP response headers of all the pages. It also extracts email addresses, phones, social handles for Facebook, Twitter, LinkedIn, and Instagram.
The actor performs a crawl of web pages from a provided list of naked domain names using headless Chrome. For each web page visited, the crawler extracts and saves the following information:
saveScreenshot
setting is true
). The screenshots are stored in JPEG format to save disk space.
Note that taking screenshots is quite resource intensive and will slow your crawler down.saveHtml
setting is true
)saveText
setting is true
)For each domain (e.g. example.com
) from the input, the actor tries to load the following pages:
http://example.com
https://example.com
(only if crawlHttpsVersion
setting is true
)http://www.example.com
(only if crawlWwwSubdomain
setting is true
)https://www.example.com
(only if both crawlHttpsVersion
and crawlWwwSubdomain
settings are true
)Additionally, if the crawlLinkCount
setting is greater than zero, for each domain
the crawler tries to open crawlLinkCount
pages linked from the main page and analyze them too.
The crawler prefers the links that contains the /contact
text, to increase the chance it will find
more emails, phone numbers and other social handles.
The results of the crawler are stored into a dataset. For each web page crawled there is one record. The optional screenshots and HTML snapshots of web pages are stored into separate records into the key-value store.
For example, for the web page https://example.com
the resulting record in the dataset will look as follows (in JSON format):
1{ 2 "domain": "example.com", 3 "url": "http://example.com", 4 "response": { 5 "url": "http://example.com/", 6 "status": 200, 7 "remoteAddress": { 8 "ip": "93.184.216.34", 9 "port": 80 10 }, 11 "headers": { 12 "content-encoding": "gzip", 13 "cache-control": "max-age=604800", 14 "content-type": "text/html; charset=UTF-8", 15 "date": "Sat, 24 Nov 2018 22:04:40 GMT", 16 "etag": "\"1541025663+gzip\"", 17 "expires": "Sat, 01 Dec 2018 22:04:40 GMT", 18 "last-modified": "Fri, 09 Aug 2013 23:54:35 GMT", 19 "server": "ECS (dca/24D5)", 20 "vary": "Accept-Encoding", 21 "x-cache": "HIT", 22 "content-length": "606" 23 }, 24 "securityDetails": null 25 }, 26 "page": { 27 "title": "Example Domain", 28 "linkUrls": [ 29 "http://www.iana.org/domains/example" 30 ], 31 "linkedDataObjects": [] 32 }, 33 "social": { 34 "emails": [], 35 "phones": [], 36 "phonesUncertain": [], 37 "linkedIns": [], 38 "twitters": [], 39 "instagrams": [], 40 "facebooks": [] 41 }, 42 "screenshot": { 43 "url": "https://api.apify.com/v2/key-value-stores/<actor_run_id>/records/screenshot-example.com-00.jpg", 44 "length": 18572 45 }, 46 "html": { 47 "url": "https://api.apify.com/v2/key-value-stores/<actor_run_id>/records/content-example.com-00.html", 48 "length": 1262 49 }, 50 "text": " EXAMPLE DOMAIN\nThis domain is established to be used for illustrative examples in documents.\nYou may use this domain in examples without prior coordination or asking for\npermission.\n\nMore information..." 51}
If the web page cannot be loaded for any reason, the record contains the information about the error:
1{ 2 "domain": "non-existent-domain.net", 3 "url": "http://non-existent-domain.net", 4 "errorMessage": "Error: net::ERR_NAME_NOT_RESOLVED at http://non-existent-domain.net\n at navigate (/Users/jan/Projects/actor-analyze-domains/node_modules/puppeteer/lib/FrameManager.js:103:37)\n at <anonymous>\n at process._tickCallback (internal/process/next_tick.js:189:7)\n -- ASYNC --\n at Frame.<anonymous> (/Users/jan/Projects/actor-analyse-domains/node_modules/puppeteer/lib/helper.js:144:27)\n at Page.goto (/Users/jan/Projects/actor-analyse-domains/node_modules/puppeteer/lib/Page.js:587:49)\n at Page.<anonymous> (/Users/jan/Projects/actor-analyse-domains/node_modules/puppeteer/lib/helper.js:145:23)\n at PuppeteerCrawler.gotoFunction (/Users/jan/Projects/actor-analyse-domains/node_modules/apify/build/puppeteer_crawler.js:30:53)\n at PuppeteerCrawler._handleRequestFunction (/Users/jan/Projects/actor-analyse-domains/node_modules/apify/build/puppeteer_crawler.js:322:48)\n at <anonymous>\n at process._tickCallback (internal/process/next_tick.js:189:7)" 5}
Yes, if you're scraping publicly available data for personal or internal use. Always review Websute's Terms of Service before large-scale use or redistribution.
No. This is a no-code tool — just enter a job title, location, and run the scraper directly from your dashboard or Apify actor page.
It extracts job titles, companies, salaries (if available), descriptions, locations, and post dates. You can export all of it to Excel or JSON.
Yes, you can scrape multiple pages and refine by job title, location, keyword, or more depending on the input settings you use.
You can use the Try Now button on this page to go to the scraper. You’ll be guided to input a search term and get structured results. No setup needed!