Automatically scrape metadata such as title, description, heading and article from websites. It will crawl the start URLs and then scrape the metadata from the detail pages automatically navigating through the pagination.
Automatically scrape metadata such as title, description, heading and article from websites. It will crawl the start URLs and then scrape the metadata from the detail pages automatically navigating through the pagination.
Be sure to use JSON mode for the input and not Manual mode. Here's an overview of the input parameters:
startUrls
: An array of objects containing:
url
: The starting URL for the scrapescrapeUrlGlobs
: An array of URL patterns for detail pages to scrapepaginationUrlGlobs
: An array of URL patterns for pagination pages (optional)maxRequestsPerCrawl
: Maximum number of requests per crawl (default: 100)urlsToIgnore
: An array of URLs to ignore when processing (optional)Here's an example of the input data structure:
1{ 2 "startUrls": [ 3 { 4 "url": "https://roger-hannah.co.uk/property-search/?search_properties=1&tenure=&property_type%5B%5D=Development&property_type%5B%5D=Industrial&size_min=0&size_max=1000000", 5 "scrapeUrlGlobs": ["https://roger-hannah.co.uk/properties/*"], 6 "paginationUrlGlobs": [] 7 } 8 ], 9 "maxRequestsPerCrawl": 100, 10 "urlsToIgnore": [ 11 "https://roger-hannah.co.uk/properties/development-site-with-potential-for-10-houses-planning-permission/", 12 "https://roger-hannah.co.uk/properties/lower-mill-mill-street/" 13 ] 14}
Glob patterns are used to match URLs. They are similar to regular expressions but more flexible. They are used to match the URL patterns for detail pages and pagination pages.
Here are some common glob patterns used in URL matching:
*
: Matches any number of characters (except /
)
Example: https://example.com/*.html
matches all HTML files in the root directory**
: Matches any number of characters (including /
)
Example: https://example.com/**/*.jpg
matches all JPG files in any subdirectory?
: Matches exactly one character
Example: https://example.com/page?.html
matches page1.html, pageA.html, etc.[...]
: Matches any one character in the brackets
Example: https://example.com/file[123].txt
matches file1.txt, file2.txt, file3.txt[!...]
: Matches any one character not in the brackets
Example: https://example.com/img[!0-9].png
matches imgA.png but not img1.png{...}
: Matches any of the comma-separated patterns
Example: https://example.com/{blog,news}/*.html
matches both blog and news HTML filesExamples in the context of web scraping:
https://example.com/products/*.html
: Matches all product detail pageshttps://example.com/category/*/page-*.html
: Matches pagination pages in all categorieshttps://example.com/{2021,2022,2023}/**
: Matches all pages from specific yearshttps://example.com/page/*
: Matches all pages in the root directoryhttps://example.com/page/**
: Matches all pages in all subdirectoriesWhen using glob patterns in the startGlobs
configuration, make sure they accurately represent the structure of the website you're scraping to ensure all relevant pages are captured.
The Actor outputs the following data for each scraped property listing:
url
: The URL of the scraped pagetitle
: The title of a detail pagedescription
: The description of a detail pageheading
: The main heading of a detail pagearticle
: The content of a detail pageHere's an example of the output data structure:
1{ 2 "url": "https://roger-hannah.co.uk/properties/bolton-street/", 3 "title": "Bolton Street - Roger Hannah", 4 "description": "Property Information The property comprises of a detached former warehouse/showroom facility constructed by way of a steel portal frame with concrete render under a pitched tiled roof. Access to the property is via personnel entrance doors fronting Bolton Street with rear loading access off Millett Street via two electrically operated roller shutter loading doors. There is a small private yard/parking/loading area to the rear of the premises. Internally, the facility provided flexible ground fl...", 5 "heading": "Bolton Street", 6 "article": "Property Information The property comprises of a detached former warehouse/showroom facility constructed by way of a steel portal frame with concrete render under a pitched tiled roof. Access to the property is via personnel entrance doors fronting Bolton Street with rear loading access off Millett Street via two electrically operated roller shutter loading doors. There is a small private yard/parking/loading area to the rear of the premises. Internally, the facility provided flexible ground fl..." 7}
Yes, if you're scraping publicly available data for personal or internal use. Always review Websute's Terms of Service before large-scale use or redistribution.
No. This is a no-code tool — just enter a job title, location, and run the scraper directly from your dashboard or Apify actor page.
It extracts job titles, companies, salaries (if available), descriptions, locations, and post dates. You can export all of it to Excel or JSON.
Yes, you can scrape multiple pages and refine by job title, location, keyword, or more depending on the input settings you use.
You can use the Try Now button on this page to go to the scraper. You’ll be guided to input a search term and get structured results. No setup needed!