Ultimate Articles Extractor

A powerful and modular web scraping tool designed to extract content from any webpage, article, or news site. Get clean, structured data from any website with optimized extraction algorithms, anti-bot detection avoidance, and proxy support.

web.harvester

$25

Try Now →

Ultimate Articles Extractor: Advanced Scraping & Content Extraction Tool

Overview

Ultimate Articles Extractor uses multiple specialized extraction engines to extract meaningful content from any webpage. It's designed for data scientists, researchers, journalists, and developers who need to analyze web content at scale.

Perfect for:

Content aggregation
News monitoring
Research data collection
SEO analysis
Topic modeling and NLP projects
Web archiving
Market intelligence

Key Features

7 Specialized Extraction Engines: Choose from Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, and JusText
Universal Website Compatibility: Works with any website regardless of structure or layout
Complete Content Extraction: Captures title, description, full text, authors, publication date, images, and metadata
Smart Fallback System: Automatically tries alternative extraction methods if the primary one fails
Advanced Header Generation: Uses sophisticated browser fingerprinting to bypass anti-bot measures
Proxy Support: Integrates with residential proxies to prevent IP blocking
Domain-Specific Rate Limiting: Automatically manages request rates per domain to avoid detection
Customizable Output: Save article HTML, full page HTML, plaintext, or structured JSON
Parallel Processing: Process multiple URLs concurrently with optimized resource usage
State Persistence: Handles interruptions gracefully by saving progress

Extraction Methods Compared

Extractor	Best For	Key Strengths	Output Fields
Newspaper4k	General news articles	NLP capabilities, metadata extraction	Title, text, authors, publish date, keywords, summary
Trafilatura	News & blog content	Optimized for news, metadata support	Title, text, author, date, language, categories, tags
Boilerpy3	Simple article extraction	Fast, efficient text extraction	Title, text, text density metrics
News-Please	Comprehensive extraction	Rich metadata, fallback capabilities	Title, text, authors, publish date, language, images
Goose3	Article content & images	Image extraction, metadata support	Title, text, authors, images, keywords
Article Parser	HTML & markdown output	Multiple output formats	Title, HTML content, markdown content
JusText	Boilerplate removal	Focuses on main content	Text, paragraphs count, language

Input Configuration

The application accepts the following input parameters:

1{
2  "startUrls": [
3    "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire"
4  ],
5  "extractorEngine": "newspaper4k",
6  "saveHtml": false,
7  "saveArticleHtml": false,
8  "useHeaderGenerator": true,
9  "headerGeneratorOptions": {
10    "browsers": ["chrome", "firefox", "safari", "edge"],
11    "devices": ["desktop"]
12  },
13  "customHeaders": {},
14  "proxyConfiguration": {
15    "useApifyProxy": true,
16    "apifyProxyGroups": [
17      "RESIDENTIAL"
18    ]
19  },
20  "maxRetries": 15
21}

Input Parameters Explained

startUrls (required): Array of article URLs to extract content from
extractorEngine (optional): Choose your preferred extraction library:
- newspaper4k - Best all-around extractor with NLP capabilities (default)
- trafilatura - Optimized for news content
- boilerpy3 - Fast and efficient text extraction
- news-please - Rich metadata extraction
- goose3 - Good for extracting images and article content
- article-parser - Supports multiple output formats
- justext - Focused on boilerplate removal
saveHtml (optional): When true, saves the complete HTML of the webpage
saveArticleHtml (optional): When true, saves the extracted article HTML (for supported extractors)
useHeaderGenerator (optional): Enables sophisticated header generation to bypass detection
headerGeneratorOptions (optional): Configure which browsers and devices to emulate
customHeaders (optional): Set custom HTTP headers for requests
proxyConfiguration (optional): Configure proxy settings to avoid IP blocking
maxRetries (optional): Maximum number of retry attempts for failed requests (default: 15)

Example Outputs by Extractor

Newspaper4k Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.",
4  "text": "The authorities said there was no immediate indication of foul play in the substation fire...",
5  "author": ["Michael Levenson", "Andrew Das"],
6  "publishedDate": "2025-03-21T04:09:20",
7  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
8  "language": "en",
9  "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg",
10  "keywords": ["airport", "heathrow", "power outage", "london"],
11  "summary": "Heathrow Airport in London resumed some flight departures and arrivals late Friday...",
12  "extractorEngine": "newspaper4k"
13}

Trafilatura Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "text": "Flights Resume at Heathrow After Fire Forced Its Closure
The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...",
4  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
5  "language": "en",
6  "categories": ["world", "europe"],
7  "tags": ["heathrow", "airport", "power outage", "london"],
8  "extractorEngine": "trafilatura"
9}

Boilerpy3 Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure - The New York Times",
3  "text": "SKIP ADVERTISEMENT
Flights Resume at Heathrow After Fire Forced Its Closure
The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...",
4  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
5  "textDensity": 0.85,
6  "markupToTextRatio": 0.32,
7  "extractorUsed": "ArticleExtractor",
8  "extractorEngine": "boilerpy3"
9}

Goose3 Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.",
4  "text": "Heathrow Airport in London resumed some flight departures and arrivals late Friday as one of the world's busiest air travel hubs began to rumble back to life...",
5  "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg",
6  "keywords": ["heathrow", "airport", "power outage", "london"],
7  "extractorEngine": "goose3"
8}

JusText Example

1{
2  "text": "Flights Resume at Heathrow After Fire Forced Its Closure
The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...",
3  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
4  "paragraphsCount": 15,
5  "languageUsed": "English",
6  "extractorEngine": "justext"
7}

Article Parser Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "articleHtml": "<div><p>Heathrow Airport in London resumed some flight departures and arrivals late Friday...</p></div>",
4  "text": "# Flights Resume at Heathrow After Fire Forced Its Closure

Heathrow Airport in London resumed some flight departures and arrivals late Friday...",
5  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
6  "extractorEngine": "article-parser"
7}

News-Please Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.",
4  "text": "Heathrow Airport in London resumed some flight departures and arrivals late Friday...",
5  "author": ["Michael Levenson", "Andrew Das"],
6  "publishedDate": "2025-03-21T04:09:20",
7  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
8  "language": "en",
9  "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg",
10  "extractorEngine": "news-please"
11}

Frequently Asked Questions

Is it legal to scrape job listings or public data?

Yes, if you're scraping publicly available data for personal or internal use. Always review Websute's Terms of Service before large-scale use or redistribution.

Do I need to code to use this scraper?

No. This is a no-code tool — just enter a job title, location, and run the scraper directly from your dashboard or Apify actor page.

What data does it extract?

It extracts job titles, companies, salaries (if available), descriptions, locations, and post dates. You can export all of it to Excel or JSON.

Can I scrape multiple pages or filter by location?

Yes, you can scrape multiple pages and refine by job title, location, keyword, or more depending on the input settings you use.

How do I get started?

You can use the Try Now button on this page to go to the scraper. You’ll be guided to input a search term and get structured results. No setup needed!