Ultimate Articles Extractor

Ultimate Articles Extractor

A powerful and modular web scraping tool designed to extract content from any webpage, article, or news site. Get clean, structured data from any website with optimized extraction algorithms, anti-bot detection avoidance, and proxy support.

NEWSAUTOMATIONSEO_TOOLSApify

Ultimate Articles Extractor: Advanced Scraping & Content Extraction Tool

A powerful and modular web scraping tool designed to extract content from any webpage, article, or news site. Get clean, structured data from any website with optimized extraction algorithms, anti-bot detection avoidance, and proxy support.

Overview

Ultimate Articles Extractor uses multiple specialized extraction engines to extract meaningful content from any webpage. It's designed for data scientists, researchers, journalists, and developers who need to analyze web content at scale.

Perfect for:

  • Content aggregation
  • News monitoring
  • Research data collection
  • SEO analysis
  • Topic modeling and NLP projects
  • Web archiving
  • Market intelligence

Key Features

  • 7 Specialized Extraction Engines: Choose from Newspaper4k, Trafilatura, Boilerpy3, News-Please, Goose3, Article Parser, and JusText
  • Universal Website Compatibility: Works with any website regardless of structure or layout
  • Complete Content Extraction: Captures title, description, full text, authors, publication date, images, and metadata
  • Smart Fallback System: Automatically tries alternative extraction methods if the primary one fails
  • Advanced Header Generation: Uses sophisticated browser fingerprinting to bypass anti-bot measures
  • Proxy Support: Integrates with residential proxies to prevent IP blocking
  • Domain-Specific Rate Limiting: Automatically manages request rates per domain to avoid detection
  • Customizable Output: Save article HTML, full page HTML, plaintext, or structured JSON
  • Parallel Processing: Process multiple URLs concurrently with optimized resource usage
  • State Persistence: Handles interruptions gracefully by saving progress

Extraction Methods Compared

ExtractorBest ForKey StrengthsOutput Fields
Newspaper4kGeneral news articlesNLP capabilities, metadata extractionTitle, text, authors, publish date, keywords, summary
TrafilaturaNews & blog contentOptimized for news, metadata supportTitle, text, author, date, language, categories, tags
Boilerpy3Simple article extractionFast, efficient text extractionTitle, text, text density metrics
News-PleaseComprehensive extractionRich metadata, fallback capabilitiesTitle, text, authors, publish date, language, images
Goose3Article content & imagesImage extraction, metadata supportTitle, text, authors, images, keywords
Article ParserHTML & markdown outputMultiple output formatsTitle, HTML content, markdown content
JusTextBoilerplate removalFocuses on main contentText, paragraphs count, language

Input Configuration

The application accepts the following input parameters:

1{
2  "startUrls": [
3    "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire"
4  ],
5  "extractorEngine": "newspaper4k",
6  "saveHtml": false,
7  "saveArticleHtml": false,
8  "useHeaderGenerator": true,
9  "headerGeneratorOptions": {
10    "browsers": ["chrome", "firefox", "safari", "edge"],
11    "devices": ["desktop"]
12  },
13  "customHeaders": {},
14  "proxyConfiguration": {
15    "useApifyProxy": true,
16    "apifyProxyGroups": [
17      "RESIDENTIAL"
18    ]
19  },
20  "maxRetries": 15
21}

Input Parameters Explained

  • startUrls (required): Array of article URLs to extract content from
  • extractorEngine (optional): Choose your preferred extraction library:
    • newspaper4k - Best all-around extractor with NLP capabilities (default)
    • trafilatura - Optimized for news content
    • boilerpy3 - Fast and efficient text extraction
    • news-please - Rich metadata extraction
    • goose3 - Good for extracting images and article content
    • article-parser - Supports multiple output formats
    • justext - Focused on boilerplate removal
  • saveHtml (optional): When true, saves the complete HTML of the webpage
  • saveArticleHtml (optional): When true, saves the extracted article HTML (for supported extractors)
  • useHeaderGenerator (optional): Enables sophisticated header generation to bypass detection
  • headerGeneratorOptions (optional): Configure which browsers and devices to emulate
  • customHeaders (optional): Set custom HTTP headers for requests
  • proxyConfiguration (optional): Configure proxy settings to avoid IP blocking
  • maxRetries (optional): Maximum number of retry attempts for failed requests (default: 15)

Example Outputs by Extractor

Newspaper4k Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.",
4  "text": "The authorities said there was no immediate indication of foul play in the substation fire...",
5  "author": ["Michael Levenson", "Andrew Das"],
6  "publishedDate": "2025-03-21T04:09:20",
7  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
8  "language": "en",
9  "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg",
10  "keywords": ["airport", "heathrow", "power outage", "london"],
11  "summary": "Heathrow Airport in London resumed some flight departures and arrivals late Friday...",
12  "extractorEngine": "newspaper4k"
13}

Trafilatura Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "text": "Flights Resume at Heathrow After Fire Forced Its Closure\nThe cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...",
4  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
5  "language": "en",
6  "categories": ["world", "europe"],
7  "tags": ["heathrow", "airport", "power outage", "london"],
8  "extractorEngine": "trafilatura"
9}

Boilerpy3 Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure - The New York Times",
3  "text": "SKIP ADVERTISEMENT\nFlights Resume at Heathrow After Fire Forced Its Closure\nThe cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...",
4  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
5  "textDensity": 0.85,
6  "markupToTextRatio": 0.32,
7  "extractorUsed": "ArticleExtractor",
8  "extractorEngine": "boilerpy3"
9}

Goose3 Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.",
4  "text": "Heathrow Airport in London resumed some flight departures and arrivals late Friday as one of the world's busiest air travel hubs began to rumble back to life...",
5  "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg",
6  "keywords": ["heathrow", "airport", "power outage", "london"],
7  "extractorEngine": "goose3"
8}

JusText Example

1{
2  "text": "Flights Resume at Heathrow After Fire Forced Its Closure\nThe cause of a blaze that knocked out power to one of the world's busiest airports was under investigation...",
3  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
4  "paragraphsCount": 15,
5  "languageUsed": "English",
6  "extractorEngine": "justext"
7}

Article Parser Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "articleHtml": "<div><p>Heathrow Airport in London resumed some flight departures and arrivals late Friday...</p></div>",
4  "text": "# Flights Resume at Heathrow After Fire Forced Its Closure\n\nHeathrow Airport in London resumed some flight departures and arrivals late Friday...",
5  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
6  "extractorEngine": "article-parser"
7}

News-Please Example

1{
2  "title": "Flights Resume at Heathrow After Fire Forced Its Closure",
3  "description": "The cause of a blaze that knocked out power to one of the world's busiest airports was under investigation.",
4  "text": "Heathrow Airport in London resumed some flight departures and arrivals late Friday...",
5  "author": ["Michael Levenson", "Andrew Das"],
6  "publishedDate": "2025-03-21T04:09:20",
7  "url": "https://www.nytimes.com/live/2025/03/21/world/heathrow-airport-power-outage-fire",
8  "language": "en",
9  "image": "https://static01.nyt.com/images/2025/03/21/multimedia/21vid-heathrow-closure-package-cover-zqhj/21vid-heathrow-closure-package-cover-zqhj-superJumbo.jpg",
10  "extractorEngine": "news-please"
11}

Frequently Asked Questions

Is it legal to scrape job listings or public data?

Yes, if you're scraping publicly available data for personal or internal use. Always review Websute's Terms of Service before large-scale use or redistribution.

Do I need to code to use this scraper?

No. This is a no-code tool — just enter a job title, location, and run the scraper directly from your dashboard or Apify actor page.

What data does it extract?

It extracts job titles, companies, salaries (if available), descriptions, locations, and post dates. You can export all of it to Excel or JSON.

Can I scrape multiple pages or filter by location?

Yes, you can scrape multiple pages and refine by job title, location, keyword, or more depending on the input settings you use.

How do I get started?

You can use the Try Now button on this page to go to the scraper. You’ll be guided to input a search term and get structured results. No setup needed!