PDF Extractor 2.0

PDF Extractor 2.0

💫 Extract PDF Document Contents including Metadata, Images, Pages, Tables, Attachments, etc.

AUTOMATIONDEVELOPER_TOOLSApify

Welcome to PDF Extractor

🍂 About PDF Format

Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.[2][3] Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images and other information needed to display it. PDF has its roots in "The Camelot Project" initiated by Adobe co-founder John Warnock in 1991.[4] PDF was standardized as ISO 32000 in 2008.[5] The last edition as ISO 32000-2:2020 was published in December 2020.

🍂 About This Actor

💫 Extract contents from PDF documents

Features :

  • ⭐ Extract PDF pages as Text or Image (SVG, PNG, JPEG).
  • ⭐ Extract PDF Metadata.
  • ⭐ Extract PDF Table of Contents
  • ⭐ Extract PDF Tables
  • ⭐ Extract Encrypted PDF (password protected)
  • ⭐ Extract Embedded images.
  • ⭐ Extract Attachments.
  • ⭐ Extract multiple URL files

🍂 Tutorial

Input Parameters

NameTypeDescription
urlArray [String]List of PDF document URL
contentStringOutput pages format (text, svg, png, jpg)
imagesBoolean (true/false)Extract embedded images
attachmentsBoolean (true/false)Extract embedded files
tablesBoolean (true/false)Extract tables

Notes : All extracted resources other than TEXT will be saved to default Key-Value storage.

Dataset Output Format :

1[	
2	# URL-1: Metadata
3	{ "metadata": { "headers": { ... }, "url": "...", "mime": "..." } },
4	# URL-1: Page Contents
5	{ "index": 0, "content": "...page-0 contents...", "images": [...], "tables": [...] },
6	{ "index": 1, "content": "...page-1 contents...", "images": [...], "tables": [...] },
7	...
8	# URL-2: Metadata
9	{ "metadata": { "headers": { ... }, "url": "...", "mime": "..." } },
10	# URL-2: Page Contents
11	{ "index": 0, "content": "...page-0 contents...", "images": [...], "tables": [...] },
12	{ "index": 1, "content": "...page-1 contents...", "images": [...], "tables": [...] },	
13	...
14]

🍂 Output Samples

PDF Sample #1

URL : https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf

1{
2
3}

PDF Sample #2

URL : https://apify.com/img/web-scraping/beginners-guide-to-web-scraping.pdf

1{
2
3}

✏️ Support

⚡️ Feel free to reach out to the developer for any issues or suggestions for improvement.

Frequently Asked Questions

Is it legal to scrape job listings or public data?

Yes, if you're scraping publicly available data for personal or internal use. Always review Websute's Terms of Service before large-scale use or redistribution.

Do I need to code to use this scraper?

No. This is a no-code tool — just enter a job title, location, and run the scraper directly from your dashboard or Apify actor page.

What data does it extract?

It extracts job titles, companies, salaries (if available), descriptions, locations, and post dates. You can export all of it to Excel or JSON.

Can I scrape multiple pages or filter by location?

Yes, you can scrape multiple pages and refine by job title, location, keyword, or more depending on the input settings you use.

How do I get started?

You can use the Try Now button on this page to go to the scraper. You’ll be guided to input a search term and get structured results. No setup needed!