This actor scrapes detailed information from GitHub repositories using reliable HTTP requests and HTML parsing. It extracts repository metadata including star counts, fork counts, topics/tags, license information, primary programming language, and last updated timestamps.
A Python-based Apify actor that scrapes GitHub repository information using requests and BeautifulSoup.
apify_actor.py
- The main actor code for Apify deploymentrequests_github_scraper.py
- Standalone GitHub scraper (for local testing)INPUT_SCHEMA.json
- Input schema for the Apify actorrequirements.txt
- Python dependenciespackage.json
- Actor metadata for Apifypip install -r requirements.txt
python requests_github_scraper.py
apify_storage
directorynpm install -g apify-cli
apify login
Initialize your project folder (if you haven't already):
apify init github-scraper
Modify the Dockerfile
to use Python:
1FROM apify/actor-python:3.9 2 3# Copy source code 4COPY . ./ 5 6# Install dependencies 7RUN pip install --no-cache-dir -r requirements.txt 8 9# Define how to run the actor 10CMD ["python3", "apify_actor.py"]
Push your actor to Apify:
apify push
After pushing, your actor will be available in the Apify Console.
repoUrls
(required): Array of GitHub repository URLs to scrapesleepBetweenRequests
(optional): Delay between requests in seconds (default: 3)1{ 2 "repoUrls": [ 3 "https://github.com/microsoft/playwright", 4 "https://github.com/facebook/react", 5 "https://github.com/tensorflow/tensorflow" 6 ], 7 "sleepBetweenRequests": 5 8}
The actor provides clean, well-structured data for each GitHub repository in the following format:
1{ 2 "url": "https://github.com/microsoft/playwright", 3 "name": "playwright", 4 "owner": "microsoft", 5 "fullName": "microsoft/playwright", 6 "description": "Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.", 7 "stats": { 8 "stars": "71.2k", 9 "forks": "4k" 10 }, 11 "language": "TypeScript", 12 "topics": [ 13 "electron", 14 "javascript", 15 "testing", 16 "firefox", 17 "chrome", 18 "automation", 19 "web", 20 "test", 21 "chromium", 22 "test-automation", 23 "testing-tools", 24 "webkit", 25 "end-to-end-testing", 26 "e2e-testing", 27 "playwright" 28 ], 29 "lastUpdated": "2025-03-17T17:00:47Z", 30 "license": "Apache-2.0 license" 31}
Field | Type | Description |
---|---|---|
url | String | The full URL of the GitHub repository |
name | String | Repository name (without owner) |
owner | String | Username or organization that owns the repository |
fullName | String | Complete repository identifier (owner/name) |
description | String | Repository description |
stats.stars | String | Number of stars the repository has |
stats.forks | String | Number of forks the repository has |
language | String | Primary programming language |
topics | Array | List of topics/tags associated with the repository |
lastUpdated | String | ISO timestamp of the last update |
license | String | Repository license information |
This structured output format makes it easy to:
Yes, if you're scraping publicly available data for personal or internal use. Always review Websute's Terms of Service before large-scale use or redistribution.
No. This is a no-code tool — just enter a job title, location, and run the scraper directly from your dashboard or Apify actor page.
It extracts job titles, companies, salaries (if available), descriptions, locations, and post dates. You can export all of it to Excel or JSON.
Yes, you can scrape multiple pages and refine by job title, location, keyword, or more depending on the input settings you use.
You can use the Try Now button on this page to go to the scraper. You’ll be guided to input a search term and get structured results. No setup needed!