A robust, asynchronous web scraper for Isranews.org with a modern GUI and CLI support.
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Table of Contents
isranews-scraper is a high-performance web scraping tool designed specifically for isranews.org. It allows users to extract news articles from various categories efficiently using asynchronous operations.
Key features include:
- Asynchronous Scraping: Built with
asyncioandplaywrightfor maximum speed and concurrency. - Dual Interface: Offers both a Command Line Interface (CLI) for automation and a Graphical User Interface (GUI) for ease of use.
- Multi-Format Export: Save data in CSV, Excel, JSON, or TXT formats.
- Smart Filtering: Filter news by date and automatically merge new data with existing files.
- Robust Error Handling: Handles network issues and encoding errors gracefully.
This project is built using robust Python libraries to ensure reliability and performance.
To get a local copy up and running follow these simple steps.
- Python 3.8 or higher
- pip
- Clone the repo
git clone https://github.com/naravid19/isranews-scraper.git
- Install Python packages
pip install -r requirements.txt
- Install Playwright browsers
python -m playwright install
For a user-friendly experience, run the GUI application:
python isranews_scraper_gui.py- Select Categories: Choose one or more news categories from the list.
- Set Range: Define the start and end pages to scrape.
- Filter: Optionally set a date to filter news items.
- Export: Choose your desired output format and filename.
- Start: Click the "Start Scraping" button.
For automation or server environments, use the CLI:
python isranews_scraper.py -c "ศูนย์ข่าวสืบสวน" -s 1 -e 5 -o investigative_newsArguments:
-c,--categories: Category name or index (comma-separated). Use "all" for everything.-s,--start: Start page number (default: 1).-e,--end: End page number (0 for all).-o,--output: Output filename (without extension).-f,--format: Output format (csv,excel,json,txt).-d,--date: Filter date (YYYY-MM-DD).--max-threads: Maximum concurrent pages (default: 5).
- Migrated to Asynchronous Architecture (
asyncio+playwright) - Modern Dark-Themed GUI (
PyQt6) - Multi-format Export Support
- Automatic Data Merging
- Add support for downloading article images/attachments
- Implement scheduled scraping (Cron/Task Scheduler integration)
- REST API for remote triggering
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.
Naravid - @naravid19
Project Link: https://github.com/naravid19/isranews-scraper