Skip to content

naravid19/isranews-scraper

Repository files navigation

Contributors Forks Stargazers Issues MIT License


Logo

Isranews Scraper

A robust, asynchronous web scraper for Isranews.org with a modern GUI and CLI support.
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

About The Project

isranews-scraper is a high-performance web scraping tool designed specifically for isranews.org. It allows users to extract news articles from various categories efficiently using asynchronous operations.

Key features include:

  • Asynchronous Scraping: Built with asyncio and playwright for maximum speed and concurrency.
  • Dual Interface: Offers both a Command Line Interface (CLI) for automation and a Graphical User Interface (GUI) for ease of use.
  • Multi-Format Export: Save data in CSV, Excel, JSON, or TXT formats.
  • Smart Filtering: Filter news by date and automatically merge new data with existing files.
  • Robust Error Handling: Handles network issues and encoding errors gracefully.

(back to top)

Built With

This project is built using robust Python libraries to ensure reliability and performance.

  • Python
  • Playwright
  • Pandas
  • BeautifulSoup
  • PyQt6

(back to top)

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

  • Python 3.8 or higher
  • pip

Installation

  1. Clone the repo
    git clone https://github.com/naravid19/isranews-scraper.git
  2. Install Python packages
    pip install -r requirements.txt
  3. Install Playwright browsers
    python -m playwright install

(back to top)

Usage

Graphical User Interface (GUI)

For a user-friendly experience, run the GUI application:

python isranews_scraper_gui.py
  1. Select Categories: Choose one or more news categories from the list.
  2. Set Range: Define the start and end pages to scrape.
  3. Filter: Optionally set a date to filter news items.
  4. Export: Choose your desired output format and filename.
  5. Start: Click the "Start Scraping" button.

Command Line Interface (CLI)

For automation or server environments, use the CLI:

python isranews_scraper.py -c "ศูนย์ข่าวสืบสวน" -s 1 -e 5 -o investigative_news

Arguments:

  • -c, --categories: Category name or index (comma-separated). Use "all" for everything.
  • -s, --start: Start page number (default: 1).
  • -e, --end: End page number (0 for all).
  • -o, --output: Output filename (without extension).
  • -f, --format: Output format (csv, excel, json, txt).
  • -d, --date: Filter date (YYYY-MM-DD).
  • --max-threads: Maximum concurrent pages (default: 5).

(back to top)

Roadmap

  • Migrated to Asynchronous Architecture (asyncio + playwright)
  • Modern Dark-Themed GUI (PyQt6)
  • Multi-format Export Support
  • Automatic Data Merging
  • Add support for downloading article images/attachments
  • Implement scheduled scraping (Cron/Task Scheduler integration)
  • REST API for remote triggering

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contact

Naravid - @naravid19

Project Link: https://github.com/naravid19/isranews-scraper

(back to top)

Acknowledgments

(back to top)

About

A robust and parallel web scraper for isranews.org with multi-category support and data export.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages