Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
3a62f5e
Use f-strings instead of .format()
rivermont Nov 1, 2021
59e124d
Remove stray parenthesis.
rivermont Apr 28, 2022
15d4e8c
Remove obselete configs.
rivermont Apr 28, 2022
a242c7c
Adding argparse stuff
lkotlus Aug 1, 2024
584a6ac
Basic outline of out of scope options
lkotlus Aug 1, 2024
043a834
Add out of scope functionality and adjust the restricted domain logic
lkotlus Aug 1, 2024
cb0e33e
Fix my wording on out of scope stuff
lkotlus Aug 1, 2024
69e4255
Fix syntax error (I am a programming genius)
lkotlus Aug 1, 2024
6155f7b
Fix some of my logic
lkotlus Aug 1, 2024
251230d
If the argument is used, don't go looking for user input
lkotlus Aug 1, 2024
0b109c7
Check if OUT_OF_SCOPE was set
lkotlus Aug 1, 2024
da4d2c7
Scratch that previous commit...
lkotlus Aug 1, 2024
91080bc
Optimize by preventing multiple checks of the same URL
lkotlus Aug 1, 2024
2ebe062
Fix some globals and whatnot
lkotlus Aug 2, 2024
305ee30
Update config files, add selenium (import only, no code yet) to crawl…
lkotlus Aug 2, 2024
2ccf5b1
Fix imports
lkotlus Aug 2, 2024
6b9f1b8
This should work
lkotlus Aug 2, 2024
476ccc0
Fix interceptor function
lkotlus Aug 2, 2024
b05e006
Bug fixes and testing
lkotlus Aug 2, 2024
cb4f856
Fix requirements
lkotlus Aug 2, 2024
e417d34
Update docs and fix comments
lkotlus Aug 2, 2024
1456694
Contributors
lkotlus Aug 2, 2024
62b4668
Remove unnecesary print
lkotlus Aug 5, 2024
1547563
KNOWN_ERROR_COUNT referenced before assignment fixed.
lkotlus Aug 5, 2024
b37fd41
Add maximum time
lkotlus Aug 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ Pretty simple!
![All Platforms!](https://img.shields.io/badge/Windows,%20OS/X,%20Linux-%20%20-brightgreen.svg)
![Open Source Love](https://badges.frapsoft.com/os/v1/open-source.png?v=103)
<br>
![Lines of Code: 1553](https://img.shields.io/badge/lines%20of%20code-1553-brightgreen.svg)
![Lines of Docs: 605](https://img.shields.io/badge/lines%20of%20docs-605-orange.svg)
![Lines of Code: 1811](https://img.shields.io/badge/lines%20of%20code-1553-brightgreen.svg)
![Lines of Docs: 619](https://img.shields.io/badge/lines%20of%20docs-605-orange.svg)
[![Last Commit](https://img.shields.io/github/last-commit/rivermont/spidy.svg)](https://github.com/rivermont/spidy/graphs/punch-card)
[![Travis CI Status](https://img.shields.io/travis/com/rivermont/spidy)](https://travis-ci.com/github/rivermont/spidy)
[![PyPI Wheel](https://img.shields.io/pypi/wheel/spidy-web-crawler.svg)](https://pypi.org/project/spidy-web-crawler/)
Expand Down Expand Up @@ -101,6 +101,7 @@ Here are some features we figure are worth noting.
- Cross-Platform compatibility: spidy will work on all three major operating systems, Windows, Mac OS/X, and Linux!
- Frequent Timestamp Logging: Spidy logs almost every action it takes to both the console and one of two log files.
- Browser Spoofing: Make requests using User Agents from 4 popular web browsers, use a custom spidy bot one, or create your own!
- Headless Browser Support: Render full webpages to get dynamic content.
- Portability: Move spidy's folder and its contents somewhere else and it will run right where it left off. *Note*: This only works if you run it from source code.
- User-Friendly Logs: Both the console and log file messages are simple and easy to interpret, but packed with information.
- Webpage saving: Spidy downloads each page that it runs into, regardless of file type. The crawler uses the HTTP `Content-Type` header returned with most files to determine the file type.
Expand Down Expand Up @@ -225,6 +226,7 @@ See the [`CONTRIBUTING.md`](https://github.com/rivermont/spidy/blob/master/spidy
* [quatroka](https://github.com/quatroka) - Fixed testing bugs.
* [stevelle](https://github.com/stevelle) - Respect robots.txt.
* [thatguywiththatname](https://github.com/thatguywiththatname) - README link corrections.
* [lkotlus](https://github.com/lkotlus) - Optimizations, out of scope items, and headless browser support.

# License
We used the [Gnu General Public License](https://www.gnu.org/licenses/gpl-3.0.en.html) (see [`LICENSE`](https://github.com/rivermont/spidy/blob/master/LICENSE)) as it was the license that best suited our needs.<br>
Expand Down
3 changes: 3 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,6 @@ requests
lxml
flake8
reppy
selenium
selenium-wire
blinker==1.7.0
9 changes: 9 additions & 0 deletions spidy/config/blank.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ RESTRICT = <True/False>
# The domain within which to restrict crawling.
DOMAIN = ''

# Domains, subdomains, and paths that are out of scope for the crawl
OUT_OF_SCOPE = ['', '']

# Whether to respect sites' robots.txt or not
RESPECT_ROBOTS = <True/False>

Expand All @@ -48,11 +51,17 @@ HEADER = HEADERS['<Header>']
# Or if you want to use custom headers:
HEADER = {'<Header Name>': '<Value>', '<Header2>': '<Value2>'}

# Select if you would like to have pages rendered with a headless browser (more thorough, but slower)
USE_BROWSER = <True/False>

# Amount of errors allowed to happen before automatic shutdown.
MAX_NEW_ERRORS = <Int>
MAX_KNOWN_ERRORS = <Int>
MAX_HTTP_ERRORS = <Int>
MAX_NEW_MIMES = <Int>

# Amount of time (in seconds) the crawl is allowed to run for (set to float('inf') if you want it to run forever)
MAX_TIME = <Int>

# Pages to start crawling on in case TODO is empty at start.
START = ['', '']
3 changes: 3 additions & 0 deletions spidy/config/default.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,17 @@ ZIP_FILES = True
OVERRIDE_SIZE = False
RESTRICT = False
DOMAIN = ''
OUT_OF_SCOPE = []
RESPECT_ROBOTS = True
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
SAVE_COUNT = 100
HEADER = HEADERS['spidy']
USE_BROWSER = False
MAX_NEW_ERRORS = 5
MAX_KNOWN_ERRORS = 10
MAX_HTTP_ERRORS = 20
MAX_NEW_MIMES = 10
MAX_TIME = float('inf')
START = ['https://en.wikipedia.org/wiki/Main_Page']
3 changes: 3 additions & 0 deletions spidy/config/docker.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,17 @@ ZIP_FILES = True
OVERRIDE_SIZE = False
RESTRICT = False
DOMAIN = ''
OUT_OF_SCOPE = []
RESPECT_ROBOTS = True
TODO_FILE = '/data/crawler_todo.txt'
DONE_FILE = '/data/crawler_done.txt'
WORD_FILE = '/data/crawler_words.txt'
SAVE_COUNT = 100
HEADER = HEADERS['spidy']
USE_BROWSER = False
MAX_NEW_ERRORS = 5
MAX_KNOWN_ERRORS = 10
MAX_HTTP_ERRORS = 20
MAX_NEW_MIMES = 10
MAX_TIME = float('inf')
START = ['https://en.wikipedia.org/wiki/Main_Page']
3 changes: 3 additions & 0 deletions spidy/config/heavy.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,17 @@ ZIP_FILES = True
OVERRIDE_SIZE = True
RESTRICT = False
DOMAIN = ''
OUT_OF_SCOPE = []
RESPECT_ROBOTS = False
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
SAVE_COUNT = 100
HEADER = HEADERS['spidy']
USE_BROWSER = True
MAX_NEW_ERRORS = 5
MAX_KNOWN_ERRORS = 10
MAX_HTTP_ERRORS = 20
MAX_NEW_MIMES = 10
MAX_TIME = float('inf')
START = ['https://en.wikipedia.org/wiki/Main_Page']
3 changes: 3 additions & 0 deletions spidy/config/infinite.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,17 @@ ZIP_FILES = True
OVERRIDE_SIZE = False
RESTRICT = False
DOMAIN = ''
OUT_OF_SCOPE = []
RESPECT_ROBOTS = True
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
SAVE_COUNT = 250
HEADER = HEADERS['spidy']
USE_BROWSER = False
MAX_NEW_ERRORS = 1000000
MAX_KNOWN_ERRORS = 1000000
MAX_HTTP_ERRORS = 1000000
MAX_NEW_MIMES = 1000000
MAX_TIME = float('inf')
START = ['https://en.wikipedia.org/wiki/Main_Page']
3 changes: 3 additions & 0 deletions spidy/config/light.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,17 @@ OVERRIDE_SIZE = False
SAVE_WORDS = False
RESTRICT = False
DOMAIN = ''
OUT_OF_SCOPE = []
RESPECT_ROBOTS = True
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
SAVE_COUNT = 150
HEADER = HEADERS['spidy']
USE_BROWSER = False
MAX_NEW_ERRORS = 5
MAX_KNOWN_ERRORS = 10
MAX_HTTP_ERRORS = 20
MAX_NEW_MIMES = 10
MAX_TIME = 600
START = ['https://en.wikipedia.org/wiki/Main_Page']
3 changes: 3 additions & 0 deletions spidy/config/multithreaded.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,17 @@ ZIP_FILES = True
OVERRIDE_SIZE = False
RESTRICT = False
DOMAIN = ''
OUT_OF_SCOPE = []
RESPECT_ROBOTS = False
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
SAVE_COUNT = 100
HEADER = HEADERS['spidy']
USE_BROWSER = False
MAX_NEW_ERRORS = 5
MAX_KNOWN_ERRORS = 10
MAX_HTTP_ERRORS = 20
MAX_NEW_MIMES = 10
MAX_TIME = float('inf')
START = ['https://en.wikipedia.org/wiki/Main_Page']
21 changes: 0 additions & 21 deletions spidy/config/rivermont-infinite.cfg

This file was deleted.

20 changes: 0 additions & 20 deletions spidy/config/rivermont.cfg

This file was deleted.

5 changes: 5 additions & 0 deletions spidy/config/wsj.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,19 @@ RESTRICT = True
# The domain within which to restrict crawling.
DOMAIN = 'wsj.com/'

# Do not allow crawling involving specific pages and subdomains
OUT_OF_SCOPE = ['wsj.com/business/airlines', 'africa.wsj.com']

RESPECT_ROBOTS = True
TODO_FILE = 'wsj_todo.txt'
DONE_FILE = 'wsj_done.txt'
WORD_FILE = 'wsj_words.txt'
SAVE_COUNT = 60
HEADER = HEADERS['spidy']
USE_BROWSER = False
MAX_NEW_ERRORS = 100
MAX_KNOWN_ERRORS = 100
MAX_HTTP_ERRORS = 100
MAX_NEW_MIMES = 5
MAX_TIME = float('inf')
START = ['https://www.wsj.com/']
Loading