Metadata-Version: 2.1
Name: bhfutils
Version: 0.0.91
Summary: Utilities that are used by any spider of Behoof project
Home-page: https://behoof.app/
Author: Teplygin Vladimir
Author-email: vvteplygin@gmail.com
License: MIT
Keywords: behoof,scrapy-cluster,utilities
Description-Content-Type: text/x-rst
Requires-Dist: python-json-logger (==0.1.8)
Requires-Dist: redis (>=4.0.2)
Requires-Dist: kazoo (>=2.8.0)
Requires-Dist: mock (>=4.0.3)
Requires-Dist: playwright (>=1.17.2)
Requires-Dist: testfixtures (>=6.18.3)
Requires-Dist: ujson (>=4.3.0)
Requires-Dist: future (>=0.18.2)
Provides-Extra: all
Requires-Dist: python-json-logger (==0.1.8) ; extra == 'all'
Requires-Dist: redis (>=4.0.2) ; extra == 'all'
Requires-Dist: kazoo (>=2.8.0) ; extra == 'all'
Requires-Dist: mock (>=4.0.3) ; extra == 'all'
Requires-Dist: playwright (>=1.17.2) ; extra == 'all'
Requires-Dist: testfixtures (>=6.18.3) ; extra == 'all'
Requires-Dist: ujson (>=4.3.0) ; extra == 'all'
Requires-Dist: future (>=0.18.2) ; extra == 'all'
Requires-Dist: mock (>=2.0.0) ; extra == 'all'
Requires-Dist: testfixtures (>=4.13.5) ; extra == 'all'
Provides-Extra: docs
Requires-Dist: sphinx ; extra == 'docs'
Requires-Dist: mock (>=2.0.0) ; extra == 'docs'
Requires-Dist: testfixtures (>=4.13.5) ; extra == 'docs'
Provides-Extra: lint
Requires-Dist: pep8 ; extra == 'lint'
Requires-Dist: pyflakes ; extra == 'lint'
Provides-Extra: test
Requires-Dist: mock (>=2.0.0) ; extra == 'test'
Requires-Dist: testfixtures (>=4.13.5) ; extra == 'test'

******************************
Behoof Scrapy Cluster Template
******************************

Overview
--------

The ``bhfutils`` package is a collection of utilities that are used by any spider of Behoof project.

Requirements
------------

- Unix based machine (Linux or OS X)
- Python 2.7 or 3.6

Installation
------------

Inside a virtualenv, run ``pip install -U bhfutils``.  This will install the latest version of the Behoof Scrapy Cluster Spider utilities.  After that you can use special settings.py compatibal with scrapy cluster (template placed in crawler/setting_template.py)

Documentation
-------------

Full documentation for the ``bhfutils`` package does not exist

custom_cookies.py
==================

The ``custom_cookies`` module is custom Cookies Middleware to pass our required cookies along but not persist between calls

distributed_scheduler.py
========================

The ``distributed_scheduler`` module is scrapy request scheduler that utilizes Redis Throttled Priority Queues to moderate different domain scrape requests within a distributed scrapy cluster

redis_domain_max_page_filter.py
===============================

The ``redis_domain_max_page_filter`` module is redis-based max page filter. This filter is applied per domain. Using this filter the maximum number of pages crawled for a particular domain is bounded 

redis_dupefilter.py
===================

The ``redis_dupefilter`` module is redis-based request duplication filter

redis_global_page_per_domain_filter.py
======================================

The ``redis_global_page_per_domain_filter`` module is redis-based request number filter When this filter is enabled, all crawl jobs have GLOBAL_PAGE_PER_DOMAIN_LIMIT as a hard limit of the max pages they are allowed to crawl for each individual spiderid+domain+crawlid combination.
