Metadata-Version: 2.0
Name: aioscrape
Version: 0.0.2
Summary: Async scraping library
Home-page: http://github.com/Suor/aioscrape
Author: Alexander Schepanovski
Author-email: suor.web@gmail.com
License: BSD
Description-Content-Type: UNKNOWN
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Intended Audience :: Developers
Requires-Dist: aiohttp (>=3.5.4)
Requires-Dist: parsechain (>=0.0.3)
Requires-Dist: funcy (<2.0,>=1.11)
Requires-Dist: aiocontextvars; python_version < "3.7"
Provides-Extra: cache
Requires-Dist: aiocache; extra == 'cache'
Requires-Dist: aiofilecache; extra == 'cache'

AioScrape
=========

A scraping library on top of `aiohttp <https://aiohttp.readthedocs.io>`_ and `parsechain <https://github.com/Suor/parsechain>`_. Note that this is **alpha** software.


Installation
-------------

::

    pip install aioscrape


Usage
-----

.. code:: python

    from aioscrape import run, fetch, settings
    from aioscrape.middleware import last_fetch, make_filecache
    from aioscrape.utils import SOME_HEADERS # To not look like a bot

    from urllib.parse import urljoin
    from parsechain import C
    from funcy import lcat, lconcat


    def main():
        # Settings are scoped and can be redefined later with another "with"
        cache = make_filecache('.fcache')
        with settings(headers=SOME_HEADERS, middleware=[cache, last_fetch]):
            print(run(scrape_all()))


    async def scrape_all():
        # All the settings in scope like headers and middleware are applied to fetch()
        start_page = await fetch(START_URL)

        # AioScrape integrates with parsechain to make extracting a breeze
        urls = start_page.css('.pagingLinks a').attrs('href')
        list_urls = [urljoin(start_page.url, page_url) for page_url in urls]

        # Using asyncio.wait() and friends to run requests in parallel
        list_pages = [start_page] + await wait_all(map(fetch, list_urls))

        # Scrape articles
        result = lcat(await wait_all(map(scrape_articles, list_pages)))
        write_to_csv('export.csv', result)


    async def scrape_articles(list_page):
        urls = list_page.css('#headlines .titleLink').attrs('href')
        abs_urls = [urljoin(list_page.url, url) for url in urls]
        return await wait_all(map(scrape_article, abs_urls))


    async def scrape_article(url):
        resp = await fetch(url)
        return resp.root.multi({
            'url': C.const(resp.url),
            'title': C.microdata('headline').first,
            'date': C.microdata('datePublished').first,
            'text': C.microdata('articleBody').first,
            'contacts': C.css('.sidebars .contact p')
                         .map(C.inner_html + html_to_text) + lconcat + ''.join,
        })


    if __name__ == '__main__':
        main()


TODO
----

- Response.follow()
- non-GET requests
- work with forms


