Metadata-Version: 2.2
Name: airscrapy
Version: 1.0.1
Summary: Scrapy contrib for Airflow
Author: Fabien Vauchelles
Project-URL: Homepage, https://github.com/scrapoxy/airscrapy
Project-URL: Issues, https://github.com/scrapoxy/airscrapy/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENCE
Requires-Dist: scrapy

# Scrapy contrib for Airflow

## Installation

```shell
pip install airscrapy
```


## Airflow Operator

This operator runs Scrapy directly within the worker process
by invoking the Scrapy engine directly, eliminating the need for a separate process.


### Example

If the spider is structured as follows:

```python
import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = [ "http://example.com" ]

    def parse(self, response):
        yield {
            'text': response.css('.info').extract_first()
        }
```

Here’s how you can create a DAG using the operator:

```python
from airflow import DAG
from airscrapy import ScrapyOperator
from myscrapers.spiders.example import ExampleSpider
import os

with DAG(
    dag_id="scrapers",
        # Add extra settings like credentials or token
        params={
            "extra_settings": {
                "CONCURRENT_REQUESTS": 2,
            }
        },
) as dag:
    # Import the shared settings file
    os.environ["SCRAPY_SETTINGS_MODULE"] = "myscrapers.settings"

    task = ScrapyOperator(spider=ExampleSpider)

if __name__ == "__main__":
    dag.test()
```

The `extra_settings` parameter is used to dynamically include elements
such as credentials or tokens, complementing the settings.py file.

Additionally, ensure you set the `SCRAPY_SETTINGS_MODULE` environment variable. 
Without it, Scrapy won't be able to locate the settings.

The DAG directory is organized as follows:

```
dags
|- myscrapers
   |- spiders
      |- __init__.py
      |- example.py
   |- __init__.py
   |- items.py
   |- middlewares.py
   |- pipelines.py
   |- settings.py
|- mydag.py
|- scrapy.cfg
```

This structure enables us to run the DAG in local debugging mode:

```python
python mydag.py
```


## Build for publish

Install dependencies:

```shell
pip install build twine
```

Build the package:

```shell
python -m build --outdir dist
```

And publish to PyPi:

```shell
python -m twine upload dist/*
```
