Metadata-Version: 2.1
Name: arche
Version: 0.3.2
Summary: Analyze Scrapy Cloud data
Home-page: http://github.com/scrapinghub/arche
Maintainer: Scrapinghub, manycoding
Maintainer-email: info@scrapinghub.com
License: MIT
Description: # Arche
        
        [![PyPI](https://img.shields.io/pypi/v/arche.svg)](https://pypi.org/project/arche)
        [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/arche.svg)](https://pypi.org/project/arche)
        ![GitHub](https://img.shields.io/github/license/scrapinghub/arche.svg)
        [![Build Status](https://travis-ci.org/scrapinghub/arche.svg?branch=master)](https://travis-ci.org/scrapinghub/arche)
        [![Codecov](https://img.shields.io/codecov/c/github/scrapinghub/arche.svg)](https://codecov.io/gh/scrapinghub/arche)
        [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)
        [![GitHub commit activity](https://img.shields.io/github/commit-activity/m/scrapinghub/arche.svg)](https://github.com/scrapinghub/arche/commits/master)
        [![Join the chat at https://gitter.im/scrapinghub/arche](https://badges.gitter.im/scrapinghub/arche.svg)](https://gitter.im/scrapinghub/arche?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
        
            pip install arche
        
        Arche (pronounced as Arkey) helps to verify data using set of defined rules, for example:
          * Validation with JSON schema
          * Coverage
          * Duplicates
          * Garbage symbols
          * Comparison of two jobs
          
        _We use it in Scrapinghub to ensure quality of scraped data_
        
        ## Installation
        
        Arche requires [Jupyter](https://jupyter.org/install) environment, supporting both [JupyterLab](https://github.com/jupyterlab/jupyterlab#installation) and [Notebook](https://github.com/jupyter/notebook) UI
        
        For JupyterLab, you will need to properly install [plotly extensions](https://github.com/plotly/plotly.py#jupyterlab-support-python-35)
        
        Then just `pip install arche`
        
        ## Use case
        * You need to check the quality of data from Scrapy Cloud jobs continuously.
        
          Say, you scraped some website and have the data ready in the cloud. A typical approach would be:
            * Create a JSON schema and validate the data with it
            * Use the created schema in [Spidermon Validation](https://spidermon.readthedocs.io/en/latest/item-validation.html#with-json-schema)
        * You want to use it in your application to verify Scrapy Cloud data
        
        ## Developer Setup
        
        	pipenv install --dev
        	pipenv shell
        	tox
        
        ## Contribution
        Any contributions are welcome!
        
        * Fork or create a new branch
        * Make desired changes
        * Open a pull request
        
        # Changes
        Most recent releases are shown at the top. Each release shows:
        
        - **Added**: New classes, methods, functions, etc
        - **Changed**: Additional parameters, changes to inputs or outputs, etc
        - **Fixed**: Bug fixes that don't change documented behaviour
        
        Note that the top-most release is changes in the unreleased master branch on Github. Parentheses after an item show the name or github id of the contributor of that change.
        
        [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
        
        
        ## [0.4.0] (Work In Progress)
        ### Added
        ### Changed
        ### Fixed
        ### Removed
        
        
        ## [0.3.2] (2019-04-18)
        ### Added
        - Allow reading private raw schemas directly from bitbucket, #58
        
        ### Changed
        - Progress widgets are removed before printing graphs
        - New plotly v4 API
        
        ### Fixed
        - Failing `Compare Prices For Same Urls` when url is `nan`, #67
        - Empty graphs in Jupyter Notebook, #63
        
        ### Removed
        - Scraped Items History graphs
        
        
        ## [0.3.1] (2019-04-12)
        
        ### Fixed
        - Empty graphs due to lack of plotlyjs, #61
        
        
        ## [0.3.0] (2019-04-12)
        
        ### Fixed
        - Big notebook size, replaced cufflinks with plotly and ipython, #39
        
        ### Changed
        - *Fields Coverage* now is printed as a bar plot, #9
        - *Fields Counts* renamed to *Coverage Difference* and results in 2 bar plots, #9, #51:
           * *Coverage from job stats fields counts* which reflects coverage for each field for both jobs
           * *Coverage difference more than 5%* which prints >5% difference between the coverages (was ratio difference before)
        - *Compare Scraped Categories* renamed to *Category Coverage Difference* and results in 2 bar plots for each category, #52:
           * *Coverage for `field`* which reflects value counts (categories) coverage for the field for both jobs
           * *Coverage difference more than 10% for `field`* which shows >10% differences between the category coverages
        - *Boolean Fields* plots *Coverage for boolean fields* graph which reflects normalized value counts for boolean fields for both jobs, #53
        
        ### Removed
        - `cufflinks` dependency
        - Deprecated `category_field` tag
        
        
        ## [2019.03.25]
        ### Added
        - CHANGES.md
        - new `arche.rules.duplicates.find_by()` to find duplicates by chosen columns
        ```
        import arche
        from arche.readers.items import JobItems
        df = JobItems(0, "235801/1/15").df
        arche.rules.duplicates.find_by(df, ["title", "category"]).show()
        ```
        - `basic_json_schema().json()` prints a schema in JSON format
        - `Result.show()` to print a rule result, e.g.
        ```
        from arche.rules.garbage_symbols import garbage_symbols
        from arche.readers.items import JobItems
        items = JobItems(0, "235801/1/15")
        garbage_symbols(items).show()
        ```
        - notebooks to documentation
        
        ### Changed
        - Tags rule returns unused tags, #2
        - `basic_json_schema()` prints a schema as a python dict
        
        ### Deprecated
        - `Arche().basic_json_schema()` deprecated in favor of `arche.basic_json_schema()`
        ### Removed
        ### Fixed
        - `Arche().basic_json_schema()` not using `items_numbers` argument
        
        
        ## 2019.03.18
        
        - Last release without CHANGES updates
        
Keywords: scrapinghub,scraping,data,data-visualization,data-analysis,pandas
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3.7
Description-Content-Type: text/markdown
Provides-Extra: pep8tests
Provides-Extra: tests
