Metadata-Version: 2.1
Name: arche
Version: 0.3.6
Summary: Analyze Scrapy Cloud data
Home-page: http://github.com/scrapinghub/arche
Maintainer: Scrapinghub, manycoding
Maintainer-email: info@scrapinghub.com
License: MIT
Description: # Arche
        
        [![PyPI](https://img.shields.io/pypi/v/arche.svg)](https://pypi.org/project/arche)
        [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/arche.svg)](https://pypi.org/project/arche)
        ![GitHub](https://img.shields.io/github/license/scrapinghub/arche.svg)
        [![Build Status](https://travis-ci.com/scrapinghub/arche.svg?branch=master)](https://travis-ci.com/scrapinghub/arche)
        [![Codecov](https://img.shields.io/codecov/c/github/scrapinghub/arche.svg)](https://codecov.io/gh/scrapinghub/arche)
        [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)
        [![GitHub commit activity](https://img.shields.io/github/commit-activity/m/scrapinghub/arche.svg)](https://github.com/scrapinghub/arche/commits/master)
        
            pip install arche
        
        Arche (pronounced *Arkey*) helps to verify scraped data using set of defined rules, for example:
          * Validation with [JSON schema](https://json-schema.org/)
          * Coverage (items, fields, categorical data, including booleans and enums)
          * Duplicates
          * Garbage symbols
          * Comparison of two jobs
          
        _We use it in Scrapinghub, among the other tools, to ensure quality of scraped data_
        
        ## Installation
        
        Arche requires [Jupyter](https://jupyter.org/install) environment, supporting both [JupyterLab](https://github.com/jupyterlab/jupyterlab#installation) and [Notebook](https://github.com/jupyter/notebook) UI
        
        For JupyterLab, you will need to properly install [plotly extensions](https://github.com/plotly/plotly.py#jupyterlab-support-python-35)
        
        Then just `pip install arche`
        
        ## Why
        To check the quality of scraped data continuously. For example, if you scraped a website, a typical approach would be to validate the data with Arche. You can also create a schema and then set up [Spidermon](https://spidermon.readthedocs.io/en/latest/item-validation.html#with-json-schema)
        
        ## Developer Setup
        
        	pipenv install --dev
        	pipenv shell
        	tox
        
        ## Contribution
        Any contributions are welcome! See https://github.com/scrapinghub/arche/issues if you want to take on something or suggest an improvement/report a bug.
        
        # Changes
        Most recent releases are shown at the top. Each release shows:
        
        - **Added**: New classes, methods, functions, etc
        - **Changed**: Additional parameters, changes to inputs or outputs, etc
        - **Fixed**: Bug fixes that don't change documented behaviour
        
        Note that the top-most release is changes in the unreleased master branch on Github. Parentheses after an item show the name or github id of the contributor of that change.
        
        [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
        
        
        ## [0.3.6] (2019-07-12)
        ### Added
        - **Categories** rule with a plot showing unique values and count per field. By default, `report_all()` only includes fields which have less or equal to 10 unique values. See https://arche.readthedocs.io/en/latest/nbs/Rules.html#Category-fields, #100
        - Category documentation
        ### Changed
        - `Arche.report_all()` does not shorten report by default, added `short` parameter.
        - Data is consistent with Dash and Spidermon: `_type, _key` fields are dropped from dataframe, raw data, basic schema, #104, #106
        - `df.index` now stores `_key` instead
        - `basic_json_schema()` works with `deleted` jobs
        - `start` is supported for Collections, #112
        - `enum` is counted as a `category` tag, #18
        - `Garbage Symbols` searches in str representation of nested fields instead of expanded df, #130
        - Show real coverage difference (negative\positive) instead of absolute, #114
        ### Fixed
        - `Arche.glance()`, #88
        - Item links in Schema validation errors, #89
        - Empty NAN bars on category graphs, #93
        - `data_quality_report()`, #95
        - Wrong number of Collection Items if it contains item 0, #112
        ### Removed
        - **Responses Per Item Ratio** rule
        - Deprecated `expand` parameter and removed `flat_df`, since `Garbage Rule` deal with nested data itself, #133
        
        
        ## [0.3.5] (2019-05-14)
        ### Added
        - `Arche()` supports any iterables with item dicts, fixing jsonschema consistency, #83
        - `Items.from_array` to read raw data from iterables, #83
        ### Changed
        - If reading from pandas df directly, store raw data in numpy array. See gotchas http://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#support-for-integer-na
        ### Fixed
        ### Removed
        
        
        ## [0.3.4] (2019-05-06)
        ### Fixed
        - basic_json_schema() fails with long `1.0` types, #80
        
        
        ## [0.3.3] (2019-05-03)
        ### Added
        - Accept dataframes as source or target, #69
        ### Changed
        - data_quality_report plots the same "Fields Coverage" instead of green "Scraped Fields Coverage"
        - Plot theme changed from ggplot2 to seaborn, #62
        - Same target and source raise an error, was a warning before
        - Passed rules marked with green PASSED.
        ### Fixed
        - Online documentation now renders graphs https://arche.readthedocs.io/en/latest/, #41
        - Error colours are back in `report_all()`. 
        ### Removed
        - Deprecated `Arche.basic_json_schema()`, use `basic_json_schema()`
        - Removed Quickstart.md as redundant - documentation lives in notebooks
        
        
        ## [0.3.2] (2019-04-18)
        ### Added
        - Allow reading private raw schemas directly from bitbucket, #58
        ### Changed
        - Progress widgets are removed before printing graphs
        - New plotly v4 API
        ### Fixed
        - Failing `Compare Prices For Same Urls` when url is `nan`, #67
        - Empty graphs in Jupyter Notebook, #63
        ### Removed
        - Scraped Items History graphs
        
        
        ## [0.3.1] (2019-04-12)
        ### Fixed
        - Empty graphs due to lack of plotlyjs, #61
        
        
        ## [0.3.0] (2019-04-12)
        ### Fixed
        - Big notebook size, replaced cufflinks with plotly and ipython, #39
        ### Changed
        - *Fields Coverage* now is printed as a bar plot, #9
        - *Fields Counts* renamed to *Coverage Difference* and results in 2 bar plots, #9, #51:
           * *Coverage from job stats fields counts* which reflects coverage for each field for both jobs
           * *Coverage difference more than 5%* which prints >5% difference between the coverages (was ratio difference before)
        - *Compare Scraped Categories* renamed to *Category Coverage Difference* and results in 2 bar plots for each category, #52:
           * *Coverage for `field`* which reflects value counts (categories) coverage for the field for both jobs
           * *Coverage difference more than 10% for `field`* which shows >10% differences between the category coverages
        - *Boolean Fields* plots *Coverage for boolean fields* graph which reflects normalized value counts for boolean fields for both jobs, #53
        ### Removed
        - `cufflinks` dependency
        - Deprecated `category_field` tag
        
        
        ## [2019.03.25]
        ### Added
        - CHANGES.md
        - new `arche.rules.duplicates.find_by()` to find duplicates by chosen columns
        ```
        import arche
        from arche.readers.items import JobItems
        df = JobItems(0, "235801/1/15").df
        arche.rules.duplicates.find_by(df, ["title", "category"]).show()
        ```
        - `basic_json_schema().json()` prints a schema in JSON format
        - `Result.show()` to print a rule result, e.g.
        ```
        from arche.rules.garbage_symbols import garbage_symbols
        from arche.readers.items import JobItems
        items = JobItems(0, "235801/1/15")
        garbage_symbols(items).show()
        ```
        - notebooks to documentation
        ### Changed
        - Tags rule returns unused tags, #2
        - `basic_json_schema()` prints a schema as a python dict
        ### Deprecated
        - `Arche().basic_json_schema()` deprecated in favor of `arche.basic_json_schema()`
        ### Removed
        ### Fixed
        - `Arche().basic_json_schema()` not using `items_numbers` argument
        
        
        ## 2019.03.18
        - Last release without CHANGES updates
        
Keywords: scrapinghub,scraping,data,data-visualization,data-analysis,pandas
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3.7
Description-Content-Type: text/markdown
Provides-Extra: docs
Provides-Extra: pep8tests
Provides-Extra: tests
