Metadata-Version: 2.1
Name: afd-measures
Version: 0.9.2
Summary: A collection of measures for Approximate Functional Dependencies in relational data.
Home-page: https://github.com/MarcelPa/AFD_comparative_study
License: MIT
Author: Marcel Parciak
Author-email: marcel.parciak@uhasselt.be
Requires-Python: >=3.9,<4.0
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Provides-Extra: experiments
Requires-Dist: joblib (>=1.2.0,<2.0.0) ; extra == "experiments"
Requires-Dist: jupyterlab (>=3.5.0,<4.0.0) ; extra == "experiments"
Requires-Dist: numpy (>=1.23.5,<2.0.0)
Requires-Dist: pandas (>=1.5.2,<2.0.0)
Requires-Dist: plotly (>=5.13.0,<6.0.0) ; extra == "experiments"
Requires-Dist: scikit-learn (>=1.1.3,<2.0.0) ; extra == "experiments"
Requires-Dist: tqdm (>=4.64.1,<5.0.0) ; extra == "experiments"
Project-URL: Repository, https://github.com/MarcelPa/AFD_comparative_study
Description-Content-Type: text/markdown

# AFD comparative study

This repository contains all artifacts to "Approximately Measuring Functional Dependencies: a Comparative Study".

## Overview

* `code`: this directory holds the code used to generate the results in the paper
	* `afd_measures`: all Python source code relating to the implemented AFD measures
	* `experiments`: Jupyter notebooks containing the processing steps to generate the results, figures or tables in the paper
	* `synthetic_data`: all Python source code relating to the synthetic data generation process
* `data`: the datasets used in the paper
	* `rwd`: manually annotated dataset of files found on the web (see `data/ground_truth.csv`)
	* `rwd_e`: datasets from `rwd` with errors introduced into them. Generated by the notebook `code/experiments/create_rwd_e_dataset.ipynb`.
	* `syn_e`: synthetic dataset generated focussing on errors. Generated by the notebook `code/experiments/create_syn_e.ipynb`
	* `syn_u`: synthetic dataset generated focussing on left-hand side uniqueness. Generated by the notebook `code/experiments/create_syn_u.ipynb`
	* `syn_s`: synthetic dataset generated focussing on right-hand side skewness. Generated by the notebook `code/experiments/create_syn_s.ipynb`
* `paper`: A full version of the paper including all proofs.
* `results`: results of applying the AFD measures to the datasets.

## Installation

Use the code in this repository with [Poetry](https://python-poetry.org) or [Conda](https://conda.io).

### Poetry

Install all dependencies via Poetry and start Jupyter lab to investigate the code.

```sh
$ poetry install
$ jupyter lab
```

### Conda

Create a new environment from the `conda_environment.yaml` file, activate it and run Jupyter lab to investigate the code.

```sh
$ conda create -f conda_environment.yaml
$ jupyter lab
```

## Dataset References

In addition to this repository, we made our benchmark also available on Zenodo: [find it here](https://www.zenodo.org/record/8098909)

* `adult.csv`: Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science. 
* `claims.csv`: TSA Claims Data 2002 to 2006, [published by the U.S. Department of Homeland Security](https://www.dhs.gov/tsa-claims-data).
* `dblp10k.csv`: Frequency-aware Similarity Measures. Lange, Dustin; Naumann, Felix (2011). 243–248. [Made available as DBLP Dataset 2](https://hpi.de/naumann/projects/repeatability/datasets/dblp-dataset.html).
* `hospital.csv`: Hospital dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824). [Made available as part the dataset collection to that paper.](https://owncloud.hpi.de/s/j6Z0yvXC0qhtGCk/download)
* `t_biocase_...` files: t\_bioc\_... files used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824). [Made available as part the dataset collection to that paper.](https://owncloud.hpi.de/s/j6Z0yvXC0qhtGCk/download)
* `tax.csv`: Tax dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824). [Made available as part the dataset collection to that paper.](https://owncloud.hpi.de/s/j6Z0yvXC0qhtGCk/download)

