Metadata-Version: 2.1
Name: Expanda
Version: 1.2.1
Summary: Integrated Corpus-Building Environment
Home-page: https://github.com/affjljoo3581/Expanda
Author: Jungwoo Park
Author-email: affjljoo3581@gmail.com
License: Apache-2.0
Keywords: expanda,corpus,dataset,nlp
Platform: UNKNOWN
Classifier: Environment :: Console
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.6.0
Description-Content-Type: text/markdown
Requires-Dist: nltk
Requires-Dist: ijson
Requires-Dist: tqdm (>=4.46.0)
Requires-Dist: mwparserfromhell (>=0.5.4)
Requires-Dist: tokenizers (>=0.7.0)
Requires-Dist: kss (==1.3.1)

# Expanda

**The universal integrated corpus-building environment.**

[![PyPI version](https://badge.fury.io/py/Expanda.svg)](https://badge.fury.io/py/Expanda)
![build](https://github.com/affjljoo3581/Expanda/workflows/build/badge.svg)
[![Documentation Status](https://readthedocs.org/projects/expanda/badge/?version=latest)](https://expanda.readthedocs.io/en/latest/?badge=latest)
![GitHub](https://img.shields.io/github/license/affjljoo3581/Expanda)
[![codecov](https://codecov.io/gh/affjljoo3581/Expanda/branch/master/graph/badge.svg)](https://codecov.io/gh/affjljoo3581/Expanda)
[![CodeFactor](https://www.codefactor.io/repository/github/affjljoo3581/expanda/badge)](https://www.codefactor.io/repository/github/affjljoo3581/expanda)

## Introduction
**Expanda** is an **integrated corpus-building environment**. Expanda provides
integrated pipelines for building a corpus dataset. Building corpus dataset
requires several complicated pipelines such as parsing, shuffling, and
tokenization. If the corpora are gathered from different applications, it would
be a problem to parse various formats. Expanda helps to build corpus simply at
once by setting build configuration.

## Main Features
* Easy to build, simple for adding new extensions
* Manages build environment systemically
* Fast build through performance optimization (even written in Python)
* Supports multi-processing
* Extremely less memory usage
* Don't need to write new codes for each corpus. Just write one line for adding
  a new corpus.

## Dependencies
* nltk
* ijson
* tqdm>=4.46.0
* mwparserfromhell>=0.5.4
* tokenizers>=0.7.0
* kss==1.3.1

## Installation

### With pip
Expanda can be installed using pip as follows:

```console
$ pip install expanda
```

### From source
You can install from source by cloning the repository and running:

```console
$ git clone https://github.com/affjljoo3581/Expanda.git
$ cd Expanda
$ python setup.py install
```

## Build your first dataset
Let's build **Wikipedia** dataset by using Expanda. First of all, install
Expanda.
```console
$ pip install expanda
```
Next, create a workspace to build dataset by running:
```console
$ mkdir workspace
$ cd workspace
```
Then, download Wikipedia dump file from [here](https://dumps.wikimedia.org/).
In this example, we are going to test with [part of the wiki](https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2).
Download the file through the browser, move to `workspace/src` and rename to
`wiki.xml.bz2`. Instead, run below code:
```console
$ mkdir src
$ wget -O src/wiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2
```
After downloading the dump file, we need to setup the configuration file.
Create ``expanda.cfg`` file and write the below:
```ini
[expanda.ext.wikipedia]
num-cores           = 6

[tokenization]
unk-token           = <unk>
control-tokens      = <s>
                      </s>
                      <pad>

[build]
input-files         =
    --expanda.ext.wikipedia     src/wiki.xml.bz2
```
The current directory structure of `workspace` should be as follows:
```
workspace
├── src
│   └── wiki.xml.bz2
└── expanda.cfg
```
Now we are ready to build! Run Expanda by using:
```console
$ expanda build
```
Then we can get the below output:
```
[*] execute extension [expanda.ext.wikipedia] for [src/wiki.xml.bz2]
[nltk_data] Downloading package punkt to /home/user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[*] merge extracted texts.
[*] start shuffling merged corpus...
[*] optimum stride: 17, buckets: 34
[*] create temporary bucket files.
[*] successfully shuffle offsets. total offsets: 102936
[*] shuffle input file: 100%|████████████████████| 102936/102936 [00:02<00:00, 34652.03it/s]
[*] start copying buckets to the output file.
[*] finish copying buckets. remove the buckets...
[*] complete preparing corpus. start training tokenizer...
[00:00:59] Reading files                            ████████████████████                 100
[00:00:04] Tokenize words                           ████████████████████ 405802   /   405802
[00:00:00] Count pairs                              ████████████████████ 405802   /   405802
[00:00:01] Compute merges                           ████████████████████ 6332     /     6332

[*] create tokenized corpus.
[*] tokenize corpus: 100%|█████████████████████| 1749902/1749902 [00:28<00:00, 61958.55it/s]
[*] split the corpus into train and test dataset.
[*] remove temporary directory.
[*] finish building corpus.
```
If you build dataset successfully, you can get the following directory tree:
```
workspace
├── build
│   ├── corpus.raw.txt
│   ├── corpus.train.txt
│   ├── corpus.test.txt
│   └── vocab.txt
├── src
│   └── wiki.xml.bz2
└── expanda.cfg
```


