Metadata-Version: 2.1
Name: aovec
Version: 0.13
Summary: Make Word2Vec from aozorabunko/aozorabunko
Home-page: https://github.com/eggplants/aovec
Author: eggplants
License: MIT
Platform: UNKNOWN
Requires-Python: >=3.5
Description-Content-Type: text/markdown
Requires-Dist: beautifulsoup4
Requires-Dist: gensim (>=4.0.1)
Requires-Dist: mecab-python3
Requires-Dist: python-Levenshtein

# aovec

[![Model scheduled release](https://github.com/eggplants/aovec/actions/workflows/model_release.yml/badge.svg)](https://github.com/eggplants/aovec/actions/workflows/model_release.yml)
[![Release Package](https://github.com/eggplants/aovec/actions/workflows/release.yml/badge.svg)](https://github.com/eggplants/aovec/actions/workflows/release.yml) [![PyPI version](https://badge.fury.io/py/aovec.svg)](https://badge.fury.io/py/aovec)

- Make Word2Vec from [aozorabunko/aozorabunko](https://github.com/aozorabunko/aozorabunko)

- Pre-built models are available from `week*` [Releases](https://github.com/eggplants/aovec/releases).

[![model](https://img.shields.io/badge/dynamic/json.svg?label=Model&query=$[0].assets[0].browser_download_url&url=https://api.github.com/repos/eggplants/aovec/releases)](https://github.com/eggplants/aovec/releases)

## Requirements

- Git
- MeCab
  - MeCab Checker: [src/check_mecab.py](https://github.com/eggplants/aovec/blob/master/src/check_mecab.py)

## How to use

- Make `*.model` file

```bash
# Install from pypi
$ pip install aovec
# Clone aozorabunko/aozorabunko (>20GB)
$ aovec clone
# Parse html files and write to results to novels/
$ aovec parse
# Make word2vec and write to aozora_model.model
$ aovec mkvec
```

- Use from Python (See: [official document](https://radimrehurek.com/gensim/models/word2vec.html))

```python
from gensim.models import Word2Vec, KeyedVectors

model = Word2Vec.load('aozora_model.model')

# or...
model = KeyedVectors.load_word2vec_format('aozora_model.kv',
                                          unicode_errors='ignore')
# or...
model = KeyedVectors.load_word2vec_format('aozora_model.kv.bin',
                                          binary=True,
                                          unicode_errors='ignore')
```

## (Optional)Set up `mecab-ipadic-neologd` on Ubuntu

- Download and install

```bash
$ sudo apt install build-essential
$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd neologd && cd $_
$ sudo bin/install-mecab-ipadic-neologd -y
$ sudo mv /usr/lib/*/mecab/dic/mecab-ipadic-neologd /var/lib/mecab/dic
```

- Update `/etc/mecabrc`

```bash
$ sudo cp /etc/mecabrc /stc/mecabrc.bak
$ sudo sed -i 's_^dicdir.*_; &\'$'\ndicdir = /var/lib/mecab/dic/mecab-ipadic-neologd_' /etc/mecabrc
```

```diff
--- /etc/mecabrc.bak
+++ /etc/mecabrc
@@ -3,7 +3,8 @@
 ;
 ; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
 ;
-dicdir = /var/lib/mecab/dic/debian
+; dicdir = /var/lib/mecab/dic/debian
+dicdir = /var/lib/mecab/dic/mecab-ipadic-neologd

 ; userdic = /home/foo/bar/user.dic
```

## Help

```bash
usage: aovec [-h] [-V] {clone,c,parse,p,mkvec,m} ...

Make Word2Vec from aozorabunko/aozorabunko

positional arguments:
  {clone,c,parse,p,mkvec,m}
    clone (c)           clone aozorabunko/aozorabunko (>20GB)
    parse (p)           parse html files and write to results
    mkvec (m)           make word2vec and write to *.model

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
```

```bash
$ aovec clone -h
usage: aovec clone [-h]

optional arguments:
  -h, --help  show this help message and exit
```

```bash
$ aovec parse -h
usage: aovec parse [-h] [-d DIR]

optional arguments:
  -h, --help            show this help message and exit
  -d DIR, --savedir DIR
                        directory name of saving results (default: novels)
```

```bash
$ aovec mkvec -h
usage: aovec mkvec [-h] [-d DIR] [-o NAME] [-e INT] [-v INT] [-m INT] [-w INT]
                   [-p INT] [-b] [--both]

optional arguments:
  -h, --help            show this help message and exit
  -d DIR, --parsedir DIR
                        directory name of saved parsing results (default:
                        novels)
  -o NAME, --model NAME
                        name of word2vec model (default: aozora_model)
  -e INT, --epochs INT  number of word2vec epochs (default: 5)
  -v INT, --vector_size INT
                        dimensionality of the word vectors (default: 1000)
  -m INT, --min_count INT
                        ignore words total frequency lower than this (default:
                        5)
  -w INT, --window INT  window size of words before and for learning (default:
                        5)
  -p INT, --workers INT
                        worker threads (default: 3)
  -b, --binary          save model files as one binary (default: False)
  --both                save model files as both row data and binary (default:
                        False)
```

## License

MIT


