Metadata-Version: 2.1
Name: attacut
Version: 1.1.0.dev0
Summary: Fast and Reasonably Accurate Word Tokenizer for Thai
Home-page: https://github.com/PyThaiNLP/attacut
Author: Pattarawat Chormai et al.
Author-email: foomail@foo.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Natural Language :: Thai
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
Requires-Dist: docopt (>=0.6.2)
Requires-Dist: fire (>=0.1.3)
Requires-Dist: nptyping (>=0.2.0)
Requires-Dist: numpy (>=1.17.0)
Requires-Dist: pyyaml (>=5.1.2)
Requires-Dist: six (>=1.12.0)
Requires-Dist: ssg (>=0.0.4)
Requires-Dist: torch (>=1.2.0)

# AttaCut: Fast and Reasonably Accurate Word Tokenizer for Thai

[![Build Status](https://travis-ci.org/PyThaiNLP/attacut.svg?branch=master)](https://travis-ci.org/PyThaiNLP/attacut)
[![Build status](https://ci.appveyor.com/api/projects/status/ittfnb2pyg95kpxk/branch/master?svg=true)](https://ci.appveyor.com/project/wannaphongcom/attacut/branch/master)
[![](https://img.shields.io/badge/-presentation-informational)](https://drive.google.com/file/d/16AUNZv1HXVmERgryfBf4JpCo1QrQyHHE/view?usp=sharing)
[![](http://img.shields.io/badge/arXiv-1911.07056-yellow.svg?style=flat)](https://arxiv.org/abs/1911.07056)

## How does AttaCut look like?

<div align="center">
    <img src="https://i.imgur.com/8yMq7IB.png" width="700px"/>
    <br/>
    <b>TL;DR:</b> 
3-Layer Dilated CNN on syllable and character features. It’s <b>6x faster</b> than DeepCut (SOTA) while its WL-f1 on BEST is <b>91%</b>, only 2% lower.
</div>

## Installation

```
$ pip install attacut
```

**Remarks:** Windows users need to install **PyTorch** before the command above.
Please consult [PyTorch.org](https://pytorch.org) for more details.

## Usage

### Command-Line Interface

```
$ attacut-cli -h
AttaCut: Fast and Reasonably Accurate Word Tokenizer for Thai

Usage:
  attacut-cli <src> [--dest=<dest>] [--model=<model>]
  attacut-cli [-v | --version]
  attacut-cli [-h | --help]

Arguments:
  <src>             Path to input text file to be tokenized

Options:
  -h --help         Show this screen.
  --model=<model>   Model to be used [default: attacut-sc].
  --dest=<dest>     If not specified, it'll be <src>-tokenized-by-<model>.txt
  -v --version      Show version
```

### High-Level API 
```
from attacut import tokenize, Tokenizer

# tokenize `txt` using our best model `attacut-sc`
words = tokenize(txt)

# alternatively, an AttaCut tokenizer might be instantiated directly, allowing
# one to specify whether to use `attacut-sc` or `attacut-c`.
atta = Tokenizer(model="attacut-sc")
words = atta.tokenize(txt)
```

## Benchmark Results

Belows are brief summaries. More details can be found on [our benchmarking page](https://pythainlp.github.io/attacut/benchmark.html).


### Tokenization Quality
![](https://pythainlp.github.io/attacut/_images/quality-benchmark-in-of-domain.png)

### Speed
![](https://pythainlp.github.io/attacut/_images/speed-benchmark-ec2.png)


## Retraining on Custom Dataset
Please refer to [our retraining page](https://pythainlp.github.io/attacut/)

## Related Resources
- [Tokenization Visualization][tovis]
- [Thai Tokenizer Dockers][docker]

## Acknowledgements
This repository was initially done by [Pattarawat Chormai][pat], while interning at [Dr. Attapol Thamrongrattanarit's NLP Lab][ate], Chulalongkorn University, Bangkok, Thailand.
Many people have involed in this project. Complete list of names can be found on [Acknowledgement](https://pythainlp.github.io/attacut/acknowledgement.html).


[pat]: http://pat.chormai.org
[ate]: https://attapol.github.io/lab.html
[noom]: https://github.com/Ekkalak-T
[can]: https://github.com/c4n
[ake]: https://github.com/ekapolc
[tovis]: https://pythainlp.github.io/tokenization-benchmark-visualization/
[docker]: https://github.com/PyThaiNLP/docker-thai-tokenizers


