Metadata-Version: 2.1
Name: OSKut
Version: 1.2
Summary: Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation (ACL 2020 Findings) Stacked Ensemble Framework and DeepCut as Baseline model
Home-page: https://github.com/mrpeerat/OSKut
Author: Peerat Limkonchotiwat
Author-email: peerat.limkonchotiwat@gmail.com
License: MIT
Keywords: thai,word segmentation,deep learning
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Programming Language :: Python :: 3
Classifier: Natural Language :: Thai
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Description-Content-Type: text/markdown
Requires-Dist: tensorflow (>=2.0.0)
Requires-Dist: pandas
Requires-Dist: scipy
Requires-Dist: numpy
Requires-Dist: scikit-learn
Requires-Dist: pyahocorasick (<=1.4.0)

# OSKut (Out-of-domain StacKed cut for Word Segmentation) 

[![](https://img.shields.io/badge/license-MIT-green)](https://github.com/mrpeerat/OSKut/blob/main/LICENSE)
[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mrpeerat/OSKut/blob/main/notebooks/OSKut_how1.ipynb)

Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation (ACL 2021 Findings) <br>
Stacked Ensemble Framework and DeepCut as Baseline model<br>

## Read more:
- Paper: [Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation](https://aclanthology.org/2021.findings-acl.86.pdf)
- Related Work (EMNLP2020): [Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble](https://www.aclweb.org/anthology/2020.emnlp-main.315/)
- Blog: [Domain Adaptation กับตัวตัดคำ มันดีย์จริงๆ](https://medium.com/@pingloaf)

## Citation
```
@inproceedings{limkonchotiwat-etal-2021-handling,
    title = "Handling Cross- and Out-of-Domain Samples in {T}hai Word Segmentation",
    author = "Limkonchotiwat, Peerat  and
      Phatthiyaphaibun, Wannaphong  and
      Sarwar, Raheem  and
      Chuangsuwanich, Ekapol  and
      Nutanong, Sarana",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.86",
    doi = "10.18653/v1/2021.findings-acl.86",
    pages = "1003--1016",
}
```

## Install
> pip install OSKut

## How To use
### Requirements
- python >= 3.6
- tensorflow >= 2.0

## Example
- Example files are on [OSKut Example notebook](https://github.com/mrpeerat/OSKut/blob/main/notebooks/OSKut_how1.ipynb)
- [Try it on Colab](https://colab.research.google.com/github/mrpeerat/OSKut/blob/main/notebooks/OSKut_how1.ipynb)
### Load Engine & Engine Mode
- ws, tnhc, and BEST !!
  - ws: The model trained on Wisesight-1000 and test on Wisesight-160
  - ws-augment-60p: The model trained on Wisesight-1000 augmented with top-60% entropy
  - tnhc: The model trained on TNHC (80:20 train&test split with random seed 42)
  - BEST: The model trained on BEST-2010 Corpus (NECTEC)
  - SCADS: The model trained on VISTEC-TP-TH-2021 Corpus (VISTEC)
  ```python
  oskut.load_model(engine='ws')
  # OR
  oskut.load_model(engine='ws-augment-60p')
  # OR
  oskut.load_model(engine='tnhc')
  # OR
  oskut.load_model(engine='best')
  # OR
  oskut.load_model(engine='scads')
  # OR
  ```
- tl-deepcut-XXXX
  - We also provide transfer learning of deepcut on 'Wisesight' as tl-deepcut-ws, 'TNHC' as tl-deepcut-tnhc, and 'LST20' as tl-deepcut-lst20
  ```python
  oskut.load_model(engine='tl-deepcut-ws')
  # OR
  oskut.load_model(engine='tl-deepcut-tnhc')
  ```
- deepcut
  - We also provide the original deepcut
  ```python
  oskut.load_model(engine='deepcut')
  ```
### Segment Example
You need to read the paper to understand why we have $k$ value!
- Tokenize with default k-value
  ```python
  oskut.load_model(engine='ws')
  print(oskut.OSKut(['เบียร์ยูไม่อร่อยสัดๆๆๆๆๆฟๆ']))
  print(oskut.OSKut('เบียร์ยูไม่อร่อยสัดๆๆๆๆๆฟๆ'))

  ['เบียร์', 'ยู', 'ไม่', 'อร่อย', 'สัด', 'ๆ', 'ๆ', 'ๆ', 'ๆ', 'ๆฟ', 'ๆ']
  ['เบียร์', 'ยู', 'ไม่', 'อร่อย', 'สัด', 'ๆ', 'ๆ', 'ๆ', 'ๆ', 'ๆฟ', 'ๆ']
  ```
- Tokenize with a various k-value
  ```python
  oskut.load_model(engine='ws')
  print(oskut.OSKut('เบียร์ยูไม่อร่อยสัดๆๆๆๆๆฟๆ',k=5)) # refine only 5% of character number
  print(oskut.OSKut('เบียร์ยูไม่อร่อยสัดๆๆๆๆๆฟๆ',k=100)) # refine 100% of character number

  ['เบียร์', 'ยู', 'ไม่', 'อร่อย', 'สัด', 'ๆ', 'ๆ', 'ๆ', 'ๆ', 'ๆฟๆ']
  ['เบียร์', 'ยู', 'ไม่', 'อร่อย', 'สัด', 'ๆ', 'ๆ', 'ๆ', 'ๆ', 'ๆฟ', 'ๆ']
  ```

## New datasets!!
VISTEC-TP-TH-2021 (VISTEC), which consists of 49,997 text samples from Twitter (2017-2019). \
VISTEC corpus contains 49,997 sentences with 3.39M words where the collection was manually annotated by linguists on four tasks, namely word segmentation, misspelling detection and correction, and named entity recognition. \
For more information and download [click here](https://github.com/mrpeerat/OSKut/tree/main/VISTEC-TP-TH-2021)

## Performance
### Model
<img src="https://user-images.githubusercontent.com/21156980/117925237-24f10500-b321-11eb-8e69-8efee577e1d7.png" width="600"/>

### Without Data Augmentation
<img src="https://user-images.githubusercontent.com/21156980/117925463-75686280-b321-11eb-8e39-fcdae3c569ea.png" width="600"/>

### With Data Augmentation
<img src="https://user-images.githubusercontent.com/21156980/117925342-4d78ff00-b321-11eb-80fa-59d71ce46a5a.png" width="600"/>
<img src="https://user-images.githubusercontent.com/21156980/117925347-4f42c280-b321-11eb-86a3-475b876b8851.png" width="600"/>


Thank you many code from

- [Deepcut](https://github.com/rkcosmos/deepcut) (Baseline Model) : We used some of code from Deepcut to perform transfer learning 



