Metadata-Version: 2.1
Name: alea-data-generator
Version: 0.1.1
Summary: ALEA low-level data generation techniques (procedural, KL3M)
Home-page: https://aleainstitute.ai/
License: MIT
Keywords: alea,synthetic,data,kl3m
Author: ALEA Institute
Author-email: hello@aleainstitute.ai
Requires-Python: >=3.10,<4.0.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Requires-Dist: faker (>=28.4.1,<29.0.0)
Requires-Dist: httpx[http2] (>=0.27.2,<0.28.0)
Requires-Dist: jinja2 (>=3.1.4,<4.0.0)
Requires-Dist: numpy (>=2.1.1,<3.0.0)
Requires-Dist: rapidfuzz (>=3.9.7,<4.0.0)
Requires-Dist: scipy (>=1.14.1,<2.0.0)
Requires-Dist: tqdm (>=4.66.5,<5.0.0)
Project-URL: Repository, https://github.com/alea-institute/alea-data-generator
Description-Content-Type: text/markdown

# ALEA Data Generator

[![PyPI version](https://badge.fury.io/py/alea-data-generator.svg)](https://badge.fury.io/py/alea-data-generator)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python Versions](https://img.shields.io/pypi/pyversions/alea-data-generator.svg)](https://pypi.org/project/alea-data-generator/)

This is a basic synthetic data generation/perturbation library designed to support the creation or augmentation
designed by the ALEA Institute to support the creation and augmentation of data without relying on "tainted" LLMs.

Data generation techniques in this library:
 * do not require the use of any LLM or external data source
 * can be used with [KL3M](https://kl3m.ai), our Fairly Trained LLM

### Supported Patterns

The following data generation patterns are supported:

 * [x] Simple string templates with sampled values (e.g., `This Agreement, by and between <|company:a|> and <|company:b|>, is made as of <|date|>.`)
   - [x] Faker integration for common data types (e.g., names, addresses, dates, etc.)
 * [x] Large templates with sampled values (e.g., `jinja2` templates in files)
 * [ ] Common document types (e.g., emails, contracts, memos, etc. using templates)
 * [ ] Data perturbation (e.g., realistic errors introduced by humans, OCR, or other automated systems)
   - [x] Skipping, doubling, or  transposing/swapping characters
   - [x] Skipping, doubling, or  transposing/swapping tokens
   - [x] QWERTY and mobile keyboard mistakes (off-by-one key, shift errors, etc.)
   - [ ] Homophones (e.g., `their` vs. `there`)
   - [ ] Synonyms (e.g., `big` vs. `large`)
   - [ ] Negation/antonyms (e.g., `big` vs. `small`)
   - [ ] Capitalization errors (e.g., `big` vs. `Big`)
   - [ ] Punctuation errors (e.g., `big` vs. `big.`)
   - [ ] OCR-like errors (e.g., misreading characters, smudges, etc.) -
 * [ ] Representation conversion (e.g., `429` to `four hundred twenty-nine` or `four twenty-nine`)
  * [ ] Format conversion (e.g., Markdown <-> HTML variants)


## Future Roadmap

 * Document image generation for document/OCR models

## License

The ALEA Data Generator library is released under the MIT License. See the [LICENSE](LICENSE) file for details.

Some of the data generation techniques used in this library may also retrieve data from external sources,
which have their own licensing terms.  These terms are documented in the `alea-data-sources` here:

 * [alea-data-sources](https://github.com/alea-institute/alea-data-resources)

See, e.g., the CMU Pronouncing Dictionary (`cmudict`), which is used in tasks like homophonic errors:

  * [cmudict metadata](https://github.com/alea-institute/alea-data-resources/blob/v0.1.0/alea_data_resources/sources/cmudict.py#L10)

## Support

If you encounter any issues or have questions about using the ALEA Data Generator library, please [open an issue](https://github.com/alea-institute/alea-data-generator/issues) on GitHub.

## Learn More

To learn more about ALEA and its software and research projects like KL3M and leeky, visit the [ALEA website](https://aleainstitute.ai/).

