Metadata-Version: 2.1
Name: FuzzTypes
Version: 0.0.1
Summary: FuzzTypes is a Pydantic extension for annotating autocorrecting fields
Author-email: Ian Maurer <ian@genomoncology.com>
License-File: LICENSE
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.9
Requires-Dist: pydantic>=2.6.1
Provides-Extra: ext
Requires-Dist: anyascii; extra == 'ext'
Requires-Dist: dateparser; extra == 'ext'
Requires-Dist: emoji; extra == 'ext'
Requires-Dist: lancedb; extra == 'ext'
Requires-Dist: nameparser; extra == 'ext'
Requires-Dist: number-parser; extra == 'ext'
Requires-Dist: rapidfuzz; extra == 'ext'
Requires-Dist: sentence-transformers; extra == 'ext'
Requires-Dist: tantivy; extra == 'ext'
Requires-Dist: unidecode; extra == 'ext'
Provides-Extra: local
Requires-Dist: build; extra == 'local'
Requires-Dist: ipython; extra == 'local'
Requires-Dist: jupyter; extra == 'local'
Requires-Dist: pip; extra == 'local'
Requires-Dist: setuptools; extra == 'local'
Provides-Extra: test
Requires-Dist: coverage[toml]; extra == 'test'
Requires-Dist: pytest; extra == 'test'
Description-Content-Type: text/markdown

# FuzzTypes

FuzzTypes is a set of "autocorrecting" annotation types that expands
upon [Pydantic](https://github.com/pydantic/pydantic)'s included [data
conversions.](https://docs.pydantic.dev/latest/concepts/conversion_table/)
Designed for simplicity, it provides powerful normalization capabilities
(e.g. named entity linking) to ensure structured data is composed of
"smart things" not "dumb strings".

### Basic Use Case

**todo** compare and contrast with default Pydantic data conversion

### Structured Data Generation Use Case

Several libraries (e.g. [Instructor](https://github.com/jxnl/instructor),
[Outlines](https://github.com/outlines-dev/outlines),
[Marvin](https://github.com/prefecthq/marvin)) use Pydantic to define models for structured data generation
using Large Language Models (LLMs) via function calling or a grammar/regex
based sampling approach based on the [JSON schema generated by Pydantic](https://docs.pydantic.dev/latest/concepts/json_schema/).

This approach allows for the enumeration of allowed values using
Python's `Literal`, `Enum` or JSON Schema's `examples` field directly
in your Pydantic class declaration which is used by the LLM to
generate valid values. This approach works exceptionally well for
low-cardinality (not many unique allowed values) such as the world's
continents (7 in total).

This approach, however, doesn't scale well for high-cardinality (many unique
allowed values) such as the number of known human genomic variants (~325M).
Where exactly the cutoff is between "low" and "high" cardinality is an exercise
left to the reader and their use case. 

That's where FuzzTypes come in. The allowed values are managed by the FuzzTypes
annotations and the values are resolved during the Pydantic validation process.

## Base Types

| type       | description                                                                                                                               |
|------------|-------------------------------------------------------------------------------------------------------------------------------------------|
| Alias      | Match by name or alias.                                                                                                                   |
| Function   | Match by calling a custom function.                                                                                                       |
| Fuzz       | Match by name or alias via fuzzy string similarity using [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz).                             |
| Hybrid     | Match by name or alias via [reciprocal rank](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) fusion of semantic and fuzzy similarity. |
| Name       | Match by name only.                                                                                                                       |
| Regex      | Match by regular expression pattern using `re` standard library.                                                                          |
| Semantic   | Match by name or alias via vector-based semantic similarity using [PyNNDescent](https://github.com/lmcinnes/pynndescent).                 |
| Typeahead  | Match by name or alias prefix via Trie lookups with fuzzy or semantic fallback.                                                           |

## Usable Types

| Type         | Description                                                                                                                                                      |
|--------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ASCII        | Convert Unicode string to ASCII equivalent using [anyascii](https://github.com/anyascii/anyascii).                                                               |
| Airport      | Represents airport names (e.g., O'Hare International Airport) for detailed aviation-related data.                                                                |
| AirportCode  | Manages airport codes (e.g., ORD) for quick and standardized airport identification.                                                                             |
| CleanURL     | Normalized URL with trackers removed using [url-normalize](https://github.com/niksite/url-normalize).                                                            |
| Country      | Represents country names, such as Germany or United States, for standardized country identification.                                                             |
| CountryCode  | Handles ISO country codes (e.g., DE, UK, US) for concise representation of countries.                                                                            |
| Currency     | Handles currency codes (e.g., USD) for financial transactions and currency representation.                                                                       |
| Date         | Convert date strings to `Date` object using [DateParser](https://pypi.org/project/dateparser/).                                                                  |
| Email        | Regex for extracting a single valid email from a string.                                                                                                         |
| Emoji        | Matches emojis based on Unicode Consortium aliases. Utilizes the [Emoji project](https://github.com/carpedm20/emoji/) for matching.                              |
| Integer      | Convert number or ordinal text to an `int` using [NumberParser](https://github.com/scrapinghub/number-parser/).                                                  |
| Language     | Manages full language names (e.g., English, German) for clear language specification.                                                                            |
| LanguageCode | Deals with ISO language codes (e.g., en, de) for brief language identification.                                                                                  |
| Person       | Parse human name into subfields (e.g. first, last, suffix) using [python-nameparser](https://github.com/derek73/python-nameparser?tab=License-1-ov-file#readme). |
| Quantity     | Converts strings to Quantity objects, combining value and unit of measurement, via [Pint](https://github.com/hgrecco/pint).                                      |
| SSN          | Regex for extracting a single social security number from a string.                                                                                              |
| Time         | Convert date time strings to `DateTime` object using [DateParser](https://pypi.org/project/dateparser/).                                                         |
| USState      | Represents U.S. state names (e.g., Ohio) for detailed geographical categorization within the United States.                                                      |
| USStateCode  | Manages U.S. state codes (e.g., OH) for abbreviated state representation.                                                                                        |
| Zipcode      | Regex for extracting a 5 or 9 digit zipcode from a string.                                                                                                       |

## Common Arguments

| argument        | type    | description                                                                                                               |
|-----------------|---------|---------------------------------------------------------------------------------------------------------------------------|
| case_sensitive  | bool    | If False, matches regardless of case. If True, matches only if case is exact. Default False.                              |
| examples        | list    | Example values used in schema generation.                                                                                 |
| notfound_mode   | Literal | raise: Raises an error if key not found. none: Returns None if key not found. allow: Returns key if not found.            |
| tiebreaker_mode | Literal | raise: Raises error if tied (value, priority). lesser: Returns lower value answer. greater: Returns greater value answer. |
| validator_mode  | str     | before: Resolves value before validation. *Currently the only tested option.*                                             |


## Lazy Dependencies

FuzzTypes leverages several powerful libraries to extend its functionality.

These dependencies are not installed by default with FuzzTypes to keep the
installation lightweight. Instead, they are optional and can be installed
as needed depending on which types you use.

Below is a list of these dependencies, including their licenses and what
specific Types require them.

| Type     | Dependency            | License | Usage                                                                                                     |
|----------|-----------------------|---------|-----------------------------------------------------------------------------------------------------------|
| ASCII    | anyascii              | ISC     | An alternative to unidecode for Unicode to ASCII conversion, offering extensive character mapping.        |
| ASCII    | unidecode             | GPL     | Converts Unicode strings to their ASCII equivalents, providing broad character support with minimal size. |
| Date     | dateparser            | BSD-3   | Parses date strings in almost any string formats to `Date` objects, supporting multiple locales.          |
| Emoji    | emoji                 | BSD     | Matches emojis based on Unicode Consortium aliases, enhancing text processing with emoji support.         |
| Fuzz     | rapidfuzz             | MIT     | Performs fuzzy string matching to find close matches to names or aliases with high performance.           |
| Integer  | number-parser         | BSD-3   | Converts number or ordinal text to integers, handling both written and numerical forms.                   |
| Person   | nameparser            | LGPL    | Parses human names into subfields (e.g., first, last, suffix), aiding in structured name handling.        |
| Semantic | pynndescent           | MIT     | Fast Approximate Nearest Neighbors library for retrieving similar text.                                   |
| Semantic | sentence-transformers | MIT     | Default embedding library for encoding text into dense vector embeddings.                                 |
