Metadata-Version: 2.4
Name: lokit-python
Version: 0.3.1
Summary: Python localization toolkit for parsing, converting, and matching TMX, XLIFF, PO, JSON, HTML, CSV, XLSX, IDML, DOCX, and PPTX. Includes direct translation memory database ingestion.
License-Expression: MIT
Project-URL: Homepage, https://lokit.org
Project-URL: Repository, https://github.com/ciarandarby/lokit
Project-URL: Issues, https://github.com/ciarandarby/lokit/issues
Project-URL: Documentation, https://lokit.org
Keywords: localization,l10n,i18n,translation,translation-memory,translation-memory-database,tmx,tmx-parser,xliff,xliff-parser,gettext,po,idml,docx,pptx,xlsx,csv,json,html,type-safe,async,streaming,mypy,mypyc,backend-localization,parsing,parsers,localization-parsers,performance,internationalization,backend,memory-safe,translate-toolkit-alternative,postgresql-localization,python-i18n-tools
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Software Development :: Internationalization
Classifier: Topic :: Software Development :: Localization
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup :: XML
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: lxml>=6.1.1
Requires-Dist: psycopg[pool]>=3.2
Requires-Dist: psycopg-binary>=3.2; platform_system != "Windows"
Requires-Dist: python-calamine>=0.6.2
Requires-Dist: polib>=1.2.0
Requires-Dist: rustpy-xlsxwriter<0.5,>=0.4.4
Requires-Dist: tqdm>=4.66
Provides-Extra: office
Provides-Extra: accelerators
Requires-Dist: numba>=0.63; extra == "accelerators"
Provides-Extra: perf
Requires-Dist: numba>=0.63; extra == "perf"
Requires-Dist: psutil>=5.9; extra == "perf"
Provides-Extra: db
Provides-Extra: db-aws
Requires-Dist: boto3>=1.35; extra == "db-aws"
Provides-Extra: db-gcp
Requires-Dist: cloud-sql-python-connector>=1.12; extra == "db-gcp"
Dynamic: license-file

# Lokit

[![PyPI Downloads](https://static.pepy.tech/personalized-badge/lokit-python?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=BLUE&left_text=downloads)](https://pepy.tech/projects/lokit-python)

> [!WARNING]
> **Beta Release:** lokit is currently in Beta. The API is volatile and subject to rapid, breaking changes prior to the official V1 release.

<br>

Lokit is a high-performance, strictly type-safe, and highly memory-efficient localization toolkit for Python.

<br>

Supports Python 3.10+.

<br>
<hr>
<br>

Unlike legacy tools that wrap around XML DOM element trees in-memory, lokit represents a shift away from XML-based localization interchange formats towards native language parsing. It ingests localization formats (TMX, XLIFF, PO, XLSX, CSV, JSON, HTML, IDML) and compiles them into a strict, unified structural data model. This enables not just parsing, but robust data manipulation, semantic extraction, and advanced translation memory features out-of-the-box. Lokit focuses on streaming and asynchronous processing rather than synchronous events using in-memory files.

<br>

This format type can be easily converted to JSON for interchange with other systems. I've made parsing and data transfers as native as possible by capturing all elements of traditional interchange formats in a common format structure. This allows for much better compatibility, especially in terms of segment matching and leveraging as it uses flattened strings as standard. Tags are preserved but as a common format, meaning the structure parsed from XLIFF will be the same as the structure parsed from HTML.

<br>

These legacy file formats have supported vendor-lock in for many year, making it difficult for any client to move to another system. Seeing that this is a major issue across the domain, something new is needed where vendors do not use hidden, legacy technology to lock in their clients. Localization deserves innovation. Lokit is the first open source package that supports direct localization interchange to database ingestion; even outside the Python ecosystem.

<br>
<hr>

> The main premise here is a common, structured and type-safe dataclass model structure that is intentionally compatible with any file format, not just localization interchange formats, although these are optimized for performance and memory efficiency due to the verbose nature of XML based formats.

<br>

Note: This project was originally written in Rust and is still unreleased. Adding Rust extensions did not show a major performance improvement over the current C-Extension modules due to bridging overheads, this will be re-addressed in future releases. SDKs in other languages including the Rust prototype are coming soon.

<br>

## Core Features

<br>

Lokit provides a comprehensive suite of tools for managing localization data:

* **Native Structural Modeling:** Converts interchange formats into a strict, unified Python Data classes, ensuring complete type safety.
* **Advanced Matching Engine:** Provides Exact Matching, Fuzzy Matching (via SequenceMatcher), and In-Context Exact (ICE) Matching leveraging previous and next segment context, as well as with inline tags.
* **Sub-segment Extraction:** Automatically parses and isolates inline tags, properties, and formatting markers, allowing for safe manipulation of text without corrupting code.
* **Semantic Querying:** Easily filter translation units using any attribute, exact ID lookups, or deep nested JSON path querying (`where()`).
* **Plural Support:** Native extraction and structuring of pluralized translation units, compatible with UI frameworks.
* **Universal Format Conversion:** Instantly import and export between any supported format (e.g., TMX to JSON, HTML to XLIFF) with zero data loss.
* **Synchronous and Asynchronous Streaming:** Process massive enterprise files natively using Python async generators to keep memory overhead to an absolute minimum.

<br>

### Type Safety and C-Extensions

<br>

The entire library is very strictly typed and mypy compliant, so strict it compiles to C-extensions via mypyc and pre-attached via wheels. Additionally, any XML processing uses C-based packages. Compiling to these extensions has shown a 23% in overall performance increases over pure-python modules with additional benefits such as lower memory usage. C extensions are standard for MacOS (ARM+Intel), Windows, and Linux.

<br>

## Parsing Performance

<br>

When dealing with enterprise-scale localization environments, parsing performance and memory efficiency are paramount. lokit is designed to be significantly leaner and faster than the industry standard.

<br>

Using another package, `translate-toolkit`, as a reference as it is the de-facto and feature-rich standard for localization file format parsing and conversion in Python for comparison, we benchmarked lokit's modules against its equivalents.

<br>

In a stress-test benchmark on a +600 MB `.TMX` file containing **557,058 segments**, converting to JSON with `Lokit.to_json_async()` over 3 iterations yielded the following comparative averages:

<br>

| Library | Avg Duration | Peak Memory | Memory Efficiency |
|---------|------------------|------------------|-------------------|
| **lokit** | 13.57s | 135.9 MB | 15x Less Memory |
| **translate-toolkit** | 20.30s | 2,034.5 MB | ~2.0 GB |

<br>

Tests for both covered from TMX to JSON with inline tag sanitization in both using the respective packages' tooling.

<br>

The major focus on memory safety allows for parallel processing of events, making it suitable for large-scale localization workflows and backend systems.

<br>

**Note:** this package is not a replacement or substitution for the already amazing translate-toolkit. The functionality is quite differet across both libraries and have their own use cases.

<br>
<hr>

## SDK Usage Reference

<br>

Lokit operates around a central `BaseStructure` dataclass model, which standardizes localization units and segments. This instructs better standardization and branching in a more language native way compared to XML based file formats. Parsing SDKs are added for both extraction and export tasks for localization interchange formats along with common file types.

<br>

### Installation

<br>

Install lokit via pip:

```bash
pip install lokit-python
```

<br>

### Basic Parsing and Conversion

<br>

Converting files synchronously is straightforward through the modular `lokit` API. Import the package once, then use `lokit.parse` and `lokit.parse.write`.

```python
import lokit

document = lokit.parse.tmx("path/to/source.tmx")
document = lokit.parse.docx("path/to/document.docx")
document = lokit.parse.pptx("path/to/presentation.pptx")

lokit.parse.write.xliff(document, "path/to/target.xliff")
document.export.csv("path/to/target.csv")
```

<br>

### Asynchronous Streaming for Large Interchange Files

<br>

For files spanning hundreds of megabytes, parsing the entire DOM structure into memory is inefficient. Lokit supports stream-parsing natively.

<br>

Here's some simple scripting code to show how easy it is. This simple program has no boilderplate and can be reduced to a few lines of code, but for the purpose of showcasing, we added some wrapper functions. The stream APIs take the static attributes such as language codes, keeping them in an immutable state. Then quickly streams the mutables. All other parsing modules also use streaming to parse to and from the common typed format.

```python
import asyncio
import os

import lokit

input_dir = "data/language_tmx"
output_dir = "data/out"


async def convert_to_json(filepath: str):
    print(f"Starting: {filepath}")
    output = f"{output_dir}/{os.path.splitext(os.path.basename(filepath))[0]}.json"
    await lokit.stream.async_.json(
        filepath=filepath,
        output=output,
    )
    print(f"Completed: {output}")


async def process():
    if not os.path.exists(output_dir):
        os.makedirs(output_dir, exist_ok=True)

    files = [os.path.join(input_dir, i) for i in os.listdir(input_dir)]
    tasks = [convert_to_json(filepath=file) for file in files]
    await asyncio.gather(*tasks)


if __name__ == "__main__":
    asyncio.run(process())
```

<br>

### Advanced Querying and Matching

<br>

The `Lokit` logic wrapper provides access to the powerful matching engine and data manipulation features. This does not substitute for enterprise database semantic search but can be used as an after-step for evaluating matching results after retrieving translation units from a semantic/vector database.

```python
import lokit

engine = lokit.Lokit.parse("path/to/source.xliff")

button_units = engine.where("extensions.component", "checkout_button")

results = engine.fuzzy_find("Complete your purchase", limit=5, threshold=0.75)
for match in results:
    print(f"Match found: {match.unit_id} (Score: {match.score})")

ice_match = engine.match(
    source="Submit",
    target_unit_id="submit_btn_1",
    previous_source="Enter your email",
    require_context=True
)
```

### Structured API Paths

The preferred public API is available from a single package import:

```python
import lokit

document = lokit.parse.file("path/to/source.tmx")
document = lokit.parse.csv("path/to/source.csv", source_locale="en-US")
document = lokit.parse.docx("path/to/source.docx")
streamed_tmx = lokit.stream.tmx("path/to/source.tmx")
streamed_docx = lokit.stream.docx("path/to/source.docx")


async def stream_to_json() -> None:
    await lokit.stream.async_.json("path/to/source.tmx", "path/to/out")

lokit.parse.write.csv(document, "path/to/target.csv")
document.export.xliff("path/to/target.xliff")
document.export.docx("path/to/translated.docx", source_docx="path/to/source.docx")


async def export_xlsx() -> None:
    await lokit.parse.write.async_.xlsx(document, "path/to/target.xlsx")

CsvExtractor = lokit.parsers.extractors.csv
```

Predefined conversions live under `lokit.quick_parse`:

```python
import lokit

stats = lokit.quick_parse.tmx_to_csv("path/to/source.tmx", "path/to/target.csv")
lokit.quick_parse.csv_to_xliff("path/to/source.csv", "path/to/target.xliff")
```

<br>

### PostgreSQL API

Install the optional database extra to use PostgreSQL-backed translation memory:

```bash
pip install "lokit-python[db]"
```

```python
import lokit


async def load_and_match() -> None:
    tm = await lokit.database.connect("postgresql://localhost/lokit_tm")
    async with tm:
        await tm.setup()

        stream = lokit.stream.tmx("translation_memory.tmx")
        await tm.load(stream)

        results = await tm.match(
            source="Roses are red",
            source_locale="en-US",
            target_locale="fr-FR",
            limit=5,
            threshold=0.5,
        )
        print(results[0].unit_id, results[0].kind, results[0].score)
```

The database stores plain source and target text in PostgreSQL, uses `pg_trgm`
for exact/fuzzy lookup, and reconstructs lokit `Data` objects with tags,
comments and metadata along with adjecent context. This allows for plan string matching with tag and metadata propagation.

<br>

### Enterprise Database Support

With the above local database ingestion and runtime logic, Lokit has direct connection APIs to external enterprise services.
Currently supporting AWS (RDS & Aurora), GCP (Cloud SQL & AlloyDB) along with serverless platforms Supabase and Neon.
Pipeline is support and enabled by default but configurable.
Dual read and write URIs are also accepted for maximum performance while a single URI can still be used for simplicity or where it is not supported in the service used.

The API includes a full backend framework for handling localization database operations including matching, tag, pluralization and properity propigation, read and writes, and concurrent data handeling to and from the database server. All in async and with concurrency where supported by the service. 

Lokit can handle direct streaming from legacy interchange formats to enterprise databses with complete customization, no hidden dependencies, no boilerplate and highly optimized data flows.

Lokit is the first ever package to support this in any language ecosystem.

```bash
pip install "lokit-python[db-aws]"
pip install "lokit-python[db-gcp]"
```

```python
import lokit

tm_rds = await lokit.database.connect(
    "postgresql://user:pass@instance.rds.amazonaws.com:5432/tm?sslmode=require"
)

tm_aurora = await lokit.database.connect(
    "postgresql://user:pass@cluster.rds.amazonaws.com:5432/tm?sslmode=require",
    reader_uri="postgresql://user:pass@cluster-ro.rds.amazonaws.com:5432/tm?sslmode=require"
)

tm_gcp = await lokit.database.connect(
    "postgresql://user:pass@/tm?host=/cloudsql/project:region:instance"
)

tm_supabase = await lokit.database.connect(
    "postgresql://postgres.project-ref:pass@aws-0-region.pooler.supabase.com:6543/postgres?sslmode=require",
    pipeline=False
)

tm_neon = await lokit.database.connect(
    "postgresql://user:pass@ep-cool-darkness-123456.us-east-2.aws.neon.tech/tm?sslmode=require",
    pipeline=False
)
```

<br>
<hr>


## Supported Formats for Parsing

<br>

* TMX
* XLIFF 
* PO/POT
* XLSX
* CSV
* JSON
* HTML
* IDML

<br>
<hr>

## Learn More

Visit the official homepage at **[lokit.org](https://lokit.org)**, more detailed documentation is to come before the V1 release.

<!-- 
Search Tags & Keywords for SEO:
python localization toolkit, python translation memory database, tmx parser python, xliff parser python, gettext po parser, localization backend as a service, postgresql translation memory, pg_trgm fuzzy matching, python i18n l10n tools, translate-toolkit alternative, localization interchange format converter, async streaming xml parser, type-safe localization, mypyc compiled python.
-->
