Metadata-Version: 2.4
Name: blitzer-cli
Version: 0.1.1
Summary: A CLI for language-learning vocabulary extraction
Author: Samiddhi
License: GPL-3.0-or-later
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Education
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Utilities
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click
Requires-Dist: importlib-metadata; python_version < "3.8"
Requires-Dist: requests
Requires-Dist: tomli; python_version < "3.11"
Dynamic: license-file

# Blitzer CLI
## A minimalist command-line tool for language text processing

*Author: Blitzer Team*

## Overview

Blitzer CLI is a command-line tool for language-learners that produces a word frequency list from input text. In addition to the simple word frequency list, the tool supports "lemmatization" for different languages. In simple terms, this means that, taking English as an example, the output can optionally treat "running" "ran" and "runs" as three instances of the same word as opposed to three different words.

Users can create and include their own word exclusion lists which will bar words from the result. The upfront cost of making and maintaining this exclusion list comes with the benefit of instant custom vocabulary for texts you intent to read, listen to, or study, as well as a way to track how many words you know in your language. Pretty cool!

The tool was designed for language learners and supports completely downloadable language pack plugins. There are **no built-in lemmatization processors** - all language processing capabilities are acquired using `blitzer languages install <lang-code>` command. Currently **2** languages are supported. Contributions for new languages are warmly accepted, and should be fairly straightforward to implement.

## Installation 

### Using pip (recommended)
```sh
pip install -e .
```

### Prerequisites
- Python 3.7+
- `click` library
- `tomli` library (for Python < 3.11)

## Usage

### Basic Usage
The tool reads text from stdin and outputs processed word lists to stdout using a subcommand architecture with flags:

```sh
# With stdin
echo "Your text here" | blitzer blitz -l [language_code] [flags]

# With direct text input
blitzer blitz -t "Your text here" -l [language_code] [flags]
```

### Available Languages
- `base` :: Basic processor for space-separated languages (no lemmatization support)
- Downloadable language packs available via plugins (e.g., `blitzer languages install pli` for Pali, `blitzer languages install slv` for Slovenian)

### Available Commands
- `blitz` :: Main command for processing text with configurable flags
- `list` :: Lists supported languages for lemmatization
- `languages` :: Manage language packs (install, uninstall, list)

### Flags
- `-L`, `--lemmatize` :: Treats different declensions/forms of the same word as one word
- `-f`, `--freq` :: Include frequency counts in the output
- `-c`, `--context` :: Include sample context for each word in output
- `-p`, `--prompt` :: Include custom prompt for LLM at the top of output
- `-s`, `--src` :: Include the full source text at the top of output
- `-l`, `--language_code` :: ISO 639 three-character language code
- `-t`, `--text` :: Direct text input (overrides stdin)
- `-h`, `--help` :: Show help message

### Examples
```sh
# Basic word list
echo "This is a test." | blitzer blitz -l pli

# With frequency counts
echo "This is a test. This is only a test." | blitzer blitz -l pli -f

# With context sentences
echo "This is a test." | blitzer blitz -l pli -f -c

# Lemmatized output
echo "To je test." | blitzer blitz -l slv -L

# Using multiple flags
echo "This is a test." | blitzer blitz -l pli -L -f -c -p

# List available languages (plugins only)
blitzer languages list

# Manage language packs
blitzer languages list
blitzer languages install [lang-code]
blitzer languages uninstall [lang-code]
```

## Configuration

The tool uses XDG specifications for configuration management:

### Configuration Location
- Config file: `~/.config/blitzer/config.toml`
- Exclusion files: `~/.config/blitzer/[lang_code]_exclusion.txt` (only location looked up)

### Default Configuration
When the config file doesn't exist, it will be automatically created with these defaults:
```toml
# Blitzer CLI Configuration
# This file uses TOML format

# Default flag values
default_lemmatize = false  # Default value for --lemmatize/-L flag
default_freq = false       # Default value for --freq/-f flag
default_context = false    # Default value for --context/-c flag
default_prompt = false     # Default value for --prompt/-p flag
default_src = false        # Default value for --src/-s flag

# Language-specific prompts
# Each key in the prompts table represents a language code with its custom prompt
[prompts]
"base" = "Convert the following wordlist into tab separated anki cards."
"en" = "Convert the following wordlist into tab separated anki cards."
```

Configuration defaults can be overridden with negative flags: `--no-lemmatize`, `--no-freq`, `--no-context`, `--no-prompt`, `--no-src`.

### Exclusion Lists
Exclusion lists prevent known words from appearing in output. Language-specific exclusion lists are automatically created in the config directory when first accessing a language.

## Language Support

### Extending Language Support
The architecture uses a plugin system with entry points to make adding new languages much more dynamic and extensible. There is now only one way to add support for new languages:

**For Plugin Languages (downloadable packages):**
1. Create a separate Python package with name format `blitzer-language-[lang_code]`
2. Include your processor configuration in the package's `__init__.py` file with a `register()` function
3. Add an entry point in your package's `setup.py` or `pyproject.toml` under `blitzer.languages`
4. Bundle any required data files (like SQLite databases) with the package
5. Distribute as a pip-installable package

### Language Pack Management
The `languages` command allows for easy management of downloadable language packs:

```sh
# List all available languages (built-in and installed plugins)
blitzer languages list

# Install a language pack
blitzer languages install [lang-code]

# Uninstall a language pack
blitzer languages uninstall [lang-code]
```

### Language Dictionaries
The tool supports language-specific dictionaries that enable lemmatization when using the `-L` flag. Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. For example, in Pali, both "deva" and "devo" would be mapped to the same root form "deva". Language dictionaries are stored in SQLite databases bundled with language plugins:

- Database location for plugins: Bundled with the plugin package

The tool looks for these databases in the installed plugins but does not create them automatically. When a language dictionary is not available, the tool falls back to basic word processing without lemmatization capabilities.

### Current Language Support
- `base` :: Basic processor for space-separated languages without lemmatization (always available)
- Downloadable processors: Available as separate plugins (install with `blitzer languages install [lang-code]`)

## Technical Architecture

### Core Components
- `cli.py` :: Command-line interface using Click
- `config.py` :: XDG configuration management
- `processor.py` :: Core text processing logic
- `data_manager.py` :: Language data management utilities

### Processing Pipeline
1. Read text from stdin or from text argument
2. Load appropriate language specification via entry points and register function
3. Apply text normalization (if language-specific)
4. Tokenize text into words or lemmatize as needed
5. Apply exclusion filtering
6. Format output according to flags
7. Write results to stdout

### Dependencies
- `click` :: Command-line interface framework
- `tomli` (or `tomllib` for Python 3.11+) :: TOML configuration parsing

## Development

### Adding New Features
The architecture is designed for extensibility:
- Add new language processors in `languages/` directory with get_processor function
- Extend processing capabilities in `processor.py`
- Modify configuration schema in `config.py`



