Metadata-Version: 2.4
Name: canonmap
Version: 0.2.52
Summary: CanonMap - A Python library for entity canonicalization and mapping with enhanced configuration and response models
Home-page: https://github.com/vinceberry/canonmap
Author: Vince Berry
Author-email: vince.berry@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE.txt
Requires-Dist: python-dotenv
Requires-Dist: pydantic
Requires-Dist: google-cloud-storage
Requires-Dist: pandas
Requires-Dist: chardet
Requires-Dist: numpy
Requires-Dist: scikit-learn
Requires-Dist: rapidfuzz
Requires-Dist: metaphone
Requires-Dist: tqdm
Requires-Dist: requests
Requires-Dist: codename
Provides-Extra: embedding
Requires-Dist: sentence-transformers; extra == "embedding"
Requires-Dist: transformers; extra == "embedding"
Requires-Dist: torch; extra == "embedding"
Requires-Dist: tokenizers; extra == "embedding"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# CanonMap

CanonMap is a Python library for generating and managing canonical entity artifacts from various data sources. It provides a streamlined interface for processing data files and generating standardized artifacts that can be used for entity matching and data integration.

## Features

- **Flexible Input Support**: Process data from:
  - CSV/JSON files
  - Directories of data files
  - Pandas DataFrames
  - Python dictionaries

- **Artifact Generation**:
  - Generate canonical entity lists
  - Create database schemas (supports multiple database types)
  - Generate semantic embeddings for entities
  - Clean and standardize field names
  - Process metadata fields

- **Database Support**:
  - DuckDB (default)
  - SQLite
  - BigQuery
  - MariaDB
  - MySQL
  - PostgreSQL

- **Enhanced Configuration**:
  - Separate configuration for artifacts and embeddings
  - **Optional GCP integration** with bucket management
  - Flexible sync strategies for cloud storage
  - Comprehensive error handling and logging
  - **Local-only mode** for development and testing

## Installation

### Lightweight Installation (Core Features Only)
```bash
pip install canonmap
```

### Full Installation (Including Embedding Support)
```bash
pip install canonmap[embedding]
```

**Note**: The lightweight installation includes all core features (GCP integration, file processing, schema generation) but excludes embedding functionality. If you need semantic embeddings, use the full installation with `[embedding]` extras.

## Quick Start

### Local-Only Mode (Recommended for Development)

```python
from canonmap import (
    CanonMap,
    CanonMapArtifactsConfig,
    CanonMapEmbeddingConfig,
    ArtifactGenerationRequest,
    EntityField,
    SemanticField
)

# Simple local-only configuration
artifacts_config = CanonMapArtifactsConfig(
    artifacts_local_path="./artifacts",
    gcs_config=None  # No GCS integration
)

embedding_config = CanonMapEmbeddingConfig(
    embedding_model_hf_name="sentence-transformers/all-MiniLM-L12-v2",
    embedding_model_local_path="./models/sentence-transformers/all-MiniLM-L12-v2",
    gcs_config=None  # No GCS integration
)

# Initialize CanonMap
canonmap = CanonMap(
    artifacts_config=artifacts_config,
    embedding_config=embedding_config,
    verbose=True
)

# Configure artifact generation
request = ArtifactGenerationRequest(
    input_path="path/to/your/data.csv",
    source_name="my_source",
    table_name="my_table",
    entity_fields=[
        EntityField(table_name="my_table", field_name="name"),
        EntityField(table_name="my_table", field_name="id")
    ],
    semantic_fields=[
        SemanticField(table_name="my_table", field_name="description"),
        SemanticField(table_name="my_table", field_name="notes")
    ],
    generate_schemas=True,
    generate_embeddings=True,
    generate_semantic_texts=True
)

# Generate artifacts
response = canonmap.generate_artifacts(request)
print(f"Generated {len(response.generated_artifacts)} artifacts")
```

### Cache Prioritization (Prevents Repeated Downloads)

CanonMap now prioritizes checking for models in your computer's cache directories before downloading them again. This prevents the same model from being downloaded multiple times across different projects.

```python
from canonmap import (
    CanonMap,
    CanonMapArtifactsConfig,
    CanonMapEmbeddingConfig,
    ArtifactGenerationRequest,
    EntityField,
    SemanticField
)

# Configuration with cache prioritization (default behavior)
artifacts_config = CanonMapArtifactsConfig(
    artifacts_local_path="./artifacts",
    gcs_config=None
)

embedding_config = CanonMapEmbeddingConfig(
    embedding_model_hf_name="sentence-transformers/all-MiniLM-L12-v2",
    embedding_model_local_path="./models/sentence-transformers/all-MiniLM-L12-v2",
    gcs_config=None,
    prioritize_cache=True  # Default: True - checks cache first
)

# Initialize CanonMap
canonmap = CanonMap(
    artifacts_config=artifacts_config,
    embedding_config=embedding_config,
    verbose=True
)
```

**Cache Locations Checked:**
- `~/.huggingface_hub/`
- `~/.sentence_transformers/`
- `~/.cache/huggingface/`
- `~/.cache/sentence_transformers/`

**To disable cache prioritization:**
```python
embedding_config = CanonMapEmbeddingConfig(
    embedding_model_hf_name="sentence-transformers/all-MiniLM-L12-v2",
    embedding_model_local_path="./models/sentence-transformers/all-MiniLM-L12-v2",
    prioritize_cache=False  # Use only the specified local path
)
```

### With GCP Integration

```python
from canonmap import (
    CanonMap,
    CanonMapGCPConfig,
    CanonMapCustomGCSConfig,
    CanonMapArtifactsConfig,
    CanonMapEmbeddingConfig,
    ArtifactGenerationRequest,
    EntityField,
    SemanticField
)

# 1. Set up base GCP configuration
base_gcp = CanonMapGCPConfig(
    gcp_service_account_json_path="path/to/service_account.json",
    troubleshooting=False
)

# 2. Configure GCS for artifacts and embeddings
artifacts_gcs = CanonMapCustomGCSConfig(
    gcp_config=base_gcp,
    bucket_name="your-artifacts-bucket",
    bucket_prefix="artifacts/",
    auto_create_bucket=True,
    sync_strategy="refresh"
)

embedding_gcs = CanonMapCustomGCSConfig(
    gcp_config=base_gcp,
    bucket_name="your-models-bucket",
    bucket_prefix="models/",
    auto_create_bucket=True,
    sync_strategy="refresh"
)

# 3. Create application-specific configs
artifacts_config = CanonMapArtifactsConfig(
    artifacts_local_path="./artifacts",
    gcs_config=artifacts_gcs
)

embedding_config = CanonMapEmbeddingConfig(
    embedding_model_hf_name="sentence-transformers/all-MiniLM-L12-v2",
    embedding_model_local_path="./models/sentence-transformers/all-MiniLM-L12-v2",
    gcs_config=embedding_gcs
)

# 4. Initialize CanonMap
canonmap = CanonMap(
    artifacts_config=artifacts_config,
    embedding_config=embedding_config,
    verbose=True,
    api_mode=False
)

# 5. Configure artifact generation
request = ArtifactGenerationRequest(
    input_path="path/to/your/data.csv",
    source_name="my_source",
    table_name="my_table",
    entity_fields=[
        EntityField(table_name="my_table", field_name="name"),
        EntityField(table_name="my_table", field_name="id")
    ],
    semantic_fields=[
        SemanticField(table_name="my_table", field_name="description"),
        SemanticField(table_name="my_table", field_name="notes")
    ],
    generate_schemas=True,
    generate_embeddings=True,
    generate_semantic_texts=True
)

# 6. Generate artifacts
response = canonmap.generate_artifacts(request)
print(f"Generated {len(response.generated_artifacts)} artifacts")
```

## Artifact Generation Example

```python
from canonmap import (
    CanonMap,
    CanonMapArtifactsConfig,
    CanonMapEmbeddingConfig,
    ArtifactGenerationRequest,
    EntityField,
    SemanticField,
    ArtifactGenerationResponse
)

# Set up configurations (local-only for this example)
artifacts_config = CanonMapArtifactsConfig(
    artifacts_local_path="./artifacts",
    gcs_config=None  # Local-only mode
)

embedding_config = CanonMapEmbeddingConfig(
    embedding_model_hf_name="sentence-transformers/all-MiniLM-L12-v2",
    embedding_model_local_path="./models",
    gcs_config=None  # Local-only mode
)

# Initialize CanonMap
cm = CanonMap(
    artifacts_config=artifacts_config,
    embedding_config=embedding_config,
    verbose=True
)

# Create generation request
gen_req = ArtifactGenerationRequest(
    input_path="input",
    source_name="football_data",
    entity_fields=[
        EntityField(table_name="passing", field_name="player"),
        EntityField(table_name="rushing", field_name="rusher_name"),
    ],
    semantic_fields=[
        SemanticField(table_name="passing", field_name="description"),
        SemanticField(table_name="rushing", field_name="notes"),
    ],
    generate_schemas=True,
    save_processed_data=True,
    generate_semantic_texts=True
)

# Generate artifacts
resp: ArtifactGenerationResponse = cm.generate_artifacts(gen_req)

# Access response details
print(f"Status: {resp.status}")
print(f"Generated {len(resp.generated_artifacts)} artifacts")
print(f"Processing time: {resp.processing_stats.processing_time_seconds:.2f} seconds")
```

## Entity Mapping Example

```python
from canonmap import (
    CanonMap,
    EntityMappingRequest,
    TableFieldFilter,
    EntityMappingResponse
)

# Initialize CanonMap (reusing configs from above)
cm = CanonMap(
    artifacts_config=artifacts_config,
    embedding_config=embedding_config
)

# Create mapping request
mapping_request = EntityMappingRequest(
    entities=["tim brady", "jake alan"],
    filters=[
        TableFieldFilter(table_name="passing", table_fields=["player"])
    ],
    num_results=3,
)

# Map entities
resp: EntityMappingResponse = cm.map_entities(mapping_request)

# Access mapping results
print(f"Processed {resp.total_entities_processed} entities")
print(f"Found {resp.total_matches_found} matches")

for mapping in resp.mappings:
    print(f"\nEntity: {mapping.entity}")
    for match in mapping.matches:
        print(f"  Match: {match.matched_entity} (Score: {match.score:.3f})")
```

## Configuration Options

### CanonMapGCPConfig
Base GCP configuration with service account and troubleshooting settings:
- `gcp_service_account_json_path`: Path to GCP service account JSON file
- `troubleshooting`: Enable detailed logging and validation

### CanonMapCustomGCSConfig
Bucket-specific configuration extending the base GCP config:
- `gcp_config`: Base GCP configuration
- `bucket_name`: GCS bucket name
- `bucket_prefix`: Optional prefix for bucket operations
- `auto_create_bucket`: Automatically create bucket if it doesn't exist
- `auto_create_bucket_prefix`: Automatically create prefix directory
- `sync_strategy`: Sync strategy ("none", "missing", "overwrite", "refresh")

### CanonMapArtifactsConfig
Configuration for artifact storage and management:
- `artifacts_local_path`: Local directory for artifacts
- `gcs_config`: **Optional** GCS configuration for artifact storage
- `troubleshooting`: Enable troubleshooting mode

### CanonMapEmbeddingConfig
Configuration for embedding model management:
- `embedding_model_hf_name`: HuggingFace model name
- `embedding_model_local_path`: Local path for model storage
- `gcs_config`: **Optional** GCS configuration for model storage
- `troubleshooting`: Enable troubleshooting mode
- `prioritize_cache`: Check user's home directory cache first (default: True)
  - Looks in `.huggingface_hub`, `.sentence_transformers`, and other common cache locations
  - Prevents repeated downloads of the same model
  - Can be disabled by setting to `False` if you want to use only the specified local path

### ArtifactGenerationRequest
Comprehensive configuration for artifact generation:
- **Input/Output**:
  - `input_path`: Path to data file/directory or DataFrame/dict
  - `source_name`: Logical source name
  - `table_name`: Logical table name

- **Directory Processing**:
  - `recursive`: Process subdirectories
  - `file_pattern`: File matching pattern (e.g., "*.csv")
  - `table_name_from_file`: Use filename as table name

- **Entity Processing**:
  - `entity_fields`: List of fields to treat as entities
  - `semantic_fields`: List of fields to extract as individual semantic text files
  - `use_other_fields_as_metadata`: Include non-entity fields as metadata

- **Generation Options**:
  - `generate_canonical_entities`: Generate entity list
  - `generate_schemas`: Generate database schema
  - `generate_embeddings`: Generate semantic embeddings
  - `generate_semantic_texts`: Generate semantic text files from semantic_fields
  - `save_processed_data`: Save cleaned data
  - `database_type`: Target database type
  - `normalize_field_names`: Standardize field names

## Response Models

### ArtifactGenerationResponse
Comprehensive response containing:
- `status`: Success/failure status
- `message`: Human-readable message
- `generated_artifacts`: List of generated artifacts with metadata
- `processing_stats`: Detailed processing statistics
- `errors`: List of errors encountered
- `warnings`: List of warnings
- `gcp_upload_info`: GCP upload details
- Convenience paths for common artifacts

### EntityMappingResponse
Detailed mapping results including:
- `status`: Success/failure status
- `mappings`: List of entity mappings with matches
- `total_entities_processed`: Number of entities processed
- `total_matches_found`: Total number of matches found
- `processing_stats`: Performance metrics
- `configuration_summary`: Request configuration summary
- `errors`: List of errors encountered
- `warnings`: List of warnings

## API Mode

For API deployments, initialize CanonMap with `api_mode=True`:

```python
canonmap = CanonMap(
    artifacts_config=artifacts_config,
    embedding_config=embedding_config,
    verbose=True,
    api_mode=True  # Enables API-specific optimizations
)
```

## Output

The `generate_artifacts()` method returns an `ArtifactGenerationResponse` containing:
- Generated artifacts with metadata
- Processing statistics and timing information
- Error and warning information
- GCP upload details (if applicable)
- Convenience paths to common artifacts

### Semantic Text Files

When `semantic_fields` is specified, CanonMap creates zip files containing individual text files for each non-null semantic field value:

- **Single table**: `{source}_{table}_semantic_texts.zip`
- **Multiple tables**: `{source}_semantic_texts.zip` (combined)
- **File naming**: `{table_name}_row_{row_index}_{field_name}.txt`
- **Content**: Raw text content from the specified semantic fields

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the LICENSE file for details.
