Metadata-Version: 2.4
Name: icebug-format
Version: 1.0.0
Summary: Convert graph data from relational formats to icebug format.
Project-URL: Homepage, https://github.com/Ladybug-Memory/icebug-format
Project-URL: Repository, https://github.com/Ladybug-Memory/icebug-format
Project-URL: PyPI, https://pypi.org/project/icebug-format
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: duckdb>=1.3.2
Requires-Dist: pyarrow>=21.0.0
Provides-Extra: graphar
Requires-Dist: graphar; extra == "graphar"

# Icebug Format

Icebug is a standardized graph format designed for efficient graph data interchange. It comes in two flavours:

| Format | Storage | Use case |
|---|---|---|
| **icebug-disk** | Parquet files | Object storage, persistence |
| **icebug-memory** | Apache Arrow tables | In-process, zero-copy access |

Both represent *directed* graphs in [CSR (Compressed Sparse Row)](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format)) format, which enables fast adjacency-list traversal.

---

## icebug-disk v1

### CLI

Convert a DuckDB source database containing `nodes_*` / `edges_*` tables into Parquet files and a `schema.cypher` that a graph database can mount directly:

```bash
uv run icebug-format \
  --source-db examples/karate/duckdb/karate_random.duckdb \
  --schema examples/karate/duckdb/schema.cypher      // input schema for rel tables
```

### Output structure

For each node table `nodes_<name>` and edge table `edges_<name>`, the following files/tables are produced:

| Name | Description |
|---|---|
| `nodes_<name>.parquet` | Original node table with attributes |
| `indices_<name>.parquet` | Target node for each edge, sorted by source (size E) |
| `indptr_<name>.parquet` | Row-pointer array of size N+1 |
| `schema.cypher` | Cypher schema for mounting in a graph database |

NOTE: Each parquet file stores `icebug_disk_version` in its metadata

### Example

Starting from a `demo-db.duckdb` with `nodes_user`, `nodes_city`, `edges_follows`, and `edges_livesin` tables:

```bash
uv run icebug-format \
  --source-db demo-db.duckdb \
  --schema demo-db/schema.cypher
```

Verify the result with `test_csr_duckdb.py`:

```bash
uv run ./icebug-format/test_csr_duckdb.py --input demo-db_csr
```

```
Metadata: 7 nodes, 8 edges, directed=True

Node Tables:
Table: demo_nodes_user
(100, 'Adam', 30) ...

Edge Tables (reconstructed from CSR):
Table: follows (FROM user TO user)
(100, 250, 2020) ...
```

---

## icebug-memory v1

### Python API

Convert Arrow tables directly into an in-memory CSR graph

```python
from icebug_format import IcebugMemGraph

# Directed heterogeneous graph (different node types on each end)
graph: IcebugMemGraph = IcebugMemGraph.from_arrow_tables(
    from_node_arrow_table=users,   # pa.Table, first column is the primary key
    rel_arrow_table=livesin,       # pa.Table with 'source' and 'target' columns
    to_node_arrow_table=cities,    # pa.Table, first column is the primary key
)

# Directed graph, or homogeneous graph with reverse edges added
graph: IcebugMemGraph = IcebugMemGraph.from_arrow_tables(
    from_node_arrow_table=users,   # pa.Table, first column is the primary key
    rel_arrow_table=follows,       # pa.Table with 'source' and 'target' columns
    add_reverse_edges=True,        # to_node_arrow_table must be omitted
)

# Node tables are passed through unchanged
graph.src    # pa.Table — source nodes
graph.dest   # pa.Table — destination nodes

# CSR adjacency structure
graph.indices  # pa.Table — 'target' column (+ any edge properties), sorted by source
graph.indptr   # pa.Table — 'ptr' column of length len(src) + 1
```

The `rel_arrow_table` source and target columns are resolved by name in priority order, with a positional fallback:

| Role | Accepted names (in order) | Fallback |
|---|---|---|
| Source | `source`, `src`, `from` | 0th column |
| Target | `target`, `destination`, `dest`, `to` | 1st column |

Any remaining columns are preserved as edge properties in `graph.indices`.

Use `--add-reverse-edges` in the CLI, or `add_reverse_edges=True` in the Python API, to emit a symmetric adjacency by adding reverse edges. For reverse-edge expansion, `to_node_arrow_table` must be omitted; the same node table is used for both sides of every edge.

## Caveats

- icebug-format will always output a directed graph
- If an algorithm needs symmetric adjacency, pass `--add-reverse-edges` to the CLI or `add_reverse_edges=True` to the Python API. Reverse edges will be added automatically. Reverse-edge expansion is supported only for rel tables with the same node type on both ends.
- Reverse-edge expansion is all or nothing for a conversion. If your graph mixes edge types that should be symmetric, such as `friends`, with edge types that should stay directed, such as `follows`, run separate conversions or add reverse edges before calling icebug-format; `--add-reverse-edges` cannot be applied selectively per edge type.

---

## Further reading

[Blog post: Graph Archiving with Apache GraphAR](https://adsharma.github.io/graph-archiving/)
