Metadata-Version: 2.4
Name: project-monitor-sdk
Version: 0.1.2
Summary: Official SDK for the Project Monitor AI Observability Platform
License: MIT
Keywords: monitoring,observability,logging,apm
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.28
Provides-Extra: asgi
Provides-Extra: flask
Requires-Dist: flask>=2.0; extra == "flask"

# Project Monitor

AI-powered observability platform — collect logs, surface AI insights, fire alerts, and track all your running services from a single dashboard.

---

## Table of Contents

- [Overview](#overview)
- [Architecture](#architecture)
- [Repository Layout](#repository-layout)
- [Prerequisites](#prerequisites)
- [Backend Setup](#backend-setup)
- [Frontend Setup](#frontend-setup)
- [SDK Setup](#sdk-setup)
- [Environment Variables](#environment-variables)
- [API Reference](#api-reference)
- [Dashboard Pages](#dashboard-pages)
- [SDK Usage](#sdk-usage)
- [Cloud Integrations](#cloud-integrations)
- [AI Insights](#ai-insights)
- [Alerts](#alerts)
- [Database Migrations](#database-migrations)
- [Running Tests](#running-tests)

---

## Overview

Project Monitor is a self-hosted monitoring platform with three components that work together:

| Component | Technology | Default Port |
|-----------|-----------|--------------|
| **Backend API** | FastAPI + PostgreSQL | `8000` |
| **React Dashboard** | Vite + React 18 | `8001` (dev) |
| **Python SDK** | `project-monitor-sdk` (PyPI) | — |

Services instrument themselves with the SDK. The SDK buffers log events and flushes them to the backend over HTTP. The dashboard renders live metrics, AI-generated root-cause analysis, alert delivery, cloud webhooks, and a per-service Servers view.

---

## Architecture

```
Your Services
  └── Monitor SDK (heartbeat + log events)
        │  POST /api/v1/logs  (batched, idempotent)
        ▼
  FastAPI Backend (port 8000)
  ├── PostgreSQL  ── projects / api_keys / logs / work_queue / cloud_integrations / insight_feedback
  ├── Background queue worker  ── processes log batches
  ├── AI Insights service  ── LLM root-cause analysis (OpenAI-compatible or Ollama)
  ├── Alert service  ── Slack / Teams / Email delivery
  └── Static file server  ── serves React build at /app

  React Dashboard (port 8001 in dev)
  └── proxies /api → backend
```

---

## Repository Layout

```
Project Monitor/
├── backend/                   FastAPI application
│   ├── app/
│   │   ├── main.py            App entry point; mounts API router + React dist
│   │   ├── core/
│   │   │   ├── config.py      Pydantic-settings config (database, LLM, alerts)
│   │   │   └── security.py    API-key hashing + require_api_key dependency
│   │   ├── api/v1/routes/
│   │   │   ├── projects.py    POST /projects  — create project + API key
│   │   │   ├── logs.py        POST /logs (ingest)  GET /logs (paginated query)
│   │   │   ├── insights.py    GET /insights  POST /insights/feedback
│   │   │   ├── alerts.py      POST /alerts/test  POST /alerts/insights/notify
│   │   │   ├── integrations.py CRUD + webhook receiver for cloud integrations
│   │   │   └── services.py    GET /services  — per-service log summary
│   │   ├── models/            SQLAlchemy ORM models
│   │   │   ├── project.py     projects table
│   │   │   ├── api_key.py     api_keys table (SHA-256 hashed, prefixed pm_)
│   │   │   ├── log.py         logs table (indexed on project, service, level, time, correlation)
│   │   │   ├── ingest_request.py  idempotency keys for log ingest
│   │   │   ├── work_queue.py  background job queue
│   │   │   ├── cloud_integration.py  cloud connections
│   │   │   └── insight_feedback.py  LLM feedback ratings
│   │   ├── schemas/           Pydantic request/response schemas
│   │   ├── services/
│   │   │   ├── insights_service.py  rule-based + LLM insight engine
│   │   │   ├── alert_service.py     Slack / Teams / Email delivery
│   │   │   ├── cloud_normalizer.py  universal webhook normalizer (AWS/Azure/GCP)
│   │   │   └── queue_service.py     background work queue helpers
│   │   ├── workers/
│   │   │   └── queue_worker.py  standalone worker process
│   │   └── db/
│   │       ├── base.py        declarative Base
│   │       └── session.py     SessionLocal + get_db dependency
│   ├── alembic/               Database migrations
│   │   └── versions/          5 migration files (initial schema → insight feedback)
│   ├── dashboard/             Legacy single-file HTML dashboard (app.html)
│   ├── monitor_sdk/           Local copy of SDK for backend dev/testing
│   ├── requirements.txt
│   └── alembic.ini
│
├── frontend/                  React + Vite dashboard
│   ├── src/
│   │   ├── App.jsx            Root component; AppContext (apiKey, activePage)
│   │   ├── components/
│   │   │   ├── Sidebar.jsx    Navigation sidebar
│   │   │   └── Topbar.jsx     Top bar with project/API key input
│   │   ├── pages/
│   │   │   ├── Overview.jsx   Live metrics, recent errors, error groups
│   │   │   ├── LogExplorer.jsx  Paginated log search with filters
│   │   │   ├── AIInsights.jsx   LLM root-cause analysis panel
│   │   │   ├── Servers.jsx    Per-service health + error drill-down
│   │   │   ├── Alerts.jsx     Alert channel configuration + test delivery
│   │   │   ├── Integrations.jsx  Cloud integration management
│   │   │   └── NewProject.jsx  Project + API key creation wizard
│   │   └── utils/
│   │       ├── api.js         apiFetch helper (reads X-API-Key from context)
│   │       └── helpers.js     tsShort, badgeLevelClass, formatGroupLabel
│   ├── vite.config.js         Proxies /api → backend; base = /app/ in production
│   └── package.json
│
└── sdk/                       Publishable Python SDK
    ├── src/monitor_sdk/
    │   ├── client.py          Monitor class — batching, retry, heartbeat
    │   ├── middleware.py       MonitorASGIMiddleware — ASGI request logging
    │   └── context.py         ContextVar correlation ID propagation
    ├── pyproject.toml         Package metadata (project-monitor-sdk 0.1.2)
    └── README.md              This file
```

---

## Prerequisites

- Python 3.10+
- Node.js 18+
- PostgreSQL 14+ running locally (default: `localhost:5432/project_monitor`)

---

## Backend Setup

```bash
cd "backend"

# Create and activate a virtual environment
python -m venv .venv
.venv\Scripts\activate          # Windows
# source .venv/bin/activate     # macOS/Linux

# Install dependencies
python -m pip install -r requirements.txt

# Apply database migrations
alembic upgrade head

# Start the API server
uvicorn app.main:app --reload --port 8000
```

The API is then available at `http://localhost:8000`.  
Interactive docs: `http://localhost:8000/docs`

### Create your first project

```bash
curl -X POST http://localhost:8000/api/v1/projects \
  -H "Content-Type: application/json" \
  -d '{"name": "My Service", "description": "Production backend"}'
```

Response:
```json
{
  "project_id": "550e8400-e29b-41d4-a716-446655440000",
  "api_key": "pm_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}
```

Save the `api_key` — it is shown only once.

---

## Frontend Setup

```bash
cd "frontend"
npm install
npm run dev          # starts on http://localhost:8001
```

For a production build (served by FastAPI at `/app`):

```bash
npm run build        # outputs to frontend/dist/
```

FastAPI automatically serves `frontend/dist/` at `/app` if the directory exists.

---

## SDK Setup

### Install from PyPI

```bash
pip install project-monitor-sdk
```

### Install from local source (development)

```bash
pip install -e "/path/to/Project Monitor/sdk"
```

---

## Environment Variables

### Backend (`backend/.env` or shell environment)

| Variable | Default | Description |
|----------|---------|-------------|
| `DATABASE_URL` | `postgresql+psycopg://postgres:postgres@localhost:5432/project_monitor` | PostgreSQL connection string |
| `LLM_ENABLED` | `true` | Enable LLM-powered insights |
| `LLM_PROVIDER` | `openai` | LLM provider (`openai` for any OpenAI-compatible endpoint) |
| `LLM_BASE_URL` | — | OpenAI-compatible API base URL |
| `LLM_MODEL` | `Qwen/Qwen3.6-35B-A3B-FP8` | Model name |
| `LLM_API_KEY` | `EMPTY` | API key for the LLM endpoint |
| `LLM_TIMEOUT_SECONDS` | `20` | Request timeout |
| `LLM_MAX_LOGS` | `25` | Max recent logs sent for analysis |
| `OLLAMA_BASE_URL` | `http://192.168.1.34:11434` | Ollama endpoint (used if LLM_BASE_URL not set) |
| `OLLAMA_MODEL` | `qwen3:4b-q4_K_M` | Ollama model |
| `SLACK_WEBHOOK_URL` | — | Slack incoming webhook for alerts |
| `TEAMS_WEBHOOK_URL` | — | Microsoft Teams webhook |
| `ALERT_EMAIL_FROM` | — | SMTP sender address |
| `ALERT_EMAIL_TO` | — | Default alert recipient |
| `SMTP_HOST` | — | SMTP host |
| `SMTP_PORT` | `587` | SMTP port |
| `SMTP_USERNAME` | — | SMTP auth username |
| `SMTP_PASSWORD` | — | SMTP auth password |
| `QUEUE_POLL_INTERVAL_SECONDS` | `5` | Background worker poll interval |

### Frontend (`frontend/.env.development`)

| Variable | Default | Description |
|----------|---------|-------------|
| `VITE_API_BASE` | *(empty)* | Backend base URL. Leave empty to use the Vite proxy. Set to `http://localhost:8000` only when bypassing the proxy. |

### SDK (passed to `Monitor(...)` constructor or via env)

| Variable | Description |
|----------|-------------|
| `PROJECT_MONITOR_API_KEY` | API key for log ingestion |
| `PROJECT_MONITOR_BASE_URL` | Backend base URL |
| `PROJECT_MONITOR_SERVICE_NAME` | Service name shown in the dashboard |
| `PROJECT_MONITOR_MIN_LEVEL` | Minimum log level to send (`DEBUG` / `INFO` / `WARN` / `ERROR` / `CRITICAL`) |

---

## API Reference

All routes are prefixed with `/api/v1`. Authenticated routes require the header `X-API-Key: pm_...`.

### Projects

| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `POST` | `/projects` | None | Create a project; returns `project_id` and `api_key` |

### Logs

| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `POST` | `/logs` | Yes | Ingest a batch of log events. Supports `Idempotency-Key` header. |
| `GET` | `/logs` | Yes | Query logs with filters: `level`, `service_name`, `start_time`, `end_time`, cursor-based pagination |

#### Log ingest payload

```json
{
  "logs": [
    {
      "service_name": "my-service",
      "level": "ERROR",
      "message": "Database connection failed",
      "operation": "db_connect",
      "status": "error",
      "error_type": "ConnectionError",
      "correlation_id": "abc-123",
      "metadata": { "host": "db.internal", "retry": 3 },
      "source": "sdk"
    }
  ]
}
```

### Insights

| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `GET` | `/insights` | Yes | AI root-cause analysis. Query params: `lookback_minutes` (5–1440), `deep_analysis` (bool) |
| `POST` | `/insights/feedback` | Yes | Submit thumbs-up/down rating with correction text |

### Alerts

| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `POST` | `/alerts/test` | Yes | Send a test alert to Slack, Teams, or Email |
| `POST` | `/alerts/insights/notify` | Yes | Run insights and email the result |

### Integrations (Cloud Webhooks)

| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `POST` | `/integrations` | Yes | Register a cloud provider connection |
| `GET` | `/integrations` | Yes | List all connections for the project |
| `DELETE` | `/integrations/{id}` | Yes | Remove a connection |
| `POST` | `/integrations/webhook/{id}?token=<webhook_token>` | None | Receive a cloud webhook; auto-normalizes and stores as logs |

### Services

| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `GET` | `/services` | Yes | Per-service summary: total logs, error count, status, last seen. Query param: `lookback_minutes` (1–10080, default 1440) |

Service `status` values:

| Value | Condition |
|-------|-----------|
| `healthy` | No errors in window |
| `degraded` | Error ratio < 30 % |
| `critical` | Error ratio ≥ 30 % |

### Health

| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `GET` | `/health` | None | Returns `{"status": "ok"}` |

---

## Dashboard Pages

| Page | Route key | Description |
|------|-----------|-------------|
| **Overview** | `overview` | Live metrics card, recent errors table, error groups, dependent error groups |
| **Log Explorer** | `logs` | Full-text log search with level / service / time range filters and cursor pagination |
| **AI Insights** | `insights` | LLM-generated root-cause analysis with timeline, error groups, contributing groups, and feedback |
| **Servers** | `servers` | All services that have reported in the lookback window; click a service for its error drill-down |
| **Alerts** | `alerts` | Configure and test Slack / Teams / Email delivery |
| **Integrations** | `integrations` | Connect AWS, Azure, or GCP; get a webhook URL to paste into cloud consoles |
| **New Project** | `new_project` | Create a project and copy the API key |

The API key is stored in the sidebar and sent as `X-API-Key` on every request. It is never persisted to localStorage — re-enter it after a page refresh.

---

## SDK Usage

### Initialise

```python
from monitor_sdk import Monitor

monitor = Monitor(
    api_key="pm_your_api_key",
    base_url="http://localhost:8000",
    service_name="my-service",
    min_level="WARN",          # drop DEBUG / INFO locally
)
```

### Register the service at startup

```python
# One-shot: appears in the Servers dashboard immediately
monitor.heartbeat()

# Continuous: daemon thread pings every 30 s; stops when monitor.close() is called
monitor.start_heartbeat_loop(interval=30)
```

Heartbeat events always bypass `min_level` so the service registers even when `min_level="ERROR"`.

### Logging

```python
monitor.debug("cache miss", operation="cache_lookup")
monitor.info("order created", operation="create_order", metadata={"order_id": 42})
monitor.warn("retry attempt", operation="send_email", metadata={"attempt": 2})
monitor.error("payment declined", operation="checkout", error_type="PaymentError")

# Generic with any level
monitor.log("custom message", level="CRITICAL", operation="scheduler")
```

All logging methods accept the same keyword arguments:

| Keyword | Description |
|---------|-------------|
| `operation` | Function or operation name |
| `status` | Short status tag, e.g. `"success"`, `"error"` |
| `error_type` | Exception class name |
| `metadata` | Arbitrary `dict` stored as JSON |
| `correlation_id` | Distributed trace ID (auto-filled from context if omitted) |
| `service_name` | Override the client-level `service_name` for this single event |
| `source` | Override the client-level `source` tag |

### Capture exceptions

```python
try:
    call_external_api()
except Exception as exc:
    monitor.capture_exception(exc, operation="external_api_call", metadata={"url": url})
```

Logs as `ERROR`, attaches the full traceback to `metadata["traceback"]`.

### Trace a block

```python
with monitor.trace("checkout_flow", metadata={"cart_id": cart.id}):
    process_order()
    charge_card()
```

Emits `INFO` on entry and `ERROR` (with `duration_ms`) if the block raises.

### ASGI middleware (FastAPI / Starlette)

```python
from fastapi import FastAPI
from monitor_sdk import MonitorASGIMiddleware

app = FastAPI()
app.add_middleware(MonitorASGIMiddleware, monitor=monitor)
```

The middleware:
- Reads `X-Request-Id` / `X-Correlation-Id` headers (or generates a UUID)
- Propagates the correlation ID via a `ContextVar` so all logs within a request share it
- Logs `INFO` for every request that returns < 500, `ERROR` for 5xx responses

### Unhandled exception hook

```python
monitor.install_excepthook()
```

Wraps `sys.excepthook` to capture and flush unhandled exceptions before the process exits.

### Lifecycle

```python
monitor.start()   # start background flush thread (automatic on init)
monitor.flush()   # force a synchronous flush right now
monitor.close()   # stop threads, final flush (registered via atexit automatically)
```

### Dead-letter queue

Events that fail all retries are kept in memory:

```python
failed_events = monitor.dead_letter()  # list[dict]
```

### Constructor parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `api_key` | required | Project Monitor API key (`pm_...`) |
| `base_url` | required | Backend URL, e.g. `http://localhost:8000` |
| `service_name` | `None` | Service name shown in the dashboard |
| `source` | `"sdk"` | Free-form source tag on every event |
| `min_level` | `"WARN"` | Drop events below this level |
| `batch_size` | `50` | Flush when buffer reaches this size |
| `flush_interval` | `2.0` | Seconds between automatic background flushes |
| `timeout_seconds` | `5.0` | HTTP timeout per attempt |
| `max_retries` | `3` | Retries per batch (exponential back-off) |
| `retry_backoff_seconds` | `0.5` | Base back-off delay (doubles on each retry) |
| `start_background` | `True` | Start flush thread immediately |

---

## Cloud Integrations

Register a connection to receive cloud provider events as normalized logs.

**Supported providers and event types:**

| Provider | Event Types |
|----------|------------|
| **AWS** | CloudWatch Alarms, CloudWatch Logs, CloudTrail, GuardDuty, RDS, Lambda, Security Hub, EventBridge |
| **Azure** | Monitor Alerts, Activity Log, Application Insights, Service Health, Defender, AKS, Event Grid |
| **GCP** | Cloud Logging, Cloud Monitoring, Security Command Center |

**Workflow:**
1. `POST /api/v1/integrations` with `{"name": "...", "provider": "aws"}` → receive `webhook_url` and `webhook_token`
2. Paste the `webhook_url` into the cloud console (SNS, Event Grid topic, Pub/Sub, etc.)
3. The backend auto-detects the event shape and normalizes it into the logs table

---

## AI Insights

`GET /api/v1/insights?lookback_minutes=60&deep_analysis=false`

The insights engine:
1. Queries recent logs and computes error metrics, top error types, and service-level grouped errors
2. If `LLM_ENABLED=true`, sends the summary to an OpenAI-compatible LLM (or Ollama) for root-cause analysis
3. Falls back to rule-based heuristics if the LLM is unreachable or returns an invalid response

Response includes:
- `incident_summary` — plain-English summary of the incident
- `root_cause` — LLM-identified root cause
- `suggestion` — recommended fix
- `error_groups` — top error types with counts
- `dependent_error_groups` — correlated errors across services (shared correlation IDs)
- `timeline` — chronological sequence of significant events
- `fallback_reason` — set when rule-based fallback was used

Submit feedback after reviewing an insight:

```bash
curl -X POST http://localhost:8000/api/v1/insights/feedback \
  -H "X-API-Key: pm_..." \
  -H "Content-Type: application/json" \
  -d '{"rating": 1, "lookback_minutes": 60, "correction": "Root cause was actually a deploy"}'
```

---

## Alerts

Test that alert channels are working:

```bash
curl -X POST http://localhost:8000/api/v1/alerts/test \
  -H "X-API-Key: pm_..." \
  -H "Content-Type: application/json" \
  -d '{
    "channel": "slack",
    "severity": "HIGH",
    "title": "Test Alert",
    "message": "This is a test alert from Project Monitor."
  }'
```

Supported `channel` values: `slack`, `teams`, `email`.

To run insights and email the result in one call:

```bash
curl -X POST http://localhost:8000/api/v1/alerts/insights/notify \
  -H "X-API-Key: pm_..." \
  -H "Content-Type: application/json" \
  -d '{"lookback_minutes": 60, "recipient_email": "ops@example.com"}'
```

---

## Database Migrations

Migrations are managed with Alembic.

```bash
cd backend

# Apply all pending migrations
alembic upgrade head

# Create a new migration after changing a model
alembic revision --autogenerate -m "describe_change"

# Check current migration state
alembic current
```

Migration history:

| File | Changes |
|------|---------|
| `0001_initial_schema` | projects, api_keys, logs tables |
| `0002_add_last_used_at_to_api_keys` | `last_used_at` on api_keys |
| `0003_add_ingest_requests_and_work_queue` | idempotency keys + job queue |
| `0004_add_cloud_integrations` | cloud_integrations table |
| `0005_add_insight_feedback` | insight_feedback table |

---

## Running Tests

```bash
cd backend
pytest tests/ -v
```

Test files:

| File | Coverage |
|------|---------|
| `test_health.py` | `GET /health` |
| `test_logs_api.py` | Log ingest + query |
| `test_alerts_api.py` | Alert delivery |
| `test_insights_api.py` | Insights + feedback |
| `test_llm.py` | LLM insight generation |
| `test_ollama.py` | Ollama fallback |
| `test_sdk.py` | SDK client unit tests |


## Install

```bash
pip install project-monitor-sdk
```

Or directly from source (development / local):

```bash
pip install -e "/path/to/Project Monitor/sdk"
```

## Quick start

```python
from monitor_sdk import Monitor, MonitorASGIMiddleware

monitor = Monitor(
    api_key="pm_your_api_key",
    base_url="http://localhost:8000",
    service_name="my-service",
    min_level="WARN",   # only WARN and above are sent
)

# Register the service immediately so it appears in the Servers dashboard
monitor.heartbeat()

# Keep the service marked as "active" with periodic heartbeats (every 30 s)
monitor.start_heartbeat_loop(interval=30)

# Manual logging
monitor.info("server started", operation="startup")
monitor.warn("disk usage above 90%", operation="disk_check", metadata={"usage_pct": 91})
monitor.error("payment failed", operation="checkout", error_type="TimeoutError")

# Capture exceptions automatically
try:
    risky_operation()
except Exception as exc:
    monitor.capture_exception(exc, operation="risky_operation")

# Trace a block
with monitor.trace("checkout_flow"):
    process_order()

# Global unhandled exception hook (calls capture_exception + flush before crash)
monitor.install_excepthook()

# ASGI middleware (FastAPI / Starlette) – auto-logs every request
from fastapi import FastAPI
app = FastAPI()
app.add_middleware(MonitorASGIMiddleware, monitor=monitor)
```

## Configuration

| Parameter | Default | Description |
|-----------|---------|-------------|
| `api_key` | required | Project Monitor API key |
| `base_url` | required | Backend URL, e.g. `http://localhost:8000` |
| `service_name` | `None` | Name shown in the dashboard |
| `source` | `"sdk"` | Free-form source tag attached to every event |
| `min_level` | `"WARN"` | Drop logs below this level (`DEBUG` / `INFO` / `WARN` / `ERROR` / `CRITICAL`) |
| `batch_size` | `50` | Flush when the buffer reaches this many events |
| `flush_interval` | `2.0` | Seconds between automatic background flushes |
| `timeout_seconds` | `5.0` | HTTP request timeout per attempt |
| `max_retries` | `3` | HTTP retries per batch (exponential back-off) |
| `retry_backoff_seconds` | `0.5` | Base back-off delay (doubles on each retry) |
| `start_background` | `True` | Start the background flush thread immediately on init |

## Log levels

```
DEBUG < INFO < WARN < ERROR < CRITICAL
```

Only events **at or above** `min_level` are sent to the backend.  
`heartbeat()` always bypasses `min_level` so the service registers even when `min_level="ERROR"`.

## API reference

### Logging

| Method | Level | Description |
|--------|-------|-------------|
| `monitor.debug(msg, **kwargs)` | DEBUG | Low-level diagnostic message |
| `monitor.info(msg, **kwargs)` | INFO | Informational message |
| `monitor.warn(msg, **kwargs)` | WARN | Warning – potential issue |
| `monitor.error(msg, **kwargs)` | ERROR | Recoverable error |
| `monitor.log(msg, level=..., **kwargs)` | any | Generic log with explicit level |

All logging methods accept these keyword arguments:

| Keyword | Type | Description |
|---------|------|-------------|
| `operation` | `str` | Name of the operation/function being logged |
| `status` | `str` | Status tag, e.g. `"success"`, `"error"` |
| `error_type` | `str` | Exception class name |
| `metadata` | `dict` | Arbitrary key-value pairs stored as JSON |
| `correlation_id` | `str` | Distributed trace ID (auto-filled from context if omitted) |
| `service_name` | `str` | Override the client-level `service_name` for this event |
| `source` | `str` | Override the client-level `source` for this event |

### Exception capture

```python
monitor.capture_exception(exc, operation="my_op", metadata={"user_id": 42})
```

Logs the exception as `ERROR`, attaches the full traceback to `metadata["traceback"]`.

### Tracing

```python
with monitor.trace("checkout_flow", metadata={"cart_id": cart.id}):
    process_order()
```

Logs an `INFO` event at entry and an `ERROR` event (with duration) if the block raises.

### Heartbeat

```python
# One-shot – call at startup
monitor.heartbeat()

# Continuous – daemon thread pings every `interval` seconds; stops on monitor.close()
monitor.start_heartbeat_loop(interval=30)
```

Heartbeat events bypass `min_level` so the service always appears in the Servers dashboard.

### Lifecycle

```python
monitor.start()   # start the background flush thread (called automatically on init)
monitor.flush()   # flush the buffer synchronously right now
monitor.close()   # stop background threads and do a final flush

monitor.install_excepthook()  # capture unhandled exceptions via sys.excepthook
```

### Dead-letter queue

Events that could not be delivered after all retries are stored in memory:

```python
failed = monitor.dead_letter()  # returns list[dict]
```
