Metadata-Version: 2.4
Name: org21-git-scan
Version: 1.0.4
Summary: Scan your GitHub or GitLab org — generates metadata JSON for service, feature, and ownership discovery
Author-email: Org21 <support@org21.ai>
License: MIT
Project-URL: Homepage, https://org21.ai
Project-URL: Documentation, https://github.com/Org21-ai/src-ingest-github-collector
Project-URL: Repository, https://github.com/Org21-ai/src-ingest-github-collector
Keywords: github,gitlab,scanner,services,ownership,codeowners,org21
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31.0
Dynamic: license-file

# Org21 Git Scanner

Scan your GitHub or GitLab organization to discover services, features, and code ownership — then upload the results to the Org21 platform for automatic knowledge graph enrichment.

Supports **GitHub**, **GitLab.com**, and **self-hosted GitLab** instances.

## What It Does

The scanner reads your Git organization's **metadata only** — it never accesses source code content. It collects:

| Data | How It's Used |
|---|---|
| Repository names & descriptions | Identifies your services |
| README first paragraph | Helps name services accurately |
| Sub-module structure (build files) | Identifies features within services |
| Domain directory structure | Identifies capability areas |
| Contributor commit counts | Determines code ownership |
| Contributor emails | Bridges identity across tools (Jira, HiBob, Git) |
| CODEOWNERS file | Maps explicit ownership declarations |
| Collaborator permissions | Identifies repo admins (strongest ownership signal) |
| First commit author | Identifies who created each repo |
| File extension counts | Identifies IaC, deployment, and language patterns |
| AI SDK usage | Detects repos using OpenAI, Anthropic, Gemini, etc. |
| Visibility (private/public/internal) | Privacy-aware processing |
| Archived status | Tracks inactive services |

**No source code is read, stored, or transmitted.** Only repository metadata.

## Quick Start

### 1. Install

```bash
pip install --upgrade org21-git-scan
```

Requires Python 3.9+. Use `--upgrade` to pick up the latest fixes (1.0.1+ includes GitLab subgroup support and diagnostic logging for missing permissions).

> **Windows users:** if `org21-git-scan` returns "command not found" after install, your Python `Scripts` folder isn't on `PATH`. See [Troubleshooting](#troubleshooting) below.

### 2. Set Up Authentication

The scanner supports two authentication methods: **token** (recommended) or **username/password** (for Enterprise and self-hosted setups).

#### Option A — Access Token (recommended)

**For GitHub — Classic PAT (simplest):**
1. Go to [github.com/settings/tokens](https://github.com/settings/tokens) → **Generate new token (classic)**
2. Tick exactly these two scopes (leave the rest unchecked):
   - ☑ **`repo`** — Full control of private repositories (scanner reads private repo trees, contents, contributors, languages, collaborators)
   - ☑ **`read:org`** — Read org and team membership, read org projects
3. Click **Generate token** and copy the `ghp_…` string
4. **If your org uses SAML SSO:** after creating the token, click **"Configure SSO"** next to it on the token list page and authorize your organization. Without this step the scanner gets a `401 Unauthorized`.

> For public-only orgs, `public_repo` + `read:org` is enough. For most orgs, stick with `repo`.

**For GitHub — Fine-grained PAT (tighter scoping):**
1. Go to [github.com/settings/personal-access-tokens/new](https://github.com/settings/personal-access-tokens/new)
2. Set **Resource owner** to your organization, **Repository access** to All repositories
3. Under **Repository permissions**, grant **Read-only** on:
   - Contents, Metadata, Administration, Pull requests
4. Under **Organization permissions**, grant **Read-only** on: Members
5. Generate, then (if SSO-enforced) authorize the token against your org via the button that appears

Fine-grained tokens may need organization admin approval before they become active — classic PATs don't.

**For GitLab (Personal Access Token):**
1. Go to [gitlab.com/-/user_settings/personal_access_tokens](https://gitlab.com/-/user_settings/personal_access_tokens)
2. Create a token with scopes: `read_api`, `read_repository`
3. Copy the token (`glpat-xxxx`)

**For GitLab (Group Access Token — fine-grained, recommended):**
1. Go to your group → **Settings → Access Tokens**
2. Create a token with **only these permissions** (all read-only):

| Category | Permission | Why |
|---|---|---|
| **Repository** | Read | CODEOWNERS, README, directory tree, file extensions |
| **Group** | Read | List projects in group (including subgroups) |
| **Member** | Read | Collaborator permission levels (admin/write/read) |
| **Global Search** | Read | AI usage detection (searches for openai/anthropic/etc.) |

3. Set role to **Reporter** (minimum needed)
4. Copy the token (`glpat-xxxx`)

**Important:** The scanner only reads metadata — it never creates, modifies, or deletes anything. No write permissions are needed. Do NOT grant Branch, Merge Request, Pipeline, CI/CD, Work Item, or Enterprise User permissions.

**If your GitLab group enforces SAML SSO:** after creating a **Personal Access Token**, visit the group once in a browser to establish the SSO session — otherwise API calls return `401 Unauthorized`. **Group access tokens skip this step** and are the recommended choice for automated or headless scans.

#### Verify your token before scanning

A two-call sanity check saves you from cryptic 401s inside the scanner:

```bash
# Does the token work at all?
curl -H "Authorization: Bearer $GITHUB_TOKEN" https://api.github.com/user

# Does it have access to the org?
curl -H "Authorization: Bearer $GITHUB_TOKEN" https://api.github.com/orgs/YourOrgName
```

Both should return JSON. If the first succeeds but the second returns 404, your token needs SSO authorization (see above). Same pattern for GitLab — hit `/api/v4/user` and `/api/v4/groups/YourGroup`.

#### Option B — Username & Password

For GitHub Enterprise or self-hosted GitLab with local accounts, you can use username and password instead of a token.

**Note:** GitHub.com requires a token (password auth is deprecated for github.com). Username/password works for GitHub Enterprise Server and self-hosted GitLab.

### 3. Run the Scanner

**GitHub with token:**
```bash
org21-git-scan --provider github --org YourOrgName --token ghp_your_token
```

**GitHub Enterprise with username/password:**
```bash
org21-git-scan --provider github --org YourOrgName --username your_user --password your_pass
```

**GitLab.com with token:**
```bash
org21-git-scan --provider gitlab --org YourGroupName --token glpat_your_token
```

**Self-hosted GitLab with token:**
```bash
org21-git-scan --provider gitlab --org YourGroupName --token glpat_your_token --url https://gitlab.internal.company.com
```

**Self-hosted GitLab with username/password:**
```bash
org21-git-scan --provider gitlab --org YourGroupName --username your_user --password your_pass --url https://gitlab.internal.company.com
```

This produces a `git-metadata.json` file in your current directory.

#### Windows / PowerShell syntax

PowerShell doesn't understand bash's `export VAR=value` or backslash line continuations. Use `$env:VAR = "value"` and keep the command on one line (or use a backtick `` ` `` as the continuation character — no trailing spaces after it):

```powershell
$env:GITHUB_TOKEN = "ghp_your_token"
org21-git-scan --provider github --org YourOrgName --output github-metadata.json --verbose
```

If `org21-git-scan` isn't on `PATH` (common on Microsoft Store Python), call the executable by full path:

```powershell
pip show -f org21-git-scan | Select-String "Location|org21-git-scan.exe"
# ...then run the .exe directly using the path it prints.
```

### 4. Upload to Org21

Upload the `git-metadata.json` file to your Org21 dashboard:

1. Log into your Org21 dashboard
2. Go to **Settings → Integrations → Git**
3. Click **Upload Scan Results**
4. Select your `git-metadata.json` file

Your services, features, and ownership data will appear in the knowledge graph within minutes.

## Options

```
org21-git-scan --help

usage: org21-git-scan [-h] [--provider {github,gitlab}] --org ORG
                      [--token TOKEN] [--username USERNAME]
                      [--password PASSWORD] [--url URL] [--output OUTPUT]
                      [--max-repos MAX_REPOS] [--verbose] [--debug]
                      [--bot-patterns PATTERNS]

options:
  --provider {github,gitlab}  Git provider (default: github)
  --org ORG                   GitHub organization or GitLab group name (required)
  --token TOKEN               Access token (or set GITHUB_TOKEN / GITLAB_TOKEN env var)
  --username USERNAME         Username for basic auth (or set GIT_USERNAME env var)
  --password PASSWORD         Password for basic auth (or set GIT_PASSWORD env var)
  --url URL                   GitLab instance URL for self-hosted (default: https://gitlab.com)
  --output, -o OUTPUT         Output file path (default: git-metadata.json)
  --max-repos MAX_REPOS       Maximum repos to scan (default: 100)
  --verbose, -v               Show detailed scanning progress
  --debug                     Mark the output JSON with debug: true. When this file is
                              ingested, the pipeline analyzes it but does not publish to
                              NATS / Neo4j; the resulting envelopes are logged by
                              camel-ingest so you can dry-run an upload safely.
  --bot-patterns PATTERNS     Comma-separated case-insensitive regex patterns of bot /
                              service-account logins to filter out of contributors,
                              collaborators, CODEOWNERS, and first-commit-author. Defaults
                              to a curated list (dependabot, renovate, semantic-release,
                              github-actions, ci, bot, [bot] suffix, -access-token, etc.).
                              Pass --bot-patterns '' to disable filtering; also honors the
                              BOT_LOGIN_PATTERNS env var. Dropped logins appear in the
                              per-repo `filtered_bots` field of the output JSON.
```

**Authentication:** Use either `--token` OR `--username` + `--password`. Not both.

### Using Environment Variables

```bash
# GitHub with token
export GITHUB_TOKEN=ghp_your_token
org21-git-scan --provider github --org YourOrgName

# GitLab with token
export GITLAB_TOKEN=glpat_your_token
org21-git-scan --provider gitlab --org YourGroupName

# Any provider with username/password
export GIT_USERNAME=your_user
export GIT_PASSWORD=your_pass
org21-git-scan --provider gitlab --org YourGroupName --url https://gitlab.internal.com
```

## Provider Comparison

| Feature | GitHub | GitLab |
|---|---|---|
| List repos | ✓ (org repos) | ✓ (group projects, including subgroups) |
| Contributors + commits | ✓ | ✓ (includes email directly) |
| CODEOWNERS | ✓ (3 standard locations) | ✓ (3 standard locations) |
| Collaborator permissions | ✓ (admin/maintain/write/read) | ✓ (Owner/Maintainer/Developer/Reporter/Guest) |
| README parsing | ✓ | ✓ |
| Sub-module detection | ✓ | ✓ |
| File extensions | ✓ | ✓ |
| AI usage search | ✓ (Code Search API) | ✓ (Group Search API) |
| First commit author | ✓ | ✓ |
| Self-hosted | ✗ (GitHub Enterprise only) | ✓ (any GitLab instance via --url) |
| Visibility levels | private / public | private / internal / public |

## What Gets Scanned

For each non-forked repository in your organization:

| Item | API Calls | Notes |
|---|---|---|
| Repository info | 1 per page | Name, description, language, last push, visibility, archived status |
| Contributors | 1 per repo | Login, display name, commit count, email |
| Languages | 1 per repo | Language breakdown |
| CODEOWNERS | Up to 3 per repo | Ownership patterns and assigned owners |
| README | 1 per repo | Heading + first paragraph (for service naming) |
| Collaborators | 1 per repo | Permission levels (admin/write/read) |
| First commit | 1 per repo | Proxy for repo creator |
| Directory tree | 1 per repo | Build manifests, sub-modules, domain directories, file extensions |
| Sub-READMEs | 1 per documented dir | Subdirectory documentation |
| AI usage | ~10 per org | Code search for AI SDK keywords |

**Typical scan time:** 2-3 minutes for a 50-repo organization.

## Output Format

The scanner produces a single JSON file with identical format regardless of provider:

```json
{
  "org": "YourOrgName",
  "scanned_at": "2026-04-09T10:30:00Z",
  "provider": "github",
  "repos": [
    {
      "name": "auth-service",
      "description": "Authentication and authorization service",
      "readme_heading": "Auth Service",
      "readme_summary": "Multi-tenant authentication with OAuth2, PKCE, and session management.",
      "language": "Python",
      "languages": {"Python": 45000, "Shell": 2000},
      "file_extensions": {".py": 120, ".yaml": 15, ".md": 8},
      "last_push": "2026-04-07T15:30:00Z",
      "visibility": "private",
      "archived": false,
      "contributors": [
        {"login": "alice", "name": "Alice Smith", "commits": 142, "email": "alice@company.com"}
      ],
      "collaborators": [
        {"login": "alice", "name": "Alice Smith", "permission": "admin"}
      ],
      "codeowners": [
        {"pattern": "/src/oauth/", "owners": ["@alice", "@bob"]}
      ],
      "sub_modules": [
        {"path": "api-gateway", "name": "api-gateway", "build_file": "package.json"}
      ],
      "created_by": "alice",
      "admins": ["alice"],
      "uses_ai": true,
      "ai_references": ["openai", "langchain"]
    }
  ]
}
```

## Privacy & Security

- **No source code** is accessed — only repository metadata (names, structure, contributors)
- **No data leaves your machine** — the JSON file is generated locally
- **You control what's shared** — review the JSON before uploading
- **Fine-grained token** — only read permissions needed, no write access
- **Credentials stay local** — tokens and passwords are never sent to Org21, only used for Git API calls on your machine
- **Revoke anytime** — delete the token or change the password to revoke access immediately
- **Basic auth supported** — for environments where token generation is restricted (e.g., self-hosted GitLab with LDAP)
- **Self-hosted GitLab** — for organizations that can't use cloud Git, the scanner runs entirely within your network

## How Org21 Uses This Data

Once uploaded, Org21 analyzes the metadata to enrich your knowledge graph:

| Input | Output |
|---|---|
| Repositories | **Services** with human-readable names |
| Sub-modules & domain directories | **Features** within services |
| Contributors, CODEOWNERS, admins | **Ownership** — who owns which service |
| Contributor emails | **Identity bridging** across Jira, Git, and HR systems |
| File extensions | Service classification (IaC, deployment, application) |
| AI references | AI stack discovery (which services use AI) |

This data cross-references with other connectors (Jira, HiBob, NetSuite) to build a complete picture of your organization's technology landscape.

## Scheduling Regular Scans

```bash
# Crontab (daily at 2am)
0 2 * * * GITHUB_TOKEN=ghp_xxx org21-git-scan --provider github --org YourOrg -o /path/to/git-metadata.json

# Or run manually after significant changes
org21-git-scan --provider github --org YourOrg
```

Then upload the updated file to your Org21 dashboard.

## Troubleshooting

| Error | Solution |
|---|---|
| `Error: Access token required` | Set `--token` or the appropriate env var |
| `401 Unauthorized` on the very first org/group call | Token is invalid, expired, or not SSO-authorized. Run the curl tests in [Verify your token](#verify-your-token-before-scanning). If `/user` works but `/orgs/<org>` 404s, click **Configure SSO** next to the token on GitHub (or visit the group in a browser on GitLab) |
| `403 Forbidden` | Token doesn't have required permissions — re-check scopes |
| `404 Not Found` on specific fields | Expected for repos without CODEOWNERS / README — harmless |
| Many `api_call_failed context=<repo>/<field> error=status=403` lines | Token scope gap. For GitLab, most often means `read_repository` is missing. The scanner now logs every silent failure (v1.0.1+) so the specific endpoint that failed is visible |
| `api_call_failed context=<repo>/_resolve_project_id error=status=404` on GitLab | Only affects pre-1.0.1. Upgrade: `pip install --upgrade org21-git-scan` |
| `Rate limit exceeded` | Wait and retry, or use a token with higher limits |
| Scan takes too long | Use `--max-repos 50` to limit scope |
| Self-hosted GitLab SSL error | Ensure your GitLab URL uses a valid certificate |
| `org21-git-scan: command not found` (or `is not recognized` on Windows) | Your Python `Scripts` folder isn't on `PATH`. Find the exe with `pip show -f org21-git-scan` and either call it by full path or add the Scripts folder to `PATH`. On PowerShell: `[Environment]::SetEnvironmentVariable("PATH", "$env:PATH;<scripts-path>", "User")` — then open a new shell |
| PowerShell `Missing expression after unary operator '--'` | The command was broken across lines with `\`. PowerShell uses backtick `` ` `` for line continuation, or keep the whole command on one line |

## Support

For questions or issues: [Contact your Org21 account team](mailto:support@org21.ai)
