Metadata-Version: 2.4
Name: pooled-job-scraper
Version: 0.1.1
Summary: Generic job listings scraper with baseline dedupe and optional Netlify cache submission.
Author: lramos0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown

# Pooled Job Scraper

`pooled-job-scraper` is published on PyPI and provides a CLI for scraping business careers pages, extracting listings, and generating a unique delta against the April 2026 baseline dataset.

PyPI package:

- https://pypi.org/project/pooled-job-scraper/

## Platform Links

- [mewannajob.com](https://www.mewannajob.com/)
- [jobpool.live](https://jobpool.live/)

### How They Fit Together

- `jobpool.live` is the open data pool and hydration surface.  
  Use this scraper to discover and normalize job listings, then review unique delta rows before promoting data into the pool workflow.
- `mewannajob.com` is the consumer-facing experience for browsing and using listings data.  
  Data prepared through the pool process ultimately supports downstream job discovery use cases there.

### Documentation Surfaces

- Public pool context and hydration navigation: [jobpool.live](https://jobpool.live/)
- Hydration docs in this repository: `pool/hydration/docs/`
- Scraper implementation in this repository: `scripts/generic_job_listings_scraper.py`

## Install

### Windows

```powershell
py -m pip install --upgrade pooled-job-scraper
```

### WSL / Linux / macOS

```bash
python3 -m pip install --upgrade pooled-job-scraper
```

## Run (Installed CLI)

```bash
job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv
```

## Run (From Repository Script)

```powershell
py scripts/generic_job_listings_scraper.py `
  --business-url https://mossyhonda.hireology.careers/ `
  --company-name "Mossy Honda" `
  --output output/mossy-scraped.csv `
  --unique-output output/mossy-unique.csv
```

## Send Unique Rows To Netlify Cache

```bash
job-scraper \
  --business-url https://mossyhonda.hireology.careers/ \
  --company-name "Mossy Honda" \
  --cache-endpoint https://<your-netlify-site>/api/scrape-cache \
  --output output/mossy-scraped.csv \
  --unique-output output/mossy-unique.csv
```

The scraper infers `user_name` from local environment or git config and sends:

- `user_name`
- `request_timestamp`
- `source_business_urls`
- `listings` (standard listing fields plus any additional discovered fields)

## Cache API

- `POST /api/scrape-cache` stores a scrape request payload.
- `GET /api/scrape-cache?limit=25&user_name=<name>` returns recent cached submissions.

## Publishing Flow

Publishing is automated through:

- `.github/workflows/publish-pypi.yml`

Behavior:

- Triggers on push/merge to `main`.
- Builds distributions from `pyproject.toml`.
- Checks whether the current version already exists on PyPI.
- Publishes only when the version is new.
- Skips cleanly when that version already exists.

To release a new version:

1. Bump `project.version` in `pyproject.toml`.
2. Merge to `main`.
3. Wait for the publish workflow to complete.
