Metadata-Version: 2.4
Name: min-llm-server-client
Version: 0.4.5
Summary: A minimal API server for local HuggingFace LLMs or VLLM LLMs
Author-email: Afshin Sadeghi <sadeghi.afshin@gmail.com>
License-Expression: Apache-2.0
Requires-Python: ==3.12.*
Description-Content-Type: text/markdown
License-File: LICENSE-2.0.txt
Requires-Dist: kernels
Requires-Dist: transformers>=4.40.0
Requires-Dist: accelerate>=0.28.0
Requires-Dist: torch==2.10
Requires-Dist: flask
Requires-Dist: sentencepiece
Requires-Dist: nvidia-ml-py
Provides-Extra: vllm
Requires-Dist: vllm>=0.3.0; extra == "vllm"
Dynamic: license-file




# Minimal LLM Server, for API calls ![PyPl Total Downloads](https://img.shields.io/pepy/dt/min_llm_server_client)


The simplest possible Python code for running local LLM inference as a REST API server and a simple client.

This package lets you start an inference server for Hugging Face–compatible models (like LLaMA, Qwen, GPT-OSS, etc.) on your own computer or server, and make it accessible to applications via HTTP.  It supports both standard HuggingFace Transformers and high-performance vLLM backends.

See the [Tutorial](https://medium.com/@sadeghi.afshin/run-gpt-oss-20b-and-gpt-oss-120b-locally-with-a-minimal-api-server-in-the-style-of-openai-1872e68a93b7) page for extented info.
---

## Backend Options

This package now supports two inference backends:

### 1. **HuggingFace Transformers (Standard)**
- ✓ Widely compatible
- ✓ CPU support available
- ✓ Smaller installation size
- ✓ Good for development and testing

### 2. **vLLM Optimized (High-Performance)** 
- ✓ Up to **24x faster** throughput than standard transformers
- ✓ Lower latency for single requests
- ✓ Better GPU memory utilization with PagedAttention
- ✓ Automatic multi-GPU support with tensor parallelism
- ✓ Continuous batching for higher throughput
- ⚠ Requires CUDA GPUs (no CPU support)
- ⚠ Best for production deployments

In comparison to the original vLLM min_llm_server_client:
- ✓ Automatic GPU selection based on free VRAM
- ✓ Auto-configured multi-GPU tensor parallelism
- ✓ Ultra-lightweight API with minimal setup and dependencies, allows just setup and run with minimal or no configuration
- ✓ Easier to customize and integrate into research or internal AI pipelines in research clusters.
---

## Installation by pip 

### Prerequisite
```bash
uv venv --python 3.12
source .venv/bin/activate
```

**Standard light weight Installation (HuggingFace):**

```bash
uv pip install min-llm-server-client
```

**With vLLM Support:**

```bash
uv pip install "min-llm-server-client[vllm]"
```

#### Installation From Source:

```bash
git clone https://github.com/afshinsadeghi/min_llm_server_client.git
cd min_llm_server_client

# Standard installation
uv pip install .

# Or with vLLM support
uv pip install ".[vllm]"
```

---

## Usage

### Starting the Server

#### Standard HuggingFace Transformers Server

```bash
uv run min-llm-server --model_name meta-llama/Llama-3.3-70B-Instruct --max_new_tokens 100 --device cuda:0
```

#### vLLM Optimized infernce Server 

```bash
uv run min-llm-server-vllm --model_name openai/gpt-oss-20b --max_new_tokens 100 --device cuda:2
```

**Command Options:**
- `--model_name` : Hugging Face model name or local path
   suggested models:
    `openai/gpt-oss-20b` 
    `openai/gpt-oss-120b` 
    `meta-llama/Llama-3.3-70B-Instruct`  
    `casperhansen/llama-3.3-70b-instruct-awq`  **The VLLM version runs this only on ONE A100 core"** 
    `meta-llama/Llama-3.1-8B` 
    `Qwen/Qwen3-0.6B` 
    `Qwen/Qwen2-VL-72B-Instruct-AWQ` 
    `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B` 
    `Qwen/Qwen3-235B-A22B-FP8`     **Runs it with 4 A100 cores** 
    `Qwen/Qwen3-30B-A3B-Instruct-250` 
    or it can use a local model on your device with `/path/to/model`. 



- `--max_new_tokens` : maximum number of tokens to generate in response.
- `--device` : Device selection 
  - `auto`  - Auto-detect available GPUs (default)
  - `cpu`,  - Force CPU (HuggingFace only, vLLM requires GPU)
  -    `cuda:0`, `cuda:1` , or a list of GPU cores: `cuda:2,3,4,5,6,7`.

- Specific to vLLM : 
  - `--max_model_len` : Maximum model context length. If not specified, will auto-detect from model config. Example: 8192
  - `--gpu_memory_utilization` : Fraction of GPU memory to use (0.0 to 1.0). Default: 0.90 (90%). Lower this value if sharing GPU with other processes. Examples: 0.85, 0.80, 0.75

If the device parameter is not given or is `auto`, it finds the available GPU cores and uses them and if no gpu is available, it uses CPU instead. 

#### Example run: 

Standard server with default settings (auto GPU detection):
```bash
min-llm-server 
```

Standard server on a specific GPU (e.g., GPU 0):
```bash
min-llm-server --model_name openai/gpt-oss-20b --device cuda:0
```

Standard server on a specific GPU (e.g., GPU 1):
```bash
min-llm-server --model_name openai/gpt-oss-120b --device cuda:1
```

Standard server forced on CPU:
```bash
min-llm-server --model_name openai/gpt-oss-20b --max_new_tokens 50 --device cpu
```

vLLM server with auto GPU detection (uses all available GPUs):
```bash
min-llm-server-vllm --model_name meta-llama/Llama-3.3-70B-Instruct
```

vLLM server on a specific GPU (e.g., GPU 2):
```bash
min-llm-server-vllm --model_name meta-llama/Llama-3.3-70B-Instruct --device cuda:2
```

vLLM server with reduced GPU memory usage (for shared GPU scenarios):
```bash
min-llm-server-vllm --model_name meta-llama/Llama-3.3-70B-Instruct --device cuda:0 --gpu_memory_utilization 0.85
```

Standard server on a several GPUs:
```bash

min-llm-server --model_name meta-llama/Llama-3.3-70B-Instruct --device cuda:2,3,4,5,6,7

```
---

### Sending Queries

Once the server is running (default: `http://127.0.0.1:5000/llm/q`), you can query it with `curl` or Python.

**Basic Curl Example:**

```bash
curl -X POST http://127.0.0.1:5000/llm/q \
  -H "Content-Type: application/json" \
  -d '{"query": "What is Earth?", "key": "key1"}'
```

**Advanced Curl Example with Generation Parameters:**

```bash
curl -X POST http://127.0.0.1:5000/llm/q \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Explain quantum computing in simple terms",
    "key": "key1",
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 50,
    "repetition_penalty": 1.1,
    "presence_penalty": 0.5
  }'
```

**Python Client - Basic:**

```python
from min_llm_server_client.local_llm_inference_api_client import send_query

response = send_query("What is the capital of France?", user="user1", key="key1")
print(response)
```

**Python Client - Advanced with Parameters:**

```python
import requests
import json

url = "http://127.0.0.1:5000/llm/q"
payload = {
    "query": "Write a short poem about AI",
    "key": "key1",
    "temperature": 0.8,
    "top_p": 0.9,
    "top_k": 40,
    "repetition_penalty": 1.2,
    "presence_penalty": 0.6
}

response = requests.post(url, json=payload)
result = response.json()
print(result['answer'])
```

**Available Generation Parameters:**

- `query` (required): The text prompt/question
- `key` (required): API authentication key (default: "key1")
- `temperature` (optional, default: 0.1): Controls randomness (0.0 = deterministic, 1.0 = very random)
- `top_p` (optional, default: 0.9): Nucleus sampling threshold (0.0-1.0)
- `top_k` (optional, default: None): Top-k sampling - limits to k most likely tokens
- `repetition_penalty` (optional, default: 1.2): Penalty for repeating tokens (1.0 = no penalty)
- `presence_penalty` (optional, default: None): Penalty for using tokens that have appeared (vLLM only)
- `extra_body` (optional, default: None): Dictionary of additional custom parameters

**Note:** The server now handles requests asynchronously using a thread pool, preventing blocking during inference.

---

### Performance Comparison

**LLaMA 3.1 8B - Standard HuggingFace Backend:**
- Intel CPU → ~30 seconds per request, ~2.4 GB RAM
- A100 GPU → <1 second per request, ~34 GB GPU memory, ~4.8 GB CPU RAM

**LLaMA 3.1 8B - vLLM Optimized Backend:**
- A100 GPU → ~0.1-0.3 seconds per request (3-10x faster)
- Better memory efficiency with PagedAttention
- Supports higher concurrent request throughput

**Performance Tips:**
- Use vLLM for production deployments with high request volumes
- Use standard backend for development, testing, or CPU-only environments
- Both the deployement method based on Hugging face and vLLM automatically utilize multiple GPUs, vLLM with tensor parallelism
- Both backends support the same API, making it easy to switch

---

## Project Structure

```
min_llm_server_client/
├── src/
│   ├── local_llm_inference_api_client.py
│   ├── local_llm_inference_server_api.py
│   └── ...
└── README.md
```

---

## License

This project is open source under the [Apache 2.0 License](./LICENSE-2.0.txt).

---

## Author
Afshin Sadeghi   
🔗 [GitHub](https://github.com/afshinsadeghi)  
🔗 [Google Scholar](https://scholar.google.com/citations?user=uWTszVEAAAAJ&hl=en&oi=ao)  
🔗 [LinkedIn](https://www.linkedin.com/in/afshin-sadeghi)
