LLM Service API

Centralized LLM management for low-memory VPS environments

LLM Service API Documentation

A centralized LLM management service for VPS environments using Ollama. Designed for low-memory environments where only one model can run at a time.

Base URL

https://www.saacho.com/api/v1/llm

Authentication

All endpoints require API key authentication. Include the token in the X-API-Key header:

X-API-Key: <your-token>

Set the token via environment variable LLM_SERVICE_TOKEN or update the default in api.php.

Available Models

Model Selection Guide

Need quick summarization? Use llama3.2:1b - fastest and most memory-efficient

General conversation & tasks? Use llama3.2:3b - balanced performance

Mathematical logic & reasoning? Use phi4:mini - optimized for logic

Code generation? Use qwen2.5-coder:3b - specialized programming model

Multilingual support? Use qwen3:4b - excellent non-English support

Complex reasoning? Use gemma3:4b - advanced reasoning capabilities

Endpoints

1. Generate Completion

Generate text using the specified model.

Endpoint: POST /generate

Request Body:

{
  "model": "llama3.2:3b",
  "prompt": "Explain quantum computing in simple terms",
  "options": {
    "temperature": 0.7,
    "max_tokens": 512
  }
}

Parameters:

Response (Success):

{
  "success": true,
  "request_id": "req_6789abcdef",
  "response": "Quantum computing is a type of computing...",
  "model": "llama3.2:3b",
  "done": true,
  "queue_length": 0
}

Response (Queued):

{
  "success": true,
  "message": "Request queued",
  "request_id": "req_6789abcdef",
  "queue_position": 2,
  "queue_length": 3
}

Example curl:

curl -X POST https://www.saacho.com/api/v1/llm/generate \
  -H "X-API-Key: your-token-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "prompt": "What is PHP?",
    "options": {"max_tokens": 256}
  }'

2. Get Status

Get current service status including active model, queue length, and installed models.

Endpoint: GET /status

Response:

{
  "success": true,
  "ollama_running": true,
  "current_model": "llama3.2:3b",
  "queue_length": 0,
  "is_processing": false,
  "available_models": [
    "llama3.2:3b",
    "phi4:mini",
    "qwen3:4b",
    "gemma3:4b",
    "llama3.2:1b",
    "qwen2.5-coder:3b"
  ],
  "installed_models": [
    {
      "name": "llama3.2:3b",
      "size": 1890000000,
      "modified_at": "2026-05-10T12:00:00Z"
    }
  ]
}

Example curl:

curl -X GET https://www.saacho.com/api/v1/llm/status \
  -H "X-API-Key: your-token-here"

3. Pull All Models

Download all available models from Ollama registry.

Endpoint: POST /pull-all

Response:

{
  "success": true,
  "message": "Pulled 6 of 6 models",
  "results": {
    "llama3.2:3b": { "success": true, "message": "Model llama3.2:3b pulled successfully" },
    "phi4:mini": { "success": true, "message": "Model phi4:mini pulled successfully" },
    "qwen3:4b": { "success": true, "message": "Model qwen3:4b pulled successfully" },
    "gemma3:4b": { "success": true, "message": "Model gemma3:4b pulled successfully" },
    "llama3.2:1b": { "success": true, "message": "Model llama3.2:1b pulled successfully" },
    "qwen2.5-coder:3b": { "success": true, "message": "Model qwen2.5-coder:3b pulled successfully" }
  }
}

Example curl:

curl -X POST https://www.saacho.com/api/v1/llm/pull-all \
  -H "X-API-Key: your-token-here"

4. Pull Single Model

Download a specific model from Ollama registry.

Endpoint: POST /pull

Request Body:

{
  "model": "qwen2.5-coder:3b"
}

Response:

{
  "success": true,
  "message": "Model qwen2.5-coder:3b pulled successfully",
  "model": "qwen2.5-coder:3b"
}

Example curl:

curl -X POST https://www.saacho.com/api/v1/llm/pull \
  -H "X-API-Key: your-token-here" \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen2.5-coder:3b"}'

5. Switch Model

Manually switch to a different model. Unloads current model from memory before loading new one.

Endpoint: POST /switch

Request Body:

{
  "model": "phi4:mini"
}

Response:

{
  "success": true,
  "message": "Switched to model phi4:mini",
  "model": "phi4:mini"
}

Example curl:

curl -X POST https://www.saacho.com/api/v1/llm/switch \
  -H "X-API-Key: your-token-here" \
  -H "Content-Type: application/json" \
  -d '{"model": "phi4:mini"}'

6. List Available Models

Get the list of all available models with their descriptions.

Endpoint: GET /models

Response:

{
  "success": true,
  "models": {
    "llama3.2:3b": "General Purpose / Balanced",
    "phi4:mini": "Logic & Reasoning / 3.8B",
    "qwen3:4b": "Coding & Multilingual",
    "gemma3:4b": "Advanced Reasoning",
    "llama3.2:1b": "Ultra-light / Summarization",
    "qwen2.5-coder:3b": "Specialized Programming"
  }
}

Example curl:

curl -X GET https://www.saacho.com/api/v1/llm/models \
  -H "X-API-Key: your-token-here"

7. Process Queue

Manually process queued requests (for background worker scenarios).

Endpoint: POST /queue/process

Request Body:

{
  "max_iterations": 10
}

Response:

{
  "success": true,
  "processed": 3,
  "remaining": 0
}

Example curl:

curl -X POST https://www.saacho.com/api/v1/llm/queue/process \
  -H "X-API-Key: your-token-here" \
  -H "Content-Type: application/json" \
  -d '{"max_iterations": 5}'

8. Clear Queue

Clear all pending requests from the queue.

Endpoint: POST /queue/clear

Response:

{
  "success": true,
  "cleared": 5
}

Example curl:

curl -X POST https://www.saacho.com/api/v1/llm/queue/clear \
  -H "X-API-Key: your-token-here"

Error Responses

401 Unauthorized

{
  "success": false,
  "error": "Unauthorized",
  "message": "Invalid or missing Bearer token"
}

400 Bad Request

{
  "success": false,
  "error": "Bad Request",
  "message": "model and prompt are required"
}

404 Not Found

{
  "success": false,
  "error": "Not Found",
  "message": "Endpoint not found: /invalid"
}

Architecture Notes

Memory Management

The service implements a single-model constraint:

1. When a new model is requested, the current model is unloaded using keep_alive: 0

2. The new model is then loaded with a 5-minute keep-alive window

3. All requests are queued to prevent OOM errors

Queue Behavior

Requests are processed FIFO (First-In-First-Out)

If a request arrives while another is processing, it's queued

The queue automatically processes when the previous request completes

Use /status to monitor queue length

Token Configuration

Set your Bearer token via environment variable:

export LLM_SERVICE_TOKEN="your-secure-random-token"

Or edit the default token in api.php:

$BEARER_TOKEN = getenv('LLM_SERVICE_TOKEN') ?: 'change-this-to-a-secure-token';

LLM Service API Documentation

Base URL

Authentication

Available Models

Model Selection Guide

Endpoints

1. Generate Completion

2. Get Status

3. Pull All Models

4. Pull Single Model

5. Switch Model

6. List Available Models

7. Process Queue

8. Clear Queue

Error Responses

401 Unauthorized

400 Bad Request

404 Not Found

Architecture Notes

Memory Management

Queue Behavior

Token Configuration

Quick Reference