← Back to Saacho

LLM Service API

Centralized LLM management for low-memory VPS environments

LLM Service API Documentation

A centralized LLM management service for VPS environments using Ollama. Designed for low-memory environments where only one model can run at a time.

Base URL

https://www.saacho.com/api/v1/llm

Authentication

All endpoints require API key authentication. Include the token in the X-API-Key header:

X-API-Key: <your-token>

Set the token via environment variable LLM_SERVICE_TOKEN or update the default in api.php.

Available Models

Model Selection Guide

  • Need quick summarization? Use llama3.2:1b - fastest and most memory-efficient
  • General conversation & tasks? Use llama3.2:3b - balanced performance
  • Mathematical logic & reasoning? Use phi4:mini - optimized for logic
  • Code generation? Use qwen2.5-coder:3b - specialized programming model
  • Multilingual support? Use qwen3:4b - excellent non-English support
  • Complex reasoning? Use gemma3:4b - advanced reasoning capabilities
  • Endpoints

    1. Generate Completion

    Generate text using the specified model.

    Endpoint: POST /generate

    Request Body:

    {
      "model": "llama3.2:3b",
      "prompt": "Explain quantum computing in simple terms",
      "options": {
        "temperature": 0.7,
        "max_tokens": 512
      }
    }
    

    Parameters:

    Response (Success):

    {
      "success": true,
      "request_id": "req_6789abcdef",
      "response": "Quantum computing is a type of computing...",
      "model": "llama3.2:3b",
      "done": true,
      "queue_length": 0
    }
    

    Response (Queued):

    {
      "success": true,
      "message": "Request queued",
      "request_id": "req_6789abcdef",
      "queue_position": 2,
      "queue_length": 3
    }
    

    Example curl:

    curl -X POST https://www.saacho.com/api/v1/llm/generate \
      -H "X-API-Key: your-token-here" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "llama3.2:3b",
        "prompt": "What is PHP?",
        "options": {"max_tokens": 256}
      }'
    

    2. Get Status

    Get current service status including active model, queue length, and installed models.

    Endpoint: GET /status

    Response:

    {
      "success": true,
      "ollama_running": true,
      "current_model": "llama3.2:3b",
      "queue_length": 0,
      "is_processing": false,
      "available_models": [
        "llama3.2:3b",
        "phi4:mini",
        "qwen3:4b",
        "gemma3:4b",
        "llama3.2:1b",
        "qwen2.5-coder:3b"
      ],
      "installed_models": [
        {
          "name": "llama3.2:3b",
          "size": 1890000000,
          "modified_at": "2026-05-10T12:00:00Z"
        }
      ]
    }
    

    Example curl:

    curl -X GET https://www.saacho.com/api/v1/llm/status \
      -H "X-API-Key: your-token-here"
    

    3. Pull All Models

    Download all available models from Ollama registry.

    Endpoint: POST /pull-all

    Response:

    {
      "success": true,
      "message": "Pulled 6 of 6 models",
      "results": {
        "llama3.2:3b": { "success": true, "message": "Model llama3.2:3b pulled successfully" },
        "phi4:mini": { "success": true, "message": "Model phi4:mini pulled successfully" },
        "qwen3:4b": { "success": true, "message": "Model qwen3:4b pulled successfully" },
        "gemma3:4b": { "success": true, "message": "Model gemma3:4b pulled successfully" },
        "llama3.2:1b": { "success": true, "message": "Model llama3.2:1b pulled successfully" },
        "qwen2.5-coder:3b": { "success": true, "message": "Model qwen2.5-coder:3b pulled successfully" }
      }
    }
    

    Example curl:

    curl -X POST https://www.saacho.com/api/v1/llm/pull-all \
      -H "X-API-Key: your-token-here"
    

    4. Pull Single Model

    Download a specific model from Ollama registry.

    Endpoint: POST /pull

    Request Body:

    {
      "model": "qwen2.5-coder:3b"
    }
    

    Response:

    {
      "success": true,
      "message": "Model qwen2.5-coder:3b pulled successfully",
      "model": "qwen2.5-coder:3b"
    }
    

    Example curl:

    curl -X POST https://www.saacho.com/api/v1/llm/pull \
      -H "X-API-Key: your-token-here" \
      -H "Content-Type: application/json" \
      -d '{"model": "qwen2.5-coder:3b"}'
    

    5. Switch Model

    Manually switch to a different model. Unloads current model from memory before loading new one.

    Endpoint: POST /switch

    Request Body:

    {
      "model": "phi4:mini"
    }
    

    Response:

    {
      "success": true,
      "message": "Switched to model phi4:mini",
      "model": "phi4:mini"
    }
    

    Example curl:

    curl -X POST https://www.saacho.com/api/v1/llm/switch \
      -H "X-API-Key: your-token-here" \
      -H "Content-Type: application/json" \
      -d '{"model": "phi4:mini"}'
    

    6. List Available Models

    Get the list of all available models with their descriptions.

    Endpoint: GET /models

    Response:

    {
      "success": true,
      "models": {
        "llama3.2:3b": "General Purpose / Balanced",
        "phi4:mini": "Logic & Reasoning / 3.8B",
        "qwen3:4b": "Coding & Multilingual",
        "gemma3:4b": "Advanced Reasoning",
        "llama3.2:1b": "Ultra-light / Summarization",
        "qwen2.5-coder:3b": "Specialized Programming"
      }
    }
    

    Example curl:

    curl -X GET https://www.saacho.com/api/v1/llm/models \
      -H "X-API-Key: your-token-here"
    

    7. Process Queue

    Manually process queued requests (for background worker scenarios).

    Endpoint: POST /queue/process

    Request Body:

    {
      "max_iterations": 10
    }
    

    Response:

    {
      "success": true,
      "processed": 3,
      "remaining": 0
    }
    

    Example curl:

    curl -X POST https://www.saacho.com/api/v1/llm/queue/process \
      -H "X-API-Key: your-token-here" \
      -H "Content-Type: application/json" \
      -d '{"max_iterations": 5}'
    

    8. Clear Queue

    Clear all pending requests from the queue.

    Endpoint: POST /queue/clear

    Response:

    {
      "success": true,
      "cleared": 5
    }
    

    Example curl:

    curl -X POST https://www.saacho.com/api/v1/llm/queue/clear \
      -H "X-API-Key: your-token-here"
    

    Error Responses

    401 Unauthorized

    {
      "success": false,
      "error": "Unauthorized",
      "message": "Invalid or missing Bearer token"
    }
    

    400 Bad Request

    {
      "success": false,
      "error": "Bad Request",
      "message": "model and prompt are required"
    }
    

    404 Not Found

    {
      "success": false,
      "error": "Not Found",
      "message": "Endpoint not found: /invalid"
    }
    

    Architecture Notes

    Memory Management

    The service implements a single-model constraint:

    1. When a new model is requested, the current model is unloaded using keep_alive: 0

    2. The new model is then loaded with a 5-minute keep-alive window

    3. All requests are queued to prevent OOM errors

    Queue Behavior

  • Requests are processed FIFO (First-In-First-Out)
  • If a request arrives while another is processing, it's queued
  • The queue automatically processes when the previous request completes
  • Use /status to monitor queue length
  • Token Configuration

    Set your Bearer token via environment variable:

    export LLM_SERVICE_TOKEN="your-secure-random-token"
    

    Or edit the default token in api.php:

    $BEARER_TOKEN = getenv('LLM_SERVICE_TOKEN') ?: 'change-this-to-a-secure-token';
    

    Quick Reference