Centralized LLM management for low-memory VPS environments
A centralized LLM management service for VPS environments using Ollama. Designed for low-memory environments where only one model can run at a time.
https://www.saacho.com/api/v1/llm
All endpoints require API key authentication. Include the token in the X-API-Key header:
X-API-Key: <your-token>
Set the token via environment variable LLM_SERVICE_TOKEN or update the default in api.php.
llama3.2:1b - fastest and most memory-efficientllama3.2:3b - balanced performancephi4:mini - optimized for logicqwen2.5-coder:3b - specialized programming modelqwen3:4b - excellent non-English supportgemma3:4b - advanced reasoning capabilitiesGenerate text using the specified model.
Endpoint: POST /generate
Request Body:
{
"model": "llama3.2:3b",
"prompt": "Explain quantum computing in simple terms",
"options": {
"temperature": 0.7,
"max_tokens": 512
}
}
Parameters:
Response (Success):
{
"success": true,
"request_id": "req_6789abcdef",
"response": "Quantum computing is a type of computing...",
"model": "llama3.2:3b",
"done": true,
"queue_length": 0
}
Response (Queued):
{
"success": true,
"message": "Request queued",
"request_id": "req_6789abcdef",
"queue_position": 2,
"queue_length": 3
}
Example curl:
curl -X POST https://www.saacho.com/api/v1/llm/generate \
-H "X-API-Key: your-token-here" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"prompt": "What is PHP?",
"options": {"max_tokens": 256}
}'
Get current service status including active model, queue length, and installed models.
Endpoint: GET /status
Response:
{
"success": true,
"ollama_running": true,
"current_model": "llama3.2:3b",
"queue_length": 0,
"is_processing": false,
"available_models": [
"llama3.2:3b",
"phi4:mini",
"qwen3:4b",
"gemma3:4b",
"llama3.2:1b",
"qwen2.5-coder:3b"
],
"installed_models": [
{
"name": "llama3.2:3b",
"size": 1890000000,
"modified_at": "2026-05-10T12:00:00Z"
}
]
}
Example curl:
curl -X GET https://www.saacho.com/api/v1/llm/status \
-H "X-API-Key: your-token-here"
Download all available models from Ollama registry.
Endpoint: POST /pull-all
Response:
{
"success": true,
"message": "Pulled 6 of 6 models",
"results": {
"llama3.2:3b": { "success": true, "message": "Model llama3.2:3b pulled successfully" },
"phi4:mini": { "success": true, "message": "Model phi4:mini pulled successfully" },
"qwen3:4b": { "success": true, "message": "Model qwen3:4b pulled successfully" },
"gemma3:4b": { "success": true, "message": "Model gemma3:4b pulled successfully" },
"llama3.2:1b": { "success": true, "message": "Model llama3.2:1b pulled successfully" },
"qwen2.5-coder:3b": { "success": true, "message": "Model qwen2.5-coder:3b pulled successfully" }
}
}
Example curl:
curl -X POST https://www.saacho.com/api/v1/llm/pull-all \
-H "X-API-Key: your-token-here"
Download a specific model from Ollama registry.
Endpoint: POST /pull
Request Body:
{
"model": "qwen2.5-coder:3b"
}
Response:
{
"success": true,
"message": "Model qwen2.5-coder:3b pulled successfully",
"model": "qwen2.5-coder:3b"
}
Example curl:
curl -X POST https://www.saacho.com/api/v1/llm/pull \
-H "X-API-Key: your-token-here" \
-H "Content-Type: application/json" \
-d '{"model": "qwen2.5-coder:3b"}'
Manually switch to a different model. Unloads current model from memory before loading new one.
Endpoint: POST /switch
Request Body:
{
"model": "phi4:mini"
}
Response:
{
"success": true,
"message": "Switched to model phi4:mini",
"model": "phi4:mini"
}
Example curl:
curl -X POST https://www.saacho.com/api/v1/llm/switch \
-H "X-API-Key: your-token-here" \
-H "Content-Type: application/json" \
-d '{"model": "phi4:mini"}'
Get the list of all available models with their descriptions.
Endpoint: GET /models
Response:
{
"success": true,
"models": {
"llama3.2:3b": "General Purpose / Balanced",
"phi4:mini": "Logic & Reasoning / 3.8B",
"qwen3:4b": "Coding & Multilingual",
"gemma3:4b": "Advanced Reasoning",
"llama3.2:1b": "Ultra-light / Summarization",
"qwen2.5-coder:3b": "Specialized Programming"
}
}
Example curl:
curl -X GET https://www.saacho.com/api/v1/llm/models \
-H "X-API-Key: your-token-here"
Manually process queued requests (for background worker scenarios).
Endpoint: POST /queue/process
Request Body:
{
"max_iterations": 10
}
Response:
{
"success": true,
"processed": 3,
"remaining": 0
}
Example curl:
curl -X POST https://www.saacho.com/api/v1/llm/queue/process \
-H "X-API-Key: your-token-here" \
-H "Content-Type: application/json" \
-d '{"max_iterations": 5}'
Clear all pending requests from the queue.
Endpoint: POST /queue/clear
Response:
{
"success": true,
"cleared": 5
}
Example curl:
curl -X POST https://www.saacho.com/api/v1/llm/queue/clear \
-H "X-API-Key: your-token-here"
{
"success": false,
"error": "Unauthorized",
"message": "Invalid or missing Bearer token"
}
{
"success": false,
"error": "Bad Request",
"message": "model and prompt are required"
}
{
"success": false,
"error": "Not Found",
"message": "Endpoint not found: /invalid"
}
The service implements a single-model constraint:
1. When a new model is requested, the current model is unloaded using keep_alive: 0
2. The new model is then loaded with a 5-minute keep-alive window
3. All requests are queued to prevent OOM errors
/status to monitor queue lengthSet your Bearer token via environment variable:
export LLM_SERVICE_TOKEN="your-secure-random-token"
Or edit the default token in api.php:
$BEARER_TOKEN = getenv('LLM_SERVICE_TOKEN') ?: 'change-this-to-a-secure-token';