My Hp Dl380 Is Now Running An Ai I Can Literally Call On The Phone
My HP DL380 Is Now Running An AI I Can Literally Call On The Phone
Introduction
In the world of enterprise hardware repurposing, few moments are as satisfying as breathing new life into retired server equipment. When an HP DL380 Gen9 server - originally designed for traditional data center workloads - starts conducting natural voice conversations through a self-hosted AI stack, we’ve crossed into uncharted territory for homelab enthusiasts and DevOps professionals.
The challenge of creating a truly autonomous conversational AI has traditionally required massive cloud resources and proprietary APIs. But recent advancements in open-source machine learning have enabled a fascinating convergence: enterprise-grade hardware meeting cutting-edge AI in completely self-contained deployments. This breakthrough eliminates cloud dependencies, reduces latency to human-conversation levels (under 300ms round-trip), and maintains strict data privacy - all critical concerns for infrastructure professionals.
In this comprehensive guide, you’ll learn how to transform an HP DL380 (or similar server hardware) into a fully functional AI telephone companion using:
- Asterisk PBX (voice call infrastructure)
- OpenAI’s Whisper (real-time speech recognition)
- Mistral 7B (local large language model)
- Coqui XTTS (neural text-to-speech with voice cloning)
We’ll cover hardware requirements, software configuration, performance optimization, and security hardening - with all components running on bare metal or Docker containers. Whether you’re a sysadmin exploring AI workloads or a DevOps engineer building on-premises voice assistants, this stack demonstrates what’s possible with modern open-source tooling.
Understanding the Technology Stack
Component Breakdown
1. Asterisk PBX
The bedrock of our telephony system, Asterisk handles SIP signaling, call routing, and audio stream management. Its modular architecture allows integration with our AI components through the AGI (Asterisk Gateway Interface).
Key Features:
- SIP/TLS for secure call setup
- RTP (Real-time Transport Protocol) audio handling
- AGI interface for external program control
- Dialplan scripting for call flow management
2. Whisper (OpenAI’s Speech Recognition)
The real-time transcription engine converts spoken words to text with remarkable accuracy. We’re using the faster-whisper implementation which provides:
- Real-time streaming transcription
- Multi-language support
- Word-level timestamps
- Optimized CUDA execution
3. Mistral 7B
This 7-billion parameter language model delivers surprisingly coherent responses while remaining small enough to run locally on consumer GPUs. Key advantages include:
- Apache 2.0 license (commercial-friendly)
- 32k token context window
- Instruction-following capabilities
- Optimized for low-latency inference
4. Coqui XTTS
The open-source text-to-speech system that gives our AI a human-like voice:
- Voice cloning from short samples
- Emotional tone control
- Streaming API for real-time playback
- Support for multiple speakers
Architectural Flow
1
2
3
4
5
Caller --> SIP (Asterisk) --> Audio Stream --> Whisper (Speech-to-Text)
↓
Mistral (Process Text) --> Response Text --> XTTS (Text-to-Speech)
↓
Asterisk <-- Audio Stream <-- Synthesized Speech
Why Local Deployment Matters
Latency Control: Cloud-based solutions introduce unpredictable delays (often 500ms+). Our local stack achieves 200-300ms round-trip latency.
Data Sovereignty: Voice data never leaves your infrastructure - critical for healthcare, finance, or personal projects.
Cost Predictability: Eliminates API call expenses - particularly important for high-volume usage.
Customization: Full control over models, prompts, and voice characteristics.
Prerequisites
Hardware Requirements
Minimum Specifications (Tested Configuration):
- HP DL380 Gen9 (or comparable server)
- Dual Intel Xeon E5-2690v3 (24 cores total)
- 128GB DDR4 ECC RAM
- NVIDIA T4 GPU (16GB VRAM) - critical for ML workloads
- Hardware RAID controller (RAID 10 recommended)
- Dual power supplies
- Intel X520-DA2 10GbE NIC (for VoIP traffic isolation)
Storage Considerations:
- 500GB SSD for OS and applications
- 1TB NVMe cache for Whisper temp files
- 2TB HDD for voice samples and logs
Software Requirements
Base Operating System:
1
2
3
4
5
6
7
# Ubuntu 22.04.3 LTS (Jammy Jellyfish)
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy
Critical Dependencies:
1
2
3
4
5
# NVIDIA Drivers and CUDA
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
Version-Locked Components:
- Python 3.10.12
- Docker 24.0.7
- NVIDIA Container Toolkit 1.14.1
- Asterisk 20.3.0
- faster-whisper 0.10.0
- Mistral 7B Instruct v0.2
- Coqui XTTS 2.0.2
Network Configuration
Required Ports: | Port | Protocol | Service | Notes | |——-|———-|—————|——————————–| | 5060 | TCP/UDP | SIP | Standard SIP port | | 5061 | TCP | SIP/TLS | Secure SIP | | 10000-20000 | UDP | RTP | Dynamic audio ports | | 8000 | TCP | API | Whisper/XTTS HTTP endpoints |
Security Considerations:
- Physically separate VoIP VLAN
- Fail2ban configuration for SIP ports
- TLS 1.3 for SIP signaling
- SRTP (Secure RTP) for audio encryption
Installation & Setup
1. Base System Preparation
Kernel Optimization:
1
2
3
4
5
# /etc/sysctl.conf
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_keepalive_time=60
vm.swappiness=10
GPU Driver Installation:
1
2
3
$ sudo apt install -y nvidia-driver-535 cuda-toolkit-12-2
$ sudo nvidia-ctk runtime configure --runtime=docker
$ sudo systemctl restart docker
2. Asterisk Configuration
Installation:
1
$ sudo apt install -y asterisk asterisk-dev libopus-dev
/etc/asterisk/sip.conf:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[transport-udp]
type=transport
protocol=udp
bind=0.0.0.0:5060
[transport-tls]
type=transport
protocol=tls
bind=0.0.0.0:5061
[ai_phone]
type=endpoint
context=ai_incoming
disallow=all
allow=ulaw
allow=opus
auth=ai_phone
aors=ai_phone
[ai_phone]
type=auth
auth_type=userpass
password=SecurePass123!
username=ai_phone
[ai_phone]
type=aor
max_contacts=1
/etc/asterisk/extensions.conf:
1
2
3
4
[ai_incoming]
exten => s,1,Answer()
same => n,AGI(agi://127.0.0.1:3000/ai-agi)
same => n,Hangup()
3. Whisper Deployment
Using faster-whisper in Docker:
1
2
3
4
5
6
7
$ docker run -d --gpus all --name whisper \
-p 8000:8000 \
-v /opt/whisper/cache:/root/.cache \
ghcr.io/guillaumekln/faster-whisper:latest \
--model small.en \
--compute_type float16 \
--server_port 8000
Verification:
1
2
3
4
5
$ curl -X POST http://localhost:8000/asr \
-H "Content-Type: audio/wav" \
--data-binary @test.wav
{"text":"this is a test of the whisper transcription system","language":"en"}
4. Mistral Inference Server
Using Ollama for local LLM:
1
2
3
4
5
6
$ docker run -d --gpus all --name ollama \
-p 11434:11434 \
-v /opt/ollama:/root/.ollama \
ollama/ollama:latest
$ docker exec ollama ollama pull mistral:7b-instruct-v0.2-q4_K_M
Test Query:
1
2
3
4
$ curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt":"Why is the sky blue?"
}'
5. Coqui XTTS Setup
Docker Deployment:
1
2
3
4
$ docker run -d --gpus all --name xtts \
-p 8020:8020 \
-e "XTTS_MODEL=tts_models/multilingual/multi-dataset/xtts_v2" \
coqui/xtts:v2.0.2
Voice Cloning:
1
2
3
4
5
6
from TTS.api import TTS
tts = TTS("xtts")
tts.tts_to_file(text="Hello world",
speaker_wav="reference.wav",
language="en",
file_path="output.wav")
Configuration & Optimization
Asterisk Performance Tuning
/etc/asterisk/asterisk.conf:
1
2
3
4
5
6
7
8
[options]
execincludes=yes
highpriority=yes
internal_timing=yes
max_load = 1.5 ; Avoid overloading system
[files]
astcache = /dev/shm/astcache ; Use RAM disk
RTP Optimization:
1
2
3
4
5
; /etc/asterisk/rtp.conf
[general]
rtpstart=10000
rtpend=20000
rtpchecksums=no ; Improve performance
Whisper Model Selection
Model | VRAM Usage | Speed (RTF) | Accuracy |
---|---|---|---|
tiny.en | 1GB | 0.1x | 60% |
base.en | 1.5GB | 0.2x | 70% |
small.en | 5GB | 0.4x | 80% |
medium.en | 10GB | 0.8x | 90% |
1
2
# Start container with different model
$ docker run ... faster-whisper --model medium.en
Mistral Prompt Engineering
System Prompt Template:
1
2
3
4
5
You are an AI assistant named "JARVIS" answering phone calls.
Respond concisely in under 15 words.
Current time: {time}.
Last caller: {last_caller}.
Context: {call_context}
Temperature Settings:
1
2
3
4
# ollama-modelfile
FROM mistral:7b-instruct-v0.2-q4_K_M
PARAMETER temperature 0.3 # Lower for predictable responses
PARAMETER num_ctx 4096 # Balance memory and context
XTTS Voice Cloning Optimization
High-Quality Samples:
- 10-30 seconds of clean speech
- Consistent microphone positioning
- Neutral background noise
- Multiple emotional tones
Real-Time Streaming:
1
2
3
# Stream TTS while generating
for chunk in tts_stream:
asterisk_stream.write(chunk)
Usage & Operations
Starting the Full Stack
Systemd Service File (/etc/systemd/system/ai-phone.service):
1
2
3
4
5
6
7
8
9
10
[Unit]
Description=AI Phone System
After=docker.service
[Service]
ExecStart=/usr/bin/docker-compose -f /opt/ai-phone/docker-compose.yml up
ExecStop=/usr/bin/docker-compose -f /opt/ai-phone/docker-compose.yml down
[Install]
WantedBy=multi-user.target
docker-compose.yml:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
version: '3.8'
services:
asterisk:
image: asterisk:20
ports:
- "5060:5060/udp"
- "5061:5061/tcp"
volumes:
- ./asterisk/config:/etc/asterisk
devices:
- "/dev/dsp:/dev/dsp"
whisper:
image: faster-whisper:gpu
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ports:
- "8000:8000"
mistral:
image: ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
xtts:
image: xtts:gpu
ports:
- "8020:8020"
Monitoring Commands
Check GPU Utilization:
1
$ watch -n 1 nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv
Asterisk Call Monitoring:
1
2
$ asterisk -rx "core show channels"
$ asterisk -rx "pjsip show endpoints"
Backup Procedures
Model Backup Script:
1
2
3
4
5
6
7
#!/bin/bash
# Backup AI models
rsync -av /opt/ollama /backup/ollama-$(date +%F)
rsync -av /opt/xtts-voices /backup/voices-$(date +%F)
# Backup Asterisk config
tar czf /backup/asterisk-$(date +%F).tgz /etc/asterisk
Troubleshooting
Common Issues
1. Audio Latency Spikes
- Check GPU temperature:
nvidia-smi -q -d TEMPERATURE
- Reduce Whisper model size
- Isolate RTP traffic to dedicated NIC
2. Incomplete Transcriptions
1
2
# Increase Whisper beam size
$ docker run ... faster-whisper --beam_size 5
3. LLM Response Delays
- Monitor VRAM usage:
nvidia-smi -l 1
- Enable layer offloading for Mistral:
1
$ ollama run mistral --num_gpu_layers 32
4. SIP Registration Failures
1
2
$ asterisk -rx "pjsip set logger on"
$ tail -f /var/log/aster