๐๏ธ Netra: Building an AI Agent That Watches Your Home While You Sleep, for 0$
What if your CCTV could think?
TL;DR: We built an AI-powered surveillance system using a local Vision-Language Model that continuously monitors our CCTV feed, stores intelligent captions, and sends us notifications when something unusual happens โ all running on a potato home server with 8GB RAM.
๐ฑ The Spark
It started with a YouTube video.
My brother and I were watching a demo of local video-captioning using LFM2.5-VL-1.6B โ a lightweight Vision-Language Model. I'd played around with LFM2.5-1.2B before for other open-source experiments, so this felt like familiar territory.
"What if we point this at our CCTV?" โ one of us said.
And just like that, a weekend project was born.
๐๏ธ The Architecture (Spoiler: It's Beautifully Simple)
Here's what we ended up building:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ NETRA SYSTEM โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโ RTSP โโโโโโโโโโโโโโโโ Frame โโโโโโโโโโโโ โ
โ โ CCTV โโโโโโโโโโโโโโโโถโ Frame โโโโโโโโโโโโโโโโถโ LM โ โ
โ โ Camera โ (24 FPS) โ Sampler โ (1/10 sec) โ Studio โ โ
โ โโโโโโโโโโโโ โ (Python) โ โ (VLM) โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโฌโโโโโโ โ
โ โ โ
โ Captions โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ Nexus โโโโโโ MCP โโโโโถโ SQLite โโโโโโโโโโโโโโโโโ Caption โ โ
โ โ (Agent) โ โ Database โ โ Store โ โ
โ โโโโโโฌโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ โ
โ โ Queries & Notifications โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Family Group Chat โ โ
โ โ โข Real-time alerts ("Person detected for 1 minute") โ โ
โ โ โข Morning summaries ("Here's what happened last night...") โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ง Problem #1: Getting the Feed
The Bad Idea
Our first approach was... not great. We thought we'd:
- Manually download video exports from the GCMOB app
- Run the VLM on those static files
- Get captions
Reality check: The GCMOB app has the worst UX imaginable. To download 1 hour of footage, you literally have to scroll back in history and hold the record button... for 1 hour. ๐คฆ
Also, LFM2.5-VL-1.6B is a Vision-Language model โ it takes images and text, not native video files.
The Good Idea: RTSP to the Rescue
After some googling and tinkering, we discovered our CCTV supports RTSP (Real Time Streaming Protocol).
# The magic URL that changed everything
RTSP_URL = "rtsp://admin:yourpassword@192.168.1.XXX:554/stream1"
We enabled the right settings on the camera, wrestled with some network configs, and then...
Victory #1: Live feed on our home server! ๐
๐ง Problem #2: Running the Vision Model
Step 1: The Quick & Dirty Way
Initially, I threw together a Python script using transformers:
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained("liquid-ai/LFM2.5-VL-1.6B")
processor = AutoProcessor.from_pretrained("liquid-ai/LFM2.5-VL-1.6B")
def caption_frame(image):
inputs = processor(images=image, text="Describe what you see:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
return processor.decode(outputs[0], skip_special_tokens=True)
It worked, but managing the model in Python felt clunky.
Step 2: LM Studio for the Win
We already had LM Studio running on our home server for other experiments. Why not use its inference server?
Downloaded the Q4_K_M quantized version (smaller, faster, still smart enough) and pointed our code at it:
import base64
import requests
def caption_frame_via_lmstudio(image_bytes):
"""Send frame to LM Studio's OpenAI-compatible API"""
base64_image = base64.b64encode(image_bytes).decode('utf-8')
response = requests.post(
"http://localhost:1234/v1/chat/completions",
json={
"model": "liquid-ai/LFM2.5-VL-1.6B",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe what you see in this image concisely."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
]
}
],
"max_tokens": 150
}
)
return response.json()["choices"][0]["message"]["content"]
[Screenshot: LM Studio server running the LFM2.5-VL-1.6B model showing the inference stats]
โฑ๏ธ Problem #3: Our Potato Server Can't Keep Up
The Math Problem
- CCTV runs at 24 FPS
- Our server has 8GB RAM
- VLM inference takes ~1-2 seconds per frame
You see where this is going. We immediately got a backlog of 80+ seconds, and it kept growing. ๐
Frame Queue: [โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ] 80+ frames behind
Server RAM: [โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ] 98% ๐
The Solution: Strategic Sampling
We don't need EVERY frame. In home surveillance, things don't change that fast.
import cv2
import time
SAMPLE_INTERVAL = 10 # seconds - our sweet spot
def process_rtsp_stream(rtsp_url):
cap = cv2.VideoCapture(rtsp_url)
last_capture_time = 0
while True:
ret, frame = cap.read()
if not ret:
continue
current_time = time.time()
# Only process 1 frame every 10 seconds
if current_time - last_capture_time >= SAMPLE_INTERVAL:
last_capture_time = current_time
# Convert frame to bytes and send to VLM
_, buffer = cv2.imencode('.jpg', frame)
caption = caption_frame_via_lmstudio(buffer.tobytes())
store_caption(timestamp=current_time, caption=caption)
| Interval | Backlog | CPU Usage | Our Verdict |
|---|---|---|---|
| 1 sec | ๐ Growing | 100% | Impossible |
| 5 sec | โ ๏ธ Slight | 85% | Risky |
| 10 sec | โ Stable | 60% | Sweet spot |
| 30 sec | โ Minimal | 30% | Too slow |
โ ๏ธ Trade-off acknowledged: 10 seconds is a long time. Someone could walk through your frame and leave before the next sample. We'll revisit this with
llama.cppfor faster inference.
๐พ Problem #4: Making the Data Useful
Captions every 10 seconds = 8,640 captions per day.
Nobody's reading that.
Step 1: Store Everything
We went with SQLite because:
- Dead simple to set up
- Perfect for single-user, local-first apps
- At ~100 bytes per caption, we could store years of data in a few GB
import sqlite3
from datetime import datetime
def init_db():
conn = sqlite3.connect('netra.db')
conn.execute('''
CREATE TABLE IF NOT EXISTS captions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
caption TEXT NOT NULL
)
''')
conn.commit()
return conn
def store_caption(timestamp, caption):
conn = sqlite3.connect('netra.db')
conn.execute(
"INSERT INTO captions (timestamp, caption) VALUES (?, ?)",
(datetime.fromtimestamp(timestamp), caption)
)
conn.commit()
๐ก Yes, we built CRUD without the U (Update). When would you ever need to edit a historical caption?
Step 2: Connect to Our Home Assistant
We have an existing home automation setup with an MCP (Model Context Protocol) server. Our AI assistant Nexus already uses it for things like controlling lights and checking weather.
We added a new tool to read from the caption database:
@mcp.tool()
def get_cctv_captions(from_time: str, to_time: str) -> list[dict]:
"""
Retrieve CCTV captions between two timestamps.
Args:
from_time: Start time in ISO format (e.g., "2024-01-25T10:00:00")
to_time: End time in ISO format (e.g., "2024-01-25T11:00:00")
Returns:
List of caption records with timestamp and description
"""
conn = sqlite3.connect('netra.db')
cursor = conn.execute(
"SELECT timestamp, caption FROM captions WHERE timestamp BETWEEN ? AND ?",
(from_time, to_time)
)
return [{"timestamp": row[0], "caption": row[1]} for row in cursor.fetchall()]
Now Nexus Can Answer:
- "What happened in the last 10 minutes?"
- "Did anyone come to the door while I was out?"
- "Summarize last night's activity"
Passive Intelligence: โ Achieved
๐จ Problem #5: From Passive to Active
Querying the past is cool, but what about real-time alerts?
Active Intelligence v1: Person Detection Pipeline
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ALERT DECISION FLOW โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Frame โโโถ VLM โโโถ "PERSON_DETECTED"? โ
โ โ โ
โ โโโโโโโดโโโโโโ โ
โ โ โ โ
โ YES NO โ
โ โ โ โ
โ โผ โผ โ
โ counter++ counter = 0 โ
โ โ โ
โ โผ โ
โ counter >= 6? โ
โ (1 minute) โ
โ โ โ
โ โโโโโโโดโโโโโโ โ
โ โ โ โ
โ YES NO โ
โ โ โ โ
โ โผ โผ โ
โ ๐จ ALERT! Keep watching โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
PERSON_DETECTION_THRESHOLD = 6 # 6 frames ร 10 sec = 1 minute
consecutive_person_frames = 0
def process_frame_for_alerts(caption):
global consecutive_person_frames
# Simple keyword detection (VLM usually says "person", "man", "woman", etc.)
person_keywords = ["person", "man", "woman", "someone", "people", "human"]
if any(keyword in caption.lower() for keyword in person_keywords):
consecutive_person_frames += 1
if consecutive_person_frames >= PERSON_DETECTION_THRESHOLD:
send_alert("๐จ Person detected for over 1 minute!")
consecutive_person_frames = 0 # Reset after alert
else:
consecutive_person_frames = 0
def send_alert(message):
# Integration with your notification system (Telegram, WhatsApp, etc.)
requests.post(NOTIFICATION_WEBHOOK, json={"text": message})
Why 1 minute?
| Threshold | Experience |
|---|---|
| 20 sec | Too many false alarms (delivery guy walking past) |
| 1 min | Catches loiterers, ignores passersby |
| 2 min | Too late for security purposes |
We have not finalized on 6, this is what we settled on after a bit of hyper-parameter tweaking.
Active Intelligence v2: The Morning Briefing โ๏ธ
Every day at sunrise + 30 minutes (~6:45 AM in Hyderabad), Nexus automatically:
- Queries all captions from the previous night (10 PM to 6 AM)
- Summarizes any unusual activity
- Sends a digest to our family WhatsApp group
from datetime import datetime, timedelta
from suntime import Sun
def get_sunrise_time():
sun = Sun(17.3850, 78.4867)
return sun.get_sunrise_time()
def morning_briefing():
sunrise = get_sunrise_time()
briefing_time = sunrise + timedelta(minutes=30)
# Get last night's captions
yesterday_night = (datetime.now() - timedelta(days=1)).replace(hour=22, minute=0)
this_morning = datetime.now().replace(hour=6, minute=0)
captions = get_cctv_captions(
from_time=yesterday_night.isoformat(),
to_time=this_morning.isoformat()
)
# Let Nexus summarize
summary = nexus.summarize_night_activity(captions)
# Send to family group along with weather & traffic
send_family_update(
cctv_summary=summary,
weather=get_weather(),
traffic=get_traffic_status()
)
๐ ๏ธ The Complete Stack
| Component | Technology | Why |
|---|---|---|
| CCTV Camera | CP Plus | Already had it |
| Streaming | RTSP | Universal protocol |
| Frame Capture | OpenCV + Python | Battle-tested |
| Vision Model | LFM2.5-VL-1.6B (Q4_K_M) | Small, fast, local |
| Inference Server | LM Studio | Easy management |
| Database | SQLite | Simple, zero-config |
| Home Assistant | Nexus | Our existing home agent |
| Notifications | WhatsApp & Telegram | Where we already are |
Total cost: $0 (excluding electricity for the home server we already had)
๐ฎ What's Next
- Faster inference โ Exploring
llama.cppto get down to 1-3 seconds per frame - Multi-camera support โ We have 4 cameras, only monitoring 1 right now
- Better anomaly detection โ Train a classifier on "normal" vs "unusual" activity
- Historical trends โ "Show me activity patterns for the past month"
- Voice queries โ "Hey Nexus, show me the last person who came to the door"
๐ฌ Final Thoughts
What started as a random YouTube rabbit hole turned into a genuinely useful home security upgrade. The best part? Everything runs locally. No cloud subscriptions, no privacy concerns, no monthly fees.
Is it perfect? No. Is it production-ready? Probably not. But it works, it's ours, and it cost us nothing but a weekend.
Sometimes the best projects come from asking "wait, what if we just..."
Have questions? Reach out to us:
Kartheek Akella - Twitter
Kausheek Akella - Email
Netra (เคจเฅเคคเฅเคฐ) โ Sanskrit for "eye" ๐๏ธ
Did this on Republic Day weekend - 24 Jan, 2026.