👁️ Netra: Building an AI Agent That Watches Your Home While You Sleep, for 0$

25 Jan, 2026

What if your CCTV could think?

TL;DR: We built an AI-powered surveillance system using a local Vision-Language Model that continuously monitors our CCTV feed, stores intelligent captions, and sends us notifications when something unusual happens — all running on a potato home server with 8GB RAM.

🌱 The Spark

It started with a YouTube video.

My brother and I were watching a demo of local video-captioning using LFM2.5-VL-1.6B — a lightweight Vision-Language Model. I'd played around with LFM2.5-1.2B before for other open-source experiments, so this felt like familiar territory.

"What if we point this at our CCTV?" — one of us said.

And just like that, a weekend project was born.

🏗️ The Architecture (Spoiler: It's Beautifully Simple)

Here's what we ended up building:

┌─────────────────────────────────────────────────────────────────────────────┐
│                              NETRA SYSTEM                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│    ┌──────────┐     RTSP      ┌──────────────┐     Frame     ┌──────────┐  │
│    │   CCTV   │──────────────▶│  Frame       │──────────────▶│  LM      │  │
│    │  Camera  │   (24 FPS)    │  Sampler     │  (1/10 sec)   │  Studio  │  │
│    └──────────┘               │  (Python)    │               │  (VLM)   │  │
│                               └──────────────┘               └────┬─────┘  │
│                                                                   │        │
│                                                              Captions      │
│                                                                   │        │
│                                                                   ▼        │
│    ┌──────────┐               ┌──────────────┐               ┌──────────┐  │
│    │  Nexus   │◀──── MCP ────▶│   SQLite     │◀──────────────│ Caption  │  │
│    │ (Agent)  │               │   Database   │               │  Store   │  │
│    └────┬─────┘               └──────────────┘               └──────────┘  │
│         │                                                                   │
│         │  Queries & Notifications                                         │
│         ▼                                                                   │
│    ┌─────────────────────────────────────────────────────────────────────┐ │
│    │                         Family Group Chat                           │ │
│    │  • Real-time alerts ("Person detected for 1 minute")                │ │
│    │  • Morning summaries ("Here's what happened last night...")         │ │
│    └─────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

🚧 Problem #1: Getting the Feed

The Bad Idea

Our first approach was... not great. We thought we'd:

Manually download video exports from the GCMOB app
Run the VLM on those static files
Get captions

Reality check: The GCMOB app has the worst UX imaginable. To download 1 hour of footage, you literally have to scroll back in history and hold the record button... for 1 hour. 🤦

Also, LFM2.5-VL-1.6B is a Vision-Language model — it takes images and text, not native video files.

The Good Idea: RTSP to the Rescue

After some googling and tinkering, we discovered our CCTV supports RTSP (Real Time Streaming Protocol).

# The magic URL that changed everything
RTSP_URL = "rtsp://admin:yourpassword@192.168.1.XXX:554/stream1"

We enabled the right settings on the camera, wrestled with some network configs, and then...

Victory #1: Live feed on our home server! 🎉

🧠 Problem #2: Running the Vision Model

Step 1: The Quick & Dirty Way

Initially, I threw together a Python script using transformers:

from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained("liquid-ai/LFM2.5-VL-1.6B")
processor = AutoProcessor.from_pretrained("liquid-ai/LFM2.5-VL-1.6B")

def caption_frame(image):
    inputs = processor(images=image, text="Describe what you see:", return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=100)
    return processor.decode(outputs[0], skip_special_tokens=True)

It worked, but managing the model in Python felt clunky.

Step 2: LM Studio for the Win

We already had LM Studio running on our home server for other experiments. Why not use its inference server?

Downloaded the Q4_K_M quantized version (smaller, faster, still smart enough) and pointed our code at it:

import base64
import requests

def caption_frame_via_lmstudio(image_bytes):
    """Send frame to LM Studio's OpenAI-compatible API"""
    base64_image = base64.b64encode(image_bytes).decode('utf-8')
    
    response = requests.post(
        "http://localhost:1234/v1/chat/completions",
        json={
            "model": "liquid-ai/LFM2.5-VL-1.6B",
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": "Describe what you see in this image concisely."},
                        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
                    ]
                }
            ],
            "max_tokens": 150
        }
    )
    return response.json()["choices"][0]["message"]["content"]

[Screenshot: LM Studio server running the LFM2.5-VL-1.6B model showing the inference stats]

⏱️ Problem #3: Our Potato Server Can't Keep Up

The Math Problem

CCTV runs at 24 FPS
Our server has 8GB RAM
VLM inference takes ~1-2 seconds per frame

You see where this is going. We immediately got a backlog of 80+ seconds, and it kept growing. 📈

Frame Queue:  [████████████████████████████████░░░░░░░░] 80+ frames behind
Server RAM:   [████████████████████████████████████████] 98% 💀

The Solution: Strategic Sampling

We don't need EVERY frame. In home surveillance, things don't change that fast.

import cv2
import time

SAMPLE_INTERVAL = 10  # seconds - our sweet spot

def process_rtsp_stream(rtsp_url):
    cap = cv2.VideoCapture(rtsp_url)
    last_capture_time = 0
    
    while True:
        ret, frame = cap.read()
        if not ret:
            continue
            
        current_time = time.time()
        
        # Only process 1 frame every 10 seconds
        if current_time - last_capture_time >= SAMPLE_INTERVAL:
            last_capture_time = current_time
            
            # Convert frame to bytes and send to VLM
            _, buffer = cv2.imencode('.jpg', frame)
            caption = caption_frame_via_lmstudio(buffer.tobytes())
            
            store_caption(timestamp=current_time, caption=caption)

Interval	Backlog	CPU Usage	Our Verdict
1 sec	💀 Growing	100%	Impossible
5 sec	⚠️ Slight	85%	Risky
10 sec	✅ Stable	60%	Sweet spot
30 sec	✅ Minimal	30%	Too slow

⚠️ Trade-off acknowledged: 10 seconds is a long time. Someone could walk through your frame and leave before the next sample. We'll revisit this with llama.cpp for faster inference.

💾 Problem #4: Making the Data Useful

Captions every 10 seconds = 8,640 captions per day.

Nobody's reading that.

Step 1: Store Everything

We went with SQLite because:

Dead simple to set up
Perfect for single-user, local-first apps
At ~100 bytes per caption, we could store years of data in a few GB

import sqlite3
from datetime import datetime

def init_db():
    conn = sqlite3.connect('netra.db')
    conn.execute('''
        CREATE TABLE IF NOT EXISTS captions (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
            caption TEXT NOT NULL
        )
    ''')
    conn.commit()
    return conn

def store_caption(timestamp, caption):
    conn = sqlite3.connect('netra.db')
    conn.execute(
        "INSERT INTO captions (timestamp, caption) VALUES (?, ?)",
        (datetime.fromtimestamp(timestamp), caption)
    )
    conn.commit()

💡 Yes, we built CRUD without the U (Update). When would you ever need to edit a historical caption?

Step 2: Connect to Our Home Assistant

We have an existing home automation setup with an MCP (Model Context Protocol) server. Our AI assistant Nexus already uses it for things like controlling lights and checking weather.

We added a new tool to read from the caption database:

@mcp.tool()
def get_cctv_captions(from_time: str, to_time: str) -> list[dict]:
    """
    Retrieve CCTV captions between two timestamps.
    
    Args:
        from_time: Start time in ISO format (e.g., "2024-01-25T10:00:00")
        to_time: End time in ISO format (e.g., "2024-01-25T11:00:00")
    
    Returns:
        List of caption records with timestamp and description
    """
    conn = sqlite3.connect('netra.db')
    cursor = conn.execute(
        "SELECT timestamp, caption FROM captions WHERE timestamp BETWEEN ? AND ?",
        (from_time, to_time)
    )
    return [{"timestamp": row[0], "caption": row[1]} for row in cursor.fetchall()]

Now Nexus Can Answer:

"What happened in the last 10 minutes?"
"Did anyone come to the door while I was out?"
"Summarize last night's activity"

Passive Intelligence: ✅ Achieved

🚨 Problem #5: From Passive to Active

Querying the past is cool, but what about real-time alerts?

Active Intelligence v1: Person Detection Pipeline

┌─────────────────────────────────────────────────────────────┐
│                    ALERT DECISION FLOW                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│    Frame ──▶ VLM ──▶ "PERSON_DETECTED"?                    │
│                           │                                 │
│                     ┌─────┴─────┐                          │
│                     │           │                          │
│                    YES         NO                          │
│                     │           │                          │
│                     ▼           ▼                          │
│              counter++     counter = 0                     │
│                     │                                      │
│                     ▼                                      │
│              counter >= 6?                                 │
│              (1 minute)                                    │
│                     │                                      │
│               ┌─────┴─────┐                                │
│               │           │                                │
│              YES         NO                                │
│               │           │                                │
│               ▼           ▼                                │
│         🚨 ALERT!     Keep watching                        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

PERSON_DETECTION_THRESHOLD = 6  # 6 frames × 10 sec = 1 minute

consecutive_person_frames = 0

def process_frame_for_alerts(caption):
    global consecutive_person_frames
    
    # Simple keyword detection (VLM usually says "person", "man", "woman", etc.)
    person_keywords = ["person", "man", "woman", "someone", "people", "human"]
    
    if any(keyword in caption.lower() for keyword in person_keywords):
        consecutive_person_frames += 1
        
        if consecutive_person_frames >= PERSON_DETECTION_THRESHOLD:
            send_alert("🚨 Person detected for over 1 minute!")
            consecutive_person_frames = 0  # Reset after alert
    else:
        consecutive_person_frames = 0

def send_alert(message):
    # Integration with your notification system (Telegram, WhatsApp, etc.)
    requests.post(NOTIFICATION_WEBHOOK, json={"text": message})

Why 1 minute?

Threshold	Experience
20 sec	Too many false alarms (delivery guy walking past)
1 min	Catches loiterers, ignores passersby
2 min	Too late for security purposes

We have not finalized on 6, this is what we settled on after a bit of hyper-parameter tweaking.

Active Intelligence v2: The Morning Briefing ☀️

Every day at sunrise + 30 minutes (~6:45 AM in Hyderabad), Nexus automatically:

Queries all captions from the previous night (10 PM to 6 AM)
Summarizes any unusual activity
Sends a digest to our family WhatsApp group

from datetime import datetime, timedelta
from suntime import Sun

def get_sunrise_time():
    sun = Sun(17.3850, 78.4867)
    return sun.get_sunrise_time()

def morning_briefing():
    sunrise = get_sunrise_time()
    briefing_time = sunrise + timedelta(minutes=30)
    
    # Get last night's captions
    yesterday_night = (datetime.now() - timedelta(days=1)).replace(hour=22, minute=0)
    this_morning = datetime.now().replace(hour=6, minute=0)
    
    captions = get_cctv_captions(
        from_time=yesterday_night.isoformat(),
        to_time=this_morning.isoformat()
    )
    
    # Let Nexus summarize
    summary = nexus.summarize_night_activity(captions)
    
    # Send to family group along with weather & traffic
    send_family_update(
        cctv_summary=summary,
        weather=get_weather(),
        traffic=get_traffic_status()
    )

🛠️ The Complete Stack

Component	Technology	Why
CCTV Camera	CP Plus	Already had it
Streaming	RTSP	Universal protocol
Frame Capture	OpenCV + Python	Battle-tested
Vision Model	LFM2.5-VL-1.6B (Q4_K_M)	Small, fast, local
Inference Server	LM Studio	Easy management
Database	SQLite	Simple, zero-config
Home Assistant	Nexus	Our existing home agent
Notifications	WhatsApp & Telegram	Where we already are

Total cost: $0 (excluding electricity for the home server we already had)

🔮 What's Next

Faster inference — Exploring llama.cpp to get down to 1-3 seconds per frame
Multi-camera support — We have 4 cameras, only monitoring 1 right now
Better anomaly detection — Train a classifier on "normal" vs "unusual" activity
Historical trends — "Show me activity patterns for the past month"
Voice queries — "Hey Nexus, show me the last person who came to the door"

🎬 Final Thoughts

What started as a random YouTube rabbit hole turned into a genuinely useful home security upgrade. The best part? Everything runs locally. No cloud subscriptions, no privacy concerns, no monthly fees.

Is it perfect? No. Is it production-ready? Probably not. But it works, it's ours, and it cost us nothing but a weekend.

Sometimes the best projects come from asking "wait, what if we just..."

Have questions? Reach out to us:

Kartheek Akella - Twitter

Kausheek Akella - Email

Netra (नेत्र) — Sanskrit for "eye" 👁️

Did this on Republic Day weekend - 24 Jan, 2026.