Random things I do

๐Ÿ‘๏ธ Netra: Building an AI Agent That Watches Your Home While You Sleep, for 0$

What if your CCTV could think?

TL;DR: We built an AI-powered surveillance system using a local Vision-Language Model that continuously monitors our CCTV feed, stores intelligent captions, and sends us notifications when something unusual happens โ€” all running on a potato home server with 8GB RAM.


๐ŸŒฑ The Spark

It started with a YouTube video.

My brother and I were watching a demo of local video-captioning using LFM2.5-VL-1.6B โ€” a lightweight Vision-Language Model. I'd played around with LFM2.5-1.2B before for other open-source experiments, so this felt like familiar territory.

"What if we point this at our CCTV?" โ€” one of us said.

And just like that, a weekend project was born.


๐Ÿ—๏ธ The Architecture (Spoiler: It's Beautifully Simple)

Here's what we ended up building:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                              NETRA SYSTEM                                   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                             โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     RTSP      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     Frame     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚    โ”‚   CCTV   โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚  Frame       โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚  LM      โ”‚  โ”‚
โ”‚    โ”‚  Camera  โ”‚   (24 FPS)    โ”‚  Sampler     โ”‚  (1/10 sec)   โ”‚  Studio  โ”‚  โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜               โ”‚  (Python)    โ”‚               โ”‚  (VLM)   โ”‚  โ”‚
โ”‚                               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜               โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                                                   โ”‚        โ”‚
โ”‚                                                              Captions      โ”‚
โ”‚                                                                   โ”‚        โ”‚
โ”‚                                                                   โ–ผ        โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚    โ”‚  Nexus   โ”‚โ—€โ”€โ”€โ”€โ”€ MCP โ”€โ”€โ”€โ”€โ–ถโ”‚   SQLite     โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ Caption  โ”‚  โ”‚
โ”‚    โ”‚ (Agent)  โ”‚               โ”‚   Database   โ”‚               โ”‚  Store   โ”‚  โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚         โ”‚                                                                   โ”‚
โ”‚         โ”‚  Queries & Notifications                                         โ”‚
โ”‚         โ–ผ                                                                   โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚    โ”‚                         Family Group Chat                           โ”‚ โ”‚
โ”‚    โ”‚  โ€ข Real-time alerts ("Person detected for 1 minute")                โ”‚ โ”‚
โ”‚    โ”‚  โ€ข Morning summaries ("Here's what happened last night...")         โ”‚ โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿšง Problem #1: Getting the Feed

The Bad Idea

Our first approach was... not great. We thought we'd:

  1. Manually download video exports from the GCMOB app
  2. Run the VLM on those static files
  3. Get captions

Reality check: The GCMOB app has the worst UX imaginable. To download 1 hour of footage, you literally have to scroll back in history and hold the record button... for 1 hour. ๐Ÿคฆ

Also, LFM2.5-VL-1.6B is a Vision-Language model โ€” it takes images and text, not native video files.

The Good Idea: RTSP to the Rescue

After some googling and tinkering, we discovered our CCTV supports RTSP (Real Time Streaming Protocol).

# The magic URL that changed everything
RTSP_URL = "rtsp://admin:yourpassword@192.168.1.XXX:554/stream1"

We enabled the right settings on the camera, wrestled with some network configs, and then...

Victory #1: Live feed on our home server! ๐ŸŽ‰


๐Ÿง  Problem #2: Running the Vision Model

Step 1: The Quick & Dirty Way

Initially, I threw together a Python script using transformers:

from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained("liquid-ai/LFM2.5-VL-1.6B")
processor = AutoProcessor.from_pretrained("liquid-ai/LFM2.5-VL-1.6B")

def caption_frame(image):
    inputs = processor(images=image, text="Describe what you see:", return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=100)
    return processor.decode(outputs[0], skip_special_tokens=True)

It worked, but managing the model in Python felt clunky.

Step 2: LM Studio for the Win

We already had LM Studio running on our home server for other experiments. Why not use its inference server?

Downloaded the Q4_K_M quantized version (smaller, faster, still smart enough) and pointed our code at it:

import base64
import requests

def caption_frame_via_lmstudio(image_bytes):
    """Send frame to LM Studio's OpenAI-compatible API"""
    base64_image = base64.b64encode(image_bytes).decode('utf-8')
    
    response = requests.post(
        "http://localhost:1234/v1/chat/completions",
        json={
            "model": "liquid-ai/LFM2.5-VL-1.6B",
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": "Describe what you see in this image concisely."},
                        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
                    ]
                }
            ],
            "max_tokens": 150
        }
    )
    return response.json()["choices"][0]["message"]["content"]

[Screenshot: LM Studio server running the LFM2.5-VL-1.6B model showing the inference stats]


โฑ๏ธ Problem #3: Our Potato Server Can't Keep Up

The Math Problem

You see where this is going. We immediately got a backlog of 80+ seconds, and it kept growing. ๐Ÿ“ˆ

Frame Queue:  [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘] 80+ frames behind
Server RAM:   [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] 98% ๐Ÿ’€

The Solution: Strategic Sampling

We don't need EVERY frame. In home surveillance, things don't change that fast.

import cv2
import time

SAMPLE_INTERVAL = 10  # seconds - our sweet spot

def process_rtsp_stream(rtsp_url):
    cap = cv2.VideoCapture(rtsp_url)
    last_capture_time = 0
    
    while True:
        ret, frame = cap.read()
        if not ret:
            continue
            
        current_time = time.time()
        
        # Only process 1 frame every 10 seconds
        if current_time - last_capture_time >= SAMPLE_INTERVAL:
            last_capture_time = current_time
            
            # Convert frame to bytes and send to VLM
            _, buffer = cv2.imencode('.jpg', frame)
            caption = caption_frame_via_lmstudio(buffer.tobytes())
            
            store_caption(timestamp=current_time, caption=caption)
Interval Backlog CPU Usage Our Verdict
1 sec ๐Ÿ’€ Growing 100% Impossible
5 sec โš ๏ธ Slight 85% Risky
10 sec โœ… Stable 60% Sweet spot
30 sec โœ… Minimal 30% Too slow

โš ๏ธ Trade-off acknowledged: 10 seconds is a long time. Someone could walk through your frame and leave before the next sample. We'll revisit this with llama.cpp for faster inference.


๐Ÿ’พ Problem #4: Making the Data Useful

Captions every 10 seconds = 8,640 captions per day.

Nobody's reading that.

Step 1: Store Everything

We went with SQLite because:

import sqlite3
from datetime import datetime

def init_db():
    conn = sqlite3.connect('netra.db')
    conn.execute('''
        CREATE TABLE IF NOT EXISTS captions (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
            caption TEXT NOT NULL
        )
    ''')
    conn.commit()
    return conn

def store_caption(timestamp, caption):
    conn = sqlite3.connect('netra.db')
    conn.execute(
        "INSERT INTO captions (timestamp, caption) VALUES (?, ?)",
        (datetime.fromtimestamp(timestamp), caption)
    )
    conn.commit()

๐Ÿ’ก Yes, we built CRUD without the U (Update). When would you ever need to edit a historical caption?

Step 2: Connect to Our Home Assistant

We have an existing home automation setup with an MCP (Model Context Protocol) server. Our AI assistant Nexus already uses it for things like controlling lights and checking weather.

We added a new tool to read from the caption database:

@mcp.tool()
def get_cctv_captions(from_time: str, to_time: str) -> list[dict]:
    """
    Retrieve CCTV captions between two timestamps.
    
    Args:
        from_time: Start time in ISO format (e.g., "2024-01-25T10:00:00")
        to_time: End time in ISO format (e.g., "2024-01-25T11:00:00")
    
    Returns:
        List of caption records with timestamp and description
    """
    conn = sqlite3.connect('netra.db')
    cursor = conn.execute(
        "SELECT timestamp, caption FROM captions WHERE timestamp BETWEEN ? AND ?",
        (from_time, to_time)
    )
    return [{"timestamp": row[0], "caption": row[1]} for row in cursor.fetchall()]

Now Nexus Can Answer:

Passive Intelligence: โœ… Achieved


๐Ÿšจ Problem #5: From Passive to Active

Querying the past is cool, but what about real-time alerts?

Active Intelligence v1: Person Detection Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    ALERT DECISION FLOW                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                             โ”‚
โ”‚    Frame โ”€โ”€โ–ถ VLM โ”€โ”€โ–ถ "PERSON_DETECTED"?                    โ”‚
โ”‚                           โ”‚                                 โ”‚
โ”‚                     โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”                          โ”‚
โ”‚                     โ”‚           โ”‚                          โ”‚
โ”‚                    YES         NO                          โ”‚
โ”‚                     โ”‚           โ”‚                          โ”‚
โ”‚                     โ–ผ           โ–ผ                          โ”‚
โ”‚              counter++     counter = 0                     โ”‚
โ”‚                     โ”‚                                      โ”‚
โ”‚                     โ–ผ                                      โ”‚
โ”‚              counter >= 6?                                 โ”‚
โ”‚              (1 minute)                                    โ”‚
โ”‚                     โ”‚                                      โ”‚
โ”‚               โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”                                โ”‚
โ”‚               โ”‚           โ”‚                                โ”‚
โ”‚              YES         NO                                โ”‚
โ”‚               โ”‚           โ”‚                                โ”‚
โ”‚               โ–ผ           โ–ผ                                โ”‚
โ”‚         ๐Ÿšจ ALERT!     Keep watching                        โ”‚
โ”‚                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
PERSON_DETECTION_THRESHOLD = 6  # 6 frames ร— 10 sec = 1 minute

consecutive_person_frames = 0

def process_frame_for_alerts(caption):
    global consecutive_person_frames
    
    # Simple keyword detection (VLM usually says "person", "man", "woman", etc.)
    person_keywords = ["person", "man", "woman", "someone", "people", "human"]
    
    if any(keyword in caption.lower() for keyword in person_keywords):
        consecutive_person_frames += 1
        
        if consecutive_person_frames >= PERSON_DETECTION_THRESHOLD:
            send_alert("๐Ÿšจ Person detected for over 1 minute!")
            consecutive_person_frames = 0  # Reset after alert
    else:
        consecutive_person_frames = 0

def send_alert(message):
    # Integration with your notification system (Telegram, WhatsApp, etc.)
    requests.post(NOTIFICATION_WEBHOOK, json={"text": message})

Why 1 minute?

Threshold Experience
20 sec Too many false alarms (delivery guy walking past)
1 min Catches loiterers, ignores passersby
2 min Too late for security purposes

We have not finalized on 6, this is what we settled on after a bit of hyper-parameter tweaking.

Active Intelligence v2: The Morning Briefing โ˜€๏ธ

Every day at sunrise + 30 minutes (~6:45 AM in Hyderabad), Nexus automatically:

  1. Queries all captions from the previous night (10 PM to 6 AM)
  2. Summarizes any unusual activity
  3. Sends a digest to our family WhatsApp group
from datetime import datetime, timedelta
from suntime import Sun

def get_sunrise_time():
    sun = Sun(17.3850, 78.4867)
    return sun.get_sunrise_time()

def morning_briefing():
    sunrise = get_sunrise_time()
    briefing_time = sunrise + timedelta(minutes=30)
    
    # Get last night's captions
    yesterday_night = (datetime.now() - timedelta(days=1)).replace(hour=22, minute=0)
    this_morning = datetime.now().replace(hour=6, minute=0)
    
    captions = get_cctv_captions(
        from_time=yesterday_night.isoformat(),
        to_time=this_morning.isoformat()
    )
    
    # Let Nexus summarize
    summary = nexus.summarize_night_activity(captions)
    
    # Send to family group along with weather & traffic
    send_family_update(
        cctv_summary=summary,
        weather=get_weather(),
        traffic=get_traffic_status()
    )

๐Ÿ› ๏ธ The Complete Stack

Component Technology Why
CCTV Camera CP Plus Already had it
Streaming RTSP Universal protocol
Frame Capture OpenCV + Python Battle-tested
Vision Model LFM2.5-VL-1.6B (Q4_K_M) Small, fast, local
Inference Server LM Studio Easy management
Database SQLite Simple, zero-config
Home Assistant Nexus Our existing home agent
Notifications WhatsApp & Telegram Where we already are

Total cost: $0 (excluding electricity for the home server we already had)


๐Ÿ”ฎ What's Next


๐ŸŽฌ Final Thoughts

What started as a random YouTube rabbit hole turned into a genuinely useful home security upgrade. The best part? Everything runs locally. No cloud subscriptions, no privacy concerns, no monthly fees.

Is it perfect? No. Is it production-ready? Probably not. But it works, it's ours, and it cost us nothing but a weekend.

Sometimes the best projects come from asking "wait, what if we just..."


Have questions? Reach out to us:

Kartheek Akella - Twitter

Kausheek Akella - Email


Netra (เคจเฅ‡เคคเฅเคฐ) โ€” Sanskrit for "eye" ๐Ÿ‘๏ธ

Did this on Republic Day weekend - 24 Jan, 2026.