BMS/project_plan.md
2026-03-19 11:32:17 +00:00

16 KiB
Raw Permalink Blame History

DCIM Platform — Project Plan

Stack: Next.js + TypeScript + shadcn/ui | Python FastAPI | PostgreSQL + TimescaleDB | MQTT (Mosquitto) | Clerk Auth Approach: Build as a production shell from day one. Simulated sensor bots feed real data pipelines. Swap bots for real hardware when ready — nothing else changes.


How the Simulator Bots Work

Each bot is a small Python script that behaves exactly like a real physical sensor or device. It generates realistic data with natural variation and drift, then publishes it to the MQTT broker on the same topic a real sensor would use.

[Bot: Rack A01 Temp Sensor]  →  MQTT topic: dc/site1/room1/rack-A01/temperature
[Bot: UPS Unit 1]            →  MQTT topic: dc/site1/power/ups-01/status
[Bot: CRAC Unit 2]           →  MQTT topic: dc/site1/cooling/crac-02/status

The backend subscribes to those topics and stores the data. The frontend never knows the difference. When real hardware is connected, it publishes to the same topics — bots get switched off, everything else keeps working.

Bots can also simulate events and scenarios for demo purposes:

  • Gradual temperature rise in a rack (simulating cooling failure)
  • Power load spike across a PDU
  • UPS battery degradation over time
  • Water leak alert trigger
  • Alarm escalation sequences

Phase Overview

Phase Name Deliverable
1 Foundation Running skeleton — frontend, backend, auth, DB all connected
2 Data Pipeline + Bots MQTT broker, simulator bots, data flowing into DB
3 Core Dashboard Live overview dashboard pulling real simulated data
4 Environmental Monitoring Temperature/humidity views, heatmaps
5 Power Monitoring PDU, UPS, PUE tracking
6 Cooling & AI Panel CRAC status, simulated AI optimization
7 Asset Management Rack views, device inventory
8 Alarms & Events Live alarm feed, acknowledgement, escalation
9 Reporting Charts, summaries, export
10 Polish & Hardening RBAC, audit log, multi-site, production readiness

Phase 1 — Foundation

Goal: Every layer of the stack is running and connected. No real features yet, just a working skeleton.

Tasks

  • Initialise Next.js project with TypeScript
  • Set up shadcn/ui component library
  • Set up Python FastAPI project structure
  • Connect Clerk authentication (login, logout, protected routes)
  • Provision PostgreSQL database (local via Docker)
  • Basic API route: frontend calls backend, gets a response
  • Docker Compose file running: frontend + backend + database together
  • Placeholder layout: sidebar nav, top bar, main content area

Folder Structure

/dcim
  /frontend          ← Next.js app
    /app
    /components
    /lib
  /backend           ← FastAPI app
    /api
    /models
    /services
  /simulators        ← Sensor bots (Python scripts)
  /infra             ← Docker Compose, config files
  docker-compose.yml

End of Phase

You can log in, see a blank dashboard shell, and the backend responds to API calls. Database is running.


Phase 2 — Data Pipeline & Simulator Bots

Goal: Simulated sensor data flows continuously from bots → MQTT → backend → database. The system behaves as if real hardware is connected.

Infrastructure

  • Add Mosquitto MQTT broker to Docker Compose
  • Add TimescaleDB extension to PostgreSQL
  • Create hypertable for sensor readings (time-series optimised)

Database Schema (core tables)

sites        — site name, location, timezone
rooms        — belongs to site, physical room
racks        — belongs to room, U-height, position
devices      — belongs to rack, type, model, serial
sensors      — belongs to device or rack, sensor type
readings     — (TimescaleDB hypertable) sensor_id, timestamp, value
alarms       — sensor_id, severity, message, state, acknowledged_at

Backend Data Ingestion

  • MQTT subscriber service (Python) — listens to all sensor topics
  • Parses incoming messages, validates, writes to readings table
  • WebSocket endpoint — streams latest readings to connected frontends
  • REST endpoints — historical data queries, aggregations

Simulator Bots

Each bot runs as an independent Python process, configurable via a simple config file.

Bot: Temperature/Humidity (per rack)

  • Publishes every 30 seconds
  • Base temperature: 2226°C with ±0.5°C natural drift
  • Humidity: 4055% RH with slow drift
  • Scenario: COOLING_FAILURE — temperature rises 0.3°C/min until alarm threshold

Bot: PDU Power Monitor (per rack)

  • Publishes every 60 seconds
  • Load: 28 kW per rack, fluctuates with simulated workload patterns
  • Simulates day/night load patterns (higher load 9am6pm)
  • Scenario: POWER_SPIKE — sudden 40% load increase

Bot: UPS Unit

  • Publishes every 60 seconds
  • Input/output voltage, load percentage, battery charge, runtime estimate
  • Battery health (SOH) degrades slowly over simulated time
  • Scenario: MAINS_FAILURE — switches to battery, runtime counts down

Bot: CRAC/Cooling Unit

  • Publishes every 60 seconds
  • Supply/return air temperature, setpoint, fan speed, compressor state
  • Responds to rack temperature increases (simulated feedback loop)
  • Scenario: UNIT_FAULT — unit goes offline, temperature in zone starts rising

Bot: Water Leak Sensor

  • Normally silent (no leak)
  • Scenario: LEAK_DETECTED — publishes alert, alarm triggers

Bot: Battery Cell Monitor

  • Cell voltages, internal resistance per cell
  • Scenario: CELL_DEGRADATION — one cell's resistance rises, SOH drops

Scenario Runner

  • Simple CLI script: python scenarios/run.py --scenario COOLING_FAILURE --rack A01
  • Useful for demos — trigger realistic alarm sequences on demand

End of Phase

Bots are running, data is flowing into the database every 3060 seconds across a simulated data center with ~20 racks, 2 rooms, 1 site. You can query the database and see live readings.


Phase 3 — Core Dashboard

Goal: The main overview screen, live-updating, showing the health of the entire facility at a glance.

Screens

Site Overview Dashboard

  • Facility health score (aggregate status)
  • KPI cards: Total Power (kW), PUE, Avg Temperature, Active Alarms count
  • Active alarm feed (live, colour-coded by severity)
  • Power trend chart (last 24 hours)
  • Temperature trend chart (last 24 hours)
  • Room status summary (green/amber/red per room)

Technical

  • WebSocket hook in frontend — subscribes to live data stream
  • KPI card component (value, trend arrow, threshold colour)
  • Live-updating line chart component (Recharts)
  • Alarm badge component
  • Auto-refresh every 30 seconds as fallback

End of Phase

Opening the dashboard shows a live, moving picture of the simulated data center. Numbers change in real time. Alarms appear when bots trigger scenarios.


Phase 4 — Environmental Monitoring

Goal: Deep visibility into temperature and humidity across all zones and racks.

Screens

Environmental Overview

  • Room selector (dropdown or tab)
  • Floor plan / rack layout — each rack colour-coded by temperature (cool blue → hot red)
  • Click a rack → side panel showing temp/humidity chart for last 24h
  • Hot/cold aisle average temperatures

Rack Detail Panel

  • Temperature trend (line chart, last 24h / 7d selectable)
  • Humidity trend
  • Current reading with timestamp
  • Threshold indicators (warning / critical bands shown on chart)

Technical

  • SVG floor plan component — racks as rectangles, colour interpolated from temp value
  • Historical data endpoint: GET /api/sensors/{id}/readings?from=&to=&interval=
  • Threshold configuration stored in DB, compared on ingest

End of Phase

You can see exactly which racks are running hot. Heatmap updates live. Clicking a rack shows its history.


Phase 5 — Power Monitoring

Goal: Full visibility into power consumption, distribution, and UPS health.

Screens

Power Overview

  • Total facility power (kW) — live gauge
  • PUE metric (Power Usage Effectiveness) — live, with trend
  • PDU breakdown — per-rack load as a bar chart
  • Power trend — last 24 hours area chart

UPS Status Panel

  • Per-unit: input voltage, output voltage, load %, battery charge %, estimated runtime
  • Battery health (SOH) indicator
  • Status badge (Online / On Battery / Fault)
  • Historical battery charge chart

PDU Detail

  • Per-rack power readings
  • Alert if any rack exceeds capacity threshold

Technical

  • PUE calculation: Total Facility Power / IT Equipment Power (computed server-side)
  • Gauge chart component (Recharts RadialBarChart or similar)
  • UPS status card component

End of Phase

Full picture of power health. UPS bot scenario (MAINS_FAILURE) visibly shows battery rundown on screen.


Phase 6 — Cooling & AI Optimization Panel

Goal: Cooling unit visibility plus a simulated AI optimization engine showing energy savings.

Screens

Cooling Overview

  • Per-unit status: CRAC/CHILLER name, supply temp, return temp, setpoint, fan speed, state
  • Zone temperature vs setpoint comparison
  • Cooling efficiency trend

AI Optimization Panel

  • Toggle: AI Optimization: ON / OFF
  • When ON: simulated PUE improvement animation, setpoint adjustment suggestions displayed
  • Energy savings counter (kWh saved today, this month)
  • Simulated recommendation feed: "Raise setpoint in Room 2 by 1°C — estimated 3% saving"
  • Before/after PUE comparison chart

Technical

  • AI optimization is simulated: a backend service generates plausible recommendations based on current temp readings vs setpoints
  • Simple rule engine (if return_temp - supply_temp > X, suggest setpoint raise)
  • Energy savings are calculated from the delta, displayed as a running total
  • This is the layer that gets replaced by a real ML model in production

End of Phase

The AI panel looks and behaves like a real optimization engine. Recommendations update as conditions change. The CRAC fault scenario visibly impacts the cooling overview.


Phase 7 — Asset Management

Goal: Know exactly what hardware is where, and manage capacity.

Screens

Rack View

  • Visual U-position diagram for each rack (1U42U slots)
  • Each populated slot shows: device name, type, power draw
  • Empty slots shown as available (grey)
  • Click device → detail panel (model, serial, IP, status, power)

Device Inventory

  • Searchable/filterable table of all devices
  • Columns: name, type, rack, U-position, IP, status, power draw, install date
  • Export to CSV

Capacity Overview

  • Per-rack: U-space used/total, power used/allocated
  • Site-wide capacity summary
  • Highlight over-capacity racks

Technical

  • Rack diagram component — SVG or CSS grid, U-slots rendered from device data
  • Device CRUD endpoints (add/edit/remove devices)
  • Capacity calculation queries

End of Phase

You can visually browse every rack, see what's installed where, and identify capacity constraints.


Phase 8 — Alarms & Events

Goal: A complete alarm management system — detection, notification, acknowledgement, history.

Screens

Active Alarms

  • Live list: severity (Critical / Major / Minor / Info), source, message, time raised
  • Acknowledge button per alarm
  • Filter by severity, site, room, system type

Alarm History

  • Searchable log of all past alarms
  • Resolution time, acknowledged by, notes

Alarm Rules (simple config)

  • View and edit threshold rules: e.g. "Rack temp > 30°C = Critical alarm"

Technical

  • Alarm engine in backend: on each sensor reading, check against thresholds, create alarm if breached, auto-resolve when reading returns to normal
  • Alarm state machine: ACTIVE → ACKNOWLEDGED → RESOLVED
  • WebSocket push for new alarms (red badge appears instantly)
  • Email notification hook (stub — wire up SMTP later)

Scenario Demo

Running python scenarios/run.py --scenario COOLING_FAILURE --rack A01:

  1. Rack A01 temperature starts rising
  2. Warning alarm fires at 28°C
  3. Critical alarm fires at 32°C
  4. Alarm appears live on dashboard
  5. Acknowledge it → status updates
  6. Stop scenario → temperature drops → alarm auto-resolves

End of Phase

Alarm management works end-to-end. Scenarios produce realistic alarm sequences that can be demonstrated live.


Phase 9 — Reporting

Goal: Exportable summaries for management, compliance, and capacity planning.

Screens

Reports Dashboard

  • Pre-built report types: Energy Summary, Temperature Compliance, Uptime Summary, Capacity Report
  • Date range selector
  • Chart previews inline

Report Detail

  • Full chart view
  • Key stats summary
  • Export to PDF / CSV

Reports Included

Report Content
Energy Summary Total kWh, PUE trend, cost estimate, comparison vs prior period
Temperature Compliance % of time within threshold per rack, worst offenders
Uptime & Availability Alarm frequency, MTTR, critical events
Capacity Planning Space and power utilisation per rack/room, projected headroom
Battery Health UPS SOH trends, recommended replacements

Technical

  • Report query endpoints (aggregations over TimescaleDB)
  • Chart components reused from earlier phases
  • PDF export via browser print or a library like react-pdf
  • CSV export from table data

End of Phase

Management-ready reports that look professional and pull from real (simulated) historical data.


Phase 10 — Polish & Production Hardening

Goal: Make the system genuinely enterprise-ready — secure, auditable, multi-tenant capable.

Security

  • Role-based access control: Admin, Operator, Read-only, Site Manager
  • Permissions enforced on both frontend routes and backend API endpoints
  • API rate limiting
  • Input validation and sanitisation throughout
  • HTTPS enforced
  • Secrets management (environment variables, never hardcoded)

Audit & Compliance

  • Audit log table: every user action recorded (who, what, when, from where)
  • Audit log viewer in admin panel
  • Data retention policy configuration

Multi-site

  • Site switcher in top bar
  • All queries scoped to selected site
  • Cross-site summary view for administrators

Operational

  • Health check endpoints
  • Structured logging throughout backend
  • Error boundary handling in frontend
  • Loading and empty states on all screens
  • Mobile-responsive layout (tablet minimum)

End of Phase

System is ready for a real pilot deployment. Security reviewed, roles working, audit trail intact.


What Comes After (Production Path)

When the mockup phases are complete, these are the additions needed to turn it into a real product:

Addition Description
Real hardware ingestion Replace simulator bots with real MQTT/SNMP/Modbus adapters
TimescaleDB scaling Move to managed TimescaleDB cloud or dedicated server
Real AI engine Replace rule-based cooling suggestions with ML model
SSO / SAML Enterprise single sign-on via Auth0 enterprise tier
Multi-tenancy Full data isolation per customer (for SaaS model)
Mobile app React Native app reusing component logic
Hardware onboarding UI for registering new devices and sensors
SLA monitoring Uptime tracking and alerting for contracted SLAs

The mockup-to-production transition is incremental — each bot gets replaced by real hardware one at a time, with zero changes to the rest of the system.


Summary

  • 10 phases, each with a clear, testable deliverable
  • Simulator bots make every phase fully demonstrable with realistic data
  • Scenario runner lets you trigger alarm sequences on demand for demos
  • Production-ready architecture from day one — no throwaway work
  • Real hardware integration is a drop-in replacement when you're ready