BMS/project_plan.md
2026-03-19 11:32:17 +00:00

424 lines
16 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# DCIM Platform — Project Plan
**Stack:** Next.js + TypeScript + shadcn/ui | Python FastAPI | PostgreSQL + TimescaleDB | MQTT (Mosquitto) | Clerk Auth
**Approach:** Build as a production shell from day one. Simulated sensor bots feed real data pipelines. Swap bots for real hardware when ready — nothing else changes.
---
## How the Simulator Bots Work
Each bot is a small Python script that behaves exactly like a real physical sensor or device. It generates realistic data with natural variation and drift, then publishes it to the MQTT broker on the same topic a real sensor would use.
```
[Bot: Rack A01 Temp Sensor] → MQTT topic: dc/site1/room1/rack-A01/temperature
[Bot: UPS Unit 1] → MQTT topic: dc/site1/power/ups-01/status
[Bot: CRAC Unit 2] → MQTT topic: dc/site1/cooling/crac-02/status
```
The backend subscribes to those topics and stores the data. The frontend never knows the difference. When real hardware is connected, it publishes to the same topics — bots get switched off, everything else keeps working.
Bots can also simulate **events and scenarios** for demo purposes:
- Gradual temperature rise in a rack (simulating cooling failure)
- Power load spike across a PDU
- UPS battery degradation over time
- Water leak alert trigger
- Alarm escalation sequences
---
## Phase Overview
| Phase | Name | Deliverable |
|---|---|---|
| 1 | Foundation | Running skeleton — frontend, backend, auth, DB all connected |
| 2 | Data Pipeline + Bots | MQTT broker, simulator bots, data flowing into DB |
| 3 | Core Dashboard | Live overview dashboard pulling real simulated data |
| 4 | Environmental Monitoring | Temperature/humidity views, heatmaps |
| 5 | Power Monitoring | PDU, UPS, PUE tracking |
| 6 | Cooling & AI Panel | CRAC status, simulated AI optimization |
| 7 | Asset Management | Rack views, device inventory |
| 8 | Alarms & Events | Live alarm feed, acknowledgement, escalation |
| 9 | Reporting | Charts, summaries, export |
| 10 | Polish & Hardening | RBAC, audit log, multi-site, production readiness |
---
## Phase 1 — Foundation
**Goal:** Every layer of the stack is running and connected. No real features yet, just a working skeleton.
### Tasks
- [ ] Initialise Next.js project with TypeScript
- [ ] Set up shadcn/ui component library
- [ ] Set up Python FastAPI project structure
- [ ] Connect Clerk authentication (login, logout, protected routes)
- [ ] Provision PostgreSQL database (local via Docker)
- [ ] Basic API route: frontend calls backend, gets a response
- [ ] Docker Compose file running: frontend + backend + database together
- [ ] Placeholder layout: sidebar nav, top bar, main content area
### Folder Structure
```
/dcim
/frontend ← Next.js app
/app
/components
/lib
/backend ← FastAPI app
/api
/models
/services
/simulators ← Sensor bots (Python scripts)
/infra ← Docker Compose, config files
docker-compose.yml
```
### End of Phase
You can log in, see a blank dashboard shell, and the backend responds to API calls. Database is running.
---
## Phase 2 — Data Pipeline & Simulator Bots
**Goal:** Simulated sensor data flows continuously from bots → MQTT → backend → database. The system behaves as if real hardware is connected.
### Infrastructure
- [ ] Add Mosquitto MQTT broker to Docker Compose
- [ ] Add TimescaleDB extension to PostgreSQL
- [ ] Create hypertable for sensor readings (time-series optimised)
### Database Schema (core tables)
```
sites — site name, location, timezone
rooms — belongs to site, physical room
racks — belongs to room, U-height, position
devices — belongs to rack, type, model, serial
sensors — belongs to device or rack, sensor type
readings — (TimescaleDB hypertable) sensor_id, timestamp, value
alarms — sensor_id, severity, message, state, acknowledged_at
```
### Backend Data Ingestion
- [ ] MQTT subscriber service (Python) — listens to all sensor topics
- [ ] Parses incoming messages, validates, writes to `readings` table
- [ ] WebSocket endpoint — streams latest readings to connected frontends
- [ ] REST endpoints — historical data queries, aggregations
### Simulator Bots
Each bot runs as an independent Python process, configurable via a simple config file.
**Bot: Temperature/Humidity (per rack)**
- Publishes every 30 seconds
- Base temperature: 2226°C with ±0.5°C natural drift
- Humidity: 4055% RH with slow drift
- Scenario: `COOLING_FAILURE` — temperature rises 0.3°C/min until alarm threshold
**Bot: PDU Power Monitor (per rack)**
- Publishes every 60 seconds
- Load: 28 kW per rack, fluctuates with simulated workload patterns
- Simulates day/night load patterns (higher load 9am6pm)
- Scenario: `POWER_SPIKE` — sudden 40% load increase
**Bot: UPS Unit**
- Publishes every 60 seconds
- Input/output voltage, load percentage, battery charge, runtime estimate
- Battery health (SOH) degrades slowly over simulated time
- Scenario: `MAINS_FAILURE` — switches to battery, runtime counts down
**Bot: CRAC/Cooling Unit**
- Publishes every 60 seconds
- Supply/return air temperature, setpoint, fan speed, compressor state
- Responds to rack temperature increases (simulated feedback loop)
- Scenario: `UNIT_FAULT` — unit goes offline, temperature in zone starts rising
**Bot: Water Leak Sensor**
- Normally silent (no leak)
- Scenario: `LEAK_DETECTED` — publishes alert, alarm triggers
**Bot: Battery Cell Monitor**
- Cell voltages, internal resistance per cell
- Scenario: `CELL_DEGRADATION` — one cell's resistance rises, SOH drops
### Scenario Runner
- [ ] Simple CLI script: `python scenarios/run.py --scenario COOLING_FAILURE --rack A01`
- [ ] Useful for demos — trigger realistic alarm sequences on demand
### End of Phase
Bots are running, data is flowing into the database every 3060 seconds across a simulated data center with ~20 racks, 2 rooms, 1 site. You can query the database and see live readings.
---
## Phase 3 — Core Dashboard
**Goal:** The main overview screen, live-updating, showing the health of the entire facility at a glance.
### Screens
**Site Overview Dashboard**
- Facility health score (aggregate status)
- KPI cards: Total Power (kW), PUE, Avg Temperature, Active Alarms count
- Active alarm feed (live, colour-coded by severity)
- Power trend chart (last 24 hours)
- Temperature trend chart (last 24 hours)
- Room status summary (green/amber/red per room)
### Technical
- [ ] WebSocket hook in frontend — subscribes to live data stream
- [ ] KPI card component (value, trend arrow, threshold colour)
- [ ] Live-updating line chart component (Recharts)
- [ ] Alarm badge component
- [ ] Auto-refresh every 30 seconds as fallback
### End of Phase
Opening the dashboard shows a live, moving picture of the simulated data center. Numbers change in real time. Alarms appear when bots trigger scenarios.
---
## Phase 4 — Environmental Monitoring
**Goal:** Deep visibility into temperature and humidity across all zones and racks.
### Screens
**Environmental Overview**
- Room selector (dropdown or tab)
- Floor plan / rack layout — each rack colour-coded by temperature (cool blue → hot red)
- Click a rack → side panel showing temp/humidity chart for last 24h
- Hot/cold aisle average temperatures
**Rack Detail Panel**
- Temperature trend (line chart, last 24h / 7d selectable)
- Humidity trend
- Current reading with timestamp
- Threshold indicators (warning / critical bands shown on chart)
### Technical
- [ ] SVG floor plan component — racks as rectangles, colour interpolated from temp value
- [ ] Historical data endpoint: `GET /api/sensors/{id}/readings?from=&to=&interval=`
- [ ] Threshold configuration stored in DB, compared on ingest
### End of Phase
You can see exactly which racks are running hot. Heatmap updates live. Clicking a rack shows its history.
---
## Phase 5 — Power Monitoring
**Goal:** Full visibility into power consumption, distribution, and UPS health.
### Screens
**Power Overview**
- Total facility power (kW) — live gauge
- PUE metric (Power Usage Effectiveness) — live, with trend
- PDU breakdown — per-rack load as a bar chart
- Power trend — last 24 hours area chart
**UPS Status Panel**
- Per-unit: input voltage, output voltage, load %, battery charge %, estimated runtime
- Battery health (SOH) indicator
- Status badge (Online / On Battery / Fault)
- Historical battery charge chart
**PDU Detail**
- Per-rack power readings
- Alert if any rack exceeds capacity threshold
### Technical
- [ ] PUE calculation: Total Facility Power / IT Equipment Power (computed server-side)
- [ ] Gauge chart component (Recharts RadialBarChart or similar)
- [ ] UPS status card component
### End of Phase
Full picture of power health. UPS bot scenario (`MAINS_FAILURE`) visibly shows battery rundown on screen.
---
## Phase 6 — Cooling & AI Optimization Panel
**Goal:** Cooling unit visibility plus a simulated AI optimization engine showing energy savings.
### Screens
**Cooling Overview**
- Per-unit status: CRAC/CHILLER name, supply temp, return temp, setpoint, fan speed, state
- Zone temperature vs setpoint comparison
- Cooling efficiency trend
**AI Optimization Panel**
- Toggle: `AI Optimization: ON / OFF`
- When ON: simulated PUE improvement animation, setpoint adjustment suggestions displayed
- Energy savings counter (kWh saved today, this month)
- Simulated recommendation feed: "Raise setpoint in Room 2 by 1°C — estimated 3% saving"
- Before/after PUE comparison chart
### Technical
- [ ] AI optimization is simulated: a backend service generates plausible recommendations based on current temp readings vs setpoints
- [ ] Simple rule engine (if return_temp - supply_temp > X, suggest setpoint raise)
- [ ] Energy savings are calculated from the delta, displayed as a running total
- [ ] This is the layer that gets replaced by a real ML model in production
### End of Phase
The AI panel looks and behaves like a real optimization engine. Recommendations update as conditions change. The CRAC fault scenario visibly impacts the cooling overview.
---
## Phase 7 — Asset Management
**Goal:** Know exactly what hardware is where, and manage capacity.
### Screens
**Rack View**
- Visual U-position diagram for each rack (1U42U slots)
- Each populated slot shows: device name, type, power draw
- Empty slots shown as available (grey)
- Click device → detail panel (model, serial, IP, status, power)
**Device Inventory**
- Searchable/filterable table of all devices
- Columns: name, type, rack, U-position, IP, status, power draw, install date
- Export to CSV
**Capacity Overview**
- Per-rack: U-space used/total, power used/allocated
- Site-wide capacity summary
- Highlight over-capacity racks
### Technical
- [ ] Rack diagram component — SVG or CSS grid, U-slots rendered from device data
- [ ] Device CRUD endpoints (add/edit/remove devices)
- [ ] Capacity calculation queries
### End of Phase
You can visually browse every rack, see what's installed where, and identify capacity constraints.
---
## Phase 8 — Alarms & Events
**Goal:** A complete alarm management system — detection, notification, acknowledgement, history.
### Screens
**Active Alarms**
- Live list: severity (Critical / Major / Minor / Info), source, message, time raised
- Acknowledge button per alarm
- Filter by severity, site, room, system type
**Alarm History**
- Searchable log of all past alarms
- Resolution time, acknowledged by, notes
**Alarm Rules** (simple config)
- View and edit threshold rules: e.g. "Rack temp > 30°C = Critical alarm"
### Technical
- [ ] Alarm engine in backend: on each sensor reading, check against thresholds, create alarm if breached, auto-resolve when reading returns to normal
- [ ] Alarm state machine: `ACTIVE → ACKNOWLEDGED → RESOLVED`
- [ ] WebSocket push for new alarms (red badge appears instantly)
- [ ] Email notification hook (stub — wire up SMTP later)
### Scenario Demo
Running `python scenarios/run.py --scenario COOLING_FAILURE --rack A01`:
1. Rack A01 temperature starts rising
2. Warning alarm fires at 28°C
3. Critical alarm fires at 32°C
4. Alarm appears live on dashboard
5. Acknowledge it → status updates
6. Stop scenario → temperature drops → alarm auto-resolves
### End of Phase
Alarm management works end-to-end. Scenarios produce realistic alarm sequences that can be demonstrated live.
---
## Phase 9 — Reporting
**Goal:** Exportable summaries for management, compliance, and capacity planning.
### Screens
**Reports Dashboard**
- Pre-built report types: Energy Summary, Temperature Compliance, Uptime Summary, Capacity Report
- Date range selector
- Chart previews inline
**Report Detail**
- Full chart view
- Key stats summary
- Export to PDF / CSV
### Reports Included
| Report | Content |
|---|---|
| Energy Summary | Total kWh, PUE trend, cost estimate, comparison vs prior period |
| Temperature Compliance | % of time within threshold per rack, worst offenders |
| Uptime & Availability | Alarm frequency, MTTR, critical events |
| Capacity Planning | Space and power utilisation per rack/room, projected headroom |
| Battery Health | UPS SOH trends, recommended replacements |
### Technical
- [ ] Report query endpoints (aggregations over TimescaleDB)
- [ ] Chart components reused from earlier phases
- [ ] PDF export via browser print or a library like `react-pdf`
- [ ] CSV export from table data
### End of Phase
Management-ready reports that look professional and pull from real (simulated) historical data.
---
## Phase 10 — Polish & Production Hardening
**Goal:** Make the system genuinely enterprise-ready — secure, auditable, multi-tenant capable.
### Security
- [ ] Role-based access control: `Admin`, `Operator`, `Read-only`, `Site Manager`
- [ ] Permissions enforced on both frontend routes and backend API endpoints
- [ ] API rate limiting
- [ ] Input validation and sanitisation throughout
- [ ] HTTPS enforced
- [ ] Secrets management (environment variables, never hardcoded)
### Audit & Compliance
- [ ] Audit log table: every user action recorded (who, what, when, from where)
- [ ] Audit log viewer in admin panel
- [ ] Data retention policy configuration
### Multi-site
- [ ] Site switcher in top bar
- [ ] All queries scoped to selected site
- [ ] Cross-site summary view for administrators
### Operational
- [ ] Health check endpoints
- [ ] Structured logging throughout backend
- [ ] Error boundary handling in frontend
- [ ] Loading and empty states on all screens
- [ ] Mobile-responsive layout (tablet minimum)
### End of Phase
System is ready for a real pilot deployment. Security reviewed, roles working, audit trail intact.
---
## What Comes After (Production Path)
When the mockup phases are complete, these are the additions needed to turn it into a real product:
| Addition | Description |
|---|---|
| Real hardware ingestion | Replace simulator bots with real MQTT/SNMP/Modbus adapters |
| TimescaleDB scaling | Move to managed TimescaleDB cloud or dedicated server |
| Real AI engine | Replace rule-based cooling suggestions with ML model |
| SSO / SAML | Enterprise single sign-on via Auth0 enterprise tier |
| Multi-tenancy | Full data isolation per customer (for SaaS model) |
| Mobile app | React Native app reusing component logic |
| Hardware onboarding | UI for registering new devices and sensors |
| SLA monitoring | Uptime tracking and alerting for contracted SLAs |
The mockup-to-production transition is incremental — each bot gets replaced by real hardware one at a time, with zero changes to the rest of the system.
---
## Summary
- **10 phases**, each with a clear, testable deliverable
- **Simulator bots** make every phase fully demonstrable with realistic data
- **Scenario runner** lets you trigger alarm sequences on demand for demos
- **Production-ready architecture** from day one — no throwaway work
- Real hardware integration is a drop-in replacement when you're ready