BMS/project_plan.md

# DCIM Platform — Project Plan

**Stack:** Next.js + TypeScript + shadcn/ui | Python FastAPI | PostgreSQL + TimescaleDB | MQTT (Mosquitto) | Clerk Auth
**Approach:** Build as a production shell from day one. Simulated sensor bots feed real data pipelines. Swap bots for real hardware when ready — nothing else changes.

---

## How the Simulator Bots Work

Each bot is a small Python script that behaves exactly like a real physical sensor or device. It generates realistic data with natural variation and drift, then publishes it to the MQTT broker on the same topic a real sensor would use.

```
[Bot: Rack A01 Temp Sensor]  →  MQTT topic: dc/site1/room1/rack-A01/temperature
[Bot: UPS Unit 1]            →  MQTT topic: dc/site1/power/ups-01/status
[Bot: CRAC Unit 2]           →  MQTT topic: dc/site1/cooling/crac-02/status
```

The backend subscribes to those topics and stores the data. The frontend never knows the difference. When real hardware is connected, it publishes to the same topics — bots get switched off, everything else keeps working.

Bots can also simulate **events and scenarios** for demo purposes:
- Gradual temperature rise in a rack (simulating cooling failure)
- Power load spike across a PDU
- UPS battery degradation over time
- Water leak alert trigger
- Alarm escalation sequences

---

## Phase Overview

| Phase | Name | Deliverable |
|---|---|---|
| 1 | Foundation | Running skeleton — frontend, backend, auth, DB all connected |
| 2 | Data Pipeline + Bots | MQTT broker, simulator bots, data flowing into DB |
| 3 | Core Dashboard | Live overview dashboard pulling real simulated data |
| 4 | Environmental Monitoring | Temperature/humidity views, heatmaps |
| 5 | Power Monitoring | PDU, UPS, PUE tracking |
| 6 | Cooling & AI Panel | CRAC status, simulated AI optimization |
| 7 | Asset Management | Rack views, device inventory |
| 8 | Alarms & Events | Live alarm feed, acknowledgement, escalation |
| 9 | Reporting | Charts, summaries, export |
| 10 | Polish & Hardening | RBAC, audit log, multi-site, production readiness |

---

## Phase 1 — Foundation

**Goal:** Every layer of the stack is running and connected. No real features yet, just a working skeleton.

### Tasks
- [ ] Initialise Next.js project with TypeScript
- [ ] Set up shadcn/ui component library
- [ ] Set up Python FastAPI project structure
- [ ] Connect Clerk authentication (login, logout, protected routes)
- [ ] Provision PostgreSQL database (local via Docker)
- [ ] Basic API route: frontend calls backend, gets a response
- [ ] Docker Compose file running: frontend + backend + database together
- [ ] Placeholder layout: sidebar nav, top bar, main content area

### Folder Structure
```
/dcim
  /frontend          ← Next.js app
    /app
    /components
    /lib
  /backend           ← FastAPI app
    /api
    /models
    /services
  /simulators        ← Sensor bots (Python scripts)
  /infra             ← Docker Compose, config files
  docker-compose.yml
```

### End of Phase
You can log in, see a blank dashboard shell, and the backend responds to API calls. Database is running.

---

## Phase 2 — Data Pipeline & Simulator Bots

**Goal:** Simulated sensor data flows continuously from bots → MQTT → backend → database. The system behaves as if real hardware is connected.

### Infrastructure
- [ ] Add Mosquitto MQTT broker to Docker Compose
- [ ] Add TimescaleDB extension to PostgreSQL
- [ ] Create hypertable for sensor readings (time-series optimised)

### Database Schema (core tables)
```
sites        — site name, location, timezone
rooms        — belongs to site, physical room
racks        — belongs to room, U-height, position
devices      — belongs to rack, type, model, serial
sensors      — belongs to device or rack, sensor type
readings     — (TimescaleDB hypertable) sensor_id, timestamp, value
alarms       — sensor_id, severity, message, state, acknowledged_at
```

### Backend Data Ingestion
- [ ] MQTT subscriber service (Python) — listens to all sensor topics
- [ ] Parses incoming messages, validates, writes to `readings` table
- [ ] WebSocket endpoint — streams latest readings to connected frontends
- [ ] REST endpoints — historical data queries, aggregations

### Simulator Bots
Each bot runs as an independent Python process, configurable via a simple config file.

**Bot: Temperature/Humidity (per rack)**
- Publishes every 30 seconds
- Base temperature: 22–26°C with ±0.5°C natural drift
- Humidity: 40–55% RH with slow drift
- Scenario: `COOLING_FAILURE` — temperature rises 0.3°C/min until alarm threshold

**Bot: PDU Power Monitor (per rack)**
- Publishes every 60 seconds
- Load: 2–8 kW per rack, fluctuates with simulated workload patterns
- Simulates day/night load patterns (higher load 9am–6pm)
- Scenario: `POWER_SPIKE` — sudden 40% load increase

**Bot: UPS Unit**
- Publishes every 60 seconds
- Input/output voltage, load percentage, battery charge, runtime estimate
- Battery health (SOH) degrades slowly over simulated time
- Scenario: `MAINS_FAILURE` — switches to battery, runtime counts down

**Bot: CRAC/Cooling Unit**
- Publishes every 60 seconds
- Supply/return air temperature, setpoint, fan speed, compressor state
- Responds to rack temperature increases (simulated feedback loop)
- Scenario: `UNIT_FAULT` — unit goes offline, temperature in zone starts rising

**Bot: Water Leak Sensor**
- Normally silent (no leak)
- Scenario: `LEAK_DETECTED` — publishes alert, alarm triggers

**Bot: Battery Cell Monitor**
- Cell voltages, internal resistance per cell
- Scenario: `CELL_DEGRADATION` — one cell's resistance rises, SOH drops

### Scenario Runner
- [ ] Simple CLI script: `python scenarios/run.py --scenario COOLING_FAILURE --rack A01`
- [ ] Useful for demos — trigger realistic alarm sequences on demand

### End of Phase
Bots are running, data is flowing into the database every 30–60 seconds across a simulated data center with ~20 racks, 2 rooms, 1 site. You can query the database and see live readings.

---

## Phase 3 — Core Dashboard

**Goal:** The main overview screen, live-updating, showing the health of the entire facility at a glance.

### Screens
**Site Overview Dashboard**
- Facility health score (aggregate status)
- KPI cards: Total Power (kW), PUE, Avg Temperature, Active Alarms count
- Active alarm feed (live, colour-coded by severity)
- Power trend chart (last 24 hours)
- Temperature trend chart (last 24 hours)
- Room status summary (green/amber/red per room)

### Technical
- [ ] WebSocket hook in frontend — subscribes to live data stream
- [ ] KPI card component (value, trend arrow, threshold colour)
- [ ] Live-updating line chart component (Recharts)
- [ ] Alarm badge component
- [ ] Auto-refresh every 30 seconds as fallback

### End of Phase
Opening the dashboard shows a live, moving picture of the simulated data center. Numbers change in real time. Alarms appear when bots trigger scenarios.

---

## Phase 4 — Environmental Monitoring

**Goal:** Deep visibility into temperature and humidity across all zones and racks.

### Screens
**Environmental Overview**
- Room selector (dropdown or tab)
- Floor plan / rack layout — each rack colour-coded by temperature (cool blue → hot red)
- Click a rack → side panel showing temp/humidity chart for last 24h
- Hot/cold aisle average temperatures

**Rack Detail Panel**
- Temperature trend (line chart, last 24h / 7d selectable)
- Humidity trend
- Current reading with timestamp
- Threshold indicators (warning / critical bands shown on chart)

### Technical
- [ ] SVG floor plan component — racks as rectangles, colour interpolated from temp value
- [ ] Historical data endpoint: `GET /api/sensors/{id}/readings?from=&to=&interval=`
- [ ] Threshold configuration stored in DB, compared on ingest

### End of Phase
You can see exactly which racks are running hot. Heatmap updates live. Clicking a rack shows its history.

---

## Phase 5 — Power Monitoring

**Goal:** Full visibility into power consumption, distribution, and UPS health.

### Screens
**Power Overview**
- Total facility power (kW) — live gauge
- PUE metric (Power Usage Effectiveness) — live, with trend
- PDU breakdown — per-rack load as a bar chart
- Power trend — last 24 hours area chart

**UPS Status Panel**
- Per-unit: input voltage, output voltage, load %, battery charge %, estimated runtime
- Battery health (SOH) indicator
- Status badge (Online / On Battery / Fault)
- Historical battery charge chart

**PDU Detail**
- Per-rack power readings
- Alert if any rack exceeds capacity threshold

### Technical
- [ ] PUE calculation: Total Facility Power / IT Equipment Power (computed server-side)
- [ ] Gauge chart component (Recharts RadialBarChart or similar)
- [ ] UPS status card component

### End of Phase
Full picture of power health. UPS bot scenario (`MAINS_FAILURE`) visibly shows battery rundown on screen.

---

## Phase 6 — Cooling & AI Optimization Panel

**Goal:** Cooling unit visibility plus a simulated AI optimization engine showing energy savings.

### Screens
**Cooling Overview**
- Per-unit status: CRAC/CHILLER name, supply temp, return temp, setpoint, fan speed, state
- Zone temperature vs setpoint comparison
- Cooling efficiency trend

**AI Optimization Panel**
- Toggle: `AI Optimization: ON / OFF`
- When ON: simulated PUE improvement animation, setpoint adjustment suggestions displayed
- Energy savings counter (kWh saved today, this month)
- Simulated recommendation feed: "Raise setpoint in Room 2 by 1°C — estimated 3% saving"
- Before/after PUE comparison chart

### Technical
- [ ] AI optimization is simulated: a backend service generates plausible recommendations based on current temp readings vs setpoints
- [ ] Simple rule engine (if return_temp - supply_temp > X, suggest setpoint raise)
- [ ] Energy savings are calculated from the delta, displayed as a running total
- [ ] This is the layer that gets replaced by a real ML model in production

### End of Phase
The AI panel looks and behaves like a real optimization engine. Recommendations update as conditions change. The CRAC fault scenario visibly impacts the cooling overview.

---

## Phase 7 — Asset Management

**Goal:** Know exactly what hardware is where, and manage capacity.

### Screens
**Rack View**
- Visual U-position diagram for each rack (1U–42U slots)
- Each populated slot shows: device name, type, power draw
- Empty slots shown as available (grey)
- Click device → detail panel (model, serial, IP, status, power)

**Device Inventory**
- Searchable/filterable table of all devices
- Columns: name, type, rack, U-position, IP, status, power draw, install date
- Export to CSV

**Capacity Overview**
- Per-rack: U-space used/total, power used/allocated
- Site-wide capacity summary
- Highlight over-capacity racks

### Technical
- [ ] Rack diagram component — SVG or CSS grid, U-slots rendered from device data
- [ ] Device CRUD endpoints (add/edit/remove devices)
- [ ] Capacity calculation queries

### End of Phase
You can visually browse every rack, see what's installed where, and identify capacity constraints.

---

## Phase 8 — Alarms & Events

**Goal:** A complete alarm management system — detection, notification, acknowledgement, history.

### Screens
**Active Alarms**
- Live list: severity (Critical / Major / Minor / Info), source, message, time raised
- Acknowledge button per alarm
- Filter by severity, site, room, system type

**Alarm History**
- Searchable log of all past alarms
- Resolution time, acknowledged by, notes

**Alarm Rules** (simple config)
- View and edit threshold rules: e.g. "Rack temp > 30°C = Critical alarm"

### Technical
- [ ] Alarm engine in backend: on each sensor reading, check against thresholds, create alarm if breached, auto-resolve when reading returns to normal
- [ ] Alarm state machine: `ACTIVE → ACKNOWLEDGED → RESOLVED`
- [ ] WebSocket push for new alarms (red badge appears instantly)
- [ ] Email notification hook (stub — wire up SMTP later)

### Scenario Demo
Running `python scenarios/run.py --scenario COOLING_FAILURE --rack A01`:
1. Rack A01 temperature starts rising
2. Warning alarm fires at 28°C
3. Critical alarm fires at 32°C
4. Alarm appears live on dashboard
5. Acknowledge it → status updates
6. Stop scenario → temperature drops → alarm auto-resolves

### End of Phase
Alarm management works end-to-end. Scenarios produce realistic alarm sequences that can be demonstrated live.

---

## Phase 9 — Reporting

**Goal:** Exportable summaries for management, compliance, and capacity planning.

### Screens
**Reports Dashboard**
- Pre-built report types: Energy Summary, Temperature Compliance, Uptime Summary, Capacity Report
- Date range selector
- Chart previews inline

**Report Detail**
- Full chart view
- Key stats summary
- Export to PDF / CSV

### Reports Included
| Report | Content |
|---|---|
| Energy Summary | Total kWh, PUE trend, cost estimate, comparison vs prior period |
| Temperature Compliance | % of time within threshold per rack, worst offenders |
| Uptime & Availability | Alarm frequency, MTTR, critical events |
| Capacity Planning | Space and power utilisation per rack/room, projected headroom |
| Battery Health | UPS SOH trends, recommended replacements |

### Technical
- [ ] Report query endpoints (aggregations over TimescaleDB)
- [ ] Chart components reused from earlier phases
- [ ] PDF export via browser print or a library like `react-pdf`
- [ ] CSV export from table data

### End of Phase
Management-ready reports that look professional and pull from real (simulated) historical data.

---

## Phase 10 — Polish & Production Hardening

**Goal:** Make the system genuinely enterprise-ready — secure, auditable, multi-tenant capable.

### Security
- [ ] Role-based access control: `Admin`, `Operator`, `Read-only`, `Site Manager`
- [ ] Permissions enforced on both frontend routes and backend API endpoints
- [ ] API rate limiting
- [ ] Input validation and sanitisation throughout
- [ ] HTTPS enforced
- [ ] Secrets management (environment variables, never hardcoded)

### Audit & Compliance
- [ ] Audit log table: every user action recorded (who, what, when, from where)
- [ ] Audit log viewer in admin panel
- [ ] Data retention policy configuration

### Multi-site
- [ ] Site switcher in top bar
- [ ] All queries scoped to selected site
- [ ] Cross-site summary view for administrators

### Operational
- [ ] Health check endpoints
- [ ] Structured logging throughout backend
- [ ] Error boundary handling in frontend
- [ ] Loading and empty states on all screens
- [ ] Mobile-responsive layout (tablet minimum)

### End of Phase
System is ready for a real pilot deployment. Security reviewed, roles working, audit trail intact.

---

## What Comes After (Production Path)

When the mockup phases are complete, these are the additions needed to turn it into a real product:

| Addition | Description |
|---|---|
| Real hardware ingestion | Replace simulator bots with real MQTT/SNMP/Modbus adapters |
| TimescaleDB scaling | Move to managed TimescaleDB cloud or dedicated server |
| Real AI engine | Replace rule-based cooling suggestions with ML model |
| SSO / SAML | Enterprise single sign-on via Auth0 enterprise tier |
| Multi-tenancy | Full data isolation per customer (for SaaS model) |
| Mobile app | React Native app reusing component logic |
| Hardware onboarding | UI for registering new devices and sensors |
| SLA monitoring | Uptime tracking and alerting for contracted SLAs |

The mockup-to-production transition is incremental — each bot gets replaced by real hardware one at a time, with zero changes to the rest of the system.

---

## Summary

- **10 phases**, each with a clear, testable deliverable
- **Simulator bots** make every phase fully demonstrable with realistic data
- **Scenario runner** lets you trigger alarm sequences on demand for demos
- **Production-ready architecture** from day one — no throwaway work
- Real hardware integration is a drop-in replacement when you're ready