424 lines
16 KiB
Markdown
424 lines
16 KiB
Markdown
# DCIM Platform — Project Plan
|
||
|
||
**Stack:** Next.js + TypeScript + shadcn/ui | Python FastAPI | PostgreSQL + TimescaleDB | MQTT (Mosquitto) | Clerk Auth
|
||
**Approach:** Build as a production shell from day one. Simulated sensor bots feed real data pipelines. Swap bots for real hardware when ready — nothing else changes.
|
||
|
||
---
|
||
|
||
## How the Simulator Bots Work
|
||
|
||
Each bot is a small Python script that behaves exactly like a real physical sensor or device. It generates realistic data with natural variation and drift, then publishes it to the MQTT broker on the same topic a real sensor would use.
|
||
|
||
```
|
||
[Bot: Rack A01 Temp Sensor] → MQTT topic: dc/site1/room1/rack-A01/temperature
|
||
[Bot: UPS Unit 1] → MQTT topic: dc/site1/power/ups-01/status
|
||
[Bot: CRAC Unit 2] → MQTT topic: dc/site1/cooling/crac-02/status
|
||
```
|
||
|
||
The backend subscribes to those topics and stores the data. The frontend never knows the difference. When real hardware is connected, it publishes to the same topics — bots get switched off, everything else keeps working.
|
||
|
||
Bots can also simulate **events and scenarios** for demo purposes:
|
||
- Gradual temperature rise in a rack (simulating cooling failure)
|
||
- Power load spike across a PDU
|
||
- UPS battery degradation over time
|
||
- Water leak alert trigger
|
||
- Alarm escalation sequences
|
||
|
||
---
|
||
|
||
## Phase Overview
|
||
|
||
| Phase | Name | Deliverable |
|
||
|---|---|---|
|
||
| 1 | Foundation | Running skeleton — frontend, backend, auth, DB all connected |
|
||
| 2 | Data Pipeline + Bots | MQTT broker, simulator bots, data flowing into DB |
|
||
| 3 | Core Dashboard | Live overview dashboard pulling real simulated data |
|
||
| 4 | Environmental Monitoring | Temperature/humidity views, heatmaps |
|
||
| 5 | Power Monitoring | PDU, UPS, PUE tracking |
|
||
| 6 | Cooling & AI Panel | CRAC status, simulated AI optimization |
|
||
| 7 | Asset Management | Rack views, device inventory |
|
||
| 8 | Alarms & Events | Live alarm feed, acknowledgement, escalation |
|
||
| 9 | Reporting | Charts, summaries, export |
|
||
| 10 | Polish & Hardening | RBAC, audit log, multi-site, production readiness |
|
||
|
||
---
|
||
|
||
## Phase 1 — Foundation
|
||
|
||
**Goal:** Every layer of the stack is running and connected. No real features yet, just a working skeleton.
|
||
|
||
### Tasks
|
||
- [ ] Initialise Next.js project with TypeScript
|
||
- [ ] Set up shadcn/ui component library
|
||
- [ ] Set up Python FastAPI project structure
|
||
- [ ] Connect Clerk authentication (login, logout, protected routes)
|
||
- [ ] Provision PostgreSQL database (local via Docker)
|
||
- [ ] Basic API route: frontend calls backend, gets a response
|
||
- [ ] Docker Compose file running: frontend + backend + database together
|
||
- [ ] Placeholder layout: sidebar nav, top bar, main content area
|
||
|
||
### Folder Structure
|
||
```
|
||
/dcim
|
||
/frontend ← Next.js app
|
||
/app
|
||
/components
|
||
/lib
|
||
/backend ← FastAPI app
|
||
/api
|
||
/models
|
||
/services
|
||
/simulators ← Sensor bots (Python scripts)
|
||
/infra ← Docker Compose, config files
|
||
docker-compose.yml
|
||
```
|
||
|
||
### End of Phase
|
||
You can log in, see a blank dashboard shell, and the backend responds to API calls. Database is running.
|
||
|
||
---
|
||
|
||
## Phase 2 — Data Pipeline & Simulator Bots
|
||
|
||
**Goal:** Simulated sensor data flows continuously from bots → MQTT → backend → database. The system behaves as if real hardware is connected.
|
||
|
||
### Infrastructure
|
||
- [ ] Add Mosquitto MQTT broker to Docker Compose
|
||
- [ ] Add TimescaleDB extension to PostgreSQL
|
||
- [ ] Create hypertable for sensor readings (time-series optimised)
|
||
|
||
### Database Schema (core tables)
|
||
```
|
||
sites — site name, location, timezone
|
||
rooms — belongs to site, physical room
|
||
racks — belongs to room, U-height, position
|
||
devices — belongs to rack, type, model, serial
|
||
sensors — belongs to device or rack, sensor type
|
||
readings — (TimescaleDB hypertable) sensor_id, timestamp, value
|
||
alarms — sensor_id, severity, message, state, acknowledged_at
|
||
```
|
||
|
||
### Backend Data Ingestion
|
||
- [ ] MQTT subscriber service (Python) — listens to all sensor topics
|
||
- [ ] Parses incoming messages, validates, writes to `readings` table
|
||
- [ ] WebSocket endpoint — streams latest readings to connected frontends
|
||
- [ ] REST endpoints — historical data queries, aggregations
|
||
|
||
### Simulator Bots
|
||
Each bot runs as an independent Python process, configurable via a simple config file.
|
||
|
||
**Bot: Temperature/Humidity (per rack)**
|
||
- Publishes every 30 seconds
|
||
- Base temperature: 22–26°C with ±0.5°C natural drift
|
||
- Humidity: 40–55% RH with slow drift
|
||
- Scenario: `COOLING_FAILURE` — temperature rises 0.3°C/min until alarm threshold
|
||
|
||
**Bot: PDU Power Monitor (per rack)**
|
||
- Publishes every 60 seconds
|
||
- Load: 2–8 kW per rack, fluctuates with simulated workload patterns
|
||
- Simulates day/night load patterns (higher load 9am–6pm)
|
||
- Scenario: `POWER_SPIKE` — sudden 40% load increase
|
||
|
||
**Bot: UPS Unit**
|
||
- Publishes every 60 seconds
|
||
- Input/output voltage, load percentage, battery charge, runtime estimate
|
||
- Battery health (SOH) degrades slowly over simulated time
|
||
- Scenario: `MAINS_FAILURE` — switches to battery, runtime counts down
|
||
|
||
**Bot: CRAC/Cooling Unit**
|
||
- Publishes every 60 seconds
|
||
- Supply/return air temperature, setpoint, fan speed, compressor state
|
||
- Responds to rack temperature increases (simulated feedback loop)
|
||
- Scenario: `UNIT_FAULT` — unit goes offline, temperature in zone starts rising
|
||
|
||
**Bot: Water Leak Sensor**
|
||
- Normally silent (no leak)
|
||
- Scenario: `LEAK_DETECTED` — publishes alert, alarm triggers
|
||
|
||
**Bot: Battery Cell Monitor**
|
||
- Cell voltages, internal resistance per cell
|
||
- Scenario: `CELL_DEGRADATION` — one cell's resistance rises, SOH drops
|
||
|
||
### Scenario Runner
|
||
- [ ] Simple CLI script: `python scenarios/run.py --scenario COOLING_FAILURE --rack A01`
|
||
- [ ] Useful for demos — trigger realistic alarm sequences on demand
|
||
|
||
### End of Phase
|
||
Bots are running, data is flowing into the database every 30–60 seconds across a simulated data center with ~20 racks, 2 rooms, 1 site. You can query the database and see live readings.
|
||
|
||
---
|
||
|
||
## Phase 3 — Core Dashboard
|
||
|
||
**Goal:** The main overview screen, live-updating, showing the health of the entire facility at a glance.
|
||
|
||
### Screens
|
||
**Site Overview Dashboard**
|
||
- Facility health score (aggregate status)
|
||
- KPI cards: Total Power (kW), PUE, Avg Temperature, Active Alarms count
|
||
- Active alarm feed (live, colour-coded by severity)
|
||
- Power trend chart (last 24 hours)
|
||
- Temperature trend chart (last 24 hours)
|
||
- Room status summary (green/amber/red per room)
|
||
|
||
### Technical
|
||
- [ ] WebSocket hook in frontend — subscribes to live data stream
|
||
- [ ] KPI card component (value, trend arrow, threshold colour)
|
||
- [ ] Live-updating line chart component (Recharts)
|
||
- [ ] Alarm badge component
|
||
- [ ] Auto-refresh every 30 seconds as fallback
|
||
|
||
### End of Phase
|
||
Opening the dashboard shows a live, moving picture of the simulated data center. Numbers change in real time. Alarms appear when bots trigger scenarios.
|
||
|
||
---
|
||
|
||
## Phase 4 — Environmental Monitoring
|
||
|
||
**Goal:** Deep visibility into temperature and humidity across all zones and racks.
|
||
|
||
### Screens
|
||
**Environmental Overview**
|
||
- Room selector (dropdown or tab)
|
||
- Floor plan / rack layout — each rack colour-coded by temperature (cool blue → hot red)
|
||
- Click a rack → side panel showing temp/humidity chart for last 24h
|
||
- Hot/cold aisle average temperatures
|
||
|
||
**Rack Detail Panel**
|
||
- Temperature trend (line chart, last 24h / 7d selectable)
|
||
- Humidity trend
|
||
- Current reading with timestamp
|
||
- Threshold indicators (warning / critical bands shown on chart)
|
||
|
||
### Technical
|
||
- [ ] SVG floor plan component — racks as rectangles, colour interpolated from temp value
|
||
- [ ] Historical data endpoint: `GET /api/sensors/{id}/readings?from=&to=&interval=`
|
||
- [ ] Threshold configuration stored in DB, compared on ingest
|
||
|
||
### End of Phase
|
||
You can see exactly which racks are running hot. Heatmap updates live. Clicking a rack shows its history.
|
||
|
||
---
|
||
|
||
## Phase 5 — Power Monitoring
|
||
|
||
**Goal:** Full visibility into power consumption, distribution, and UPS health.
|
||
|
||
### Screens
|
||
**Power Overview**
|
||
- Total facility power (kW) — live gauge
|
||
- PUE metric (Power Usage Effectiveness) — live, with trend
|
||
- PDU breakdown — per-rack load as a bar chart
|
||
- Power trend — last 24 hours area chart
|
||
|
||
**UPS Status Panel**
|
||
- Per-unit: input voltage, output voltage, load %, battery charge %, estimated runtime
|
||
- Battery health (SOH) indicator
|
||
- Status badge (Online / On Battery / Fault)
|
||
- Historical battery charge chart
|
||
|
||
**PDU Detail**
|
||
- Per-rack power readings
|
||
- Alert if any rack exceeds capacity threshold
|
||
|
||
### Technical
|
||
- [ ] PUE calculation: Total Facility Power / IT Equipment Power (computed server-side)
|
||
- [ ] Gauge chart component (Recharts RadialBarChart or similar)
|
||
- [ ] UPS status card component
|
||
|
||
### End of Phase
|
||
Full picture of power health. UPS bot scenario (`MAINS_FAILURE`) visibly shows battery rundown on screen.
|
||
|
||
---
|
||
|
||
## Phase 6 — Cooling & AI Optimization Panel
|
||
|
||
**Goal:** Cooling unit visibility plus a simulated AI optimization engine showing energy savings.
|
||
|
||
### Screens
|
||
**Cooling Overview**
|
||
- Per-unit status: CRAC/CHILLER name, supply temp, return temp, setpoint, fan speed, state
|
||
- Zone temperature vs setpoint comparison
|
||
- Cooling efficiency trend
|
||
|
||
**AI Optimization Panel**
|
||
- Toggle: `AI Optimization: ON / OFF`
|
||
- When ON: simulated PUE improvement animation, setpoint adjustment suggestions displayed
|
||
- Energy savings counter (kWh saved today, this month)
|
||
- Simulated recommendation feed: "Raise setpoint in Room 2 by 1°C — estimated 3% saving"
|
||
- Before/after PUE comparison chart
|
||
|
||
### Technical
|
||
- [ ] AI optimization is simulated: a backend service generates plausible recommendations based on current temp readings vs setpoints
|
||
- [ ] Simple rule engine (if return_temp - supply_temp > X, suggest setpoint raise)
|
||
- [ ] Energy savings are calculated from the delta, displayed as a running total
|
||
- [ ] This is the layer that gets replaced by a real ML model in production
|
||
|
||
### End of Phase
|
||
The AI panel looks and behaves like a real optimization engine. Recommendations update as conditions change. The CRAC fault scenario visibly impacts the cooling overview.
|
||
|
||
---
|
||
|
||
## Phase 7 — Asset Management
|
||
|
||
**Goal:** Know exactly what hardware is where, and manage capacity.
|
||
|
||
### Screens
|
||
**Rack View**
|
||
- Visual U-position diagram for each rack (1U–42U slots)
|
||
- Each populated slot shows: device name, type, power draw
|
||
- Empty slots shown as available (grey)
|
||
- Click device → detail panel (model, serial, IP, status, power)
|
||
|
||
**Device Inventory**
|
||
- Searchable/filterable table of all devices
|
||
- Columns: name, type, rack, U-position, IP, status, power draw, install date
|
||
- Export to CSV
|
||
|
||
**Capacity Overview**
|
||
- Per-rack: U-space used/total, power used/allocated
|
||
- Site-wide capacity summary
|
||
- Highlight over-capacity racks
|
||
|
||
### Technical
|
||
- [ ] Rack diagram component — SVG or CSS grid, U-slots rendered from device data
|
||
- [ ] Device CRUD endpoints (add/edit/remove devices)
|
||
- [ ] Capacity calculation queries
|
||
|
||
### End of Phase
|
||
You can visually browse every rack, see what's installed where, and identify capacity constraints.
|
||
|
||
---
|
||
|
||
## Phase 8 — Alarms & Events
|
||
|
||
**Goal:** A complete alarm management system — detection, notification, acknowledgement, history.
|
||
|
||
### Screens
|
||
**Active Alarms**
|
||
- Live list: severity (Critical / Major / Minor / Info), source, message, time raised
|
||
- Acknowledge button per alarm
|
||
- Filter by severity, site, room, system type
|
||
|
||
**Alarm History**
|
||
- Searchable log of all past alarms
|
||
- Resolution time, acknowledged by, notes
|
||
|
||
**Alarm Rules** (simple config)
|
||
- View and edit threshold rules: e.g. "Rack temp > 30°C = Critical alarm"
|
||
|
||
### Technical
|
||
- [ ] Alarm engine in backend: on each sensor reading, check against thresholds, create alarm if breached, auto-resolve when reading returns to normal
|
||
- [ ] Alarm state machine: `ACTIVE → ACKNOWLEDGED → RESOLVED`
|
||
- [ ] WebSocket push for new alarms (red badge appears instantly)
|
||
- [ ] Email notification hook (stub — wire up SMTP later)
|
||
|
||
### Scenario Demo
|
||
Running `python scenarios/run.py --scenario COOLING_FAILURE --rack A01`:
|
||
1. Rack A01 temperature starts rising
|
||
2. Warning alarm fires at 28°C
|
||
3. Critical alarm fires at 32°C
|
||
4. Alarm appears live on dashboard
|
||
5. Acknowledge it → status updates
|
||
6. Stop scenario → temperature drops → alarm auto-resolves
|
||
|
||
### End of Phase
|
||
Alarm management works end-to-end. Scenarios produce realistic alarm sequences that can be demonstrated live.
|
||
|
||
---
|
||
|
||
## Phase 9 — Reporting
|
||
|
||
**Goal:** Exportable summaries for management, compliance, and capacity planning.
|
||
|
||
### Screens
|
||
**Reports Dashboard**
|
||
- Pre-built report types: Energy Summary, Temperature Compliance, Uptime Summary, Capacity Report
|
||
- Date range selector
|
||
- Chart previews inline
|
||
|
||
**Report Detail**
|
||
- Full chart view
|
||
- Key stats summary
|
||
- Export to PDF / CSV
|
||
|
||
### Reports Included
|
||
| Report | Content |
|
||
|---|---|
|
||
| Energy Summary | Total kWh, PUE trend, cost estimate, comparison vs prior period |
|
||
| Temperature Compliance | % of time within threshold per rack, worst offenders |
|
||
| Uptime & Availability | Alarm frequency, MTTR, critical events |
|
||
| Capacity Planning | Space and power utilisation per rack/room, projected headroom |
|
||
| Battery Health | UPS SOH trends, recommended replacements |
|
||
|
||
### Technical
|
||
- [ ] Report query endpoints (aggregations over TimescaleDB)
|
||
- [ ] Chart components reused from earlier phases
|
||
- [ ] PDF export via browser print or a library like `react-pdf`
|
||
- [ ] CSV export from table data
|
||
|
||
### End of Phase
|
||
Management-ready reports that look professional and pull from real (simulated) historical data.
|
||
|
||
---
|
||
|
||
## Phase 10 — Polish & Production Hardening
|
||
|
||
**Goal:** Make the system genuinely enterprise-ready — secure, auditable, multi-tenant capable.
|
||
|
||
### Security
|
||
- [ ] Role-based access control: `Admin`, `Operator`, `Read-only`, `Site Manager`
|
||
- [ ] Permissions enforced on both frontend routes and backend API endpoints
|
||
- [ ] API rate limiting
|
||
- [ ] Input validation and sanitisation throughout
|
||
- [ ] HTTPS enforced
|
||
- [ ] Secrets management (environment variables, never hardcoded)
|
||
|
||
### Audit & Compliance
|
||
- [ ] Audit log table: every user action recorded (who, what, when, from where)
|
||
- [ ] Audit log viewer in admin panel
|
||
- [ ] Data retention policy configuration
|
||
|
||
### Multi-site
|
||
- [ ] Site switcher in top bar
|
||
- [ ] All queries scoped to selected site
|
||
- [ ] Cross-site summary view for administrators
|
||
|
||
### Operational
|
||
- [ ] Health check endpoints
|
||
- [ ] Structured logging throughout backend
|
||
- [ ] Error boundary handling in frontend
|
||
- [ ] Loading and empty states on all screens
|
||
- [ ] Mobile-responsive layout (tablet minimum)
|
||
|
||
### End of Phase
|
||
System is ready for a real pilot deployment. Security reviewed, roles working, audit trail intact.
|
||
|
||
---
|
||
|
||
## What Comes After (Production Path)
|
||
|
||
When the mockup phases are complete, these are the additions needed to turn it into a real product:
|
||
|
||
| Addition | Description |
|
||
|---|---|
|
||
| Real hardware ingestion | Replace simulator bots with real MQTT/SNMP/Modbus adapters |
|
||
| TimescaleDB scaling | Move to managed TimescaleDB cloud or dedicated server |
|
||
| Real AI engine | Replace rule-based cooling suggestions with ML model |
|
||
| SSO / SAML | Enterprise single sign-on via Auth0 enterprise tier |
|
||
| Multi-tenancy | Full data isolation per customer (for SaaS model) |
|
||
| Mobile app | React Native app reusing component logic |
|
||
| Hardware onboarding | UI for registering new devices and sensors |
|
||
| SLA monitoring | Uptime tracking and alerting for contracted SLAs |
|
||
|
||
The mockup-to-production transition is incremental — each bot gets replaced by real hardware one at a time, with zero changes to the rest of the system.
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
- **10 phases**, each with a clear, testable deliverable
|
||
- **Simulator bots** make every phase fully demonstrable with realistic data
|
||
- **Scenario runner** lets you trigger alarm sequences on demand for demos
|
||
- **Production-ready architecture** from day one — no throwaway work
|
||
- Real hardware integration is a drop-in replacement when you're ready
|