# DCIM Platform — Project Plan **Stack:** Next.js + TypeScript + shadcn/ui | Python FastAPI | PostgreSQL + TimescaleDB | MQTT (Mosquitto) | Clerk Auth **Approach:** Build as a production shell from day one. Simulated sensor bots feed real data pipelines. Swap bots for real hardware when ready — nothing else changes. --- ## How the Simulator Bots Work Each bot is a small Python script that behaves exactly like a real physical sensor or device. It generates realistic data with natural variation and drift, then publishes it to the MQTT broker on the same topic a real sensor would use. ``` [Bot: Rack A01 Temp Sensor] → MQTT topic: dc/site1/room1/rack-A01/temperature [Bot: UPS Unit 1] → MQTT topic: dc/site1/power/ups-01/status [Bot: CRAC Unit 2] → MQTT topic: dc/site1/cooling/crac-02/status ``` The backend subscribes to those topics and stores the data. The frontend never knows the difference. When real hardware is connected, it publishes to the same topics — bots get switched off, everything else keeps working. Bots can also simulate **events and scenarios** for demo purposes: - Gradual temperature rise in a rack (simulating cooling failure) - Power load spike across a PDU - UPS battery degradation over time - Water leak alert trigger - Alarm escalation sequences --- ## Phase Overview | Phase | Name | Deliverable | |---|---|---| | 1 | Foundation | Running skeleton — frontend, backend, auth, DB all connected | | 2 | Data Pipeline + Bots | MQTT broker, simulator bots, data flowing into DB | | 3 | Core Dashboard | Live overview dashboard pulling real simulated data | | 4 | Environmental Monitoring | Temperature/humidity views, heatmaps | | 5 | Power Monitoring | PDU, UPS, PUE tracking | | 6 | Cooling & AI Panel | CRAC status, simulated AI optimization | | 7 | Asset Management | Rack views, device inventory | | 8 | Alarms & Events | Live alarm feed, acknowledgement, escalation | | 9 | Reporting | Charts, summaries, export | | 10 | Polish & Hardening | RBAC, audit log, multi-site, production readiness | --- ## Phase 1 — Foundation **Goal:** Every layer of the stack is running and connected. No real features yet, just a working skeleton. ### Tasks - [ ] Initialise Next.js project with TypeScript - [ ] Set up shadcn/ui component library - [ ] Set up Python FastAPI project structure - [ ] Connect Clerk authentication (login, logout, protected routes) - [ ] Provision PostgreSQL database (local via Docker) - [ ] Basic API route: frontend calls backend, gets a response - [ ] Docker Compose file running: frontend + backend + database together - [ ] Placeholder layout: sidebar nav, top bar, main content area ### Folder Structure ``` /dcim /frontend ← Next.js app /app /components /lib /backend ← FastAPI app /api /models /services /simulators ← Sensor bots (Python scripts) /infra ← Docker Compose, config files docker-compose.yml ``` ### End of Phase You can log in, see a blank dashboard shell, and the backend responds to API calls. Database is running. --- ## Phase 2 — Data Pipeline & Simulator Bots **Goal:** Simulated sensor data flows continuously from bots → MQTT → backend → database. The system behaves as if real hardware is connected. ### Infrastructure - [ ] Add Mosquitto MQTT broker to Docker Compose - [ ] Add TimescaleDB extension to PostgreSQL - [ ] Create hypertable for sensor readings (time-series optimised) ### Database Schema (core tables) ``` sites — site name, location, timezone rooms — belongs to site, physical room racks — belongs to room, U-height, position devices — belongs to rack, type, model, serial sensors — belongs to device or rack, sensor type readings — (TimescaleDB hypertable) sensor_id, timestamp, value alarms — sensor_id, severity, message, state, acknowledged_at ``` ### Backend Data Ingestion - [ ] MQTT subscriber service (Python) — listens to all sensor topics - [ ] Parses incoming messages, validates, writes to `readings` table - [ ] WebSocket endpoint — streams latest readings to connected frontends - [ ] REST endpoints — historical data queries, aggregations ### Simulator Bots Each bot runs as an independent Python process, configurable via a simple config file. **Bot: Temperature/Humidity (per rack)** - Publishes every 30 seconds - Base temperature: 22–26°C with ±0.5°C natural drift - Humidity: 40–55% RH with slow drift - Scenario: `COOLING_FAILURE` — temperature rises 0.3°C/min until alarm threshold **Bot: PDU Power Monitor (per rack)** - Publishes every 60 seconds - Load: 2–8 kW per rack, fluctuates with simulated workload patterns - Simulates day/night load patterns (higher load 9am–6pm) - Scenario: `POWER_SPIKE` — sudden 40% load increase **Bot: UPS Unit** - Publishes every 60 seconds - Input/output voltage, load percentage, battery charge, runtime estimate - Battery health (SOH) degrades slowly over simulated time - Scenario: `MAINS_FAILURE` — switches to battery, runtime counts down **Bot: CRAC/Cooling Unit** - Publishes every 60 seconds - Supply/return air temperature, setpoint, fan speed, compressor state - Responds to rack temperature increases (simulated feedback loop) - Scenario: `UNIT_FAULT` — unit goes offline, temperature in zone starts rising **Bot: Water Leak Sensor** - Normally silent (no leak) - Scenario: `LEAK_DETECTED` — publishes alert, alarm triggers **Bot: Battery Cell Monitor** - Cell voltages, internal resistance per cell - Scenario: `CELL_DEGRADATION` — one cell's resistance rises, SOH drops ### Scenario Runner - [ ] Simple CLI script: `python scenarios/run.py --scenario COOLING_FAILURE --rack A01` - [ ] Useful for demos — trigger realistic alarm sequences on demand ### End of Phase Bots are running, data is flowing into the database every 30–60 seconds across a simulated data center with ~20 racks, 2 rooms, 1 site. You can query the database and see live readings. --- ## Phase 3 — Core Dashboard **Goal:** The main overview screen, live-updating, showing the health of the entire facility at a glance. ### Screens **Site Overview Dashboard** - Facility health score (aggregate status) - KPI cards: Total Power (kW), PUE, Avg Temperature, Active Alarms count - Active alarm feed (live, colour-coded by severity) - Power trend chart (last 24 hours) - Temperature trend chart (last 24 hours) - Room status summary (green/amber/red per room) ### Technical - [ ] WebSocket hook in frontend — subscribes to live data stream - [ ] KPI card component (value, trend arrow, threshold colour) - [ ] Live-updating line chart component (Recharts) - [ ] Alarm badge component - [ ] Auto-refresh every 30 seconds as fallback ### End of Phase Opening the dashboard shows a live, moving picture of the simulated data center. Numbers change in real time. Alarms appear when bots trigger scenarios. --- ## Phase 4 — Environmental Monitoring **Goal:** Deep visibility into temperature and humidity across all zones and racks. ### Screens **Environmental Overview** - Room selector (dropdown or tab) - Floor plan / rack layout — each rack colour-coded by temperature (cool blue → hot red) - Click a rack → side panel showing temp/humidity chart for last 24h - Hot/cold aisle average temperatures **Rack Detail Panel** - Temperature trend (line chart, last 24h / 7d selectable) - Humidity trend - Current reading with timestamp - Threshold indicators (warning / critical bands shown on chart) ### Technical - [ ] SVG floor plan component — racks as rectangles, colour interpolated from temp value - [ ] Historical data endpoint: `GET /api/sensors/{id}/readings?from=&to=&interval=` - [ ] Threshold configuration stored in DB, compared on ingest ### End of Phase You can see exactly which racks are running hot. Heatmap updates live. Clicking a rack shows its history. --- ## Phase 5 — Power Monitoring **Goal:** Full visibility into power consumption, distribution, and UPS health. ### Screens **Power Overview** - Total facility power (kW) — live gauge - PUE metric (Power Usage Effectiveness) — live, with trend - PDU breakdown — per-rack load as a bar chart - Power trend — last 24 hours area chart **UPS Status Panel** - Per-unit: input voltage, output voltage, load %, battery charge %, estimated runtime - Battery health (SOH) indicator - Status badge (Online / On Battery / Fault) - Historical battery charge chart **PDU Detail** - Per-rack power readings - Alert if any rack exceeds capacity threshold ### Technical - [ ] PUE calculation: Total Facility Power / IT Equipment Power (computed server-side) - [ ] Gauge chart component (Recharts RadialBarChart or similar) - [ ] UPS status card component ### End of Phase Full picture of power health. UPS bot scenario (`MAINS_FAILURE`) visibly shows battery rundown on screen. --- ## Phase 6 — Cooling & AI Optimization Panel **Goal:** Cooling unit visibility plus a simulated AI optimization engine showing energy savings. ### Screens **Cooling Overview** - Per-unit status: CRAC/CHILLER name, supply temp, return temp, setpoint, fan speed, state - Zone temperature vs setpoint comparison - Cooling efficiency trend **AI Optimization Panel** - Toggle: `AI Optimization: ON / OFF` - When ON: simulated PUE improvement animation, setpoint adjustment suggestions displayed - Energy savings counter (kWh saved today, this month) - Simulated recommendation feed: "Raise setpoint in Room 2 by 1°C — estimated 3% saving" - Before/after PUE comparison chart ### Technical - [ ] AI optimization is simulated: a backend service generates plausible recommendations based on current temp readings vs setpoints - [ ] Simple rule engine (if return_temp - supply_temp > X, suggest setpoint raise) - [ ] Energy savings are calculated from the delta, displayed as a running total - [ ] This is the layer that gets replaced by a real ML model in production ### End of Phase The AI panel looks and behaves like a real optimization engine. Recommendations update as conditions change. The CRAC fault scenario visibly impacts the cooling overview. --- ## Phase 7 — Asset Management **Goal:** Know exactly what hardware is where, and manage capacity. ### Screens **Rack View** - Visual U-position diagram for each rack (1U–42U slots) - Each populated slot shows: device name, type, power draw - Empty slots shown as available (grey) - Click device → detail panel (model, serial, IP, status, power) **Device Inventory** - Searchable/filterable table of all devices - Columns: name, type, rack, U-position, IP, status, power draw, install date - Export to CSV **Capacity Overview** - Per-rack: U-space used/total, power used/allocated - Site-wide capacity summary - Highlight over-capacity racks ### Technical - [ ] Rack diagram component — SVG or CSS grid, U-slots rendered from device data - [ ] Device CRUD endpoints (add/edit/remove devices) - [ ] Capacity calculation queries ### End of Phase You can visually browse every rack, see what's installed where, and identify capacity constraints. --- ## Phase 8 — Alarms & Events **Goal:** A complete alarm management system — detection, notification, acknowledgement, history. ### Screens **Active Alarms** - Live list: severity (Critical / Major / Minor / Info), source, message, time raised - Acknowledge button per alarm - Filter by severity, site, room, system type **Alarm History** - Searchable log of all past alarms - Resolution time, acknowledged by, notes **Alarm Rules** (simple config) - View and edit threshold rules: e.g. "Rack temp > 30°C = Critical alarm" ### Technical - [ ] Alarm engine in backend: on each sensor reading, check against thresholds, create alarm if breached, auto-resolve when reading returns to normal - [ ] Alarm state machine: `ACTIVE → ACKNOWLEDGED → RESOLVED` - [ ] WebSocket push for new alarms (red badge appears instantly) - [ ] Email notification hook (stub — wire up SMTP later) ### Scenario Demo Running `python scenarios/run.py --scenario COOLING_FAILURE --rack A01`: 1. Rack A01 temperature starts rising 2. Warning alarm fires at 28°C 3. Critical alarm fires at 32°C 4. Alarm appears live on dashboard 5. Acknowledge it → status updates 6. Stop scenario → temperature drops → alarm auto-resolves ### End of Phase Alarm management works end-to-end. Scenarios produce realistic alarm sequences that can be demonstrated live. --- ## Phase 9 — Reporting **Goal:** Exportable summaries for management, compliance, and capacity planning. ### Screens **Reports Dashboard** - Pre-built report types: Energy Summary, Temperature Compliance, Uptime Summary, Capacity Report - Date range selector - Chart previews inline **Report Detail** - Full chart view - Key stats summary - Export to PDF / CSV ### Reports Included | Report | Content | |---|---| | Energy Summary | Total kWh, PUE trend, cost estimate, comparison vs prior period | | Temperature Compliance | % of time within threshold per rack, worst offenders | | Uptime & Availability | Alarm frequency, MTTR, critical events | | Capacity Planning | Space and power utilisation per rack/room, projected headroom | | Battery Health | UPS SOH trends, recommended replacements | ### Technical - [ ] Report query endpoints (aggregations over TimescaleDB) - [ ] Chart components reused from earlier phases - [ ] PDF export via browser print or a library like `react-pdf` - [ ] CSV export from table data ### End of Phase Management-ready reports that look professional and pull from real (simulated) historical data. --- ## Phase 10 — Polish & Production Hardening **Goal:** Make the system genuinely enterprise-ready — secure, auditable, multi-tenant capable. ### Security - [ ] Role-based access control: `Admin`, `Operator`, `Read-only`, `Site Manager` - [ ] Permissions enforced on both frontend routes and backend API endpoints - [ ] API rate limiting - [ ] Input validation and sanitisation throughout - [ ] HTTPS enforced - [ ] Secrets management (environment variables, never hardcoded) ### Audit & Compliance - [ ] Audit log table: every user action recorded (who, what, when, from where) - [ ] Audit log viewer in admin panel - [ ] Data retention policy configuration ### Multi-site - [ ] Site switcher in top bar - [ ] All queries scoped to selected site - [ ] Cross-site summary view for administrators ### Operational - [ ] Health check endpoints - [ ] Structured logging throughout backend - [ ] Error boundary handling in frontend - [ ] Loading and empty states on all screens - [ ] Mobile-responsive layout (tablet minimum) ### End of Phase System is ready for a real pilot deployment. Security reviewed, roles working, audit trail intact. --- ## What Comes After (Production Path) When the mockup phases are complete, these are the additions needed to turn it into a real product: | Addition | Description | |---|---| | Real hardware ingestion | Replace simulator bots with real MQTT/SNMP/Modbus adapters | | TimescaleDB scaling | Move to managed TimescaleDB cloud or dedicated server | | Real AI engine | Replace rule-based cooling suggestions with ML model | | SSO / SAML | Enterprise single sign-on via Auth0 enterprise tier | | Multi-tenancy | Full data isolation per customer (for SaaS model) | | Mobile app | React Native app reusing component logic | | Hardware onboarding | UI for registering new devices and sensors | | SLA monitoring | Uptime tracking and alerting for contracted SLAs | The mockup-to-production transition is incremental — each bot gets replaced by real hardware one at a time, with zero changes to the rest of the system. --- ## Summary - **10 phases**, each with a clear, testable deliverable - **Simulator bots** make every phase fully demonstrable with realistic data - **Scenario runner** lets you trigger alarm sequences on demand for demos - **Production-ready architecture** from day one — no throwaway work - Real hardware integration is a drop-in replacement when you're ready