16 KiB
DCIM Platform — Project Plan
Stack: Next.js + TypeScript + shadcn/ui | Python FastAPI | PostgreSQL + TimescaleDB | MQTT (Mosquitto) | Clerk Auth Approach: Build as a production shell from day one. Simulated sensor bots feed real data pipelines. Swap bots for real hardware when ready — nothing else changes.
How the Simulator Bots Work
Each bot is a small Python script that behaves exactly like a real physical sensor or device. It generates realistic data with natural variation and drift, then publishes it to the MQTT broker on the same topic a real sensor would use.
[Bot: Rack A01 Temp Sensor] → MQTT topic: dc/site1/room1/rack-A01/temperature
[Bot: UPS Unit 1] → MQTT topic: dc/site1/power/ups-01/status
[Bot: CRAC Unit 2] → MQTT topic: dc/site1/cooling/crac-02/status
The backend subscribes to those topics and stores the data. The frontend never knows the difference. When real hardware is connected, it publishes to the same topics — bots get switched off, everything else keeps working.
Bots can also simulate events and scenarios for demo purposes:
- Gradual temperature rise in a rack (simulating cooling failure)
- Power load spike across a PDU
- UPS battery degradation over time
- Water leak alert trigger
- Alarm escalation sequences
Phase Overview
| Phase | Name | Deliverable |
|---|---|---|
| 1 | Foundation | Running skeleton — frontend, backend, auth, DB all connected |
| 2 | Data Pipeline + Bots | MQTT broker, simulator bots, data flowing into DB |
| 3 | Core Dashboard | Live overview dashboard pulling real simulated data |
| 4 | Environmental Monitoring | Temperature/humidity views, heatmaps |
| 5 | Power Monitoring | PDU, UPS, PUE tracking |
| 6 | Cooling & AI Panel | CRAC status, simulated AI optimization |
| 7 | Asset Management | Rack views, device inventory |
| 8 | Alarms & Events | Live alarm feed, acknowledgement, escalation |
| 9 | Reporting | Charts, summaries, export |
| 10 | Polish & Hardening | RBAC, audit log, multi-site, production readiness |
Phase 1 — Foundation
Goal: Every layer of the stack is running and connected. No real features yet, just a working skeleton.
Tasks
- Initialise Next.js project with TypeScript
- Set up shadcn/ui component library
- Set up Python FastAPI project structure
- Connect Clerk authentication (login, logout, protected routes)
- Provision PostgreSQL database (local via Docker)
- Basic API route: frontend calls backend, gets a response
- Docker Compose file running: frontend + backend + database together
- Placeholder layout: sidebar nav, top bar, main content area
Folder Structure
/dcim
/frontend ← Next.js app
/app
/components
/lib
/backend ← FastAPI app
/api
/models
/services
/simulators ← Sensor bots (Python scripts)
/infra ← Docker Compose, config files
docker-compose.yml
End of Phase
You can log in, see a blank dashboard shell, and the backend responds to API calls. Database is running.
Phase 2 — Data Pipeline & Simulator Bots
Goal: Simulated sensor data flows continuously from bots → MQTT → backend → database. The system behaves as if real hardware is connected.
Infrastructure
- Add Mosquitto MQTT broker to Docker Compose
- Add TimescaleDB extension to PostgreSQL
- Create hypertable for sensor readings (time-series optimised)
Database Schema (core tables)
sites — site name, location, timezone
rooms — belongs to site, physical room
racks — belongs to room, U-height, position
devices — belongs to rack, type, model, serial
sensors — belongs to device or rack, sensor type
readings — (TimescaleDB hypertable) sensor_id, timestamp, value
alarms — sensor_id, severity, message, state, acknowledged_at
Backend Data Ingestion
- MQTT subscriber service (Python) — listens to all sensor topics
- Parses incoming messages, validates, writes to
readingstable - WebSocket endpoint — streams latest readings to connected frontends
- REST endpoints — historical data queries, aggregations
Simulator Bots
Each bot runs as an independent Python process, configurable via a simple config file.
Bot: Temperature/Humidity (per rack)
- Publishes every 30 seconds
- Base temperature: 22–26°C with ±0.5°C natural drift
- Humidity: 40–55% RH with slow drift
- Scenario:
COOLING_FAILURE— temperature rises 0.3°C/min until alarm threshold
Bot: PDU Power Monitor (per rack)
- Publishes every 60 seconds
- Load: 2–8 kW per rack, fluctuates with simulated workload patterns
- Simulates day/night load patterns (higher load 9am–6pm)
- Scenario:
POWER_SPIKE— sudden 40% load increase
Bot: UPS Unit
- Publishes every 60 seconds
- Input/output voltage, load percentage, battery charge, runtime estimate
- Battery health (SOH) degrades slowly over simulated time
- Scenario:
MAINS_FAILURE— switches to battery, runtime counts down
Bot: CRAC/Cooling Unit
- Publishes every 60 seconds
- Supply/return air temperature, setpoint, fan speed, compressor state
- Responds to rack temperature increases (simulated feedback loop)
- Scenario:
UNIT_FAULT— unit goes offline, temperature in zone starts rising
Bot: Water Leak Sensor
- Normally silent (no leak)
- Scenario:
LEAK_DETECTED— publishes alert, alarm triggers
Bot: Battery Cell Monitor
- Cell voltages, internal resistance per cell
- Scenario:
CELL_DEGRADATION— one cell's resistance rises, SOH drops
Scenario Runner
- Simple CLI script:
python scenarios/run.py --scenario COOLING_FAILURE --rack A01 - Useful for demos — trigger realistic alarm sequences on demand
End of Phase
Bots are running, data is flowing into the database every 30–60 seconds across a simulated data center with ~20 racks, 2 rooms, 1 site. You can query the database and see live readings.
Phase 3 — Core Dashboard
Goal: The main overview screen, live-updating, showing the health of the entire facility at a glance.
Screens
Site Overview Dashboard
- Facility health score (aggregate status)
- KPI cards: Total Power (kW), PUE, Avg Temperature, Active Alarms count
- Active alarm feed (live, colour-coded by severity)
- Power trend chart (last 24 hours)
- Temperature trend chart (last 24 hours)
- Room status summary (green/amber/red per room)
Technical
- WebSocket hook in frontend — subscribes to live data stream
- KPI card component (value, trend arrow, threshold colour)
- Live-updating line chart component (Recharts)
- Alarm badge component
- Auto-refresh every 30 seconds as fallback
End of Phase
Opening the dashboard shows a live, moving picture of the simulated data center. Numbers change in real time. Alarms appear when bots trigger scenarios.
Phase 4 — Environmental Monitoring
Goal: Deep visibility into temperature and humidity across all zones and racks.
Screens
Environmental Overview
- Room selector (dropdown or tab)
- Floor plan / rack layout — each rack colour-coded by temperature (cool blue → hot red)
- Click a rack → side panel showing temp/humidity chart for last 24h
- Hot/cold aisle average temperatures
Rack Detail Panel
- Temperature trend (line chart, last 24h / 7d selectable)
- Humidity trend
- Current reading with timestamp
- Threshold indicators (warning / critical bands shown on chart)
Technical
- SVG floor plan component — racks as rectangles, colour interpolated from temp value
- Historical data endpoint:
GET /api/sensors/{id}/readings?from=&to=&interval= - Threshold configuration stored in DB, compared on ingest
End of Phase
You can see exactly which racks are running hot. Heatmap updates live. Clicking a rack shows its history.
Phase 5 — Power Monitoring
Goal: Full visibility into power consumption, distribution, and UPS health.
Screens
Power Overview
- Total facility power (kW) — live gauge
- PUE metric (Power Usage Effectiveness) — live, with trend
- PDU breakdown — per-rack load as a bar chart
- Power trend — last 24 hours area chart
UPS Status Panel
- Per-unit: input voltage, output voltage, load %, battery charge %, estimated runtime
- Battery health (SOH) indicator
- Status badge (Online / On Battery / Fault)
- Historical battery charge chart
PDU Detail
- Per-rack power readings
- Alert if any rack exceeds capacity threshold
Technical
- PUE calculation: Total Facility Power / IT Equipment Power (computed server-side)
- Gauge chart component (Recharts RadialBarChart or similar)
- UPS status card component
End of Phase
Full picture of power health. UPS bot scenario (MAINS_FAILURE) visibly shows battery rundown on screen.
Phase 6 — Cooling & AI Optimization Panel
Goal: Cooling unit visibility plus a simulated AI optimization engine showing energy savings.
Screens
Cooling Overview
- Per-unit status: CRAC/CHILLER name, supply temp, return temp, setpoint, fan speed, state
- Zone temperature vs setpoint comparison
- Cooling efficiency trend
AI Optimization Panel
- Toggle:
AI Optimization: ON / OFF - When ON: simulated PUE improvement animation, setpoint adjustment suggestions displayed
- Energy savings counter (kWh saved today, this month)
- Simulated recommendation feed: "Raise setpoint in Room 2 by 1°C — estimated 3% saving"
- Before/after PUE comparison chart
Technical
- AI optimization is simulated: a backend service generates plausible recommendations based on current temp readings vs setpoints
- Simple rule engine (if return_temp - supply_temp > X, suggest setpoint raise)
- Energy savings are calculated from the delta, displayed as a running total
- This is the layer that gets replaced by a real ML model in production
End of Phase
The AI panel looks and behaves like a real optimization engine. Recommendations update as conditions change. The CRAC fault scenario visibly impacts the cooling overview.
Phase 7 — Asset Management
Goal: Know exactly what hardware is where, and manage capacity.
Screens
Rack View
- Visual U-position diagram for each rack (1U–42U slots)
- Each populated slot shows: device name, type, power draw
- Empty slots shown as available (grey)
- Click device → detail panel (model, serial, IP, status, power)
Device Inventory
- Searchable/filterable table of all devices
- Columns: name, type, rack, U-position, IP, status, power draw, install date
- Export to CSV
Capacity Overview
- Per-rack: U-space used/total, power used/allocated
- Site-wide capacity summary
- Highlight over-capacity racks
Technical
- Rack diagram component — SVG or CSS grid, U-slots rendered from device data
- Device CRUD endpoints (add/edit/remove devices)
- Capacity calculation queries
End of Phase
You can visually browse every rack, see what's installed where, and identify capacity constraints.
Phase 8 — Alarms & Events
Goal: A complete alarm management system — detection, notification, acknowledgement, history.
Screens
Active Alarms
- Live list: severity (Critical / Major / Minor / Info), source, message, time raised
- Acknowledge button per alarm
- Filter by severity, site, room, system type
Alarm History
- Searchable log of all past alarms
- Resolution time, acknowledged by, notes
Alarm Rules (simple config)
- View and edit threshold rules: e.g. "Rack temp > 30°C = Critical alarm"
Technical
- Alarm engine in backend: on each sensor reading, check against thresholds, create alarm if breached, auto-resolve when reading returns to normal
- Alarm state machine:
ACTIVE → ACKNOWLEDGED → RESOLVED - WebSocket push for new alarms (red badge appears instantly)
- Email notification hook (stub — wire up SMTP later)
Scenario Demo
Running python scenarios/run.py --scenario COOLING_FAILURE --rack A01:
- Rack A01 temperature starts rising
- Warning alarm fires at 28°C
- Critical alarm fires at 32°C
- Alarm appears live on dashboard
- Acknowledge it → status updates
- Stop scenario → temperature drops → alarm auto-resolves
End of Phase
Alarm management works end-to-end. Scenarios produce realistic alarm sequences that can be demonstrated live.
Phase 9 — Reporting
Goal: Exportable summaries for management, compliance, and capacity planning.
Screens
Reports Dashboard
- Pre-built report types: Energy Summary, Temperature Compliance, Uptime Summary, Capacity Report
- Date range selector
- Chart previews inline
Report Detail
- Full chart view
- Key stats summary
- Export to PDF / CSV
Reports Included
| Report | Content |
|---|---|
| Energy Summary | Total kWh, PUE trend, cost estimate, comparison vs prior period |
| Temperature Compliance | % of time within threshold per rack, worst offenders |
| Uptime & Availability | Alarm frequency, MTTR, critical events |
| Capacity Planning | Space and power utilisation per rack/room, projected headroom |
| Battery Health | UPS SOH trends, recommended replacements |
Technical
- Report query endpoints (aggregations over TimescaleDB)
- Chart components reused from earlier phases
- PDF export via browser print or a library like
react-pdf - CSV export from table data
End of Phase
Management-ready reports that look professional and pull from real (simulated) historical data.
Phase 10 — Polish & Production Hardening
Goal: Make the system genuinely enterprise-ready — secure, auditable, multi-tenant capable.
Security
- Role-based access control:
Admin,Operator,Read-only,Site Manager - Permissions enforced on both frontend routes and backend API endpoints
- API rate limiting
- Input validation and sanitisation throughout
- HTTPS enforced
- Secrets management (environment variables, never hardcoded)
Audit & Compliance
- Audit log table: every user action recorded (who, what, when, from where)
- Audit log viewer in admin panel
- Data retention policy configuration
Multi-site
- Site switcher in top bar
- All queries scoped to selected site
- Cross-site summary view for administrators
Operational
- Health check endpoints
- Structured logging throughout backend
- Error boundary handling in frontend
- Loading and empty states on all screens
- Mobile-responsive layout (tablet minimum)
End of Phase
System is ready for a real pilot deployment. Security reviewed, roles working, audit trail intact.
What Comes After (Production Path)
When the mockup phases are complete, these are the additions needed to turn it into a real product:
| Addition | Description |
|---|---|
| Real hardware ingestion | Replace simulator bots with real MQTT/SNMP/Modbus adapters |
| TimescaleDB scaling | Move to managed TimescaleDB cloud or dedicated server |
| Real AI engine | Replace rule-based cooling suggestions with ML model |
| SSO / SAML | Enterprise single sign-on via Auth0 enterprise tier |
| Multi-tenancy | Full data isolation per customer (for SaaS model) |
| Mobile app | React Native app reusing component logic |
| Hardware onboarding | UI for registering new devices and sensors |
| SLA monitoring | Uptime tracking and alerting for contracted SLAs |
The mockup-to-production transition is incremental — each bot gets replaced by real hardware one at a time, with zero changes to the rest of the system.
Summary
- 10 phases, each with a clear, testable deliverable
- Simulator bots make every phase fully demonstrable with realistic data
- Scenario runner lets you trigger alarm sequences on demand for demos
- Production-ready architecture from day one — no throwaway work
- Real hardware integration is a drop-in replacement when you're ready