2026-03-19 11:32:17 +00:00

16 KiB

Raw Permalink Blame History

DCIM Platform — Project Plan

Stack: Next.js + TypeScript + shadcn/ui | Python FastAPI | PostgreSQL + TimescaleDB | MQTT (Mosquitto) | Clerk Auth Approach: Build as a production shell from day one. Simulated sensor bots feed real data pipelines. Swap bots for real hardware when ready — nothing else changes.

How the Simulator Bots Work

Each bot is a small Python script that behaves exactly like a real physical sensor or device. It generates realistic data with natural variation and drift, then publishes it to the MQTT broker on the same topic a real sensor would use.

[Bot: Rack A01 Temp Sensor]  →  MQTT topic: dc/site1/room1/rack-A01/temperature
[Bot: UPS Unit 1]            →  MQTT topic: dc/site1/power/ups-01/status
[Bot: CRAC Unit 2]           →  MQTT topic: dc/site1/cooling/crac-02/status

The backend subscribes to those topics and stores the data. The frontend never knows the difference. When real hardware is connected, it publishes to the same topics — bots get switched off, everything else keeps working.

Bots can also simulate events and scenarios for demo purposes:

Gradual temperature rise in a rack (simulating cooling failure)
Power load spike across a PDU
UPS battery degradation over time
Water leak alert trigger
Alarm escalation sequences

Phase Overview

Phase	Name	Deliverable
1	Foundation	Running skeleton — frontend, backend, auth, DB all connected
2	Data Pipeline + Bots	MQTT broker, simulator bots, data flowing into DB
3	Core Dashboard	Live overview dashboard pulling real simulated data
4	Environmental Monitoring	Temperature/humidity views, heatmaps
5	Power Monitoring	PDU, UPS, PUE tracking
6	Cooling & AI Panel	CRAC status, simulated AI optimization
7	Asset Management	Rack views, device inventory
8	Alarms & Events	Live alarm feed, acknowledgement, escalation
9	Reporting	Charts, summaries, export
10	Polish & Hardening	RBAC, audit log, multi-site, production readiness

Phase 1 — Foundation

Goal: Every layer of the stack is running and connected. No real features yet, just a working skeleton.

Tasks

Initialise Next.js project with TypeScript
Set up shadcn/ui component library
Set up Python FastAPI project structure
Connect Clerk authentication (login, logout, protected routes)
Provision PostgreSQL database (local via Docker)
Basic API route: frontend calls backend, gets a response
Docker Compose file running: frontend + backend + database together
Placeholder layout: sidebar nav, top bar, main content area

Folder Structure

/dcim
  /frontend          ← Next.js app
    /app
    /components
    /lib
  /backend           ← FastAPI app
    /api
    /models
    /services
  /simulators        ← Sensor bots (Python scripts)
  /infra             ← Docker Compose, config files
  docker-compose.yml

End of Phase

You can log in, see a blank dashboard shell, and the backend responds to API calls. Database is running.

Phase 2 — Data Pipeline & Simulator Bots

Goal: Simulated sensor data flows continuously from bots → MQTT → backend → database. The system behaves as if real hardware is connected.

Infrastructure

Add Mosquitto MQTT broker to Docker Compose
Add TimescaleDB extension to PostgreSQL
Create hypertable for sensor readings (time-series optimised)

Database Schema (core tables)

sites        — site name, location, timezone
rooms        — belongs to site, physical room
racks        — belongs to room, U-height, position
devices      — belongs to rack, type, model, serial
sensors      — belongs to device or rack, sensor type
readings     — (TimescaleDB hypertable) sensor_id, timestamp, value
alarms       — sensor_id, severity, message, state, acknowledged_at

Backend Data Ingestion

MQTT subscriber service (Python) — listens to all sensor topics
Parses incoming messages, validates, writes to readings table
WebSocket endpoint — streams latest readings to connected frontends
REST endpoints — historical data queries, aggregations

Simulator Bots

Each bot runs as an independent Python process, configurable via a simple config file.

Bot: Temperature/Humidity (per rack)

Publishes every 30 seconds
Base temperature: 22–26°C with ±0.5°C natural drift
Humidity: 40–55% RH with slow drift
Scenario: COOLING_FAILURE — temperature rises 0.3°C/min until alarm threshold

Bot: PDU Power Monitor (per rack)

Publishes every 60 seconds
Load: 2–8 kW per rack, fluctuates with simulated workload patterns
Simulates day/night load patterns (higher load 9am–6pm)
Scenario: POWER_SPIKE — sudden 40% load increase

Bot: UPS Unit

Publishes every 60 seconds
Input/output voltage, load percentage, battery charge, runtime estimate
Battery health (SOH) degrades slowly over simulated time
Scenario: MAINS_FAILURE — switches to battery, runtime counts down

Bot: CRAC/Cooling Unit

Publishes every 60 seconds
Supply/return air temperature, setpoint, fan speed, compressor state
Responds to rack temperature increases (simulated feedback loop)
Scenario: UNIT_FAULT — unit goes offline, temperature in zone starts rising

Bot: Water Leak Sensor

Normally silent (no leak)
Scenario: LEAK_DETECTED — publishes alert, alarm triggers

Bot: Battery Cell Monitor

Cell voltages, internal resistance per cell
Scenario: CELL_DEGRADATION — one cell's resistance rises, SOH drops

Scenario Runner

Simple CLI script: python scenarios/run.py --scenario COOLING_FAILURE --rack A01
Useful for demos — trigger realistic alarm sequences on demand

End of Phase

Bots are running, data is flowing into the database every 30–60 seconds across a simulated data center with ~20 racks, 2 rooms, 1 site. You can query the database and see live readings.

Phase 3 — Core Dashboard

Goal: The main overview screen, live-updating, showing the health of the entire facility at a glance.

Screens

Site Overview Dashboard

Facility health score (aggregate status)
KPI cards: Total Power (kW), PUE, Avg Temperature, Active Alarms count
Active alarm feed (live, colour-coded by severity)
Power trend chart (last 24 hours)
Temperature trend chart (last 24 hours)
Room status summary (green/amber/red per room)

Technical

WebSocket hook in frontend — subscribes to live data stream
KPI card component (value, trend arrow, threshold colour)
Live-updating line chart component (Recharts)
Alarm badge component
Auto-refresh every 30 seconds as fallback

End of Phase

Opening the dashboard shows a live, moving picture of the simulated data center. Numbers change in real time. Alarms appear when bots trigger scenarios.

Phase 4 — Environmental Monitoring

Goal: Deep visibility into temperature and humidity across all zones and racks.

Screens

Environmental Overview

Room selector (dropdown or tab)
Floor plan / rack layout — each rack colour-coded by temperature (cool blue → hot red)
Click a rack → side panel showing temp/humidity chart for last 24h
Hot/cold aisle average temperatures

Rack Detail Panel

Temperature trend (line chart, last 24h / 7d selectable)
Humidity trend
Current reading with timestamp
Threshold indicators (warning / critical bands shown on chart)

Technical

SVG floor plan component — racks as rectangles, colour interpolated from temp value
Historical data endpoint: GET /api/sensors/{id}/readings?from=&to=&interval=
Threshold configuration stored in DB, compared on ingest

End of Phase

You can see exactly which racks are running hot. Heatmap updates live. Clicking a rack shows its history.

Phase 5 — Power Monitoring

Goal: Full visibility into power consumption, distribution, and UPS health.

Screens

Power Overview

Total facility power (kW) — live gauge
PUE metric (Power Usage Effectiveness) — live, with trend
PDU breakdown — per-rack load as a bar chart
Power trend — last 24 hours area chart

UPS Status Panel

Per-unit: input voltage, output voltage, load %, battery charge %, estimated runtime
Battery health (SOH) indicator
Status badge (Online / On Battery / Fault)
Historical battery charge chart

PDU Detail

Per-rack power readings
Alert if any rack exceeds capacity threshold

Technical

PUE calculation: Total Facility Power / IT Equipment Power (computed server-side)
Gauge chart component (Recharts RadialBarChart or similar)
UPS status card component

End of Phase

Full picture of power health. UPS bot scenario (MAINS_FAILURE) visibly shows battery rundown on screen.

Phase 6 — Cooling & AI Optimization Panel

Goal: Cooling unit visibility plus a simulated AI optimization engine showing energy savings.

Screens

Cooling Overview

Per-unit status: CRAC/CHILLER name, supply temp, return temp, setpoint, fan speed, state
Zone temperature vs setpoint comparison
Cooling efficiency trend

AI Optimization Panel

Toggle: AI Optimization: ON / OFF
When ON: simulated PUE improvement animation, setpoint adjustment suggestions displayed
Energy savings counter (kWh saved today, this month)
Simulated recommendation feed: "Raise setpoint in Room 2 by 1°C — estimated 3% saving"
Before/after PUE comparison chart

Technical

AI optimization is simulated: a backend service generates plausible recommendations based on current temp readings vs setpoints
Simple rule engine (if return_temp - supply_temp > X, suggest setpoint raise)
Energy savings are calculated from the delta, displayed as a running total
This is the layer that gets replaced by a real ML model in production

End of Phase

The AI panel looks and behaves like a real optimization engine. Recommendations update as conditions change. The CRAC fault scenario visibly impacts the cooling overview.

Phase 7 — Asset Management

Goal: Know exactly what hardware is where, and manage capacity.

Screens

Rack View

Visual U-position diagram for each rack (1U–42U slots)
Each populated slot shows: device name, type, power draw
Empty slots shown as available (grey)
Click device → detail panel (model, serial, IP, status, power)

Device Inventory

Searchable/filterable table of all devices
Columns: name, type, rack, U-position, IP, status, power draw, install date
Export to CSV

Capacity Overview

Per-rack: U-space used/total, power used/allocated
Site-wide capacity summary
Highlight over-capacity racks

Technical

Rack diagram component — SVG or CSS grid, U-slots rendered from device data
Device CRUD endpoints (add/edit/remove devices)
Capacity calculation queries

End of Phase

You can visually browse every rack, see what's installed where, and identify capacity constraints.

Phase 8 — Alarms & Events

Goal: A complete alarm management system — detection, notification, acknowledgement, history.

Screens

Active Alarms

Live list: severity (Critical / Major / Minor / Info), source, message, time raised
Acknowledge button per alarm
Filter by severity, site, room, system type

Alarm History

Searchable log of all past alarms
Resolution time, acknowledged by, notes

Alarm Rules (simple config)

View and edit threshold rules: e.g. "Rack temp > 30°C = Critical alarm"

Technical

Alarm engine in backend: on each sensor reading, check against thresholds, create alarm if breached, auto-resolve when reading returns to normal
Alarm state machine: ACTIVE → ACKNOWLEDGED → RESOLVED
WebSocket push for new alarms (red badge appears instantly)
Email notification hook (stub — wire up SMTP later)

Scenario Demo

Running python scenarios/run.py --scenario COOLING_FAILURE --rack A01:

Rack A01 temperature starts rising
Warning alarm fires at 28°C
Critical alarm fires at 32°C
Alarm appears live on dashboard
Acknowledge it → status updates
Stop scenario → temperature drops → alarm auto-resolves

End of Phase

Alarm management works end-to-end. Scenarios produce realistic alarm sequences that can be demonstrated live.

Phase 9 — Reporting

Goal: Exportable summaries for management, compliance, and capacity planning.

Screens

Reports Dashboard

Pre-built report types: Energy Summary, Temperature Compliance, Uptime Summary, Capacity Report
Date range selector
Chart previews inline

Report Detail

Full chart view
Key stats summary
Export to PDF / CSV

Reports Included

Report	Content
Energy Summary	Total kWh, PUE trend, cost estimate, comparison vs prior period
Temperature Compliance	% of time within threshold per rack, worst offenders
Uptime & Availability	Alarm frequency, MTTR, critical events
Capacity Planning	Space and power utilisation per rack/room, projected headroom
Battery Health	UPS SOH trends, recommended replacements

Technical

Report query endpoints (aggregations over TimescaleDB)
Chart components reused from earlier phases
PDF export via browser print or a library like react-pdf
CSV export from table data

End of Phase

Management-ready reports that look professional and pull from real (simulated) historical data.

Phase 10 — Polish & Production Hardening

Goal: Make the system genuinely enterprise-ready — secure, auditable, multi-tenant capable.

Security

Role-based access control: Admin, Operator, Read-only, Site Manager
Permissions enforced on both frontend routes and backend API endpoints
API rate limiting
Input validation and sanitisation throughout
HTTPS enforced
Secrets management (environment variables, never hardcoded)

Audit & Compliance

Audit log table: every user action recorded (who, what, when, from where)
Audit log viewer in admin panel
Data retention policy configuration

Multi-site

Site switcher in top bar
All queries scoped to selected site
Cross-site summary view for administrators

Operational

Health check endpoints
Structured logging throughout backend
Error boundary handling in frontend
Loading and empty states on all screens
Mobile-responsive layout (tablet minimum)

End of Phase

System is ready for a real pilot deployment. Security reviewed, roles working, audit trail intact.

What Comes After (Production Path)

When the mockup phases are complete, these are the additions needed to turn it into a real product:

Addition	Description
Real hardware ingestion	Replace simulator bots with real MQTT/SNMP/Modbus adapters
TimescaleDB scaling	Move to managed TimescaleDB cloud or dedicated server
Real AI engine	Replace rule-based cooling suggestions with ML model
SSO / SAML	Enterprise single sign-on via Auth0 enterprise tier
Multi-tenancy	Full data isolation per customer (for SaaS model)
Mobile app	React Native app reusing component logic
Hardware onboarding	UI for registering new devices and sensors
SLA monitoring	Uptime tracking and alerting for contracted SLAs

The mockup-to-production transition is incremental — each bot gets replaced by real hardware one at a time, with zero changes to the rest of the system.

Summary

10 phases, each with a clear, testable deliverable
Simulator bots make every phase fully demonstrable with realistic data
Scenario runner lets you trigger alarm sequences on demand for demos
Production-ready architecture from day one — no throwaway work
Real hardware integration is a drop-in replacement when you're ready

16 KiB Raw Permalink Blame History Unescape Escape

DCIM Platform — Project Plan

How the Simulator Bots Work

Phase Overview

Phase 1 — Foundation

Tasks

Folder Structure

End of Phase

Phase 2 — Data Pipeline & Simulator Bots

Infrastructure

Database Schema (core tables)

Backend Data Ingestion

Simulator Bots

Scenario Runner

End of Phase

Phase 3 — Core Dashboard

Screens

Technical

End of Phase

Phase 4 — Environmental Monitoring

Screens

Technical

End of Phase

Phase 5 — Power Monitoring

Screens

Technical

End of Phase

Phase 6 — Cooling & AI Optimization Panel

Screens

Technical

End of Phase

Phase 7 — Asset Management

Screens

Technical

End of Phase

Phase 8 — Alarms & Events

Screens

Technical

Scenario Demo

End of Phase

Phase 9 — Reporting

Screens

Reports Included

Technical

End of Phase

Phase 10 — Polish & Production Hardening

Security

Audit & Compliance

Multi-site

Operational

End of Phase

What Comes After (Production Path)

Summary

16 KiB

Raw Permalink Blame History