first commit

This commit is contained in:
mega 2026-03-19 11:32:17 +00:00
commit 4b98219bf7
144 changed files with 31561 additions and 0 deletions

424
project_plan.md Normal file
View file

@ -0,0 +1,424 @@
# DCIM Platform — Project Plan
**Stack:** Next.js + TypeScript + shadcn/ui | Python FastAPI | PostgreSQL + TimescaleDB | MQTT (Mosquitto) | Clerk Auth
**Approach:** Build as a production shell from day one. Simulated sensor bots feed real data pipelines. Swap bots for real hardware when ready — nothing else changes.
---
## How the Simulator Bots Work
Each bot is a small Python script that behaves exactly like a real physical sensor or device. It generates realistic data with natural variation and drift, then publishes it to the MQTT broker on the same topic a real sensor would use.
```
[Bot: Rack A01 Temp Sensor] → MQTT topic: dc/site1/room1/rack-A01/temperature
[Bot: UPS Unit 1] → MQTT topic: dc/site1/power/ups-01/status
[Bot: CRAC Unit 2] → MQTT topic: dc/site1/cooling/crac-02/status
```
The backend subscribes to those topics and stores the data. The frontend never knows the difference. When real hardware is connected, it publishes to the same topics — bots get switched off, everything else keeps working.
Bots can also simulate **events and scenarios** for demo purposes:
- Gradual temperature rise in a rack (simulating cooling failure)
- Power load spike across a PDU
- UPS battery degradation over time
- Water leak alert trigger
- Alarm escalation sequences
---
## Phase Overview
| Phase | Name | Deliverable |
|---|---|---|
| 1 | Foundation | Running skeleton — frontend, backend, auth, DB all connected |
| 2 | Data Pipeline + Bots | MQTT broker, simulator bots, data flowing into DB |
| 3 | Core Dashboard | Live overview dashboard pulling real simulated data |
| 4 | Environmental Monitoring | Temperature/humidity views, heatmaps |
| 5 | Power Monitoring | PDU, UPS, PUE tracking |
| 6 | Cooling & AI Panel | CRAC status, simulated AI optimization |
| 7 | Asset Management | Rack views, device inventory |
| 8 | Alarms & Events | Live alarm feed, acknowledgement, escalation |
| 9 | Reporting | Charts, summaries, export |
| 10 | Polish & Hardening | RBAC, audit log, multi-site, production readiness |
---
## Phase 1 — Foundation
**Goal:** Every layer of the stack is running and connected. No real features yet, just a working skeleton.
### Tasks
- [ ] Initialise Next.js project with TypeScript
- [ ] Set up shadcn/ui component library
- [ ] Set up Python FastAPI project structure
- [ ] Connect Clerk authentication (login, logout, protected routes)
- [ ] Provision PostgreSQL database (local via Docker)
- [ ] Basic API route: frontend calls backend, gets a response
- [ ] Docker Compose file running: frontend + backend + database together
- [ ] Placeholder layout: sidebar nav, top bar, main content area
### Folder Structure
```
/dcim
/frontend ← Next.js app
/app
/components
/lib
/backend ← FastAPI app
/api
/models
/services
/simulators ← Sensor bots (Python scripts)
/infra ← Docker Compose, config files
docker-compose.yml
```
### End of Phase
You can log in, see a blank dashboard shell, and the backend responds to API calls. Database is running.
---
## Phase 2 — Data Pipeline & Simulator Bots
**Goal:** Simulated sensor data flows continuously from bots → MQTT → backend → database. The system behaves as if real hardware is connected.
### Infrastructure
- [ ] Add Mosquitto MQTT broker to Docker Compose
- [ ] Add TimescaleDB extension to PostgreSQL
- [ ] Create hypertable for sensor readings (time-series optimised)
### Database Schema (core tables)
```
sites — site name, location, timezone
rooms — belongs to site, physical room
racks — belongs to room, U-height, position
devices — belongs to rack, type, model, serial
sensors — belongs to device or rack, sensor type
readings — (TimescaleDB hypertable) sensor_id, timestamp, value
alarms — sensor_id, severity, message, state, acknowledged_at
```
### Backend Data Ingestion
- [ ] MQTT subscriber service (Python) — listens to all sensor topics
- [ ] Parses incoming messages, validates, writes to `readings` table
- [ ] WebSocket endpoint — streams latest readings to connected frontends
- [ ] REST endpoints — historical data queries, aggregations
### Simulator Bots
Each bot runs as an independent Python process, configurable via a simple config file.
**Bot: Temperature/Humidity (per rack)**
- Publishes every 30 seconds
- Base temperature: 2226°C with ±0.5°C natural drift
- Humidity: 4055% RH with slow drift
- Scenario: `COOLING_FAILURE` — temperature rises 0.3°C/min until alarm threshold
**Bot: PDU Power Monitor (per rack)**
- Publishes every 60 seconds
- Load: 28 kW per rack, fluctuates with simulated workload patterns
- Simulates day/night load patterns (higher load 9am6pm)
- Scenario: `POWER_SPIKE` — sudden 40% load increase
**Bot: UPS Unit**
- Publishes every 60 seconds
- Input/output voltage, load percentage, battery charge, runtime estimate
- Battery health (SOH) degrades slowly over simulated time
- Scenario: `MAINS_FAILURE` — switches to battery, runtime counts down
**Bot: CRAC/Cooling Unit**
- Publishes every 60 seconds
- Supply/return air temperature, setpoint, fan speed, compressor state
- Responds to rack temperature increases (simulated feedback loop)
- Scenario: `UNIT_FAULT` — unit goes offline, temperature in zone starts rising
**Bot: Water Leak Sensor**
- Normally silent (no leak)
- Scenario: `LEAK_DETECTED` — publishes alert, alarm triggers
**Bot: Battery Cell Monitor**
- Cell voltages, internal resistance per cell
- Scenario: `CELL_DEGRADATION` — one cell's resistance rises, SOH drops
### Scenario Runner
- [ ] Simple CLI script: `python scenarios/run.py --scenario COOLING_FAILURE --rack A01`
- [ ] Useful for demos — trigger realistic alarm sequences on demand
### End of Phase
Bots are running, data is flowing into the database every 3060 seconds across a simulated data center with ~20 racks, 2 rooms, 1 site. You can query the database and see live readings.
---
## Phase 3 — Core Dashboard
**Goal:** The main overview screen, live-updating, showing the health of the entire facility at a glance.
### Screens
**Site Overview Dashboard**
- Facility health score (aggregate status)
- KPI cards: Total Power (kW), PUE, Avg Temperature, Active Alarms count
- Active alarm feed (live, colour-coded by severity)
- Power trend chart (last 24 hours)
- Temperature trend chart (last 24 hours)
- Room status summary (green/amber/red per room)
### Technical
- [ ] WebSocket hook in frontend — subscribes to live data stream
- [ ] KPI card component (value, trend arrow, threshold colour)
- [ ] Live-updating line chart component (Recharts)
- [ ] Alarm badge component
- [ ] Auto-refresh every 30 seconds as fallback
### End of Phase
Opening the dashboard shows a live, moving picture of the simulated data center. Numbers change in real time. Alarms appear when bots trigger scenarios.
---
## Phase 4 — Environmental Monitoring
**Goal:** Deep visibility into temperature and humidity across all zones and racks.
### Screens
**Environmental Overview**
- Room selector (dropdown or tab)
- Floor plan / rack layout — each rack colour-coded by temperature (cool blue → hot red)
- Click a rack → side panel showing temp/humidity chart for last 24h
- Hot/cold aisle average temperatures
**Rack Detail Panel**
- Temperature trend (line chart, last 24h / 7d selectable)
- Humidity trend
- Current reading with timestamp
- Threshold indicators (warning / critical bands shown on chart)
### Technical
- [ ] SVG floor plan component — racks as rectangles, colour interpolated from temp value
- [ ] Historical data endpoint: `GET /api/sensors/{id}/readings?from=&to=&interval=`
- [ ] Threshold configuration stored in DB, compared on ingest
### End of Phase
You can see exactly which racks are running hot. Heatmap updates live. Clicking a rack shows its history.
---
## Phase 5 — Power Monitoring
**Goal:** Full visibility into power consumption, distribution, and UPS health.
### Screens
**Power Overview**
- Total facility power (kW) — live gauge
- PUE metric (Power Usage Effectiveness) — live, with trend
- PDU breakdown — per-rack load as a bar chart
- Power trend — last 24 hours area chart
**UPS Status Panel**
- Per-unit: input voltage, output voltage, load %, battery charge %, estimated runtime
- Battery health (SOH) indicator
- Status badge (Online / On Battery / Fault)
- Historical battery charge chart
**PDU Detail**
- Per-rack power readings
- Alert if any rack exceeds capacity threshold
### Technical
- [ ] PUE calculation: Total Facility Power / IT Equipment Power (computed server-side)
- [ ] Gauge chart component (Recharts RadialBarChart or similar)
- [ ] UPS status card component
### End of Phase
Full picture of power health. UPS bot scenario (`MAINS_FAILURE`) visibly shows battery rundown on screen.
---
## Phase 6 — Cooling & AI Optimization Panel
**Goal:** Cooling unit visibility plus a simulated AI optimization engine showing energy savings.
### Screens
**Cooling Overview**
- Per-unit status: CRAC/CHILLER name, supply temp, return temp, setpoint, fan speed, state
- Zone temperature vs setpoint comparison
- Cooling efficiency trend
**AI Optimization Panel**
- Toggle: `AI Optimization: ON / OFF`
- When ON: simulated PUE improvement animation, setpoint adjustment suggestions displayed
- Energy savings counter (kWh saved today, this month)
- Simulated recommendation feed: "Raise setpoint in Room 2 by 1°C — estimated 3% saving"
- Before/after PUE comparison chart
### Technical
- [ ] AI optimization is simulated: a backend service generates plausible recommendations based on current temp readings vs setpoints
- [ ] Simple rule engine (if return_temp - supply_temp > X, suggest setpoint raise)
- [ ] Energy savings are calculated from the delta, displayed as a running total
- [ ] This is the layer that gets replaced by a real ML model in production
### End of Phase
The AI panel looks and behaves like a real optimization engine. Recommendations update as conditions change. The CRAC fault scenario visibly impacts the cooling overview.
---
## Phase 7 — Asset Management
**Goal:** Know exactly what hardware is where, and manage capacity.
### Screens
**Rack View**
- Visual U-position diagram for each rack (1U42U slots)
- Each populated slot shows: device name, type, power draw
- Empty slots shown as available (grey)
- Click device → detail panel (model, serial, IP, status, power)
**Device Inventory**
- Searchable/filterable table of all devices
- Columns: name, type, rack, U-position, IP, status, power draw, install date
- Export to CSV
**Capacity Overview**
- Per-rack: U-space used/total, power used/allocated
- Site-wide capacity summary
- Highlight over-capacity racks
### Technical
- [ ] Rack diagram component — SVG or CSS grid, U-slots rendered from device data
- [ ] Device CRUD endpoints (add/edit/remove devices)
- [ ] Capacity calculation queries
### End of Phase
You can visually browse every rack, see what's installed where, and identify capacity constraints.
---
## Phase 8 — Alarms & Events
**Goal:** A complete alarm management system — detection, notification, acknowledgement, history.
### Screens
**Active Alarms**
- Live list: severity (Critical / Major / Minor / Info), source, message, time raised
- Acknowledge button per alarm
- Filter by severity, site, room, system type
**Alarm History**
- Searchable log of all past alarms
- Resolution time, acknowledged by, notes
**Alarm Rules** (simple config)
- View and edit threshold rules: e.g. "Rack temp > 30°C = Critical alarm"
### Technical
- [ ] Alarm engine in backend: on each sensor reading, check against thresholds, create alarm if breached, auto-resolve when reading returns to normal
- [ ] Alarm state machine: `ACTIVE → ACKNOWLEDGED → RESOLVED`
- [ ] WebSocket push for new alarms (red badge appears instantly)
- [ ] Email notification hook (stub — wire up SMTP later)
### Scenario Demo
Running `python scenarios/run.py --scenario COOLING_FAILURE --rack A01`:
1. Rack A01 temperature starts rising
2. Warning alarm fires at 28°C
3. Critical alarm fires at 32°C
4. Alarm appears live on dashboard
5. Acknowledge it → status updates
6. Stop scenario → temperature drops → alarm auto-resolves
### End of Phase
Alarm management works end-to-end. Scenarios produce realistic alarm sequences that can be demonstrated live.
---
## Phase 9 — Reporting
**Goal:** Exportable summaries for management, compliance, and capacity planning.
### Screens
**Reports Dashboard**
- Pre-built report types: Energy Summary, Temperature Compliance, Uptime Summary, Capacity Report
- Date range selector
- Chart previews inline
**Report Detail**
- Full chart view
- Key stats summary
- Export to PDF / CSV
### Reports Included
| Report | Content |
|---|---|
| Energy Summary | Total kWh, PUE trend, cost estimate, comparison vs prior period |
| Temperature Compliance | % of time within threshold per rack, worst offenders |
| Uptime & Availability | Alarm frequency, MTTR, critical events |
| Capacity Planning | Space and power utilisation per rack/room, projected headroom |
| Battery Health | UPS SOH trends, recommended replacements |
### Technical
- [ ] Report query endpoints (aggregations over TimescaleDB)
- [ ] Chart components reused from earlier phases
- [ ] PDF export via browser print or a library like `react-pdf`
- [ ] CSV export from table data
### End of Phase
Management-ready reports that look professional and pull from real (simulated) historical data.
---
## Phase 10 — Polish & Production Hardening
**Goal:** Make the system genuinely enterprise-ready — secure, auditable, multi-tenant capable.
### Security
- [ ] Role-based access control: `Admin`, `Operator`, `Read-only`, `Site Manager`
- [ ] Permissions enforced on both frontend routes and backend API endpoints
- [ ] API rate limiting
- [ ] Input validation and sanitisation throughout
- [ ] HTTPS enforced
- [ ] Secrets management (environment variables, never hardcoded)
### Audit & Compliance
- [ ] Audit log table: every user action recorded (who, what, when, from where)
- [ ] Audit log viewer in admin panel
- [ ] Data retention policy configuration
### Multi-site
- [ ] Site switcher in top bar
- [ ] All queries scoped to selected site
- [ ] Cross-site summary view for administrators
### Operational
- [ ] Health check endpoints
- [ ] Structured logging throughout backend
- [ ] Error boundary handling in frontend
- [ ] Loading and empty states on all screens
- [ ] Mobile-responsive layout (tablet minimum)
### End of Phase
System is ready for a real pilot deployment. Security reviewed, roles working, audit trail intact.
---
## What Comes After (Production Path)
When the mockup phases are complete, these are the additions needed to turn it into a real product:
| Addition | Description |
|---|---|
| Real hardware ingestion | Replace simulator bots with real MQTT/SNMP/Modbus adapters |
| TimescaleDB scaling | Move to managed TimescaleDB cloud or dedicated server |
| Real AI engine | Replace rule-based cooling suggestions with ML model |
| SSO / SAML | Enterprise single sign-on via Auth0 enterprise tier |
| Multi-tenancy | Full data isolation per customer (for SaaS model) |
| Mobile app | React Native app reusing component logic |
| Hardware onboarding | UI for registering new devices and sensors |
| SLA monitoring | Uptime tracking and alerting for contracted SLAs |
The mockup-to-production transition is incremental — each bot gets replaced by real hardware one at a time, with zero changes to the rest of the system.
---
## Summary
- **10 phases**, each with a clear, testable deliverable
- **Simulator bots** make every phase fully demonstrable with realistic data
- **Scenario runner** lets you trigger alarm sequences on demand for demos
- **Production-ready architecture** from day one — no throwaway work
- Real hardware integration is a drop-in replacement when you're ready