Skip to content

Latest commit

 

History

History
200 lines (156 loc) · 4.85 KB

README.md

File metadata and controls

200 lines (156 loc) · 4.85 KB

DanteGPU - GPU Share VM Manager

DanteGPU is a sophisticated virtual machine management system designed specifically for AI workload distribution and GPU resource sharing. Built with Rust, it provides a robust, high-performance solution for managing VMs with GPU passthrough capabilities.

flames

Overview

DanteGPU serves as the core component of the GPU Share Platform, offering:

  • VM lifecycle management with GPU passthrough
  • Real-time resource monitoring
  • Automated GPU management
  • RESTful API interface
  • CLI tools for system management

Key Features

VM Management

  • Full lifecycle control (create, start, stop, delete)
  • GPU passthrough support
  • Resource allocation optimization
  • Template-based VM creation
  • Automated recovery mechanisms

GPU Management

  • Automated device discovery
  • Dynamic GPU allocation
  • Multi-vendor support (NVIDIA, AMD)
  • Performance metrics tracking
  • Resource isolation

Monitoring System

  • Real-time resource tracking
  • Performance metrics collection
  • GPU utilization monitoring
  • Memory usage tracking
  • Temperature and power monitoring

API & CLI Interface

  • RESTful API endpoints
  • Git-style CLI commands
  • Colored terminal output
  • Async command processing
  • Comprehensive error handling

🔧 Technical Architecture

Core Components

  1. Configuration Management

    • Hierarchical config system
    • Multiple override layers
    • Environment variable support
    • TOML-based configuration
    • Secure secrets handling
  2. CLI System

    gpu-share
    ├── serve [--port]          # API server management
    ├── vm                      # VM operations
    │   ├── list               # List all VMs
    │   ├── create             # Create new VM
    │   ├── start              # Start VM
    │   ├── stop               # Stop VM
    │   └── delete             # Remove VM
    ├── gpu                     # GPU management
    │   ├── list               # List GPUs
    │   ├── attach             # Attach GPU to VM
    │   └── detach             # Detach GPU from VM
    └── init                    # Generate config
  3. API Endpoints

    • /api/v1/vms - VM management
    • /api/v1/gpus - GPU operations
    • /api/v1/metrics - Performance metrics
    • RESTful design principles
    • JSON payload support
  4. Monitoring System

    • Resource metrics collection
    • Performance tracking
    • Health monitoring
    • Metrics retention management
    • Real-time alerts

🛠 Prerequisites

  • System Requirements

    • Linux kernel with IOMMU support
    • QEMU/KVM virtualization
    • Libvirt daemon
    • Compatible GPU (NVIDIA/AMD)
    • Rust toolchain (latest stable)
  • Optional Components

    • NVIDIA driver (for NVIDIA GPUs)
    • AMD driver (for AMD GPUs)
    • Docker (for containerized deployment)

📦 Installation

  1. System Setup

    # Install dependencies
    sudo apt install qemu-kvm libvirt-daemon-system
    
    # Clone repository
    git clone https://github.com/yourusername/gpu-share-vm-manager
    cd gpu-share-vm-manager
    
    # Build project
    cargo build --release
  2. Configuration

    # Generate default config
    ./target/release/gpu-share init
    
    # Edit configuration (optional)
    vim config/default.toml
  3. Start Service

    # Run API server
    ./target/release/gpu-share serve --port 3000

Security Considerations

  • Input validation on all endpoints
  • Resource limits enforcement
  • Secure configuration management
  • Environment variable protection
  • API authentication (coming soon)
  • Resource isolation

Usage Examples

# Create new VM with GPU
gpu-share vm create --name ai-worker-01 --memory 8192 --vcpus 4 --gpu

# List available GPUs
gpu-share gpu list

# Attach GPU to VM
gpu-share gpu attach --vm-name ai-worker-01 --gpu-id 0

🔍 Monitoring & Metrics

  • CPU usage tracking
  • Memory utilization
  • GPU metrics
    • Utilization percentage
    • Memory usage
    • Temperature
    • Power consumption
  • Performance analytics
  • Resource optimization

🤝 Contributing

We welcome contributions! Please see our CONTRIBUTING.md for guidelines.

  1. Fork the repository
  2. Create your feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

📝 License

MIT License

Project Status

Currently in active development. Features being worked on:

  • Enhanced GPU scheduling
  • Multi-node support
  • Advanced monitoring
  • Security enhancements
  • Performance optimizations

📚 Documentation

Full documentation available in /docs:

  • Installation Guide
  • Configuration Reference
  • API Documentation
  • Development Guide
  • Security Guidelines

Remember: With great GPU power comes great electricity bills! 🔋