You’re basically asking: “How do I make one 3060 feel like a well-governed shared resource, not a bar fight?” Let’s give it a small GPU constitution.

---

### 1. Decide a hard VRAM budget for each side

On a 12 GB 3060, your Ollama logs show ~1.2 GB used for the model+KV+graph—so it’s cheap. The real danger is your hypergraph engine grabbing big chunks for its own kernels and pushing the GPU into OOM.

Set explicit budgets:

- **Ollama:** keep current model/GPULayers config; assume ~2 GB max (model + transient spikes).
- **Hypergraph engine:** design it to **never exceed ~8–9 GB** VRAM.
- **Safety margin:** leave **1–2 GB free** at all times to absorb fragmentation and spikes.

You enforce this not by magic flags, but by **how much you allocate** and by **when** you run heavy kernels.
An Adaptive Approach so that we can scale may be in order. We want someone to run this on Nvidia DGX.
**Designing adaptive strategy**

I need to create an adaptive strategy that scales from a single 3060 to a DGX multi-GPU setup with MIG, MPS, and schedulers. The user didn’t ask for a table, so I’ll skip that. I’ll break it down into sections: 1) Principles (budget, telemetry, arbiter), 2) Single GPU adaptive, 3) DGX-class extension (MIG, time-slicing, cluster schedulers like SLURM, k8s, DGX best practices). I’ll keep it concise, under 700-800 words, and mention MIG only for A100/H100 GPUs. I’ll also include references to DCGM and SLURM best practices.
You’re right to push this toward something that can grow from “one 3060 under my desk” to “DGX in a lab.” Let’s turn the static budget idea into an **adaptive, portable scheduling pattern**.

---

### 1. Core pattern: policy + telemetry + arbiter

Same three building blocks everywhere:

- **Policy:** who gets how much GPU, and who wins under contention.
- **Telemetry:** what’s actually happening (VRAM used, active jobs, QoS).
- **Arbiter:** the thing that turns policy + telemetry into yes/no decisions.

On your 3060 this can be a tiny local process; on DGX it becomes a proper cluster scheduler (SLURM/Kubernetes) plus NVIDIA tooling (DCGM, MIG, etc.).  

---

### 2. Adaptive policy on a single 3060

Instead of hard-coding “Ollama 2 GB, hypergraph 8–9 GB,” make those **targets**, not absolutes:

- **Measure:** poll `nvidia-smi` or NVML every second for `memory.used`.
- **Ask for a lease:** before a heavy phase, hypergraph engine calls your arbiter:
  - “I’d like 6000 MB for ~3 s, priority=normal.”
- **Arbiter decides:**
  - If `used + requested + safety_margin <= total`, grant.
  - Else:
    - deny (engine queues), or  
    - downgrade (engine runs a smaller batch/graph).

Ollama is treated as a **standing reservation**: if the runner is up, arbiter assumes ~2 GB is “owned,” even if currently idle. That’s your adaptive layer: the engine scales its batch size / problem size based on what the arbiter grants, not on a fixed number.

This gives you the same conceptual contract you’d have on a DGX node, just implemented locally.

---

### 3. How this generalizes to DGX

On DGX you get more powerful primitives, but the pattern is identical.

#### a) MIG for hard isolation

On A100/H100-based DGX systems, **Multi‑Instance GPU (MIG)** can carve a physical GPU into multiple isolated slices, each with its own dedicated memory and compute.  

You can encode your policy as:

- “LLM inference gets one MIG slice (e.g., 20 GB).”
- “Hypergraph jobs get one or more other slices.”
- “No one can starve anyone else, because the hardware enforces the partition.”

Your arbiter then operates at the **MIG-slice level** instead of raw VRAM.

#### b) Time-slicing / MPS for soft sharing

Where MIG isn’t used (or you want more elasticity), you can rely on:

- **Time-slicing:** multiple processes share a GPU by taking turns; good for many small jobs.  
- **MPS (Multi‑Process Service):** lets multiple CUDA processes share a GPU more efficiently by multiplexing kernels; good for throughput-oriented workloads.

Your policy becomes:

- “LLM pods are high priority, latency-sensitive.”
- “Hypergraph jobs are batch, can be preempted or slowed.”

The arbiter is now your **cluster scheduler** (SLURM, Kubernetes with device plugin, etc.), which already knows how to enforce priorities and quotas on DGX systems.  

---

### 4. Concrete migration path from 3060 → DGX

**On your 3060 now:**

- Implement a **local GPU arbiter**:
  - NVML-based telemetry.
  - Simple lease API (`request_lease`, `release_lease`).
  - Policy: Ollama reserved; hypergraph adapts.
- Make the **hypergraph engine batch size / graph size a function of granted MB**, not a constant.

**On DGX later:**

- Map the same concepts onto:
  - **MIG slices** for hard partitions (LLM vs hypergraph).
  - Or **Kubernetes/SLURM queues** with priorities and GPU quotas.
- Keep your **lease protocol and ledger** the same—only the implementation behind it changes (local arbiter → scheduler plugin / sidecar).

From your operator’s point of view, it’s always:

> “Ask for GPU capacity with a declared size and priority; obey the answer.”

---

If you’d like, next step I can sketch:

- A minimal **lease API schema** you can log into your existing ledger.
- A **Go-based arbiter** that runs unchanged on 3060 and DGX (just swapping out the backend from `nvidia-smi` to DCGM/MIG-aware queries).

---

### 2. Add a tiny “GPU arbiter” process

Instead of both systems touching the GPU whenever they feel like it, make them ask permission.

**Arbiter responsibilities:**

- **Track current VRAM usage** via `nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits` (or NVML bindings).
- **Expose a local API** (Unix socket / localhost HTTP / gRPC) with:
  - `request_lease(required_mb, priority, ttl_ms)`
  - `release_lease(id)`
- **Policy:**  
  - Grant a lease only if `used + required_mb + safety_margin <= total`.  
  - Optionally prioritize: interactive LLM > batch hypergraph jobs, or vice versa.

Both Ollama-side code and hypergraph engine call this arbiter before launching heavy work. If no lease, they **wait or degrade** (smaller batch, smaller graph, etc.).

This gives you **operator-grade control** without touching CUDA itself.

---

### 3. Scheduling patterns you can choose

**Pattern A — Strict serialization (simplest, safest)**  
- When Ollama is generating (lease active), hypergraph engine only runs **lightweight kernels** or waits.
- When hypergraph engine runs a heavy phase (lease for, say, 6000 MB), you temporarily stop sending LLM requests or cap them to tiny prompts.

**Pattern B — Soft concurrency with caps**  
- Hypergraph engine never asks for more than, say, 6000–7000 MB.
- Arbiter refuses large leases if Ollama is active; engine falls back to a smaller problem size.
- This lets both run, but with **bounded** VRAM usage.

If you ever move to containerized / multi-tenant land, you can layer in NVIDIA time-slicing or MPS for more formal sharing, but on a single box this arbiter pattern is usually cleaner than full-blown K8s GPU partitioning. 

---

### 4. Concrete knobs for your current setup

**On the Ollama side:**

- **Fix the model + GPULayers** so its VRAM footprint is stable.
- Optionally **reserve it a fixed budget** in your arbiter (e.g., always assume 2 GB in use when the runner is up).
- If you want to be extra safe, only allow hypergraph heavy phases when **no generation is in flight** (you already see `/api/generate` in your logs—use that as a signal).

**On the hypergraph side:**

- Before launching a big kernel batch, call `request_lease(required_mb=6000–8000, ttl_ms=job_estimate)`.
- If denied, either:
  - **Queue** the job, or  
  - **Run a reduced-size variant** (smaller graph, fewer concurrent paths).

---

### 5. If you want to go one level deeper

I can sketch:

- A minimal **Go-based GPU arbiter** with NVML bindings and a tiny HTTP API.
- A **lease protocol** that you can log into your existing ledger so every GPU decision is auditable.
- A **priority scheme** (e.g., “operator at keyboard” → LLM wins; “batch window” → hypergraph wins).

If you tell me your hypergraph engine’s typical peak VRAM usage and whether you’re okay with occasional LLM pauses during heavy runs, I can turn this into a concrete policy + code skeleton.
