Overview
Survey synthesizes many system papers with measured gains, but most results are prototype-level or evaluated on limited traces; practical deployment needs staged rollouts and continuous retraining.
Citations1
Evidence Strength0.70
Confidence0.80
Risk Signals13
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 6/6
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 45%
Novelty: 55%
Why It Matters For Business
AI techniques can reduce tail latency, improve throughput, lower storage errors, and cut datacenter costs, but require guardrails and staged deployment to avoid regressions and privacy risks.
Who Should Care
Summary TLDR
This 68-page survey maps the two-way interaction between AI and operating systems. It summarizes how traditional ML, large language models (LLMs), and agent systems improve OS subsystems (scheduling, I/O, storage, memory, networking, security, GUI/CLI, ops/tuning, verification, education). It also explains how OS designs (kernel-bypass, modular kernels, memory and scheduler interfaces) accelerate AI workloads (short- and long-context inference, distributed training, edge inference). The paper lists representative systems, quantifies several empirical gains (I/O latency, tail latency, storage throughput, error reduction, energy/TCO), identifies pitfalls (model drift, overhead, explainability,
Problem Statement
Modern OSs face growing heterogeneity and dynamic workloads that break static heuristics. At the same time, AI methods (ML, LLMs, agents) can automate and optimize OS decisions but are fragmented and raise new overhead, reliability, and governance issues. The paper surveys techniques and gaps in both "AI for OS" and "OS for AI" to guide engineering and research.
Main Contribution
Categorize research into two directions: AI for OS (apply AI inside OS) and OS for AI (OS changes to support AI workloads).
Survey representative systems across kernel subsystems and the OS ecosystem, summarizing goals, methods, and measured impacts.
Key Findings
Lightweight ML in the kernel can sharply improve I/O predictability and throughput.
A production-focused ML pipeline can deliver sub-microsecond decisions and reduce latency vs heuristics.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| I/O latency (avg) | down up to 40% | heuristic kernel I/O path | ≤ -40% avg latency | contention-heavy SSD workloads | LinnOS embeds a light NN in the block path to predict device behavior | Section 4.1.2; LinnOS [77] |
| Accuracy | 93% | heuristic admission controllers | n/a | production traces from Microsoft/Alibaba/Tencent | Heimdall full ML pipeline, sub-µs inference, 28KB overhead | Section 4.1.2; Heimdall [79] |
What To Try In 7 Days
Run a lightweight ML pilot on a hot I/O path (simulate LinnOS) and measure latency/tail improvements.
Audit logs and telemetry to build a small dataset for anomaly detection or failure prediction (prepare Desh-style pipeline).
Prototype a simple LLM-assisted devops flow for kernel config changes using AutoOS or BYOS ideas in a sandbox.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Model drift: learned policies degrade as hardware and workloads change.
Inference overhead: kernel-embedded models must meet tight latency budgets.
When Not To Use
In hard real-time kernel paths with strict determinism requirements.
On resource-limited embedded devices without inference acceleration.
Failure Modes
Hallucinated or incorrect code patches from LLMs causing regressions.
Model drift leading to performance regressions or SLO violations.

