Survey: how machine learning, LLMs, and agents are reshaping operating systems and the OS stack

July 19, 20249 min

Overview

Decision SnapshotNeeds Validation

Survey synthesizes many system papers with measured gains, but most results are prototype-level or evaluated on limited traces; practical deployment needs staged rollouts and continuous retraining.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals13

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 45%

Novelty: 55%

Authors

Yifan Zhang, Xinkui Zhao, Ziying Li, Guanjie Cheng, Jianwei Yin, Lufei Zhang, Zuoning Chen

Links

Abstract / PDF

Why It Matters For Business

AI techniques can reduce tail latency, improve throughput, lower storage errors, and cut datacenter costs, but require guardrails and staged deployment to avoid regressions and privacy risks.

Who Should Care

Summary TLDR

This 68-page survey maps the two-way interaction between AI and operating systems. It summarizes how traditional ML, large language models (LLMs), and agent systems improve OS subsystems (scheduling, I/O, storage, memory, networking, security, GUI/CLI, ops/tuning, verification, education). It also explains how OS designs (kernel-bypass, modular kernels, memory and scheduler interfaces) accelerate AI workloads (short- and long-context inference, distributed training, edge inference). The paper lists representative systems, quantifies several empirical gains (I/O latency, tail latency, storage throughput, error reduction, energy/TCO), identifies pitfalls (model drift, overhead, explainability,

Problem Statement

Modern OSs face growing heterogeneity and dynamic workloads that break static heuristics. At the same time, AI methods (ML, LLMs, agents) can automate and optimize OS decisions but are fragmented and raise new overhead, reliability, and governance issues. The paper surveys techniques and gaps in both "AI for OS" and "OS for AI" to guide engineering and research.

Main Contribution

Categorize research into two directions: AI for OS (apply AI inside OS) and OS for AI (OS changes to support AI workloads).

Survey representative systems across kernel subsystems and the OS ecosystem, summarizing goals, methods, and measured impacts.

Key Findings

Lightweight ML in the kernel can sharply improve I/O predictability and throughput.

NumbersLinnOS: up to 40% lower I/O latency; up to throughput under contention

Practical UseEmbed small per-I/O inference models for SSDs when microsecond-level predictability matters, but ensure models are low overhead and updateable.

Evidence RefSection 4.1.2; LinnOS [77]

A production-focused ML pipeline can deliver sub-microsecond decisions and reduce latency vs heuristics.

NumbersHeimdall: 93% decision accuracy; sub-µs inference; 1535% lower avg I/O latency vs heuristics; up to vs baseline

Practical UseUse full ML pipelines (labeling, filtering, quantization) plus careful deployment tuning to replace heuristics in storage admission control.

Evidence RefSection 4.1.2; Heimdall [79]

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
I/O latency (avg)down up to 40%heuristic kernel I/O path≤ -40% avg latencycontention-heavy SSD workloadsLinnOS embeds a light NN in the block path to predict device behaviorSection 4.1.2; LinnOS [77]
Accuracy93%heuristic admission controllersn/aproduction traces from Microsoft/Alibaba/TencentHeimdall full ML pipeline, sub-µs inference, 28KB overheadSection 4.1.2; Heimdall [79]

What To Try In 7 Days

Run a lightweight ML pilot on a hot I/O path (simulate LinnOS) and measure latency/tail improvements.

Audit logs and telemetry to build a small dataset for anomaly detection or failure prediction (prepare Desh-style pipeline).

Prototype a simple LLM-assisted devops flow for kernel config changes using AutoOS or BYOS ideas in a sandbox.

Agent Features

Memory
external memory vectors (semantic memory)context window as short-term memoryKV cache retrieval
Planning
multi-step reasoningtree-of-thought promptingstate-machine orchestration
Tool Use
LLM + symbolic executorsystem tool invocationfuzzers and validators
Frameworks
AIOSOSAgentAIOS-AgentLSFSCoRE
Is Agentic

Yes

Architectures
single-agentmulti-agentmemory-enhanced agents
Collaboration
role-specialized agentsagent orchestration pipelinesmulti-agent grading/education

Optimization Features

Token Efficiency
attention reuse (AttentionStore)paged attention / memory paging for long contextKV cache management
Infra Optimization
library OS / Demikernel for microsecond datacenter pathsmodular kernels and per-application OS instancessoftware-defined far memory
Model Optimization
quantizationmodel distillationlightweight NN for kernel paths
System Optimization
communication-computation overlapGPU-initiated I/O (BaM)device-aware scheduling
Training Optimization
federated/continuous retrainingnoise filtering and period-based labelingdomain-specific data curation
Inference Optimization
kernel-bypass and in-kernel inferencequantized sub-µs inference (Heimdall)adaptive batching and preemption (ExeGPT, XSched)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Model drift: learned policies degrade as hardware and workloads change.

Inference overhead: kernel-embedded models must meet tight latency budgets.

When Not To Use

In hard real-time kernel paths with strict determinism requirements.

On resource-limited embedded devices without inference acceleration.

Failure Modes

Hallucinated or incorrect code patches from LLMs causing regressions.

Model drift leading to performance regressions or SLO violations.

Core Entities

Models

GPT-4LLaMAGeminiMLPLSTMAutoencoderRandom Forest

Metrics

P99 latencyaverage latencythroughput (QPS)Accuracyenergy/TCOdevice endurancecode coverage

Datasets

Microsoft/Alibaba/Tencent I/O traces (Heimdall)Linux kernel bug corpus (LinuxFLBench)VulnLocSAN2VULN

Benchmarks

UnixBench (AutoOS experiments)SLURM benchmarks (Chronus)Kernel fuzzing coverage (ECG, KernelGPT)