Survey: how machine learning, LLMs, and agents are reshaping operating systems and the OS stack

Overview

Decision SnapshotNeeds Validation

Survey synthesizes many system papers with measured gains, but most results are prototype-level or evaluated on limited traces; practical deployment needs staged rollouts and continuous retraining.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals13

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 45%

Novelty: 55%

Authors

Yifan Zhang, Xinkui Zhao, Ziying Li, Guanjie Cheng, Jianwei Yin, Lufei Zhang, Zuoning Chen

Links

Abstract / PDF

Why It Matters For Business

AI techniques can reduce tail latency, improve throughput, lower storage errors, and cut datacenter costs, but require guardrails and staged deployment to avoid regressions and privacy risks.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This 68-page survey maps the two-way interaction between AI and operating systems. It summarizes how traditional ML, large language models (LLMs), and agent systems improve OS subsystems (scheduling, I/O, storage, memory, networking, security, GUI/CLI, ops/tuning, verification, education). It also explains how OS designs (kernel-bypass, modular kernels, memory and scheduler interfaces) accelerate AI workloads (short- and long-context inference, distributed training, edge inference). The paper lists representative systems, quantifies several empirical gains (I/O latency, tail latency, storage throughput, error reduction, energy/TCO), identifies pitfalls (model drift, overhead, explainability,

Problem Statement

Modern OSs face growing heterogeneity and dynamic workloads that break static heuristics. At the same time, AI methods (ML, LLMs, agents) can automate and optimize OS decisions but are fragmented and raise new overhead, reliability, and governance issues. The paper surveys techniques and gaps in both "AI for OS" and "OS for AI" to guide engineering and research.

Main Contribution

Categorize research into two directions: AI for OS (apply AI inside OS) and OS for AI (OS changes to support AI workloads).

Survey representative systems across kernel subsystems and the OS ecosystem, summarizing goals, methods, and measured impacts.

Key Findings

Lightweight ML in the kernel can sharply improve I/O predictability and throughput.

NumbersLinnOS: up to 40% lower I/O latency; up to 3× throughput under contention

Practical UseEmbed small per-I/O inference models for SSDs when microsecond-level predictability matters, but ensure models are low overhead and updateable.

Evidence RefSection 4.1.2; LinnOS [77]

A production-focused ML pipeline can deliver sub-microsecond decisions and reduce latency vs heuristics.

NumbersHeimdall: 93% decision accuracy; sub-µs inference; 15–35% lower avg I/O latency vs heuristics; up to 2× vs baseline

Practical UseUse full ML pipelines (labeling, filtering, quantization) plus careful deployment tuning to replace heuristics in storage admission control.

Evidence RefSection 4.1.2; Heimdall [79]

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
I/O latency (avg)	down up to 40%	heuristic kernel I/O path	≤ -40% avg latency	contention-heavy SSD workloads	LinnOS embeds a light NN in the block path to predict device behavior	Section 4.1.2; LinnOS [77]
Accuracy	93%	heuristic admission controllers	n/a	production traces from Microsoft/Alibaba/Tencent	Heimdall full ML pipeline, sub-µs inference, 28KB overhead	Section 4.1.2; Heimdall [79]

What To Try In 7 Days

Run a lightweight ML pilot on a hot I/O path (simulate LinnOS) and measure latency/tail improvements.

Audit logs and telemetry to build a small dataset for anomaly detection or failure prediction (prepare Desh-style pipeline).

Prototype a simple LLM-assisted devops flow for kernel config changes using AutoOS or BYOS ideas in a sandbox.

Agent Features

Memory

external memory vectors (semantic memory)context window as short-term memoryKV cache retrieval

Planning

multi-step reasoningtree-of-thought promptingstate-machine orchestration

Tool Use

LLM + symbolic executorsystem tool invocationfuzzers and validators

Frameworks

AIOSOSAgentAIOS-AgentLSFSCoRE

Is Agentic

Yes

Architectures

single-agentmulti-agentmemory-enhanced agents

Collaboration

role-specialized agentsagent orchestration pipelinesmulti-agent grading/education

Optimization Features

Token Efficiency

attention reuse (AttentionStore)paged attention / memory paging for long contextKV cache management

Infra Optimization

library OS / Demikernel for microsecond datacenter pathsmodular kernels and per-application OS instancessoftware-defined far memory

Model Optimization

quantizationmodel distillationlightweight NN for kernel paths

System Optimization

communication-computation overlapGPU-initiated I/O (BaM)device-aware scheduling

Training Optimization

federated/continuous retrainingnoise filtering and period-based labelingdomain-specific data curation

Inference Optimization

kernel-bypass and in-kernel inferencequantized sub-µs inference (Heimdall)adaptive batching and preemption (ExeGPT, XSched)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Model drift: learned policies degrade as hardware and workloads change.

Inference overhead: kernel-embedded models must meet tight latency budgets.

When Not To Use

In hard real-time kernel paths with strict determinism requirements.

On resource-limited embedded devices without inference acceleration.

Failure Modes

Hallucinated or incorrect code patches from LLMs causing regressions.

Model drift leading to performance regressions or SLO violations.

Core Entities

Models

GPT-4LLaMAGeminiMLPLSTMAutoencoderRandom Forest

Metrics

P99 latencyaverage latencythroughput (QPS)Accuracyenergy/TCOdevice endurancecode coverage

Datasets

Microsoft/Alibaba/Tencent I/O traces (Heimdall)Linux kernel bug corpus (LinuxFLBench)VulnLocSAN2VULN

Benchmarks

UnixBench (AutoOS experiments)SLURM benchmarks (Chronus)Kernel fuzzing coverage (ECG, KernelGPT)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Lightweight ML in the kernel can sharply improve I/O predictability and throughput.

A production-focused ML pipeline can deliver sub-microsecond decisions and reduce latency vs heuristics.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding