Survey: how machine learning, LLMs, and agents are reshaping operating systems and the OS stack

July 19, 20249 min

Overview

Production Readiness

0.45

Novelty Score

0.55

Cost Impact Score

0.6

Citation Count

1

Authors

Yifan Zhang, Xinkui Zhao, Ziying Li, Guanjie Cheng, Jianwei Yin, Lufei Zhang, Zuoning Chen

Links

Abstract / PDF

Why It Matters For Business

AI techniques can reduce tail latency, improve throughput, lower storage errors, and cut datacenter costs, but require guardrails and staged deployment to avoid regressions and privacy risks.

Summary TLDR

This 68-page survey maps the two-way interaction between AI and operating systems. It summarizes how traditional ML, large language models (LLMs), and agent systems improve OS subsystems (scheduling, I/O, storage, memory, networking, security, GUI/CLI, ops/tuning, verification, education). It also explains how OS designs (kernel-bypass, modular kernels, memory and scheduler interfaces) accelerate AI workloads (short- and long-context inference, distributed training, edge inference). The paper lists representative systems, quantifies several empirical gains (I/O latency, tail latency, storage throughput, error reduction, energy/TCO), identifies pitfalls (model drift, overhead, explainability,

Problem Statement

Modern OSs face growing heterogeneity and dynamic workloads that break static heuristics. At the same time, AI methods (ML, LLMs, agents) can automate and optimize OS decisions but are fragmented and raise new overhead, reliability, and governance issues. The paper surveys techniques and gaps in both "AI for OS" and "OS for AI" to guide engineering and research.

Main Contribution

Categorize research into two directions: AI for OS (apply AI inside OS) and OS for AI (OS changes to support AI workloads).

Survey representative systems across kernel subsystems and the OS ecosystem, summarizing goals, methods, and measured impacts.

Distill common evaluation axes, engineering patterns, deployment suggestions, and a three-stage roadmap: AI-powered, AI-refactored, AI-driven OSs.

Identify core risks (model drift, runtime overhead, explainability, privacy) and propose rules+AI guardrails, modular kernels, and unified toolchains.

Key Findings

Lightweight ML in the kernel can sharply improve I/O predictability and throughput.

NumbersLinnOS: up to 40% lower I/O latency; up to 3× throughput under contention

A production-focused ML pipeline can deliver sub-microsecond decisions and reduce latency vs heuristics.

NumbersHeimdall: 93% decision accuracy; sub-µs inference; 15–35% lower avg I/O latency vs heuristics; up to 2× vs baseline

Learned indexes at the block layer can cut redundant work and shrink tail latency drastically.

NumbersLearnedFTL: 55.5% fewer double reads; up to 12.2× P99 latency improvement; 8.2× average throughput gain

ML methods can improve NVM reliability and device lifetime.

NumbersLearnWD: -20.1% write-disturbance errors; -11.0% write latency; +21.9% device endurance

System-level autotuning with ML can lower datacenter memory cost with small performance impact.

NumbersSoftware-defined far memory (Google deployment): 4–5% TCO reduction

LLM-powered automation boosts kernel testing and vulnerability discovery but is not flawless.

NumbersKernelGPT: discovered 24 unique Linux bugs (12 fixed, 11 CVEs); ECG: +16.02% avg kernel coverage; SAN2PATCH: 79.5% Vuln-

Results

I/O latency (avg)

Valuedown up to 40%

Baselineheuristic kernel I/O path

Accuracy

Value93%

Baselineheuristic admission controllers

Flash translation double reads

Value-55.5%

Baselinestate-of-the-art FTL designs

Device endurance (NVM)

Value+21.9%

Baselinenon-ML remapping

Datacenter memory TCO

Value-4–5% TCO

Baselineno far-memory autotuning

Kernel fuzzing / vulnerability discovery

Value24 unique bugs (KernelGPT); ECG +16.02% average code coverage

Baselinetraditional fuzzers

Who Should Care

What To Try In 7 Days

Run a lightweight ML pilot on a hot I/O path (simulate LinnOS) and measure latency/tail improvements.

Audit logs and telemetry to build a small dataset for anomaly detection or failure prediction (prepare Desh-style pipeline).

Prototype a simple LLM-assisted devops flow for kernel config changes using AutoOS or BYOS ideas in a sandbox.

Agent Features

Memory

  • external memory vectors (semantic memory)
  • context window as short-term memory
  • KV cache retrieval

Planning

  • multi-step reasoning
  • tree-of-thought prompting
  • state-machine orchestration

Tool Use

  • LLM + symbolic executor
  • system tool invocation
  • fuzzers and validators

Frameworks

  • AIOS
  • OSAgent
  • AIOS-Agent
  • LSFS
  • CoRE

Is Agentic

true

Architectures

  • single-agent
  • multi-agent
  • memory-enhanced agents

Collaboration

  • role-specialized agents
  • agent orchestration pipelines
  • multi-agent grading/education

Optimization Features

Token Efficiency

  • attention reuse (AttentionStore)
  • paged attention / memory paging for long context
  • KV cache management

Infra Optimization

  • library OS / Demikernel for microsecond datacenter paths
  • modular kernels and per-application OS instances
  • software-defined far memory

Model Optimization

  • quantization
  • model distillation
  • lightweight NN for kernel paths

System Optimization

  • communication-computation overlap
  • GPU-initiated I/O (BaM)
  • device-aware scheduling

Training Optimization

  • federated/continuous retraining
  • noise filtering and period-based labeling
  • domain-specific data curation

Inference Optimization

  • kernel-bypass and in-kernel inference
  • quantized sub-µs inference (Heimdall)
  • adaptive batching and preemption (ExeGPT, XSched)

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Model drift: learned policies degrade as hardware and workloads change.
  • Inference overhead: kernel-embedded models must meet tight latency budgets.
  • Explainability: LLM/agent outputs can hallucinate or produce unsafe patches.
  • Data scarcity: representative, labeled OS traces are hard to collect and share.
  • Engineering complexity: legacy kernels lack modular hooks for safe AI integration.

When Not To Use

  • In hard real-time kernel paths with strict determinism requirements.
  • On resource-limited embedded devices without inference acceleration.
  • When high-quality, representative telemetry is unavailable.
  • In environments requiring provable, formally verified behavior without AI fallback.

Failure Modes

  • Hallucinated or incorrect code patches from LLMs causing regressions.
  • Model drift leading to performance regressions or SLO violations.
  • Inference-induced contention that increases tail latency.
  • Adversarial inputs that trigger incorrect scheduling or security alerts.

Core Entities

Models

  • GPT-4
  • LLaMA
  • Gemini
  • MLP
  • LSTM
  • Autoencoder
  • Random Forest

Metrics

  • P99 latency
  • average latency
  • throughput (QPS)
  • Accuracy
  • energy/TCO
  • device endurance
  • code coverage

Datasets

  • Microsoft/Alibaba/Tencent I/O traces (Heimdall)
  • Linux kernel bug corpus (LinuxFLBench)
  • VulnLoc
  • SAN2VULN

Benchmarks

  • UnixBench (AutoOS experiments)
  • SLURM benchmarks (Chronus)
  • Kernel fuzzing coverage (ECG, KernelGPT)