Notes for Linux Kernel

39 minute read




  • Preemptive multitasking
  • Virtual memory
  • Shared libraries
  • Demand loading, dynamic kernel modules
  • TCP/IP networking
  • Symmetrical Multi-Processing support
  • Open source

Two modes in Linux

  • User mode: application software(including different libraries)
  • Kernel mode: system calls, linux kernel, hardware

What’s a kernel

  • executive system monitor
  • controls and mediates access to hardware
  • implements and supports fundamental abstractions (processes, files, devices)
  • schedules/allocates system resources (memory, cpu, disk, descriptors)
  • security and protection
  • respond to user requests for service (system calls)

Kernel design goal

performance, stability, capability, security and protection, portability, extensibility

Linux source tree (directories)

  • /root : the home directory for the root user
  • /home : the user’s home directories along with directories for services (ftp, http…)
  • /bin : commands needed during booting up that may be needed by normal users
  • /sbin : similar to /bin. but not intended for normal users. run by Linux.
  • /proc : a virtual file system that exits in the kernels imagination, which is memory. not on a disk.
    • a directory with info about process number
    • each process has a directory below /proc
  • /usr : contain all commands, libraries, man pages, games and static files for normal operations
  • /boot : files used bu the bootstrap loader. kernel images are often kept here.
  • /lib : shared libraries needed by the programs on the root file system
  • /modules : loadable kernel modules, especially those needed to boot the system after disasters
  • /dev : device files
  • /etc : configuration files specific to the machine
  • /skel : when a home directory is created, it is initialized with files from this directory
  • /sysconfig : files that configure the linux system for devices
  • /var : files that change for mail, news , printers log files, man pages, temp files
  • /mnt : mount points for temporary mounts by the system administrator
  • /tmp : temporary files. programs running after bootup should use /var/tmp


  • subdirectories for each current port
  • each contains kernel, lib, mm, boot and other directories whose contents override code stubs in architecture independent code


  • largest amount of code in the kernel tree
  • device, bus, platform and general directories


  • contains virtual file system framework (VFS) and subdirectories for actual file systems


  • include/asm-* : architecture-dependent include subdirectories
  • include/linux :
    • header info needed both by the kernel and user apps
    • usually linked to /usr/include/linux
    • kernel-only portions guarded by
        /* kernel stuff */


  • version.c : contains the version banner that prints at boot
  • main.c : architecture-independent boot code (start_kernel is the main entry point)


  • the core kernel code
  • sched.c : the main kernel file (scheduler, wait queues, timers, alarms, task queues)
  • process control : fork.c, exec.c, signal.c, exit.c etc…
  • kernel module support : kmod.c, ksyms.c, module.c
  • other operations : time.c, resource.c, dma.c, printk.c, info.c, sys.c …


  • kernel code cannot call standard C library routines


  • paging and swapping
  • allocation and deallocation
  • memory mapping


scripts for:

  • menu-based kernel configuration
  • kernel patching
  • generating kernel documentation


  • Linux is a modular, UNIX-like monolithic kernel
  • kernel is the heart of the OS that executes with special hardware permission (kernel mode)
  • “core kernel” provides framework, data structures, support for drivers, modules, subsystems
  • architecture dependent source subtrees live in /arch


How computer startup?

  • booting is a bootstrapping process that starts operating systems when the user turns on a computer system
  • a boot sequence is the set of operations the computer performs when it is switched on that load an operating system

Booting sequence

  • Turn on
  • CPU jump to address of BIOS (0xFFFF0)
  • BIOS runs POST (Power-On Self Test)
  • Find bootable devices
  • Load and execute boot sector from MBR
  • Load OS

BIOS (Basic Input/Output System)

  • BIOS refers to the software code run by a computer when first powered on
  • the primary function is code program embedded on a chip and controls various devices that make up the computer

MBR (Master Boot Record)


  • OS is booted from a hard disk, where MBR contains the primary boot loader
  • The MBR is a 512-byte sector, located in the first sector in the disk (sector 1 of cylinder 0, head 0)
  • After the MBR is loaded into RAM, the BIOS yields control to it
  • The first 446 bytes are the primary boot loader, which contains both executable code and error message
  • The next 64 bytes are the partition table, which contains a record for each of four partitions
  • The MBR ends with two bytes that are defined as the magic number (0xAA55 or 0x55AA). The magic number serves as a validation check of the MBR.
  • To see the contents of MBR, use:
      sudo dd if=/dev/sda of=mbr.bin bs=512 count=1
      od -xa mbr.bin

Boot loader

  • more aptly called the kernel loader. the task at this stage is to load the linux kernel
  • optional, initial RAM disk
  • GRUB and LILO are the most popular Linux boot loader
    • GRUB: GRand Unified Bootloader
      • an operating system independent bootloader
      • a multiboot software packet from GNU
      • flexible command line interface
      • file system access
      • support multiple executable format
      • support diskless system
      • download OS from network
    • LILO: LInux LOader
      • not dependent on a specific file system
      • can boot from hard disk and floppy
      • up to 16 different images
      • must change LILO when kernel image file or config file is changed

Kernel Image

  • The kernel is the central part in most OS beacause of its task, which is the management of the system’s resources and the communication between hardware and software components.
  • kernel always store in memory until computer is turned off.
  • kernel image is not an executable kernel, but a compressed kernel image
  • zImage size is less than 512 KB
  • bzImage is larger than 512 KB

Task of kernel

  • process management
  • memory management
  • device management
  • system call

Init process

  • the first thing the kernel does is to execute init process
  • init is the root/parent of all processes executing in linux, and is responsible for staring all other processes
  • the first process that init starts is a script: /etc/rc.d/rc.sysinit
  • based on the appropriate run-level, scripts are executed to start various processes to run the system and make it functional
  • the init process id is “1”
  • init is responsible for starting system processes defined in /etc/inittab file
  • init typically will start multiple instances of “getty”, which waits for console logins which spawn one’s user shell process
  • upon shutdown, init control the sequence and processes for shutdown


  • a runlevel is a software configuration of the system which allows only a selected group of processes to exist
  • init can be in one of seven runlevels: 0-6

rc#.d files

  • rc#.d files are the scripts for a given runlevel that run during boot and shutdown
  • the scripts are found in /etc/rc.d/rc#.d, where the symbol # represents the run level


  • deamon is a background process
  • init.d is a directory that admin can start/stop individual deamons by changing on it
  • admin can issue the command and either the satrt, stop, status, restart or reload option
  • e.g. to stop the web server
      cd /etc/rc.d/init.d/
      httpd stop


  • setup.S连同内核映象由bootsect.S装入。setup.S从BIOS获取计算机系统的参数,放到内存参数区,仍在实模式下运行
  • 辅助程序setup为内核映象的执行做好准备,然后跳转到0x10000(小内核),0x100000(大内核)开始内核本身的执行,此后就是内核的初始化过程
  • setup.S:
    • 版本检查和参数设置
    • 为进入保护模式做准备


  • 初始化寄存器和数据区
  • 核心代码解压缩:调用mics.c 中的 decompress_kernel
  • 页表初始化
  • 初始化 idt, gdt, ldt
  • 启动核心
    • 前面是对CPU进行初始化,并启动保护模式
    • 现在的任务是初始化内核的核心数据结构,主要涉及:
      • 中断管理
      • 进程管理
      • 内存管理
      • 设备管理
    • 进入保护模式后,系统从start_kernel开始执行,start_kernel函数变成0号进程,不再返回
    • Start_kernel显示版本信息,调用setup_arch()初始化核心的数据结构
    • 最后,调用kernel_thread()创建1号进程init进程
    • 父进程创建init子进程之后,返回执行cpu_idle


  1. Q: 在i386中,内核可执行代码在内存中的首地址是否可随意选择?
    A: 从原理上讲是可以的,但实际上要考虑许多因素。例如微机内存的低端1MB的地址空间是不连续的。所以要把内核代码放在低端就要看是否能放的下。如果内核代码模块太大就不能将内核放在低端。
  2. Q: 主引导扇区位于硬盘的什么位置,如果一个硬盘的主引导扇区有故障此硬盘是否还可以使用?
    A: 主引导扇区位于硬盘的0面0道1扇区。一般来讲如果一个硬盘的主引导扇区有故障,此硬盘虽然可以使用,但不能作为引导盘使用了,因为它的主引导扇区不能读出内容。

/proc File System and Kernel Module Programming

What is a kernel module?

  • (wiki) an object file that contains code to extend the running kernel
  • (RedHat) modules are pieces of code that can be loaded and unloaded into the kernel upon demand

    advantages and disadvantages

  • advantages:
    • allowing the dynamic insertion and removal of code from the kernel at run-time
    • save memory cost
  • disadvantages: fragmentation penalty -> decrease memory performance

current kernel modules

cd /lib/modules/2.6.32-22-generic/
find . -name "*.ko"

current loaded modules

lsmod or cat /proc/modules


  • a pseudo file system
  • real time, resides in the virtual memory
  • tracks the processes running on the machine and the state of the system
  • a new /proc file system is created every time your linux machine reboots
  • highly dynamic. the size of the proc directory is 0 and the last time of modification is the last bootup time
  • /proc file system doesn’t exist on any particular media
  • the contents of the /proc file system can be read by anyone who has the requisite permissions
  • certain parts of the /proc file system can be read only by the owner of the process and of course root (some not even by root)
  • the contents of the /proc are used by many utilities which grab the data from the particular /proc directory and display it. e.g. top, ps, lspci, dmesg etc

Tweak kernel parameters

  • /proc/sys: making changes to this directory enables you to make real time changes to certain kernel parameters
  • e.g. /proc/sys/net/ipv4/ip_forward, which has value of “0”, which can be seen using cat. This can be changed in real time by echo 1 > /proc/sys/net/ipv4/ip_forward, thus allowing IP forwarding.

Details of some files in /proc

  • buddyinfo: contains the number of free areas of each order for the kernel buddy system
  • cmdline: kernel command line
  • cpuinfo: human-readable information about the processors
  • devices: list of device drivers configured into the currently running kernel (block and character)
  • dma: shows which DMA channels are being used at the moment
  • execdomains: related to security
  • fb: frame buffer devices
  • filesystems: filesystems configured/supported into/by the kernel
  • interrupts: number of interrupts per IRQ on the x86 architecture
  • iomem: shows the current map of the system’s memory for its various devices
  • ioports: provides a list of currently registered port regions used for input or output communication with a device
  • kcore:
    • this file represents the physical memory of the system and is stored in the core file format
    • unlike most /proc files, kcore does display a size. This value is given in bytes and is equal to the size of physical memory (RAM) used plus 4 KB.
    • its content are designed to be examined by a debugger, such as gdb, the GNU Debugger
    • only root user has the rights to view this file
  • kmsg: used to hold messages generated by the kernel, which are then picked up by other programs, such as klogd
  • loadavg:
    • provides a look at load average
    • the first three columns measure CPU utilization of the last 1, 5 and 10 minutes periods.
    • the fourth column shows the number of currently running processes and the total number of processes
    • the last column displays the last process ID used
  • locks: displays the files currently locked by the kernel
  • mdstat: contains the current information for multiple-disk, RAID configurations
  • meminfo:
    • one of the most commonly used /proc files
    • it reports plenty of valuable information about the current utilization of RAM on the system
  • misc: this files lists miscellaneous drivers registered on the miscellaneous major device, which is number 10
  • modules: displays a list of all modules that have been loade by the system
  • mounts: provides a quick list of all mounts in use by the system
  • mtrr: refers to the current Memory Type Range Registers (MTRRs) in use with the system
  • partitions: detailed information on the various partitions currently available to the system
  • pci: full list of every PCI device on your system
  • slabinfo: information about the memory usage on the slab level
  • stat: keeps track of a variety of different statistics about the system since it was last restarted
  • swap: measure swap space and its uilization
  • uptime: contains information about how long the system has on since its last restart
  • version: tells the versions of the kinux kernel and gcc, as well as the version of red hat linux installed on the system

The numerical named directories

  • the processed that are running at the instant a snapshot of the /proc file system was taken
  • the contents of all the directories are the same as these directories contain the carious parameters and status of the corresponding process
  • you have full access only to the processes that you have started

A typical process directory

  • cmdline: contains the whole command line used to invoke the process. the contents of this file are the command line arguments with all the parameters (without formatting/spaces)
  • cwd: symbolic link to the current working directory
  • environ: contains all the process-specific environment variables
  • exe: symbolic link of the executable
  • maps: parts of the process’ address space mapped to a file
  • fd: this directory contains the list of file descriptors as opened by the particular process
  • root: symbolic link pointing to the directory which is the root file system for the particular process
  • status: information about the process

Other subdirectories in /proc

  • /proc/self: link to the currently running process
  • /proc/bus:
    • contains information specific to the various buses available on the system
    • for ISA, PCI and USB buses, current data on each is available in /proc/bus/<bus type directory>
    • individual bus directories, signified with numbers, contains binary files that refer to the various devices available on that bus
  • /proc/driver: specific drivers in use by kernel
  • /proc/fs: specific file system, file handle, inode, dentry and quota information
  • /proc/ide: information abput IDE devices
  • /proc/irq: used to set IRQ to CPU affinity
  • /proc/net: networking parameters and statistics
    • arp: kernel’s ARP table. useful for connecting hardware address to an IP address on a system
    • dev: lists the network devices along with transmit and receive statistics
    • route: displays the kernel’s routing table
  • /proc/scsi: like /proc/ide, it gives info about scsi devices
  • /proc/sys:
    • allows you to make configuration changes to a running kernel, by echo command
    • any configuration changes made will disappear when the system is restarted

/proc/sys subdirectories

  • /proc/sys/dev: provides parameters for particular devices on the system. e.g. cdrom/info: many important CD-ROM parameters
  • /proc/sys/kernel:
    • acct: controls the suspension of process accounting based on the percentage of free space available on the filesystem containing the log
    • ctrl-alt-del: controls whether [Ctrl]-[Alt]-[Delete] will gracefully restart the computer using init (value 0) or force an immediate reboot without syncing the dirty buffers to disk (value 1).
    • domainname: allows you to configure the system’s domain name, such as
    • hostname: allows you to configure the system’s host name, such as
    • threads-max: sets the maximum number of threads to be used in the kernel, with a default value of 4096
    • panic: defines the number of seconds the kernel will postpone rebooting the system when a kernel panic is experienced. By default, the value is set to 0, which disables automatic rebooting after a panic.
  • /proc/sys/vm: facilitates the configuration of the Linux kernel’s virtual memory subsystem

Advantages and disadvantages of the /proc file system

  • advantages:
    • coherent, intuitive interface to the kernel
    • great for tweaking and collecting status info
    • easy to use and programming
  • disadvantages:
    • certain amount of overhead, must use fs calls, alleviated somewhat by sysctl() interface
    • user can possibly cause system instability

/proc file system entries

  • to use any of the procfs functions, you have to include the correct header file #include <linux/proc_fs.h>
  • struct proc_dir_entry* create_proc_entry(const char* name, mode_t mode, struct proc_dir_entry* parent);
    • this function creates a regular file with the name name, the mode mode in the directory parent
    • to create a file in the root of the procfs, use NULL as parent parameter
    • when successful, the function will return a pointer to the freshly created struct proc_dir_entry
    • e.g. foo_file = create_proc_entry("foo", 0644, example_dir);
  • struct proc_dir_entry* proc_mkdir(const char* name, struct proc_dir_entry* parent);
    • create a directory name in the procfs directory parent
  • struct proc_dir_entry* proc_symlink(const char* name, struct proc_dir_entry* parent, const char* dest);
    • this creates a symlink in the procfs directory parent that points from name to dest. this translates in userland to ln -s dest name
  • void remove_proc_entry(const char* name, struct proc_dir_entry* parent);
    • removes the entry name in the directory parent from the procfs.
    • be sure to free the data entry from the struct proc_dir_entry before the function is called


  • modules can only use APIs exported by kernel and other modules
    • no libcs
    • kernel exports some common APIs
  • modules are part of kernel
    • modules can control the whole system
    • as a result, can damage the whole system
  • modules basically can’t be written with C++
  • determine which part should be in kernel

Process Management

Processes, Lightweight Processes and Threads

  • Process: an instance of a program in execution
  • (User) Thread: an execution flow of the process
    • Pthread (POSIX thread) library
  • Lightweight process (LWP): used to offer better support for multithreaded applications
    • LWP may share resources: address space, open files…
    • To associate a lightweight process with each thread

Process descriptor

task_struct data structure:

  • state: process state
  • thread_info: low-level information for the process
  • mm: pointers to memory area descriptors
  • tty: tty associated with the process
  • fs: current directory
  • files: pointers to file descriptors
  • signal: signals received

Process state

  • TASK_RUNNING: 该状态表示进程处于可运行状态,也就是说要么正在CPU中运行,要么在runqueue队列中等待运行
  • TASK_INTERRUPTABLE: 该状态表示进程处于可中断的睡眠状态。该进程正处在睡眠,但是可以被任何信号唤醒。当信号将该进程唤醒后,进程会去对信号做出响应。
  • TASK_UNINTERRUPTABLE: 该状态表示进程处于不可中断的睡眠状态。该进程正处于睡眠,专心等待某一个事件(一般是IO事件),并且不希望被其他信号唤醒。
  • EXIT_ZOMBIE: 该状态是该进程变为僵尸进程,即其父进程没有对该进程的结束信号进行处理

Identifying a process

  • process descriptor pointers: 32 bits
  • process ID (PID): 16 bits (~32767 for compatibility)
    • linux associates different PID with each process or LWP
    • programmers expect threads in the same group to have a common PID
    • thread group: a collection of LWPs
      • the PID of the first LWP in the group
      • tgid field in process descriptor: using getpid() system call

The process list

  • tasks field in task_struct structure
    • type list_head
    • prev, next fields point to the previous and the next task_struct
  • process 0 (swapper): init_task

Parenthood relationships among processes

  • process 0 and 1: created by the kernel
    • process 1 init: the ancestor of all processes
  • fields in process descriptor for parenthood
    • real_parent
    • parent
    • children
    • sibling

Pidhash table and chained lists

  • to support the search for the process descriptor of a PID, since sequential search in the process list is inefficient
  • the pid_hash array contains four hash tables and correspnding field in the process descriptor
    • pid: PIDTYPE_PID
    • tgid: PIDTYPE_TGID (thread group leader)
    • pgrp: PIDTYPE_PGID (group leader)
    • session: PIDTYPE_SID (session leader)
  • chaining is used to handle PID collisions
  • the size of each pidhash table is dependent on the available memory


pids field of the process descriptor: the pid data structure

  • nr: PID number
  • pid_chain: links to the previous and the next elements in the hash chain list
  • pid_list: head of the per-PID list (in thread group)

How processes are organized

  • processes in TASK_STOPPED, EXIT_ZOMBIE, EXIT_DEAD: not linked in lists
  • processes in TASK_INTERRUPTABLE, TASK_UNINTERRUPTABLE: waiting queues
  • two kinds of sleeping processes:
    • exclusive process
    • nonexclusive process: always woken up by the kernel when the event occurs

Process switch, task switch, context switch

  • hardware context switch: a far jmp (in older Linux)
  • software context switch: a sequence of mov instructions
    • allows better control over the validity of data being loaded
    • the amount of time required is about the same

Performing the process switch

  • switch the page global directory
  • switch the kernel mode stack and the hardware context

Task State Segment

TSS: a specific segment type in x86 architecture to store hardware contexts

Creating processes

  • in traditional UNIX, resources owned by parent process are duplicated
    • very slow and inefficient
    • mechanisms to solve this problem
      • Copy on Write: parent and child read the same physical page
      • Ligitweight process (LWP): parent and child share per-process kernel data structures
      • vfork() system call: parent and child share the memory address space

clone(), fork() and vfork() system calls

  • clone(fn, arg, flags, child_stack, tls, ptid, ctid): creating lightweight process
    • a wrapper function in C library
    • Uses clone() system call
  • fork() and vfork() system calls: implemented by clone with different parameters
  • each invokes do_fork() function

Kernel threads

  • kernel threads run only in kernel mode
  • they use only linear addresses greater than PAGE_OFFSET
  • kernel_thread(): to create a kernel thread
  • example kernel threads:
    • process 0 (swapper process), the ancestor of all processes
    • process 1 (init process)
    • others: keventd, kapm…

Destrying processes

  • exit() library function
    • two system calls in Linux 2.6
      • _exit() system call: handled by do_exit() function
      • exit_group() system call: handled by do_group_exit() function
  • process removal: releasing the process descriptor of a zombie process by release_task()

Scheduling policy

  • based on time-sharing:
    • time slice
  • based on priority ranking:
    • dynamic
  • classification of processes:
    • interactive processes: e.g. shells, text editors, GUI applications
    • batch processes: e.g. compilers, database search engine, web server
    • real-time processes: e.g. audio/video applications, data-collection from physical sensors, robit controllers

Process preemption

  • Linux (user) processes are preemptive
    • when a new process has higher priority than the current
    • when its time quantum expires, TIF_NEED_RESCHED in thread_info will be set
  • a preempted process is not suspended since it remains in the TASK_RUNNING state
  • Linux kernel before 2.6 is nonpreemptive, simpler

How long should a quantum last

  • neither too long nor too short:
    • too short: overhead for process switch
    • too long: process no longer appear to be executed concurrently
  • always a compromise: the rule of thumb, to choose a duration as long as possible, while keeping good system response

Scheduling algorithm

  • earlier version of Linux: simple and straightforward
    • at every process switch, scan the runnable processes, compute the priorities and select the best one to run
  • much more complex in Linux 2.6
    • scales well with the number of processes and processors
    • constant time scheduling
  • 3 scheduling classes
    • SCHED_FIFO: first-in first-out real-time
    • SCHED_RR: round-robin real-time
    • SCHED_NORMAL: conventional time-shared processes

Scheduling of conventional processes

Static priority

  • static priority
    • conventional processes: 100 - 139
      • can be changed by nice(), setpriority() system calls
    • base time quantum: the time-quantum assigned by the scheduler if it has exhausted its previous time quantum
      base time quantum

Dynamic priority and average sleep time

  • dynamic priority: 100 - 139
  • dynamic priority = max(100, min(static_priority - bonus + 5, 139))
    • bonus: 0 - 10
      • < 5: penalty
      • > 5: premium
    • dependent on the average sleep time: average number of nanoseconds that the process spent while sleeping
    • the more average sleep time, the more bonus:

Determine the status of a process

To determine whether a process is considered to be interactive or batch

  • interactive if: dynamic_priority <= 3 * static_priority / 4 + 28
  • i.e. bonux - 5 >= static_priority / 4 - 28
    • interactive delta

Active and expired processes

  • active processes: runnable processes that have not exhausted their time quantum
  • expired processes: runnable processes that have exhausted their time quantum
  • periodically, the role of processes changes

Scheduling of real-time processes

  • real-time priority: 1 - 99
  • real-time processes are always considered active
    • can be changed by sched_setparam() and sched_setscheduler() system calls
  • a real-time process is replaced only when:
    • another process has higher real-time priority
    • put to sleep by blocking operation
    • stopped or killed
    • voluntarily relinquishes the CPU by sched_yield() system call
    • for round-robin real time (SCHED_RR), time quantum exhausted

Implementation Support of Scheduling

Data structures used by the scheduler

  • runqueue data structure for each CPU
  • two sets of runnable processes in the runqueue structure
  • prio_array_t data structure
    • nr_active: # of process descriptors in the list
    • bitmap: priority bitmap
    • queue: the 140 list_heads

Functions used by the scheduler

  • scheduler_tick(): keep the time_slice counter up-to-date
  • try_to_wake_up() awaken a sleeping process
  • recalc_task_prio(): updates the dynamic priority
  • load_balance(): keep the runqueue of multiprocess system balanced
  • schedule(): select a new process to run
    • Direct invocation:
      • when the process must be blocked to wait for resource
      • when long iterative tasks are executed in device drivers
    • Lazy invocation: by setting the TIF_NEED_RESCHED flag
      • when current process has used up its quantum, by scheduler_tick()
      • when a process is woken up, by try_to_wake_up()
      • when a sched_setscheduler() system call is issued

Runqueue balancing in multiprocessor systems

  • 3 types of multiprocessor systems
    • Classic multiprocessor architecture
    • Hyper-threading
    • NUMA
  • A CPU can execute only the runnable processes in the corresponding runqueue
  • The kernel periodically checks if the workloads of the runqueues are balanced

Scheduling domains

A set of CPUs whose workloads are kept balanced by the kernel

  • hierarchically organized
  • partitioned into groups: a subset of CPUs
  • nice(): for backward compatability only
    • allow processes to change their base priority
    • replaced by setpriority()
  • getpriority() and setpriority():
    • act on the base priorities of all processes in a given group
    • who: the value of pid, pgrp, or uid field to be used for selecting the processes
    • niceval: the new base priority value: -20 - +19
  • sched_getscheduler(), sched_setscheduler(): queries/sets the scheduling policy
  • sched_getparam(), sched_setparam(): queries/sets the scheduling parameters
  • sched_yield(): to relinquish the CPU voluntarily without being suspended
  • sched_get_priority_min(), sched_get_priority_max(): to return the minimum/maximum real-time static priority
  • sched_rr_get_interval(): to write into user space the round-robin time quantum for real-time process

Completely Fair Scheduler (CFS)

  • Motivation of CFS: running task gets 100% usage of the CPU, all other tasks get 0% usage of the CPU
  • New scheduling algorithm in linux kernel 2.6
  • Design:
    • the same virtual runtime for each task
    • increasing the priority for sleeping task
  • Implementation
    • stores the records about the planned tasks in a red-black tree
    • pick efficiently the process that has used the least amount of time (the leftmost node of the tree)
    • the entry of the picked process is then removed form the tree, the spent execution time is updated and the entry is then returned to the tree where it normally takes some other location.

Kernel synchronization

Kernel control paths

  • Linux kernel: like a server that answers requests
    • parts of the kernel run in an interleaved way
  • Kernel control path: a sequence of instructions executed in kernel mode on behalf of current process
    • interrupts and exceptions
    • lighter than a process (less context)
  • Three CPU states are considered:
    • User: running a process in user mode
    • Excp: running an exception or a system call handler
    • Intr: running an interrupt handler

Kernel preemption

  • Preemptive kernel: a process running in kernel mode can be replaced by another process while in the middle of a kernel function
  • The main motivation for making a kernel preemptive is to reduce the dispatch latency of the user mode process
    • dispatch latency: delay between the time they become runnable and the time they actually begin running
  • The kernel can be preempted only when it is excuting an exception handler (in particular a system call) and the kernel preemption has not been explicitly disabled

When synchronization is necessary

  • A race condition can occur when the outcome of a computation depends on how two or more interleaved kernel control paths are nested
  • To identify and protect the critical regions in exception handlers, interrupt handlers, deferrable functions, and kernel threads
    • On single CPU, critical region can be implemented by disabling interrupts while accessing shared data
    • If the same data is shared only by the service routines of system calls, critical region can be implemented by disabling kernel preemption while accessing shared data
  • Things are more complicated on multiprocessor systems
    • different synchronization techniques are necessary

When synchronization is not necessary

  • The same interrupt cannot occur until the handler terminates
  • Interrupt handlers and softirqs are nonpreemptable, non-blocking
  • A kernel control path performing interrupt handling cannot be interrupted by a kernel control path executing a deferrable function or a system call service routine
  • Softirqs cannot be interleaved

Per-CPU variables

  • The simplest and most efficient synchronization technique consists of declaring kernel variables as per-cpu variables
    • an array of data structures, one element per CPU in the system
    • a CPU should not access the elements of the array corresponding to the other CPUs
  • While per-cpu variables provide protection against concurrent accesses from several CPUs, they do not provide protection against accesses from asynchronous functions (interrupt handles and deferrable functions)
  • per-cpu variables are prone to race conditions caused by kernel preemption, both in uniprocessor and multiprocessor systems

Memory Barriers

  • when dealing with synchronization, instruction reordering must be avoided
  • a memory barrier primitive ensures that the operations before the primitive are finished before the operations after the primitive
    • all instructions that operate on I/O ports
    • all instructions prefixed by lock byte
    • all instructions that write into control registers, system registers or debug registers
    • a few special instructions, e.g. iret

Spin locks

  • Spin locks are a special kind of lock designed to work in a multiprocessor environment
    • busy waiting
    • very convenient

Read/Write Spin Locks

  • To increase the amount of concurrency in the kernel
    • multiple reads, one write
  • rwlock_t structure
    • lock field: 32 bit


  • Introduced in Linux 2.6
  • similar to read/write spin locks
  • except that they give a much higher priority to writers
  • a writer is allowed to proceed even when readers are active

Read-Copy Update

  • Read-Copy update (RCU): another synchronization technique designed to protect data structures that are mostly accessed for reading by several CPUs
    • RCU allows many readers and many writers to proceed concurrently
    • RCU is lock free
  • Key ideas
    • only data structures that are dynamically allocated and referenced via pointers can be protected by RCU
    • No kernel control path can sleep inside a critical section protected by RCU


  • two kinds of semaphores
    • kernel semaphores: by kernel control paths
    • System V IPC semaphores: by user processes
  • Kernel semaphores
    • struct semaphore
      • count
      • wait
      • sleepers
    • up(): to release a kernel semaphore (similar to signal)
    • down(): to acquire kernel semaphore (similar to wait)

Read/Write semaphores

  • similar to read/write spin locks
    • except that waiting processes are suspended instead of spinning
  • struct rw_semaphore
    • count
    • wait_list
    • wait_lock
  • init_rwsem()
  • down_read(), down_write: acquire a read/write semaphore
  • up_read(),up_write(): release a read/write semaphore


  • to solve a subtle race condition in multiprocessor systems
    • similar to semaphores
  • struct completion
    • done
    • wait
  • complete(): corresponding to up()
  • wait_for_completion(): corresponding to down()

Local interrupt disabling

  • interrupts can be disabled on a CPU with cli instructions
    • local_irq_disable() macro
  • interrupts can enabled by sti instructions
    • local_irq_enable() macro

Disabling/Enabling deferrable functions

  • softirq
  • the kernel sometimes need to disable deferrable functions without diabling interrupts
    • local_bh_disable() macro
    • local_bh_enable() macro

Synchronizing accesses to kernel data structures

  • rule of thumb for kernel developers:
    • always keep the concurrency level as high as possible in the system
    • two factors:
      • the number of I/O devices that operate concurrently
      • the number of CPUs that do productive work
  • a shared data structure consisting of a single integer value can be updated by declaring it as an atomic_t type and by using atomic operations
  • inserting an element into a shared linked list is never atomic since it consists of at leasr pointer assignments

Examples of race condition

  • when a program uses two or more semaphores, the potential for deadlock is present because two different paths could wait for each other
  • Linux has few problems with deadlocks on semaphore requests since each path usually acquire just one semaphore
  • in cases such as rmdir() and rename() system calls, two semaphores requests
  • to avoid such deadlocks, semaphore requests are performed in address order
  • semaphore request are performed in predefined address order

Symmetric multiprocessing (SMP)

Traditional View

  • Traditionally, the computer has been viewed as a sequential machine
    • a processor executes instructions one at a time in sequence
    • each instruction is a sequence of operations
  • two popular approached to providing parallelism
    • symmetric multiprocesors
    • clusters

Categories of computer systems

  • single instruction single data (SISD) stream
    • single processor executes a single instruction stream to operate on data stored in a single memory
  • Single instruction Multiple data (SIMD) stream
    • each instruction is executed on a different set of data by the different processors
  • multiple instruction single data (MISD) stream
    • a sequence of data is transmitted to a set of processors, each execute a different instruction sequence
  • multiple instruction multiple data (MIMD)
    • a set of processors simultaneously execute different instruction sequences on different data sets

Symmetric multiprocessing

  • kernel can execute on any processor
    • allowing portions of the kernel to execute in parallel
    • typically each processor does self-scheduling from the pool of available process or threads


  • Symmetric multiprocessing (SMP) involves a multiprocessor computer hardware and software where two or more identical processors connect to a single shared main memory, have full access to all I/O devices and are controlled by a single OS instance that treats all processors equally, reserving none for special purposes
  • Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as seperate processors

Linux Support for SMP

  • 启动过程:BSP负责操作系统的启动,在启动的最后阶段,BSP通过IPI激活各个AP,在系统的正常运行过程中,BSP和AP基本上是无差别的
  • 进程调度:与单处理器系统的主要差别是执行进程切换后,被换下的进程有可能会换到其他CPU上继续运行。在计算优先权时,如果进程上次运行的CPU也是当前CPU,则会适当提高优先权,这样可以更有效地利用Cache
  • 中断系统:为了支持SMP,在硬件上需要APIC中断控制系统。Linux定义了各种IPI的中断向量以及传送IPI的函数


  • Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor
  • Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memeory shared between processors)
  • Benefits: the benefits of NUMA are limited to particular workloads, notably on servers where the data are often associated strongly with certain tasks or users

SMP 系统启动过程

  1. BIOS初始化:屏蔽AP(Application Processor),建立系统配置表
  2. MBR里面的引导程序(LILO,GRUB等)将内核载入内存
  3. 执行 head.S中的 startup_32函数,最后调用 start_kernel
  4. 执行 start_kernel,这个函数相当于应用程序里面的 main 函数
  5. start_kernel 进行一系列的初始化工作,最后将执行:
    • smp_init:关键的一步,启动各AP
    • rest_init,调用 init 创建1号 init 进程,自身调用 cpu_idle成为0号进程
  6. 1号进程 init 完成剩下的工作

reschedule_idle 工作过程

  1. 先检查p进程上一次运行的cpu是否空闲,如果空闲,这是最好的cpu,直接返回。
  2. 找一个合适的cpu,查看SMP中的每个CPU上运行的进程,与p进程相比的抢先权,把具有最高的抢先权值的进程记录在target_task中,该进程运行的cpu为最合适的CPU。
  3. 如target_task为空,说明没有找到合适的cpu,直接返回。
  4. 如果target_task不为空,则说明找到了合适的cpu,因此将target_task->need_resched置为1,如果运行target_task的cpu不是当前运行的cpu,则向运行target_task的cpu发送一个IPI中断,让它重新调度。

Memory Management - Addressing

Memory Addresses

  • three kinds of addresses in 80x86 microprocessors
    • logical address: included in the machine language instructions
    • linear address (virtual address): a single 32-bit unsigned integer that can be used to address up to 4GB
    • physical address (32-bit unsigned integers): used to address memory cells in memory chips
    • logical address => segmentation unit => linear address => paging unit => physical address

Virtual File System (VFS)

Role of VFS

  • a common interface to several kinds of file systems
  • File systems supported by the VFS
    • Disk-based filesystems
    • Network filesystems
    • special filesystems (e.g. /proc)

Common file system interface

  • enable system calls such as open(), read() and write() to work regardless of file system or storage media


  • file system: storage of data adhering to a specific structure
  • file: ordered string of bytes
  • directory: analogous to a folder
    • special type of file
    • instead of normal data, it contains pointers to other files
    • directories are hooked together to create the hierarchical name space
  • metadata: information describing a file

Physical file representation

  • Inode:
    • unique index
    • holds file attributes and data block locations pertaining to a file
  • Data blocks
    • contain file data
    • may not be physically contiguous
  • File name:
    • human-readable identifier for each file

The common file model

  • defines common file model conceptual interfaces and data structures
  • e.g. user space (write()) => VFS (sys_write()) => file system (file system write methods) => physical media
  • object-oriented: data structures and associated operations
    • Superblock object: a mounted file system
    • Inode object: information about a file, represents a specific file
    • File object: interaction between an open file and a process
    • Dentry object: directory entry, single component of a path name

VFS operations

  • each object contains operations object with methods:
    • super_operations: invoked on a specific file system
    • inode_operations: invoked on a specific inodes (which point to a file)
    • dentry_operations: invoked on a specific directory entry
    • file_operations: invoked on a file
  • lower file system can implement own version of methods to be called by VFS
  • if an operation is not defined by a lower file system (NULL), VFS will often call a generic version of the method

VFS Data Structures

Superblock object

  • implemented by each file system
  • used to store information describing that specific file system
  • often physically written at the beginning of the partition (file system control block)
  • found in linux/fs.h

Inode object

  • represent all the information needed to manipulate a file or directory
  • constructed in memory, regardless of how file system stores metadata information
  • three doubly linked lists
    • List of valid unused inodes
    • List of in-use inodes
    • List of dirty inodes

Power Management - From Linux Kernel to Android

Linux Power Management

  • Two popular power management standards in Linux
    • APM (Advanced Power Management)
      • Power management happens in two ways:
        1. function calls from APM driver to the BIOS requesting power state change
        2. automatically based on device activities
      • Power states in APM:
        1. Full on
        2. APM enabled
        3. APM standby
        4. APM suspend
        5. Sleep/Hibernation
        6. Off
      • Weakness of APM:
        1. each BIOS has its own policy, different computers have different activities
        2. unable to know why system is suspend
        3. BIOS does not know user activities
        4. does not know add-on devices (USB)
    • ACPI (Advanced Configuration and Power Interface)
      • control divided between OS and BIOS, decisions managed through OS
      • Enables sophisticated power policies for general-purpose computers with standard usage patterns and hardware

Android Power Management

  • Built on top of Linux Power Management
  • Designed for devices which have a “default-off” behaviors
    • the phone is not supposed to be on when we do not want to use it
    • powered on only when requested to be run, off by default
    • unlike PC, which has a “default-on” behavior
  • Apps and services must request CPU resource with “wake locks” through the Android application framework (PowerManager class) and native Linux libraries in order to keep power on, otherwise Android will shut down the CPU
  • Android PM uses wake locks and time out mechanism to switch state of system power, so that system power consumption decreases
  • If there are no wake locks, CPU will be turned off
  • If there are partial wake locks, screen and keyboard will be turned off

Android PM State Machine

  • When an user application acquire full wake lock or screen/keyboard touch activity event occur, the machine will enter ‘AWAKE’ state
  • If timeout happens or power key is pressed, the machine will enter “notification” state:
    • if partial wake lock is acquired, the machine will remain “notification”
    • if all partial wake locks are released, the machine will enter “sleep” state
  • There is one main wake lock called ‘main’ in the kernel to keep the kernel awake
    • It is the last wake lock to be released when system goes to suspend

      Acquiring wake locks

      1. request sent to PowerManager to acquire a wake lock
      2. PowerManagementService (in the kernel) to take a wake lock
      3. Add wake lock to the list
      4. set the power state:
    • for a full wake lock, state should be set to “ON”
      1. For taking Partial wake lock, if it is the first partial wake lock, a kernel wake lock is taken. This will protect all the partial wake locks. For subsequent requests, kernel wake lock is not taken, but just added to the list

System Sleep

  • early_suspend:
    • Used by drivers that need to handle power mode settings to the device before kernel is suspended
    • Used to turn off screen and non-wakeup source input devices
    • E.g., consider a display driver
      • In early suspend, the screen can be turned off
      • In the suspend, other things like closing the driver can be done
  • late_resume:
    • When system is resumed, resume is called first, followed by resume late

Battery Service

  • The BatteryService monitors the battery status, level, temperature etc.
  • A Battery Driver in the kernel interacts with the physical battery via ADC to read battery voltage and I2C
  • Whenever BatteryService receives information from the BatteryDriver, it will act accordingly
    • E.g. if battery level is low, it will ask system to shutdown
  • Battery Service will monitor the battery status based on received uevent from the kernel
    • uevent : An asynchronous communication channel for kernel