歡迎您光臨本站 註冊首頁

Linux Kernel Internals(2)--Process and Interrupt Management

←手機掃碼閱讀     火星人 @ 2014-03-12 , reply:0
  Next Previous Contents
--------------------------------------------------------------------------------

2. 進程和中斷管理


2.1 任務結構和進程表

Linux下每一個進程動態分配一個struct task_struct結構。可以建立的最大進程數只是由當前的物理內存數量所限制(參見 kernel/fork.c:fork_init()):



--------------------------------------------------------------------------------

/*
* The default maximum number of threads is set to a safe
* value: the thread structures can take up at most half
* of memory.
*/
max_threads = mempages / (THREAD_SIZE/PAGE_SIZE) / 2;


--------------------------------------------------------------------------------

在IA32體系結構中, 它基本上等於num_physpages/4. 例如,在一台具有512M的機器上,你可以建立32k個線程。對於老的核心(2.2和更早的版本)4k多一點的限制來說,這是一個很大的進步。而且,這還可以在運行時用KERN_MAX_THREADS sysctl(2)改變或是簡單的通過procfs介面來調整:



--------------------------------------------------------------------------------

# cat /proc/sys/kernel/threads-max
32764
# echo 100000 > /proc/sys/kernel/threads-max
# cat /proc/sys/kernel/threads-max
100000
# gdb -q vmlinux /proc/kcore
Core was generated by `BOOT_IMAGE=240ac18 ro root=306 video=matrox:vesa:0x118'.
#0 0x0 in ?? ()
(gdb) p max_threads
$1 = 100000


--------------------------------------------------------------------------------

Linux系統的一組進程是通過許多的struct task_struct結構來表示的, 它們通過兩種方法來鏈接在一起:


作為一個哈希表, 通過pid來建立
作為一個圓環,用p->next_task 和 p->prev_task指針建立的一個雙向鏈表。
這個哈希表稱為 pidhash[],在include/linux/sched.h中定義:



--------------------------------------------------------------------------------

/* PID hashing. (shouldnt this be dynamic?) */
#define PIDHASH_SZ (4096 >> 2)
extern struct task_struct *pidhash[PIDHASH_SZ];

#define pid_hashfn(x) ((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))


--------------------------------------------------------------------------------

任務通過它們的pid值來建立哈希表,上面的哈希函數可以在它們的範圍中(0 to PID_MAX-1)均一的分配元素。哈希表用include/linux/sched.h中的find_task_pid() 內聯函數,可以快速查找一個給定pid的任務:



--------------------------------------------------------------------------------

static inline struct task_struct *find_task_by_pid(int pid)
{
struct task_struct *p, **htable = &pidhash[pid_hashfn(pid)];

for(p = *htable; p && p->pid != pid; p = p->pidhash_next)
;

return p;
}


--------------------------------------------------------------------------------

每個哈希列表中的任務(即散列成同樣的值)是通過p->pidhash_next/pidhash_pprev 來鏈接的,hash_pid() 和 unhash_pid() 使用它們來插入和刪除一個哈希表中指定的進程 。這些都是在被稱為tasklist_lock的讀寫自旋鎖的保護下完成的。

圓環雙向鏈表使用p->next_task/prev_task 來進行維護,因此可以在系統中容易地遍歷所有任務。這是通過include/linux/sched.h中的宏for_each_task() 來完成的:



--------------------------------------------------------------------------------

#define for_each_task(p)
for (p = &init_task ; (p = p->next_task) != &init_task ; )


--------------------------------------------------------------------------------

for_each_task()使用者應該採用tasklist_lock來讀。注意for_each_task() 是用init_task來標記列表的開始和結束,這是安全的方法,因為空閑任務(pid=0)從來不存在。

進程哈希表或進程錶鏈的修改,特別是 fork(), exit() 和 ptrace(), 必須使用 tasklist_lock 來寫。更有趣的是寫時還必須在本地CPU上關閉中斷。原因很簡單:send_sigio() 函數遍歷任務列表從而採用tasklist_lock 來讀,在中斷上下文中,它還被 kill_fasync() 調用。這就是為什麼要在寫時禁止中斷的原因了,而讀取時卻不需要。

現在我們已明白 task_struct 結構是如何相互鏈接在一起,讓我們檢查一下 task_struct的成員。它們鬆散地對應著 UNIX 'proc' 和 'user' 結構的組合。

UNIX 的其它版本將任務狀態信息分為兩部分,一部分內容一直保留在內存(稱為 'proc structure' ,它包括進程狀態,調度信息等等) ;另外一部分是只有當進程運行時才需要的(稱為'u 區' ,包括文件描述符表,磁碟限額信息等等)。這麼醜陋的設計唯一的原因就是內存時非常緊缺的資源。現代操作系統(當然,目前只有Linux而不是其它操作系統,比如FreeBSD似乎在這方面朝著Linux提高)不需要這樣區分,從而在任何時候都是在核心內存駐留數據結構中維護進程狀態。

task_struct 結構申明在include/linux/sched.h ,目前的大小為1680位元組。

狀態域申明為:



--------------------------------------------------------------------------------

volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */

#define TASK_RUNNING 0
#define TASK_INTERRUPTIBLE 1
#define TASK_UNINTERRUPTIBLE 2
#define TASK_ZOMBIE 4
#define TASK_STOPPED 8
#define TASK_EXCLUSIVE 32


--------------------------------------------------------------------------------

為什麼TASK_EXCLUSIVE 定義為32而不是16?因為16已經被TASK_SWAPPING 使用,當我已經刪除了TASK_SWAPPING (有時是在2.3.x中)所有的引用后,卻忘記了將TASK_EXCLUSIVE 提前。

p->state 中的volatile 申明意味著它可以非同步地修改(從中斷處理器中):


TASK_RUNNING: 意味著此任務被假定是在運行隊列。它可能還不在運行隊列中的原因是標記一個任務為TASK_RUNNING和將它放置在運行隊列並不是原子性的。為了查看運行隊列,你需要持有runqueue_lock 讀寫自旋鎖進行讀操作。如果你那樣做的話,你將看見任務隊列中的每一個任務都處於TASK_RUNNING 狀態。然而,反之卻未必正確,原因上面已經解釋了。同樣地,驅動程序(準確地說是它們所運行的進程上下文)可以標記它們自己為TASK_INTERRUPTIBLE (或 TASK_UNINTERRUPTIBLE)然後調用schedule(),這樣將會把它從運行隊列中刪除掉(除非有一個未決的信號,這樣它會留在運行隊列)。
TASK_INTERRUPTIBLE: 意味著進程正在睡眠,但可以通過一個信號或定時器超時來喚醒。
TASK_UNINTERRUPTIBLE: 跟TASK_INTERRUPTIBLE一樣,但它不能被喚醒。
TASK_ZOMBIE: 任務已經終止,但它的狀態還沒有被父進程(本身的或是收養的)收集(通過wait()調用)
TASK_STOPPED: 任務停止,要麼是由於任務控制信號或是因為調用ptrace(2)。
TASK_EXCLUSIVE: 這不是一個單獨的狀態,它可以與TASK_INTERRUPTIBLE 或 TASK_UNINTERRUPTIBLE中的任一個相或。這意味著此任務與其它許多任務睡眠在等待隊列中時,它將被單獨喚醒而不是引起一個"雷鳴般的牧群"問題,喚醒隊列中的所有任務。
任務標記包含不是互斥的進程狀態信息:


--------------------------------------------------------------------------------

unsigned long flags; /* per process flags, defined below */
/*
* Per process flags
*/
#define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */
/* Not implemented yet, only for 486*/
#define PF_STARTING 0x00000002 /* being created */
#define PF_EXITING 0x00000004 /* getting shut down */
#define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */
#define PF_SUPERPRIV 0x00000100 /* used super-user privileges */
#define PF_DUMPCORE 0x00000200 /* dumped core */
#define PF_SIGNALED 0x00000400 /* killed by a signal */
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
#define PF_VFORK 0x00001000 /* Wake up parent in mm_release */
#define PF_USEDFPU 0x00100000 /* task used FPU this quantum (SMP) */


--------------------------------------------------------------------------------

p->has_cpu, p->processor, p->counter, p->priority, p->policy 和 p->rt_priority 域與調度器相關,後面我們將會看到。

域p->mm 和 p->active_mm 分別指向進程的地址空間和活動地址空間(如果沒有一個真正的進程,如核心線程),它是由mm_struct 結構描述。 這在當任務被調度出去交換地址空間時可以最小化TLB表。因此,如果我們正調度進核心線程(它沒有p->mm) ,於是它的next->active_mm將被設置成調度出去的任務的prev->active_mm ,如果prev->mm != NULL,它將與prev->mm一致。如果CLONE_VM標記被傳給clone(2) 系統調用或vfork(2)系統調用,那麼地址空間就可以在線程之間共享。

域p->exec_domain 和 p->personality跟任務的特徵相關,比如說為了模擬UNIX的外來有趣"特徵"的某些系統調用行為方法。

域p->fs 包含文件系統信息,Linux下代表三種信息:


根目錄項和安裝點,
預備根目錄項和安裝點,
當前工作目錄項和安裝點。
這個結構還包括引用記數,因為當CLONE_FS標記傳送給clone(2)系統調用時它要在克隆的任務之間共享。

域p->files 包含文件描述符表。這也可以在任務之間共享,只要在clone(2)系統調用中指定了CLONE_FILES標記。

域p->sig包含信號處理器,可以通過CLONE_SIGHAND在克隆的任務之間共享。


2.2 任務和核心線程的建立和終止

操作系統類不同的書定義"進程"有不同的方法,但都是以"程序執行的事例"開始,以"通過clone(2)或fork(2)系統調用來產生"結束。在Linux下,有3種進程:


空閑線程,
核心線程,
用戶任務。
空閑線程是在編譯時為第一個CPU建立的;然後通過特定體系結構arch/i386/kernel/smpboot.c中的fork_by_hand()(在某些體系結構中它是fork(2)系統調用的手工展開)來為每一個CPU「手工的」建立一個。空閑任務有一個私有的TSS結構,它們在每個CPU數組init_tss中,但共享一個init_task結構。 所有空閑任務的pid = 0 ,沒有其它的任務能夠共享此pid,即使在clone(2)時使用了CLONE_PID。

核心線程是在核心模式下使用kernel_thread()函數執行clone(2)系統調用來建立的。核心線程通常沒有用戶地址空間,即p->mm = NULL,因為它們顯式地調用exit_mm(),如通過daemonize()函數。核心線程總是直接地訪問核心地址空間。它們在很低的區間分配pid號。運行在處理器的第0環意味著核心線程享有所有的I/O特權,並且不能夠被調度器搶佔。

用戶任務是通過clone(2) 或 fork(2) 系統調用來建立的,它們內部都是執行kernel/fork.c:do_fork()。

讓我們看看當一個用戶進程運行fork(2)系統調用時會發生什麼。儘管fork(2)是與體系結構相關的,因為傳遞給用戶棧和寄存器有不同方式,但真正基本的函數do_fork()完成那些工作並且是可移植的。它位於kernel/fork.c。

它完成下面的步驟:


局部變數retval設置成-ENOMEM,如果fork(2)未能分配一個新的任務結構的話errno應該設置成這個值。
如果在clone_flags中設置了CLONE_PID,就返回錯誤(-EPERM), 除非調用者是空閑線程(僅在啟動時)。於是,普通用戶線程不能傳遞CLONE_PID 給clone(2) 並期望能成功。對於fork(2)來說,由於clone_flags 被設置成 SIFCHLD,這並不相關 - 它只有當do_fork() 從 sys_clone()執行時才相關,它從需要的用戶空間請求的值傳遞給clone_flags。
current->vfork_sem 初始化(它在子進程中清除)。sys_vfork() (vfork(2) 系統調用,對應於clone_flags = CLONE_VFORK|CLONE_VM|SIGCHLD) 使父進程睡眠時會用到,直到子進程執行mm_release(), 例如作為exec()執行另外的程序或exit(2)的結果。
使用體系結構相關的alloc_task_struct()宏分配一個新的任務結構。在x86中,它就是用GFP_KERNEL優先順序獲取一個空閑頁。這就是為什麼fork(2)系統調用可能睡眠的第一個原因。如果分配失敗,我們返回-ENOMEM。
使用賦值語句*p = *current,當前進程的任務結構所有的值都拷貝到新的結構中。 也許這應該用memset來替換?稍後,不應該被子進程繼承的值設置成準確的值。
餘下的代碼採用大核心鎖,它們是不可重入的。
如果父進程有用戶資源(UID思想,Linux 很靈活,它是提出問題而不是製造事實),然後檢驗用戶是否超出了RLIMIT_NPROC軟限制 - 如果是的話,返回-EAGAIN,如果沒有超出,就增加由給定uid進程記數p->user->count。
如果系統中任務數量超出可調的max_threads值,返回錯誤-EAGAIN。
如果正在執行的二進位代碼屬於模塊化的執行域,增加相應的模塊引用記數。
如果正在執行的二進位代碼屬於模塊化的二進位格式,增加相應的模塊引用記數。
子進程標記為'還未執行' (p->did_exec = 0)
子進程標記為'不可交換' (p->swappable = 0)
將子進程置為'不可中斷睡眠'狀態,即p->state = TASK_UNINTERRUPTIBLE (TODO: 為什麼要這樣做? 我認為它是不需要的 - 去掉它,Linus 確認它是不需要的)
根據clone_flags的值設置子進程的p->flags;對於簡單的fork(2),它將是p->flags = PF_FORKNOEXEC。
子進程的pid p->pid在kernel/fork.c:get_pid()里用快速演算法來設置 (TODO: lastpid_lock 自旋鎖顯得多餘了,因為get_pid() 總是在大核心鎖下從do_fork()中調用,同樣去掉get_pid()的標記參數,在2000-06-20將補丁送給了Alan - 後來又發了一次)。
do_fork()中剩下的代碼初始化子進程其餘的任務結構。在最後,子進程的任務結構被散列進pidhash哈希表,子進程被喚醒 (TODO: wake_up_process(p) 設置 p->state = TASK_RUNNING 並且增加這個進程到運行隊列,因此我們do_fork())之前多半不需要設置 p->state 為 TASK_RUNNING 。感興趣的部分是設置p->exit_signal 為 clone_flags & CSIGNAL, 這對於 fork(2) 就意味著SIGCHLD ,設 p->pdeath_signal 為 0。當一個進程『忘記』了原始的父進程(由於死了),就會用到pdeath_signal,它可以通過prctl(2)系統調用中的PR_GET/SET_PDEATHSIG命令來設置/獲取(你也許會認為通過在prctl(2)用戶空間的指針參數來返回pdeath_signal值這種方法很笨 - 是我的過失,在Andries Brouwer更新手冊頁之後已經太遲了還沒有更正;)
這樣任務就建立了。有幾種原因使得任務終止:


執行exit(2) 系統調用;
接收到一個信號,預設處理就是死亡;
在某些異常條件下被迫死亡;
通過func == 1調用bdflush(2)(這是Linux專用的,由於和在/etc/inittab中仍然有『update』行的老的發布兼容的目的 - 現在更新的工作是通過核心線程kupdate來完成的)。
在Linux下實現系統調用的函數都是以sys_為前綴的,但是它們通常只考慮參數檢查或者用結構相關的方法傳遞一些信息,而真正的工作是由do_函數來完成的。因此sys_exit() 調用do_exit() 來工作。儘管如此,核心的其它部分有時執行sys_exit() 而它們應該真正調用do_exit()。

do_exit() 函數在kernel/exit.c中可以找到。對於do_exit()應該注意的點有:


使用全局核心鎖(鎖但是不解鎖)。
最後調用schedule(),它永遠不返回。
設置任務狀態為TASK_ZOMBIE。
如果current->pdeath_signal不為0,則用它通知所有子進程。
用current->exit_signal通知父進程,它通常等於 SIGCHLD。
釋放有fock分配的資源,關閉打開的文件等。
對於使用懶散的FPU交換結構(ia64, mips, mips64) (TODO: 去掉sparc, sparc64的'flags' 參數), 任何硬體請求傳遞FPU所有者(如果由current擁有)為「無」。

2.3 Linux 調度器

The job of a scheduler is to arbitrate access to the current CPU between multiple processes. The scheduler is implemented in the 'main kernel file' kernel/sched.c. The corresponding header file include/linux/sched.h is included (either explicitly or indirectly) by virtually every kernel source file.

The fields of task structure relevant to scheduler include:


p->need_resched: this field is set if schedule() should be invoked at the 'next opportunity'.
p->counter: number of clock ticks left to run in this scheduling slice, decremented by a timer. When this field becomes lower than or equal to zero, it is reset to 0 and p->need_resched is set. This is also sometimes called 'dynamic priority' of a process because it can change by itself.
p->priority: the process' static priority, only changed through well-known system calls like nice(2), POSIX.1b sched_setparam(2) or 4.4BSD/SVR4 setpriority(2).
p->rt_priority: realtime priority
p->policy: the scheduling policy, specifies which scheduling class the task belongs to. Tasks can change their scheduling class using the sched_setscheduler(2) system call. The valid values are SCHED_OTHER (traditional UNIX process), SCHED_FIFO (POSIX.1b FIFO realtime process) and SCHED_RR (POSIX round-robin realtime process). One can also OR SCHED_YIELD to any of these values to signify that the process decided to yield the CPU, for example by calling sched_yield(2) system call. A FIFO realtime process will run until either a) it blocks on I/O, b) it explicitly yields the CPU or c) it is preempted by another realtime process with a higher p->rt_priority value. SCHED_RR is the same as SCHED_FIFO, except that when its timeslice expires it goes back to the end of the runqueue.
The scheduler's algorithm is simple, despite the great apparent complexity of the schedule() function. The function is complex because it implements three scheduling algorithms in one and also because of the subtle SMP-specifics.

The apparently 'useless' gotos in schedule() are there for a purpose - to generate the best optimised (for i386) code. Also, note that scheduler (like most of the kernel) was completely rewritten for 2.4, therefore the discussion below does not apply to 2.2 or earlier kernels.

Let us look at the function in detail:


If current->active_mm == NULL then something is wrong. Current process, even a kernel thread (current->mm == NULL) must have a valid p->active_mm at all times.
If there is something to do on the tq_scheduler task queue, process it now. Task queues provide a kernel mechanism to schedule execution of functions at a later time. We shall look at it in details elsewhere.
Initialise local variables prev and this_cpu to current task and current CPU respectively.
Check if schedule() was invoked from interrupt handler (due to a bug) and panic if so.
Release the global kernel lock.
If there is some work to do via softirq mechanism, do it now.
Initialise local pointer struct schedule_data *sched_data to point to per-CPU (cacheline-aligned to prevent cacheline ping-pong) scheduling data area, which contains the TSC value of last_schedule and the pointer to last scheduled task structure (TODO: sched_data is used on SMP only but why does init_idle() initialises it on UP as well?).
runqueue_lock spinlock is taken. Note that we use spin_lock_irq() because in schedule() we guarantee that interrupts are enabled. Therefore, when we unlock runqueue_lock, we can just re-enable them instead of saving/restoring eflags (spin_lock_irqsave/restore variant).
task state machine: if the task is in TASK_RUNNING state, it is left alone; if it is in TASK_INTERRUPTIBLE state and a signal is pending, it is moved into TASK_RUNNING state. In all other cases, it is deleted from the runqueue.
next (best candidate to be scheduled) is set to the idle task of this cpu. However, the goodness of this candidate is set to a very low value (-1000), in hope that there is someone better than that.
If the prev (current) task is in TASK_RUNNING state, then the current goodness is set to its goodness and it is marked as a better candidate to be scheduled than the idle task.
Now the runqueue is examined and a goodness of each process that can be scheduled on this cpu is compared with current value; the process with highest goodness wins. Now the concept of "can be scheduled on this cpu" must be clarified: on UP, every process on the runqueue is eligible to be scheduled; on SMP, only process not already running on another cpu is eligible to be scheduled on this cpu. The goodness is calculated by a function called goodness(), which treats realtime processes by making their goodness very high (1000 + p->rt_priority), this being greater than 1000 guarantees that no SCHED_OTHER process can win; so they only contend with other realtime processes that may have a greater p->rt_priority. The goodness function returns 0 if the process' time slice (p->counter) is over. For non-realtime processes, the initial value of goodness is set to p->counter - this way, the process is less likely to get CPU if it already had it for a while, i.e. interactive processes are favoured more than CPU bound number crunchers. The arch-specific constant PROC_CHANGE_PENALTY attempts to implement "cpu affinity" (i.e. give advantage to a process on the same CPU). It also gives a slight advantage to processes with mm pointing to current active_mm or to processes with no (user) address space, i.e. kernel threads.
if the current value of goodness is 0 then the entire list of processes (not just the ones on the runqueue!) is examined and their dynamic priorities are recalculated using simple algorithm:

--------------------------------------------------------------------------------


recalculate:
{
struct task_struct *p;
spin_unlock_irq(&runqueue_lock);
read_lock(&tasklist_lock);
for_each_task(p)
p->counter = (p->counter >> 1) + p->priority;
read_unlock(&tasklist_lock);
spin_lock_irq(&runqueue_lock);
}


--------------------------------------------------------------------------------

Note that the we drop the runqueue_lock before we recalculate. The reason is that we go through entire set of processes; this can take a long time, during which the schedule() could be called on another CPU and select a process with goodness good enough for that CPU, whilst we on this CPU were forced to recalculate. Ok, admittedly this is somewhat inconsistent because while we (on this CPU) are selecting a process with the best goodness, schedule() running on another CPU could be recalculating dynamic priorities.
From this point on it is certain that next points to the task to be scheduled, so we initialise next->has_cpu to 1 and next->processor to this_cpu. The runqueue_lock can now be unlocked.
If we are switching back to the same task (next == prev) then we can simply reacquire the global kernel lock and return, i.e. skip all the hardware-level (registers, stack etc.) and VM-related (switch page directory, recalculate active_mm etc.) stuff.
The macro switch_to() is architecture specific. On i386, it is concerned with a) FPU handling, b) LDT handling, c) reloading segment registers, d) TSS handling and e) reloading debug registers.

2.4 Linux linked list implementation

Before we go on to examine implementation of wait queues, we must acquaint ourselves with the Linux standard doubly-linked list implementation. Wait queues (as well as everything else in Linux) make heavy use of them and they are called in jargon "list.h implementation" because the most relevant file is include/linux/list.h.

The fundamental data structure here is struct list_head:



--------------------------------------------------------------------------------

struct list_head {
struct list_head *next, *prev;
};

#define LIST_HEAD_INIT(name) { &(name), &(name) }

#define LIST_HEAD(name)
struct list_head name = LIST_HEAD_INIT(name)

#define INIT_LIST_HEAD(ptr) do {
(ptr)->next = (ptr); (ptr)->prev = (ptr);
} while (0)

#define list_entry(ptr, type, member)
((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member)))

#define list_for_each(pos, head)
for (pos = (head)->next; pos != (head); pos = pos->next)


--------------------------------------------------------------------------------

The first three macros are for initialising an empty list by pointing both next and prev pointers to itself. It is obvious from C syntactical restrictions which ones should be used where - for example, LIST_HEAD_INIT() can be used for structure's element initialisation in declaration, the second can be used for static variable initialising declarations and the third can be used inside a function.

The macro list_entry() gives access to individual list element, for example (from fs/file_table.c:fs_may_remount_ro()):



--------------------------------------------------------------------------------

struct super_block {
...
struct list_head s_files;
...
} *sb = &some_super_block;

struct file {
...
struct list_head f_list;
...
} *file;

struct list_head *p;

for (p = sb->s_files.next; p != &sb->s_files; p = p->next) {
struct file *file = list_entry(p, struct file, f_list);
do something to 'file'
}


--------------------------------------------------------------------------------

A good example of the use of list_for_each() macro is in the scheduler where we walk the runqueue looking for the process with highest goodness:



--------------------------------------------------------------------------------

static LIST_HEAD(runqueue_head);
struct list_head *tmp;
struct task_struct *p;

list_for_each(tmp, &runqueue_head) {
p = list_entry(tmp, struct task_struct, run_list);
if (can_schedule(p)) {
int weight = goodness(p, this_cpu, prev->active_mm);
if (weight > c)
c = weight, next = p;
}
}


--------------------------------------------------------------------------------

Here, p->run_list is declared as struct list_head run_list inside task_struct structure and serves as anchor to the list. Removing an element from the list and adding (to head or tail of the list) is done by list_del()/list_add()/list_add_tail() macros. The examples below are adding and removing a task from runqueue:



--------------------------------------------------------------------------------

static inline void del_from_runqueue(struct task_struct * p)
{
nr_running--;
list_del(&p->run_list);
p->run_list.next = NULL;
}

static inline void add_to_runqueue(struct task_struct * p)
{
list_add(&p->run_list, &runqueue_head);
nr_running++;
}

static inline void move_last_runqueue(struct task_struct * p)
{
list_del(&p->run_list);
list_add_tail(&p->run_list, &runqueue_head);
}

static inline void move_first_runqueue(struct task_struct * p)
{
list_del(&p->run_list);
list_add(&p->run_list, &runqueue_head);
}


--------------------------------------------------------------------------------


2.5 Wait Queues

When a process requests the kernel to do something which is currently impossible but that may become possible later, the process is put to sleep and is woken up when the request is more likely to be satisfied. One of the kernel mechanisms used for this is called a 'wait queue'.

Linux implementation allows wake-on semantics using TASK_EXCLUSIVE flag. With waitqueues, you can either use a well-known queue and then simply sleep_on/sleep_on_timeout/interruptible_sleep_on/interruptible_sleep_on_timeout, or you can define your own waitqueue and use add/remove_wait_queue to add and remove yourself from it and wake_up/wake_up_interruptible to wake up when needed.

An example of the first usage of waitqueues is interaction between the page allocator (in mm/page_alloc.c:__alloc_pages()) and the kswapd kernel daemon (in mm/vmscan.c:kswap()), by means of wait queue kswapd_wait, declared in mm/vmscan.c; the kswapd daemon sleeps on this queue, and it is woken up whenever the page allocator needs to free up some pages.

An example of autonomous waitqueue usage is interaction between user process requesting data via read(2) system call and kernel running in the interrupt context to supply the data. An interrupt handler might look like (simplified drivers/char/rtc_interrupt()):



--------------------------------------------------------------------------------

static DECLARE_WAIT_QUEUE_HEAD(rtc_wait);

void rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
spin_lock(&rtc_lock);
rtc_irq_data = CMOS_READ(RTC_INTR_FLAGS);
spin_unlock(&rtc_lock);
wake_up_interruptible(&rtc_wait);
}


--------------------------------------------------------------------------------

So, the interrupt handler obtains the data by reading from some device-specific I/O port (CMOS_READ() macro turns into a couple outb/inb) and then wakes up whoever is sleeping on the rtc_wait wait queue.

Now, the read(2) system call could be implemented as:



--------------------------------------------------------------------------------

ssize_t rtc_read(struct file file, char *buf, size_t count, loff_t *ppos)
{
DECLARE_WAITQUEUE(wait, current);
unsigned long data;
ssize_t retval;

add_wait_queue(&rtc_wait, &wait);
current->state = TASK_INTERRUPTIBLE;
do {
spin_lock_irq(&rtc_lock);
data = rtc_irq_data;
rtc_irq_data = 0;
spin_unlock_irq(&rtc_lock);

if (data != 0)
break;

if (file->f_flags & O_NONBLOCK) {
retval = -EAGAIN;
goto out;
}
if (signal_pending(current)) {
retval = -ERESTARTSYS;
goto out;
}
schedule();
} while(1);
retval = put_user(data, (unsigned long *)buf);
if (!retval)
retval = sizeof(unsigned long);

out:
current->state = TASK_RUNNING;
remove_wait_queue(&rtc_wait, &wait);
return retval;
}


--------------------------------------------------------------------------------

What happens in rtc_read() is this:


We declare a wait queue element pointing to current process context.
We add this element to the rtc_wait waitqueue.
We mark current context as TASK_INTERRUPTIBLE which means it will not be rescheduled after the next time it sleeps.
We check if there is no data available; if there is we break out, copy data to user buffer, mark ourselves as TASK_RUNNING, remove ourselves from the wait queue and return
If there is no data yet, we check whether the user specified non-blocking I/O and if so we fail with EAGAIN (which is the same as EWOULDBLOCK)
We also check if a signal is pending and if so inform the "higher layers" to restart the system call if necessary. By "if necessary" I meant the details of signal disposition as specified in sigaction(2) system call.
Then we "switch out", i.e. fall asleep, until woken up by the interrupt handler. If we didn't mark ourselves as TASK_INTERRUPTIBLE then the scheduler could schedule us sooner than when the data is available, thus causing unneeded processing.
It is also worth pointing out that, using wait queues, it is rather easy to implement the poll(2) system call:



--------------------------------------------------------------------------------

static unsigned int rtc_poll(struct file *file, poll_table *wait)
{
unsigned long l;

poll_wait(file, &rtc_wait, wait);

spin_lock_irq(&rtc_lock);
l = rtc_irq_data;
spin_unlock_irq(&rtc_lock);

if (l != 0)
return POLLIN | POLLRDNORM;
return 0;
}


--------------------------------------------------------------------------------

All the work is done by the device-independent function poll_wait() which does the necessary waitqueue manipulations; all we need to do is point it to the waitqueue which is woken up by our device-specific interrupt handler.


2.6 Kernel Timers

Now let us turn our attention to kernel timers. Kernel timers are used to dispatch execution of a particular function (called 'timer handler') at a specified time in the future. The main data structure is struct timer_list declared in include/linux/timer.h:



--------------------------------------------------------------------------------

struct timer_list {
struct list_head list;
unsigned long expires;
unsigned long data;
void (*function)(unsigned long);
volatile int running;
};


--------------------------------------------------------------------------------

The list field is for linking into the internal list, protected by the timerlist_lock spinlock. The expires field is the value of jiffies when the function handler should be invoked with data passed as a parameter. The running field is used on SMP to test if the timer handler is currently running on another CPU.

The functions add_timer() and del_timer() add and remove a given timer to the list. When a timer expires, it is removed automatically. Before a timer is used, it MUST be initialised by means of init_timer() function. And before it is added, the fields function and expires must be set.
'


[火星人 ] Linux Kernel Internals(2)--Process and Interrupt Management已經有737次圍觀

http://coctec.com/docs/program/show-post-72160.html