notes.html 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434
  1. <HTML>
  2. <HEAD>
  3. <TITLE>State Threads Library Programming Notes</TITLE>
  4. </HEAD>
  5. <BODY BGCOLOR=#FFFFFF>
  6. <H2>Programming Notes</H2>
  7. <P>
  8. <B>
  9. <UL>
  10. <LI><A HREF=#porting>Porting</A></LI>
  11. <LI><A HREF=#signals>Signals</A></LI>
  12. <LI><A HREF=#intra>Intra-Process Synchronization</A></LI>
  13. <LI><A HREF=#inter>Inter-Process Synchronization</A></LI>
  14. <LI><A HREF=#nonnet>Non-Network I/O</A></LI>
  15. <LI><A HREF=#timeouts>Timeouts</A></LI>
  16. </UL>
  17. </B>
  18. <P>
  19. <HR>
  20. <P>
  21. <A NAME="porting">
  22. <H3>Porting</H3>
  23. The State Threads library uses OS concepts that are available in some
  24. form on most UNIX platforms, making the library very portable across
  25. many flavors of UNIX. However, there are several parts of the library
  26. that rely on platform-specific features. Here is the list of such parts:
  27. <P>
  28. <UL>
  29. <LI><I>Thread context initialization</I>: Two ingredients of the
  30. <TT>jmp_buf</TT>
  31. data structure (the program counter and the stack pointer) have to be
  32. manually set in the thread creation routine. The <TT>jmp_buf</TT> data
  33. structure is defined in the <TT>setjmp.h</TT> header file and differs from
  34. platform to platform. Usually the program counter is a structure member
  35. with <TT>PC</TT> in the name and the stack pointer is a structure member
  36. with <TT>SP</TT> in the name. One can also look in the
  37. <A HREF="http://www.mozilla.org/source.html">Netscape's NSPR library source</A>
  38. which already has this code for many UNIX-like platforms
  39. (<TT>mozilla/nsprpub/pr/include/md/*.h</TT> files).
  40. <P>
  41. Note that on some BSD-derived platforms <TT>_setjmp(3)/_longjmp(3)</TT>
  42. calls should be used instead of <TT>setjmp(3)/longjmp(3)</TT> (that is
  43. the calls that manipulate only the stack and registers and do <I>not</I>
  44. save and restore the process's signal mask).</LI>
  45. <P>
  46. Starting with glibc 2.4 on Linux the opacity of the <TT>jmp_buf</TT> data
  47. structure is enforced by <TT>setjmp(3)/longjmp(3)</TT> so the
  48. <TT>jmp_buf</TT> ingredients cannot be accessed directly anymore (unless
  49. special environmental variable LD_POINTER_GUARD is set before application
  50. execution). To avoid dependency on custom environment, the State Threads
  51. library provides <TT>setjmp/longjmp</TT> replacement functions for
  52. all Intel CPU architectures. Other CPU architectures can also be easily
  53. supported (the <TT>setjmp/longjmp</TT> source code is widely available for
  54. many CPU architectures).
  55. <P>
  56. <LI><I>High resolution time function</I>: Some platforms (IRIX, Solaris)
  57. provide a high resolution time function based on the free running hardware
  58. counter. This function returns the time counted since some arbitrary
  59. moment in the past (usually machine power up time). It is not correlated in
  60. any way to the time of day, and thus is not subject to resetting,
  61. drifting, etc. This type of time is ideal for tasks where cheap, accurate
  62. interval timing is required. If such a function is not available on a
  63. particular platform, the <TT>gettimeofday(3)</TT> function can be used
  64. (though on some platforms it involves a system call).
  65. <P>
  66. <LI><I>The stack growth direction</I>: The library needs to know whether the
  67. stack grows toward lower (down) or higher (up) memory addresses.
  68. One can write a simple test program that detects the stack growth direction
  69. on a particular platform.</LI>
  70. <P>
  71. <LI><I>Non-blocking attribute inheritance</I>: On some platforms (e.g. IRIX)
  72. the socket created as a result of the <TT>accept(2)</TT> call inherits the
  73. non-blocking attribute of the listening socket. One needs to consult the manual
  74. pages or write a simple test program to see if this applies to a specific
  75. platform.</LI>
  76. <P>
  77. <LI><I>Anonymous memory mapping</I>: The library allocates memory segments
  78. for thread stacks by doing anonymous memory mapping (<TT>mmap(2)</TT>). This
  79. mapping is somewhat different on SVR4 and BSD4.3 derived platforms.
  80. <P>
  81. The memory mapping can be avoided altogether by using <TT>malloc(3)</TT> for
  82. stack allocation. In this case the <TT>MALLOC_STACK</TT> macro should be
  83. defined.</LI>
  84. </UL>
  85. <P>
  86. All machine-dependent feature test macros should be defined in the
  87. <TT>md.h</TT> header file. The assembly code for <TT>setjmp/longjmp</TT>
  88. replacement functions for all CPU architectures should be placed in
  89. the <TT>md.S</TT> file.
  90. <P>
  91. The current version of the library is ported to:
  92. <UL>
  93. <LI>IRIX 6.x (both 32 and 64 bit)</LI>
  94. <LI>Linux (kernel 2.x and glibc 2.x) on x86, Alpha, MIPS and MIPSEL,
  95. SPARC, ARM, PowerPC, 68k, HPPA, S390, IA-64, and Opteron (AMD-64)</LI>
  96. <LI>Solaris 2.x (SunOS 5.x) on x86, AMD64, SPARC, and SPARC-64</LI>
  97. <LI>AIX 4.x</LI>
  98. <LI>HP-UX 11 (both 32 and 64 bit)</LI>
  99. <LI>Tru64/OSF1</LI>
  100. <LI>FreeBSD on x86, AMD64, and Alpha</LI>
  101. <LI>OpenBSD on x86, AMD64, Alpha, and SPARC</LI>
  102. <LI>NetBSD on x86, Alpha, SPARC, and VAX</LI>
  103. <LI>MacOS X (Darwin) on PowerPC (32 bit) and Intel (both 32 and 64 bit) [universal]</LI>
  104. <LI>Cygwin</LI>
  105. </UL>
  106. <P>
  107. <A NAME="signals">
  108. <H3>Signals</H3>
  109. Signal handling in an application using State Threads should be treated the
  110. same way as in a classical UNIX process application. There is no such
  111. thing as per-thread signal mask, all threads share the same signal handlers,
  112. and only asynchronous-safe functions can be used in signal handlers.
  113. However, there is a way to process signals synchronously by converting a
  114. signal event to an I/O event: a signal catching function does a write to
  115. a pipe which will be processed synchronously by a dedicated signal handling
  116. thread. The following code demonstrates this technique (error handling is
  117. omitted for clarity):
  118. <PRE>
  119. /* Per-process pipe which is used as a signal queue. */
  120. /* Up to PIPE_BUF/sizeof(int) signals can be queued up. */
  121. int sig_pipe[2];
  122. /* Signal catching function. */
  123. /* Converts signal event to I/O event. */
  124. void sig_catcher(int signo)
  125. {
  126. int err;
  127. /* Save errno to restore it after the write() */
  128. err = errno;
  129. /* write() is reentrant/async-safe */
  130. write(sig_pipe[1], &signo, sizeof(int));
  131. errno = err;
  132. }
  133. /* Signal processing function. */
  134. /* This is the "main" function of the signal processing thread. */
  135. void *sig_process(void *arg)
  136. {
  137. st_netfd_t nfd;
  138. int signo;
  139. nfd = st_netfd_open(sig_pipe[0]);
  140. for ( ; ; ) {
  141. /* Read the next signal from the pipe */
  142. st_read(nfd, &signo, sizeof(int), ST_UTIME_NO_TIMEOUT);
  143. /* Process signal synchronously */
  144. switch (signo) {
  145. case SIGHUP:
  146. /* do something here - reread config files, etc. */
  147. break;
  148. case SIGTERM:
  149. /* do something here - cleanup, etc. */
  150. break;
  151. /* .
  152. .
  153. Other signals
  154. .
  155. .
  156. */
  157. }
  158. }
  159. return NULL;
  160. }
  161. int main(int argc, char *argv[])
  162. {
  163. struct sigaction sa;
  164. .
  165. .
  166. .
  167. /* Create signal pipe */
  168. pipe(sig_pipe);
  169. /* Create signal processing thread */
  170. st_thread_create(sig_process, NULL, 0, 0);
  171. /* Install sig_catcher() as a signal handler */
  172. sa.sa_handler = sig_catcher;
  173. sigemptyset(&sa.sa_mask);
  174. sa.sa_flags = 0;
  175. sigaction(SIGHUP, &sa, NULL);
  176. sa.sa_handler = sig_catcher;
  177. sigemptyset(&sa.sa_mask);
  178. sa.sa_flags = 0;
  179. sigaction(SIGTERM, &sa, NULL);
  180. .
  181. .
  182. .
  183. }
  184. </PRE>
  185. <P>
  186. Note that if multiple processes are used (see below), the signal pipe should
  187. be initialized after the <TT>fork(2)</TT> call so that each process has its
  188. own private pipe.
  189. <P>
  190. <A NAME="intra">
  191. <H3>Intra-Process Synchronization</H3>
  192. Due to the event-driven nature of the library scheduler, the thread context
  193. switch (process state change) can only happen in a well-known set of
  194. library functions. This set includes functions in which a thread may
  195. "block":<TT> </TT>I/O functions (<TT>st_read(), st_write(), </TT>etc.),
  196. sleep functions (<TT>st_sleep(), </TT>etc.), and thread synchronization
  197. functions (<TT>st_thread_join(), st_cond_wait(), </TT>etc.). As a result,
  198. process-specific global data need not to be protected by locks since a thread
  199. cannot be rescheduled while in a critical section (and only one thread at a
  200. time can access the same memory location). By the same token,
  201. non thread-safe functions (in a traditional sense) can be safely used with
  202. the State Threads. The library's mutex facilities are practically useless
  203. for a correctly written application (no blocking functions in critical
  204. section) and are provided mostly for completeness. This absence of locking
  205. greatly simplifies an application design and provides a foundation for
  206. scalability.
  207. <P>
  208. <A NAME="inter">
  209. <H3>Inter-Process Synchronization</H3>
  210. The State Threads library makes it possible to multiplex a large number
  211. of simultaneous connections onto a much smaller number of separate
  212. processes, where each process uses a many-to-one user-level threading
  213. implementation (<B>N</B> of <B>M:1</B> mappings rather than one <B>M:N</B>
  214. mapping used in native threading libraries on some platforms). This design
  215. is key to the application's scalability. One can think about it as if a
  216. set of all threads is partitioned into separate groups (processes) where
  217. each group has a separate pool of resources (virtual address space, file
  218. descriptors, etc.). An application designer has full control of how many
  219. groups (processes) an application creates and what resources, if any,
  220. are shared among different groups via standard UNIX inter-process
  221. communication (IPC) facilities.<P>
  222. There are several reasons for creating multiple processes:
  223. <P>
  224. <UL>
  225. <LI>To take advantage of multiple hardware entities (CPUs, disks, etc.)
  226. available in the system (hardware parallelism).</LI>
  227. <P>
  228. <LI>To reduce risk of losing a large number of user connections when one of
  229. the processes crashes. For example, if <B>C</B> user connections (threads)
  230. are multiplexed onto <B>P</B> processes and one of the processes crashes,
  231. only a fraction (<B>C/P</B>) of all connections will be lost.</LI>
  232. <P>
  233. <LI>To overcome per-process resource limitations imposed by the OS. For
  234. example, if <TT>select(2)</TT> is used for event polling, the number of
  235. simultaneous connections (threads) per process is
  236. limited by the <TT>FD_SETSIZE</TT> parameter (see <TT>select(2)</TT>).
  237. If <TT>FD_SETSIZE</TT> is equal to 1024 and each connection needs one file
  238. descriptor, then an application should create 10 processes to support 10,000
  239. simultaneous connections.</LI>
  240. </UL>
  241. <P>
  242. Ideally all user sessions are completely independent, so there is no need for
  243. inter-process communication. It is always better to have several separate
  244. smaller process-specific resources (e.g., data caches) than to have one large
  245. resource shared (and modified) by all processes. Sometimes, however, there
  246. is a need to share a common resource among different processes. In that case,
  247. standard UNIX IPC facilities can be used. In addition to that, there is a way
  248. to synchronize different processes so that only the thread accessing the
  249. shared resource will be suspended (but not the entire process) if that resource
  250. is unavailable. In the following code fragment a pipe is used as a counting
  251. semaphore for inter-process synchronization:
  252. <PRE>
  253. #ifndef PIPE_BUF
  254. #define PIPE_BUF 512 /* POSIX */
  255. #endif
  256. /* Semaphore data structure */
  257. typedef struct ipc_sem {
  258. st_netfd_t rdfd; /* read descriptor */
  259. st_netfd_t wrfd; /* write descriptor */
  260. } ipc_sem_t;
  261. /* Create and initialize the semaphore. Should be called before fork(2). */
  262. /* 'value' must be less than PIPE_BUF. */
  263. /* If 'value' is 1, the semaphore works as mutex. */
  264. ipc_sem_t *ipc_sem_create(int value)
  265. {
  266. ipc_sem_t *sem;
  267. int p[2];
  268. char b[PIPE_BUF];
  269. /* Error checking is omitted for clarity */
  270. sem = malloc(sizeof(ipc_sem_t));
  271. /* Create the pipe */
  272. pipe(p);
  273. sem->rdfd = st_netfd_open(p[0]);
  274. sem->wrfd = st_netfd_open(p[1]);
  275. /* Initialize the semaphore: put 'value' bytes into the pipe */
  276. write(p[1], b, value);
  277. return sem;
  278. }
  279. /* Try to decrement the "value" of the semaphore. */
  280. /* If "value" is 0, the calling thread blocks on the semaphore. */
  281. int ipc_sem_wait(ipc_sem_t *sem)
  282. {
  283. char c;
  284. /* Read one byte from the pipe */
  285. if (st_read(sem->rdfd, &c, 1, ST_UTIME_NO_TIMEOUT) != 1)
  286. return -1;
  287. return 0;
  288. }
  289. /* Increment the "value" of the semaphore. */
  290. int ipc_sem_post(ipc_sem_t *sem)
  291. {
  292. char c;
  293. if (st_write(sem->wrfd, &c, 1, ST_UTIME_NO_TIMEOUT) != 1)
  294. return -1;
  295. return 0;
  296. }
  297. </PRE>
  298. <P>
  299. Generally, the following steps should be followed when writing an application
  300. using the State Threads library:
  301. <P>
  302. <OL>
  303. <LI>Initialize the library (<TT>st_init()</TT>).</LI>
  304. <P>
  305. <LI>Create resources that will be shared among different processes:
  306. create and bind listening sockets, create shared memory segments, IPC
  307. channels, synchronization primitives, etc.</LI>
  308. <P>
  309. <LI>Create several processes (<TT>fork(2)</TT>). The parent process should
  310. either exit or become a "watchdog" (e.g., it starts a new process when
  311. an existing one crashes, does a cleanup upon application termination,
  312. etc.).</LI>
  313. <P>
  314. <LI>In each child process create a pool of threads
  315. (<TT>st_thread_create()</TT>) to handle user connections.</LI>
  316. </OL>
  317. <P>
  318. <A NAME="nonnet">
  319. <H3>Non-Network I/O</H3>
  320. The State Threads architecture uses non-blocking I/O on
  321. <TT>st_netfd_t</TT> objects for concurrent processing of multiple user
  322. connections. This architecture has a drawback: the entire process and
  323. all its threads may block for the duration of a <I>disk</I> or other
  324. non-network I/O operation, whether through State Threads I/O functions,
  325. direct system calls, or standard I/O functions. (This is applicable
  326. mostly to disk <I>reads</I>; disk <I>writes</I> are usually performed
  327. asynchronously -- data goes to the buffer cache to be written to disk
  328. later.) Fortunately, disk I/O (unlike network I/O) usually takes a
  329. finite and predictable amount of time, but this may not be true for
  330. special devices or user input devices (including stdin). Nevertheless,
  331. such I/O reduces throughput of the system and increases response times.
  332. There are several ways to design an application to overcome this
  333. drawback:
  334. <P>
  335. <UL>
  336. <LI>Create several identical main processes as described above (symmetric
  337. architecture). This will improve CPU utilization and thus improve the
  338. overall throughput of the system.</LI>
  339. <P>
  340. <LI>Create multiple "helper" processes in addition to the main process that
  341. will handle blocking I/O operations (asymmetric architecture).
  342. This approach was suggested for Web servers in a
  343. <A HREF="http://www.cs.rice.edu/~vivek/flash99/">paper</A> by Peter
  344. Druschel et al. In this architecture the main process communicates with
  345. a helper process via an IPC channel (<TT>pipe(2), socketpair(2)</TT>).
  346. The main process instructs a helper to perform the potentially blocking
  347. operation. Once the operation completes, the helper returns a
  348. notification via IPC.
  349. </UL>
  350. <P>
  351. <A NAME="timeouts">
  352. <H3>Timeouts</H3>
  353. The <TT>timeout</TT> parameter to <TT>st_cond_timedwait()</TT> and the
  354. I/O functions, and the arguments to <TT>st_sleep()</TT> and
  355. <TT>st_usleep()</TT> specify a maximum time to wait <I>since the last
  356. context switch</I> not since the beginning of the function call.
  357. <P>The State Threads' time resolution is actually the time interval
  358. between context switches. That time interval may be large in some
  359. situations, for example, when a single thread does a lot of work
  360. continuously. Note that a steady, uninterrupted stream of network I/O
  361. qualifies for this description; a context switch occurs only when a
  362. thread blocks.
  363. <P>If a specified I/O timeout is less than the time interval between
  364. context switches the function may return with a timeout error before
  365. that amount of time has elapsed since the beginning of the function
  366. call. For example, if eight milliseconds have passed since the last
  367. context switch and an I/O function with a timeout of 10 milliseconds
  368. blocks, causing a switch, the call may return with a timeout error as
  369. little as two milliseconds after it was called. (On Linux,
  370. <TT>select()</TT>'s timeout is an <I>upper</I> bound on the amount of
  371. time elapsed before select returns.) Similarly, if 12 ms have passed
  372. already, the function may return immediately.
  373. <P>In almost all cases I/O timeouts should be used only for detecting a
  374. broken network connection or for preventing a peer from holding an idle
  375. connection for too long. Therefore for most applications realistic I/O
  376. timeouts should be on the order of seconds. Furthermore, there's
  377. probably no point in retrying operations that time out. Rather than
  378. retrying simply use a larger timeout in the first place.
  379. <P>The largest valid timeout value is platform-dependent and may be
  380. significantly less than <TT>INT_MAX</TT> seconds for <TT>select()</TT>
  381. or <TT>INT_MAX</TT> milliseconds for <TT>poll()</TT>. Generally, you
  382. should not use timeouts exceeding several hours. Use
  383. <tt>ST_UTIME_NO_TIMEOUT</tt> (<tt>-1</tt>) as a special value to
  384. indicate infinite timeout or indefinite sleep. Use
  385. <tt>ST_UTIME_NO_WAIT</tt> (<tt>0</tt>) to indicate no waiting at all.
  386. <P>
  387. <HR>
  388. <P>
  389. </BODY>
  390. </HTML>