123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504 |
- <HTML>
- <HEAD>
- <TITLE>State Threads for Internet Applications</TITLE>
- </HEAD>
- <BODY BGCOLOR=#FFFFFF>
- <H2>State Threads for Internet Applications</H2>
- <H3>Introduction</H3>
- <P>
- State Threads is an application library which provides a
- foundation for writing fast and highly scalable Internet Applications
- on UNIX-like platforms. It combines the simplicity of the multithreaded
- programming paradigm, in which one thread supports each simultaneous
- connection, with the performance and scalability of an event-driven
- state machine architecture.</P>
- <H3>1. Definitions</H3>
- <P>
- <A NAME="IA">
- <H4>1.1 Internet Applications</H4>
- </A>
- <P>
- An <I>Internet Application</I> (IA) is either a server or client network
- application that accepts connections from clients and may or may not
- connect to servers. In an IA the arrival or departure of network data
- often controls processing (that is, IA is a <I>data-driven</I> application).
- For each connection, an IA does some finite amount of work
- involving data exchange with its peer, where its peer may be either
- a client or a server.
- The typical transaction steps of an IA are to accept a connection,
- read a request, do some finite and predictable amount of work to
- process the request, then write a response to the peer that sent the
- request. One example of an IA is a Web server;
- the most general example of an IA is a proxy server, because it both
- accepts connections from clients and connects to other servers.</P>
- <P>
- We assume that the performance of an IA is constrained by available CPU
- cycles rather than network bandwidth or disk I/O (that is, CPU
- is a bottleneck resource).
- <P>
- <A NAME="PS">
- <H4>1.2 Performance and Scalability</H4>
- </A>
- <P>
- The <I>performance</I> of an IA is usually evaluated as its
- throughput measured in transactions per second or bytes per second (one
- can be converted to the other, given the average transaction size). There are
- several benchmarks that can be used to measure throughput of Web serving
- applications for specific workloads (such as
- <A HREF="http://www.spec.org/osg/web96/">SPECweb96</A>,
- <A HREF="http://www.mindcraft.com/webstone/">WebStone</A>,
- <A HREF="http://www.zdnet.com/zdbop/webbench/">WebBench</A>).
- Although there is no common definition for <I>scalability</I>, in general it
- expresses the ability of an application to sustain its performance when some
- external condition changes. For IAs this external condition is either the
- number of clients (also known as "users," "simultaneous connections," or "load
- generators") or the underlying hardware system size (number of CPUs, memory
- size, and so on). Thus there are two types of scalability: <I>load
- scalability</I> and <I>system scalability</I>, respectively.
- <P>
- The figure below shows how the throughput of an idealized IA changes with
- the increasing number of clients (solid blue line). Initially the throughput
- grows linearly (the slope represents the maximal throughput that one client
- can provide). Within this initial range, the IA is underutilized and CPUs are
- partially idle. Further increase in the number of clients leads to a system
- saturation, and the throughput gradually stops growing as all CPUs become fully
- utilized. After that point, the throughput stays flat because there are no
- more CPU cycles available.
- In the real world, however, each simultaneous connection
- consumes some computational and memory resources, even when idle, and this
- overhead grows with the number of clients. Therefore, the throughput of the
- real world IA starts dropping after some point (dashed blue line in the figure
- below). The rate at which the throughput drops depends, among other things, on
- application design.
- <P>
- We say that an application has a good <I>load scalability</I> if it can
- sustain its throughput over a wide range of loads.
- Interestingly, the <A HREF="http://www.spec.org/osg/web99/">SPECweb99</A>
- benchmark somewhat reflects the Web server's load scalability because it
- measures the number of clients (load generators) given a mandatory minimal
- throughput per client (that is, it measures the server's <I>capacity</I>).
- This is unlike <A HREF="http://www.spec.org/osg/web96/">SPECweb96</A> and
- other benchmarks that use the throughput as their main metric (see the figure
- below).
- <P>
- <CENTER><IMG SRC="fig.gif" ALT="Figure: Throughput vs. Number of clients">
- </CENTER>
- <P>
- <I>System scalability</I> is the ability of an application to sustain its
- performance per hardware unit (such as a CPU) with the increasing number of
- these units. In other words, good system scalability means that doubling the
- number of processors will roughly double the application's throughput (dashed
- green line). We assume here that the underlying operating system also scales
- well. Good system scalability allows you to initially run an application on
- the smallest system possible, while retaining the ability to move that
- application to a larger system if necessary, without excessive effort or
- expense. That is, an application need not be rewritten or even undergo a
- major porting effort when changing system size.
- <P>
- Although scalability and performance are more important in the case of server
- IAs, they should also be considered for some client applications (such as
- benchmark load generators).
- <P>
- <A NAME="CONC">
- <H4>1.3 Concurrency</H4>
- </A>
- <P>
- Concurrency reflects the parallelism in a system. The two unrelated types
- are <I>virtual</I> concurrency and <I>real</I> concurrency.
- <UL>
- <LI>Virtual (or apparent) concurrency is the number of simultaneous
- connections that a system supports.
- <BR><BR>
- <LI>Real concurrency is the number of hardware devices, including
- CPUs, network cards, and disks, that actually allow a system to perform
- tasks in parallel.
- </UL>
- <P>
- An IA must provide virtual concurrency in order to serve many users
- simultaneously.
- To achieve maximum performance and scalability in doing so, the number of
- programming entities than an IA creates to be scheduled by the OS kernel
- should be
- kept close to (within an order of magnitude of) the real concurrency found on
- the system. These programming entities scheduled by the kernel are known as
- <I>kernel execution vehicles</I>. Examples of kernel execution vehicles
- include Solaris lightweight processes and IRIX kernel threads.
- In other words, the number of kernel execution vehicles should be dictated by
- the system size and not by the number of simultaneous connections.
- <P>
- <H3>2. Existing Architectures</H3>
- <P>
- There are a few different architectures that are commonly used by IAs.
- These include the <I>Multi-Process</I>,
- <I>Multi-Threaded</I>, and <I>Event-Driven State Machine</I>
- architectures.
- <P>
- <A NAME="MP">
- <H4>2.1 Multi-Process Architecture</H4>
- </A>
- <P>
- In the Multi-Process (MP) architecture, an individual process is
- dedicated to each simultaneous connection.
- A process performs all of a transaction's initialization steps
- and services a connection completely before moving on to service
- a new connection.
- <P>
- User sessions in IAs are relatively independent; therefore, no
- synchronization between processes handling different connections is
- necessary. Because each process has its own private address space,
- this architecture is very robust. If a process serving one of the connections
- crashes, the other sessions will not be affected. However, to serve many
- concurrent connections, an equal number of processes must be employed.
- Because processes are kernel entities (and are in fact the heaviest ones),
- the number of kernel entities will be at least as large as the number of
- concurrent sessions. On most systems, good performance will not be achieved
- when more than a few hundred processes are created because of the high
- context-switching overhead. In other words, MP applications have poor load
- scalability.
- <P>
- On the other hand, MP applications have very good system scalability, because
- no resources are shared among different processes and there is no
- synchronization overhead.
- <P>
- The Apache Web Server 1.x (<A HREF=#refs1>[Reference 1]</A>) uses the MP
- architecture on UNIX systems.
- <P>
- <A NAME="MT">
- <H4>2.2 Multi-Threaded Architecture</H4>
- </A>
- <P>
- In the Multi-Threaded (MT) architecture, multiple independent threads
- of control are employed within a single shared address space. Like a
- process in the MP architecture, each thread performs all of a
- transaction's initialization steps and services a connection completely
- before moving on to service a new connection.
- <P>
- Many modern UNIX operating systems implement a <I>many-to-few</I> model when
- mapping user-level threads to kernel entities. In this model, an
- arbitrarily large number of user-level threads is multiplexed onto a
- lesser number of kernel execution vehicles. Kernel execution
- vehicles are also known as <I>virtual processors</I>. Whenever a user-level
- thread makes a blocking system call, the kernel execution vehicle it is using
- will become blocked in the kernel. If there are no other non-blocked kernel
- execution vehicles and there are other runnable user-level threads, a new
- kernel execution vehicle will be created automatically. This prevents the
- application from blocking when it can continue to make useful forward
- progress.
- <P>
- Because IAs are by nature network I/O driven, all concurrent sessions block on
- network I/O at various points. As a result, the number of virtual processors
- created in the kernel grows close to the number of user-level threads
- (or simultaneous connections). When this occurs, the many-to-few model
- effectively degenerates to a <I>one-to-one</I> model. Again, like in
- the MP architecture, the number of kernel execution vehicles is dictated by
- the number of simultaneous connections rather than by number of CPUs. This
- reduces an application's load scalability. However, because kernel threads
- (lightweight processes) use fewer resources and are more light-weight than
- traditional UNIX processes, an MT application should scale better with load
- than an MP application.
- <P>
- Unexpectedly, the small number of virtual processors sharing the same address
- space in the MT architecture destroys an application's system scalability
- because of contention among the threads on various locks. Even if an
- application itself is carefully
- optimized to avoid lock contention around its own global data (a non-trivial
- task), there are still standard library functions and system calls
- that use common resources hidden from the application. For example,
- on many platforms thread safety of memory allocation routines
- (<TT>malloc(3)</TT>, <TT>free(3)</TT>, and so on) is achieved by using a single
- global lock. Another example is a per-process file descriptor table.
- This common resource table is shared by all kernel execution vehicles within
- the same process and must be protected when one modifies it via
- certain system calls (such as <TT>open(2)</TT>, <TT>close(2)</TT>, and so on).
- In addition to that, maintaining the caches coherent
- among CPUs on multiprocessor systems hurts performance when different threads
- running on different CPUs modify data items on the same cache line.
- <P>
- In order to improve load scalability, some applications employ a different
- type of MT architecture: they create one or more thread(s) <I>per task</I>
- rather than one thread <I>per connection</I>. For example, one small group
- of threads may be responsible for accepting client connections, another
- for request processing, and yet another for serving responses. The main
- advantage of this architecture is that it eliminates the tight coupling
- between the number of threads and number of simultaneous connections. However,
- in this architecture, different task-specific thread groups must share common
- work queues that must be protected by mutual exclusion locks (a typical
- producer-consumer problem). This adds synchronization overhead that causes an
- application to perform badly on multiprocessor systems. In other words, in
- this architecture, the application's system scalability is sacrificed for the
- sake of load scalability.
- <P>
- Of course, the usual nightmares of threaded programming, including data
- corruption, deadlocks, and race conditions, also make MT architecture (in any
- form) non-simplistic to use.
- <P>
- <A NAME="EDSM">
- <H4>2.3 Event-Driven State Machine Architecture</H4>
- </A>
- <P>
- In the Event-Driven State Machine (EDSM) architecture, a single process
- is employed to concurrently process multiple connections. The basics of this
- architecture are described in Comer and Stevens
- <A HREF=#refs2>[Reference 2]</A>.
- The EDSM architecture performs one basic data-driven step associated with
- a particular connection at a time, thus multiplexing many concurrent
- connections. The process operates as a state machine that receives an event
- and then reacts to it.
- <P>
- In the idle state the EDSM calls <TT>select(2)</TT> or <TT>poll(2)</TT> to
- wait for network I/O events. When a particular file descriptor is ready for
- I/O, the EDSM completes the corresponding basic step (usually by invoking a
- handler function) and starts the next one. This architecture uses
- non-blocking system calls to perform asynchronous network I/O operations.
- For more details on non-blocking I/O see Stevens
- <A HREF=#refs3>[Reference 3]</A>.
- <P>
- To take advantage of hardware parallelism (real concurrency), multiple
- identical processes may be created. This is called Symmetric Multi-Process
- EDSM and is used, for example, in the Zeus Web Server
- (<A HREF=#refs4>[Reference 4]</A>). To more efficiently multiplex disk I/O,
- special "helper" processes may be created. This is called Asymmetric
- Multi-Process EDSM and was proposed for Web servers by Druschel
- and others <A HREF=#refs5>[Reference 5]</A>.
- <P>
- EDSM is probably the most scalable architecture for IAs.
- Because the number of simultaneous connections (virtual concurrency) is
- completely decoupled from the number of kernel execution vehicles (processes),
- this architecture has very good load scalability. It requires only minimal
- user-level resources to create and maintain additional connection.
- <P>
- Like MP applications, Multi-Process EDSM has very good system scalability
- because no resources are shared among different processes and there is no
- synchronization overhead.
- <P>
- Unfortunately, the EDSM architecture is monolithic rather than based on the
- concept of threads, so new applications generally need to be implemented from
- the ground up. In effect, the EDSM architecture simulates threads and their
- stacks the hard way.
- <P>
- <A NAME="ST">
- <H3>3. State Threads Library</H3>
- </A>
- <P>
- The State Threads library combines the advantages of all of the above
- architectures. The interface preserves the programming simplicity of thread
- abstraction, allowing each simultaneous connection to be treated as a separate
- thread of execution within a single process. The underlying implementation is
- close to the EDSM architecture as the state of each particular concurrent
- session is saved in a separate memory segment.
- <P>
- <H4>3.1 State Changes and Scheduling</H4>
- <P>
- The state of each concurrent session includes its stack environment
- (stack pointer, program counter, CPU registers) and its stack. Conceptually,
- a thread context switch can be viewed as a process changing its state. There
- are no kernel entities involved other than processes.
- Unlike other general-purpose threading libraries, the State Threads library
- is fully deterministic. The thread context switch (process state change) can
- only happen in a well-known set of functions (at I/O points or at explicit
- synchronization points). As a result, process-specific global data does not
- have to be protected by mutual exclusion locks in most cases. The entire
- application is free to use all the static variables and non-reentrant library
- functions it wants, greatly simplifying programming and debugging while
- increasing performance. This is somewhat similar to a <I>co-routine</I> model
- (co-operatively multitasked threads), except that no explicit yield is needed
- --
- sooner or later, a thread performs a blocking I/O operation and thus surrenders
- control. All threads of execution (simultaneous connections) have the
- same priority, so scheduling is non-preemptive, like in the EDSM architecture.
- Because IAs are data-driven (processing is limited by the size of network
- buffers and data arrival rates), scheduling is non-time-slicing.
- <P>
- Only two types of external events are handled by the library's
- scheduler, because only these events can be detected by
- <TT>select(2)</TT> or <TT>poll(2)</TT>: I/O events (a file descriptor is ready
- for I/O) and time events
- (some timeout has expired). However, other types of events (such as
- a signal sent to a process) can also be handled by converting them to I/O
- events. For example, a signal handling function can perform a write to a pipe
- (<TT>write(2)</TT> is reentrant/asynchronous-safe), thus converting a signal
- event to an I/O event.
- <P>
- To take advantage of hardware parallelism, as in the EDSM architecture,
- multiple processes can be created in either a symmetric or asymmetric manner.
- Process management is not in the library's scope but instead is left up to the
- application.
- <P>
- There are several general-purpose threading libraries that implement a
- <I>many-to-one</I> model (many user-level threads to one kernel execution
- vehicle), using the same basic techniques as the State Threads library
- (non-blocking I/O, event-driven scheduler, and so on). For an example, see GNU
- Portable Threads (<A HREF=#refs6>[Reference 6]</A>). Because they are
- general-purpose, these libraries have different objectives than the State
- Threads library. The State Threads library is <I>not</I> a general-purpose
- threading library,
- but rather an application library that targets only certain types of
- applications (IAs) in order to achieve the highest possible performance and
- scalability for those applications.
- <P>
- <H4>3.2 Scalability</H4>
- <P>
- State threads are very lightweight user-level entities, and therefore creating
- and maintaining user connections requires minimal resources. An application
- using the State Threads library scales very well with the increasing number
- of connections.
- <P>
- On multiprocessor systems an application should create multiple processes
- to take advantage of hardware parallelism. Using multiple separate processes
- is the <I>only</I> way to achieve the highest possible system scalability.
- This is because duplicating per-process resources is the only way to avoid
- significant synchronization overhead on multiprocessor systems. Creating
- separate UNIX processes naturally offers resource duplication. Again,
- as in the EDSM architecture, there is no connection between the number of
- simultaneous connections (which may be very large and changes within a wide
- range) and the number of kernel entities (which is usually small and constant).
- In other words, the State Threads library makes it possible to multiplex a
- large number of simultaneous connections onto a much smaller number of
- separate processes, thus allowing an application to scale well with both
- the load and system size.
- <P>
- <H4>3.3 Performance</H4>
- <P>
- Performance is one of the library's main objectives. The State Threads
- library is implemented to minimize the number of system calls and
- to make thread creation and context switching as fast as possible.
- For example, per-thread signal mask does not exist (unlike
- POSIX threads), so there is no need to save and restore a process's
- signal mask on every thread context switch. This eliminates two system
- calls per context switch. Signal events can be handled much more
- efficiently by converting them to I/O events (see above).
- <P>
- <H4>3.4 Portability</H4>
- <P>
- The library uses the same general, underlying concepts as the EDSM
- architecture, including non-blocking I/O, file descriptors, and
- I/O multiplexing. These concepts are available in some form on most
- UNIX platforms, making the library very portable across many
- flavors of UNIX. There are only a few platform-dependent sections in the
- source.
- <P>
- <H4>3.5 State Threads and NSPR</H4>
- <P>
- The State Threads library is a derivative of the Netscape Portable
- Runtime library (NSPR) <A HREF=#refs7>[Reference 7]</A>. The primary goal of
- NSPR is to provide a platform-independent layer for system facilities,
- where system facilities include threads, thread synchronization, and I/O.
- Performance and scalability are not the main concern of NSPR. The
- State Threads library addresses performance and scalability while
- remaining much smaller than NSPR. It is contained in 8 source files
- as opposed to more than 400, but provides all the functionality that
- is needed to write efficient IAs on UNIX-like platforms.
- <P>
- <TABLE CELLPADDING=3>
- <TR>
- <TD></TD>
- <TH>NSPR</TH>
- <TH>State Threads</TH>
- </TR>
- <TR>
- <TD><B>Lines of code</B></TD>
- <TD ALIGN=RIGHT>~150,000</TD>
- <TD ALIGN=RIGHT>~3000</TD>
- </TR>
- <TR>
- <TD><B>Dynamic library size <BR>(debug version)</B></TD>
- <TD></TD>
- <TD></TD>
- </TR>
- <TR>
- <TD>IRIX</TD>
- <TD ALIGN=RIGHT>~700 KB</TD>
- <TD ALIGN=RIGHT>~60 KB</TD>
- </TR>
- <TR>
- <TD>Linux</TD>
- <TD ALIGN=RIGHT>~900 KB</TD>
- <TD ALIGN=RIGHT>~70 KB</TD>
- </TR>
- </TABLE>
- <P>
- <H3>Conclusion</H3>
- <P>
- State Threads is an application library which provides a foundation for
- writing <A HREF=#IA>Internet Applications</A>. To summarize, it has the
- following <I>advantages</I>:
- <P>
- <UL>
- <LI>It allows the design of fast and highly scalable applications. An
- application will scale well with both load and number of CPUs.
- <P>
- <LI>It greatly simplifies application programming and debugging because, as a
- rule, no mutual exclusion locking is necessary and the entire application is
- free to use static variables and non-reentrant library functions.
- </UL>
- <P>
- The library's main <I>limitation</I>:
- <P>
- <UL>
- <LI>All I/O operations on sockets must use the State Thread library's I/O
- functions because only those functions perform thread scheduling and prevent
- the application's processes from blocking.
- </UL>
- <P>
- <H3>References</H3>
- <OL>
- <A NAME="refs1">
- <LI> Apache Software Foundation,
- <A HREF="http://www.apache.org">http://www.apache.org</A>.
- <A NAME="refs2">
- <LI> Douglas E. Comer, David L. Stevens, <I>Internetworking With TCP/IP,
- Vol. III: Client-Server Programming And Applications</I>, Second Edition,
- Ch. 8, 12.
- <A NAME="refs3">
- <LI> W. Richard Stevens, <I>UNIX Network Programming</I>, Second Edition,
- Vol. 1, Ch. 15.
- <A NAME="refs4">
- <LI> Zeus Technology Limited,
- <A HREF="http://www.zeus.co.uk/">http://www.zeus.co.uk</A>.
- <A NAME="refs5">
- <LI> Peter Druschel, Vivek S. Pai, Willy Zwaenepoel,
- <A HREF="http://www.cs.rice.edu/~druschel/usenix99flash.ps.gz">
- Flash: An Efficient and Portable Web Server</A>. In <I>Proceedings of the
- USENIX 1999 Annual Technical Conference</I>, Monterey, CA, June 1999.
- <A NAME="refs6">
- <LI> GNU Portable Threads,
- <A HREF="http://www.gnu.org/software/pth/">http://www.gnu.org/software/pth/</A>.
- <A NAME="refs7">
- <LI> Netscape Portable Runtime,
- <A HREF="http://www.mozilla.org/docs/refList/refNSPR/">http://www.mozilla.org/docs/refList/refNSPR/</A>.
- </OL>
- <H3>Other resources covering various architectural issues in IAs</H3>
- <OL START=8>
- <LI> Dan Kegel, <I>The C10K problem</I>,
- <A HREF="http://www.kegel.com/c10k.html">http://www.kegel.com/c10k.html</A>.
- </LI>
- <LI> James C. Hu, Douglas C. Schmidt, Irfan Pyarali, <I>JAWS: Understanding
- High Performance Web Systems</I>,
- <A HREF="http://www.cs.wustl.edu/~jxh/research/research.html">http://www.cs.wustl.edu/~jxh/research/research.html</A>.</LI>
- </OL>
- <P>
- <HR>
- <P>
- <CENTER><FONT SIZE=-1>Portions created by SGI are Copyright © 2000
- Silicon Graphics, Inc. All rights reserved.</FONT></CENTER>
- <P>
- </BODY>
- </HTML>
|