Author: Sahlan Diver, Contractor
Date: Jan 2009
Revised: April 2009
Revised May 2nd 2009 after further
simplifications to the code
Revised May 7th 2009 after further tidying up
the code
IO Completion Ports
Thread Pool Requirement
Simplified Completion Port Coding
Completion Port Problems
Timeouts
Socket close handling
Graceful socket close
Unused Facilities
CancelIo
PostQueuedCompletionStatus
Maximising
Performance
AcceptEx()
Denial Of Service
DisconnectEx()
Threading and Memory Issues
Multithreading Issues
Memory Management
Using an alternative memory cache in C++
Class member function operator new and delete
Zero-byte receive
TCP Packets
Testing
Summary
While working on a C++ SMTP mail server class,
I came across an article extolling the virtues of Windows I/O Completion Ports
for scaleable, high-performance socket programming. Previously I had not needed to worry about performance in socket
classes, as the applications I had used them in were relatively lightweight.
Clearly the performance requirements of an SMTP server were going to be much
more stringent, so I set about designing some C++ classes to provide socket
functionality using the Win32 IO completion port mechanism.
There were two principle sources of the
information needed to write the code: the MSDN, and the many articles and
source code examples on the very excellent “The Code Project” web site.
Question:
If this diagram models a single-threaded
application:
Does this diagram model a multi-threaded
application?
![]()
![]()
Answer:
It depends. The diagram gives the impression
that the multi-threaded application does three times the amount of work in the
same amount of time, but this would only be true if there were three processors
running simultaneously. In a single processor system, the processor has to
divide its time across the threads, so a more correct diagram would be:
The program is not running any faster, it is
just splitting its available time across its threads.
If threads do not cause a program to run
faster, why bother with them at all? When handling socket send and receives,
for example, why not, in a single-threaded program, do all the i/o for one
socket, then go to the next socket, finally doing the i/o for the third socket?
There are several reasons:
Firstly, a single-threaded model does not scale
to multi-processor machines. A single-threaded program running on a three
processor machine would still only use one processor, whereas a multi-threaded
program could automatically take advantage of the additional processors and run
up to three times as fast.
Secondly, threads are not always fully active.
A thread might need to go dormant while waiting for something to occur, such as
data to arrive from a remote connection. When a thread is put into a dormant
wait state, this will free up processor time for the remaining threads, so the
overall processing time for the tasks is shorter. By contrast, a single
threaded application carrying out tasks one after the other can do nothing but
be idle when it is forced to wait for some external event.
Thirdly, operations carried out by a program
can have timeouts. If an operation in a single threaded application is queued
up behind other operations for too long, it might be aborted with a timeout
error, whereas in a multithreaded application the continual intermittent
activity on an operation keeps it moving along, reducing the possibility of any
intermediate stage of the operation timing out.
It might be thought that, since multi-threading
offers such advantages, all we need to do to design a multithreaded server is
to allow a separate thread to be created to service each incoming client
connection, so that all clients can be serviced simultaneously. Unfortunately
this tactic does not scale well, and the reason why not can be seen if we
redraw our multi-threading diagram to more accurately represent how
multithreading works:
It can be seen that switching between one
thread and another is not instantaneous; some time is lost making the switch,
as represented by the diagonal lines in the diagram This is a fixed amount of
time per switch, so what happens as an application adds more threads is that a
higher and higher proportion of processor time is spent on thread switching and
less and less on actual processing within each thread. Eventually the number of
threads can become high enough that the processor spends all its time switching
from one thread to the next, and no time is spent on actual data processing.
Microsoft provides a solution to this problem,
inbuilt into the Windows O/S. It is known as the IO Completion Port mechanism.
IO Completion ports provide a way that
processor time can be divided between any number of I/O objects, such as
sockets, without requiring a proportionate increase in the number of threads.
Windows
sockets programming is typically done asynchronously, that is, a sockets
function is called and returns immediately without waiting for a result. Some
time later when the result is ready, for example a remote connection has sent
some data, the application is notified and the result can then be processed.
IO Completion Ports provide a specialised
notification mechanism for completed asynchronous I/O. Completed I/O is queued
at an object called a completion port which then calls a
programmer-supplied call-back, known as a completion function, to
process the I/O. What is unique about the completion port mechanism is that it
tries to run the call-backs for multiple socket I/O in as few threads as
possible, to maximise efficiency by minimising thread context switching.
The completion function that processes
completed I/O is called within the context of a thread. When the completion
function has returned, the completion port looks to see if it has more I/O
results queued up, waiting to be processed. If so, it immediately calls the
completion function again in the same thread as previously. A whole
sequence of I/O operations can therefore be processed in the one thread,
avoiding losing time through context switching to other threads.
Take an SMTP server as an example application.
The server receives some mail data from a client in a completion routine and,
in accordance with the SMTP protocol sends back a response to the client.
Before further data comes back from that client, I/O might arrive at the
completion port from a second client. The completion port immediately causes
this new data to be processed in the same thread that processed the first
client’s data. No context switch to a second thread is necessary.
The completion port mechanism optimally tries
to use one thread per processor in the system, so if you have a dual processor
machine the completion port can field two I/O completions at a time, through
two parallel running threads, thus automatically doubling performance.
Although completion ports will try to use just
one thread per processor, occasionally this ideal is not possible. For example,
the thread in use may have called a blocking function. In such circumstances,
so as not to be delayed, the completion port will grab an additional thread to
use to process other I/O results that are currently queued. So a pool of
available threads is required. The completion port will automatically choose
the most recently used thread, as threads that have not been used for some time
may have had their information swapped out of memory, and would take longer to
reactivate.
To try to get as reliable a result in as short
as possible a time, I did not set about designing my own thread-pool, but
instead used a Microsoft library function, BindIoCompletionCallback(),
that conveniently allows a socket handle to be associated with a resident
Windows thread pool and completion port:
// binding socket to completion port associated with Windows-managed
thread pool
if ( !BindIoCompletionCallback( RECAST (HANDLE,sockHandle),
SocketCompletionRoutine, 0 ) )
// …
All that was then necessary to do was write the
completion routine for handling completed I/O. This has the prototype:
VOID CALLBACK SocketCompletionRoutine ( DWORD errcode, DWORD bytes_transfered, LPOVERLAPPED overlapped )
Note the pointer to an OVERLAPPED structure. This is a compulsory parameter to asynchronous socket I/O functions which is passed on as a parameter to the SocketCompletionRoutine. We can use the parameter to store additional context-sensitive information that will assist our processing of the call completion.
Microsoft’s OVERLAPPED structure contains very little spare field space for context-sensitive information, so it was necessary to design a new struct which impersonates an OVERLAPPED while at the same time containing additional fields. More of this technique later.
Completion Port
Problems
So
far, the coding of the completion port mechanism sounds very straightforward.
We were able to call a library function that provided us with a
ready-functioning thread-pool and completion port, and we only needed to
provide two pieces of code, the completion routine itself and a special
structure to provide context sensitive data. However there are several thorny
issues involved in completion port programming, which will be discussed in the
following sections.
In
asynchronous design we fire off an operation and forget about it until a result
comes back. What happens if a result never comes back? We need to be able to
set reasonable time-outs on operations and if they haven’t completed within
that time, assume that there is a problem and return a timeout error message.
In
my former asynchronous sockets programming, handling timeouts was easy. The
dedicated thread for a connection would simply wait on a handle representing the
completion of the operation. The wait function would have a timeout specified,
so the conclusion of a wait would either be a successful completion or a
timeout error.
Unfortunately
we can’t do this with completion ports because waiting on a result or timeout
will put a completion port thread to sleep until a waited-on condition has
occurred. This will cause the completion port to context-switch to another
thread in the thread pool, as it does not tolerate sitting idle if it has
queued i/o waiting to be processed. If every client i/o operation contained a
wait and forced a context switch, consider what would happen if our thread pool
were of unlimited size, we would be back to the situation where as many threads
were running as there were connected clients, which is not a scaleable
situation. Conversely if we limit the thread pool size, we will then have
queued i/o blocked by i/o, in front of it in the queue, that is waiting for its
threads to wake up. Either way, performance is severely compromised.
Clearly
a different mechanism is needed for handling timeouts, and that mechanism also
needs to be scaleable.
The
solution devised was to have a single extra thread dedicated to monitoring I/O
timeouts.
When
an I/O operation is initiated, it is associated with an enhanced OVERLAPPED
structure. This enhanced structure is made to store the start time of the
operation and the required timeout for the operation. We also store a pointer
to the OVERLAPPED on a special list under the control of the timeout monitoring
thread. The timeout monitoring thread is woken up every second to examine its
list and see if any of the OVERLAPPED’s on it are old enough that their
corresponding I/O operation can be considered to have timed out. If so, the
completion routine is called directly to process the timed-out item and return
an error to the application.
Some
care is needed with arbitrary timeouts. It may be that an i/o operation that we
deemed timed out does eventually successfully return a result to the completion
routine. We can’t then return this result to the application, because we have
already told the application the operation was in error. What we do for a timed-out operation is
leave it’s corresponding OVERLAPPED information on the timeout monitoring
thread’s list. If the operation does subsequently return we will know from the
flag value stored on the list that the I/O was regarded as having timed out, so
we ignore and discard the late returned result. If the operation never returns
there is some slight memory overhead, as the OVERLAPPED structure has to remain
on the list as a precaution and can’t actually be cleaned up until the socket
is closed.
We do know
in a timeout situation the completion routine will be called twice, once from
the timeout thread and once from the i/o thread (either because of a subsequent
successful completion, or because of an i/o abort error when the non-returning
socket is eventually closed) So the first time we report the timeout error, the
second time we do not report but could cleanup instead. We distinguish the
second time in with the TIMEOUTREPORTED flag. It's not as simple as this,
however, because of thread scheduling. The thread charged with reporting the
timeout may go slower than the one charged with cleaning up the OVERLAPPED. We
don't want any thread to start cleaning up until the other thread has finished
with the data, so what is relevant to cleanup is not which thread is enters the
timeout handling code first – that will be given the task of reporting the
timeout -- but which thread gets to the cleanup part of the code last, that
will be the one charged with cleaning up.
Another problem with io completion ports comes about when wanting to delete a socket object.
Traditionally
in C++, well-designed destructors should clean up the resources used by an
object. This is OK where those resources can be immediately cleaned up, but in
an asynchronous situation, where it might take some for resources to be
released, it can be dangerous to allow a C++ object to be prematurely deleted.
Imagine
this scenario:
1.
We delete a Socket class object
2.
The destructor of the object causes close() on
the socket
3.
Closing a socket causes any outstanding i/o on
the socket to be reported back to the completion routine as having failed with
an error
4.
In the completion routine we try to access the
Socket class object. However the object has already been deleted in step 2, so
we are accessing an invalid object, and the code fails.
This
shows that the socket class designers, rather than the programmers using the
class, need to control precisely when and how a socket object can be deleted.
One
solution in situations like this is to make an object’s destructor wait on an
event handle until a parallel running thread responsible for cleaning up the
object has signalled completion of its task. We can’t do that because we want
to have the option to close a socket within the data processing or error
handlers called by a completion routine, and we have already said above that
any code that waits or blocks a thread during a completion routine is
unacceptable.
A
more complicated solution is required. The solution is encapsulated within the
private member functions of our socket classes and also takes advantage of C++
semantics to ensure safe asynchronous closure of sockets without any special
action being required by the user of the socket classes.
In
summary:
1.
We declare the object’s destructor as
“protected”. This prevents the object from being destroyed by a programmer’s
direct call to delete. The programmer can’t use delete to free up the
heap-based object since the destructor which delete activates is not
publicly accessible. But making the destructor “protected” also implies that we
can’t create a local stack-based socket object, we always have to create it on
the heap, since a stack based object needs access to a public destructor for
automatic cleanup when it goes out of scope.
2.
We provide a boolean parameter for the Close()
method as a way of requesting that the socket object be destroyed once all
pending i/o has been completed.
3.
We make use of the fact that closing a socket
causes all pending i/o to be queued on the completion port with an abort error.
This is guaranteed as part of the way in which completion ports work. For each
socket object, we keep a reference count of all pending i/o. When this has been
reduced down to zero through each I/O result being reported back to the
completion port, we know that the socket object is not expecting any further
communications and can now be deleted.
4.
When all pending i/o has returned, the socket
object can delete itself by doing “delete this;” This is safe because
(a) the protected destructor means that the socket class object could only have
been created on the heap – there is no danger of trying to delete a stack-based
object (b) the protected destructor is unavailable to outsiders but it is
callable by another method of the same class, so it will be callable as usual
as part of the delete operation
5.
Notice that we make the destructor protected
rather than private, so it is accessible to be called in the destruction
of any classes derived from the Socket class.
Microsoft:
Shutting down a socket connection involves an exchange of protocol
messages between the two endpoints, hereafter referred to as a shutdown
sequence. Two general classes of shutdown sequences are defined: graceful and
abortive (also called hard). In a graceful shutdown sequence, any data that has
been queued, but not yet transmitted can be sent prior to the connection being
closed. In an abortive shutdown, any unsent data is lost. The occurrence of a
shutdown sequence (graceful or abortive) can also be used to provide an
FD_CLOSE indication to the associated applications signifying that a shutdown
is in progress.
Sockets
need to be closed gracefully, I.e. if there is a data transfer still in
progress, it is allowed to complete
before the socket is closed, firstly so that data is not lost, secondly so that
consequent communications errors do not occur for what should be an error free
operation.
The
WinSock function, shutdown (), is called to close a socket gracefully.
The shutdown action needs to
be different on the server and client sides, because on TCP there is a TCP enforced delay, known as
TIME_WAIT, before a port / i.p.
combination can be reused. The side that does the active close goes to TIME_WAIT.
We don't want the server side
of a socket connection in time wait because we want to be able to use
DisconnectEx() to put the socket handles into a pool for re-use, to improve the
efficiency at which we can offer new connections.
These handles need to be reusable as soon as possible.
Calling shutdown(SD_SEND) from the server initiates a
passive close. The client does the final active close by calling shutdown(SD_BOTH.
That way the client ends up in TIME_WAIT, which generally doesn't matter as
much as the server being in TIME_WAIT.
The precise sequence differs
slightly depending on whether the server or client side initiates the close.
Server
side closing:
- if
the server wishes to initiate the close down, it should call shutdown
(FD_SEND). The client receives 0 data bytes telling it that the server is
sending no more data. The client can then call shutdown (FD_BOTH). The server's
recv() completion routine gets any remaining sent client data followed by a
zero byte receive telling it the client has closed down. The server can then
close the connection socket.
Client
side closing:
- if
the client wishes to initiate the close down, it should call shutdown (FD_SEND).
The server receives 0 data bytes telling it that the client is sending no more
data. It can then call shutdown (FD_SEND). The client then gets FD_CLOSE and
can close. The server also then gets FD_CLOSE and can close.
Note
that both sides initially do the same thing, sending FD_SEND. They differ in
how they react to a close down. A virtual function is tested to see whether the
socket class is for the server or client and the final part of the shutdown is
handled accordingly.
This solution had two problems
o
It
required creating an extra thread per connection. Although the thread was
dormant, waiting on an event, this would nevertheless create overhead that
might limit scalability.
o
In testing we found that the FD_CLOSE could not be relied on, and was
sometimes leaving connections hanging without proper termination.
Fortunately we notice that all closed connections infallibly return zero
bytes with no error as their closing recv() operation, and we now rely on this
to start of the socket close sequence described above. This simplified the
design by removing a thread and also by removing a redundant mechanism – if a
zero-byte receive is sufficient to detect socket closure then we don’t need to
handle FD_CLOSE as well.
Unused Facilities
If we were managing our own thread pool this
would be difficult enough to achieve, but we chose to initialise our sockets by
calling BindIoCompletionPort() which means that we are using an internal
Windows thread pool over which we have no control. Because of this limitation
of CancelIo() we decided not to give it any further consideration.
This function can be used by a programmer to
post an item to a completion port. However, because we used BindIoCompletionPort(), we don’t have the actual handle of the completion port in use. There
were a couple of cases where in our own threads we needed to send an item to
the completion routine. We cannot do this via the completion port, since we
have no handle for it. Instead the code calls the completion routine directly,
bypassing the completion port. The completion routine will be called within one
of our socket class worker threads and
not within a thread from the Windows thread pool. However, the code is designed
for thread safety so that calling the completion routine in any thread is not a
problem.
Several articles on server design recommended
using AcceptEx() for the sockets that a server creates to represent
client connections. Creating a socket and opening it when a client connects is
apparently an expensive operation in terms of time. AcceptEx() allows
sockets to be created and opened in advance of a connection being made. The
recommendation is to keep a pool of these sockets available to speed connection
times. This recommendation was followed.
A further advantage of AcceptEx() is
that it can combine both connection and the receiving of the first data from
the client into a single operation, thus further increasing efficiency. The
read data is queued to the completion port, just like a regular read.
The code uses WSAEventSelect to request the setting of the FD_ACCEPT event
when a connection is ready to be made but there are no more pre-opened sockets
left in the pool. A
special thread waiting on this event can create a new stock of required sockets
and call AcceptEx on them.
Unfortunately a malicious client can make a
connection but then not send any data. In a denial of service attack a
multitude of such inactive connections are requested, leading to the pool of
sockets being used up and more needing to be created. Eventually one hits a
resource limit of too many created sockets.
So when asked to create more
sockets by an FD_ACCEPT event, the code tests the existing sockets to see if any
have been connected but have been waiting too long for data.
get_socketopt() with the SO_CONNECT_TIME
option is used to test connection time. Any sockets connected for a long time
without sending data are closed and deleted.
Several articles recommended using DisconnectEx()
to disconnect sockets when finished with but keep them available for reuse.
Reusing an existing socket offers performance benefits over discarding a socket
and having to create a new one when required.
A limitation is that a disconnected socket can
only be immediately re-used if not subject to the TCP time wait after closure.
This limitation is placed on a socket on the disconnecting side. Our code, as
described above in this document, ensures that the side performing shutdown is
not the server rather but the client. So DisconnectEx can be applied to
sockets on the server side.
When a socket is disconnected, it is placed on
separate list for re-use. When we next get a FD_ACCEPT, showing that we have
run out of sockets in our pool, we first grab the sockets on the disconnected
list and try to reactivate them with AcceptEx(). For simplicity, any such
sockets that give an error on reactivation are simply discarded.
Following the reactivation of former
disconnected sockets, we look at the new size of the socket pool, and if
smaller than a specified minimum size, we create more sockets and pre-accept
them with AcceptEx, until the pool is full.
We found an unexpected problem with
DisconnectEx. At each FD_ACCPET we first call AcceptEx() to formerly
disconnected sockets, then we do a check for denial of service, by checking how
long the socket has already been connected. What we found was that the
getsockopt() routine for returning connection time returned the time since when
the socket had first been connected, before it was disconnected for
re-use. Therefore the denial of service test was rejecting the socket by
looking too far back into its connection history.
We worked round this problem by testing the
connection time immediately after calling AcceptEx() on a formerly
disconnected socket, so this gives us the time of reconnection. Any future DOS
test is made to refer correctly to this latest reconnection time, which we
store in the socket object.
Threading and
Memory Issues
The code makes use of threads it creates itself
to monitor timeouts, acceptance of incoming connections and closure of
connections. There is the possibility of simultaneous access of the same data
between these threads and the completion port threads. All such data is
protected by CRITICAL_SECTION objects. Critical sections are the most efficient
method in Windows of protecting data from simultaneous thread access.
We know that completion ports try to keep the
number of threads in use to a minimum. This gives an additional advantage in
the matter of access to shared data. The likelihood of a thread having to wait
temporarily while another thread accesses critical section protected data is
greatly reduced because there are relatively very few threads running.
Sockets
communication inevitably involves memory buffers, and some articles on sockets
programming suggest caching freed-up buffers for re-use, to reduce the time
spent on allocating memory. Other articles caution against this by saying that
one would want to be certain that we really were designing a caching mechanism
more efficient than the default heap.
Having
made several attempts in the past to design a fast memory cache, I considered
it worthwhile re-visiting the subject.
I was not very happy with my past designs which involved too much
searching in the cache for a memory block of the correct size. I tried to make
the search algorithms efficient but I had little evidence they bettered the
default heap mechanisms. So I considered the problem afresh.
It
seemed to me that the ideal cache mechanism allows one to retrieve a block of
memory of a given size in just one step, without any kind of looping search.
Suppose
there were an array of pointers to stacks of free memory blocks.
o
Array slot number 10 would point to a stack of
memory blocks of size 10 bytes
o
Slot number 11 would point to a stack of memory
blocks of size 11 bytes
o
Slot number 12 would point to a stack of memory
blocks of size 12 bytes
o
and so on.
Then
if you wanted say 12 bytes, one step would take you to straight to the stack of
12-byte blocks and you could just grab the top block off the stack. Or you
could get new memory from the default heap if the stack were found to be empty.
Likewise putting the 12 byte block back when finished with would only require
one step, for example to slot number 12 in the array, to find the stack where
the block should go.
The disadvantage of this proposal is the enormous amount of memory that would be taken up by the array and the stacks because of having a separate stack for every possible size of memory block.
Suppose, however, we increased the block size increments, so we had something like this:
o
Slot number 10 would point to a stack of memory
blocks of size 1000 bytes
o
Slot number 11 would point to a stack of memory
blocks of size 1100 bytes
o
Slot number 12 would point to a stack of memory
blocks of size 1200 bytes
If
we wanted some memory, we would round up the nearest size. For example, if we
wanted 932 bytes we would take a 1000 byte sized block and just ignore the
surplus trailing 68 bytes of the block. This approach keeps the cache size very
much smaller, at the expense of wasting a bit of memory in the application. The
method also has the compensating advantage that the chance of resuing any block
is greatly increased. For example, a cached 1000 byte block could be used by
any memory request for between 901 and 1000 bytes of memory. There’s less
likely to be infrequently used sizes of memory block sitting on the cache doing
nothing.
Instrumenting solutions like this is vital. When a cache designed according to the ideas above was first instrumented with a test program, it was actually found to be slower than the default heap! This was quickly traced to a mixture of genuine bugs and silly mistakes. However, after putting these right, the cache performance was still found to be unpredictable. Eventually the cause was discovered -- the caching code was fine, but there was a fault in the design of the code for reorganising the cache when too much memory had been placed on it. After revamping the algorithms, a test rig program showed performance increases, in release builds, of between 20 – 70 times the allocation/deallocation speed of the default heap. So this proved to be an experiment worth undertaking
Using an alternative memory cache in C++
The reader will know that memory allocation for
an object and calling of the constructor for an object is achieved through a
single operator; new.
For example:
std::string s = new
std::string ( “Hello” ) ;
// 1) new gets memory from
the heap
// 2) new calls the
constructor for the string, passing it “Hello”
It is possible to redefine new to
use alternative sources of memory. We don’t want this, however. We want our
applications to continue to use the default heap, but for a limited number of
socket buffer allocations we wish instead to use our fast memory cache
described above. It’s no problem to design a class or function that will return
us a pointer to some allocated memory.
e.g.
class FastHeap {
public:
void * Allocate (size_t) ;
void DeAllocate ( void * ) ;
}
;
but how do we construct a C++
object using this memory, as new() insists on using the default heap?
Fortunately C++ provides a way to
get what we want. I mention this here because the techniques are not used very
often and the reader may not have come across them before.
There is a special operator
function called the “placement new” which performs the dual function of
“placing” an object at a memory address specified by the programmer and then
calling the object’s constructor.
For our string example, the syntax
looks like this:
FastHeap fastHeap ;
void * mem =
fastHeap.Allocate(sizeof(std::string)) ; // get memory
std::string * s = new ( mem ) std::string ( “Hello” ) ;
// place and construct object
The converse problem occurs when
we want to delete the object. The standard operator delete calls the objects
destructor, then deallocates its memory onto the default heap. We don’t want
the second part of the operation.
Again C++ provides a
way to get what we want. We can call the object’s destructor directly with some
special syntax. Then we can deallocate our memory back to our fast heap.
s->std::string::~std::string()
; // call destructor
fastHeap.Deallocate ( &s
); // fee up memory
A lot of the time we use FastHeap to
allocate simple byte buffers for socket data, but we also happen to use the FastHeap
to allocate C++ class objects having constructors and destructors and the above
techniques are used for that.
An alternative technique for
allocating and deallocating through an alternative heap, is to define member
operator new and operator delete functions for a class. Any call to new and delete uses these
functions for the to get and free the memory, so the functions can be defined
to use alternative heaps. The advantage is that there is no need for a
placement new, or an explicit constructor call since the normal construction
and destruction mechanisms are applied by the compiler, all that is different
is the source of the allocated memory. We use this latter technique in the
code.
Note that operator new and
operator delete are static functions since they have to be called when an
object is either not in existence yet or is in the process of being destroyed.
They are static implicitly and do not require to be declared as static.
The problem is solved by a technique known as “zero-byte receive”. We call WSARecv() to ready our socket for the next data receive, but we specify a zero-byte buffer. Because of the buffer it uses is zero length, no memory is locked by OS. Thus a server
WSAEWOULDBLOCK., showing that no more data is available
Of course, this optimisation will be less than
effective if most data is received in only small chunks with just an occasional
large chunk, because then we will be getting a larger amount of memory than we
need for most receives. For this reason, it should be possible for the user to
fine-tune the optimisation by specifying a start size and an upper limit for
the buffer size. We have not added this facility yet.
The TCP
socket class contains virtual functions for identifying and handling TCP
packets. The classes ProcessRecv() has been overridden to call a virtual
function ExtractPackets() which should be defined by a derived class to analyse
the data into packets, returning only whole packets and keeping back a trailing
partial packet until the conclusion of that packet is received in the
subsequent ProcessRecv(s). The extracted packets are passed on to
ProcessPackets(), which a derived class must define to act on appropriately.
If it is
not required for the derived class to process packets, then ProcessRecv()
should be overridden instead to process the raw received data in some other
way.
Microsoft
defines an WSAOVERLAPPED structure that must be supplied to i/o function calls
for the purpose of passing data that may be needed by completion routines.
Since the data is passed as a pointer we can supply a pointer that actually
points to a much larger structure containing context-sensitive data of our own
that we need in the completion routine.
An
example of such a structure is:
struct
OVERLAPPEDPLUS {
OVERLAPPEDPLUS ( WinSocket * originator,
DWORD timeout_secs ) ;
~OVERLAPPEDPLUS () ;
// …………..
enum
CompletionState { PENDING, TIMEDOUT, TIMEOUTREPORTED, COMPLETED }
WSAOVERLAPPED
wsaOverlapped ; // used by Winsock
WinSocket * winSocket ; // the socket
that initiated the overlapped operation
DWORD timeoutPeriod ; // 0 for infinite
timeout, or timeout in seconds
DWORD queuedTime ; // for the timeout
monitoring thread
CompletionState completionState ;
int timeoutHandlersCompleted ;
} .
There is a derived class hierarchy of
these structs, namely UDP_OVERLPAPPED, TCP_OVERLAPPED, ACCEPT_OVERLAPPED, all
derived from the base OVERLAPPEDPLUS and serving UDP,
TCP and connection completion operations.
It should be noted that connection completion does not occur in the
connection listening socket but in the sockets created by it to service the
connection.
The question is how can we pass a pointer to our enhanced structure to a Microsoft routine that is expecting a pointer to an entirely different structure? It may be that the WSAOVERLAPPED field, being the first declared in the struct, is at offset zero, but it is not safe to rely on that assumption. We need to do this in a portable way that makes no assumptions about how the fields will be offset within the structure.
The
answer should be that we use a little known C and C++ macro, offsetof(),
to get the offest of the WSAOVERLAPPED field within the OVERLAPPEDPLUS structure. By adding or
subtracting the offset we can convert the address of one structure to the
corresponding address of the other.
Alternative to offsetof()
Unfortunately OVERLAPPEDPLUS
and its derived classes break the rules for offsetof() in that they
contains member functions and is not just a simple data object.
I found a neat solution in a
programmer’s blog. Unfortunately I have lost his name or I would credit him
here. He said that the rule is enforced because offsetof() works by
pretend placement of data at address zero and then working out the address of
the required field relative to the zero address placement of its containing
object. The compiler doesn’t like object members used on an object placed at
the invalid address of zero, so it issues an error. The work round is to devise our own version of offsetof(),
which places the object at non-zero address, 0x0000001, and works out the
offset of the required field from that non-zero address by subtracting 1 from
the field’s address:
#define fieldoffset(classname,fieldname) ((size_t)&(((classname*)1)->fieldname)-1)
2) It is a very bad idea to bind a client to a
specific port
This one caused us huge grief in our first test
program. We bound the client to a specific port when connecting to the server,
and were able to open and close the socket without a problem. But then we
couldn’t reconnect on the same port. Much time was spent looking for a software
bug that didn’t exist. Here’s the real reason for the problem: (quote) “When you close a TCP
connection, it goes into the TIME_WAIT state for a short period
(between 30 and 120 seconds, typically), during which you cannot reuse that
connection's "5-tuple:" the combination of {local host, local port,
remote host, remote port, transport protocol}. (This timeout period is a
feature of all correctly-written TCP/IP stacks, and is covered in RFC 793 and
especially RFC
1122.) In practical terms, this means that if you bind to a specific
port all the time, you cannot connect to the same host using the same remote
port until the TIME_WAIT period expires”
A very bad common mistake seen in the design of
C++ class hierarchies is to stuff the base class of the hierarchy full of too
much functionality. We know for sockets that we might need udp sockets, tcp
sockets, bound and unbound sockets, sockets that read and write data, sockets
that accept connections from other sockets, sockets that are simple clients
connecting to servers, and so on. It could be tempting to put all that
functionality in the base class -- after all, base classes are supposed to
provide code to assist their derived classes. Then in the base class member
functions we just test the type of the derived class and make decisions
accordingly: if the derived class is for a udp socket, do this, if it is for a
tcp socket, do that, and so on.
This way of designing class hierarchies in C++
is very bad for at least the following reasons.
1)
It breaks encapsulation. A base class should
not hold any knowledge of matters that are the responsibility of a derived
class. It should be as separate from its derived classes as it would be to
classes not related to it by inheritance. If you are going to have classes all
knowing about each other’s internal workings then you might as well not use C++
at all
2)
It effectively creates one giant base class
that knows everything there is to be known about the problem domain, with the
derived classes having only trivial functionality. Such code for a class is
very difficult to maintain because it is usually far too voluminous and
muddled; the opportunity for clarity and separation has been lost.
3)
You usually can’t derive new classes without
adding to the code in the base class that accommodates the existing special
cases. Alteration of existing code increases the likelihood of disrupting
existing code and extending the testing cycle unnecessarily.
By contrast our design splits the socket
functionality into a hierarchy of classes that each add progressive
functionality to their immediate base classes
In the code header examples below, we have
omitted a considerable amount of the private and protected sections of the
classes where these are just implementation detail. We focus only on the public
detail that is of direct interest to the user of the class.
Please
refer to the latest source as some of the detail below may have been further
refined since this document was written.
Notice how the public interface to the I/O
completion port functionality is very simple and straightforward. All the
implementation detail complexities discussed in this document are in private
and protected areas of the class declaration, which are not accessible to, and
therefore not under the control of, the user of the classes.
I
We start with a base class that provides
general-purpose functionality without anticipating precisely which classes will
be derived from it:
class WinSocket
{
protected:
WinSocket () ;
virtual ~WinSocket () ;
public:
// -------------- config --------------
void RequestBound ( unsigned short port = 0, ulong address = INADDR_ANY ) ;
// Call before opening the socket if it should be
bound
static void SetIoTimeout ( DWORD secs ) { ioTimeout =
secs ; }
// Sets seconds timeout for all send and receives in
the application
// --------------- info ---------------
bool SocketOpen();
// Returns whether socket is currently open or not
// -------------- action ---------------
virtual ErrorInfo * Open() ;
// socket must be opened before performing operations
virtual ErrorInfo * Close (bool destroy_on_close =
false) ;
// A non-blocking call
// If a socket has a pending i/o operation and
'destroy_on_close' is true
// then asynchronously destroys the socket after all i/o is completed
private:
virtual SOCKET WSAOpenSocket() = 0 ;
// open the socket and return a handle according to
socket type
} ;
Notes:
1.
The WinSocket class constructor is protected,
as are the constructors of all classes below which should be derived from. This
prevents the class being created directly, when that should not be allowed. The
class only has meaning as part of a derived class which fills in missing
functionality
2.
The destructor (unusually) is declared as protected
as well. As discussed above this prevents a WinSocket object or
derivative from being deliberately deleted. We want the object to destroy
itself asynchronously after all pending i/o has completed.
3.
We use protected rather than private
destructors so that the destructor is still callable from derived classes.
4.
We provide an abstract function WSAOpenSocket()
which must be defined in a derived class to specify how the socket is to be
opened. This is an example of the correct technique for separating out
knowledge between base and derived classes. The base class knows it needs an
opened socket handle, so this is provided as a function in the base class for
it to call internally. However it doesn’t know precisely how to create the
socket handle because only a derived class knows what type of socket it is. So the provided function is made abstract
within the base class forcing its derived classes to define the function to
return the correct socket type.
Next we specialise the base WinSocket class for
UDP type connections:
class UDPSocket : public
WinSocket
{
protected:
UDPSocket () ;
~UDPSocket () ;
public:
ErrorInfo * RequestSendTo ( const char * buffer,
DWORD bufsize,
SOCKADDR_IN * specified_address = NULL ) ;
// Requests item to be sent. A non-blocking call.
ErrorInfo * RequestRecvFrom ( DWORD buflen ) ;
// Requests a RecvFrom() into a given size of buffer
private:
virtual ErrorInfo * ProcessRecvFrom ( char * sender_addr, short
sender_port, DWORD
bytes_received, char * buffer ) = 0 ;
// Overload to process RecvFrom's. Assume info is
transitory, so copy it
virtual ErrorInfo * ProcessSendTo () { return NULL ;
}
// Optionally process the results of a successful
sendto
virtual ErrorInfo * ProcessSendToError (ErrorInfo *
err, bool timed_out ) = 0 ;
virtual ErrorInfo * ProcessRecvFromError (ErrorInfo*
err, bool timed_out) = 0 ;
virtual SOCKET WSAOpenSocket() ;
} ;
Notes:
1.
A derived WinSocket class that handles UDP type
connections.
2.
Still has a protected constructor because must
be derived from to add more functionality.
3.
Defines the function WSAOpenSocket()
needed by the base class, to return a udp socket handle.
4.
Provides new abstract virtual functions for
data and error processing. The user of these classes will need to define their
own derived class and implement these functions. Our base class can only detect
and inform of data received or sent and inform of any errors. It cannot know
how a specific application wishes to handle the errors
We also specialise the base WinSocket class for
TCP type connections:
class TCPSocket : public
WinSocket
{
protected:
TCPSocket () ;
~TCPSocket () ;
public:
ErrorInfo *
RequestSend ( const char * buffer,
DWORD bufsize ) ;
// Requests item to be sent. A non-blocking call.
ErrorInfo * RequestRecv ( DWORD buflen ) ;
// Requests data to be sent to the socket
private:
virtual ErrorInfo * ProcessRecv ( DWORD
bytes_received, char * buffer ) = 0 ;
// Overload to process Recv's on connected sockets.
// Info is transitory, so copy it.
virtual ErrorInfo * ProcessSend () { return NULL ; }
// Optionally process the results of a successful
send
virtual ErrorInfo * ProcessSendError ( ErrorInfo *
err, bool timed_out ) = 0 ;
virtual ErrorInfo * ProcessRecvError ( ErrorInfo *
err, bool timed_out ) = 0 ;
virtual SOCKET WSAOpenSocket() ;
} ;
Notes:
1.
A derived WinSocket class that handles TCP type
connections.
2.
Still has a protected constructor because does
not have complete functionality
3.
Defines the function WSAOpenSocket() needed by
the base class, to return a tcp socket handle.
4.
Provides new abstract virtual functions for
data and error processing. The user of these classes will need to define their
own derived class and implement these functions. Our base class can only detect
and inform of data received or sent and inform of any errors. It cannot know
how a specific application wishes to handle the errors
Next, we
provide two derived UDP sockets dedicated to handling the sending and receiving
of broadcast messages:
class BroadcastingSocket :
private UDPSocket
{
public:
BroadcastingSocket (ushort broadcast_port) ;
virtual ErrorInfo * Open() ; // open the socket
before broadcasting
virtual ErrorInfo * Broadcast ( char* buffer, ulong
buflen ) ;
;
Notes:
1.
A derived UDPSocket class that will broadcast a
message on a specified port
2.
Note the private derivation from the
base. We do this to make the public functions of the base classes inaccessible.
This gives us very tight control over what functionality from the base can be
called. Only new functions declared as public or base class functions redeclared
as public can be used. So in this example, the user could not call SendTo(),
which is a public member of the base class, they can only call Broadcast()
which implements the special conditions needed for broadcast functionality. In
other words we make it impossible to use the class wrongly by accident.
class ALVSOCK_DLL
BroadcastListeningSocket : private UDPSocket
{
protected:
BroadcastListeningSocket ( ushort broadcast_port ) ;
public:
virtual ErrorInfo * StartListening() ; // also opens
the socket
private:
virtual ErrorInfo * ProcessRecvFrom ( char *
sender_addr,
short sender_port, DWORD bytes_received, char * buffer ) = 0 ;
virtual ErrorInfo * ProcessRecvError ( ErrorInfo *
err, bool timed_out ) = 0 ;
} ;
Notes:
1.
A derived UDPSocket class that will listen for
a broadcast a message on a specified port and call a private virtual function
with the received message
2.
The user of this class will need to define
their own derived class and implement the virtual functions for handling
received data or errors.
Next is a derived TCPSocket class that can very
simply connect to a TCP server:
class ALVSOCK_DLL
ClientSocket : public TCPSocket
{
public:
ErrorInfo * RequestConnect ( const char *
dot_address,
unsigned short port, DWORD timeout = WSA_INFINITE ) ;
} ;
Notes:
1.
The user of this class will need to define
their own derived classes and implement the base class virtual functions for
handling received data or errors.
Finally we have two classes to be used together
to listen for incoming connections and create a derived TCPSocket object
representing each accepted connection:
class ALVSOCK_DLL
ConnectedClient : public TCPSocket
{
protected:
ConnectedClient () : TCPSocket() {}
private:
virtual ErrorInfo * ProcessRecv ( DWORD
bytes_received, char * buffer ) = 0 ;
virtual ErrorInfo * ProcessRecvError ( ErrorInfo *
err, bool timed_out ) = 0 ;
} ;
Notes:
1.
A derived TCPSocket class, representing an
incoming connection that has been accepted from a client
2.
The user of this class will need to define their
own derived class and implement the virtual functions for handling received
data or errors.
class
ConnectionListeningSocket : private TCPSocket
{
public:
ConnectionListeningSocket ( ushort
listening_port,
ulong address = INADDR_ANY ) ;
~ConnectionListeningSocket () ;
// This socket class SHOULD be deleted by the
programmer. Or it can be created on the stack.
// It does not self-delete
protected:
// ---------- configuration ---------------
void SetMinPendingAccepts( uint number ) ;
// Set the minimum number of pending accepts that should be kept in
reserve to // rapidly service connection requests
// Default is 20 if this is not called
void SetDOSTimeout( DWORD secs ) ;
// The timeout beyond which an accepted connection which has sent no
data is
// timed out, to prevent denial of service attacks
// Defaults to 2 minutes
// -------- action --------
virtual ErrorInfo * StartListening() ; // also opens
the socket
private:
virtual ConnectedClient * OnCreate () = 0 ;
// Override to create an unopened derived
ConnectedClient that will be
// used to make the data-receiving connection to the client.
// Delete the socket when finished with by calling
Close(true) on it
} ;
Notes:
1.
A derived TCPSocket class representing a server
socket that is listening for incoming
connections from a client
2.
For maximum efficiency sets up a pool of sockets in advance with AcceptEx.The
user of this class will need to define their own derived class and implement
the virtual function OnCreate() to create and return a derived ConnectedClient
class.
Two GUI-based test programs were written.
A server test program creates a server at a
known address and port, and can receive incoming connections and data. Any
received data it echoes back to the sender.
A client test program creates 1 .. n clients at
an ip and ports chosen by itself and fires off messages at a selectable time
interval to the server’s known i.p. address and port.
The server reports progress at one second
intervals, the client reports progress more frequently.
The server was run on Windows 64-bit XP
Professional on a dual-processor PC. Two clients were run simultaneously, both
on 32-bit machines, one running XP Home, the other running Vista Home. The
relevant test port was opened on the Windows firewall.
Testing has mainly been concerned with
examining basic functionality and removing bugs. At the time of writing,
metering of performance and identification of any bottlenecks has not been
done.
The UDP sockets have not been tested.
The Recv and Send timeout facility was tested
briefly early on in the development.
We noticed a small memory leak on the server
side in testing. We have overridden new and delete operators that can test for
memory leaks but have not had time to run these tests yet.
Sockets programming is very time and sequence
sensitive. Multi-threading, the use of threads with overlapping responsibilities,
the use of a Microsoft thread pool to which we have no direct access and the
high-speed and high-volume nature of the data flow can make certain problems
impossible to trap in the Visual Studio debugger. For these we used a simple
trace function writing to a text file to get the precise sequence of events.
The trace function used a critical section to protect against simultaneous
multi-threaded access. The actual traces have been removed from the code,
though the trace function remains.
This document started by describing the need
for a more efficient set of socket classes than I had used in the past, ones
that would be fully scaleble to the stringent requirements of an SMTP server.
The document moved on to a discussion of the
theory behind IO Completion ports followed by some notes on how these ports can
be used very easily by calling just a handful of Win32 library functions.
The document then discussed issues such as
handling timeouts and asynchronous socket closure that make IO completion port
programming much less simple than it might appear to be at first sight.
The document concluded by setting out the need
to simplify and clarify C++ class design by clearly separating out specialised
functionalities into derived classes and resisting the temptation to
incorrectly cram all functionality into a single base class.
The described socket classes provide a
lightweight and straightforward interface to Windows IO completion port
functionality, with all technicalities relating to io completion ports hidden
away in the private implementation of the classes. Someone who knew almost
nothing about I/O completion ports could use these classes with confidence. The
only rule they need to be aware of is to try to avoid making thread-blocking
calls (such as WaitForSingleObject() during their handler functions for
received data and errors, as blocking calls adversely affect the efficiency of
io completion port threads.