Neural Network for Recognition of Handwritten Digits

Neural Network for Recognition of Handwritten Digits

By  | 6 Dec 2006 | Article
A convolutional neural network achieves 99.26% accuracy on a modified NIST database of hand-written digits.
 Prize winner in Competition "MFC/C++ Nov 2006"
  Discuss 189  
Is your email address OK? You are signed up for our newsletters but your email address is either unconfirmed, or has not been reconfirmed in a long time. Please click here to have a confirmation email sent so we can confirm your email address and start sending you newsletters again. Alternatively, you can update your subscriptions.
Add your own
alternative version

Graphical view of the neural network



This article chronicles the development of an artificial neural network designed to recognize handwritten digits. Although some theory of neural networks is given here, it would be better if you already understood some neural network concepts, like neurons, layers, weights, and backpropagation.

The neural network described here is not a general-purpose neural network, and it's not some kind of a neural network workbench. Rather, we will focus on one very specific neural network (a five-layer convolutional neural network) built for one very specific purpose (to recognize handwritten digits).

The idea of using neural networks for the purpose of recognizing handwritten digits is not a new one. The inspiration for the architecture described here comes from articles written by two separate authors. The first is Dr. Yann LeCun, who was an independent discoverer of the basic backpropagation algorithm. Dr. LeCun hosts an excellent site on his research into neural networks. In particular, you should view his "Learning and Visual Perception" section, which uses animated GIFs to show results of his research. The MNIST database (which provides the database of handwritten digits) was developed by him. I used two of his publications as primary source materials for much of my work, and I highly recommend reading his other publications too (they're posted at his site). Unlike many other publications on neural networks, Dr. LeCun's publications are not inordinately theoretical and math-intensive; rather, they are extremely readable, and provide practical insights and explanations. His articles and publications can be found here. Here are the two publications that I relied on:

The second author is Dr. Patrice Simard, a former collaborator with Dr. LeCun when they both worked at AT&T Laboratories. Dr. Simard is now a researcher at Microsoft's "Document Processing and Understanding" group. His articles and publications can be found here, and the publication that I relied on is:

One of my goals here was to reproduce the accuracy achieved by Dr. LeCun, who was able to train his neural network to achieve 99.18% accuracy (i.e., an error rate of only 0.82%). This error rate served as a type of "benchmark", guiding my work.

As a final introductory note, I'm not overly proud of the source code, which is most definitely an engineering work-in-progress. I started out with good intentions, to make source code that was flexible and easy to understand and to change. As things progressed, the code started to turn ugly. I began to write code simply to get the job done, sometimes at the expense of clean code and comprehensibility. To add to the mix, I was also experimenting with different ideas, some of which worked and some of which did not. As I removed the failed ideas, I did not always back out all the changes and there are therefore some dangling stubs and dead ends. I contemplated the possibility of not releasing the code. But that was one of my criticisms of the articles I read: none of them included code. So, with trepidation and the recognition that the code is easy to criticize and could really use a re-write, here it is.

go back to top

Some Neural Network Theory

This is not a neural network tutorial, but to understand the code and the names of the variables used in it, it helps to see some neural networks basics.

The following discussion is not completely general. It considers only feed-forward neural networks, that is, neural networks composed of multiple layers, in which each layer of neurons feeds only the very next layer of neurons, and receives input only from the immediately preceding layer of neurons. In other words, the neurons don't skip layers.

Consider a neural network that is composed of multiple layers, with multiple neurons in each layer. Focus on one neuron in layer n, namely the i-th neuron. This neuron gets its inputs from the outputs of neurons in the previous layer, plus a bias whose valued is one ("1"). I use the variable "x" to refer to outputs of neurons. The i-th neuron applies a weight to each of its inputs, and then adds the weighted inputs together so as to obtain something called the "activation value". I use the variable "y" to refer to activation values. The i-th neuron then calculates its output value "x" by applying an "activation function" to the activation value. I use the letter "F()" to refer to the activation function. The activation function is sometimes referred to as a "Sigmoid" function, a "Squashing" function and other names, since its primary purpose is to limit the output of the neuron to some reasonable range like a range of -1 to +1, and thereby inject some degree of non-linearity into the network. Here's a diagram of a small part of the neural network; remember to focus on the i-th neuron in level n:

General diagram of a neuron in a neural network

This is what each variable means:

Output of the i-th neuron in layer n

is the output of the i-th neuron in layer n

Output of the j-th neuron in layer n-1

is the output of the j-th neuron in layer n-1

Output of the k-th neuron in layer n-1

is the output of the k-th neuron in layer n-1

Weight that the i-th neuron in layer n applies to the output of the j-th neuron from layer n-1

is the weight that the i-th neuron in layer n applies to the output of the j-th neuron from layer n-1 (i.e., the previous layer). In other words, it's the weight from the output of the j-th neuron in the previous layer to the i-th neuron in the current (n-th) layer.

Weight that the i-th neuron in layer n applies to the output of the k-th neuron in layer n-1

is the weight that the i-th neuron in layer n applies to the output of the k-th neuron in layer n-1

General feed-forward equation

is the general feed-forward equation, where F() is the activation function. We will discuss the activation function in more detail in a moment.

How does this translate into code and C++ classes? The way I saw it, the above diagram suggested that a neural network is composed of objects of four different classes: layers, neurons in the layers, connections from neurons in one layer to those in another layer, and weights that are applied to connections. Those four classes are reflected in the code, together with a fifth class -- the neural network itself -- which acts as a container for all other objects and which serves as the main interface with the outside world. Here's a simplified view of the classes. Note that the code makes heavy use of std::vector, particularly std::vector< double >:

// simplified view: some members have been omitted,
// and some signatures have been altered

// helpful typedef's

typedef std::vector< NNLayer* >  VectorLayers;
typedef std::vector< NNWeight* >  VectorWeights;
typedef std::vector< NNNeuron* >  VectorNeurons;
typedef std::vector< NNConnection > VectorConnections;

// Neural Network class

class NeuralNetwork  
    virtual ~NeuralNetwork();
    void Calculate( double* inputVector, UINT iCount, 
        double* outputVector = NULL, UINT oCount = 0 );

    void Backpropagate( double *actualOutput, 
         double *desiredOutput, UINT count );

    VectorLayers m_Layers;

// Layer class

class NNLayer
    NNLayer( LPCTSTR str, NNLayer* pPrev = NULL );
    virtual ~NNLayer();
    void Calculate();
    void Backpropagate( std::vector< double >& dErr_wrt_dXn /* in */, 
        std::vector< double >& dErr_wrt_dXnm1 /* out */, 
        double etaLearningRate );

    NNLayer* m_pPrevLayer;
    VectorNeurons m_Neurons;
    VectorWeights m_Weights;

// Neuron class

class NNNeuron
    NNNeuron( LPCTSTR str );
    virtual ~NNNeuron();

    void AddConnection( UINT iNeuron, UINT iWeight );
    void AddConnection( NNConnection const & conn );

    double output;

    VectorConnections m_Connections;

// Connection class

class NNConnection
    NNConnection(UINT neuron = ULONG_MAX, UINT weight = ULONG_MAX);
    virtual ~NNConnection();

    UINT NeuronIndex;
    UINT WeightIndex;

// Weight class

class NNWeight
    NNWeight( LPCTSTR str, double val = 0.0 );
    virtual ~NNWeight();

    double value;

As you can see from the above, class NeuralNetwork stores a vector of pointers to layers in the neural network, which are represented by class NNLayer. There is no special function to add a layer (there probably should be one); simply use the std::vector::push_back() function. The NeuralNetwork class also provides the two primary interfaces with the outside world, namely, a function to forward propagate the neural network (the Calculate()function) and a function to Backpropagate() the neural network so as to train it.

Each NNLayer stores a pointer to the previous layer, so that it knows where to look for its input values. In addition, it stores a vector of pointers to the neurons in the layer, represented by class NNNeuron, and a vector of pointers to weights, represented by class NNWeight. Similar to the NeuralNetwork class, the pointers to the neurons and to the weights are added using the std::vector::push_back() function. Finally, the NNLayer class includes functions to Calculate() the output values of neurons in the layer, and to Backpropagate() them; in fact, the corresponding function in the NeuralNetwork class simply iterate through all layers in the network and call these functions.

Each NNNeuron stores a vector of connections that tell the neurons where to get its inputs. Connections are added using the NNNeuron::AddConnection() function, which takes an index to a neuron and an index to a weight, constructs a NNConnection object, and push_back()'s the new connection onto the vector of connections. Each neuron also stores its own output value, even though it's the NNLayer class that is responsible for calculating the actual value of the output and storing it there. The NNConnection and NNWeight classes respectively store obviously-labeled information.

One legitimate question about the class structure is, why are there separate classes for the weights and the connections? According to the diagram above, each connection has a weight, so why not put them in the same class? The answer lies in the fact that weights are often shared between connections. In fact, the convolutional neural network of this program specifically shares weights amongst its connections. So, for example, even though there might be several hundred neurons in a layer, there might only be a few dozen weights due to sharing. By making theNNWeight class separate from the NNConnection class, this sharing is more readily accomplished.

go back to top

Forward Propagation

Forward propagation is the process whereby each of all of the neurons calculates its output value, based on inputs provided by the output values of the neurons that feed it.

In the code, the process is initiated by calling NeuralNetwork::Calculate()NeuralNetwork::Calculate()directly sets the values of neurons in the input layer, and then iterates through the remaining layers, calling each layer's NNLayer::Calculate() function. This results in a forward propagation that's completely sequential, starting from neurons in the input layer and progressing through to the neurons in the output layer. A sequential calculation is not the only way to forward propagate, but it's the most straightforward. Here's simplified code, which takes a pointer to a C-style array of doubles representing the input to the neural network, and stores the output of the neural network to another C-style array of doubles:

// simplified code

void NeuralNetwork::Calculate(double* inputVector, UINT iCount, 
               double* outputVector /* =NULL */, 
               UINT oCount /* =0 */)
    VectorLayers::iterator lit = m_Layers.begin();
    VectorNeurons::iterator nit;
    // first layer is input layer: directly
    // set outputs of all of its neurons
    // to the given input vector
    if ( lit < m_Layers.end() )  
        nit = (*lit)->m_Neurons.begin();
        int count = 0;
        ASSERT( iCount == (*lit)->m_Neurons.size() );
        // there should be exactly one neuron per input
        while( ( nit < (*lit)->m_Neurons.end() ) && ( count < iCount ) )
            (*nit)->output = inputVector[ count ];
    // iterate through remaining layers,
    // calling their Calculate() functions
    for( lit++; lit<m_Layers.end(); lit++ )
    // load up output vector with results
    if ( outputVector != NULL )
        lit = m_Layers.end();
        nit = (*lit)->m_Neurons.begin();
        for ( int ii=0; ii<oCount; ++ii )
            outputVector[ ii ] = (*nit)->output;

Inside the layer's Calculate() function, the layer iterates through all neurons in the layer, and for each neuron the output is calculated according to the feed-forward formula given above, namely

General feed-forward equation

This formula is applied by iterating through all connections for the neuron, and for each connection, obtaining the corresponding weight and the corresponding output value from a neuron in the previous layer:

// simplified code

void NNLayer::Calculate()
    ASSERT( m_pPrevLayer != NULL );
    VectorNeurons::iterator nit;
    VectorConnections::iterator cit;
    double dSum;
    for( nit=m_Neurons.begin(); nit<m_Neurons.end(); nit++ )
        NNNeuron& n = *(*nit);  // to ease the terminology
        cit = n.m_Connections.begin();
        ASSERT( (*cit).WeightIndex < m_Weights.size() );
        // weight of the first connection is the bias;
        // its neuron-index is ignored

        dSum = m_Weights[ (*cit).WeightIndex ]->value;  
        for ( cit++ ; cit<n.m_Connections.end(); cit++ )
            ASSERT( (*cit).WeightIndex < m_Weights.size() );
            ASSERT( (*cit).NeuronIndex < 
                     m_pPrevLayer->m_Neurons.size() );
            dSum += ( m_Weights[ (*cit).WeightIndex ]->value ) * 
                ( m_pPrevLayer->m_Neurons[ 
                   (*cit).NeuronIndex ]->output );
        n.output = SIGMOID( dSum );

In this code, SIGMOID is #defined to the activation function, which is described in the next section.

go back to top

The Activation Function (or, "Sigmoid" or "Squashing" Function)

Selection of a good activation function is an important part of the design of a neural network. Generally speaking, the activation function should be symmetric, and the neural network should be trained to a value that is lower than the limits of the function.

One function that should never be used as the activation function is the classical sigmoid function (or "logistic" function), defined as Logisitc function. It should never be used since it is not symmetric: its value approaches +1 for increasing x, but for decreasing x its value approaches zero (i.e., it does not approach -1 which it should for symmetry). The reason the logistic function is even mentioned here is that there are many articles on the web that recommend its use in neural networks, for example, Sigmoid function in the Wikipedia. In my view, this is a poor recommendation and should be avoided.

One good selection for the activation function is the hyperbolic tangent, or Hyperbolic tangent. This function is a good choice because it's completely symmetric, as shown in the following graph. If used, then do not train the neural network to ±1.0. Instead, choose an intermediate value, like ±0.8.

Graph of hyperbolic tangent

Another reason why hyperbolic tangent is a good choice is that it's easy to obtain its derivative. Yes, sorry, but a bit of calculus is needed in neural networks. Not only is it easy to obtain its derivative, but also the value of derivative can be expressed in terms of the output value (i.e., as opposed to the input value). More specifically, given that:

Basic tanh function , where (using the notation established above) y is the input to the function (corresponding to the activation value of a neuron) and x is the output of the neuron.

Then Derivative of tanh function

Which simplifies to Simplified derivative

Or, since Given values, the result is Result. This result means that we can calculate the value of the derivative of F() given only the output of the function, without any knowledge of the input. In this article, we will refer to the derivative of the activation function as G(x).

The activation function used in the code is a scaled version of the hyperbolic tangent. It was chosen based on a recommendation in one of Dr. LeCun's articles. Scaling causes the function to vary between ±1.7159, and permits us to train the network to values of ±1.0.

go back to top


Backpropagation is an iterative process that starts with the last layer and moves backwards through the layers until the first layer is reached. Assume that for each layer, we know the error in the output of the layer. If we know the error of the output, then it is not hard to calculate changes for the weights, so as to reduce that error. The problem is that we can only observe the error in the output of the very last layer.

Backpropagation gives us a way to determine the error in the output of a prior layer given the output of a current layer. The process is therefore iterative: start at the last layer and calculate the change in the weights for the last layer. Then calculate the error in the output of the prior layer. Repeat.

The backpropagation equations are given below. My purpose in showing you the equations is so that you can find them in the code, and understand them. For example, the first equation shows how to calculate the partial derivative of the error EP with respect to the activation value yi at the n-th layer. In the code, you will see a variable nameddErr_wrt_dYn[ ii ].

Start the process off by computing the partial derivative of the error due to a single input image pattern with respect to the outputs of the neurons on the last layer. The error due to a single pattern is calculated as follows:

Equation (1): Error due to a single pattern

(equation 1)


Error due to a single pattern P at the last layer n

is the error due to a single pattern P at the last layer n;

Target output at the last layer (i.e., the desired output at the last layer)

is the target output at the last layer (i.e., the desired output at the last layer); and

Actual value of the output at the last layer

is the actual value of the output at the last layer.

Given equation (1), then taking the partial derivative yields:

Equation (2): Partial derivative of the output error for one pattern with respect to the neuron output values

(equation 2)

Equation (2) gives us a starting value for the backpropagation process. We use the numeric values for the quantities on the right side of equation (2) in order to calculate numeric values for the derivative. Using the numeric values of the derivative, we calculate the numeric values for the changes in the weights, by applying the following two equations (3) and then (4):

Equation (3): Partial derivative of the output error for one pattern with respect to the activation value of each neuron

(equation 3)

Derivative of the activation function

is the derivative of the activation function.
Equation (4): Partial derivative of the output error for one pattern with respect to each weight feeding the neuron(equation 4)

Then, using equation (2) again and also equation (3), we calculate the error for the previous layer, using the following equation (5):

Equation (5): Partial derivative of the error for the previous layer

(equation 5)

The values we obtain from equation (5) are used as starting values for the calculations on the immediately preceding layer. This is the single most important point in understanding backpropagation. In other words, we take the numeric values obtained from equation (5), and use them in a repetition of equations (3), (4) and (5) for the immediately preceding layer.

Meanwhile, the values from equation (4) tell us how much to change the weights in the current layer n, which was the whole purpose of this gigantic exercise. In particular, we update the value of each weight according to the formula:

Equation (6): Updating the weights

(equation 6)

where eta is the "learning rate", typically a small number like 0.0005 that is gradually decreased during training.

In the code, these equations are implemented by calling NeuralNetwork::Backpropagate(). The input to theNeuralNetwork::Backpropagate() function is the actual output of the neural network and the desired output. Using these two inputs, the NeuralNetwork::Backpropagate() function calculates the value of equation (2). It then iterates through all layers in the network, starting from the last layer and proceeding backwards toward the first layer. For each layer, the layer's NNLayer::Backpropagate() function is called. The input toNNLayer::Backpropagate() is the derivative, and the output is equation (5). These derivatives are all stored in avector of a vector of doubles named differentials. The output from one layer is then fed as the input to the next preceding layer:

// simplified code
void NeuralNetwork::Backpropagate(double *actualOutput, 
     double *desiredOutput, UINT count)
    // Backpropagates through the neural net
    // Proceed from the last layer to the first, iteratively
    // We calculate the last layer separately, and first,
    // since it provides the needed derviative
    // (i.e., dErr_wrt_dXnm1) for the previous layers
    // nomenclature:
    // Err is output error of the entire neural net
    // Xn is the output vector on the n-th layer
    // Xnm1 is the output vector of the previous layer
    // Wn is the vector of weights of the n-th layer
    // Yn is the activation value of the n-th layer,
    // i.e., the weighted sum of inputs BEFORE 
    //    the squashing function is applied
    // F is the squashing function: Xn = F(Yn)
    // F' is the derivative of the squashing function
    //   Conveniently, for F = tanh,
    //   then F'(Yn) = 1 - Xn^2, i.e., the derivative can be 
    //   calculated from the output, without knowledge of the input
    VectorLayers::iterator lit = m_Layers.end() - 1;
    std::vector< double > dErr_wrt_dXlast( (*lit)->m_Neurons.size() );
    std::vector< std::vector< double > > differentials;
    int iSize = m_Layers.size();
    differentials.resize( iSize );
    int ii;
    // start the process by calculating dErr_wrt_dXn for the last layer.
    // for the standard MSE Err function
    // (i.e., 0.5*sumof( (actual-target)^2 ), this differential is simply
    // the difference between the target and the actual
    for ( ii=0; ii<(*lit)->m_Neurons.size(); ++ii )
        dErr_wrt_dXlast[ ii ] = 
            actualOutput[ ii ] - desiredOutput[ ii ];
    // store Xlast and reserve memory for
    // the remaining vectors stored in differentials
    differentials[ iSize-1 ] = dErr_wrt_dXlast;  // last one
    for ( ii=0; ii<iSize-1; ++ii )
        differentials[ ii ].resize( 
             m_Layers[ii]->m_Neurons.size(), 0.0 );
    // now iterate through all layers including
    // the last but excluding the first, and ask each of
    // them to backpropagate error and adjust
    // their weights, and to return the differential
    // dErr_wrt_dXnm1 for use as the input value
    // of dErr_wrt_dXn for the next iterated layer
    ii = iSize - 1;
    for ( lit; lit>m_Layers.begin(); lit--)
        (*lit)->Backpropagate( differentials[ ii ], 
              differentials[ ii - 1 ], m_etaLearningRate );

Inside the NNLayer::Backpropagate() function, the layer implements equations (3) through (5) in order to determine the derivative for use by the next preceding layer. It then implements equation (6) in order to update the weights in its layer. In the following code, the derivative of the activation function G(x) is #defined as DSIGMOID:

// simplified code

void NNLayer::Backpropagate( std::vector< double >& dErr_wrt_dXn /* in */, 
                            std::vector< double >& dErr_wrt_dXnm1 /* out */, 
                            double etaLearningRate )
    double output;

    // calculate equation (3): dErr_wrt_dYn = F'(Yn) * dErr_wrt_Xn
    for ( ii=0; ii<m_Neurons.size(); ++ii )
        output = m_Neurons[ ii ]->output;
        dErr_wrt_dYn[ ii ] = DSIGMOID( output ) * dErr_wrt_dXn[ ii ];
    // calculate equation (4): dErr_wrt_Wn = Xnm1 * dErr_wrt_Yn
    // For each neuron in this layer, go through
    // the list of connections from the prior layer, and
    // update the differential for the corresponding weight
    ii = 0;
    for ( nit=m_Neurons.begin(); nit<m_Neurons.end(); nit++ )
        NNNeuron& n = *(*nit);  // for simplifying the terminology
        for ( cit=n.m_Connections.begin(); cit<n.m_Connections.end(); cit++ )
            kk = (*cit).NeuronIndex;
            if ( kk == ULONG_MAX )
                output = 1.0;  // this is the bias weight
                output = m_pPrevLayer->m_Neurons[ kk ]->output;
            dErr_wrt_dWn[ (*cit).WeightIndex ] += dErr_wrt_dYn[ ii ] * output;
    // calculate equation (5): dErr_wrt_Xnm1 = Wn * dErr_wrt_dYn,
    // which is needed as the input value of
    // dErr_wrt_Xn for backpropagation of the next (i.e., previous) layer
    // For each neuron in this layer
    ii = 0;
    for ( nit=m_Neurons.begin(); nit<m_Neurons.end(); nit++ )
        NNNeuron& n = *(*nit);  // for simplifying the terminology
        for ( cit=n.m_Connections.begin(); 
              cit<n.m_Connections.end(); cit++ )
            if ( kk != ULONG_MAX )
                // we exclude ULONG_MAX, which signifies
                // the phantom bias neuron with
                // constant output of "1",
                // since we cannot train the bias neuron
                nIndex = kk;
                dErr_wrt_dXnm1[ nIndex ] += dErr_wrt_dYn[ ii ] * 
                       m_Weights[ (*cit).WeightIndex ]->value;
        ii++;  // ii tracks the neuron iterator
    // calculate equation (6): update the weights
    // in this layer using dErr_wrt_dW (from 
    // equation (4)    and the learning rate eta

    for ( jj=0; jj<m_Weights.size(); ++jj )
        oldValue = m_Weights[ jj ]->value;
        newValue = oldValue.dd - etaLearningRate * dErr_wrt_dWn[ jj ];
        m_Weights[ jj ]->value = newValue;

go back to top

Second Order Methods

All second order techniques have one goal in mind: to increase the speed with which backpropagation converges to optimal weights. All second order techniques (at least in principle) accomplish this in the same fundamental way: by adjusting each weight differently, e.g., by applying a learning rate eta that differs for each individual weight.

In his "Efficient BackProp, article, Dr. LeCun proposes a second order technique that he calls the "stochastic diagonal Levenberg-Marquardt method". He compares the performance of this method with a "carefully tuned stochastic gradient algorithm", which is an algorithm that does not rely on second order techniques, but which does apply different learning rates to each individual weight. According to his comparisons, he concludes that "the additional cost[of stochastic diagonal Levenberg-Marquardt] over regular backpropagation is negligible and convergence is - as a rule of thumb - about three times faster than a carefully tuned stochastic gradient algorithm." (See page 35 of the article.)

It was clear to me that I needed a second order algorithm. Convergence without it was tediously slow. Dr. Simard, in his article titled "Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis," indicated that he wanted to keep his algorithm as simple as possible and therefore did not use second order techniques. He also admitted that he required hundreds of epochs for convergence (my guess is that he required a few thousand).

With the MNIST database, each epoch requires 60,000 backpropagations, and on my computer each epoch took around 40 minutes. I did not have the patience (or confidence in the correctness of the code) to wait for thousands of epochs. It was also clear that, unlike Dr. LeCun, I did not have the skill to design "a carefully tuned stochastic gradient algorithm". So, in keeping with the advice that stochastic diagonal Levenberg-Marquardt would be around three times faster than that anyway, my neural network implements this second order technique.

I will not go into the math or the code for the stochastic diagonal Levenberg-Marquardt algorithm. It's actually not too dissimilar from standard backpropagation. Using this technique, I was able to achieve good convergence in around 20-25 epochs. In my mind this was terrific for two reasons. First, it increased my confidence that the network was performing correctly, since Dr. LeCun also reported convergence in around 20 epochs. Second, at 40 minutes per epoch, the network converged in around 14-16 hours, which is a palatable amount of time for an overnight run.

If you have the inclination to inspect the code on this point, the functions you want to focus on are namedCMNistDoc::CalculateHessian() (which is in the document class - yes, the program is an MFC doc/view program), and NeuralNetwork::BackpropagateSecondDervatives(). In addition, you should note that theNNWeight class includes a double member that was not mentioned in the simplified view above. This member is named "diagHessian", and it stores the curvature (in weight space) that is calculated by Dr. LeCun's algorithm. Basically, when CMNistDoc::CalculateHessian() is called, 500 MNIST patterns are selected at random. For each pattern, the NeuralNetwork::BackpropagateSecondDervatives() function calculates the Hessian for each weight caused by the pattern, and this number is accumulated in diagHessian. After the 500 patterns are run, the value in diagHessian is divided by 500, which results in a unique value of diagHessian for each and every weight. During actual backpropagation, the diagHessian value is used to amplify the current learning rateeta, such that in highly curved areas in weight space, the learning rate is slowed, whereas in flat areas in weight space, the learning rate is amplified.

go back to top

Structure of the Convolutional Neural Network

As indicated above, the program does not implement a generalized neural network, and is not a neural network workbench. Rather, it is a very specific neural network, namely, a five-layer convolutional neural network. The input layer takes grayscale data of a 29x29 image of a handwritten digit, and the output layer is composed of ten neurons of which exactly one neuron has a value of +1 corresponding to the answer (hopefully) while all other nine neurons have an output of -1.

Convolutional neural networks are also known as "shared weight" neural networks. The idea is that a small kernel window is moved over neurons from a prior layer. In this network, I use a kernel sized to 5x5 elements. Each element in the 5x5 kernel window has a weight independent of that of another element, so there are 25 weights (plus one additional weight for the bias term). This kernel is shared across all elements in the prior layer, hence the name "shared weight". A more detailed explanation follows.

go back to top

Illustration and General Description

Here is an illustration of the neural network:

Illustration of the Neural Network

Input Layer
Layer #1
6 Feature Maps
Each 13x13
Layer #2
50 Feature Maps
Each 5x5
Layer #3
Fully Connected
100 Neurons
Layer #4
Fully Connected
10 Neurons

The input layer (Layer #0) is the grayscale image of the handwritten character. The MNIST image database has images whose size is 28x28 pixels each, but because of the considerations described by Dr. Simard in his article "Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis," the image size is padded to 29x29 pixels. There are therefore 29x29 = 841 neurons in the input layer.

Layer #1 is a convolutional layer with six (6) feature maps. Each feature map is sized to 13x13 pixels/neurons. Each neuron in each feature map is a 5x5 convolutional kernel of the input layer, but every other pixel of the input layer is skipped (as described in Dr. Simard's article). As a consequence, there are 13 positions where the 5x5 kernel will fit in each row of the input layer (which is 29 neurons wide), and 13 positions where the 5x5 kernel will fit in each column of the input layer (which is 29 neurons high). There are therefore 13x13x6 = 1014 neurons in Layer #1, and (5x5+1)x6 = 156 weights. (The "+1" is for the bias.)

On the other hand, since each of the 1014 neurons has 26 connections, there are 1014x26 = 26364 connections from this Layer #1 to the prior layer. At this point, one of the benefits of a convolutional "shared weight" neural network should become more clear: because the weights are shared, even though there are 26364 connections, only 156 weights are needed to control those connections. As a consequence, only 156 weights need training. In comparison, a traditional "fully connected" neural network would have needed a unique weight for each connection, and would therefore have required training for 26364 different weights. None of that excess training is needed here.

Layer #2 is also a convolutional layer, but with 50 feature maps. Each feature map is 5x5, and each unit in the feature maps is a 5x5 convolutional kernel of corresponding areas of all 6 of the feature maps of the previous layers, each of which is a 13x13 feature map. There are therefore 5x5x50 = 1250 neurons in Layer #2, (5x5+1)x6x50 = 7800 weights, and 1250x26 = 32500 connections.

Before proceeding to Layer #3, it's worthwhile to mention a few points on the architecture of the neural network in general, and on Layer #2 in particular. As mentioned above, each feature map in Layer #2 is connected to all 6 of the feature maps of the previous layer. This was a design decision, but it's not the only decision possible. As far as I can tell, the design is the same as Dr. Simard's design. But it's distinctly different from Dr. LeCun's design. Dr. LeCun deliberately chose not to connect each feature map in Layer #2 to all of the feature maps in the previous layer. Instead, he connected each feature map in Layer #2 to only a few selected ones of the feature maps in the previous layer. Each feature map was, in addition, connected to a different combination of feature maps from the previous layer. As Dr. LeCun explained it, his non-complete connection scheme would force the feature maps to extract different and (hopefully) complementary information, by virtue of the fact that they are provided with different inputs. One way of thinking about this is to imagine that you are forcing information through fewer connections, which should result in the connections becoming more meaningful. I think Dr. LeCun's approach is correct. However, to avoid additional complications to programming that was already complicated enough, I chose the simpler approach of Dr. Simard.

Other than this, the architectures of all three networks (i.e., the one described here and those described by Drs. LeCun and Simard) are largely similar.

Turning to Layer #3, Layer #3 is a fully-connected layer with 100 units. Since it is fully-connected, each of the 100 neurons in the layer is connected to all 1250 neurons in the previous layer. There are therefore 100 neurons in Layer #3, 100*(1250+1) = 125100 weights, and 100x1251 = 125100 connections.

Layer #4 is the final, output layer. This layer is a fully-connected layer with 10 units. Since it is fully-connected, each of the 10 neurons in the layer is connected to all 100 neurons in the previous layer. There are therefore 10 neurons in Layer #4, 10x(100+1) = 1010 weights, and 10x101 = 1010 connections.

Like Layer #2, this output Layer #4 also warrants an architectural note. Here, Layer #4 is implemented as a standard, fully connected layer, which is the same as the implementation of Dr. Simard's network. Again, however, it's different from Dr. LeCun's implementation. Dr. LeCun implemented his output layer as a "radial basis function", which basically measures the Euclidean distance between the actual inputs and a desired input (i.e., a target input). This allowed Dr. LeCun to experiment in tuning his neural network so that the output of Layer #3 (the previous layer) matched idealized forms of handwritten digits. This was clever, and it also yields some very impressive graphics. For example, you can basically look at the outputs of his Layer #3 to determine whether the network is doing a good job at recognition. For my Layer #3 (and Dr. Simard's), the outputs of Layer #3 are meaningful only to the network; looking at them will tell you nothing. Nevertheless, the implementation of a standard, fully connected layer is far simpler than the implementation of radial basis functions, both for forward propagation and for training during backpropagation. I therefore chose the simpler approach.

Altogether, adding the above numbers, there are a total of 3215 neurons in the neural network, 134066 weights, and 184974 connections.

The object is to train all 134066 weights so that, for an arbitrary input of a handwritten digit at the input layer, there is exactly one neuron at the output layer whose value is +1 whereas all other nine (9) neurons at the output layer have a value of -1. Again, the benchmark was an error rate of 0.82% or better, corresponding to the results obtained by Dr. LeCun.

go back to top

Code For Building the Neural Network

The code for building the neural network is found in the CMNistDoc::OnNewDocument() function of the document class. Using the above illustration, together with its description, it should be possible to follow the code which is reproduced in simplified form below:

// simplified code

BOOL CMNistDoc::OnNewDocument()
    if (!COleDocument::OnNewDocument())
        return FALSE;
    // grab the mutex for the neural network
    CAutoMutex tlo( m_utxNeuralNet );
    // initialize and build the neural net
    NeuralNetwork& NN = m_NN;  // for easier nomenclature
    NNLayer* pLayer;
    int ii, jj, kk;
    double initWeight;
    // layer zero, the input layer.
    // Create neurons: exactly the same number of neurons as the input
    // vector of 29x29=841 pixels, and no weights/connections
    pLayer = new NNLayer( _T("Layer00") );
    NN.m_Layers.push_back( pLayer );
    for ( ii=0; ii<841; ++ii )
        pLayer->m_Neurons.push_back( new NNNeuron() );

    // layer one:
    // This layer is a convolutional layer that
    // has 6 feature maps.  Each feature 
    // map is 13x13, and each unit in the
    // feature maps is a 5x5 convolutional kernel
    // of the input layer.
    // So, there are 13x13x6 = 1014 neurons, (5x5+1)x6 = 156 weights
    pLayer = new NNLayer( _T("Layer01"), pLayer );
    NN.m_Layers.push_back( pLayer );
    for ( ii=0; ii<1014; ++ii )
        pLayer->m_Neurons.push_back( new NNNeuron() );
    for ( ii=0; ii<156; ++ii )
        initWeight = 0.05 * UNIFORM_PLUS_MINUS_ONE;
        // uniform random distribution

        pLayer->m_Weights.push_back( new NNWeight( initWeight ) );
    // interconnections with previous layer: this is difficult
    // The previous layer is a top-down bitmap
    // image that has been padded to size 29x29
    // Each neuron in this layer is connected
    // to a 5x5 kernel in its feature map, which 
    // is also a top-down bitmap of size 13x13. 
    // We move the kernel by TWO pixels, i.e., we
    // skip every other pixel in the input image
    int kernelTemplate[25] = {
        0,  1,  2,  3,  4,
        29, 30, 31, 32, 33,
        58, 59, 60, 61, 62,
        87, 88, 89, 90, 91,
        116,117,118,119,120 };
    int iNumWeight;
    int fm;  // "fm" stands for "feature map"
    for ( fm=0; fm<6; ++fm)
        for ( ii=0; ii<13; ++ii )
            for ( jj=0; jj<13; ++jj )
                // 26 is the number of weights per feature map
                iNumWeight = fm * 26;
                NNNeuron& n = 
                   *( pLayer->m_Neurons[ jj + ii*13 + fm*169 ] );
                n.AddConnection( ULONG_MAX, iNumWeight++ );  // bias weight
                for ( kk=0; kk<25; ++kk )
                    // note: max val of index == 840, 
                    // corresponding to 841 neurons in prev layer
                    n.AddConnection( 2*jj + 58*ii + 
                        kernelTemplate[kk], iNumWeight++ );
    // layer two:
    // This layer is a convolutional layer
    // that has 50 feature maps.  Each feature 
    // map is 5x5, and each unit in the feature
    // maps is a 5x5 convolutional kernel
    // of corresponding areas of all 6 of the
    // previous layers, each of which is a 13x13 feature map
    // So, there are 5x5x50 = 1250 neurons, (5x5+1)x6x50 = 7800 weights
    pLayer = new NNLayer( _T("Layer02"), pLayer );
    NN.m_Layers.push_back( pLayer );
    for ( ii=0; ii<1250; ++ii )
                 new NNNeuron( (LPCTSTR)label ) );
    for ( ii=0; ii<7800; ++ii )
        initWeight = 0.05 * UNIFORM_PLUS_MINUS_ONE;
        pLayer->m_Weights.push_back( new NNWeight( initWeight ) );
    // Interconnections with previous layer: this is difficult
    // Each feature map in the previous layer
    // is a top-down bitmap image whose size
    // is 13x13, and there are 6 such feature maps.
    // Each neuron in one 5x5 feature map of this 
    // layer is connected to a 5x5 kernel
    // positioned correspondingly in all 6 parent
    // feature maps, and there are individual
    // weights for the six different 5x5 kernels.  As
    // before, we move the kernel by TWO pixels, i.e., we
    // skip every other pixel in the input image.
    // The result is 50 different 5x5 top-down bitmap
    // feature maps
    int kernelTemplate2[25] = {
        0,  1,  2,  3,  4,
        13, 14, 15, 16, 17, 
        26, 27, 28, 29, 30,
        39, 40, 41, 42, 43, 
        52, 53, 54, 55, 56   };
    for ( fm=0; fm<50; ++fm)
        for ( ii=0; ii<5; ++ii )
            for ( jj=0; jj<5; ++jj )
                // 26 is the number of weights per feature map
                iNumWeight = fm * 26;
                NNNeuron& n = *( pLayer->m_Neurons[ jj + ii*5 + fm*25 ] );
                n.AddConnection( ULONG_MAX, iNumWeight++ );  // bias weight
                for ( kk=0; kk<25; ++kk )
                    // note: max val of index == 1013,
                    // corresponding to 1014 neurons in prev layer
                    n.AddConnection(       2*jj + 26*ii + 
                     kernelTemplate2[kk], iNumWeight++ );
                    n.AddConnection( 169 + 2*jj + 26*ii + 
                     kernelTemplate2[kk], iNumWeight++ );
                    n.AddConnection( 338 + 2*jj + 26*ii + 
                     kernelTemplate2[kk], iNumWeight++ );
                    n.AddConnection( 507 + 2*jj + 26*ii + 
                     kernelTemplate2[kk], iNumWeight++ );
                    n.AddConnection( 676 + 2*jj + 26*ii + 
                     kernelTemplate2[kk], iNumWeight++ );
                    n.AddConnection( 845 + 2*jj + 26*ii + 
                     kernelTemplate2[kk], iNumWeight++ );
    // layer three:
    // This layer is a fully-connected layer
    // with 100 units.  Since it is fully-connected,
    // each of the 100 neurons in the
    // layer is connected to all 1250 neurons in
    // the previous layer.
    // So, there are 100 neurons and 100*(1250+1)=125100 weights
    pLayer = new NNLayer( _T("Layer03"), pLayer );
    NN.m_Layers.push_back( pLayer );
    for ( ii=0; ii<100; ++ii )
           new NNNeuron( (LPCTSTR)label ) );
    for ( ii=0; ii<125100; ++ii )
        initWeight = 0.05 * UNIFORM_PLUS_MINUS_ONE;
    // Interconnections with previous layer: fully-connected
    iNumWeight = 0;  // weights are not shared in this layer
    for ( fm=0; fm<100; ++fm )
        NNNeuron& n = *( pLayer->m_Neurons[ fm ] );
        n.AddConnection( ULONG_MAX, iNumWeight++ );  // bias weight
        for ( ii=0; ii<1250; ++ii )
            n.AddConnection( ii, iNumWeight++ );
    // layer four, the final (output) layer:
    // This layer is a fully-connected layer
    // with 10 units.  Since it is fully-connected,
    // each of the 10 neurons in the layer
    // is connected to all 100 neurons in
    // the previous layer.
    // So, there are 10 neurons and 10*(100+1)=1010 weights
    pLayer = new NNLayer( _T("Layer04"), pLayer );
    NN.m_Layers.push_back( pLayer );
    for ( ii=0; ii<10; ++ii )
              new NNNeuron( (LPCTSTR)label ) );
    for ( ii=0; ii<1010; ++ii )
        initWeight = 0.05 * UNIFORM_PLUS_MINUS_ONE;
    // Interconnections with previous layer: fully-connected
    iNumWeight = 0;  // weights are not shared in this layer
    for ( fm=0; fm<10; ++fm )
        NNNeuron& n = *( pLayer->m_Neurons[ fm ] );
        n.AddConnection( ULONG_MAX, iNumWeight++ );  // bias weight
        for ( ii=0; ii<100; ++ii )
            n.AddConnection( ii, iNumWeight++ );
    SetModifiedFlag( TRUE );
    return TRUE;

This code builds the illustrated neural network in stages, one stage for each layer. In each stage, an NNLayer isnew'd and then added to the NeuralNetwork's vector of layers. The needed number of NNNeurons andNNWeights are new'd and then added respectively to the layer's vector of neurons and vector of weights. Finally, for each neuron in the layer, NNConnections are added (using the NNNeuron::AddConnection() function), passing in appropriate indices for weights and neurons.

go back to top

MNIST Database of Handwritten Digits

The MNIST database is modified (hence the "M") from a database of handwritten patterns offered by the National Institute of Standards and Technology ("NIST"). As explained by Dr. LeCun at the MNIST section of his web site, the database has been broken down into two distinct sets, a fist set of 60,000 images of handwritten digits that is used as a training set for the neural network, and a second set of 10,000 images that is used as a testing set. The training set is composed of digits written by around 250 different writers, of which approximately half are high school students and the other half are U.S. census workers. The number of writers in the testing set is unclear, but special precautions were made to ensure that all writers in the testing set were not also in the training set. This makes for a strong test, since the neural network is tested on images from writers that it has never before seen. It is thus a good test of the ability of the neural network to generalize, i.e., to extract intrinsically important features from the patterns in the training set that are also applicable to patterns that it has not seen before.

To use the neural network, you must download the MNIST database. Besides the two files that compose the patterns from the training and testing sets, there are also two companion files that give the "answers", i.e., the digit that is represented by a corresponding handwritten pattern. These two files are called "label" files. As indicated at the beginning of this article, the four files can be downloaded from here (11,594 Kb total).

Incidentally, it's been mentioned that Dr. LeCun's achieval of an error rate of 0.82% has been used as a benchmark. If you read Dr. Simard's article, you will see that he claims an even better error rate of 0.40%. Why not use Dr. Simard's 0.40% as the benchmark?

The reason is that Dr. Simard did not respect the boundary between the training set and the testing set. In particular, he did not respect the fact that the writers in the training set were distinct from the writers in the testing set. In fact, Dr. Simard did not use the testing set at all. Instead, he trained with the first 50,000 patterns in the training set, and then tested with the remaining 10,000 patterns. This raises the possibility that, during testing, Dr. Simard's network was fed patterns from writers whose handwriting had already been seen before, which would give the network an unfair advantage. Dr. LeCun, on the other hand, took pains to ensure that his network was tested with patterns from writers it had never seen before. Thus, as compared with Dr. Simard's testing, Dr. LeCun's testing was more representative of real-world results since he did not give his network any unfair advantages. That's why I used Dr. LeCun's error rate of 0.82% as the benchmark.

Finally, you should note that the MNIST database is still widely used for study and testing, despite the fact that it was created back in 1998. As one recent example, published in February 2006, see:

In the Lauer et al. article, the authors used a convolutional neural network for everything except the actual classification/recognition step. Instead, they used the convolutional neural network for black-box extraction of feature vectors, which they then fed to a different type of classification engine, namely a support vector machine ("SVM") engine. With this architecture, they were able to obtain the excellent error rate of just 0.54%. Good stuff.

go back to top

Overall Architecture of the Test/Demo Program

The test program is an MFC SDI doc/view application, using worker threads for the various neural network tasks.

The document owns the neural network as a protected member variable. Weights for the neural network are saved to and loaded from an .nnt file in the CMNistDoc::Serialize() function. Storage/retrieval of the weights occurs in response to the menu items "File->Open" and "File->Save" or "Save As". For this purpose, the neural network also has a Serialize() function, which was not shown in the simplified code above, and it is the neural network that does the heavy lifting of storing its weights to a disk file, and extracting them later.

The document further holds two static functions that are used for the worker threads to run backpropagation and testing on the neural network. These functions are unimaginatively namedCMNistDoc::BackpropagationThread() and CMNistDoc::TestingThread(). The functions are not called directly; instead, other classes that wish to begin a backpropagation or testing cycle ask the document to do so for them, by calling the document's StartBackpropagation() and StartTesting() functions. These functions accept parameters for the backpropagation or the testing, and then kick off the threads to do the work. Results are reported back to the class that requested the work using the API's ::PostMessage() function with a user-defined message and user-defined codings of the wParam and lParam parameters.

Two helper classes are also defined in the document, named CAutoCS and CAutoMutex. These are responsible for automatic locking and release of critical sections and mutexes that are used for thread synchronization. Note that the neural network itself does not have a critical section. I felt that thread synchronization should be the responsibility of the owner of the neural network, and not the responsibility of the neural network, which is why the document holds these synchronization objects. The CAutoCS and CAutoMutex classes lock and release the synchronization objects through use of well-known RAII techniques ("resource acquisition is initialization").

The document further owns the MNIST database files, which are opened and closed in theCMNistDoc::OnButtonOpenMnistFiles() and CMNistDoc::OnButtonCloseMnistFiles() functions. These functions are button handlers for correspondingly-labeled buttons on the view. When opening the MNIST files, all four files must be opened, i.e., the pattern files for the training and testing sets, and the label files for the training and testing set. There are therefore four "Open File" dialogs, and prompts on the dialogs tell you which file should be opened next.

The view is based on CFormView, and holds a single tab control with three tabs, a first for a graphical view of the neural network and the outputs of each of its neurons, a second for backpropagation and training, and a third for testing. The view is resizable, using an excellent class named DlgResizeHelper, described at Dialog Resize Helperby Stephan Keil.

Each tab on the tab control hosts a different CDialog-derived class. The first tab hosts a CDlgCharacterImagedialog which provides the graphical view of the neural network and the outputs of each neuron. The second tab hosts a CDlgNeuralNet dialog which provides control over training of the neural network, and which also gives feedback and progress during the training process (which can be lengthy). The third tab hosts a CDlgTesting dialog which provides control over testing and outputs test results. Each of these tabs/dialogs is described in more detail in the following sections.

go back to top

Using The Demo Program

To use the program, you must open five files: four files comprising the MNIST database, and one file comprising the weights for the neural network.

To open the MNIST database files, click the "Open MNist" button at the bottom of the screen. You will be prompted four successive times to open the four files comprising the MNIST database, namely, the training patterns, the label (i.e., the answers) for the training patterns, the testing patterns, and the labels for the testing patterns.

To open the weights for the neural network, from the main menu, choose File->Open, or choose the file-open icon from the toolbar. You will be prompted to open a .nnt file that contains the weights and the structure of the neural network.

go back to top

Graphical View of the Neural Network

The graphical view of the neural network is the same as the screenshot at the top of this article, and it's repeated again here:

Graphical view of the neural network

The window at the mid-right shows the output of all 3215 neurons. For each neuron, the output is represented as a single grayscale pixel whose gray level corresponds to the neuron's output: white equals a fully off neuron (value of -1) and black equals a fully on neuron (value of +1). The neurons are grouped into their layers. The input layer is the 29x29 pattern being recognized. Next are the six 13x13 feature maps of the second layer, followed in turn by the 50 5x5 feature maps of the third layer, the 100 neurons of the fourth layer (arranged in a column), and finally the 10 output neurons in the output layer.

Since individual pixels are hard to see clearly, a magnified view is provided. Simply move your mouse over the window, and a magnified window is displayed that lets you see each pixel's value clearly.

The display for the output layer is a bit different than for the other layers. Instead of a single pixel, the 10 output neurons are shown as a grayscale representation of the digit coded by that neuron. If the neuron is fully off (value of -1), the digit is shown as pure white, which of course means that it can't be seen against the white background. If the neuron is fully on (value of +1) then the digit is shown in pure black. In-between values are shown by varying degrees of grayness. In the screenshot above, the output of the neural network shows that it firmly believes that the pattern being recognized is an "8" (which is correct). But the dimly displayed "2" also shows that the neural network sees at least some similarity in the pattern to the digit "2". A red marker points to the most likely result.

The mid-left of the dialog shows the appearance of the pattern at a selectable index. Here the pattern's index is 1598 in the training set. Controls are arranged to allow fast sequencing through the patterns, and to allow selection of either the training set or the testing set. There is also a check box for distortion of the pattern. Distortion helps in training of the neural network, since it forces the network to extract the intrinsic shapes of the patterns, rather than allowing the network to (incorrectly) focus on peculiarities of individual patterns. This is discussed in greater detail below, but in essence, it helps the neural network to generalize. If you look carefully at the screenshot above, you will notice that the input pattern for the digit "8" (on the left) has been slightly distorted when it is given as the input layer to the neural network (on the right).

The "Calculate" button causes the neural network to forward propagate the pattern in the input layer and attempt a recognition of it. The actual numeric values of the output neurons are shown in the center, together with the time taken for the calculation. Although almost all other operations on the neural network are performed in a thread, for recognitions of single patterns, the time taken is so short (typically around 15 milliseconds) that the calculation is performed in the main UI thread.

go back to top

Training View and Control Over the Neural Network

Unless training is actually ongoing, the training view hides many of its controls. Training is started by clicking on the "Start Backpropagation" button, and during training, the training view looks like the following screenshot:

Training view of the neural network

During training, the 60,000 patterns in the training set are continuously processed (in a random sequence) by the backpropagation algorithm. Each pass through the 60,000 patterns is called an "epoch". The display shows statistics about this process. Starting at the top, the display gives an estimate of the current mean squared error ("MSE") of the neural network. MSE is the value of Error due to a single pattern P at the last layer n in equation (1) above, averaged across all 60,000 patterns. The value at the top is only a running estimate, calculated over the last 200 patterns.

A history of the current estimate of MSE is seen graphically just to the right. This is a graph of the current estimates of MSE, over the last four epochs.

Just below are a few more statistics, namely the number of fully-completed epochs, and the cardinal number of the current pattern being backpropagated. A progress bar is also shown.

Below that is an edit control showing information about each epoch. The information is updated as each epoch is completed.

The edit control also indicates when the Hessian (needed for second order backpropagation) is being calculated. During this time, which requires an analysis of 500 random patters, the neural network is not able to do much else, so a string of dots slowly scrolls across the control. Frankly, it's not logical to put this display inside the edit control, and this might change in future developments.

Backpropagation can be stopped at any time by clicking the "Stop Backpropagation" button.

go back to top

Testing View of the Neural Network

The testing view of the neural network is shown in the following screenshot:

Testing view of the neural network

In the testing mode, the 10,000 patterns in the testing set are run through the neural network, and if the output of the neural network does not match the expected output, an error is recorded in the view's main edit control. A progress bar through the testing set is displayed, and at the completion of testing, a summary of the total number of errors is displayed.

A testing sequence is commenced by clicking the "Start Testing" button, at which point you are shown a dialog from which you can select the testing parameters:

Selection of testing parameters

It's usually best simply to select the default values. The meaning of the available parameters will become clearer once we look at the training parameters in the next section. Note that it is possible to "test" the training set, which can be useful for checking on convergence of the weights.

go back to top

Training the Neural Network

In order of importance, training is probably the next most important topic after design of the overall architecture of the neural network. We will discuss it over the next few sections.

The idea of training is to force the neural network to generalize. The word "generalization" deserves a few moments of your thought. An enormous amount of time is spent in analyzing the training set, but out in the real world, the performance of the neural network on the training set is of absolutely no interest. Almost any neural network can be trained so well it might not encounter any errors at all on the training set. But who cares about that? We already know the answers for the training set. The real question is, how well will the neural network perform out in the real world on patterns that it has never before seen?

The ability of a neural network to do well on patterns that it has never seen before is called "generalization". The effort spent in training is to force the neural network to see intrinsic characteristics in the training set, which we hope will also be present in the real world. When we calculate the MSE of the training set, we do so with the understanding that this is only an estimate of the number that we are really interested in, namely, the MSE of real-world data.

A few techniques are used during training in an effort to improve the neural network's generalization. As one example, the training set is passed through the neural network in a randomized order in each epoch. Conceptually, these techniques cause the weights to "bounce around" a bit, thereby preventing the neural network from settling on weights that emphasize false attributes of the patterns, i.e., attributes that are peculiar to patterns in the training set but which are not intrinsic characteristics of patterns encountered out in the real world.

One of the more important techniques for improving generalization is distortion: the training patterns are distorted slightly before being used in backpropagation. The theory is that these distortions force the neural network to look more closely at the important aspects of the training patters, since these patterns will be different each time the neural network sees them. Another way of looking at this is that distortions enlarge the size of the training set, by artificially creating new training patterns from existing ones.

In this program, the CMNistDoc::GenerateDistortionMap() function generates a distortion map that is applied to the training patterns (using the CMNistDoc::ApplyDistortionMap() function) during training. Thedistortion is calculated randomly for each pattern just before backpropagation. Three different types of distortions are applied:

  • Scale Factor: The scale of the pattern is changed so as to enlarge or shrink it. The scale factor for the horizontal direction is different from the scale factor for the vertical direction, such that it is possible to see shrinkage in the vertical scale and enlargement in the horizontal scale.
  • Rotation: The entire pattern is rotated clockwise or counterclockwise.
  • Elastic: This is a word borrowed Dr. Simard's "Best Practices" paper. Visually, elastic distortions look like a gentle pushing and pulling of pixels in the pattern, almost like viewing the pattern through wavy water.

The result of distortions is illustrated in the following diagram, which shows the original of a pattern for the digit "3" together with 25 examples of distortions applied to the pattern:

Examples of distortions applied to training patterns

You can see the effect of the three types of distortions, but to a human viewer, it is clear that each distortion still "looks like" the digit "3". To the neural network, however, the pixel patterns are remarkably different. Because of these differences, the neural network is forced to see beyond the actual pixel patterns, and hopefully to understand and apply the intrinsic qualities that make a "3" look like a "3".

You select whether or not to apply distortions when you click the "Start Backpropagation" button in the training view, at which point you are shown the following dialog which lets you select other training parameters as well:

Selection of training parameters

Besides distortions, this dialog lets you set other parameters for backpropagation. Most notably, you must specify an initial learning rate "eta" and a final learning rate. Generally speaking, the learning rate should be larger for untrained neural networks, since a large learning rate will allow large changes in the weights during backpropagation. The learning rate is slowly decreased during training, as the neural network learns, so as to allow the weights to converge to some final value. But it never really makes sense to allow the learning rate to decrease too much, or the network will stop learning. That's the purpose of the entry for the final value.

For the initial value of the learning rate, never use a value larger than around 0.001 (which is the default). Larger values cause the weights to change too quickly, and actually cause the weights to diverge rather than converge.

Most of the literature on training of neural networks gives a schedule of learning rates as a function of epoch number. Dr. LeCun, for example, recommends a schedule of 0.0005 for two epochs, followed by 0.0002 for the next three, 0.0001 for the next three, 0.00005 for the next four, and 0.00001 thereafter. I found it easier to program a reduction factor that is applied after N recognitions. Frankly, I think it might have been better to listen to the experts, and this might change in the future to a schedule of learning rates. At present, however, at the end of N recognitions (such as 120,000 recognitions which corresponds to two epochs and is the default value), the program simply multiplies the current learning rate by a factor that's less than one, which results in a continuously decreasing learning rate.

The dialog was also written at a naive time of development, when I thought it might be possible to reduce the learning rate within a single epoch, rather than only after an epoch is completed. So, the reduction factor is given in terms of a reduction after N recognitions, rather than after N epochs which would have been more appropriate.

The default values for the dialog are values that I found to work well for untrained networks. The initial learning rate is 0.001, which is rather large and results in good coarse training quickly, particularly given the second order nature of the backpropagation algorithm. The learning rate is reduced by 79.4% of its value after every two epochs. So, for example, the training rate for the third epoch is 0.000794. The final learning rate is 0.00005, which is reached after around 26 epochs. At this point the neural network is well-trained.

The other options in the dialog are discussed below in the "Tricks" section.

go back to top

Tricks That Make Training Faster

As indicated in the beginning of the article, this project is an engineering work-in-progress, where many different adjustments are made in an effort to improve accuracy. Against that background, training can be a maddeningly slow process, where it takes several CPU hours to determine whether or not a change was beneficial. As a consequence, there were a few things done to make training progress more quickly.

Three things in particular were done: second order backpropagation, multithreading, and backpropagation skipping. The first has been done before, but I have not seen the second and third techniques in any of the resources that I found (maybe they're new??). Each is discussed below.

go back to top

Second Order Backpropagation Using Pseudo-Hessian

Second order backpropagation is not really a "trick"; it's a well-understood mathematical technique. And it was discussed above in the section on "Second Order Methods". It's mentioned here again since it clearly had the greatest impact on speeding convergence of the neural network, in that it dramatically reduced the number of epochs needed for convergence of the weights.

The next two techniques are slightly different in emphasis, since they reduce the amount of time spent in any one epoch by speeding progress through the epoch. Put another way, although second order backpropagation reduces the overall time needed for training, the next two techniques reduce the time needed to go through all of the 60,000 patterns that compose a single epoch.

go back to top

Simultaneous Backpropagation and Forward Propagation

In this technique, multiple threads are used to speed the time needed for one epoch.

Note that in general, multiple threads will almost always have a negative impact on performance. In other words, if it takes time T to perform all work in a single thread, then it will take time T+x to perform the same work with multiple threads. The reason is that the overhead of context switching between threads, and the added burden of synchronization between threads, adds time that would not be necessary if only a single thread were used.

The only circumstance where multiple threads will improve performance is where there are also multiple processors. If there are multiple processors, then it makes sense to divide the work amongst the processors, since if there were only a single thread, then all the other processors would wastefully be idle.

So, this technique helps only if your machine has multiple processors, i.e., if you have a hyper-threaded or dual core processor (Intel terminology) or dual processors (AMD terminology). If you do not have multiple processors, then you will not see any improvement when using multiple threads. Moreover, there is absolutely no benefit to running more threads than you have processors. So, if you have a hyperthreaded machine, use two threads exactly; use of three or more threads does not give any advantage.

But there's a significant problem in trying to implement a multi-threaded backpropagation algorithm. The problem is that the backpropagation algorithm depends on the numerical outputs of the individual neurons from the forward propagation step. So consider a simple multi-threaded scenario: a first thread forward propagates a first input pattern, resulting in neuron outputs for each layer of the neural network including the output layer. The backpropagation stage is then started, and equation (2) above is used to calculate how errors in the pattern depend on outputs of the output layer.

Meanwhile, because of multithreading, another thread selects a second pattern and forward propagates it through the neural network, in preparation for the backpropagation stage. There's a context switch back to the first thread, which now tries to complete backpropagation for the first pattern.

And here's where the problem becomes evident. The first thread needs to apply equations (3) and (4). But these equations depend on the numerical outputs of the neurons, and all of those values have now changed, because of the action of the second thread's forward propagation of the second pattern.

My solution to this dilemma is to memorize the outputs of all the neurons. In other words, when forward propagating an input pattern, the outputs of all 3215 neurons are stored in separate memory locations, and these memorized outputs are used during the backpropagation stage. Since the outputs are stored, another thread's forward propagation of a different pattern will not affect the backpropagation calculations.

The sequence runs something like this. First, a thread tries to lock a mutex that protects the neural network, and will wait until it can obtain the lock. Once the lock is obtained, the thread forward propagates a pattern and memorizes the output values of all neurons in the network. The mutex is then released, which allows another thread to do the same thing.

Astute readers will note that this solution might solve the problem of equations (2) through (4), since those equations depend on the numerical values of the neuron outputs, but that this solution does not address additional problems caused by equations (5) and (6). These two equations depend on the values of the weights, and the values of the weights might be changed out from under one thread by another thread's backpropagation. For example, continuing the above explanation, the first thread starts to backpropagate, and changes the values of the weights in (say) the last two layers. At this point, there is a context switch to the second thread, which also tries to calculate equation (5) and also tries to change the values of weights according to equation (6). But since both equations depend on weights that have now been changed by the first thread, the calculations performed by the second thread are not valid.

The solution here is to ignore the problem, and explain why it's numerically valid to do so. It's valid to ignore the problem because the errors caused by using incorrect values for the weights are generally negligible. To see that the errors are negligible, let's use a numeric example. Consider a worst-case situation that might occur during very early training when the errors in the output layer are large. How large can the errors be? Well, the target value for the output of a neuron might be +1, whereas the actual output might be -1 which is as wrong as it can get. So, per equation (2), the error is 2. Let's backpropagate that error of 2. In equation (3) we see that the error is multiplied byG(x), which is the derivative of the activation function. The maximum value that G(x) can assume is around +1, so the error of 2 is still 2. Likewise, in equation (4), the value of 2 is multiplied by the outputs of neurons in the previous layer. But those outputs generally cannot exceed a value of +1 (since the outputs are limited by the activation function). So again, the error of 2 is still 2.

We need to skip equation (5) for a moment, since it's equation (6) that actually changes the values of the weights. In equation (6) we see that the change in the weight is our value of 2 multiplied by the learning rate eta (possibly multiplied by a second order Hessian). We know eta is small, since we deliberately set it to a small value like 0.001 or less. So, any weight might be changed by as much as our value of 2 multiplied by eta, or 2 x 0.001 = 0.002.

Now we can tackle equation (5). Let's assume that the second thread has just modified a weight as described above, which is now 0.002 different from the value that the first thread should be using. The answer here is a big "so what". It's true that the value of the weight has been changed, but the change is only 0.2%. So even though the first thread is now using a value for the weight that is wrong, the amount by which the weight is wrong is completely negligible, and it's justifiable to ignore it.

It can also be seen that the above explanation is a worst-case explanation. As the training of the neural network improves, the errors in each layer are increasingly smaller, such that it becomes even more numerically justifiable to ignore this effect of multithreading.

There's one final note on implementation in the code. Since the program is multithreaded, it cannot alter weights without careful synchronization with other threads. The nature of equation (6) is also such that one thread must be respectful of changes to the value of weight made by other threads. In other words, if the first thread simply wrote over the value of the weight with the value it wants, then changes made by other threads would be lost. In addition, since the weights are stored as 64-bit doubles, there would be a chance that the actual weight might be corrupt if a first thread were interrupted before it had a chance to update both of the 32-bit halves of the weight.

In this case, synchronization is achieved without a kernel-mode locking mechanism. Rather, synchronization is achieved with a compare-and-swap ("CAS") approach, which is a lock-free approach. This approach makes use of atomic operations that are guaranteed to complete in one CPU cycle. The function used isInterlockedCompareExchange64() which performs an atomic compare-and-exchange operation on specified 64-bit values. The function compares a destination value with a comparand value and exchanges it with an exchangevalue only if the two are equal. If the two are not equal, then that means that another thread has modified the value of the weight, and the code tries again to make another exchange. Theoretically (and unlike locking mechanisms for synchronization), lock-free methods are not guaranteed to complete in finite time, but that's theory only and in practice lock-free mechanisms are widely used.

The code for updating the weight is therefore the following, which can be found in theNNLayer::Backpropagate() function. Note that the code given above in the "Backpropagation" section was simplified code that omitted this level of detail in the interests of brevity:

// simplified code showing an excerpt
// from the NNLayer::Backpropagate() function

// finally, update the weights of this layer
// neuron using dErr_wrt_dW and the learning rate eta
// Use an atomic compare-and-exchange operation,
// which means that another thread might be in 
// the process of backpropagation
// and the weights might have shifted slightly

        double dd;
        unsigned __int64 ullong;

DOUBLE_UNION oldValue, newValue;

double epsilon, divisor;

for ( jj=0; jj<m_Weights.size(); ++jj )
    divisor = /* a value that depends on the second order Hessian */ 
    epsilon = etaLearningRate / divisor;
    // amplify the learning rate based on the Hessian

    oldValue.dd = m_Weights[ jj ]->value;
    newValue.dd = oldValue.dd - epsilon * dErr_wrt_dWn[ jj ];
    while ( oldValue.ullong != _InterlockedCompareExchange64( 
           (unsigned __int64*)(&m_Weights[ jj ]->value), 
            newValue.ullong, oldValue.ullong ) ) 
        // another thread must have modified the weight.
        // Obtain its new value, adjust it, and try again
        oldValue.dd = m_Weights[ jj ]->value;
        newValue.dd = oldValue.dd - epsilon * dErr_wrt_dWn[ jj ];

I was very pleased with this speed-up trick, which showed dramatic improvements in speed of backpropagation without any apparent adverse effects on convergence. In one test, the machine had an Intel Pentium 4 hyperthreaded processor, running at 2.8gHz. In singly threaded usage, the machine was able to backpropagate at around 13 patterns per second. In two-threaded use, backpropagation speed was increased to around 18.6 patterns per second. That's an improvement of around 43% in backpropagation speed, which translates into a reduction of 30% in backpropagation time.

In another test, the machine had an AMD Athlon 64 x2 Dual Core 4400+ processor, running at 2.21 gHz. For one-threaded backpropagation the speed was around 12 patterns per second, whereas for two-threaded backpropagation the speed was around 23 patterns per second. That's a speed improvement of 92% and a corresponding reduction of 47% in backpropagation time.

As one final implementation note, my development platform is VC++ 6.0, which does not have theInterlockedCompareExchange64() function, not even as a compiler intrinsic. With help from the comp.programming.threads newsgroup (see this thread), I wrote my own assembler version, which I am reproducing here:

inline unsigned __int64 
_InterlockedCompareExchange64(volatile unsigned __int64 *dest,
                           unsigned __int64 exchange,
                           unsigned __int64 comparand) 
    //value returned in eax::edx
    __asm {
        lea esi,comparand;
        lea edi,exchange;
        mov eax,[esi];
        mov edx,4[esi];
        mov ebx,[edi];
        mov ecx,4[edi];
        mov esi,dest;
        //lock CMPXCHG8B [esi] is equivalent to the following except
        //that it's atomic:
        //ZeroFlag = (edx:eax == *esi);
        //if (ZeroFlag) *esi = ecx:ebx;
        //else edx:eax = *esi;
        lock CMPXCHG8B [esi];            

go back to top

Skip Backpropagation for Small Errors

For a decently-trained network, there are some patterns for which the error is very small. The idea of this speed-up trick, therefore, is that for such patterns where there is only a small error, it is frankly meaningless to backpropagate the error. The error is so small that the weights won't change. Simply put, for errors that are small, there are bigger fish to fry, and there's no point in spending time to backpropagate small errors.

In practice, the demo program calculates the error for each pattern. If the error is smaller than some small fraction of the current MSE, then backpropagation is skipped entirely, and the program proceeds immediately to the next pattern.

The small fraction is set empirically to one-tenth of the current MSE. For a pattern whose error is smaller than one-tenth the overall MSE in the prior epoch, then backpropagation is skipped entirely, and training proceeds to the next pattern.

The actual value of the fractional threshold is settable in the program's .ini file. The dialog for setting training parameters asks you to input the current MSE, so that it has some idea of the cut-off point between performing and not performing backpropagation. It advises you to input a small number for the MSE if you are unsure of the actual MSE, since for small MSE's, essentially all errors are backpropagated.

In testing, as the neural network gets increasingly well-trained, I have found that approximately one-third of the patterns result in an error so small that there's no point in backpropagation. Since each epoch is 60,000 patterns, there's no backpropagation of error for around 20,000 patterns. The remaining 40,000 patterns result in an error large enough to justify backpropagation.

On an Intel Pentium 4 hyperthreaded processor running at 2.8gHz, when these two speed-up tricks are combined (i.e., the multithreading speed-up, combined with skipping of backpropagations for patterns that result in a small error), each epoch completes in around 40 minutes. It's still maddeningly slow, but it's not bad.

go back to top

Experiences In Training the Neural Network

The default values for parameters found on the "Training" dialog are the result of my experiences in training the neural network for best performance on the testing set. This section gives an abbreviated history of those experiences.

Note: I only recently realized that I should have been tracking the overall MSE for both the training set and the testing set. Before writing this article, I only tracked the MSE of the training set, without also tracking the MSE of the testing set. It was only when I sat down to write the article that I realized my mistake. So, when you see a discussion of MSE, it's the MSE of the training set and not the MSE of the testing set.

My first experimentations were designed to determine whether there was a need for distortions while training, or if they just got in the way. Over many epochs without distortions, I was never able to get the error rate in the testing set below around 140 mis-recognitions out of the 10,000 testing patterns, or a 1.40% error rate. To me this meant that distortions were needed, since without them, the neural network was not able to effectively generalize its learned behavior when confronted with a pattern (from the testing set) that it had never before seen.

I am still not able to explain why Dr. LeCun was able to achieve an error rate of 0.89% without distortions. Perhaps one day I will repeat these experiments, now that I have more experience in training the network.

Anyway, having determined that distortions were needed to improve training and generalization, my experimentations turned toward the selection of "good" values for distortions of the training patterns. Too much distortion seemed to prevent the neural network from reaching a stable point in its weights. In other words, if the distortions were large, then at the end of each epoch, the MSE for the training set remained very high, and did not reach a stable and small value. On the other hand, too little distortion prevented the neural network from generalizing its result to the testing set. In other words, if the distortion was small, then the results from the testing set were not satisfactory.

In the end, I selected values as follows. Note that the values of all tune-able parameters are found in the .ini file for the program:

  • Maximum scale factor change (percent, like 20.0 for 20%) = 15.0
  • Maximum rotational change (degrees, like 20.0 for 20 degrees) = 15.0
  • Sigma for elastic distortions (higher numbers are more smooth and less distorted; Simard uses 4.0) = 8.0
  • Scaling for elastic distortions (higher numbers amplify distortions; Simard uses 0.34) = 0.5

Once these values were selected, I began to concentrate on the training progression, and on the influence of the learning rate eta. I set up a 24-hour run, starting from purely randomized weights, over which 30 epochs were completed. The training rate was varied from an initial value of 0.001 down to a minimum of 0.00001 over 20 epochs, and thereafter (i.e., for the final 10 epochs) was maintained at this minimum value. After training was completed, I ran the testing set to see how well the neural network performed.

The results were a very-respectable 1.14% error rate in the set of testing patterns. That is, in the 10,000 pattern testing set, of which the neural net had never before seen any of the patterns, and had never before even seen the handwriting of any of the writers, the neural network correctly recognized 9,886 patterns and incorrectly misrecognized only 114 patterns. For the set of training patterns, there were around 1880 mis-recognitions in the distorted patterns.

Naturally, although I was pleased that the neural network worked at all, I was not satisfied with this performance. Seven years earlier, Dr. LeCun had achieved an error rate of 0.82%. Certainly, given the passage of seven years, better results could be obtained.

So, I graphed backpropagation progress as a function of epoch. At each epoch, I graphed the learning rate eta, the MSE (of the training set), and the change in MSE relative to the previous epoch. Since the change in MSE bounced around a bit, I also graphed a smoothed version of delta MSE. Here are the results. Note that the right-hand axis is for MSE, whereas the left-hand axis is for everything else.

Backpropagation progress over 30 epochs starting from scratch

The graph shows very clearly that the neural network reaches a constant MSE quickly, after around 12-15 epochs, and then does not improve any further. The asymptotic value of MSE was 0.135, corresponding to around 1880 mis-recognized patterns in the distorted training set. There was no significant improvement in the neural network's performance after 12-15 epochs, at least not for the learning rates that were used for the latter epochs.

My first reaction, given the asymptotic nature of the curve, was one of concern: maybe the neural network was incapable of learning any better than this. So, I restarted the training with the initial learning rate of 0.001. But this time, instead of starting from purely random weights, I used the weights that resulted from the first 30 epochs. After another 15 hours of training, here is the graph of results:

Backpropagation progress over another 25 epochs

This graph was reassuring, since it showed that the neural network still had the capability to learn. It was simply a matter of teaching it correctly. At this point, the neural network had settled in on an MSE of around 0.113, corresponding to around 1550 mis-recognized patterns in the distorted training set. For the testing set, there were 103 mis-recognized patterns out of 10,000, which is an error rate of 1.03%. At least there was some improvement, which meant that the neural network could learn more, but not much improvement.

Looking at the graph, I decided that the learning rate should be set to a value that would allow the neural network to improve at a slow but deliberate pace. Arbitrarily, I tried a constant learning rate of 0.00015, which (as seen in the above graph) should have allowed the neural network to improve around 0.01 MSE every three or so epochs. More specifically, looking at the "Smoothed Delta MSE" graph above, at a learning rate of around 0.00015, the delta MSE per epoch was around 0.0033, which should have translated to an improvement of around 0.01 MSE after every three epochs.

The results were miserable. As shown in the following graph, even after an additional 15 epochs at the carefully chosen constant learning rate of 0.00015, there was no noticeable improvement in the network. For the set of training patterns, there were still around 1550 mis-recognized patterns; for the set of testing patterns, there was a slight improvement to 100 errors, for an error rate of exactly 1.00%:

Backpropagation progress over another 15 epochs at a constant learning rate of 0.0015

I was not certain of why the neural network failed to improve. Clearly the learning rate was too small, given the small size of the errors being produced by the network, which after all, had decent performance. So, I decided to increase the learning rate to a much larger value. I abandoned the idea of trying to estimate the change in MSE as a function of learning rate, and arbitrarily chose a starting value of eta = 0.0005. It was also clear that the minimum learning rate of 0.00001, used previously, was too small; instead, I chose a minimum learning rate of 0.0002.

On the other hand, I thought that I might be rushing the neural network, in the sense that I had been decreasing the learning rate after each and every epoch. After all, the training patterns were distorted; as a consequence, the neural network never really saw the same training pattern twice, despite the fact that the same patterns were used for all epochs. I decided to let the neural network stick with the same training rate for four epochs in a row, before decreasing it. This produced good results:

Backpropagation progress over another 35 epochs with stepped learing rate

As seen in the above graph, the neural network continued to learn for at least 15-20 epochs. The final MSE was somewhere around 0.102, corresponding to around 1410 mis-recognized characters in the distorted training set. On the testing set, there were only 94 errors (an error rate of 0.94%), so I had finally crossed the 1.00% threshold, and could see the possibility of meeting or exceeding the benchmark performance of Dr. LeCun's implementation.

But it had taken many, many epochs to get here. And each training cycle (except the first) had started with an already-trained network. I decided it was time to check whether the results could be reproduced, starting again from scratch, i.e., a blank neural network with purely random weights.

I again chose an initial learning rate of 0.001, but in keeping with the experiences to date, chose a minimum learning rate of 0.00005 (which is five times larger than the minimum learning rate in the first experiment). I also used a stepped decrease in the learning rate, as above, where the learning rate was kept at the same value for 4 epochs before it was reduced. I frankly did not like the math when these parameters were multiplied together: It would take 52 epochs to reach the minimum learning rate, and at 40 minutes per epoch, that would require at least 35 hours of processing time. (Because of the timing of overnight runs, I actually let it run for 63 epochs, or 42 hours.) I bit the bullet, hit the "Backpropagate" button, and waited. The results showed something interesting:

Backpropagation progress over 63 epochs, starring from scratch, with a stepped learning rate

First off, it should be apparent that the network trained itself well. The final MSE for the training set settled in at around 0.102, corresponding to around 1440 mis-recognitions in the distorted patterns of the training set. More importantly, mis-recognitions for the testing set were quite good: there were only 86 mis-recognitions out of the 10,000 patterns in the testing set, for an error rate of 0.86%. Dr. LeCun's benchmark was in sight.

The interesting thing about the graph is the "bump" in the MSE that seems to appear after each time that the learning rate is decreased. Look carefully: there's a kind of periodicity to the "Delta MSE" curve, with a spike (which represents a good reduction in MSE) occurring right after the learning rate is decreased. I theorized that the neural network needs to see the same learning rate for more than one epoch, but does not need to see it for as many as four epochs. After more experimentation, I concluded that the learning rate needs to be constant for only two epochs. This is good news for training, since it means that the neural network can be trained, reproducibly from scratch, to sub-one-percent errors in less than around 15-18 hours.

All together, based on the above experimentation, these are the defaults that you see in the "Backpropagation" dialog:

  • Initial learning rate (eta) = 0.001
  • Minimum learning rate (eta) = 0.00005
  • Rate of decay for learning rate (eta) = 0.794183335
  • Decay rate is applied after this number of backprops = 120000

The 0.86% error rate obtained above is very good, but it's still not as good as the 0.74% error rate in the title of this article. How did I get the error reduced further?

After some reflection on the results obtained so far, it appeared to me that there was still an apparent paradox in the error rates. On the training set, which the neural network actually sees, over and over again with only slightdistortions, the network was not able to perform better than 1440 mis-recognitions out of 60,000 patterns, which works out to an error rate of 2.40%. That's nearly three times larger than the error rate of 0.86% for the testing set, which the neural network had never even seen. I theorized that while the distortions were clearly needed to force the neural network to generalize (remember: without distortions I was not able to achieve anything better than a 1.40% error rate on the testing set), after a while, they actually hampered the ability of the neural network to recognize and settle in on "good" weights. In a sense, the network might benefit from a final "polishing" with non-distorted patterns.

That's exactly what I did. First I trained (as above) with the distorted training set until I had obtained an error rate of 0.86% on the testing set. Then, I set the learning rate to a constant value of 0.0001, which is small but is still twice as large as the minimum value of 0.00005. At this learning rate, I ran non-distorted training patterns for five "polishing" epochs. The selection of five epochs was somewhat arbitrary/intuitive: at five epochs, the neural network showed 272 errors out of the 60,000 patterns in the training set, for an error rate of 0.45%. This value for error rate seemed appropriate, since it was smaller than the error rate on the testing set, yet not so small that all of the good generalizations introduced by distortions might be undone.

The "polishing" worked well, and resulted in the 0.74% error rate advertised above.

go back to top


All 74 errors made by the neural network are shown in the following collage. For each pattern, the sequence number is given, together with an indication of the error made by the network. For example, "4176 2=>7" means that for pattern number 4176, the network misrecognized a 2, and thought it was a 7:

Collage showing all 74 errors made by the neural network

For some of these patterns, it's easy to see why the neural network was wrong: the patterns are genuinely confusing. For example, pattern number 5937, which is purported to be a 5, looks to me like a 3, which is what the neural network thought too. Other patterns appear to have artifacts from pre-processing by the MNIST authors, and it's the presence of these artifacts that confused the network. For example, pattern number 5457, which is clearly a 1, also has artifacts on the right-hand side, possibly caused when the pattern was separated from an adjacent pattern. Other patterns, primarily in the 8's, have too much missing from the bottom of the patterns.

But for the vast majority, the neural network simply got it wrong. The patterns aren't especially confusing. Each of the 7's looks like a 7, and there's no good reason why the neural network failed to recognize them as 7's.

Perhaps more training, with a larger training set, would help. I'm inclined to believe, however, that the network is at a point of diminishing returns with respect to training. In other words, more training might help some, but probably not much.

Instead, I believe that a different architecture is needed. For example, I think that the input layer should be designed with due recognition for the fact that the human visual system is very sensitive to spatial frequency. As a result, I believe that better performance could be obtained by somehow converting the input pattern to a frequency domain, which could be used to supplement the purely spatial input now being given as input.

go back to top


Here, in one place, is a list of all the articles and links mentioned in the article, as well as a few extra articles that might be helpful. Clicking the link will open a new window.

go back to top

License and Version Information

The source code is licensed under the MIT X11-style open source license. Basically, under this license, you can use/modify the code for almost anything you want. For a comparison with the BSD-style license and the much more restrictive GNU GPL-style license, read this Wikipedia article: "BSD and GPL licensing".

Version information:

  • 26 November 2006: Original release of the code and this article.

go back to top


This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Mike O'Neill

United States United States


Mike O'Neill is a patent attorney in Southern California, where he specializes in computer and software-related patents. He programs as a hobby, and in a vain attempt to keep up with and understand the tech

저작자 표시

'소스코드' 카테고리의 다른 글  (0) 2012.07.13
VideoLAN VLC 설치되는 작업  (0) 2012.07.13
Neural Network for Recognition of Handwritten Digits  (0) 2012.07.13
MEncoder  (0) 2012.07.12
Microsoft DirectX 9.0 DirectShow한글 설명  (0) 2012.07.12
Windows Development  (0) 2012.07.12



From Wikipedia, the free encyclopedia
Operating systemCross-platform
LicenseGNU General Public License

MEncoder is a free command line video decoding, encoding and filtering tool released under the GNU General Public License. It is a sibling of MPlayer, and can convert all the formats that MPlayer understands into a variety of compressed and uncompressed formats using different codecs.[1]

MEncoder is included in the MPlayer distribution.




As it is built from the same code as MPlayer, it can read from every source which MPlayer can read, decode all media which MPlayer can decode and it supports all filters which MPlayer can use. MPlayer can also be used to view the output of most of the filters (or of a whole pipeline of filters) before running MEncoder. If the system is not able to process this in realtime, audio can be disabled using -nosound to allow a smooth review of the video filtering results.

It is possible to copy audio and/or video unmodified into the output file to avoid quality loss because of re-encoding. For example, to modify only the audio or only the video, or to put the audio/video data unmodified into a different container format.

Since it uses the same code as MPlayer, it features the same huge number of highly-configurable video and audio filters to transform the video and audio stream. Filters include CroppingScaling, Vertical Flipping, horizontal mirroring, expanding to create letterboxes, rotating, brightness/contrast, changing the aspect ratiocolorspace conversion, hue/saturation, color-specific Gamma correction, filters for reducting the visibility of compression artifacts caused by MPEG compression (deblocking, deringing), automatic brightness/contrast enhancement (autolevel), sharpness/blur, denoising filters, several ways of deinterlacing, and reversing telecine.

[edit]Frame rate conversions and slow-motion

Changing the frame rate is possible using the -ofps or -speed options and by using the framestep filter for skipping frames. Reducing the frame rate can be used to create fast-motion "speed" effects which are sometimes seen in films.

Doubling the frame rate of interlaced footage without duplicating or morphing frames is possible using the tfields filter to create two different frames from each of the two fields in one frame of interlaced video. This allows playback on progressive displays, while preserving the full resolution and framerate of interlaced video, unlike other deinterlacing methods. It also makes the video stream usable for framerate conversion, and creating slow-motion scenes from streams taken at standard video/TV frame rates, e.g. using cheap consumer camcorders. If the filter gets incorrect information about the top/bottom field order, the resulting output will have juddering motion, because the two frames created would be displayed in the wrong order.

[edit]See also

  • MPlayer, the media player built from the same source code as MEncoder
  • FFmpeg, similar to MEncoder
  • MediaCoder, a media transcoding application for Windows OSs, uses MEncoder as one of its backends
  • Transcode, a command-line transcoding application for Unix-like OSs
  • MPlayer Wikibook- almost all decoding-related and filtering arguments are shared with mencoder
  • RetroCode, a universal mobile content encoder/decoder


  1. ^ MPlayer and MEncoder Status of codecs support, Retrieved on 2009-07-19

[edit]External links

저작자 표시

'소스코드' 카테고리의 다른 글

VideoLAN VLC 설치되는 작업  (0) 2012.07.13
Neural Network for Recognition of Handwritten Digits  (0) 2012.07.13
MEncoder  (0) 2012.07.12
Microsoft DirectX 9.0 DirectShow한글 설명  (0) 2012.07.12
Windows Development  (0) 2012.07.12
Hilo: Developing C++ Applications for Windows 7  (0) 2012.07.12

Microsoft DirectX 9.0 DirectShow한글 설명



DirectShow   [목차열람] [주소복사] [슬롯비우기]

Microsoft DirectX 9.0


Microsoft® DirectShow® 애플리케이션 프로그래밍 인터페이스는, Microsoft® Windows® 플랫폼전용의 미디어 스트리밍 아키텍처이다. DirectShow 를 사용하면 애플리케이션으로 고품질인 비디오나 오디오의 재생과 캡춰를 실시할 수가 있다.

DirectShow 의 문서는, 다음 주제로 나누어져 있다.

저작자 표시

Windows Development



Windows Development

193 out of 311 rated this helpful - Rate this topic

This documentation provides info about developing applications and drivers for the Windows operating system. The Win32 and COM application programming interface (API) were designed primarily for development in C and C++, and support development for both 32- and 64-bit Windows. For more info, see these Dev Centers: Windows Desktop Development and Windows Hardware Development.

The .NET Framework also provides programming interfaces for developing Windows applications, components, and controls. These programming interfaces can be used with a variety of programming languages, including Visual Basic, C#, and C++. For more info, see the .NET Development documentation.

You can also develop Metro style apps. For more info, see Metro style app development.

In this section

저작자 표시

'소스코드' 카테고리의 다른 글

MEncoder  (0) 2012.07.12
Microsoft DirectX 9.0 DirectShow한글 설명  (0) 2012.07.12
Windows Development  (0) 2012.07.12
Hilo: Developing C++ Applications for Windows 7  (0) 2012.07.12
How to Write a Windows Application with Hilo  (0) 2012.07.12
Windows SDK Archive  (0) 2012.07.12

Hilo: Developing C++ Applications for Windows 7

Hilo: Developing C++ Applications for Windows 7

10 out of 13 rated this helpful - Rate this topic

"Hilo" is a series of articles and sample applications that show how you can leverage the power of Windows 7, using the Visual Studio 2010 and Visual C++ development systems to build high performance, responsive rich client applications. Hilo provides both source code and written guidance to help you design and develop Windows applications of your own.

The series covers many topics, including the key capabilities and features of Windows 7, the design process for the user experience, and application design and architecture. Source code is provided so that you can see firsthand how the accompanying sample applications were designed and implemented. You can also use the source code in your own projects to produce your own rich, compelling applications for Windows 7. The Hilo sample applications are designed for high performance and responsiveness and are written entirely in C++ using Visual C++.

These articles describe the design and implementation of a set of touch-enabled applications that allow you to browse, select, and work with images. They will illustrate how to write applications that leverage some of the powerful capabilities that Windows 7 provides. You will see how the various technologies for Windows 7 can be used together to create a compelling user experience.

In This Section

Topic Description

Chapter 1: Introducing Hilo

The first Hilo sample application—the Hilo Browser—implements a touch-enabled user interface for browsing and selecting photos and images.

Chapter 2: Setting up the Hilo Development Environment

This article outlines how to set up a workstation for the development environment so that you can compile and run the Hilo Browser sample application.

Chapter 3: Choosing Windows Development Technologies

This article describes the rationale for the choice of the development technologies used to implement the Hilo applications.

Chapter 4: Designing the Hilo User Experience

This article describes the process and thoughts when developing the Hilo User Experience.

Chapter 5: The Hilo Common Library

This article introduces the Hilo Common Library, a lightweight object orientated library to help to create and manage Hilo-based application windows and handle messages sent to them.

Chapter 6: Using Windows Direct2D

This article describes how hardware accelerated Direct2D and DirectWrite are used in the Hilo sample application.

Chapter 7: Using Windows Animation Manager

This article explores the Windows 7 Windows Animation Manager, that handles the complexities of image changes over time.

Chapter 8: Using Windows 7 Libraries and the Shell

Files from many different locations can be accessed through a single logical location according to their type even though they are stored in many different locations.

Libraries are user defined collections of content that are indexed to enable faster search and sorting. Hilo uses the Windows 7 Libraries feature to access the user’s images.

Chapter 9: Introducing Hilo Annotator

This article describes the Hilo Annotator application, which allows you to crop, rotate, and draw on the photographs you have selected. Hilo Annotator uses the Windows Ribbon Control to provide easy access to the various annotation functions, and the Windows Imaging Component to load and manipulate the images and their metadata.

Chapter 10: Using the Windows Ribbon

This article examines the use of the Windows Ribbon control, which is designed to help users find, use, and understand available commands for a particular application in a way that’s more natural and intuitive than menu bars or toolbars.

Chapter 11: Using the Windows Imaging Component

In this article you will learn how the Windows 7 Imaging Component is used in the Hilo Browser and Annotator applications. The Windows 7 Imaging Component (WIC) allows you to load and manipulate images and their metadata. The WIC Application Programming Interface (API) has built-in component support for all standard formats. In addition, the images created by the WIC can be used to create Direct2D bitmaps so you can use Direct2D to change images.

Chapter 12: Sharing Photos with Hilo

In this article, we’ll describe how the Hilo applications have been extended to allow you to share photos via an online photo sharing site. To do this, Hilo uses the Windows 7 Web Services application programming Interface (WSSAPI). The Hilo Browser application has also been updated to provide additional user interface (UI) and touch screen features, and the Hilo Annotator application has been extended to support Windows 7 Taskbar Jump Lists. This chapter provides an overview of these new features.

Chapter 13: Enhancing the Hilo Browser User Interface

In the final version of Hilo, the Annotator and Browser applications provide a number of enhanced user interface (UI) features. For example, the Hilo Browser now provides buttons to launch the Annotator application, to share photos via Flickr, and touch screen gestures to pan and zoom images. In this chapter we will see how these features were implemented.

Chapter 14: Adding Support for Windows 7 Jump Lists & Taskbar Tabs

The Hilo Browser and Annotator support Windows 7 Jump Lists and taskbar tabs. Jump Lists provide the user with easy access to recent files and provide a mechanism to launch key tasks. Taskbar tabs provide a preview image and access to additional actions within the Windows taskbar. In this Chapter we will see how the Hilo Browser and Annotator applications implement support for Windows 7 Jump Lists and taskbar tabs.

Chapter 15: Using Windows HTTP Services

The Hilo Browser application allows you to upload photos to the Flickr online photo sharing application. To do this, Hilo uses Windows HTTP Services. This chapter will explore how this library is used in the Hilo Browser to implement its photo sharing feature.

Chapter 16: Using the Windows 7 Web Services API

The Hilo Browser application allows you to share your photos via Flickr by using the Share dialog. The previous chapter showed how the Share dialog uses the Windows HTTP Services API to upload the selected photos to Flickr using a multi-part HTTP POST request. Before the photo can be uploaded the Hilo Browser must first be authenticated with Flickr by obtaining a session token (called a frob), and then authorized to upload photos by obtaining an access token. To accomplish these two steps, Hilo Browser uses the Windows 7 Web Services Application Programming Interface (WWSAPI) to access Flickr using web services. In this chapter we will explore how the Hilo Browser uses this library.


Additional Resources

This series is targeted at C++ developers. If you are a C++ developer but are not familiar with Windows development, you may want to check out the Learn to Program for Windows in C++ series of articles.

These articles are available as a single, downloadable PDF.

The code of the sample Hilo applications are available in the Hilo project on MSDN Code Gallery.

저작자 표시

How to Write a Windows Application with Hilo



How to Write a Windows Application with Hilo

“Hilo” is a series of articles and sample applications that show how you can program for Windows 7 using Visual Studio 2010 and Visual C++. The end goal of this series is to enable developers to create high performance client applications that are responsive and attractive. Hilo provides both source code and the written guidance that will help you design and develop compelling, touch-enabled Windows applications of your own. If you want to read the documentation offline, download the whitepaper for Developing Hilo from the Hilo Code Gallery page.

A series of articles are now on MSDN that describe the design and implementation of a set of touch-enabled Windows applications that allow you to browse, select, and work with photos and images. The articles cover key Windows 7 technologies, describe how they are used together to create a compelling user experience, and detail the design and implementation of the applications themselves. You can find the Hilo articles on the MSDN page Hilo: Developing C++ Applications for Windows 7 and you can find the Hilo sample code on the Hilo Code Gallery page. The first article provides an overview of Hilo and describes the goals of the articles and sample applications in the series and the rest go into details of the design and coding practices used throughout the release.

The Hilo article series, along with the sample application source code, are intended to jumpstart your development and show you how to take advantage of key Windows capabilities in your own.

The following sections will describe the Hilo applications as they become available.

The Hilo Browser

This first release of Hilo contains the source code for the Hilo Browser application. This application implements an innovative carousel-style navigation user interface. It’s touch-enabled so you can quickly browse and select images using touch gestures. Download the source code for the Hilo browser.

Read the Hilo articles on MSDN to learn more about how to program for Windows with Hilo.

The Hilo Annotator

This Hilo sample applications allows you to browse, annotate, and share photographs and images.The latest Hilo source code release includes the Hilo Annotator application, which allows you to crop, rotate, and draw on the photographs you have selected. Download the source code for the Hilo Annotator sources.

The Hilo Annotator uses the Windows Ribbon Control to provide easy access to the various annotation functions, and the Windows Imaging Component to load and manipulate the images and their metadata.

Share Your Photos With Hilo

The Hilo sample applications allow you to browse, annotate, and share photographs and images. Previous articles in this series have described the design and implementation of the Hilo Browser, which allows you to browse and select images using a touch-enabled user interface, and the Hilo Annotator, which allows you to crop, rotate, and draw on the photographs you have selected. Now, in the third release of Hilo, we've added support for sharing your photo via Flickr! You can download the source code for the Hilo applications from here.

We've also added support for Windows 7 taskbar and jump lists!

저작자 표시

Windows SDK Archive



Windows SDK Archive

Windows SDK for Windows 7 and .NET Framework 3.5 SP1

Released in August 2009, this SDK provides Windows 7 headers, libraries, documentation, samples, and tools (including VS 2008 SP1 C++ compilers) to develop applications for Windows 7, Windows XP, Windows Server 2003, Windows Vista, Windows Server 2008, and .NET Framework versions 2.0, 3.0, and 3.5 SP1. It is available in ISO and Web setup form.

Windows SDK for Windows Server 2008 and .NET Framework 3.5

Released February 2008, this SDK provides the documentation, samples, header files, libraries, and tools (including C++ compilers) that you need to develop applications to run on Windows Server 2008 and the .NET Framework 3.5. To build and run .NET Framework applications, you must have the corresponding version of the .NET Framework installed.

Microsoft Windows SDK Update for Windows Vista and .NET Framework 3.0

Released in February 2007, this SDK provides documentation, samples, header files, libraries, and tools you need to develop applications that run on Windows. This release of the SDK supplies updated compilers and documentation. The updated compilers are the same ones that recently shipped in Visual Studio 2005 Service Pack 1. This SDK also includes the samples, tools, headers, and libraries that shipped in the Windows SDK for Vista in November, 2006.

저작자 표시

Learn Windows: Audio and Video Development


Learn Windows: Audio and Video Development

Get Started with Windows Audio and Video Development

Get Started with Media Foundation
Media foundation enables you to produce high quality multimedia experiences. This link will lead you to MSDN documentation on what Media Foundation is and has a high level overview of the developer content.Click the following link to learn more about supported media formats in Media Foundation.

Get Started with Core Audio
Core Audio is the preferred way to add support for low-latency, well-featured, and glitch-resistant audio in the latest Windows operating systems. This link will lead you to an overview of Core Audio highlighting its features.

Getting Started with DirectShow
DirectShow is an older end-to-end media pipeline, which supports playback, audio/video capture, encoding, DVD navigation and playback, analog television, and MPEG-2. This link will lead you to the DirectShow introduction that will lead you through the process of developing DirectShow applications.Click the following link to learn more about supported media formats in DirectShow.

Windows SDK
The Microsoft Windows SDK is a set of tools, code samples, documentation, compilers, headers, and libraries that developers can use to create applications that run on Microsoft Windows operating systems using native (Win32) or managed (.NET Framework) programming models. Download the Windows SDK to add support for Windows media formats and develop Windows applications.

Integrate Windows Audio and Video into your Application

Media Foundation Programming Guide
Read the developer documentation covering the best practices and key developer scenarios for developing applications that feature the latest Windows multimedia APIs.

Core Audio Programming Guide
the developer documentation covering developer scenarios for creating applications that feature the latest Windows audio APIs.

DirectShow Programming Guide
Read the developer documentation covering DirectShow.

Media Foundation SDK Samples
Gives descriptions of the Media Foundation samples that ship with the Windows SDK.

SDK Samples that use the Core Audio APIs
Lists the samples that ship with the Windows SDK and demonstrate how to use Core Audio.

DirectShow Samples
Lists the various samples that ship with the Windows SDK and demonstrate how to use DirectShow.

Learn More about Windows Audio and Video Development

Media Foundation Development Forum
Discuss development of multimedia applications.

Windows Pro-Audio Application Development Forum
Discuss topics and ask questions related to DAWs, Plug-Ins, Soft-Synths, and more.

DirectShow Development Forum
Discuss the use of the DirectShow API.

Other Resources

The Green Button

This site is the official Windows Media Center community portal. It contains developer forums for Windows Media Center and Microsoft TV Technologies.

Windows Media Licensing Administration (WMLA) License Request Form

Use this form to request licenses for Windows Media technologies, including PlayReady and Windows Media Rights Manager SDK.

Windows Media Foundation Team Blog

This blog provides in-depth information about Media Foundation programming.

Windows Media

Using MFTrace to Trace Your Own Code
Now that you know how to trace Media Foundation and analyze those traces to figure out what Media Foundation is doing, the next step is to figure out what your own code is doing. That means adding traces for MFTrace to your code. The simplest way to... More...
Automating Trace Analysis
In our last post, we discussed how to examine Media Foundation traces manually, using TextAnalysisTool.NET. This time, we present a few scripts to automate the process and identify problems more quickly. Getting Started First, install Perl, Graph... More...
Analyzing Media Foundation traces
As we mentioned in our previous blog post, log files generated by MFTrace quickly become huge, and it can be difficult to sort out interesting traces from background noise. This blog post will share a few tips to help analyze those log files. TextAn... More...
저작자 표시

Develop great Metro style apps for windows 8


Develop great Metro style
apps for Windows 8

Download the Windows 8 Release Preview

Experience the newest version of Windows and see for yourself how apps are at the center of the Windows 8 experience.


Download the tools and SDK

Get the tools to build Metro style apps for Windows 8. Our free download includes Microsoft Visual Studio Express 2012 RC for Windows 8 and Blend for Visual Studio to help jumpstart your project.

Explore the documentation

Our docs are optimized to make you more productive. Discover everything you need to plan, build, and sell great apps.

Read the developer guide

Windows 8 Release Preview has many powerful features for developers. Discover the new features for Desktop, Web, and Metro style app developers.

Build apps with the experts at Windows Dev Camps

Dev Camps are free events that bring together developers like you to learn more about building apps. Learn interactively and get advice from expert app developers in hundreds of locations around the world.

Network sites


Other links



저작자 표시

DirectX Developer Center



DirectX Highlights RSS

DirectX Developer Center

Windows Developer Center – Developing Games
Windows Developer Center – Developing Games
The DirectX Developer Center will soon become part of the Windows Development Center. Over the next few months, expect most of the... more
GDC 2012 Presentations Now Available
GDC 2012 Presentations Now Available
Presentations from the Microsoft Developer Day at the Game Developers Conference 2012 have been posted to the Microsoft Download ... more
Where is the DirectX SDK?
Where is the DirectX SDK?
As of the Windows 8 Developer Preview, the DirectX SDK is now part of the Windows SDK. For more details on the transition, visit t... more
저작자 표시

3D Space Flight Demo for Android OS, with Accelerometer-Based Controls



3D Space Flight Demo for Android OS, with Accelerometer-Based Controls

By | 8 Jun 2012 | Article
Simulation of flight in 3D space using OpenGL. The accelerometer-based UI is of particular interest.
Prize winner in Competition "Best Mobile article of May 2012"


Figure 1: Image of the program running on a Motorola Droid Bionic XT875


This article describes a modest 3D game ("Demofox") written for devices running the Android OS. While the scope of this game is limited (it has two levels and no enemies to fight), the code provided does address some difficult issues effectively, particularly in the area of user input. 

The objective of game play is to direct a spaceship forward through a continuous series of obstacles (rectangular prisms) without contacting them. Forward flight (positive motion in the "Z" dimension) takes place at all times. The user is able to effect position changes in the X and Y dimensions, within limits. The perspective shown to the user is first person, with the spaceship itself not visible.

Figure 2 below shows a model of the game world. For clarity, this figure is rendered using parallel projection, whereas the game itself uses perspective projection with a vanishing point. As shown, movement is constrained to the interior of a long, tunnel-like rectangular solid of indefinite length in the "Z" dimension.

The obstacles faced by the user are rectangular prisms occupying discrete channels. Prisms reside completely within these channels, of which there are 16 in Figure 2. Though this is not shown in the model, prisms can touch one or more other prisms; more formally, the space between such prisms is zero.

The universe within which movement takes place in the model figure below is four prisms wide by four prisms high. In the real program as supplied, the universe is defined by a 14 x 14 matrix of channels, each of which contains random prisms of constant size, interspersed with the open areas through which the player must direct the spaceship as it moves forward.

Figure 2: Model of game universe, rendered using parallel projection

As the user proceeds successfully forward, the score shown in the text control immediately above the main game display increments. When the user contacts an obstacle, the ship is pushed backwards slightly, and the score resets to zero. At this point, if the user has exceeded the previous high score, three short vibrations are emitted by the device as a notification mechanism. The high score resets to zero only when the device is powered off, or if the "Demofox" task is forcibly terminated. It is displayed at the very top of the game UI.

At program startup, the user is made to select between touchscreen and accelerometer-based play, and also to select between two levels. The button-based GUI that prompts the user for these pieces of information is landscape-oriented, as this is the mode of play that is prefered1. An image of this button-based GUI is shown below:

Figure 3: Button-based startup GUI, running on a Motorola Droid 4

The second level has a different background from the first level, and also has more obstacles. Accordingly, the user's score accrues more quickly in the second level. Figure 1 showed first level game play; a picture of second level game play is shown below:

Figure 4: Image of the program running on a Motorola Droid Bionic XT875 (Level 2)


In general, libaries have been selected with maximum portability in mind. Only the "ES1.0" version of OpenGL is required, for example, and the Android version targeted is 2.1 or better. Similarly, touchscreen user input is supported, since some devices do not support accelerometer input. This is true of the emulator provided with the Android SDK, for example. 

Despite its rudimentary nature, this emulator is fully supported by the code and binary files included with this article. A picture of the program running in the emulator is shown below:

Figure 5: Image of the program running on the Android SDK emulator


Two ZIP archives are provided at the top of the article: a "demo" archive and a "source" archive. The demo archive contains an ".apk" file. This is similar to an ".exe" file in the Windows OS, in that it contains executable code with some embedded resources (images). To run the Demofox game, it is necessary to copy the ".apk" file to some folder on the Android device file system. This can generally be done by connecting the device to a PC via USB, at which point Windows will give the user the option of browsing the Android file system using Explorer.

After the file has copied and the Android device has been disconnected, browse to the Android folder where the ".apk" file is located and tap it. The OS will give you the option of installing the application. It is necessary to enable a setting called "Unknown sources" / "Allow installation of non-Market applications" (or similar) in order to install an application in this way, but most versions of Android will allow you to change this setting interactively as needed when you attempt to install a non-market ".apk" file.

User Input

The simple game described in the last section is mildly amusing, but the area in which the author believes it really excels is that of user input. While the overall application described here is an arcade-style game, the user input techniques described should prove applicable to a wide variety of flight simulators of varying complexity and purpose.

The supplied program exposes two user interfaces: an accelerometer-based (or motion-sensing) user inteface and a touchscreen user interface. Accelerometer-based input devices have grown explosively in number since the introduction of the Nintendo Wii in late 2006. In particular, devices running the iOS and Android operating systems almost universally have multiple accelerometers. All of these devices offer the ability to construct a user experience that is closer to the real experience being modeled than is possible with a mouse, keyboard, or joystick.

The accelerometer-based user interface employed here is an example. It operates in two modes. One mode of operation allows the user to hold the device (e.g. a smart phone) like the control yoke of an airplane. Two of these are featured prominently in the next figure shown below. The user must pull back to point the nose up, push forward to point it down, rotate the device clockwise to effect a right turn, and rotate it counterclockwise to turn left. If an Android device running Demofox were attached to a dummy airplane yoke, its video output could be fed to a display in front of this yoke to create a sort of virtual reality.

Figure 6: Two yokes in a fixed-wing aircraft.

Strictly speaking, in accelerometer control mode, the Demofox user rotates the Android device about its longest axis to effect up-and-down movement. In a yoke like the one shown in the figure above, pulling back / pushing forward naturally causes the necessary rotation. This is part of the mechanical design of the yoke. The Android user operating without a dummy yoke, on the other hand, must take care to rotate the device, not just push it backwards and forwards. This is a familiar motion for many Android users, though, since many accelerometer-based Android user interfaces are based on axial rotation. At the Google Play store, many games that use such an interface have "tilt" in their names, like "aTilt Labyrinth 3D" or "Tilt Racing."

The little snippet of text below the buttons in Figure 3 relates to the other mode in which the accelerometer-based UI can operate: "Divecam" mode. The name of this mode draws an analogy with a diver holding an underwater camera as shown in the next figure below. Such a diver generally has the camera pointed in the direction in which he or she is moving, and this is similarly true of the Android device operating in "Divecam" mode. If the player wishes to navigate the Demofox ship toward the left, for example, he or she can do so by rotating the Android device as if he or she were taking a picture of an object to his or her left.

Figure 7: A diving camera

The selection between "Divecam" mode and "Yoke" mode is made by the user, but this is done indirectly. While the button-based GUI is showing, if the user holds the device like a yoke, i.e. basically perpendicular to the ground or floor plane beneath it, then the snippet of text below the buttons will indicate "Yoke" mode. If the user does not do this, then the text indicates "Divecam" mode. The mode in effect when the user presses a level selection button (i.e. when the user makes a selection on the second and final screen of the button-based GUI) is the mode used for game play. If the user chooses touchscreen mode, then this distinction is irrelevant.

The touchscreen input mode operates in a manner that should be familiar to Android users. To effect a maneuver to the left, the player must touch his or her finger down somewhere on the screen and drag it to the right. This is the same technique used to move toward the left edge of a zoomed-in webpage in the Android browser, to move toward the left in the Android version of Google Earth, and so on. Maneuvers in other directions are performed according to the same pattern: the user must touch down a finger somewhere on the screen and drag it in the direction 180 degrees opposite the desired direction. This results in a very intuitive mode of game play.


Most basically, the code base supplied here is built around the principles of interface-based programming, and of modular programming, which relies on the use of interfaces. In keeping with these principles, the Demofox source consists of series of classes that communicate with each other exclusively using defined sets of methods.  

Despite its use of the class keyword to subdivide code, the implementation provided is not really object-oriented, in several key senses. The author saw no need to use polymorphism beyond what is strictly required by the Java / Android infrastructure, for example. In fact, the same can be said about object instantiation; all members introduced by the author (as opposed to required overridden methods) are static.

This design offers the advantage of deterministic management of resources. These are allocated at application start, and reclaimed upon process termination, without any dependency on object instantiation or destruction.

Also, the author did not want to obfuscate the code presented in the article with architectural concerns. The design selected allows for a clear presentation of techniques that can be integrated into a wide variety of overall architectures. 

Despite these simple structured programming underpinnings, every effort has been made to respect good programming principles. Most basically, the structured, procedural nature of the code is fully reflected in its declarations. Except where the infrastructure requires a true object instance, classes contain static members only, and any initialization that is necessary takes place in the static constructor. This article thus provides a good example of the proper syntax for procedural programming in Java. 

Another design principle which is respected in the Demofox design is the principle of least privilege. Access modifiers (public, protected, etc.) are no more permissive than they need to be, and parameters are marked final wherever possible.

The principle of information hiding is also respected, in that the program consists of a series of classes that interact only by way of a functional interface. Only private fields are used, and the protected interface exposed by each class is comprised of a series of methods in a designated portion of each class. 

Well-defined standards for the naming of members, and for the ordering of their declarations, were followed for all Demofox code. The member order used in the supplied source code files is shown below, from top to bottom:

  • Constants  
  • Constructor (if strictly required)  
  • public method overrides (if strictly required) 
  • protected methods  
  • private methods  
  • private fields  
  • static constructor   

The listing shown above reflects the basic design already laid out. In keeping with least privilege, the only allowance made for any public member in the listing above is for overrides that are absolutely required by the Android / Java infrastructure.

Similarly, an instance constructor is only allowed for in the listing above if it is absolutely necessary. All of this reflects the overall design goal stated above, which was to implement a basic structured programming design, but to do so in a well-specified manner, and to adhere to general good practices.

The design used for this application is interface-based. Accordingly, each class's interface has a designated position in its file. This is the fourth item in the list shown above, "protected methods."

Naming standards are as shown in the listing below:

  • "Lazy" case is used unless otherwise specified, e.g. xbackplane 
  • Constant names are uppercase and are underbar-delimited e.g. YOKE_UI_X_LIMIT 
  • Type suffixes are used for GUI elements, e.g. highscoretv (a TextView
  •  "Camel" case is used for protected methods, if 2+ words are necessary, e.g. setScore() 
  • Abbreviations are generally avoided 
  • As shown above, type name suffixes for GUI controls are abbreviated. 
  • Local variable names can be abbreviated  
  • Units-of-measure can be abbreviated 
  • The word "alternate" is abbreviated "alt"    
  • The word "minimum" is abbreviated "min"    
  • The word "maximum" is abbreviated "max"  
The author does not pretend that these naming and ordering conventions will be completely satisfactory for any other application. They represent his attempt to impose an appropriate but not stifling level of discipline on this relatively limited project. The same is true of the overall design selected. It is adequate for the author's purpose here.


    At the bottom level of the application architecture lies a class named cube. This class is capable of rendering cubes into the model world associated with an instance of type javax.microedition.khronos.opengles.GL10. In keeping with the interface-based architecture described above, this GL10 instance is passed into a static method, draw().

    The draw() method draws a cube2 whose sides are 1.0 units in width, centered about the origin at (0,0,0). It is up to the caller to apply transformations to position each cube properly before the method is called. The cubes are drawn using a top-to-bottom gray gradient coloration; no texture is applied. This is a good, neutral building block for many 3D applications. In the Demofox application, groups of three cubes are used to create obstacles for the player.  

    Like all of the Demofox classes, the cube class begins with its local constant declarations. These are shown and discussed below. The first segment appears thus:

    final class cube
     private static final int PRISM_FULL_COLOR = 0x08000;
     private static final int PRISM_HALF_COLOR = 0x04000;

    These first two declarations define the brightness of the beginning and ending shades of gray used to render the cube. These are of type int, and will be used as OpenGL "fixed point" values. The values actually used above represent full brightness and 50% brightness. The code continues as shown below:

     private static final float PRISM_CUBE_WIDTH = 1f;
     private static final float VERTICES[] = 
       // Back
      // Front

    The constant PRISM_CUBE_WIDTH specifies the length of each side of the cube. Then, the lengthy array declaration shown in the snippet above defines the eight vertices of the cube in 3D space. Two possible coordinates, 0.5 and -0.5, are permuted across the three dimensions to yield 8 array elements.

    It is not sufficient, though, to simply supply the eight vertices of the cube to OpenGL. Instead, 12 triangles must be drawn (two per face, times six cube faces), resulting in a 36-vertex sequence. OpenGL rendering is triangle-driven and does not work with rectangular faces. The next set of constant declarations take care of these details:

     // 6 sides * 2 triangles per side * 3 vertices per triangle = 36
     private static final int CUBE_VERTEX_COUNT = 36;
     // This is the specific, clockwise zig-zagging order
     // necessary for optimal rendering of the solid.
     private static final byte CUBE_VERTEX_INDICES[] = 
      0, 4, 5,  //First triangle drawn
      0, 5, 1,  //Second... etc.
      1, 5, 6, 
      1, 6, 2, 
      2, 6, 7, 
      2, 7, 3, 
      3, 7, 4, 
      3, 4, 0, 
      4, 7, 6, 
      4, 6, 5, 
      3, 0, 1, 
      3, 1, 2 

    Each of the elements in CUBE_VERTEX_INDICES is a number from 0 to 7, and uniquely identifies one of the 8 cube vertices. Specifically, these 0 to 7 elements are indices into array cube.VERTICES.

    The coordinates for these faces are rendered, and therefore defined above, in a specific order that allows for a performance benefit. This ordering is such that the front faces (i.e. those that are visible in a given frame) end up getting drawn using a clockwise motion from each vertex to the next, whereas the equivalent motion for hidden faces ends up being counterclockwise. This allows OpenGL to cull (i.e. not render) the back faces of the solid.

    The cube is a very easy example of this optimization. When defining the cube, we do so such that the vertices of the front face (the face lying entirely in the plane where Z = -0.5) is drawn in clockwise order, and the back face is drawn counterclocwise, assuming the camera is looking at the front face.

    If we rotate the cube 180 degrees about the yaw ("Y") axis, and leave the camera in the same spot, then when these same three vertices are rendered, it will take place in a counterclockwise fashion. OpenGL, when configured to cull surfaces, deals with both scenarios by only drawing the clockwise renderings.  

    This brief discussion of surface culling necessarily leaves out some detail. Some more information about this technique is available in another of the author's articles on this site. Though that article uses Direct3D instead of OpenGL, the design techniques required are the same, and some of the illustrative figures used in that article may help make the text discussion above clearer.

    After the declaration of these constants, the protected interface exposed by class cube is defined. This consists of two static methods, the first of which is draw():  

     protected static void draw(final GL10 gl)
      // 3 means dimensions: x, y, and z
      gl.glVertexPointer(3, GL10.GL_FLOAT, 0, vertexbuffer);
      // 4 means R, G, B, A here
      gl.glColorPointer(4, GL10.GL_FIXED, 0, colorbuffer);
      gl.glDrawElements(GL10.GL_TRIANGLES, CUBE_VERTEX_COUNT,
       GL10.GL_UNSIGNED_BYTE, indexbuffer);

    The method implementation begins by disabling texturing, since the cube to be drawn is not textured. Then, two calls are made, which prepare the OpenGL rendering engine to receive vertices, and enable clockwise surface culling, as described above. The floating point coordinate buffer, and the fixed point color buffer, are supplied to OpenGL by the calls to glVertexPointer() and glColorPointer(). Then, glDrawElements() is used to effect the actual drawing of the cube, and a call is made to return the OpenGL rendering engine to its pre-call state.

    The other protected method that is exposed by cube returns the width of a cube side. This is shown below: 

     protected static float getPrismCubeWidth()
      return PRISM_CUBE_WIDTH;

    This is used by cuber, for example, to translate channel numbers into OpenGL coordinates.

    Many readers will likely be accustomed to a flow of information which is the reverse of what's shown here. Specifically, such readers might expect for a cube class to expose a property allowing its size to be set by the consumer of the class, accompanied, perhaps, by similar properties relating to color and texture.

    Here, these values are instead contained within the cube class, and are selectively exposed via a defined interface. This is done only in cases where some specific consumer is known to require access. 

    While not an example of OOP, this design does respect several principles the author holds very important. The principle of least privilege is an example; Demofox does not need for cuber, demofoxactivity, touchableglview, etc., to manipulate the color or width of the solid rendered by cube, so these other classes are not allowed to do so.

    Instead, the constants used to determine such things as cube color and width are colocated with the code that uses them most, and are hidden unless they specifically need to be exposed. The cuber class needs to read the width value, but not write it, and it is therefore granted this privilege only. Consistent with the interface-based  approach promised earlier, access to this value is given by way of a method, not a field.

    The code shown above also exemplifies modular programming. There is a strict separation-of-concerns between two key tasks: the generation of cubes, which is a generic and potentially reusable operation existing at a relatively low level, and their construction into groups of prisms and integration into the overall application, which is a higher-level task.

    The cube class is self-contained. It does one thing well (renders a neutral cube using a gray gradient) and eschews all other roles (e.g. positioning the cube).

    While the cube class is admittedly inflexible, it also is most definitely a component, or (literal) building block. To force some notion of adaptability onto it would be an example of speculative generality, and this is something the author considers it wise to avoid, at least for the expository purposes of this article.   

    All other executable code for the cube class resides in the static constructor. In each of the Demofox classes, the static constructor takes care of the allocation of application resources. For the cube class, this constructor begins as shown below:   

      final int fullcolor = PRISM_FULL_COLOR;
      final int halfcolor = PRISM_HALF_COLOR;
      final int colors[] = 
       halfcolor, halfcolor, halfcolor, fullcolor,
       halfcolor, halfcolor, halfcolor, fullcolor, 
       fullcolor, fullcolor, fullcolor, fullcolor, 
       fullcolor, fullcolor, fullcolor, fullcolor, 
       halfcolor, halfcolor, halfcolor, fullcolor, 
       halfcolor, halfcolor, halfcolor, fullcolor,
       fullcolor, fullcolor, fullcolor, fullcolor, 
       fullcolor, fullcolor, fullcolor, fullcolor 

    This initial portion of the static constructor is focused on coloration. First, the aliases fullcolor and halfcolor are created. Then, these are used to declare and initialize array colors. This array parallels array VERTICES. Each of the eight rows in its initializer corresponds to the similarly-numbered row in the initializer of VERTICES, and to the same cube vertex as that row. Where each row in VERTICES had three individual numeric coordinates, above, each row has four: red, green, blue, and alpha (opacity).

    In examining the declaration of colors, consider that all of the vertices with lower "Y" coordinate values are gray, while vertices with higher "Y" values are white. This is what creates the gradient effect evident in the rendered cubes. 

    The static constructor for the cube class concludes with the code shown below:

      final ByteBuffer vbb = ByteBuffer.allocateDirect(VERTICES.length * Float.SIZE
       / Byte.SIZE);
      vertexbuffer = vbb.asFloatBuffer();
      final ByteBuffer cbb = ByteBuffer
       .allocateDirect(colors.length * Integer.SIZE);
      colorbuffer = cbb.asIntBuffer();
      indexbuffer = ByteBuffer.allocateDirect(CUBE_VERTEX_INDICES.length);

    This code sets up three data structures required for OpenGL 3D rendering. Member vertexbuffer holds the 8 coordinates of the cube vertices. This is basically a copy of VERTICES, in a very specific format. In particular, vertexbuffer is a  direct buffer. In the nomenclature of Java, this means that it does not reside on the main heap, and will not get relocated by the Java garbage collector. These requirements are imposed by OpenGL. 

    Another direct buffer, indexbuffer, holds the 36 indices into VERTICES that define the way in which the cube is actually rendered. Again, this is basically just a copy of a more familiar-looking data structure, in this case CUBE_VERTEX_INDICES.   

    Finally, colorbuffer is a direct buffer copy of colors. Like the other two data structures for which a direct buffer gets created, colorbuffer contains data to which OpenGL requires quick access during rendering. 

    Other Classes

    The other classes used for Demofox are fundamentally similar to cube in their construction. Each of them exhibits the same member ordering, naming conventions, and overall interface-based, structured philosophy. A listing of all Demofox classes is given below: 

    • cube : See above. 
    • plane : This is similar to cube, but renders a flat plane defined by the equation Z=K , where K is some constant. It is used to render a background image for the game.  
    • cuber : This is a subclass of GLSurfaceView.Renderer, and it performs high-level management of the 3D rendering. It is the class that calls into plane and cube, for example. 
    • touchableglview :  This is a subclass of GLSurfaceView, with methods overridden to facilitate the touchscreen UI. 
    • demofoxactivity : This is a subclass of Activity, and is thus the top-level unit of the application from a procedural standpoint. The main point-of-entry is defined here, for example. 
    The remainder of this article deals with those portions of these other classes that the author thought most noteworthy.  

    Prism Lifetime   

    One very significant role played by cuber is the generation and management of the prismatic obstacles. These are presented to the user in a continuous stream, as if they comprised a sort of permanent, asteroid-belt-like obstruction in space.

    It would be sub-optimal to render such a huge conglomeration of figures with each frame, though. Only a small subset of these really needs to be rendered to provide a convincing game experience. 

    A simplistic way to optimize the generation of prisms might be to generate a small group of them, in near proximity to the user, at application start. These would be positioned in front of the user; in this case, that would place them at slightly higher "Z" coordinates than the user's ship. Under such a design, another, similar group of prisms would be generated after the user had passed by all of the first group. Also, at that time, the first group of prisms would cease to be rendered, and any associated resources would be deallocated.

    This simplistic approach does create a stream of obstacles that is basically continuous,  but it also suffers from gaps in the presentation of these obstacles. Typically, as a prism group passes out of view, there is a point at which the last prism of the group becomes invisible, and then the next group of prisms appears suddenly, all at once. This results in an unconvincing simulation.

    The strategy actually used by Demofox, though, does rely on the basic narrative given above, despite its simplistic nature. The actual code given here compensates for the natural abruptness of the prism group transitions by rendering two such groups at once, and staggering them such that the transitions between groups are much less obvious. 

    In essence, the approach actually used is identical to the simplistic approach described, but involving two sets of prism groups. The second of these groups does not start getting rendered until the user has reached a certain "Z" coordinate, cuber.START_ALT_PRISMS. After that, groups are regenerated as they have completely passed the user in the "Z" dimension.

    To be specific, this regeneration process is handled by calling the reset method of class cuber.This method begins as shown below:  

    private static void reset(final boolean alternate)
      if (!alternate)
       paging = getPositionstate() - demofoxactivity.getStartDistance();
       xstate = new ArrayList<Integer>();
       ystate = new ArrayList<Integer>();
       zstate = new ArrayList<Integer>();
       for (int index = 0; index < simultaneaty; ++index)
         - PRISMS_PER_DIMENSION / 2);
         - PRISMS_PER_DIMENSION / 2);
         - PRISMS_PER_DIMENSION / 2);

    First, note that the portion of reset()shown above pertains to the first set of obstacle prisms, i.e. the one that was present even in the simplistic algorithm given above, before the second set of obstacles was added to improved continuity. The outer if shown above has an else clause which is very similar to the if clause shown above, but uses its own set of variables (altpaging instead of paging, altxstate instead of xstate, and so on3).  

    Most of the code shown above takes care of resetting the first obstacle group into new positions after it has passed out of view. To begin, member paging is set to the current "Z" position of the user's ship (yielded by accessor getPositionState()) minus an initial offset value. 

    All of the prisms' "Z" positions will be expressed as offsets from paging, since at the start of their lifetimes they must all be visible to the end user. This mechanism is used to deal with the ever-increasing "Z" position of the user, ship, and camera. It is evident in much of the code relating to position in the "Z" dimension. 

    Member lists xstate, ystate, and zstate serve to locate each of the new obstacle prisms somewhere in 3D space in front of the user. These are lists, with one element per prism. These lists exist in parallel. The first elements from all three lists together form a single coordinate, as do the three second elements, and so on. 

    These lists hold integer elements. Rather than holding coordinates in our OpenGL universe (which would be floating point numbers), xstate and ystate hold channel numbers, as described in the introduction and as shown in Figure 2.

    Because each channel is exactly large enough in the "X" and "Y" dimensions to accommodate one full prism, the translation of the integer data in xstate and ystate into floating point OpenGL coordinates involves a multiplication by cube.getPrismCubeWidth(). The same numbering convention is continued into the "Z" dimension, although Because the integer value in zstate is offset from paging, not 0, the value of paging  must be added in for calculations involving the "Z" dimension, although there are no channels per se in this dimension. 

    Background Effect  

    The prismatic obstacles shown during game play are rendered in front of an unchanging background. This is implemented using a flat OpenGL plane at a fixed distance from the camera. This plane is of sufficient size that it occupies the entire area of the display, other than those parts covered by obstacles. This is an inexpensive way to add interest to the game's 3D scene without unduly taxing the device's processing resources. 

    Unlike the OpenGL surfaces used to construct the obstacles, the single face used to create this background is textured, since a solid color or even a gradient would make for an uninteresting effect. Though the background surface is unchanging in its appearance and position, this is not unrealistic. Stellar features like the ones depicted in the background images reside at huge distances from the viewer. In the real world, the apparent position of such stellar features does not move much if at all in response to viewer movements. This is particularly true of movement like the sort modeled here, i.e. movement forward with constant orientation. 

    Class plane takes responsibility for drawing the background. This is similar to cube, with some important additions related to texturing, as well as some simplifications related to the simpler solid being rendered. Much of the additional work associated with texturing is handled by method  loadtexture():   

     private static void loadtexture(GL10 gl, Context context)
      Bitmap bitmap;
      // Loading texture
      if (demofoxactivity.getLevel() == 1)
       bitmap = BitmapFactory.decodeResource(context.getResources(),
       bitmap = BitmapFactory.decodeResource(context.getResources(),
      // Generate one texture pointer
      gl.glGenTextures(1, textures, 0);
      // ...and bind it to our array
      gl.glBindTexture(GL10.GL_TEXTURE_2D, textures[0]);
      // Use "nearest pixel" filter for sizing
      gl.glTexParameterf(GL10.GL_TEXTURE_2D, GL10.GL_TEXTURE_MIN_FILTER,
      gl.glTexParameterf(GL10.GL_TEXTURE_2D, GL10.GL_TEXTURE_MAG_FILTER,
      // Use GLUtils to make a two-dimensional texture image from
      // the bitmap
      GLUtils.texImage2D(GL10.GL_TEXTURE_2D, 0, bitmap, 0);
      // Allow for resource reclamation

    Above, an object of type Bitmap is first created, from an embedded resource identified by a member of class R, an Android facility provided for this purpose. Then, the next two calls set up a single texture level for the GL10 rendering engine object passed into the method. For generality, these calls require an int[] array parameter, which holds an identifier for the texture created in its single element. The call to GLUtils.texImage2D() serves to load the appearance data from the Bitmap as this level 0 texture's appearance. After that, the  Bitmap is expendable, and is deallocated. 

    The remainder of the new code in plane associated with surface texturing is in its static constructor. This constructor begins with code that is very similar to the code at the start of the static constructor for class cube. The code in plane diverges from that in cube, though, at the point shown below:   

      float texture[] =
       0.0f, 0.0f, // bottom left 
       1.0f, 0.0f, // bottom right
       1.0f, 1.0f, // top left
       0.0f, 1.0f  // top right
      ByteBuffer cbb3 = ByteBuffer.allocateDirect(texture.length * Float.SIZE
       / Byte.SIZE);
      texturebuffer = cbb3.asFloatBuffer();

    Here, another sort of direct buffer is created for OpenGL. This serves to map the two-dimensional texture appearance onto the single face that gets rendered by OpenGL. These texture coordinates are floating point numbers ranging from 0.0 to 1.0 for all textures, and are associated with the four vertices held in vertexbuffer. Here, we are simply spreading a flat square texture over a flat square face, so the determination of the correct texture coordinates is very simple. The portions of this code devoted to allocating the direct buffer are fundamentally similar to code already shown in the discussion of the cube class above.  

    Drawing a Frame 

    Much key game logic is present in method cuber.onDrawFrame(). This is an overridden method which runs for each OpenGL frame. It is the longest method in the Demofox code base, at 60 lines. In general, long functions are avoided in this code base. The length of onDrawFrame()does allow for a linear presentation below. This method begins thus: 

     public void onDrawFrame(final GL10 gl)

    This code first performs some OpenGL preliminaries shared in common with many 3D applications. The depth buffer, which tracks which pixels are in front and visible, is cleared, and certain basic rendering facilities are enabled. Then, the code shown above calls plane() to draw the background image. The method implementation continues as shown below:  

      boolean allbehind;
      allbehind = true;
      for (int index = 0; index < xstate.size(); ++index)
       allbehind &= (prism(gl, xstate.get(index), ystate.get(index),

    Boolean variable allbehind plays a key role in the management of prism lifetime. As the prisms confronting the user are drawn, by the prism() method, this  allbehind is updated to reflect whether or not each prism is behind the current camera position. If it is not, allbehind is set to false. After all of the prisms in the first group are drawn, if allbehind is still true, then special action is taken:  

      if (allbehind)
       if (wrapitup && altdone)
        positionstate = demofoxactivity.getStartDistance();
        startedalt = false;
        wrapitup = false;
        altdone = false;

    Variable alldone and wrapitup are part of a system that restarts the ship at position 0.0 after a very high maximum "Z" position is reached. Once this happens,  wrapitup is set to true. This begins a process by which both prism groups are allowed to pass behind the camera, without regeneration, before the game is essentially restarted. (Position is reset to 0.0 and the initialization code for the prism group runs, but the score is not reset and accrues normally.)  

    After wrapitup is set to true, neither of the two groups of prismatic obstacles will be reallocated until the overall reset to ship position 0.0 has been completed. This system exists to prevent ship position from climbing indefinitely high and causing number system anomalies. After an extreme position is reached, and both prism groups have subsequently passed behind the current position, (wrapitup && altdone) will evaluate to true and the final reset of ship position to 0.0 will be effected by the innermost sequence of statements above (followed by the subsequent call to reset()). Otherwise, the main prism group is simply recycled using a call to reset().    

    The cuber.onDrawFrame() method continues as shown below:  

      if (positionstate >= START_ALT_PRISMS)
       if (!startedalt)
        startedalt = true;
       allbehind = true;
       for (int index = 0; index < altxstate.size(); ++index)
        allbehind &= (prism(gl, altxstate.get(index), altystate.get(index),
         altzstate.get(index), true));
       if (allbehind)
        if (!wrapitup)
         altdone = true;

    Above, we see a section of code that is similar to the previous snippet, but for the second or alternate prism group. One immediately obvious difference relates to variable startedalt. In order to stagger the two prism groups properly, the alternate prism group does not get drawn until a specific positional milestone is reached by the player / ship / camera. At this point, reset(true) is invoked to create the alternate prism group for the first time, and startedalt is set to true. Next, series of calls to prism() is used to render the actual obstacle group. Throughout this process, variable allbehind is used to track whether the second group is due to be regenerated.

    At the end of the code shown above, the situation where allbehind is true is dealt with. Normally, this results in regeneration of the alternate prism group. However, when it is time to effect the major reset of ship position to 0.0, and wrapitup is true, regeneration of the prism group is deferred, and altdone is set to true instead. As shown in an earlier code snippet, this serves to trigger what amounts to a game restart at ship position 0.0. 

    Finally, this method concludes with the code to increase the current "Z" position, and, if it is sufficiently high, begin the position reset process already described: 

       positionstate = positionstate + EMULATOR_FORWARD_SPEED;
       positionstate = positionstate + PHONE_FORWARD_SPEED;
      if (positionstate > MAX_POSITION)
       wrapitup = true;

    Collision Detection  

    The detection of collisions between the user's ship and the prismatic obstacles is handled within private method cuber.cube(). The relevant portion of this method is explore below. It begins with this conditional structure:  

      // 0.5 is for rounding
      if (positionstate + cube.getPrismCubeWidth() + 0.5 + SHIP_RADIUS >= -cube
       .getPrismCubeWidth() * z + tpaging
       && positionstate + cube.getPrismCubeWidth() - 0.5 - SHIP_RADIUS <= -cube
        .getPrismCubeWidth() * z + tpaging)

    This initial check returns true if the user's ship and the front plane of a cube intersect in the "Z" dimension. This check runs one per frame for the front face of each cube, i.e. three times per prism. While not perfect, this logic ensures that the vast majority of user collisions with any portion of the overall prism are detected. 

    Note that the ship is assumed to have volume; it is treated as a cube of width 2.0 * SHIP_RADIUS. The literal value 0.5 is included for rounding purposes. In conjunction with an implied cast to an integer type, which results in truncation, the addition of 0.5 results in proper rounding-up of numbers having a fractional portion portion greater than or equal to 0.5. Variable tpaging is equal to member paging when drawing the main prism group and is equal to altpaging otherwise. Either way, this value is a component of the overall "Z" position of the cube. If the logic shown above evaluates to true, the conditional shown below then executes: 

       if (getXpos() >= x - 0.5 - SHIP_RADIUS && getXpos() <= x + 0.5 + SHIP_RADIUS)
        if (getYpos() >= y - 0.5 - SHIP_RADIUS
         && getYpos() <= y + 0.5 + SHIP_RADIUS)

    The outer conditional is true when the ship and the front plane intersect in the "X" dimension, and the inner conditional plays a similar role for the "Y" dimension. These calculations are reminiscent of the "Z" dimension's conditional presented earlier, except that there is no need to deal with any paging value for the prism group. The "X" and "Y" values associated with the cubes are absolute positions in the OpenGL 3D universe. 

    If the two conditional expressions shown above both evaluate to true, the collision code runs. This code is simple; the score must be set to zero, the ship bounces back slightly in the "Z" dimension (by value BOUNCE_BACK), and if a new high score has been set, a distinctive three vibration indication is given to the user, using method pulse()

         if (demofoxactivity.getHighScore() <= demofoxactivity.getScore())
         positionstate -= BOUNCE_BACK;

    During development, the author also experimented with vibration feedback during prism collisions. This was ultimately excluded from the end product provided here, since it is battery-draining, and because it was judged to be redundant in light of the other feedback provided for collisions. 

    Touchscreen Interface       

    The touchscreen user interface is built around whole drag events. When the user touches his or her finger down, the ship immediately begins moving correspondingly in the "X" and "Y" dimensions, as described in the "User Input" section above, and continues to do so until the user removes the finger. At this point, the ship remains in the last position attained. If the user at any point attempts to navigate past a world boundary, i.e. to pass outside of the long rectangular play area shown in Figure 2, a long vibration is given, and the ship bounces back to a corrected position. 

    This intuitive and interactive system of control actually does not require too much logic. The handler event for all touchscreen events is in class touchableglview, in method onTouchEvent(). The ACTION_DOWN event must be handled specially, but all other events associated with the touchscreen share common handler logic. The next snippet of code shown below executes when the user first touches a finger down on the screen. 

      if (event.getAction() == MotionEvent.ACTION_DOWN)
       oldy = event.getY();
       originaly = cuber.getYpos();
       oldx = event.getX();
       originalx = cuber.getXpos();

    Values oldx and oldy establish a frame-of-reference for the whole drag event. Throughout the event, displacement of the ship in the "X" and "Y" dimensions will be a linear function of the corresponding "X" or "Y" difference of the finger at any point in time from oldx or oldy. The ship moves forward throughout such events, resulting in a  user-controlled flight path in 3D space.

    Constant FINENESS defines how steep the linear relationship between finger position and ship position actually is. The next segment of code shows the associated calculations. This is the start of the handler logic for all of the touchscreen events that execute after the initial ACTION_DOWN event. Actual ship movement is performed using calls to cuber.setXpos() and cuber.setYpos()

       dy = event.getY() - oldy;
       cuber.setYpos(originaly + (dy / FINENESS));
       // Dragger UI
       dx = event.getX() - oldx;
       cuber.setXpos(originalx + (dx / FINENESS));  

    After these brief calculations, all that is necessary is the check for world extremes. Here, method vibe() is used to give the designated pulse when the user exceeds a boundary. Movement based on constant BOUNCE_TO is applied, in a direction opposite to the user's disallowed maneuver: 

       if (cuber.getXpos() > demofoxactivity.getWorldLimit())
        cuber.setXpos(demofoxactivity.getWorldLimit() * BOUNCE_TO);
       if (cuber.getYpos() > demofoxactivity.getWorldLimit())
        cuber.setYpos(demofoxactivity.getWorldLimit() * BOUNCE_TO);
       if (cuber.getXpos() < -demofoxactivity.getWorldLimit())
        cuber.setXpos(-demofoxactivity.getWorldLimit() * BOUNCE_TO);
       if (cuber.getYpos() < -demofoxactivity.getWorldLimit())
        cuber.setYpos(-demofoxactivity.getWorldLimit() * BOUNCE_TO);

    Accelerometer Interface     

    The accelerometer-based control system takes a form that is broadly similar to that of the touchscreen system just discussed. Again, it is displacement from a base point that effects ship position changes in the "X" and "Y" dimensions, while movement forward in the "Z" dimension continues unabated.

    In the touchscreen system described above, the "base point" for each position change was the point on the touchscreen where the user initially touched down. For the accelerometer-based system, the "base point" is the position of the Android device according to the device accelerometer at the beginning of the flight itself. This is stored in variables xbackplane and ybackplane. The accelerometer sensor values in all cases originate from a member of class SensorEvent, named values. The key subtraction operation for each sensor event therefore involves xbackplane or ybackplane and an element of values

    Like the touchscreen system, the accelerometer-based system relies on a linear translation from input movement to ship movement. These are multiplied by the result of the central subtraction operation. Here, the constant factors at play are TILT_SPEED_X and TILT_SPEED_Y.  

    The dimensions in the names of these constants refers to the movement of the ship, not to the associated accelerometer dimensions. In default mode (i.e. when yokemode is false), these dimensions are inversed. It is SensorEvent.values[Y_ACCELEROMETER] that controls movement in the "X" dimension, and SensorEvent.values[X_ACCELEROMETER] that controls movement in the "Y" dimension. 

    In "yoke" mode, SensorEvent.values[Y_ACCELEROMETER] is used to similar purpose, but is negated. The value of SensorEvent.values[X_ACCELEROMETER] is not used. Instead, it is SensorEvent.values[Z_ACCELEROMETER] that is used to control movement in the "Y" dimension of the ship. 

    These modes of operation were set up empirically by the author. Another program was developed to output the raw value of all three sensors to the device screen, and it was used to track desired control movements.  

    Finally, variables ytiltfactor and xtiltfactor are set to zero to disable the accelerometer system for touchscreen operation. They are multiplied by the result of each linear translation function, similarly to TILT_SPEED_X and TILT_SPEED_Y to account for situations where accelerometer input is disabled. This is a slightly inefficient, but reliable, method. 

    All of the logic described above takes place in method demofoxacitivty.readaccelerometers(). The body of this method is shown below, in its entirety. As was true of the touchscreen system, all of the calculations just described result in a set of calls to cuber.setXpos() and cuber.setYpos()

      if (yokemode)
       cuber.setYpos(cuber.getYpos() - ytiltfactor
        * ((event.values[Z_ACCELEROMETER] - xbackplane) / TILT_SPEED_Y));
       cuber.setYpos(cuber.getYpos() + ytiltfactor
        * ((event.values[X_ACCELEROMETER] - xbackplane) / TILT_SPEED_Y));
      if (yokemode)
       cuber.setXpos(cuber.getXpos() - xtiltfactor
        * ((event.values[Y_ACCELEROMETER] - ybackplane) / TILT_SPEED_X));
       cuber.setXpos(cuber.getXpos() + xtiltfactor
        * ((event.values[Y_ACCELEROMETER] - ybackplane) / TILT_SPEED_X)); 

    Process Management    

    The Android OS employs a process model in which applications do not usually exit once started. Rather, they are suspended and then they resume at some indeterminate point in the future. The Demofox application responds to being suspended and resumed by restarting the application. That is, the user returns to the start of the button-based GUI used to select level and input mode. Full credit is given for high scores set before process suspension. 


    The author's experience with Android and its OpenGL implementation has been positive. This implementation is a powerful and accessible one. The Eclipse IDE and Java language are economical and familiar choices, whose operations and semantics should be familiar to many developers. 

    The accelerometer and touchscreen input devices are somewhat more novel technologies. The author hopes that the code discussed here offers some useful techniques for dealing with these forms of input.    


    Many of the images used in the article, and used as embedded resources in the game, were photographs taken by, or in conjunction with, the United States federal government. The first game level uses a photograph taken by the Hubble Space Telescope, of the Boomerang Nebula. The European Space Administration shares credit for this image, which is in the public domain.

    Similarly, the image of the diver and camera shown above in the article was released by the U.S. Navy. Credit goes to Petty Officer Shane Tuck, and the individual shown is Petty Officer Jayme Pastoric. 

    Credit for the picture of aircraft yokes shown in the article body above goes to Christian Kath. This image was made available under the GNU Free Documentation License.  

    The background image for the second game level is an original work by my daughter, Holly. 


    1. Portrait mode play is possible, but play control seemed inferior to the author. Also, "yoke" mode, in which the accelerometer-based UI directly models the yoke of an airplane, is not available when the device is held in portrait orientation. This reflects the fact that real airplane yokes are generally wider than they are tall.

    2. Issues of aspect ratio will cause these cubes to actually appear elongated in actual practice. Because the prisms shown to the user in this application are elongated by design, this is not a problem.   

    3. Consideration was given to using an array here. Ultimately, the required syntax was judged to be less intuitive than having two sets of variables. 


    This is the third major version of this article. Both revisions contained improvements to the article text only. The code and binary files have not changed. 


    This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

    저작자 표시

    Windows SDK for Windows 7 and .NET Framework 4 Release Notes

    Windows SDK for Windows 7 and .NET Framework 4 Release Notes


    1. Welcome

    2. License Agreement

    3. Installing and Uninstalling the Windows SDK

    4. Build Environment

    5. Documentation

    6. Known Issues    

    7. Windows SDK Product Support and Feedback




    1. Welcome

    Welcome to the Microsoft Windows Software Development Kit (SDK) for Windows 7 and .NET Framework 4.

    The Windows SDK contains a set of tools, code samples, documentation, compilers, headers, and libraries that developers can use to create applications that run on Microsoft Windows. You can use the Windows SDK to write applications using the native (Win32/COM) or managed (.NET Framework) programming model.

    For access to additional resources and information, such as downloads, forum posts, and the Windows SDK team blog, go to the Windows SDK Developer Center.


    1.1 What’s New in the Windows SDK for Windows 7 and .NET Framework 4 Release?

    ·         Smaller/Faster: at less than 600MB, this SDK is one third the size of the Windows 7 RTM SDK; it installs faster and has a smaller footprint.

    ·         Cleaner setup: features on setup screens have been grouped into native, managed, and common buckets to help you choose the components you need faster.

    ·         New Microsoft Help System v1.0: this brand new system was first introduced with Visual Studio 2010.  You can import just the content you need from the MSDN cloud, and update it according to your schedule.

    ·         Visual C++ 2010 compilers/CRT with improved compilation performance and speed.  These are the same compilers and toolset that ships with Visual Studio 2010.

    ·         New command line build environment that uses MSBuild 4.0, now the common Microsoft build system for all languages, and supporting the new Visual C++ project type .vcxproj.


    1.2 What Does the Windows SDK for Windows 7 and .NET Framework 4 Support?

    ·         Operating Systems: You can install this SDK on and/or create applications for Windows 7, Server 2008 R2, Server 2008, XPSP3, Vista, and Windows Server 2003 R2.

    ·         Platform architecture: you can install this SDK on and/or create applications for platform chipsets X86, X64, and IA64 (Itanium).

    ·         .NET Framework: you can use the SDK resources to create applications that target .NET Framework versions 2.0, 3.0, 3.5, 4.

    ·         Visual Studio: you can use the resources in this SDK with Visual Studio versions 2005, 2008, and 2010, including Express editions. (Not all features work with all versions of Visual Studio. For example, you can’t use the .NET 4 tools with Visual Studio 2008.)

    ·         Setup/Install options: Win SDK v7.1 will be available through an ISO or a Web setup download and install experience.  Web setup allows you to install selected components of the SDK without having to download the entire SDK.  The DVD ISO setup allows you to download the entire SDK to install later, or share among different computers. 



    1.3 What’s Been Removed?

    1.3.1 Documentation

    The DExplore document viewer that shipped with previous SDKs is no longer delivered via the SDK, and documentation is no longer delivered in-box with the SDK.  You’ll be prompted at the end of SDK setup to download documentation to your computer using the Microsoft Help System if you wish to do so.


    1.3.2 Managed Samples

    Managed samples have been removed from this release of the Windows SDK.  Managed samples can be found on Code Gallery.


    1.3.3 Tools

    The following tools were included in the Windows SDK for Server 2008 and .NET Framework 3.5 release, but are not included in this release:


    Tools Removed in the Windows SDK for Windows 7 and .Net Framework 4

    ·         UISpy.exe

    ·         Wpt_arch.msi



    1.4 Required Resources

    This release of the Windows SDK does not include a .NET Framework Redistributable Package.


    .NET Framework 4 (Required)

    ·         The Microsoft Windows SDK for Windows 7 and .NET Framework 4 requires the RTM version of the full, extended .NET Framework 4 Redistributable Components  

    ·         The .NET Framework 4 Redistributable Components must be installed prior to installing the Windows SDK for Windows 7 and .NET Framework 4

    NOTE:  The Client version of the .NET Framework 4 is not sufficient and the SDK will not install on a computer that has a pre-release (Beta or RC) version of the .NET Framework 4.


    2. License Agreement

    The contents included in the Windows SDK are licensed to you, the end user. Your use of the SDK is subject to the terms of an End User License Agreement (EULA) accompanying the SDK and located in the \License subdirectory. You must read and accept the terms of the EULA before you access or use the SDK. If you do not agree to the terms of the EULA, you are not authorized to use the SDK.

    3. Installing, Uninstalling the Windows SDK

    To optimize your Windows SDK setup experience, we strongly recommend that you install the latest updates and patches from Microsoft Update before you begin installing the Windows SDK.


    3.1 Windows SDK Disk Space Requirements

    The complete installation of the Windows SDK requires less than 600MB, this SDK is one third the size of the Windows 7 RTM SDK; it installs faster and has a smaller footprint. Please verify that the computer you are installing to has the minimum required disk space before beginning setup. If the minimum required disk space is not available, setup will return a fatal error.


    3.2 How to Uninstall SDK Components

    When you uninstall the SDK through Programs and Features (Add/Remove Programs on pre-Vista OS's) most of the SDK components will be uninstalled automatically. However, a few shared components installed by the SDK may need to be uninstalled separately. This guide provides instructions for uninstalling those shared components.


    To uninstall shared SDK components:

    1.    From the Start Menu, go to Control Panel, Programs and Features (Add/Remove Programs on pre-Vista OS's)

    2.    Select and remove the following entry:

    ·    Microsoft Windows SDK for Windows 7 (7.1) (the Windows SDK core-component files)

    3.    Remove the shared components. This list provides some of the components you may see. If you are running a 64 bit version of Windows you may see both 32 bit (x86) and 64 bit (x64, IA-64) versions of the components listed:

    ·         Application Verifier

    ·         Debugging Tools for Windows

    ·         Windows Performance Toolkit

    ·         Microsoft Help Viewer 1.0

    ·         Microsoft Visual C++ 2010 Redistributable

    ·         Microsoft Visual C++ 2010 Standard Edition


    4. Build Environment

    4.1 Moving From the VCBuild to the MSBuild 4.0 Environment

    Visual C++ 2010 moved to newer build system (MSBuild v4.0) to provide better performance, scalability and extensibility. In order for the build system migration to happen the project file format in Visual C++ 2010 had to change to a different format in order to be compatible with the new MSBuild v4.0 file format (earlier versions of Visual C++ have used VCBuild as their build system).  The extension of the new Visual C++ 2010 project file has been changed from “.vcproj” to “.vcxproj”.  Project files with the “.vcproj” extension can be upgraded using the vcupgrade tool (included in this release of the SDK).  For more information on how to upgrade a project file, see section titled “Upgrading Projects to Visual C++ 2010.”


    4.2 Setting Build Environment Switches

    To set specific targets in the build environment:

    ·         Launch the Windows SDK build environment - From the Start menu, click on  All Programs > Microsoft Windows SDK v7.1 > Windows SDK 7.1 Command Prompt

    ·         Set the build environment -  At the prompt, type:
    setenv  [/Debug | /Release][/x86 | /x64 | /ia64 ][/vista | /xp | /2003 | /win7][-h | /?]


    The setenv.cmd help [-h | /?] displays the usage

                    /Debug   - Create a Debug configuration build environment

                    /Release - Create a Release configuration build environment

                    /x86     - Create 32-bit x86 applications

                    /x64     - Create 64-bit x64 applications

                    /ia64    - Create 64-bit ia64 applications

                    /vista   - Create Windows Vista SP1 applications or Windows Server 2008

                    /xp      -   Create Windows XP applications

                    /2003    - Create Windows Server 2003 applications

                   /win7    - Create Windows 7 or Windows Server 2008 R2 applications


    4.3 Upgrading Projects to Visual C++ 2010

    The Windows SDK for Windows 7 and .NET Framework 4 installs the vcupgrade.exe tool which will convert the previous project file format to the MSBuild compatible project file format. In this process the extension of the project file (.vcproj) will change to .vcxproj.  To upgrade a Visual C++ 2005 or a Visual C++ 2008 (ex: sample.vcproj) file to VC 2010 (sample.vcxproj) file format in the SDK build environment, at the command prompt type: “vcupgrade sample.vcproj”.



    VCUpgrade [options] <project file>



         -nologo                     Suppresses the copyright message

         -nocolor                    Do not output error and warning messages in color

         -overwrite                Overwrite existing files

         -PersistFramework  Keep the Target Framework Version while upgrading. (-p)


    The vcupgrade tool can upgrade any project version from VC6 through VC 2008 to the VC 2010 format, and returns a success message if the conversion succeeded.  If unsuccessful, a message listing conversion errors is returned.


    5. Documentation

    The new Microsoft Help System allows you to view documents on the MSDN Library using a standard browser, and select documents to download from the MSDN Online content publication web site (MSDN cloud) to your computer for viewing when a connection to the Internet is unavailable or undesired.  You can download, update or delete content on your own schedule.  

    The Microsoft Help System is also delivered via Visual Studio 2010. 


    If you select the Microsoft Help System during SDK setup, you will be prompted at the end of SDK setup to import documentation to your computer using the Help Library Manager.  You will be prompted to select a location for content to be stored on your computer’s hard drive.  You may select the default location to store content on your computer at this time only.  The content store location cannot be changed later. 


    Microsoft Help System is the replacement for the Document Explorer (DExplore) help viewer that was delivered in earlier versions of the Windows SDK.  The Document Explorer has been deprecated.

    5.1 Start Menu shortcuts for Microsoft Help System

    Three Start Menu shortcuts will be installed with the MHS under All Programs, Windows SDK v7.1, Documentation:

    ·         Manage Help Settings: use the Help Library Manager to manage content in online and offline modes

    ·         Microsoft Help System Documentation: open the MHS Help-on-Help documentation stored locally on your computer when in offline mode, or online on MSDN when in online mode.

    ·         Windows SDK Documentation: open the documentation for the Windows SDK product stored locally on your computer when in offline mode, or online on MSDN when in online mode.

    5.2 Confirm that you wish to go online

    The MHS is set to Online Mode by default.  The first time you click the MHS shortcuts on the Start Menu you will be asked to confirm that you wish to connect to the Internet to view documentation in the MSDN cloud. 

    5.3 Offline documentation

    If you wish to view documentation when a connection to the Internet is unavailable, you can import documentation sets (books) from the MSDN cloud and install these books to your computer.  You can switch to Offline Mode to view content on your computer by default. 

    5.4 Configuring MHS to use Offline mode

    The Help Viewer is set to online mode by default upon installation.  This setting affects all instances of the Help System installed on a computer.  If you have downloaded content from the MSDN cloud using the Help Library Manager (Start, All Programs, Windows SDK v7.1, Documentation, Help Library Manager) and wish to view these documents when not connected to the Internet, you must change the configuration to offline mode:

    1.  Open the Help Library Manager (Start, All Programs, Windows SDK v7.1, Documentation, Help Library Manager)

    2.  Select Choose online or local help

    3.  Select I want to use local help

    4.  Click OK

    5.5 Importing documentation to your local computer

    If you wish to import (download documentation) to your local computer for viewing in offline mode (no Internet connection):

    1.  From the Start Menu, select Start, All Programs, Windows SDK v7.1, Documentation, Help Library Manager. 

    2.  On the options screen, select ‘Install Content from online’.

    3.  Select the Add link next to any content you wish to add locally, example:  .NET Framework 3.5

    4.  When the Add link is selected, the Actions column will change to say Cancel and Status to Update Pending. 

    5.  Click the Update button.

    6.  While the content is updating you will see a status screen.

    7.  When the update is finished you will see a Finished Updating notification.

    8.  On the options screen, select ‘Choose online or local help’.

    9.  Select ‘I want to use local help’ and click OK

    5.6 Checking for updates on the MSDN cloud

    Documentation is frequently updated on the MSDN Cloud.  If you wish to update the documentation you have previously downloaded to your local computer for viewing in offline mode (no Internet connection):

    1.  From the Start Menu, select Start, All Programs, Windows SDK v7.1, Documentation, Help Library Manager. 

    3.  On the options screen, select ‘Check for updates online’.

    4.  A Checking for Updates screen will show the documentation you have imported to your computer.  The Status column next to each feature will show the status of the documentation for that feature:

          - Up to date: no updates are available

          -Update Available: click to import updates for this feature documentation

    5.  Click the Update button.

    6.   While the content is updating you will see a status screen.

    7.  When the update is finished you will see a Finished Updating notification.

    5.7 Removing documentation content from your computer

    If you wish to remove the documentation you have previously downloaded to your local computer for viewing in offline mode:

    1.  From the Start Menu, select Start, All Programs, Windows SDK v7.1, Documentation, Help Library Manager. 

    2.  On the options screen, select ‘Remove content’.

    3.  A Remove Content screen will show the documentation stored locally on your computer.  The Actions column next to each feature will show the available actions for that documentation set.

    4.  Click the Remove link to remove a documentation set.

    5.  The Status column will show Remove Pending as content is being deleted.

    6.  Click the Cancel link during removal if you wish to cancel this action.

    7.  Click the Remove button.

    8.  While the content is being removed you will see an Updating Local Library screen.

    9.  When the removal is finished you will see a Finished Updating notification.

    5.8 Installing Offline Content from media

    Offline content on media is not provided for this release of the Windows SDK.  This content may be made available at a future date.



    6. Known Issues

    This release of the Windows SDK has the following known issues, categorized by type.

    6.1 Build Environment

    6.1.1 Limitations of the vcupgrade tool

    ·         The vcupgrade tool cannot upgrade a solution file

    Proposed workaround:

    o   Hand edit/update the solution file to conform with the Visual Studio 2010 format. This involves

    §  Changing the header in the file to say Visual Studio 2010.

    §  Changing the extension of the projects in the solution file to .vcxproj


    ·         The vcupgrade tool cannot upgrade a project which has a reference to another project.

    Proposed workaround:

    §  Run the tool from the solution directory.


    6.1.2 SDK Build Environment may Fail on X86 XP with VS2005

    If your usage scenario matches the one listed below, you will be unable to build in the Windows SDK command line build environment.  If you type cl.exe in the SDK command window and press Enter, you will see this error:

    This application has failed to start because mspdb80.dll was not found.  Re-installing the application may fix this problem.

    Computer setup required to repro issue:

    1.           Windows XP on x86 machine (which has version 5.1.2600.2180 of REG.exe)

    2.           Visual Studio 2005 installed, but Visual Studio 2008 or 2010 is NOT installed

    Cause: When the SDK build environment window is launched, the SDK file SetEnv.cmd launches Reg.exe.  Reg.exe generates standard output when a valid KeyPath is specified and also generates error output when invalid Value is specified.  In this scenario, the KeyPath is valid but the value doesn’t exist.  For more information, see the Windows SDK blog post on this issue.

    Workaround: follow these instructions to manually edit SetEnv.cmd to remove the second call to REG:

    1.    Open C:\Program Files\Microsoft SDKs\Windows\v7.0\Bin\SetEnv.cmd in Notepad or another editor

    2.    For this line:
    FOR /F "tokens=2* delims=  " %%A IN ('REG QUERY "%VSRegKeyPath%" /v 9.0') DO SET VSRoot=%%B

    Either comment out using the REM command (like this):
    REM FOR /F "tokens=2* delims=     " %%A IN ('REG QUERY "%VSRegKeyPath%" /v 9.0') DO SET VSRoot=%%B

    OR delete the line completely.

    3.    Save SetEnv.cmd

    4.    Restart the Windows SDK command prompt


    6.1.3 Windows 7 SDK with Visual C++ 2005: Failure to compile in Debug mode.

    If your usage scenario matches the one listed below, you will be unable to debug in the Windows SDK command line build environment or Visual Studio 2005 SP1.

    Symptom: You have an .lib file or an .obj file that exposes C interfaces that was built by using Microsoft Visual C++ 2008 or for Windows 7. You add this file to a project as a link dependency. When you build the project in Microsoft Visual Studio 2005 Service Pack 1 (SP1) to generate an .exe file or a .dll file, you may receive the following link error:

    Fatal error LNK1103: debugging information corrupt

    Cause: This problem occurs because of a compatibility issue between Visual Studio 2005 and Visual Studio 2008 versions. For more information, see the Microsoft Support page for the patch:

    Fix: Install the patch for Visual Studio 2005 SP1 available from:


    6.1.4 Platform Used in /p:platform=[target platform] and setenv [target platform] Must Match

    If you encounter one of the following error messages, make sure that platform in /p:platform=[target platform] and setenv [target platform] match.


    Error messages:

    You are attempting to build an AMD64 application from an x86 environment.

    You are attempting to build an Itanium (IA64) application from an x86 environment.

    You are attempting to build a Win32 application from an x64 environment.

    You are attempting to build an Itanium (IA64) application from an x64 environment.

    You are attempting to build a Win32 application from an Itanium (IA64) environment.

    You are attempting to build an AMD64 application from an Itanium (IA64) environment.


    6.1.5 Builds Requiring hhc.exe May Fail When Using MSBuild

    Builds requiring hhc.exe (HTML Help Workshop compiler) may fail and throw a "Windows cannot find 'hhc'..." error message.  There are a couple of workarounds.


    Workaround 1:

    Modify setenv.cmd to include %programfiles(x86)%\HTML Help Workshop

    ·         Open an elevated command-prompt.

    ·         Change directories to the Windows SDK 7.1 bin directory (%programfiles%\Microsoft SDKs\Windows\v7.1\bin)

    ·         Type: Notepad setenv.cmd

    ·         After line 369 (which looks like this): SET Path=%FxTools%;%VSTools%;%VCTools%;%SdkTools%;%Path%
    Add the following:
    IF "%CURRENT_CPU%"=="x86" (
    SET “Path=%path%;%ProgramFiles%\HTML Help Workshop”
    ) ELSE (
    SET “Path=%path%;%ProgramFiles(x86)%\HTML Help Workshop”


    Workaround 2:

    Add %programfiles(x86)%\HTML Help Workshop to %path% in the environment variables.

    On Vista/Server 2008/Windows 7:

    ·         Click Start

    ·         Right-click ”Computer

    ·         Click “Properties

    ·         Click “Advanced system settings

    ·         Click the “Environment Variables…” button

    ·         Use the scroll bar to find the “Path” variable

    ·         Click “Path” to highlight it, then click the “Edit…” button

    ·         At the end of the Variable Value text box, add a semicolon and the path to the HTML Help Workshop (i.e. “;%programfiles%\HTML Help Workshop” on x86)


     6.2 Microsoft Help System

    6.2.1 Microsoft Help Viewer is unavailable during SDK setup

    If the Microsoft Help System (MHS) has already been installed on your computer by a Visual Studio 2010 product, the Microsoft Help Viewer node on the SDK setup screen will be unavailable, or grayed out.  This is by design.  The MHS will only be installed once on a computer.

    6.2.2 Microsoft Help System remains after uninstall of the Windows SDK

    Help Viewer is a separate entry in Programs and Features (Add/Remove Programs in pre-Vista OSes) and will not be uninstalled when the Windows SDK is uninstalled.  The Help Viewer must be uninstalled separately. 

    6.2.3 Help Library Manager will not update/synch content if Internet connection is not available

    The Help Library Manager requires an Internet connection in order to connect to the MSDN cloud to update or synch content.   If you attempt to synch offline content with updated documentation available in the MSDN cloud, you will receive an error message that a network connectivity problem has occurred.

    6.2.4 Downloading documentation requires an internet connection

    Documentation is not included in the SDK setup package and must be downloaded from the MSDN cloud using the Help Library Manager (Start, All Programs, Windows SDK v7.1, Documentation, Help Library Manager).

    6.2.5 The Help Viewer is not supported on computers with an Itanium (IA64) chip

    The Help Viewer will not install on a computer that is using an Itanium (IA64) processor.  A Help Viewer option will not be available when installing the Windows SDK on a computer with an Itanium (IA64) processor.

    6.2.6 Help Library Manager requires Administrator permission to update content

    For security purposes, the Help Library Manager installs content to a location that is locked down to users who are not members of the HelpLibraryUpdaters security group.  Non-members cannot update, add or delete content to the store.  This security group is created during setup, and by default, it includes the administrators group and the user account that created the local store.  To enable support for multiple users (non-administrators) to update content on the machine, an administrator must add the additional accounts to the HelpLibraryUpdaters security group. 

    6.2.7 Help Library Manager validates digital signatures on content

    As a security check, the Help Library Manager checks the digital signature used to sign cabinet files that contain documentation content to be installed locally. When installing content interactively, users can view the certificate of signed files and approve or reject the content.


    6.3 SDK Tools and Compilers

    6.3.1 Some Tools Require .NET Framework 3.5 SP1

    In order to function correctly some tools included in the Windows SDK also require you to install .NET Framework 3.5 SP1.

    6.3.2 The Command-Line Environment for the Windows SDK Configuration Tool Supports Only Visual Studio 2008

    The command-line environment for The Windows SDK Version Selection tool supports only Visual Studio 2008. It does not support earlier versions of Visual Studio.

    6.3.3 GuidGen.exe may fail to run without Visual Studio Installed.

    The .NET Framework 3.5 version of the GuidGen.exe tool has a dependency on version 9.0 of the MFC Runtime. If you do not have the correct version installed you may see an error similar to this:


    C:\Program Files\Microsoft SDKs\Windows\v7.1\Bin>guidgen.exe

    The application has failed to start because its side-by-side configuration is incorrect. Please see the application event log or use the command-line sxstrace.exe tool for more detail.


    In order for the tool to function correctly we recommend installing any Visual Studio 2008 Express edition to obtain the required version of the MFC Runtime.


    NOTE:  This does not occur with the .NET Framework 4 version of GuidGen.exe.

    6.3.4 MSIL Disassembler (ILDASM.EXE) Help does not display in Windows 7.

    Help for the ILDASM tool does not display because it uses a help file format that is not natively supported in Windows 7.  Online documentation is available:

    6.3.5 OLE-COM Object Viewer (OLEView.exe) – error message appears and some tool functions are unavailable if tool is not run for the first time with administrator permissions.

    If the first run of the OLE-COM Object Viewer tool is not done with administrator permissions the following error message will appear:

    DllRegisterServer in IVIEWERS.DLL failed. OLEViewer will operate correctly without this DLL, however you will not be able to use the interface viewers.

    To avoid this issue use “Run as administrator” when you run the tool for the first time.

    6.3.6 FXCop Setup is Now Located Under the Window SDK “\Bin” Directory.

    The installer for FXCop, fxcopsetup.exe, is now located in [Program Files]\Microsoft SDKs\Windows\v7.1\Bin\FXCop.


    6.3.7 Windows SDK Configuration Tool Does Not Update Visual Studio 2008 Paths for Itanium Headers and Libraries

     Symptom: When the Windows SDK Configuration Tool is used to integrate the Windows 7 SDK with Visual Studio 2008, the Visual C++ Directories for headers and libraries are set to point to the Windows 7 SDK content for x86 and x64 platforms only. Visual Studio will still point to the IA64 headers, libraries and tools that ship with Visual Studio 2008.

    Fix: Open Visual Studio 2008, Select Tools->Options->Projects and Solutions->C++ Directories. From the ‘Platform’ drop-down, Select ‘Itanium’ and from the “Show directories for” drop-down select ‘Include files’. Replace all instances of ‘WindowsSDKDirIA64’ with ‘WindowsSDKDir’. Select ‘Library files’ from the “Show directories for” drop down menu and repeat the previous step.


    6.3.8 Windows UI Automation May Fail When Using AccEvent or Inspect on Windows XP, Windows Vista, Windows Server 2003 R2 or Windows Server 2008


    Symptom in UIA Events Mode: Start AccEvent, Select Mode, Select UIA Event, Select Settings.  An error occurs stating “Accessible Event Watcher (32-bit UNICODE Release) has encountered a problem and needs to close.  We are sorry for the inconvenience.”

    Symptom in WinEvents Mode: Start AccEvent, Select Mode menu, Select WinEvents (In Context) or WinEvents (Out of Context), Select Events menu, Select Start Listening.  At the bottom corner of the screen a message is displays stating “Registration failed, Stopped”.

    Cause: The latest Automation API is not present on the machine.  For more information, see the Microsoft Support page for the Windows Automation API:

    Fix:   Install the latest Windows Automation API available from:



    Symptom: When starting Inspect an error occurs stating “The program was unable to query the UI Automation client interfaces. Please confirm that the latest framework is installed.”

    Cause: The latest Automation API is not present on the machine.  For more information, see the Microsoft Support page for the Windows Automation API:

    Fix:   Install the latest Windows Automation API available from:


    6.3.9 A Failure May Occur When Using aximp.exe to Add a COM Component to VS2010 Sample Form Application


    The following error message is encountered when adding a COM component to a VS2010 Sample Form Application.

    Error Message:

    “Failed to import ActiveX control. Please ensure it is properly registered."



    Install the comlibrary.interop.dll and axcontrol.inteorp.dll into the GAC.  Open a Windows SDK 7.1 Command Prompt.  At the prompt type cd bin.  Then type Gacutil /if comlibrary.interop.dll and Gacutil /if axcontrol.inteorp.dll.


    6.3.10 Windows Troubleshooting Pack Designer May Fail on Startup

    Issue: Windows Troubleshooting Pack Designer fails to start.

    Cause: TSPDesigner.exe and DesignerFunction.dll are delay signed only.

    Workaround: Run the SN tool to bypass strong name validation for Windows Troubleshooting Pack Designer:

    -          Open a Windows SDK Command Prompt as Administrator

    -          Change directories to the folder where TSPDesigner.exe is located.  The default location is C:\Program Files\Microsoft SDKs\Windows\v7.1\Bin\TSPDesigner

    -          Enter ‘sn.exe -Vr TSPDesigner.exe’

    -          Enter ‘sn.exe -Vr DesignFunction.dll’

    Note that ‘sn.exe –Vr’ is case sensitive


    6.4 Samples

    6.4.1 Some C++ Samples Require Upgrade Prior to Building

  Building Samples Using Visual Stuido 2005/2008

    Many of the sample project files in the SDK are written for VC 8.0 compilers, the compilers that shipped in Microsoft Visual Studio 2005. Samples with version 8.0 project files can be built in Microsoft Visual Studio 2005 but must be upgraded before being built in the Visual Studio 2008 build environment. 

  Building Samples Using Visual Studio 2010 or Windows SDK v7.1 (MSBuild 4.0)

    Visual Studio 2010 and the Windows SDK v7.1 use the MSBuild 4.0 build environment.  The Visual C++ compilers shipped in the Windows SDK v7.2 are the same compilers that ship in Visual Studio 2010 RTM.  All Native samples project and solution files in the Windows SDK v7.1 must be upgraded prior to building using the MSBuild 4.0 environment.  

    NOTE: For more information about upgrading sample to build in the MSBuild 4.0 build environment, see section titled “Moving From the VCBuild to the MSBuild 4.0 Environment”.

    6.4.2 Some C++ Samples will not Build in Visual Studio 2005

    Some of the samples in the SDK contain v9.0 project files, which will not build in Visual Studio 2005. You cannot downgrade the project files to a lower version.  To workaround this issue, install Visual Studio 2008 or 2010 (Express or Retail SKU) or build the sample in the Windows SDK command line build environment.

    6.4.3 Some Samples have External Dependencies

    Some samples included with the Windows SDK have dependencies on components outside the Windows SDK. ATL/MFC Dependency

    Some samples require the ATL and/or MFC headers, libraries, or runtime, which are included with Visual C++ (non-Express editions). When building a sample that depends on ATL/MFC without Visual Studio installed on your computer, you might see an error similar to this:

    fatal error C1083: Cannot open include file: 'afxwin.h': No such file or directory

    To workaround this issue, install a non-Express version of Microsoft Visual Studio with the compatible MFC/ATL. Windows Media Player Dependency

    The Multimedia\WMP_11\dotNet\SchemaReader sample requires Windows Media Player 11 or later to be installed. Microsoft Management Console 3.0 Dependency

    The samples in the \Samples\SysMgmt\MMC3.0 directory require Microsoft Management Console 3.0 or later to be installed. DirectX SDK Dependency

    Some samples require the DirectX SDK (refer to the sample's readme for additional information). msime.h Dependency

    Some samples fail to build because the file msime.h is not found. Msime.h is not shipped with the Windows SDK. Msime.h is for use by developers when customizing applications for the 2007 Microsoft Office System.

    The affected samples are:

    ·         winui\Input\tsf\TSFApps\ImmPad-Interim

    ·         winui\Input\tsf\TSFApps\ImmPad-Level3-Step3

    ·         winui\Input\tsf\TSFApps\TsfPad-Hybrid

    To workaround this issue, download msime.h from the Microsoft Download Center and copy to the Windows SDK \Include directory.

    6.4.4 Some Samples Require Microsoft Visual Studio 2005 and will not Build with Microsoft Visual Studio 2008

    A few unmanaged samples rely on mspbase.h, mtyp.h, or mfc80ud.lib. These files are included with Microsoft Visual Studio 2005 and do not ship with Microsoft Visual Studio 2008.

    1.    mspbase.h

    a.    netds\Tapi\Tapi3\Cpp\Msp\MSPBase

    b.    netds\Tapi\Tapi3\Cpp\Msp\Sample

    c.    netds\Tapi\Tapi3\Cpp\pluggable

    2.    mtype.h

    a.    netds\Tapi\Tapi3\Cpp\tapirecv

    b.    netds\Tapi\Tapi3\Cpp\tapisend

    3.    mfc80ud.lib

    a.    Sysmgmt\Wmi\VC\AdvClient

    6.4.5 Some C++ Samples do not have Configurations for x64

    When building a C++ sample on an x64 computer that does not have support for x64, you might see the following error message:

    fatal error LNK1112: module machine type 'x64' conflicts with target machine type 'x86'

    To workaround this issue, perform one of the following actions:

    1.       Build the sample targeting x86 by using this command:

    msbuild *.vcxproj /p:platform=win32

    1.       Add x64 support by doing the following:

    a.       Load the sample in Microsoft Visual Studio (C++).

    b.      Update the Configuration Manager under Project | Properties.


    For detailed instructions, see the Windows SDK Blog post "How to add 64-bit support to vcproj files."


    Note: if you do not install libraries for all CPU architectures during SDK setup, some samples with Visual C++ project files might fail to build with this error for all configurations in the project file:


    Fatal error LNK1181: cannot open input file


    For example, if a sample has an x86 configuration and x86 libraries were not installed (these libraries are installed by default when installing the SDK on all platforms), the sample will fail to compile.

    6.4.6 Visual J# Samples Require VJ++ (Visual Studio)

    J# samples will not build using the Windows SDK because there is no appropriate build environment. This edition of the Windows SDK does not support building J# applications. To workaround this issue, install Visual Studio (VJ++) 2005.

    6.4.7 Setupvroot.bat Setup Script for WCF Samples Fails on Windows Vista if the NetMsmqActivator Service is Enabled and Message Queuing (MSMQ) is not Installed

    The Windows Communication Foundation samples setup script Setupvroot.bat does not work on Windows Vista if the NetMsmqActivator service is enabled and Message Queuing (MSMQ) is not installed. The iisreset utility does not work unless MSMQ is installed or the NetMsmqActivator service is disabled. The WCF samples setup script Setupvroot.bat will not run unless MSMQ is installed or the NetMsmqActivator service is disabled.

    Make sure MSMQ is installed or disable the NetMsmqActivator service on Windows Vista before you run the WCF samples setup script Setupvroot.bat.

    6.4.8 Some Samples Fail to Compile: Debug:Itanium Error

    Some samples might fail to compile on all platforms with the following error:

    vcbuild.exe: error VCBLD0004: Project 'C:\Samples\Technologies\DirectoryServices\BEREncoding\CP\BerEncoding\BerEncoding.vcproj' does not contain a configuration called 'Debug|Itanium'

    This error occurs because platform configurations are listed alphabetically by default in a project or solution file created by Visual Studio. If Debug|Itanium is a supported configuration, it will be listed first in the samples' solution and/or project files. This configuration will be built first by default.

    To workaround this issue, use a configuration switch to specify what platform you want to build for:

    Msbuild.exe *.sln /p:platform=Win32
    Msbuild.exe *.sln /p:platform=x64

    6.4.9 Windows Media Services SDK Plug-ins Fail to Build on Windows 2008 Server

    The “Playlistparser” and “Authorization” plug-in samples should be available in Windows Media Services. The user should be able to enable and disable the newly built plug-ins. However, the “Playlistparser” and “Authorization” plug-ins fail to build and produce the following error:

    Syntax error for the code given below in Unknwn.idl

       HRESULT QueryInterface(

            [in] REFIID riid,

            [out, iid_is(riid), annotation("__RPC__deref_out")] void **ppvObject);

    To workaround this issue, build the “Playlistparser” and “Authorization” plug-ins on Windows Vista or Windows 2003 Server and copy the plug-ins to Windows 2008 Server.

    6.4.10 WSDAPI StockQuote and FileService samples may fail to build on the command line (using msbuild) if Visual Studio 2008 is installed

    The WSDAPI StockQuote and FileService samples may fail to build on the command line (using msbuild) if Visual Studio 2008 is installed. These samples include project files which reference WSDL and XSD files, and msbuild attempts to invoke sproxy.exe to process these files. Visual Studio 2008 does not include sproxy.exe, and compilation fails if the tool is not present. Compiling from inside the Visual Studio IDE is unaffected.

    It is not necessary to use sproxy to process these files. WsdCodeGen, a WSDL/XSD compiler for WSDAPI, can be used to generate C++ code from these files. This generated code is already included in the sample.

    If you encounter this error, you can remove the XSD and WSDL files from the affected project files and recompile with msbuild:

    ·         StockQuote\StockQuoteContract\StockQuoteContract.vcproj: remove StockQuote.xsd, StockQuote.wsdl, and StockQuoteService.wsdl

    ·         FileService\FileServiceContract\FileServiceContract.vcproj: remove FileService.wsdl

    6.4.11 WinBase\RDC sample generates assertion failure when built in Visual Studio

    You may receive an “Assertion failed” message when attempting to build this sample in Visual Studio. This is caused by a post build instruction that attempts to register the COM component, but fails. To workaround this issue, build the sample in Administrator mode. From the Start menu, Microsoft Visual Studio 2008, Visual Studio Tools, then right click on a Visual Studio command prompt and select “Run as Administrator”.

    6.5 Headers and Libraries

    6.5.1 The wincodec_proxy.h Header File is Missing From the Windows SDK for Windows 7 and .NET Framework 4

    Symptom: The wincodec_proxy.h header file is missing from the the Windows SDK for Windows 7 and .NET Framework 4.

    Fix: Download the wincodec_proxy.h header file from the Microsoft Code Gallery:

    6.5.2 Warnings May Occur When Compiling Programs Containing the ws2tcpip.h Header File

    Symptom: Warnings occur when compiling programs containing the ws2tcpip.h header file.

    Fix: Precede #include <ws2tcpip.h> with #pragma warning(push), #pragma warning(disable : 6386) and follow it with #pragma warning(pop).

    6.5.3 Conflict in definition of INTSAFE_E_ARITHMETIC_OVERFLOW in intsafe.h/comutil.h

    Symptom: You are using the Windows 7 SDK development environment or Visual Studio 2008 or earlier compilers with the Windows 7 version of intsafe.h and at compile time receive the below warning:

    warning C4005: 'INTSAFE_E_ARITHMETIC_OVERFLOW' : macro redefinition c:\program files\microsoft sdks\windows\v7.1\include\intsafe.h

    Cause: This warning is reported because intafe.h and comutil.h have conflicting definitions of INTSAFE_E_ARITHMETIC_OVERFLOW.

    Workaround: Intsafe.h must be included in the code before comutil.h and the warning will not occur.

    Fix: This conflict is resolved in Visual Studio 2010 and later versions of the compiler package.


    6.6 Windows Native Development

    6.6.1 Windows SDK Configuration Tool (Command Line) Fails With Visual Studio 2005

    This is not a supported scenario will not work with Visual Studio 2005.


    7.   Windows SDK Product Support and Feedback

    The Windows SDK is provided as-is and is not supported by Microsoft. For technical support, there are a number of options:

    7.1 Professional Support for Developers

    Microsoft Professional Support for Developers provides incident-based access to Microsoft support professionals and rich information services to help developers to create and enhance their software solutions with Microsoft products and technologies.

    For more information about Professional Support for Developers, or to purchase Professional Support incidents, please contact a Customer Representative at 1-800-936-3500. To access Professional Support for Developers, visit the MSDN Web site. If you have already purchased support incidents and would like to speak directly with a Microsoft support professional, call 1-800-936-5800.

    7.2 MSDN Online

    MSDN Online provides Developer Support search, support incident submission, support highlights, service packs, downloads, technical articles, API information, blogs, newsgroups, forums, webcasts, and other resources to help optimize development.

    7.3 Ways to Find Support and Send Feedback

    Your feedback is important to us. Your participation and feedback through the locations listed below is appreciated.

    ·         The MSDN Forums are available for peer-to-peer support.

    ·         Windows SDK Developer Center is the official site about development using the Windows SDK, and provides information about the SDKs, links to the Windows SDK Blog, Forum, online release notes and other resources.

    ·         The Windows SDK Forum deals with topics related specifically to the Windows SDK.

    ·         The Software Development for Windows Client forum contains an updated list of related forums.

    ·         You can also send mail to the Windows SDK Feedback alias at

    ·         The Windows SDK Blog contains workarounds,  late-breaking and forward-looking news.

    Copyright © 2010 Microsoft Corporation. All rights reserved. Legal Notices:


    저작자 표시

    Using Direct2D with WPF



    Using Direct2D with WPF

    By | 3 Nov 2010 | Article
    Hosting Direct2D content in WPF controls.


    With Windows 7, Microsoft introduced a new technology called Direct2D (which is also supported on Windows Vista SP2 with the Platform Update installed). Looking through all its documentation, you'll notice it's aimed at Win32 developers; however, the Windows API Code Pack allows .NET developers to use the features of Windows 7 easily, with Direct2D being one of the features supported. Unfortunately, all the WPF examples included with the Code Pack require hosting the control in a HwndHost, which is a problem as it has airspace issues. This basically means that the Direct2D control needs to be separated from the rest of the WPF controls, which means no overlapping controls with transparency.

    The attached code allows Direct2D to be treated as a normal WPF control and, thanks to some COM interfaces, doesn't require you to download the DirectX SDK or even play around with any C++ - the only dependency is the aforementioned Code Pack (the binaries of which are included in the attached file). This article is more about the problems found along the way the challenges involved in creating the control, so feel free to skip to the Using the code section if you want to jump right in.


    WPF architecture

    WPF is built on top of DirectX 9, and uses a retained rendering system. What this means is that you don't draw anything to the screen, but instead create a tree of visual objects; their drawing instructions are cached and later rendered automatically by the framework. This, coupled with using DirectX to do the graphics processing, enables WPF applications not only to remain responsive when they have to be redrawn, but also allows WPF to use a "painter's algorithm" painting model. In this model, each component (starting at the back of the display, going towards the front) is asked to draw itself, allowing them to paint over the previous component's display. This is the reason it's so easy to have complex and/or partially transparent shapes with WPF - because it was designed taking this scenario into account. For more information, check out the MSDN article.

    Direct2D architecture

    In contrast to the managed WPF model, Direct2D is immediate-mode where the developer is responsible for everything. This means you are responsible for creating your resources, refreshing the screen, and cleaning up after yourself. It's built on top of Direct3D 10.1, which gives it high-performance rendering, but provides several of the advantages of WPF (such as device independent units, ClearType text rendering, per primitive anti-aliasing, and solid/linear/radial/bitmap brushes). MSDN has a more in-depth introduction; however, it's more aimed at native developers.


    Direct2D has been designed to be easily integrated into existing projects that use GDI, GDI+, or Direct3D, with multiple options available for incorporating Direct2D content with Direct3D 10.1 or above. The Direct2D SDK even includes a nice sample called DXGI Interop to show how to do this.

    To host Direct3D content inside WPF, the D3DImage class was introduced in .NET 3.5 SP1. This allows you to host Direct3D 9 content as an ImageSource, enabling it to be used inside an Image control, or as an ImageBrush etc. There's a great article here on CodeProject with more information and examples.

    The astute would have noticed that whilst both technologies can work with Direct3D, Direct2D requires version 10.1 or later, whilst the D3DImage in WPF only supports version 9. A quick internet search resulted in this blog post by Jeremiah Morrill. He explains that an IDirect3DDevice9Ex (which is supported by D3DImage) supports sharing resources between devices. A shared render target created in Direct3D 10.1 can therefore be pulled into a D3DImage via an intermediate IDirect3DDevice9Ex device. He also includes example source code which does exactly this, and the attached code is derived from his work.

    So, we now have a way of getting Direct2D working with Direct3D 10.1, and we can get WPF working with Direct3D 10.1; the only problem is the dependency of both of the examples on unmanaged C++ code and the DirectX SDK. To get around this problem, we'll access DirectX through its COM interface.

    Component Object Model

    I'll admit I know nothing about COM, apart from to avoid it! However, there's an article here on CodeProject that helped to make it a bit less scary. To use COM, we have to use low level techniques, and I was surprised (and relieved!) to find that the Marshal class has methods which could mimic anything that would normally have to be done in unmanaged code.

    Since there are only a few objects we need from Direct3D 9, and there are only one or two functions in each object that are of interest to us, instead of trying to convert all the interfaces and their functions to their C# equivalent, we'll manually map the V-table as discussed in the linked article. To do this, we'll create a helper function that will extract a method from the specified slot in the V-table:

    public static bool GetComMethod<T, U>(T comObj, int slot, out U method) where U : class
        IntPtr objectAddress = Marshal.GetComInterfaceForObject(comObj, typeof(T));
        if (objectAddress == IntPtr.Zero)
            method = null;
            return false;
            IntPtr vTable = Marshal.ReadIntPtr(objectAddress, 0);
            IntPtr methodAddress = Marshal.ReadIntPtr(vTable, slot * IntPtr.Size);
            // We can't have a Delegate constraint, so we have to cast to
            // object then to our desired delegate
            method = (U)((object)Marshal.GetDelegateForFunctionPointer(
                                 methodAddress, typeof(U)));
            return true;
            Marshal.Release(objectAddress); // Prevent memory leak

    This code first gets the address of the COM object (using Marshal.GetComInterfaceForObject), then gets the location of the V-table stored at the start of the COM object (using Marshal.ReadIntPtr), then gets the address of the method at the specified slot from the V-table (multiplying by the system size of a pointer, as Marshal.ReadIntPtr specifies the offset in bytes), then finally creates a callable delegate to the returned function pointer (Marshal.GetDelegateForFunctionPointer). Simple!

    An important thing to note is that the IntPtr returned by the call to Marshal.GetComInterfaceForObject must be released; I wasn't aware of this, and found my program leaking memory when the resources were being re-created. Also, the function uses an out parameter for the delegate so we get all the nice benefits of type inference and, therefore, reduces the amount of typing required for the caller. Finally, you'll notice there's some nasty casting to object and then to the delegate type. This is unfortunate but necessary, as there's no way to specify a delegate generic constraint in C# (the CLI does actually allow this constraint, as mentioned by Jon Skeet in his blog). Since this is an internal class, we'll assume that the caller of the function knows this constraint.

    With this helper function, it becomes a lot easier to create a wrapper around the COM interfaces, so let's take a look at how to provide a wrapper around the IDirect3DTexture9 interface. First, we'll create an internal interface with the ComImport, Guid, and InterfaceType attributes attached so that the Marshal class knows how to use the object. For guid, we'll need to look inside the DirectX SDK header files, in particular d3d9.h:

    interface DECLSPEC_UUID("85C31227-3DE5-4f00-9B3A-F11AC38C18B5") IDirect3DTexture9;

    With the same header open, we can also look for the interface's declaration, which looks like this after running it through the pre-processor and removing the __declspec and __stdcall attributes:

    struct IDirect3DTexture9 : public IDirect3DBaseTexture9
        virtual HRESULT QueryInterface( const IID & riid, void** ppvObj) = 0;
        virtual ULONG AddRef(void) = 0;
        virtual ULONG Release(void) = 0;
        virtual HRESULT GetDevice( IDirect3DDevice9** ppDevice) = 0;
        virtual HRESULT SetPrivateData( const GUID & refguid, 
                const void* pData,DWORD SizeOfData,DWORD Flags) = 0;
        virtual HRESULT GetPrivateData( const GUID & refguid, 
                void* pData,DWORD* pSizeOfData) = 0;
        virtual HRESULT FreePrivateData( const GUID & refguid) = 0;
        virtual DWORD SetPriority( DWORD PriorityNew) = 0;
        virtual DWORD GetPriority(void) = 0;
        virtual void PreLoad(void) = 0;
        virtual D3DRESOURCETYPE GetType(void) = 0;
        virtual DWORD SetLOD( DWORD LODNew) = 0;
        virtual DWORD GetLOD(void) = 0;
        virtual DWORD GetLevelCount(void) = 0;
        virtual HRESULT SetAutoGenFilterType( D3DTEXTUREFILTERTYPE FilterType) = 0;
        virtual D3DTEXTUREFILTERTYPE GetAutoGenFilterType(void) = 0;
        virtual void GenerateMipSubLevels(void) = 0;
        virtual HRESULT GetLevelDesc( UINT Level,D3DSURFACE_DESC *pDesc) = 0;
        virtual HRESULT GetSurfaceLevel( UINT Level,IDirect3DSurface9** ppSurfaceLevel) = 0;
        virtual HRESULT LockRect( UINT Level,D3DLOCKED_RECT* pLockedRect, 
                const RECT* pRect,DWORD Flags) = 0;
        virtual HRESULT UnlockRect( UINT Level) = 0;
        virtual HRESULT AddDirtyRect( const RECT* pDirtyRect) = 0;

    We only need one of these methods for our code, which is the GetSurfaceLevel method. Starting from the top and counting down, we can see that this is the 19th method, so will therefore be at slot 18 in the V-table. We can now create a wrapper class around this interface.

    internal sealed class Direct3DTexture9 : IDisposable
        private delegate int GetSurfaceLevelSignature(IDirect3DTexture9 texture, 
                             uint Level, out IntPtr ppSurfaceLevel);
        [ComImport, Guid("85C31227-3DE5-4f00-9B3A-F11AC38C18B5"), 
        internal interface IDirect3DTexture9
        private IDirect3DTexture9 comObject;
        private GetSurfaceLevelSignature getSurfaceLevel;
        internal Direct3DTexture9(IDirect3DTexture9 obj)
            this.comObject = obj;
            HelperMethods.GetComMethod(this.comObject, 18, 
                                       out this.getSurfaceLevel);
        public void Dispose()
        public IntPtr GetSurfaceLevel(uint Level)
            IntPtr surface;
                                  this.comObject, Level, out surface));
            return surface;
        private void Release()
            if (this.comObject != null)
                this.comObject = null;
                this.getSurfaceLevel = null;

    In the code, I've used Marshal.ThrowExceptionForHR to make sure that the call succeeds - if there's an error, then it will throw the relevant .NET type (e.g., a result of E_NOTIMPL will result in a NotImplementedException being thrown).

    Using the code

    To use the attached code, you can either include the compiled binary into your project, or include the code as there's not a lot of it (despite the time spent on creating it!). Either way, you'll need to make sure you reference the Windows API Code Pack DirectX library in your project.

    In the code, there are three classes of interest: D3D10Image, Direct2DControl, and Scene.

    The D3D10Image class inherits from D3DImage, and adds an override of the SetBackBuffer method that accepts a Direct3D 10 texture (in the form of a Microsoft.WindowsAPICodePack.DirectX.Direct3D10.Texture2D object). As the code is written, the texture must be in the DXGI_FORMAT_B8G8R8A8_UNORM format; however, feel free to edit the code inside the GetSharedSurface function to whatever format you want (in fact, the original code by Jeremiah Morrill did allow for different formats, so take a look at that for inspiration).

    Direct2DControl is a wrapper around the D3D10Image control, and provides an easy way to display a Scene. The control takes care of redrawing the Scene and D3D10Image when it's invalidated, and also resizes their contents. To help improve performance, the control uses a timer to resize the contents 100ms after the resize event has been received. If another request to be resized occurs during this time, the timer is reset to 100ms again. This might sound like it could cause problems when resizing, but internally, the control uses an Image control, which will stretch its contents when it's resized so the contents will always be visible; they just might get temporarily blurry. Once resizing has finished, the control will redraw its contents at the correct resolution. Sometimes, for reasons unknown to me, there will be a flicker when this happens, but by using the timer, this will occur infrequently.

    The Scene class is an abstract class containing three main functions for you to override: OnCreateResources, OnFreeResources, and OnRender. The reason for the first two functions is that a DirectX device can get destroyed (for example, if you switch users), and afterwards, you will need to create a new device. These methods allow you to create/free device dependent resources, such as brushes for example. The OnRender method, as the name implies, is where you do the actual drawing.

    Putting this together gives us this code to create a simple rectangle on a semi-transparent blue background:

    <!-- Inside your main window XAML code -->
    <!-- Make sure you put a reference to this at the top of the file:
    <d2d:Direct2DControl x:Name="d2DControl" />
    using D2D = Microsoft.WindowsAPICodePack.DirectX.Direct2D1;
    internal sealed class MyScene : Direct2D.Scene
        private D2D.SolidColorBrush redBrush;
        protected override void OnCreateResources()
            // We'll fill our rectangle with this brush
            this.redBrush = this.RenderTarget.CreateSolidColorBrush(
                                 new D2D.ColorF(1, 0, 0));
        protected override void OnFreeResources()
            if (this.redBrush != null)
                this.redBrush = null;
        protected override void OnRender()
            // This is what we're going to draw
            var size = this.RenderTarget.Size;
            var rect = new D2D.Rect
                    (int)size.Width - 10,
                    (int)size.Height - 10
            // This actually draws the rectangle
            this.RenderTarget.Clear(new D2D.ColorF(0, 0, 1, 0.5f));
            this.RenderTarget.FillRectangle(rect, this.redBrush);
    // This is the code behind class for the XAML
    public partial class MainWindow : Window
        public MainWindow()
            // Add this after the call to InitializeComponent. Really you should
            // store this object as a member so you can dispose of it, but in our
            // example it will get disposed when the window is closed.
            this.d2DControl.Scene = new MyScene();

    Updating the Scene

    In the original code to update the Scene, you needed to call Direct2DControl.InvalidateVisual. This has now been changed so that calling the Render method on Scene will cause the new Updated event to be fired, which the Direct2DControl subscribes to and invalidates its area accordingly.

    Also discovered was that the Scene would sometimes flicker when redrawn. This seems to be an issue with the D3DImage control, and the solution (whilst not 100%) is to synchronize the AddDirtyRect call with when WPF is rendering (by subscribing to the CompositionTarget.Rendering event). This is all handled by the Direct2DControl for you.

    To make things easier still, there's a new class deriving from Scene called AnimatableScene. After releasing the first version, there was some confusion with how to do continuous scene updates, so hopefully this class should make it easier - you use it the same as the Scene class, but your OnRender code will be called, when required, by setting the desired frames per second in the constructor (though see the Limitations section). Also note that if you override the OnCreateResources method, you need to make sure to call the base's version at the end of your code to start the animation, and when you override the OnFreeResources method, you need to call the base's version first to stop the animation (see the example in the attached code).

    Mixed mode assembly is built against version 'v2.0.50727'

    The attached code is compiled against .NET 4.0 (though it could probably be retargeted to work under .NET 2.0), but the Code Pack is compiled against .NET 2.0. When I first referenced the Code Pack and tried running the application, the above exception kept getting raised. The solution, found here, is to include an app.config file in the project with the following startup information:

    <?xml version="1.0"?>
      <startup useLegacyV2RuntimeActivationPolicy="true">
        <supportedRuntime version="v4.0"/>


    Direct2D will work over remote desktop; however (as far as I can tell), the D3DImage control is not rendered. Unfortunately, I only have a Home Premium version of Windows 7, so cannot test any workarounds, but would welcome feedback in the comments.

    The code written will work with targeting either x86 or x64 platforms (or even using the Any CPU setting); however, you'll need to use the correct version of Microsoft.WindowsAPICodePack.DirectX.dll; I couldn't find a way of making this automatic, and I don't think the Code Pack can be compiled to use Any CPU as it uses unmanaged code.

    The timer used in the AnimatableScene is a DispatchTimer. MSDN states:

    [The DispatcherTimer is] not guaranteed to execute exactly when the time interval occurs [...]. This is because DispatcherTimer operations are placed on the Dispatcher queue like other operations. When the DispatcherTimer operation executes is dependent on the other jobs in the queue and their priorities.


    • 02/11/10 - Direct2DControl has been changed to use a DispatchTimer so that it doesn't contain any controls needing to be disposed of (makes FxCop a little happier), and the control is now synchronized with WPF's CompositionTarget.Rendering event to reduce flickering. Scene has been changed to include an Updated event and to allow access to its D2DFactory to derived classes. Also, the AnimatedScene class has been added.
    • 21/09/10 - Initial version.


    This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

    저작자 표시

    Hex Grids and Hex Coordinate Systems in Windows: Drawing and Printing

    Hex Grids and Hex Coordinate Systems in Windows: Drawing and Printing

    By | 29 Apr 2012 | Article
    A library (DLL) for the generation of hexagon grids (or "tessellations"), and for the management of the resultant coordinate systems.


    Figure 1: "Hexplane.exe", a demo in which the user flies through 3D space over a hex-based terrain


    Figure 2: Spinning cube demo with four dynamic hex tessellation surfaces ("Hex3d.exe")


    Whether called a hex grid, a honeycomb, or a tessellation (which is the mathematical term for a space-filling pattern), groupings of hexagons like the ones shown above are a useful way to fill up two-dimensional space. They provide for a more interesting visual effect than tessellations of 3- or 4-sided polygons; and unlike tessellations composed of several different polygons, hex tessellations create a regular grid with a consistent coordinate system, a fact which has been exploited by all sorts of computer games, and also by many board games.

    The work at hand describes how to use a library created by the author to draw a wide variety of hex tessellations. This is accomplished using selected calls into the GDI and GDI+ APIs made by the library. The library is named "hexdll.dll". This somewhat repetitive name makes more sense in the context of its codebase, where it is found alongside "hexdll.cpp", "hexdll.h", "hex3d.cpp", and so on.

    The development of the programs provided with this article presented some performance challenges. In one demonstration, a hexagon grid is drawn onto a Direct3D surface. The surface is dynamic; with each frame, a slightly different hex tessellation is drawn, creating an interesting flicker effect. Care was taken to ensure that the many calls into "hexdll.dll" necessitated by this application did not result in decreased frame rate. This requires "hexdll.dll" itself to be capable of operating quickly, and also presents potential interface issues between GDI and Direct3D, which are discussed more extensively further down in the article.

    In another of the demo programs, a large hex grid is redrawn in its entirety with each Resize event. Again, if done incorrectly, this action will introduce a very noticeable lag.

    These high-performance applications are both enabled by a single common design decision: GDI+ is avoided, in favor of GDI, unless the caller specifically requests anti-aliasing. GDI does not offer this capability, so GDI+ must be used for anti-aliased drawing. However, for drawing operations not involving anti-aliasing, GDI is significantly faster than GDI+. This is an interesting result which is discussed at some length below. Here, suffice it to say that the OOP-style interface exposed by GDI+ comes at a cost, and at a cost which is in some cases dramatic. This article is thus a reminder of the high performance potential of non-OO procedural and structured programming techniques.


    The "device context" (DC) is ubiquitous in Windows programming. The Direct3D surface interface used here (IDirect3DSurface91), for example, exposes a GetDC() method. Most Windows controls (be they MFC, Win32, or .NET-based) expose an HWND, which can be converted into a DC using a single Windows API call. Each of these entities is ultimately just a different kind of 2D Windows surface, and "hexdll.dll" can draw hex tessellations on all of them, using the DC as a common medium. Many of these operations are demonstrated in the code base provided, and discussed in this article.

    The author's DLL is designed for hassle-free use from native or .NET programs2; the source code provided contains complete examples of both types of clients. The main ".cs" file for the .NET demo is only 87 lines long. None of the C++ demos use more than 450 lines of code, despite their high feature content. The "hex2d.cpp" printer-friendly demo uses only 236 lines of code.

    The next section of the article deals with the creation of client apps that use the DLL. For simplicity, a Visual Studio 2010 / C# client app ("hexdotnet.sln" / "hexdotnet.exe") is shown first. The folder tree for this C# application is present in the "hexdotnet" subfolder of the provided source code archive.

    After the presentation of the .NET client, the text below continues with the presentation of a variety of C++ client programs, and then concludes with a discussion of the internal implementation of "hexdll.dll". The code for the library is written in C++, and built using the MinGW compiler. Some C++ client apps are discussed as well.

    The client programs provided were developed using Visual Studio for the .NET app and MinGW for the C++ apps and for the DLL itself. In constructing the C++ development tool chain, this article relies heavily on techniques described in two previous articles by the same author, GDI Programming with MinGW and Direct3D Programming with MinGW. The text below attempts to be self-contained, but it does make reference to these predecessor articles, as necessary, and they do provide more detailed explanations of some of the background topics for this article. There are some minor differences between the instructions given in this article and those given in its predecessors, due to changes in MinGW. These are discussed in the section titled "Building", further below.

    API Selection

    The author faced a choice between several 2D drawing APIs in developing the programs described in this article. The APIs considered were GDI, GDI+, DirectDraw, and Direct2D. Of these, Direct2D is the newest and likely the fastest-running alternative. Unfortunately, MinGW does not support it, at least not as downloaded. DirectDraw is, like Direct2D, a component of DirectX, but it is a deprecated one.

    Of course, it would be difficult to integrate either of these DirectX-based technologies into typical (i.e., raster-based) Windows applications as seamlessly as was done for the GDI/GDI+ implementation present in "hexdll.dll". Two main advantages of the approach selected are therefore its generality and its lack of bothersome architectural requirements.

    Using the Code

    One simple way to develop a client application that uses "hexdll.dll" is to use the .NET System.Windows.Forms namespace to create a control onto which the DLL can draw. Any C# application can access the functions exposed by "hexdll.dll". The first step is to insert the declarations shown below into an application class:

    [DllImport("hexdll.dll", CallingConvention = CallingConvention.Cdecl)]
    static extern void hexdllstart();
    [DllImport("hexdll.dll", CallingConvention = CallingConvention.Cdecl)]
    static extern void hexdllend();
    [DllImport("hexdll.dll", CallingConvention = CallingConvention.Cdecl)]
    static extern void systemhex
      IntPtr hdc,  //DC we are drawing upon
      Int32 origx, //Top left corner of (0,0) hex in system - X
      Int32 origy, //Top left corner of (0,0) hex in system - Y
      Int32 magn,  //One-half hex width; also, length of each hex side
      Int32 r,     //Color of hex - R
      Int32 g,     //Color of hex - G
      Int32 b,     //Color of hex - B
      Int32 coordx,//Which hex in the system is being drawn? - X
      Int32 coordy,//Which hex in the system is being drawn? - Y
      Int32 penr,  //Outline (pen) color - R
      Int32 peng,  //Outline (pen) color - G
      Int32 penb,  //Outline (pen) color - B
      Int32 anti   //Anti-alias? (0 means "no")

    In the C# application provided, these declarations are inserted directly into the Form1 class, very near the top of "Form1.cs". The systemhex(), hexdllstart(), and hexdllend() functions are therefore accessible as static methods of the Form1 class.

    At runtime, "hexdll.dll" must be present in the same folder as the .NET executable ("hexdotnet.exe" here), or at least present in the Windows search path, for this technique to work.

    In the declarations shown above, note that the Cdecl calling convention is used, as opposed to the default Stdcall option. Programmers uninterested in this distinction can simply copy the declarations as shown above without giving this any thought. For those interested in detail, the author found that using Stdcall in the DLL implementation code caused MinGW to engage in some undesirable name mangling. The DLL function names ended up looking like hexdllstart@0.

    The Stdcall convention uses these extra suffixes to support function name overloading; Cdecl does not support overloaded functions and therefore does not require them. It is worth noting, too, that this sort of name mangling is an inherent requirement for the linkage of C++ class methods; the library presented here thus makes no attempt to expose an OO interface.

    Calls to functions hexdllstart() and hexdllend() must bracket any use of "hexdll.dll" for drawing. These functions exist basically to call Microsoft's GdiplusStartup and GdiplusShutdown API functions, at app startup / shutdown. This design is in keeping with Microsoft's guidelines for the construction of DLLs that use GDI+.

    The actual hex-drawing code in each client app consists of call(s) to systemhex(). In this identifier, the word "system" refers not to some sort of low-level privilege, but to the system of coordinates created by a hex tessellation. Any such tessellation has a hexagon designated (0,0), at its top / left corner. Unless the tessellation is exceedingly narrow, there is a (1,0) hexagon to its right. Unless the tessellation is very short, there is a (0,1) hexagon beneath the (0,0) hexagon.

    The figure below shows an example hex tessellation with several of its constituent hexagons labeled with their (X,Y) coordinates, as defined in the system used by "hexdll.dll". In this figure, many arbitrary but necessary decisions made by the author are evident. The placement of the origin at the top left is an obvious example. More subtly, note that the entire grid is oriented such that vertical columns of hexagons can be identified (e.g., the column of hexagons with "X" coordinate 0). The grid could be rotated 90 degrees such that these formed rows instead, but this is not the orientation used here. Finally, note that hexagons at odd "X" coordinates, by convention, are located slightly lower in the "Y" dimension than those at even "X" coordinates. This is another one of these arbitrary decisions made by the author, each of which will potentially impact the implementation of any application that uses "hexdll.dll".


    Figure 3: Coordinate system used by "hexdll.dll"

    Returning to the declaration of systemhex(), the coordx and coordy parameters to this function define the coordinate of the single hexagon drawn by each call to systemhex(). This (X,Y) point defines the entire hexagon in terms of a coordinate system like the one shown in the figure above. The specifics of this coordinate system are passed in parameters origx, origy, and magn. The origx and origy parameters, taken together, define where the leftmost vertex of the hexagon (0,0) is located. These coordinates are expressed in pixels, relative to coordinate (0,0) of the surface onto which the hexagon is being drawn.

    The magn parameter defines the size of each hexagon. Each hexagon is 2.0 * magn pixels wide. Each hexagon's height is slightly less than that, at approximately 1.7321 times magn. (This is 2.0 * sin(60o)) * magn.)

    Two RGB color triads are passed to systemhex(): parameters r, g, and b define the interior color of the hexagon, while penr, peng, and penb define the color of its single-pixel outline. Each of these parameters can range from 0 to 255.

    Finally, the IntPtr parameter to systemhex() is a HANDLE to the DC to be drawn upon. In the .NET example client provided, this is obtained by taking the Handle property of a Panel control created for this purpose, and passing it to the Win32 GetDC() function. This function is brought into the .NET program using a DllImport declaration very similar to the three already shown, along with the corresponding cleanup function ReleaseDC():

    static extern IntPtr GetDC(IntPtr hWnd);
    static extern bool ReleaseDC(IntPtr hWnd, IntPtr hDC);

    In the .NET example program, the MakeHex() method of Form1 does the actual drawing. It is deliberately ill-behaved, redrawing an 80 x 80 hex coordinate system in its entirety. Because MakeHex() gets called one time for every Resize event, this presents severe performance issues unless each call to systemhex() executes with sufficient speed. The code for MakeHex() is shown in its entirety below:

    private void MakeHex()
      IntPtr h = GetDC(this.panel1.Handle);
      //Not efficient. Good for testing.
      for (int row = 0; row < 80; ++row)
       for (int col = 0; col < 80; ++col)
        systemhex(h, 30, 30, 10, 255, 255, 255, row, col, 255, 0, 0, 0);
      ReleaseDC(this.panel1.Handle, h);

    Above, note that each hex drawn is part of a system having its (0,0) hex at raster coordinate (30,30). This is measured in pixels from the top left corner of Panel1, which is configured to fill the entire client area of Form1. Each hex is 20 pixels wide (twice the magn parameter of 10). The hexagons are white (red=255, green=255, blue=255), with a bright red outline. A full 6400 hexagon is drawn with each call to MakeHex(); an 80 x 80 grid of hexagons is drawn, at system coordinates (0,0) through (79,79). The result of this process is shown below; note that the window is not sufficiently large at this point in time to show the entire 80 x 80 grid:


    Figure 4: "Hexdotnet.exe" at startup (not anti-aliased)

    As the code exists in the download provided, the final parameter to systemhex(), named anti, is set to 0. This disables anti-aliasing and allows for GDI (as opposed to GDI+) to be used, which is key to obtaining good Resize performance. The tradeoff is a somewhat jagged rendering, as evident in the picture above.

    If anti is set to a non-zero value, and the .NET example client is recompiled, then a significant performance lag will be perceptible when resizing Form1. In the author's test, a simple maximize operation performed immediately after app startup took about 2 seconds with anti-aliasing enabled.

    Significantly, GDI's performance advantage was present even when compared to GDI+ running without anti-aliasing enabled (i.e., with SmoothingModeHighSpeed in effect). If OVERRIDE_C_GDI is defined when "hexdll.cpp" is built, GDI+ will be used for all calls. The resultant performance lag is, again, quite perceptible, and the author provides this option only for performance testing.


    The Build Script

    The C# demonstration described in the last section can be built and executed by simply opening "hexdotnet.sln" and pressing F5. A pre-built copy of the DLL is included along with its source code.

    The DLL can be rebuilt, though, along with all of the C++ demonstration programs, using the build script "make.bat". This batch file also copies "hexdll.dll" to the requisite locations under the "hexdotnet" folder tree.

    The script "clean.bat" is also provided; it removes all traces of the build process, except for the pre-built version of the DLL included with the .NET solution. These are intended for execution from a Command Prompt window, not directly from Explorer or Task Manager. Before attempting to run "make.bat", it is necessary to include the MinGW binary path in the environment PATH, e.g.:


    Figure 5: C++ Build Steps

    A batch file that sets PATH properly is also provided in the source code archive. It is named "envvars.bat". This can be run instead of the set command shown above.

    The build script itself consists mostly of calls to g++. The commands that compile "hex3d.cpp" and "hexplane.cpp" rely on build commands that are very similar to those shown in Direct3D Programming with MinGW. The commands that build "hexdll.dll" itself rely heavily on the MinGW / GDI+ build instructions given in GDI+ Programming With MinGW, and also on some detailed instructions for DLL construction given by the developers of MinGW.

    In all of the "g++" commands in "make.bat", the -w option is used to disable warnings. In the versions of MinGW used by the author, this either had no effect, i.e., there were no warnings even without -w, or, if there were warnings, they came from Microsoft's DirectX source files.

    MinGW Version Differences

    The author used the November, 2011 release of MinGW during the final development of the code supplied, with the MinGW Developer Toolkit selected for inclusion during installation. Slightly different techniques were necessary with earlier versions of MinGW.

    GDI+ headers are now included in the distribution, and do not have to be obtained from elsewhere, for example. These headers are in a "gdiplus" subfolder, though, which must be considered in constructing one's #include directives.

    Also, it used to be possible to run the MinGW compiler without including "c:\mingw\bin" (or equivalent) in the search path. In the latest versions of MinGW, this will result in missing dependency errors when attempting to use "g++.exe".

    Some of these earlier techniques were used in GDI+-Programming-With-MinGW and Direct3D Programming with MinGW, and the instructions given in these articles remain valid when the compiler actually recommended in those specific articles is used.

    C++ Demonstrations

    At a high level, the steps necessary to create a client application for "hexdll.dll" are the same in both C# and C++. In both cases, the DLL itself must be present alongside the client EXE at runtime (or, at least, in its search path). Similarly, in both C# and C++ clients, a sort of function prototype or stub declaration is inserted into the client code base, to represent the DLL functions. Once these preconditions are met, the DLL functions can be called exactly like the functions (in the case of C++) or methods (in C#) implemented in the client code.

    In the C++ client code written here, these declarations are brought into the code base from a header file, "hexdll.h", using #include. This is a very typical way for C++ programs to share declarations, and to, in this way, expose interfaces to each other. The C++ declarations comprising the external interface of "hexdll.dll" are shown below. This is the core of "hexdll.h":

    void HEXDLL systemhex(HDC hdc,int origx,int origy,int magn,int r,
          int g,int b,int coordx,int coordy,int penr,int peng,int penb,BOOL anti);
    void HEXDLL hexdllstart();
    void HEXDLL hexdllend();

    These declarations are analogous to the C# declarations of systemhex(), hexdllstart(), and hexdllend(), shown earlier. The HEXDLL macro evaluates, when included in a client application, to __declspec(dllimport), a Windows-specific modifier for functions imported from a DLL. During the DLL build, HEXDLL evaluates to __declspec(dllexport); this is all managed using the preprocessor macro BUILDING_HEXDLL.

    When included by a C++ compilation unit, the declarations shown above get wrapped in an extern "C" block. This action is based on the fact that __cplusplus is defined. The extern "C" block ensures that the Cdecl calling convention is used, even in C++ programs, and that names are not mangled. Finally, all of this code is bracketed by an #ifdef directive designed to keep these declarations from getting repeated due to preprocessor actions. Of course, the author of the client application needs only to #include the header file and call its functions.

    In neither (C++ / .NET) case does the client application code need to make any direct reference to the GDI+ libraries. Rather, they are included indirectly, as a part of "hexdll.dll".

    Spinning Cube Demo

    Three C++ example programs are provided with the article. First, "hex3d.exe" is a variation on the spinning cube demo shown in Direct3D Programming with MinGW. This is the application shown earlier in Figure 2. It is built from a single source code file, "hex3d.cpp". In this program, a static texture is not used for the cube surfaces. Instead, with each iteration of the rendering loop, a DC is obtained for the texture's main appearance surface, and is drawn on using systemhex(). Random shades of red are used for each hexagon, resulting in an appealing frame-by-frame flicker effect. The application exits after a certain number of frames have been rendered (by default, a thousand frames). This allows for easy determination of frame rate, by timing the demo's execution time.

    The code to get a DC from a texture surface is a new addition to "hex3d.cpp", compared to its spinning cube predecessor. This task is performed by the function do2dwork, shown below this paragraph. This function is called with each iteration of the main loop, prior to the call to render().

    void do2dwork()
     IDirect3DSurface9* surface=NULL;
     hexgridtexture->GetSurfaceLevel(0, &surface);
     HDC hdc;
     for(int hexcx=0;hexcx<TESSEL_ROWS;++hexcx)
      for(int hexcy=0;hexcy<TESSEL_ROWS;++hexcy) 
       //Slight flicker to red in hexagons
       //Red values range from MIN_HEX_RED to 255
       int red=(rand()%(256-MIN_HEX_RED))+MIN_HEX_RED;     

    The first four lines of code in the function body above serve to get the necessary DC handle for systemhex(). The loop immediately after that is very similar in its implementation to the C# loop from MakeHex(). The color randomization code in the loop body is new, but straightforward. As is typical of C++ compared to C#, the final two statements above clean up resources.

    Like "hexdotnet.exe", "hex3d.exe" expects "hexdll.dll" to be present in the current search path at runtime. In addition, it requires the file "hex3d.png" to be present. This contains appearance information for the static texture applied to the top and bottom of the demo solid.


    This demonstration program creates an illusion of flight in 3D space, above a flat terrain covered by a hex tessellation. The program is built from a single source code file, "hex3d.cpp". It is shown in action in Figure 1, near the top of the article. In this demo, flight takes place in the positive "Z" direction (forward), with rotation about the "Z" axis occurring throughout the flight. Movement continues at an accelerating (but limited) rate until shortly after the terrain below passes out of view. At that point, the demo restarts. The sky is simulated by applying a horizon image to a rectangular solid off in the distance. Like the spinning cube demo, "hexplane.exe" exits after a set number of frames have been rendered.

    In many ways, this demo is a simplification of the spinning cube demo. Only two rectangular faces must be drawn, versus six in the spinning cube demo. The declaration of the eight vertices required to draw these two rectangular faces is shown below:

    // These are our vertex declarations, for both of the rectangular faces
    //  being drawn.
    MYVERTEXTYPE demo_vertices[] =
     { -SKY_SIZE,  SKY_SIZE, SKY_DISTANCE, 0, 0, -1, 0, 0 },         // Sky face  
     {  SKY_SIZE,  SKY_SIZE, SKY_DISTANCE, 0, 0, -1, 1, 0 },
     { -SKY_SIZE, -SKY_SIZE, SKY_DISTANCE, 0, 0, -1, 0, 1 },
     {  SKY_SIZE, -SKY_SIZE, SKY_DISTANCE, 0, 0, -1, 1, 1 },
     { -GROUND_SIZE, -GROUND_DEPTH,  GROUND_SIZE, 0, 1, 0, 0, 0 },    // Ground face
     {  GROUND_SIZE, -GROUND_DEPTH,  GROUND_SIZE, 0, 1, 0, 1, 0 },
     { -GROUND_SIZE, -GROUND_DEPTH, -GROUND_SIZE, 0, 1, 0, 0, 1 },
     {  GROUND_SIZE, -GROUND_DEPTH, -GROUND_SIZE, 0, 1, 0, 1, 1 },

    This declaration consists of eight distinct vertex structures, each occupying its own line in the code. Each of these begins with "X", "Y", and "Z" coordinates. These coordinates are defined using preprocessor constants that hint at their purposes. More details about the actual design of 3D solids is available in Direct3D Programming with MinGW; the ground face is roughly analogous to the top face of the original spinning cube, and the sky face is analogous to its front.

    The remainder of the initializers are explained by the declaration of MYVERTEXTYPE, the custom vertex struct used by both of the Direct3D demo programs presented here. This declaration is shown below:


    Note that immediately after the coordinates comes the normal vector, followed by 2D point (U,V). The normal vector extends outward into space from the solid, and is perpendicular to the face; this is necessary for lighting purposes. For the ground face, the normal vectors are <0,1,0>, i.e., a vector sticking straight up in the "Y" dimension. For the sky face, the normal vectors point at the user, i.e., in the negative "Z" direction. They thus have a value of <0,0,-1>.

    Point (U,V) maps the vertex to a point on the 2D surface of whatever texture is applied to it. The texture 2D coord system has "U" increasing from top to bottom, and "V" increasing from left to right. Because both rectangular faces are defined as triangle strips, a criss-cross pattern is evident in (U,V), as well as in the "X", "Y", and "Z" coordinates themselves; the vertices do not go around the rectangle from vertex 0, to 1, to 2, to 3; rather, they cross over the rectangle in diagonal fashion between vertex 1 and vertex 2. This is consistent with Direct3D's general expectation that solids be comprised of triangular facets.

    Both textures used have a static appearance. As a result, anti is set to 1; because the hex tessellation is drawn just once, there is no real performance penalty associated with this improvement. There is still a function do2dwork(), as was seen in "hex3d.cpp", but it is called only once, before the first frame is rendered, to set up the static texture appearance. The code for this function is shown below:

    void do2dwork()
     IDirect3DSurface9* surface=NULL;
     HDC hdc;
     planetexture->GetSurfaceLevel(0, &surface);
     for(int hexcx=0;hexcx<TESSEL_ROWS;++hexcx)
      for(int hexcy=0;hexcy<TESSEL_ROWS;++hexcy)
        case 0:
         //255 means full color for R, G or B
        case 1:
        case 2:
        case 3:  

    As in "hex3d.cpp", the function begins by obtaining a handle to a DC for the surface's appearance. Again, a tessellation of fixed size is drawn. Here, the randomization component is different; either a red, green, or blue hex can be drawn, or no hex at all can be drawn for a given system coordinate. This allows a default appearance, dictated by file "hexplane.png", to show through. This default appearance is loaded from "hexplane.png" earlier in the startup sequence using a call to D3DXLoadSurfaceFromFile. Preprocessor constants TESSEL_ORIG_X, TESSEL_ORIG_Y, and TESSEL_MAGNITUDE define the coordinate system used for the hex terrain; these were tuned to yield hexagons of an acceptable size, and to achieve full coverage of the ground surface. In particular, slightly negative values are used for TESSEL_ORIG_X and TESSEL_ORIG_Y, to avoid leaving unfilled space around the top and left edges of the tessellation.


    This demo creates a high-resolution, 8.5" x 11.0" bitmap file. The executable shows a modal message box with the message "DONE!" after it has finished creating the output bitmap file. The bitmap is completely covered by a black and white hex tessellation, drawn with anti-aliasing enabled. If printed, the result could be useful for a board or pen-and-paper game built around a hexagonal coordinate system. This program is built from source code file "hex2d.cpp".

    Unlike the other two C++ demos, DirectX is not used here. Rather, GDI and Win32 API calls only are used, in conjunction with calls into "hexdll.dll", to achieve the desired result. Specifically, a starting BITMAP is created. A DC is then obtained, for drawing onto this BITMAP. This DC is passed to systemhex() repeatedly, in a nested for loop, to draw the hex tessellation. Finally, the resultant appearance data after drawing must be written from memory out to a properly formatted bitmap file. This last step in particular requires a significant amount of new low-level code compared to the two 3D C++ demos.

    The series of steps outlined in the last paragraph are mostly executed directly from main(). After declaring some local variables, main() begins as shown below:

    //Delete "out.bmp"
    synchexec("cmd.exe","/c del out.bmp");
    //Make blank "temp.bmp"
    synchexec("cmd.exe","/c copy blank.bmp temp.bmp");
    //Modify TEMP.BMP...
    hbm = (HBITMAP) LoadImage(NULL, "temp.bmp", IMAGE_BITMAP, 0, 0,
    if(hbm==NULL) //Error
        MessageBox(0,"BITMAP ERROR","Hex2D.exe",
        return 1;

    The two calls to synchexec() (a wrapper for ShellExecuteEx()) serve to delete "out.bmp", which is the program output file, and then to create working bitmap file "temp.bmp". Note that this working file is a copy of "blank.bmp", which is a plain white bitmap having 16-bit color depth (like the output bitmap). In the application code as provided, this is just a starting point, which is completely overwritten using systemhex calls.

    The main() function continues as shown below:

    static BITMAP bm;
      (HGDIOBJ)hbm,     // handle to graphics object of interest
      sizeof(BITMAP),   // size of buffer for object information
      (LPVOID)&bm   // pointer to buffer for object information

    This code snippet takes the HBITMAP value hbm, which is a pointer-like identifier for a structure held by Windows, and converts it into a BITMAP object proper, present in the static storage of "hex2d.cpp". Getting the actual BITMAP structure (vs. an HBITMAP) is useful as a way to access some properties like width and height using "dot" operator. Variable bigbuff, which is declared with a static size equal to the known memory requirements of the high-resolution bitmap, holds the local copy of the BITMAP appearance information.

    Next, main() continues with the code shown below:


    The series of calls shown above first create a new and independent DC, as opposed to one obtained for a control or window. The DC created is compatible with the desktop (HWND zero), since there is no app main window DC to pass instead. Then, the code associates this DC, and the drawing about to happen, with hbm. Now, with this relationship established, the actual hexagon drawing can take place, with the newly created DC passed as the first parameter to systemhex():

    for(int ccx=0;ccx<HEX_GRID_COLS;++ccx)
      for(int ccy=0;ccy<HEX_GRID_ROWS ;++ccy)
       systemhex( hdc,
        1 );

    This code fragment is very reminiscent of the earlier demos. Note that the last parameter is 1, indicating that anti-aliasing is enabled. All of the other parameters are constants which, as before, were tweaked by the author, based on observation, to yield complete coverage of the target surface.

    The remainder of main() writes out the image identified by hbm to a ".bmp" file. This is a somewhat tedious process, which is already well-summarized elsewhere online. One noteworthy addition made for this application is that DPI is explicitly set to 192, using the bit of code presented below. Note that the actual setting involves the somewhat more obscure terminology "pels per meter". Application constant IMG_PELS_PER_METER contains the correct value of 7,560 pels per meter:

    lpbi->bmiHeader.biYPelsPerMeter = IMG_PELS_PER_METER;
    lpbi->bmiHeader.biXPelsPerMeter = IMG_PELS_PER_METER;

    Several online sources simply set these values to 0. The author wished for a high-resolution, printable image of the correct 8.5" x 11.0" size, though, so setting DPI (or "pels per meter") correctly was deemed necessary.

    Library Implementation

    Many of the calculations required to draw hexagons will involve real numbers. In order to maximize the accuracy of these computations, and to minimize the number of typecast operations necessary, systemhex begins by converting all of its pixel parameters into doubles, and passing them to an inner implementation function:

    void systemhex(HDC hdc,int origx,int origy,int magn,int r,
         int g,int b,int coordx,int coordy,int pr,int pg,int pb,BOOL anti)
          g, b,coordx,coordy,pr,pg, pb,anti);

    This inner function translates coordx and coordy (hex system coordinates) into actual screen coordinates. In doing so, it largely just enforces the arbitrary decisions made by the author in designing the coordinate system. Its if, for example, ensures that hexagons at odd "X" coordinates are located slightly lower in the "Y" dimension than those at even "X" coordinates, as is the stated convention of "hexdll.dll":

    void innerhex(HDC hdc,double origx,double origy,double magn,int r,
         int g,int b,int x,int y,int pr,int pg,int pb,BOOL anti)
     //Odd X translates drawing up and left a bit 
      abstracthex( hdc, 
       magn, r, g, b,pr,pg,pb,anti); 
      abstracthex( hdc, 
       magn, r, g, b,pr,pg,pb,anti); 

    As shown above, the bottom-level function responsible for drawing hexagons in the internal implementation of "hexdll.dll" is another function called abstracthex(). This lower level function operates in terms of system coordinates (as opposed to the hexagon coordinates shown in Figure 3). The prototype of abstracthex() is shown below:

    void abstracthex(HDC hdc,double origx,double origy,double magn,
         int r,int g,int b,int pr,int pg,int pb,BOOL anti)

    Note that in performing this final translation into raster coordinates, the geometry of the hexagon must be considered in depth. Figure 6, below, is a useful aid to understanding this geometry:


    Figure 6: Hexagon Geometry

    The diagram above gives all of the dimensions necessary to implement abstracthex(). The leftmost vertex of the hexagon is, by definition, located at (x,y). This vertex is the first vertex drawn by the abstracthex() function. From there, drawing moves down and right to the next vertex. As shown in Figure 6, the 60° angle is key to these calculations. We can view any side of the hexagon as the hypotenuse of a 30-60-90 triangle. The triangle constructed using dotted lines in Figure 6 is an example of one of these 30-60-90 triangles. The other sides of such a triangle measure cos(60°) times the hypotenuse length (for the shorter side) and sin(60°) times the hypotenuse length (for the longer side). Here, the hypotenuse has length magn, and the two side other than the hypotenuse therefore have lengths of cos(60°)*magn and sin(60°)*magn. The actual measurement shown in Figure 6 is negative, since positive "Y" movement in the Direct3D texture coordinate system is down.

    As shown in the picture above, the shorter of these two triangle sides approximates the movement from the first vertex drawn to the second in the "X" dimension. Similarly, the longer of these two sides approximates the movement from the first vertex drawn to the second in the "Y" dimension. As we move from the first vertex drawn at (x,y) to the next vertex, we therefore move cos(60°)*magn pixels in the "X" dimension and sin(60°)*magn in the "Y" dimension. The coordinate of this second vertex is thus (x+cos(60°)*magn, y+sin(60°)*magn).

    The next vertex drawn is the one directly to the right of the vertex just drawn. Because the length of the side between these two is magn, the third coordinate is located at (x+cos(60°)*magn+magn, y+sin(60°)*magn).

    Instead of passing these coordinate expressions to GDI/GDI+ as shown above, though, the code provided uses a system of running totals, in an effort to minimize repeated calculations. Near the top of abstracthex(), the following initializations are present:

    double cham=COS_HEX_ANGLE*magn;
    double sham=SIN_HEX_ANGLE*magn;
    double opx=(x+cham);   //Second vertex's "X" location
    double opy=(y+sham);   //Hex bottom "Y" location
    double opm=(opx+magn); //Third vertex's "X" location
    double oms=(y-sham);   //Hex top "Y" location

    After the execution of this code, the first three vertices drawn will have coordinates (x,y). The second will be located at (opx,opy), and the third at (opm,y). The fourth coordinate drawn, at the extreme right side of the hexagon, is just a bit further to the right, at (opm+cham,y). The drawing of the fifth vertex moves back toward the left and up, to (opm,oms). Finally, we move back magn pixels to the left, and draw the sixth vertex at (opx,oms).

    Depending on whether or not anti is true, either GDI or GDI+ will be used for the actual drawing operations. In either case, a data structure holding all of the vertex coordinates, in drawing order, is first constructed. For GDI, this is an array of POINT structures, whose construction is shown below:

    POINT hex1[6];
    //Start hex at origin... leftmost point of hex
    //Move [ cos(theta) , sin(theta) ] units in positive (down/right) direction
    //Move ((0.5) * hexwidth) more units right, to make "bottom" of hex
    //Move to vertex opposite origin... Y is same as origin
    //Move to right corner of hex "top"
    //Complete the "top" side of the hex

    Note that the addition of 0.5 to each term serves to achieve proper rounding; otherwise, the decimal portion of each floating point value (x, y, opx, etc.) would simply be abandoned.

    If GDI+ is used, an array of PointF structures is built instead. These structures use floating point coordinates, and no rounding or typecasting is necessary. Their declaration is shown below:

    PointF myPointFArray[] = 
      //Start hex at origin... leftmost point of hex
      PointF(x, y),
      //Move [ cos(theta) , sin(theta) ] units in positive (down/right) direction
      PointF((opx), (opy)),
      //Move ((0.5) * hexwidth) more units right, to make "bottom" of hex
      PointF((opm), (opy)),
      //Move to vertex opposite origin... Y is same as origin
      PointF(opm+cham, y),
      //Move to right corner of hex "bottom"
      PointF((opm), (oms)),
      //Complete the "bottom" side of the hex
      PointF((opx), (oms))

    If GDI is in use, the vertex data structure gets passed to a function named Polygon. The SelectObject API is first used to select a pen with the requested outline color, and then to select a brush with the requested interior color. This series of actions results in a polygon with the requested outline and interior colors.

    Under GDI+, two calls are necessary to achieve the same result, one to DrawPolygon() and one to FillPolygon(). It is once again necessary to create both a pen and a brush, with the first of these getting passed to DrawPolygon() and the second to FillPolygon(). It should be noted that the necessity of two distinct function calls here plays some role in the relatively slow performance obtained using GDI+. However, the author made a point of running tests with a single call to FillPolygon() only, and GDI+ was still much slower than GDI.


    The work presented here led the author to several conclusions about recent versions of Windows, its APIs, and its rendering architecture. GDI, of course, is much more efficient than GDI+. This should be no surprise, given that much of GDI was originally written to work well on the relatively primitive hardware that ran the earliest versions of Windows.

    GDI+ is useful primarily because of its anti-aliasing capability. It also offers a cleaner interface than GDI, e.g. in the area of memory management. This OO interface comes at a significant cost, though. Comparable operations are much slower in GDI+ than in GDI, even with anti-aliasing disabled.

    While both imperfect, GDI and GDI+ do seem to complement each other well. In the demonstration programs provided, GDI+ works well for generating a high-quality printable image, and this, fortuitously, is not a task that needs to happen with incredible quickness anyway. GDI, on the other hand, provides the high level of speed and efficiency necessary for the dynamic texturing demo ("hex3d.exe"), and in this arena its lack of anti-aliasing will usually go unnoticed. The texture will be moving quickly at runtime, and will also get passed through the Direct3D interpolation filters necessary to scale the texture for each frame. Whatever jagged edges GDI might generate compared to GDI+ are quite likely lost in the translation and animation process.

    Finally, some conclusions about combining Direct3D and GDI in the latest versions of Windows were reached by the author in preparing this work. While the changes in GUI rendering that came with Windows Vista were significant, nothing in them seems to rule of the possibility of using GDI to draw on Direct3D surfaces with a level of efficiency that is at least reasonably good. The process of obtaining the necessary DC remains quick and intuitive, and the GDI operations themselves seem mostly to be fast enough to keep up with Direct3D.


    1. MinGW did not support later versions of DirectX when the article was written. At least, DirectX 9 was the newest version for which headers were present in "c:\mingw\include\", and web searches yielded no obvious way to incorporate later versions. Microsoft's "August 2007" version of the DirectX SDK should therefore be installed in order to build the 3D demonstration programs. Detailed instructions for obtaining and installing the SDK are given in Direct3D Programming with MinGW.
    2. At present, only 32-bit client applications are supported. To support 64-bit clients would require the code for "hexdll.dll" to be rebuilt in a 64-bit development environment. While there is no reason to suspect that this would not work, 64-bit compilation has not been tested.


    This is the second major version of this article. Compared to the first version, some improvements in formatting and clarity have been made. The code and binary files have not changed. 


    This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)


    저작자 표시

    Endogine sprite engine




    Endogine sprite engine

    By | 17 Jul 2006 | Article
    Sprite engine for D3D and GDI+ (with several game examples).

    Sample Image

    Some of the examples are included in the source.


    Endogine is a sprite and game engine, originally written in Macromedia Director to overcome the many limitations of its default sprite engine. I started working in my spare time on a C# version of the project about a year ago, and now it has far more features than the original - in fact, I've redesigned the architecture many times so it has little in common with the Director++ paradigm I tried to achieve. Moving away from those patterns has made the project much better. However, there are still a lot of things I want to implement before I can even call it a beta.

    Some of the features are:

    • Easy media management.
    • Sprite hierarchy (parent/child relations, where children's Rotation, LOC etc. are inherited from the parent).
    • Behaviors.
    • Collision detection.
    • Plugin-based rendering (Direct3D, GDI+, Irrlicht is next).
    • Custom raster operations.
    • Procedural textures (Perlin/Wood/Marble/Plasma/others).
    • Particle systems.
    • Flash, Photoshop, and Director import (not scripts). NB: Only prototype functionality.
    • Mouse events by sprite (enter/leave/up/down etc. events).
    • Widgets (button, frame, window, scrollbar etc.). All are sprite-based, so blending/scaling/rotating works on widget elements as well.
    • Animator object (can animate almost any property of a sprite).
    • Interpolators (for generating smooth animations, color gradients etc.).
    • Sprite texts (each character is a sprite which can be animated, supports custom multicolor bitmap fonts and kerning).
    • Example game prototypes (Puzzle Bobble, Parallax Asteroids, Snooker/Minigolf, Cave Hunter).
    • IDE with scene graph, sprite/behavior editing, resource management, and debugging tools.
    • Simple scripting language, FlowScript, for animation and sound control.
    • Plug-in sound system (currently BASS or DirectSound).
    • New - Color editor toolset: Gradient, Painter-style HSB picker, Color swatches (supports .aco, .act, Painter.txt).
    • New - Classes for RGB, HSB, HSL, HWB, Lab, XYZ color spaces (with plug-in functionality). Picker that handles any 3-dimensional color space.

    Sample Image

    Some of the current GUI tools (editors, managers etc.).


    I had been developing games professionally in Macromedia Director for 10 years, and was very disappointed with the development of the product the last 5 years. To make up for this, I wrote several graphical sub-systems, some very project-specific, but finally I designed one that fulfilled the more generic criteria I had for a 2D game creation graphics API. It was being developed in Director's scripting language Lingo from autumn 2004 to spring 2005, and since then it's a C# project..

    It's a prototype

    The current engine design is not carved in stone, and I have already made several major changes during its development, and even more are planned.

    Optimizations will have to wait until all functionality is implemented. The GDI+ mode is extremely slow, because I haven't ported my dirty rect system yet. The D3D full-screen mode has the best performance.

    The code is poorly commented at this stage, as it is still possible I'll rewrite many parts. If there is a demand for documentation, I will create it as questions appear. For now, you can get a feel for how to use it, by investigating the Tests project.

    Example projects

    There are two solutions in the download. One is the actual engine, including a project called Tests which contains most of the examples and code. I choose to include it in the solution since it's a little bit easier to develop/debug the engine if the test is part of it, but that's not the way your own projects should be set up. The MusicGame project is closer to how it should be done.

    There's also a simple tutorial text on how to set up your own project.

    I wanted to have a simple, but real-life testbed, so I'm creating a few game prototypes. Currently, they are Puzzle Bobble, a scrolling asteroid game, a golf/snooker game, CaveHunter, and Space Invaders. Other non-game tests are also available in the project. Turn them on and off by bringing the main window into focus and select items from the "Engine tests" menu.

    Example walkthrough

    Note: For using dialogs / editors, the Endogine.Editors.dll has to be present in the .exe folder. For sound, include Endogine.Audio.Bass.dll and the files in (shareware license).

    To try out some of the examples in the Tests project, run the Tests solution and follow these steps:

    • Start in 3D/MDI mode (default).
    • Set focus on the Main window, so the Engine Tests menu appears in the MDI parent.
    • Select Particle System from the menu. The green "dialog" controls a few aspects of the particle system. The top slider is numParticles, the bottom the size. The buttons switch between color and size schemes. After playing around with it, turn it off by selecting Particle System again from the menu.
    • Select GDI+ random procedural, and also Font from the menu. This demonstrates a few ways to create procedural textures and manipulate bitmaps, as well as the support for bitmap fonts. Each letter sprite also has a behavior that makes it swing. Note that both are extremely slow - they're using my old systems. I'll upgrade them soon which will make them a hundred times faster.
    • Go to the SceneGraphViewer, and expand the nodes until you get to the Label sprite. Right-click it and select the Loc/Scale/Rot control. Try the different modes of changing the properties. Notice the mouse wrap-around feature.
    • Close the Loc/Scale/Rot control, go back to the SceneGraphViewer, expand the Label node. Right-click one of the letter sprites and select Properties (the LocY and Rotation properties are under the program control so they are hard to change here).
    • Click the Behaviors button to see which behaviors the sprite has. Mark ThisMovie.BhSwing, and click Remove to make it stop swinging. (Add... and Properties... aren't implemented yet).
    • Stop the program, and start it again, but now deselect MDI mode (because of .NET's keyboard problems in MDI mode).
    • Set focus to the Main window and select Puzzle Bobble from the menu. Player 1 uses the arrow keys to steer (Down to shoot, Up to center), player 2 uses AWSD. Select it again to turn it off. Note: due to changes in the ERectangle(F) classes, there's currently a bug which makes the ball stick in the wrong places. Will look into that later. (May be fixed.)
    • Select Snooker from the menu and click-drag on the white ball to shoot it. Open the Property Inspector for the topology sprite object, change Depth and Function to see how a new image is generated, use the Loc/Scale/Rot control to drag it around, and see how the balls react (buggy, sometimes!).
    • Select Parallax Scroll from the menu and try the Asteroids-like game. Arrow keys to steer, Space to shoot. Note: the Camera control's LOC won't work now, because the camera is managed by the player, centering the camera over the ship for each frame update. That's how the parallax works - just move the camera, and the parallax layers will automatically create the depth effect.
    • (Broken in the Feb '06 version) Go to the main menu, Edit, check Onscreen sprite edit, and right-click on a sprite on the screen, and you'll get a menu with the sprites under the mouse LOC. Select one of the sprites, and it should appear as selected in the SceneGraph. It doesn't update properly now, so you'll have to click the SceneGraph's toolbar to see the selection.

    Using the code

    Prerequisites: .NET 2.0 and DirectX 9.0c with Managed Extensions (Feb 06) for the demo executable, and DirectX SDK Feb 06 for compiling the source. You can download SharpDevelop 2.0 or Microsoft's Visual Studio Express C# for free, if you need a C# developer IDE. Read the README.txt included in the source for further instructions.

    Note that Managed DirectX versions aren't backwards nor forwards compatible. I think later versions will work if you recompile the project, but the demo executable needs MDX Feb 06.

    I'm currently redesigning the workflow, and you would need a fairly long list of instructions in order to set up a new solution which uses Endogine, which I haven't written. The easiest way to get started is by looking at the Tests project. You can also have a look at the MusicGame solution, which is more like how a real project would be organized.

    Most of the terminology is borrowed from Director. Some examples of sprite creation:

    Sprite sp1 = new Sprite();
    //Loads the first bitmap file named 
    //Ball.* from the default directory
    sp1.MemberName = "Ball";
    sp1.Loc = new Point(30,30); //moves the sprite
    Sprite sp2 = new Sprite();
    //If it's an animated gif, 
    //the sprite will automatically animate
    sp2.MemberName = "GifAnim";
    sp2.Animator.StepSize = 0.1; //set the animation speed
    sp2.Rect = new RectangleF(50,50,200,200); //stretches and moves the sprite
    Sprite sp2Child = new Sprite();
    //same texture/bitmap as sp1's will be used 
    //- no duplicate memory areas
    sp2Child.MemberName = "Ball";
    //now, sp2Child will follow sp2's location, 
    //rotation, and stretching
    sp2Child.Parent = sp2;

    Road map

    I'll be using the engine for some commercial projects this year. This means I'll concentrate all features that are necessary for those games, and that I probably won't work on polishing the "common" feature set a lot.

    There will be a number of updates during the year, but I've revised my 1.0 ETA to late autumn '06. I expect the projects to put an evolutionary pressure on the engine, forcing refactoring and new usage patterns, but resulting in a much better architecture. Another side effect is documentation; I'll have to write at least a few tutorials for the other team members.

    Currently, I put most of my spare time into PaintLab, an image editor, which is based on OpenBlackBox, an open source modular signal processing framework. They both use Endogine as their graphics core, and many GUI elements are Endogine editors, so it's common that I need to improve/add stuff in Endogine while working on them.

    Goals for next update (unchanged)

    I've started using Subversion for source control, which will make it easier to assemble new versions for posting, so updates with new tutorials and bug-fixes should appear more often. Some probable tasks for the next few months:

    • Switch between IDE/MDI mode and "clean" mode in runtime.
    • Clean up terminology, duplicate systems, continue transition to .NET 2.0.
    • Look into supporting XAML, harmonize with its terminology and patterns.
    • Fix IDE GUI, work on some more editors.


    Update 2006-07-06

    Again, other projects have taken most of my time - currently it's the OpenBlackBox/Endogine-based PaintLab paint program. Side effects for Endogine:

    • Color management tools: editors, color space converters, color palette file filters etc.
    • Refactoring: WinForms elements are being moved out of the main project into a separate DLL. Simplifies porting to mono or WPF, and makes the core DLL 50% smaller.
    • Improved Canvas, Vector3, Vector4, and Matrix4 classes.
    • Improved PSD parser.
    • More collision/intersection detection algorithms.
    • Several new or updated user controls.

    Update 2006-03-15

    I've focused on releasing the first version of OpenBlackBox, so most of the modifications have been made in order to provide interop functionality (OBB highly dependent on Endogine).

    • Optimizations (can be many times faster when lots of vertices are involved).
    • Continued transition to .NET 2.0.
    • Requires DirectX SDK Feb '06.
    • Simple support for HLSL shaders and RenderToTexture.
    • Better proxies for pixel manipulation - Canvas and PixelDataProvider classes instead of the old PixelManipulator.
    • Extended interfaces to the rendering parts of Endogine. Projects (such as OpenBlackBox) can take advantage of the abstracted rendering API without necessarily firing up the whole Endogine sprite system.
    • Forgot to mention it last time, but there's a small tutorial included since the last update.

    That's pretty much it. But don't miss OpenBlackBox, in time it will become a very useful component for developing applications with Endogine!

    Update 2006-02-01

    Some major architectural changes in this version, especially in the textures/bitmap/animation system. The transition isn't complete yet, so several usage patterns can co-exist and cause some confusion. Some utilities will be needed to make the animation system easier to use.

    • Moved to .NET 2.0. Note: refactoring is still in progress (e.g., generics isn't used everywhere yet).
    • The Isometric example has been removed. I've continued to work on it as a separate project, which I'll make available later if possible.
    • Renderers are now plug-ins, allowing for easier development/deployment. (Also started on a Tao- based OpenGL renderer, but lost my temper with its API.)
    • PixelManipulator - easy access to pixels regardless of if the source is a bitmap or texture surface.
    • Examples of how to use the PixelManipulator (adapted Smoke and Cellular Automata3 from - thanks Mike Davis and Glen Murphy for letting me use them).
    • C# version of Carlos J. Quintero's VB.NET TriStateTreeView - thanks Carlos. Added a TriStateTreeNode class.
    • Started on a plugin-based sound system. Currently, I've implemented two sound sub-systems: BASS and a simple DirectX player. OpenAL can be done, but BASS is cross-platform enough, and it has a better feature set. Later, I'll add DirectShow support.
    • Included a modified version of Leslie Sanford's MidiToolKit (supports multiple playbacks and has a slightly different messaging system). Thanks!
    • Flash parser/renderer has been restructured and improved. Can render basic shapes and animations.
    • System for managing file system and settings for different configurations (a bit like app.config but easier to use and better for my purposes).
    • RegEx-based file finder (extended search mechanism so you can use non-regex counters like [39-80] and [1-130pad:3] - the latter will find the string 001 to 130).
    • Helper for creating packed textures (tree map packing).
    • Abstraction layer for texture usage - the user doesn't have to care if a sprite's image comes from a packed texture or not.
    • The concept of Members is on its way out, replaced by PicRefs. Macromedia Director terminology in general will disappear over time from now on.
    • New, more abstracted animation system. Not fully implemented.
    • .NET scripting system with code injection. Currently only implemented for Boo, but will support other languages. Will be restructured later, with a strategy pattern for the different languages.
    • New vastly improved bitmap font system, both for rendering and creating fonts. Real-time kerning instead of precalculated kerning tables.
    • Localization/translation helper (for multi-language products).
    • A number of helper classes such as the IntervalString class (translates "-1-3,5-7" to and from the array [-1,0,1,2,3,5,6,7]).
    • Unknown number of bug fixes and optimizations.

    Update 2005-10-10

    Since the last update wasn't that exciting from a gaming POV, I decided to throw in a new version with a prototype isometric game. Totally R.A.D.

    • Isometric rendering, based on David Skoglund's game CrunchTime, including his graphics (see _readme.txt in the Isometric folder). Thanks, pal!
    • A Star pathfinding algorithm adapted from an implementation by John Kenedy. Thank you!
    • Removed references to LiveInterface.
    • Added "resource fork" for bitmap files - an XML file with additional info such as number of animation frames and offset point.

    Update 2005-10-04

    OK, it's over a month late, and it doesn't include stuff you might have been waiting for, and the things I've been working on - mainly creating two script languages - aren't that useful in their current state (especially with no good examples of their use). I think it was worth putting some time into, as I'm certain FlowScript will become a great tool later on. Here's what I've got for this version:

    • Compiled using the August 2005 SDK.
    • Added Space Invaders prototype.
    • Added curves rendering to .swf import (it still renders strange images).
    • Reorganized the engine into a .dll.
    • Fixed mouse LOC <-> screen LOC error.
    • Fixed transparency / alpha / color-key issues.
    • Map (probably a bad name) class - like a SortedList, but accepts multiple identical "keys".
    • Node class, like XmlNode, but for non-text data.
    • Started preparing the use of 3D matrices for scale/LOC/rotation.
    • Removed DirectDraw support.
    • Basic sound support.
    • A badly sync'ed drum machine example.
    • FlowScript, a time-based scripting language, aimed at non-programmers for animation and sound control, based on:
    • EScript, simple scripting language based on reflection (no bytecode).
    • Simple CheckBox widget.

    Update 2005-08-01

    • Compiled using the June 2005 SDK.
    • Sprite.Cursor property (works like Control.Cursor).
    • Simple XML editor.
    • .NET standardized serializing.
    • PropertyGrid instead of custom property editors.
    • VersatileDataGrid User Control (a new DataGrid control which implements functionality missing in .NET's standard DataGrid).
    • TreeGrid User Control - a bit like Explorer, but the right-hand pane is a VersatileDataGrid locked to the treeview.
    • Two new game prototypes: CaveHunter and Snooker/MiniGolf.
    • ResourceManager editor w/ drag 'n' drop to scene (and to SceneGraph viewer).
    • Better structure.
    • Transformer behavior - Photoshop-style sprite overlay for moving/scaling/rotating.
    • Scene XML import/export.
    • Director Xtra for exporting movies to Endogine XML scene format.
    • Import Photoshop documents - each layer becomes a sprite, layer effects become behaviors. Decoding of layer bitmaps incomplete.
    • Import Flash swf files (rendering code incomplete).
    • BinaryReverseReader which reads bytes in reverse order (for .psd), and BinaryFlashReader which reads data with sub-byte precision (for .swf).
    • Extended EPoint(F) and ERectangle(F) classes. Note that Puzzle Bobble doesn't work properly after the latest changes, I'll take care of that later.

    Update 2005-07-07

    • Camera node.
    • Parallax layers.
    • LOC/Scale control toolbox.
    • User Controls: ValueEdit (arrow keys to change a Point), JogShuttle (mouse drag to change a Point value by jog or shuttle method).
    • MDI mode crash fixed (thanks to adrian cirstei) - but keys still don't work in MDI mode.
    • Multiple Scene Graph windows.
    • Select sprites directly in scene.
    • Asteroids-type game with parallax layers.
    • Extended EPoint(F) and ERectangle(F) classes.

    Update 2005-06-27

    • Optional MDI interface: editors and game window as MDI windows. (Problem: I get an error when trying to create the 3D device in a MDI window.)
    • Scene Graph: treeview of all sprites.
    • Sprite marker: creates a marker around the sprite which is selected in the scene graph.
    • Property Inspector: interface for viewing and editing sprite properties.
    • Sprite Behaviors: easy way to add functionality to sprites. The swinging letters animation is done with a behavior.
    • Behavior Inspector: add/remove/edit behaviors in runtime.
    • Inks use an enumeration instead of an int (ROPs.Multiply instead of 103).
    • Switched from MS' too simplistic Point(F)/Rectangle(F) classes to my own EPoint(F)/ERectangle(F), which have operator overloading and many more methods. Note that I've chosen to write them as classes, not structs - i.e., they're passed as references, not values.
    • Easier keyboard handling (assigns names to keys - makes it easier to let the user define keys, or to have multiple players on one keyboard).
    • Puzzle Bobble allows multiple players in each area.

    You can read about my early thoughts about the Endogine concept, future plans, and see some Shockwave/Lingo demos here.

    I have added a Endogine C# specific page here, but it probably lags behind this page.


    This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

    A list of licenses authors might use can be found here

    About the Author

    Jonas Beckeman

    Web Developer

    Sweden Sweden


    저작자 표시

    '소스코드' 카테고리의 다른 글

    Using Direct2D with WPF  (0) 2012.07.12
    Hex Grids and Hex Coordinate Systems in Windows: Drawing and Printing  (0) 2012.07.12
    Endogine sprite engine  (0) 2012.07.12
    Paint.NET  (0) 2012.07.12
    GPGPU on Accelerating Wave PDE  (0) 2012.07.12
    Microsoft® Surface® 2 Design and Interaction Guide  (0) 2012.07.12



    Paint.NET is free image and photo editing software for computers that run Windows. It features an intuitive and innovative user interface with support for layers, unlimited undo, special effects, and a wide variety of useful and powerful tools. An active and growing online community provides friendly help, tutorials, and plugins.
    It started development as an undergraduate college senior design project mentored by Microsoft, and is currently being maintained by some of the alumni that originally worked on it. Originally intended as a free replacement for the Microsoft Paint software that comes with Windows, it has grown into a powerful yet simple image and photo editor tool. It has been compared to other digital photo editing software packages such as Adobe® Photoshop®, Corel® Paint Shop Pro®, Microsoft Photo Editor, and The GIMP.





    Writing effect plug-ins for Paint.NET 2.1 in C#

    By | 10 May 2005 | Article
    This article is an introduction on how to create your own effect plug-ins for Paint.NET 2.1 in C#.


    Paint.NET 2.1 was released last week. Created to be a free replacement for the good old Paint that ships with every copy of Windows, it is very interesting for end users at large. But it is even more interesting for developers because of two reasons. First, it is open source. So if you like to study a few megabytes of C# code or how some architectural problems can be solved, go and get it. Second, the application provides a simple but appealing interface for creating your own effect plug-ins. And that's what this article is all about (if you're searching for some fancy effect algorithm, go somewhere else as the effect used in the article is quite simple).

    Getting started

    The first thing you need to do is to get the Paint.NET source code. Besides being the source code, it also serves as its own documentation and as the Paint.NET SDK. The solution consists of several projects. However, the only interesting ones when developing Paint.NET effect plug-ins are the PdnLib library which contains the classes we will use for rendering our effect and the Effects library which contains the base classes for deriving your own effect implementations.

    The project basics

    To create a new effect plug-in, we start with creating a new C# Class Library and add references to the official release versions of the PdnLib (PdnLib.dll) and the PaintDotNet.Effects library (PaintDotNet.Effects.dll). The root namespace for our project should be PaintDotNet.Effects as we're creating a plug-in that is supposed to fit in seamlessly. This is, of course, not limited to the namespace but more of a general rule: when writing software for Paint.NET, do as the Paint.NET developers do. The actual implementation requires deriving three classes:

    1. Effect is the base class for all Paint.NET effect implementations and it's also the interface Paint.NET will use for un-parameterized effects. It contains the method public virtual void Render(RenderArgs, RenderArgs, Rectangle) which derived un-parameterized effects override.
    2. Most of the effects are parameterized. The EffectConfigToken class is the base class for all specific effect parameter classes.
    3. And finally, as parameterized effects most likely will need a UI, there is a base class for effect dialogs: EffectConfigDialog.

    Implementing the infrastructure

    Now, we will take a look at the implementation details on the basis of the Noise Effect (as the name implies, it simply adds noise to an image). By the way, when using the sources provided with this article, you will most likely need to update the references to the Paint.NET libraries.

    The effect parameters

    As I said before, we need to derive a class from EffectConfigToken to be able to pass around our effect parameters. Given that our effect is called Noise Effect and that we want to achieve consistency with the existing sources, our parameter class has to be named NoiseEffectConfigToken.

    public class NoiseEffectConfigToken : EffectConfigToken

    There is no rule what your constructor has to look like. You can use a simple default constructor or one with parameters. From Paint.NET's point of view, it simply does not matter because (as you will see later) the class (derived from) EffectConfigDialog is responsible for creating an instance of the EffectConfigToken. So, you do not need to necessarily do anything else than having a non-private constructor.

    public NoiseEffectConfigToken() : base()

    However, our base class implements the ICloneable interface and also defines a pattern how cloning should be handled. Therefore, we need to create a protected constructor that expects an object of the class' own type and uses it to duplicate all values. We then have to override Clone() and use the protected constructor for the actual cloning. This also means that the constructor should invoke the base constructor but Clone() must not call its base implementation.

    protected NoiseEffectConfigToken(NoiseEffectConfigToken copyMe) : base(copyMe)
      this.frequency      = copyMe.frequency;
      this.amplitude      = copyMe.amplitude;
      this.brightnessOnly = copyMe.brightnessOnly;
    public override object Clone()
      return new NoiseEffectConfigToken(this);

    The rest of the implementation details are again really up to you. Most likely, you will define some private fields and corresponding public properties (as the case may be with some plausibility checks).

    The UI to set the effect parameters

    Now that we've got a container for our parameters, we need a UI to set them. As mentioned before, we will derive the UI dialog from EffectConfigDialog. This is important as it helps to ensure consistency of the whole UI. For example, in Paint.NET 2.0, an effect dialog is by default shown with opacity of 0.9 (except for sessions over terminal services). If I don't use the base class of Paint.NET and the developers decide that opacity of 0.6 is whole lot cooler, my dialog would all of a sudden look "wrong". Because we still try to be consistent with the original code, our UI class is called NoiseEffectConfigDialog.

    Again, you have a lot of freedom when it comes to designing your dialog, so I will again focus on the mandatory implementation details. The effect dialog is entirely responsible for creating and maintaining effect parameter objects. Therefore, there are three virtual base methods you must override. And, which might be unexpected, don't call their base implementations (it seems that earlier versions of the base implementations would even generally throw exceptions when called). The first is InitialInitToken() which is responsible for creating a new concrete EffectConfigToken and stores a reference in the protected field theEffectToken (which will implicitly cast the reference to an EffectConfigToken reference).

    protected override void InitialInitToken()
      theEffectToken = new NoiseEffectConfigToken();

    Second, we need a method to update the effect token according to the state of the dialog. Therefore, we need to override the method InitTokenFromDialog().

    protected override void InitTokenFromDialog()
      NoiseEffectConfigToken token = (NoiseEffectConfigToken)theEffectToken;
      token.Frequency      = (double)FrequencyTrackBar.Value / 100.0;
      token.Amplitude      = (double)AmplitudeTrackBar.Value / 100.0;
      token.BrightnessOnly = BrightnessOnlyCheckBox.Checked;

    And finally, we need to be able to do what we did before the other way round. That is, updating the UI according to the values of a token. That's what InitDialogFromToken() is for. Unlike the other two methods, this one expects a reference to the token to process.

    protected override void InitDialogFromToken(EffectConfigToken effectToken)
      NoiseEffectConfigToken token = (NoiseEffectConfigToken)effectToken;
      if ((int)(token.Frequency * 100.0) > FrequencyTrackBar.Maximum)
        FrequencyTrackBar.Value = FrequencyTrackBar.Maximum;
      else if ((int)(token.Frequency * 100.0) < FrequencyTrackBar.Minimum)
        FrequencyTrackBar.Value = FrequencyTrackBar.Minimum;
        FrequencyTrackBar.Value = (int)(token.Frequency * 100.0);
      if ((int)(token.Amplitude * 100.0) > AmplitudeTrackBar.Maximum)
        AmplitudeTrackBar.Value = AmplitudeTrackBar.Maximum;
      else if ((int)(token.Amplitude * 100.0) < AmplitudeTrackBar.Minimum)
        AmplitudeTrackBar.Value = AmplitudeTrackBar.Minimum;
        AmplitudeTrackBar.Value = (int)(token.Amplitude * 100.0);
      FrequencyValueLabel.Text = FrequencyTrackBar.Value.ToString("D") + "%";
      AmplitudeValueLabel.Text = AmplitudeTrackBar.Value.ToString("D") + "%";
      BrightnessOnlyCheckBox.Checked = token.BrightnessOnly;

    We're almost done. What's still missing is that we need to signal the application when values have been changed and about the user's final decision to either apply the changes to the image or cancel the operation. Therefore, whenever a value has been changed by the user, call UpdateToken() to let the application know that it needs to update the preview. Also, call Close() when leaving the dialog and set the appropriate DialogResult. For example:

    private void AmplitudeTrackBar_Scroll(object sender, System.EventArgs e)
      AmplitudeValueLabel.Text = AmplitudeTrackBar.Value.ToString("D") + "%";
    private void OkButton_Click(object sender, System.EventArgs e)
      DialogResult = DialogResult.OK;
    private void EscButton_Click(object sender, System.EventArgs e)
      DialogResult = DialogResult.Cancel;

    Implementing the effect

    Now everything is in place to start the implementation of the effect. As I mentioned before, there is a base class for un-parameterized effects. The Noise Effect is parameterized but that will not keep us from deriving from Effect. However, in order to let Paint.NET know that this is a parameterized effect, we need to also implement the IConfigurableEffect interface which adds another overload of the Render() method. It also introduces the method CreateConfigDialog() which allows the application to create an effect dialog.

    public class NoiseEffect : Effect, IConfigurableEffect

    But how do we construct an Effect object, or in this case, a NoiseEffect object? This time, we have to follow the patterns of the application which means that we use a public default constructor which invokes one of the two base constructors. The first one expects the effect's name, its description, and an icon to be shown in the Effects menu. The second constructor, in addition, requires a shortcut key for the effect. The shortcut key, however, will only be applied to effects which are categorized as an adjustment. In case of a normal effect, it will be ignored (see chapter Effect Attributes for details on effects and adjustments). In conjunction with some resource management, this might look like this:

    public NoiseEffect() : base(NoiseEffect.resources.GetString("Text.EffectName"),

    The only mandatory implementations we need are those that come with the implementation of the interface IConfigurableEffect. Implementing CreateConfigDialog() is quite simple as it does not involve anything but creating a dialog object and returning a reference to it.

    public EffectConfigDialog CreateConfigDialog()
      return new NoiseEffectConfigDialog();

    Applying the effect is more interesting but we're going to deal with some strange classes we may never have heard of. So let's first take a look at the signature of the Render() method:

    public void Render(EffectConfigToken properties,
                       PaintDotNet.RenderArgs dstArgs,
                       PaintDotNet.RenderArgs srcArgs,
                       PaintDotNet.PdnRegion roi)

    The class RenderArgs contains all we need to manipulate images; most important, it provides us with Surface objects which actually allow reading and writing pixels. However, beware not to confuse dstArgs and srcArgs. The object srcArgs (of course, including its Surface) deals with the original image. Therefore, you should never ever perform any write operations on those objects. But you will constantly read from the source Surface as once you made changes to the target Surface, nobody is going to reset those. The target (or destination) Surface is accessible via the dstArgs object. A pixel at a certain point can be easily addressed by using an indexer which expects x and y coordinates. The following code snippet, for example, takes a pixel from the original image, performs an operation, and then assigns the changed pixel to the same position in the destination Surface.

    point = srcArgs.Surface[x, y];
    VaryBrightness(ref point, token.Amplitude);
    dstArgs.Surface[x, y] = point;

    But that's not all. The region, represented by the fourth object roi, which the application orders us to manipulate, can have any shape. Therefore, we need to call a method like GetRegionScansReadOnlyInt() to obtain a collection of rectangles that approximate the drawing region. Furthermore, we should process the image line by line beginning at the top. These rules lead to a pattern like this:

    public void Render(EffectConfigToken properties, RenderArgs dstArgs,
                       RenderArgs srcArgs, PdnRegion roi)
      /* Loop through all the rectangles that approximate the region */
      foreach (Rectangle rect in roi.GetRegionScansReadOnlyInt())
        for (int y = rect.Top; y < rect.Bottom; y++)
          /* Do something to process every line in the current rectangle */
          for (int x = rect.Left; x < rect.Right; x++)
            /* Do something to process every point in the current line */

    The last interesting fact that should be mentioned is that the Surface class generally uses a 32-bit format with four channels (red, green, blue and alpha) and 8-bits per channel where each pixel is represented by a ColorBgra object. Keep in mind that ColorBgra is actually a struct, so in order to pass an object of that type by reference, you have to use the ref keyword. Furthermore, the struct allows accessing each channel through a public field:

    private void VaryBrightness(ref ColorBgra c, double amplitude)
      short newOffset = (short)(random.NextDouble() * 127.0 * amplitude);
      if (random.NextDouble() > 0.5)
        newOffset *= -1;
      if (c.R + newOffset < byte.MinValue)
        c.R = byte.MinValue;
      else if (c.R + newOffset > byte.MaxValue)
        c.R = byte.MaxValue;
        c.R = (byte)(c.R + newOffset);
      if (c.G + newOffset < byte.MinValue)
        c.G = byte.MinValue;
      else if (c.G + newOffset > byte.MaxValue)
        c.G = byte.MaxValue;
        c.G = (byte)(c.G + newOffset);
      if (c.B + newOffset < byte.MinValue)
        c.B = byte.MinValue;
      else if (c.B + newOffset > byte.MaxValue)
        c.B = byte.MaxValue;
        c.B = (byte)(c.B + newOffset);

    Effect Attributes

    Now we've got our effect up and running. Is there something else we have to do? Well, in this case everything is fine. However, as every effect is different you might want to apply one of the three attributes that are available in the PaintDotNet.Effects namespace. First, there is the attribute EffectCategoryAttribute which is used to let Paint.NET know if the effect is an effect or an adjustment. The difference between those two is that effects are meant to perform substantial changes on an image and are listed in the Effects menu while adjustments only perform small corrections on the image and are listed in the submenu Adjustments in the menu Layers. Just take a look at the effects and adjustments that are integrated in Paint.NET to get a feeling for how to categorize a certain plug-in. The EffectCategoryAttribute explicitly sets the category of an effect by using the EffectCategory value which is passed to the attribute's constructor. By default, every effect plug-in which does not have an EffectCategoryAttribute is considered to be an effect (and therefore appears in the Effects menu) which is equivalent to applying the attribute as follows:


    Of course, the enumeration EffectCategory contains two values and the second one, EffectCategory.Adjustment, is used to categorize an effect as an adjustment so that it will appear in the Adjustments submenu in Paint.NET.


    Besides from being able to categorize effects, you can also define your own submenu by applying the EffectSubMenu attribute. Imagine you created ten ultra-cool effects and now want to group them within the Effects menu of Paint.NET to show that they form a toolbox. Now, all you would have to do in order to put all those plug-ins in the submenu 'My Ultra-Cool Toolbox' within the Effects menu would be to apply the EffectSubMenu attribute to every plug-in of your toolbox. This of course can also be done with adjustment plug-ins in order to create submenus within the Adjustments submenu. However, there is one important restriction: because of the way how effects are managed in Paint.NET, the effect name must be unique. This means that you can't have an effect called Foo directly in the Effects menu and a second effect which is also called Foo in the submenu 'My Ultra-Cool Toolbox'. If you try something like this, Paint.NET will call only one of the two effects no matter if you use the command in the Effects menu or the one in the submenu.

    [EffectSubMenu("My Ultra-Cool Toolbox")]

    Last but not least there is the SingleThreadedEffect attribute. Now, let's talk about multithreading first. In general, Paint.NET is a multithreaded application. That means, for example, that when it needs to render an effect, it will incorporate worker threads to do the actual rendering. This ensures that the UI stays responsive and in case the rendering is done by at least two threads and Paint.NET is running on a multi-core CPU or a multiprocessor system, it also reduces the rendering time significantly. By default, Paint.NET will use as many threads to render an effect as there are logical processors in the system with a minimum number of threads of two.


    physical CPUs

    logical CPUs


    Intel Pentium III




    Intel Pentium 4 with hyper-threading




    Dual Intel Xeon without hyper-threading




    Dual Intel Xeon with hyper-threading




    However, Paint.NET will use only one thread if the SingleThreadedEffect attribute has been applied regardless of the number of logical processors. If the rendering is done by multiple threads, you have to ensure that the usage of any object in the method Render() is thread-safe. The effect configuration token is usually no problem (as long as you don't change its values, which is not recommended anyway) as the rendering threads get a copy of the token instance used by the UI. Also Paint.NET's own PdnRegion class is thread-safe, accordingly you don't have to worry about those objects. However, GDI+ objects like RenderArgs.Graphics or RenderArgs.Bitmap are not thread-safe so whenever you want to use these objects to render your effect, you have to apply the SingleThreadedEffect attribute. You also may apply the attribute whenever you are not sure if your implementation is actually thread-safe or you simply don't want to ponder about multi-threading. Although doing so will lead to a decreased performance on multiprocessor systems and multi-core CPUs, you'll at least be literally on the safe side.



    Creating effect plug-ins for Paint.NET is not too difficult after all. The parts of the object model you need in order to do this are not very complex (trying this is an excellent idea for the weekend) and it even seems quite robust. Of course, this article does not cover everything there is to know about Paint.NET effect plug-ins but it should be enough to create your first own plug-in.


    I'd like to thank Rick Brewster and Craig Taylor for their feedback and for proof-reading this article.

    Change history

    • 2005-05-08: Added a note that shortcut keys are only applied to adjustments and a chapter about attributes.
    • 2005-01-06: Corrected a major bug in NoiseEffectConfigDialog.InitDialogFromToken(EffectConfigToken). The old implementation used the property EffectConfigDialog.EffectToken instead of the parameter effectToken.
    • 2005-01-03: Initial release.


    This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

    A list of licenses authors might use can be found here

    About the Author

    Dennis C. Dietrich

    Web Developer

    Ireland Ireland





    저작자 표시

    GPGPU on Accelerating Wave PDE

    소스 검색한것을 공유하기.






    GPGPU on Accelerating Wave PDE

    By | 20 Apr 2012 | Article
    A Wave PDE simulation using GPGPU capabilities


    This article aims at exploiting GPGPU (GP2U) computation capabilities to improve physical and scientific simulations. In order for the reader to understand all the passages, we will gradually proceed in the explanation of a simple physical simulation based on the well-known Wave equation. A classic CPU implementation will be developed and eventually another version using GPU workloads is going to be presented.

    Wave PDE

    PDE stands for “Partial differential equation” and indicates an equation which has one or more partial derivatives as independent variables in its terms. The order of the PDE is usually defined as the highest partial derivative in it. The following is a second-order PDE:

    Usually a PDE of order n having m variables xi for i=1,2…m is expressed as

    A compact form to express is ux and the same applies to (uxx ) and (uxy ).

    The wave equation is a second order partial differential equation used to describe water waves, light waves, sound waves, etc. It is a fundamental equation in fields such as electro-magnetics, fluid dynamics and acoustics. Its applications involve modeling the vibration of a string or the air flow dynamics as a result from an aircraft movement.

    We can derive the one-dimensional wave equation (i.e. the vibration of a string) by considering a flexible elastic string that is tightly bound between two fixed end-points lying of the x axis

    A guitar string has a restoring force that is proportional to how much it’s stretched. Suppose that, neglecting gravity, we apply a y-direction­ displacement at the string (i.e. we’re slightly pulling it). Let’s consider only a short segment of it between x and x+  :

    Let’s write down for the small red segment in the diagram above. We assume that the string has a linear density (linear density is a measure of mass per unit of length and is often used with one-dimensional objects). Recalling that if a one-dimensional object is made of a homogeneous substance of length L and total mass m the linear density is

    we have m= .

    The forces (as already stated we neglect gravity as well as air resistance and other tiny forces) are the tensions T at the ends. Since the string is actually waving, we can’t assume that the two T vectors cancel themselves: this is the force we are looking for. Now let’s make some assumptions to simplify things: first we consider the net force we are searching for as vertical (actually it’s not exactly vertical but very close)

    Furthermore we consider the wave amplitude small. That is, the vertical component of the tension at the x+ end of the small segment of string is

    The slope is if we consider dx and dy as the horizontal and vertical components of the above image. Since we have considered the wave amplitude to be very tiny, we can therefore assume . This greatly helps us: and the total vertical force from the tensions at the two ends becomes

    The equality becomes exact in the limit .

    We see that y is a function of x, but it is however a function of t as well: y=y(x,t). The standard convention for denoting differentiation with respect to one variable while the other is held constant (we’re looking at the sum of forces at one instant of time) let us write

    The final part is to use Newton’s second law and put everything together: the sum of all forces, the mass (substituted with the linear density multiplied by the segment length) and the acceleration (i.e. just because we’re only moving in the vertical direction, remember the small amplitude approximation).

    And we’re finally there: the wave equation. Using the spatial Laplacian operator, indicating the y function (depending on x and t) as u and substituting (a fixed constant) we have the common compact form

    The two-dimensional wave equation is obtained by adding a second spatial term to the equation as follows

    The constant c has dimensions of distance per unit time and thus represents a velocity. We won’t prove here that c is actually the velocity at which waves propagate along a string or through a surface (although it’s surely worth noting). This makes sense since the wave speed increases with tension experienced by the medium and decreases with the density of the medium.

    In a physical simulation we need to take into account forces other than just the surface tension: the average amplitude of the waves on the surface diminishes in real-world fluids. We may therefore add a viscous damping force to the equation by introducing a force that acts in the opposite direction of the velocity of a point on the surface:

    where the nonnegative constant represents the viscosity of the fluid (it controls how long it takes for the wave on the surface to calm down, a small allows waves to exist for a long time as with water while a large causes them to diminish rapidly as for thick oil).

    Solving the Wave PDE with finite difference method

    To actually implement a physical simulation which uses the wave PDE we need to find a method to solve it. Let’s solve the last equation we wrote with the damping force

    here t is the time, x and y are the spatial coordinates of the 2D wave, c2 is the wave speed and the damping factor. u=u(t,x,y) is the function we need to calculate: generally speaking it could represent various effects like a change of height of a pool’s water, electric potential in an electromagnetic wave, etc..

    A common method to solve this problem is to use the finite difference method. The idea behind is basically to replace derivatives with finite differences which can be easily calculated in a discrete algorithm. If there is a function f=f(x) and we want to calculate its derivative respect to the x variable we can write

    if h is a discrete small value (which is not zero), then we can approximate the derivative with

    the error of such an approach could be derived from Taylor’s theorem but that isn’t the purpose of this paper.

    A finite difference approach uses one of the following three approximations to express the derivative

    • Forward difference

    • Backward difference

    • Central difference

    Let’s stick with the latter (i.e. central difference); this kind of approach can be generalized, so if we want to calculate the discrete version of f''(x) we could first write

    and then calculate f''(x) as follows

    The same idea is used to estimate the partial derivatives below

    Let’s get back to the wave equation. We had the following

    let’s apply the partial derivative formula just obtained to it (remember that u=u(t,x,y), that is, u depends by t,x and y)

    This is quite a long expression. We just substituted the second derivatives and the first derivatives with the formulas we got before.

    If we now consider , we are basically forcing the intervals where the derivatives are calculated to be the same for both the x and the y directions (and we can greatly simplify the expression):

    To improve the readability, let us substitute some of the terms with letters

    and we have

    If you look at the t variables we have something like

    This tells us that the new wave’s height (at t+1) will be the current wave height (t) plus the previous wave height (t-1) plus something in the present (at t) depending only by what are the values around the wave point we are considering.

    This can be visualized as a sequence of time frames one by another where a point on the surface we are considering evolves

    The object

    has a central dot which represents a point (t,x,y) on the surface at time t. Since the term we previously called something_at_t(x,y) is actually

    the value of the central point is influenced by five terms and the latter is its same value (-4ut,x,y ) multiplied by -4. 

    Creating the algorithm

    As we stated before, the wave PDE can be effectively solved with finite difference methods. However we still need to write some resolution code before a real physical simulation can be set up. In the last paragraph we eventually ended up obtaining

    we then simplified this expression with

    This is indeed a recursive form which may be modeled with the following pseudo-code

    for each t in time
    ut+1 <- ut+ut-1+something_at_t(x,y) 

        // Do something with the new ut+1, e.g. you can draw the wave here

        ut-1<- ut

    u<- ut+1

    The above pseudo-code is supposed to run in retained-mode so the application might be able to draw the wave in each point of the surface we’re considering by simply calling draw_wave(ut+1) .

    Let assume that we’re modeling how a water wave evolves and so let the u function represent the wave height (0 – horizontal level): we can now write down the beginning of our pseudo-code

    // Surface data 

    height = 500;

    width = 500; 

    // Wave parameters

    c = 1; // Speed of wave

    dx = 1; // Space step

    dt = 0.05; // Time step

    k=0.002 // Decay factor

    kdt=k*dt; // Decayment factor per timestep, recall that q = 2 - kdt, r = -1 +kdt

    c1 = dt^2*c^2/dx^2; // b factor

    // Initial conditions

    u_current_t = zeros(height,width); // Create a height x width zero matrix

    u_previous_t = u_current_t;

    We basically defined a surface where the wave evolution will be drawn (a 500x500 area) and initialized the wave parameters we saw a few paragraphs ago (make sure to recall the q,r and b substitutions we did before). The initial condition is a zero matrix (u_current_t) so the entire surface is quiet and there’s no wave evolving.

    Given that we are considering a matrix surface (every point located at (x;y) coordinates is described by a u value indicating the height of the wave there) we need to write code to implement the

    line in the for cycle. Actually the above expression is a simplified form for

    and we need to implement this last one. We may write down something like the following

    for(t=0; t<A_LOT_OF_TIME; t += dt)


        u_next_t = (2-kdt)*u_current_t+(kdt-1)*u_previous_t+c1*something_at_t(x,y)

        u_previous_t = u_current_t; // Current becomes old

        u_current_t = u_next_t; // New becomes current

        // Draw the wave



    that is, a for cycle with index variable t increasing by dt at every step. Everything should be familiar by now because the (2-kdt),(kdt-1) and c1 terms are the usual q,r and b substitutions. Last thing we need to implement is the something_at_t(x,y) term, as known as:

    The wave PDE we started from was

    and the term we are interested now is this:

    that, in our case, is

    Since we have a matrix of points representing our surface, we are totally in the discrete field, and since we need to perform a second derivative on a matrix of discrete points our problem is the same as having an image with pixel intensity values I(x,y) and need to calculate its Laplacian

    This is a common image processing task and it’s usually solved by applying a convolution filter to the image we are interested in (in our case: the matrix representing the surface). A common small kernel used to approximate the second derivatives in the definition of the Laplacian is the following

    So in order to implement the

    term, we need to apply the D Laplacian kernel as a filter to our image (i.e. the u_current_t):

    u_next_t=(2-kdt)*u_current_t+(kdt-1)*u_previous_t+c1* convolution(u_current_t,D);

    In fact in the element we saw earlier

    the red dot elements are weighted and calculated with a 2D convolution of the D kernel.

    An important element to keep in mind while performing the D kernel convolution with our u_current_t matrix is that every value in outside halos (every value involved in the convolution but outside the u_current_t matrix boundaries) should be zero as in the image below

    In the picture above the red border is the u_current_t matrix perimeter while the blue 3x3 matrix is the D Laplacian kernel, everything outside the red border is zero. This is important because we want our surface to act like water contained in a recipient and the waves in the water to “bounce back” if they hit the container’s border. By zeroing all the values outside the surface matrix we don’t receive wave contributions from the outside of our perimeter nor do we influence in any way what’s beyond it. In addition the “energy” of the wave doesn’t spread out and is partially “bounced back” by the equation.

    Now the algorithm is almost complete: our PDE assures us that every wave crest in our surface will be properly “transmitted” as a real wave. The only problem is: starting with a zero-everywhere matrix and letting it evolve would produce just nothing. Forever.

    We will now add a small droplet to our surface to perturb it and set the wheels in motion.

    To simulate as realistic as possible the effect of a droplet falling into our surface we will introduce a “packet” in our surface matrix. That means we are going to add a matrix that represents a discrete Gaussian function (similar to a Gaussian kernel) to our surface matrix. Notice that we’re going to “add”, not to “convolve”.

    Given the 2D Gaussian formula

    we have that A is the Gaussian amplitude (so will be our droplet amplitude), x0 and y0 are the center’s coordinates and the   and  are the spreads of the blob. Putting a negative amplitude like in the following image we can simulate a droplet just fallen into our surface.

    To generate the droplet matrix we can use the following pseudo-code

    // Generate a droplet matrix

    dsz = 3; // Droplet size

    da=0.07; // Droplet amplitude

    [X,Y] = generate_XY_planes(dsz,da);

    DropletMatrix = -da * exp( -(X/dsz)^2 -(Y/dsz)^2);

    da is the droplet amplitude (and it’s negative as we just said), while dsz is the droplet size, that is, the Gaussian “bell” radius. X and Y are two matrices representing the X and Y plane discrete values

    so the X and Y matrices for the image above are

    And the final DropletMatrix is similar to the following

    where the central value is -0.0700. If you drew the above matrix you would obtain a 3D Gaussian function which can now model our droplet.

    The final pseudo-code for our wave algorithm is the following

    // Surface data

    height = 500;

    width = 500;

    // Wave parameters

    c = 1; // Speed of wave

    dx = 1; // Space step

    dt = 0.05; // Time step

    k=0.002 // Decay factor

    kdt=k*dt; // Decayment factor per timestep, recall that q = 2 - kdt, r = -1 +kdt

    c1 = dt^2*c^2/dx^2; // b factor

    // Initial conditions

    u_current_t=zeros(height,width); // Create a height x width zero matrix

    u_previous_t = u_current_t;

    // Generate a droplet matrix

    dsz = 3; // Droplet size

    da=0.07; // Droplet amplitude

    [X,Y] = generate_XY_planes(dsz,da);

    DropletMatrix = -da * exp( -(X/dsz)^2 -(Y/dsz)^2);

    // This variable is used to add just one droplet

    One_single_droplet_added = 0;

    for(t=0; t<A_LOT_OF_TIME; t = t + dt)


        u_next_t = (2-kdt)*u_current_t+(kdt-1)*u_previous_t+c1*convolution(u_current_t,D);

        u_previous_t = u_current_t; // Current becomes old

        u_current_t = u_next_t; // New becomes current

        // Draw the wave


        if(One_single_droplet_added == 0)


            One_single_droplet_added = 1; // no more droplets

            addMatrix(u_current_t, DropletMatrix, [+100;+100]);



    The variable One_single_droplet_added is used to check if the droplet has already been inserted (we want just one droplet). The addMatrix function adds the DropletMatrix values to the u_current_t surface matrix values centering the DropletMatrix at the point (100;100), remember that the DropletMatrix is smaller (or equal) to the surface matrix, so we just add the DropletMatrix’s values to the u_current_t’s values that fall within the dimensions of the DropletMatrix.

    Now the algorithm is complete, although it is still a theoretical simulation. We will soon implement it with real code.


    Implementing the Wave simulation

    We will now discuss how the above algorithm has been implemented in a real C++ project to create a fully functional openGL physical simulation.

    The sequence diagram below shows the skeleton of the program which basically consists of three main parts: the main module where the startup function resides as well as the kernel function which creates the matrix image for the entire window, an openGL renderer wrapper to encapsulate GLUT library functions and callback handlers and a matrix hand-written class to simplify matrix data access and manipulation. Although a sequence diagram would require following a standard software engineering methodology and its use is strictly enforced by predetermined rules, nonetheless we will use it as an abstraction to model the program’s control flow

    The program starts at the main() and creates an openGLrenderer object which will handle all the graphic communication with the GLUT library and the callback events (like mouse movements, keyboard press events, etc.). OpenGL doesn’t provide routines for interfacing with a windowing system, so in this project we will rely on GLUT libraries which provide a platform-independent interface to manage windows and input events. To create an animation that runs as fast as possible we will set an idle callback with the glutIdleFunc() function. We will explain more about this later.

    Initially the algorithm sets its initialization variables (time step, space step, droplet amplitude, Laplacian 2D kernel, etc.. practically everything we saw in the theory section) and every matrix corresponding to the image to be rendered is zeroed out. The Gaussian matrix corresponding to the droplets is also preprocessed. A structure defined in openGLRenderer’s header file contains all the data which should be retained across image renderings

     typedef struct kernelData
        float a1; // 2-kdt
        float a2; // kdt-1
        float c1; // decayment factor
        sMatrix* u;
        sMatrix* u0;
        sMatrix* D;
        int Ddim; // Droplet matrix width/height
        int dsz; // Gaussian radius
        sMatrix* Zd;
    } kernelData; 

    The structure is updated each time a time step is performed since it contains both current and previous matrices that describe the waves evolution across time. Since this structure is both used by the openGL renderer and the main module to initialize it, the variable is declared as external and defined afterwards in the openGLrenderer cpp file (so its scope goes beyond the single translation unit). After everything has been set up the openGLRenderer class’ startRendering() method is called and the openGL main loop starts fetching events. The core of the algorithm we saw is in the main module’s kernel() function which is called every time an openGL idle event is dispatched (that is, the screen is updated and the changes will be shown only when the idle callback has completed, thus the amount of rendering work here performed should be minimized to avoid performance loss).

    The kernel’s function code is the following

    // This kernel is called at each iteration
    // It implements the main loop algorithm and someway "rasterize" the matrix data
    // to pass to the openGL renderer. It also adds droplets in the waiting queue
    void kernel(unsigned char *ptr, kernelData& data)
        // Wave evolution
        sMatrix un(DIM,DIM);
        // The iterative discrete update (see documentation)
        un = (*data.u)*data.a1 + (*data.u0)*data.a2 + convolutionCPU((*data.u),(*data.D))*data.c1;
        // Step forward in time
        (*data.u0) = (*data.u);
        (*data.u) = un;
        // Prepare matrix data for rendering
        matrix2Bitmap( (*data.u), ptr );
        if(first_droplet == 1) // By default there's just one initial droplet
            first_droplet = 0;
            int x0d= DIM / 2; // Default droplet center
            int y0d= DIM / 2;
            // Place the (x0d;y0d) centered Zd droplet on the wave data (it will be added at the next iteration)
            for(int Zdi=0; Zdi < data.Ddim; Zdi++)
                for(int Zdj=0; Zdj < data.Ddim; Zdj++)
                    (*data.u)(y0d-2*data.dsz+Zdi,x0d-2*data.dsz+Zdj) += (*data.Zd)(Zdi, Zdj);
        // Render the droplet queue

    The pattern we followed in the wave PDE evolution can be easily recognized in the computational-intensive code line

    un = (*data.u)*data.a1 + (*data.u0)*data.a2 + convolutionCPU((*data.u),(*data.D))*data.c1;

    which basically performs the well-known iterative step

    All constants are preprocessed to improve performances.

    It is to be noticed that the line adds up large matrices which are referred in the code as sMatrix objects. The sMatrix class is an handwritten simple class that exposes simple operator overrides to ease working with matrices. Except that one should bear in mind that large matrix operations shall avoid passing arguments by value and that to create a new matrix and copy it to the destination a copy constructor is required (to avoid obtaining a shallow copy without the actual data), the code is pretty straight forwarding so no more words will be spent on it 

    // This class handles matrix objects
    class sMatrix
       int rows, columns;
       float *values;
       sMatrix(int height, int width)
           if (height == 0 || width == 0)
          throw "Matrix constructor has 0 size";
           rows = height;
           columns = width;
           values = new float[rows*columns];
       // Copy constructor, this is needed to perform a deep copy (not a shallow one)
       sMatrix(const sMatrix& mt)
           this->rows = mt.rows;
           this->columns = mt.columns;
           this->values = new float[rows*columns];
           // Copy the values
           memcpy(this->values, mt.values, this->rows*this->columns*sizeof(float));
           delete [] values;
       // Allows matrix1 = matrix2
       sMatrix& operator= (sMatrix const& m)
           if(m.rows != this->rows || m.columns != this->columns)
            throw "Size mismatch";
          return *this; // Since "this" continues to exist after the function call, it is perfectly legit to return a reference
       // Allows both matrix(3,3) = value and value = matrix(3,3)
       float& operator() (const int row, const int column)
        // May be suppressed to slightly increase performances
        if (row < 0 || column < 0 || row > this->rows || column > this->columns)
            throw "Size mismatch";
        return values[row*columns+column]; // Since the float value continues to exist after the function call, it is perfectly legit to return a reference
      // Allows scalar*matrix (e.g. 3*matrix) for each element
      sMatrix operator* (const float scalar)
        sMatrix result(this->rows, this->columns);
        // Multiply each value by the scalar
        for(int i=0; i<rows*columns; i++)
            result.values[i] = this->values[i]*scalar;
        return result; // Copy constructor
      // Allows matrix+matrix (if same size)
      sMatrix operator+ (const sMatrix& mt)
        if (this->rows != mt.rows || this->columns != mt.columns)
            throw "Size mismatch";
        sMatrix result(this->rows, this->columns);
        // Sum each couple of values
        for(int i=0; i<rows; i++)
            for(int j=0; j<columns; j++)
                result.values[i*columns+j] = this->values[i*columns+j] + mt.values[i*columns+j];
        return result; // Copy constructor

    The convolution is performed with the following code (a classic approach):

    // Returns the convolution between the matrix A and the kernel matrix B,
    // A's size is preserved
    sMatrix convolutionCPU(sMatrix& A, sMatrix& B)
      sMatrix result(A.rows, A.columns);
      int kernelHradius = (B.rows - 1) / 2;
      int kernelWradius = (B.columns - 1) / 2;
      float convSum, convProd, Avalue;
      for(int i=0; i<A.rows; i++)
        for(int j=0; j<A.columns; j++)
            // --------j--------->
            // _ _ _ _ _ _ _     
            // |            |    |
            // |            |    |
            // |       A    |    i
            // |            |    |
            // |            |    |
            // _ _ _ _ _ _ _|    v
            convSum = 0;
            for(int bi=0; bi<B.rows; bi++)
                for(int bj=0; bj<B.columns; bj++)
                    // A's value respect to the kernel center
                    int relpointI = i-kernelHradius+bi;
                    int relpointJ = j-kernelWradius+bj;
                    if(relpointI < 0 || relpointJ < 0 || relpointI >= A.rows || relpointJ >= A.columns)
                        Avalue = 0;
                        Avalue = A(i-kernelHradius+bi,j-kernelWradius+bj);
                    convProd = Avalue*B(bi,bj);
                    convSum += convProd;
            // Store the convolution result
            result(i,j) = convSum;
      return result;

    After calculating the system’s evolution, the time elapsing is simulated by swapping the new matrix with the old one and discarding the previous state as we described before.

    Then a matrix2Bitmap() call performs a matrix-to-bitmap conversion as its name suggests, more precisely the entire simulation area is described by a large sMatrix object which contains, obviously, float values. To actually render these values as pixel units we need to convert each value to its corresponding RGBA value and pass it to the openGLRenderer class (which in turn will pass the entire bitmap buffer to the GLUT library). In brief: we need to perform a float-to-RGBcolor mapping.

    Since in the physical simulation we assumed that the resting water height is at 0 and every perturbation would heighten or lower this value (in particular the droplet Gaussian matrix lowers it by a maximum -0.07 factor), we are searching for a [-1;1] to color mapping. A HSV color model would better simulate a gradual color transition as we actually experience with our own eyes, but this would require converting it back to RGB values to set up a bitmap map to pass back at the GLUT wrapper. For performance reasons we chose to assign each value a color (colormap). A first solution would have been implementing a full [-1;1] -> [0x0;0xFFFFFF] mapping in order to cover all the possible colors in the RGB format

    // Returns a RGB color from a value in the interval between [-1;+1]
    RGB getColorFromValue(float value)
        RGB result;
        if(value <= -1.0f)
                result.r = 0x00;
                result.g = 0x00;
                result.b = 0x00;
        else if(value >= 1.0f)
                result.r = 0xFF;
                result.g = 0xFF;
                result.b = 0xFF;
                float step = 2.0f/0xFFFFFF;
                unsigned int cvalue = (unsigned int)((value + 1.0f)/step);
                if(cvalue < 0)
                        cvalue = 0;
                else if(cvalue > 0xFFFFFF)
                        cvalue = 0xFFFFFF;
                result.r = cvalue & 0xFF;
                result.g = (cvalue & 0xFF00) >> 8;
                result.b = (cvalue & 0xFF0000) >> 16;
        return result;

    However the above method is either performance-intensive and doesn’t render very good a smooth color transition: let’s take a look at a droplet mapped like that

    looks more like a fractal rather than a droplet, so the above solution won’t work. A better way to improve performances (and the look of the image) is to hard-code a colormap in an array and to use it when needed:

    float pp_step = 2.0f / (float)COLOR_NUM;
    // The custom colormap, you can customize it to whatever you like
    unsigned char m_colorMap[COLOR_NUM][3] = 
    // Returns a RGB color from a value in the interval between [-1;+1] using the above colormap
    RGB getColorFromValue(float value)
      RGB result;
      unsigned int cvalue = (unsigned int)((value + 1.0f)/pp_step);
      if(cvalue < 0)
        cvalue = 0;
      else if(cvalue >= COLOR_NUM)
        cvalue = COLOR_NUM-1;
      result.r = m_colorMap[cvalue][0];
      result.g = m_colorMap[cvalue][1];
      result.b = m_colorMap[cvalue][2];
      return result;

    Creating a colormap isn’t hard and different colormaps are freely available on the web which produce different transition effects. This time the result was much nicer (see the screenshot later on) and the performances (although an every-value conversion is always an intensive task) increased substantially.

    The last part involved in the on-screen rendering is adding a droplet wherever the user clicks on the window with the cursor. One droplet is automatically added at the center of the surface (you can find the code in the kernel() function, it is controlled by the first_droplet variable) but the user can click everywhere (almost everywhere) on the surface to add another droplet in that spot. To achieve this a queue has been implemented to contain at the most 60 droplet centers where the Gaussian matrix will be placed (notice that the matrix will be added to the surface values that were already present in the spots, not just replace them).

    #define MAX_DROPLET        60
    typedef struct Droplet
        int mx;
        int my;
    } Droplet; 
    Droplet dropletQueue[MAX_DROPLET];
    int dropletQueueCount = 0; 

    The queue system has been implemented for a reason: unlike the algorithm in pseudo-code we wrote before, rendering a scene with openGL requires the program to control the objects to be displayed in immediate-mode: that means the program needs to take care of what should be drawn before the actual rendering is performed, it cannot simply put a droplet to be rendered inside the data to be drawn because it could be in use (you can do this in retained-mode). Besides, we don’t know when a droplet will be added because it’s totally user-dependent. Because of that, every time the kernel() finishes, the droplet queue is emptied and the droplet Gaussian matrices are added to the data to be rendered (the surface data). The code which performs this is the following

    void openGLRenderer::renderWaitingDroplets()
       // If there are droplets waiting to be rendered, schedule for the next rendering
       while(dropletQueueCount > 0)
    void addDroplet( int x0d, int y0d )
       y0d = DIM - y0d;
       // Place the (x0d;y0d) centered Zd droplet on the wave data (it will be added at the next iteration)
       for(int Zdi=0; Zdi< m_simulationData.Ddim; Zdi++)
           for(int Zdj=0; Zdj< m_simulationData.Ddim; Zdj++)
    += (*m_simulationData.Zd)(Zdi, Zdj);

    The code should be familiar by now: the addDroplet function simply adds the Zd 2D Gaussian matrix to the simulation data (the kernel data) at the “current” time, a.k.a. the u matrix which represents the surface.

    The code loops until the keyboard callback handler (defined by the openGLrenderer) detects the Esc keypress, after that the application is issued a termination signal, the loops ends. The resources allocated by the program are freed before the exit signal is issued, however this might not be necessary since a program termination would let the OS free any of its previously allocated resources.

    With the droplet-adding feature and all the handlers set the code is ready to run. This time the result is much nicer than the previous, that’s because we used a smoother colormap (take a look at the images below). Notice how the convolution term creates the “wave spreading” and the “bouncing” effect when computing values against the padded zero data outside the surface matrix (i.e. when the wave hits the window’s borders and is reflected back). The first image is the simulation in its early stage, that is when some droplets have just been added, the second image represents a later stage when the surface is going to calm down (in our colormap blue values are higher than red ones).

    Since we introduced a damping factor (recall it from the theory section), the waves are eventually going to cease and the surface will sooner or later be quiet again. The entire simulation is (except for the colors) quite realistic but quite slow too. That’s because the entire surface matrix is being thoroughly updated and calculated by the system. The kernel() function runs continuously updating the rendering buffer. For a 512x512 image the CPU has to process a large quantity of data and it has also to perform 512x512 floating point convolutions. Using a profiler (like VS’s integrated one) shows that the program spends most of its time in the kernel() call (as we expected) and that the convolution operation is the most cpu-intensive.

    It is also interesting to notice that the simulation speed decreases substantially when adding lots of droplets.

    In a real scientific simulation environment gigantic quantities of data need to be processed in relatively small time intervals. That’s where GPGPU computing comes into the scene. We will briefly summarize what this acronym means and then we will present a GPU-accelerated version of the wave PDE simulation. 

    GPGPU Architectures

    GPGPU stands for General Purpose computation on Graphics Processing Units and indicates using graphic processors and devices to perform high parallelizable computations that would normally be handled by CPU devices. The idea of using graphic devices to help CPUnits with their workloads isn’t new, but until recent architectures and frameworks like CUDA (©NVIDIA vendor-specific) or openCL showed up, programmers had to rely on series of workarounds and tricks to work with inconvenient and unintuitive methods and data structures. The reason why a developer should think about porting his CPU-native code into a new GPU version resides in the architecture design differences between CPUs and GPUs. While CPUs evolved (multicore) to gain performance advantages with sequential executions (pipelines, caches, control flows, etc..), GPUs evolved in a many-core way: they tended to operate at higher data bandwidths and chose to heavily increase their execution threads number. In the last years GPGPU has been seen as a new opportunity to use graphical processing units as algebraic coprocessors capable of handling massive parallelization and precision floating point calculations. The idea behind GPGPU architectures is letting CPU handling sequential parts of programs and GPU getting over with parallelizable parts. In fact many scientific applications and systems found their performances increased by such an approach and GPGPU is now a fundamental technology in many fields like medical imaging, physics simulations, signal processing, cryptography, intrusion detection, environmental sciences, etc..

    Massive parallelization with CUDA

    We chose to use the CUD-Architecture to parallelize some parts of our code. Parallelizing with GPGPU techniques means passing from a sequentially-designed code to a parallel-designed code, this also often means having to rewrite large parts of your code. The most obvious part of the entire simulation algorithm that could benefit of a parallel approach is the surface matrix convolution with the 2D Laplacian kernel.

    Notice: for brevity’s sake, we will assume that the reader is already familiar with CUDA C/C++.

    The CUDA SDK comes with a large variety of examples regarding how to implement an efficient convolution between matrices and how to apply a kernel as an image filter. However we decided to implement our own version rather than rely on standard techniques. The reasons are many:

    • we wanted to show how a massively parallel algorithm is designed and implemented
    • since our kernel is very small (3x3 2D Laplacian kernel, basically 9 float values) using a FFT approach like the one described by Victor Podlozhnyuk would be inefficient
    • the two-dimensional Gaussian kernel is the only radially symmetric function that is also separable, our kernel is not, so we cannot use separable convolution methods
    • such a small kernel seems perfect to be “aggressively cached” in the convolution operation. We’ll expand on that as soon as we describe the CUDA kernel designed

    The most obvious way to perform a convolution (although extremely inefficient) consists in delegating each GPU thread multiple convolutions for each element across the entire image.

    Take a look at the following image, we will use a 9x9 thread grid (just one block to simplify things) to perform the convolution. The purple square is our 9x9 grid while the red grids correspond to the 9x9 kernel. Each thread performs the convolution within its elements, then the X are shifted and the entire block is “virtually” moved to the right. When the X coordinate is complete (that is: the entire image horizontal area has been covered), the Y coordinate is incremented and the process starts again until completion. In the border area, every value outside the image will be set to zero.

    The code for this simple approach is the following where A is the image matrix and B is the kernel.

    __global__ void convolutionGPU(float *A, int A_cols, int A_rows, float *B, int B_Wradius, int B_Hradius,
    int B_cols, int B_rows, float *result)
        // Initial position
        int threadXabs = blockIdx.x*blockDim.x + threadIdx.x;
        int threadYabs = blockIdx.y*blockDim.y + threadIdx.y;
        int threadXabsInitialPos = threadXabs;
        float convSum;
        while(threadYabs < A_rows)
            while(threadXabs < A_cols)
                 // If we are still in the image, start the convolution
                 convSum = 0.0f;
                 // relative x coord to the absolute thread
                 #pragma unroll
                 for(int xrel=-B_Wradius;xrel<(B_Wradius+1); xrel++)
                     #pragma unroll
                     for(int yrel=-B_Hradius;yrel<(B_Hradius+1); yrel++)
                           // Check the borders, 0 if outside
                           float Avalue;
                           if(threadXabs + xrel < 0 || threadYabs + yrel <0 || threadXabs + xrel >= A_cols || threadYabs + yrel >= A_rows)
                                Avalue = 0;
                                Avalue = A[ (threadYabs+yrel)*A_cols + (threadXabs + xrel) ];
                           // yrel+b_hradius makes the interval positive 
                           float Bvalue = B[ (yrel+B_Hradius)*B_cols + (xrel+B_Wradius) ];
                           convSum += Avalue * Bvalue;
                  // Store the result and proceed ahead in the grid
                  result[threadYabs*A_cols + threadXabs ] = convSum;
                  threadXabs += blockDim.x * gridDim.x;
             // reset X pos and forward the Y pos
             threadXabs = threadXabsInitialPos;
             threadYabs += blockDim.y * gridDim.y;
        // convolution finished

    As already stated, this simple approach has several disadvantages

    • The kernel is very small, keeping it into global memory and accessing to it for every convolution performed is extremely inefficient
    • Although the matrix readings are partially coalesced, thread divergence can be significant with threads active in the border area and threads that are inactive
    • There’s no collaborative behavior among threads, although they basically use the same kernel and share a large part of the apron region

    Hence a way better method to perform GPU convolution has been designed keeping in mind the points above.

    The idea is simple: letting each thread load part of the apron and data regions in the shared memory thus maximizing readings coalescence and reducing divergence.

    The code that performs the convolution on the GPU version of the simulation is the following

     // For a 512x512 image the grid is 170x170 blocks 3x3 threads each one
    __global__ void convolutionGPU(float *A, float *result)
       __shared__ float data[laplacianD*2][laplacianD*2];
       // Absolute position into the image
       const int gLoc = threadIdx.x + IMUL(blockIdx.x,blockDim.x) + IMUL(threadIdx.y,DIM) + IMUL(blockIdx.y,blockDim.y)*DIM;
       // Image-relative position
       const int x0 = threadIdx.x + IMUL(blockIdx.x,blockDim.x);
       const int y0 = threadIdx.y + IMUL(blockIdx.y,blockDim.y);
       // Load the apron and data regions
       int x,y;
       // Upper left square
       x = x0 - kernelRadius;
       y = y0 - kernelRadius;
       if(x < 0 || y < 0)
             data[threadIdx.x][threadIdx.y] = 0.0f;
             data[threadIdx.x][threadIdx.y] = A[ gLoc - kernelRadius - IMUL(DIM,kernelRadius)];
       // Upper right square
       x = x0 + kernelRadius + 1;
       y = y0 - kernelRadius;
       if(x >= DIM || y < 0)
             data[threadIdx.x + blockDim.x][threadIdx.y] = 0.0f;
             data[threadIdx.x + blockDim.x][threadIdx.y] = A[ gLoc + kernelRadius+1 - IMUL(DIM,kernelRadius)];
       // Lower left square
       x = x0 - kernelRadius;
       y = y0 + kernelRadius+1;
       if(x < 0 || y >= DIM)
             data[threadIdx.x][threadIdx.y + blockDim.y] = 0.0f;
             data[threadIdx.x][threadIdx.y + blockDim.y] = A[ gLoc - kernelRadius + IMUL(DIM,(kernelRadius+1))];
       // Lower right square
       x = x0 + kernelRadius+1;
       y = y0 + kernelRadius+1;
       if(x >= DIM || y >= DIM)
             data[threadIdx.x + blockDim.x][threadIdx.y + blockDim.y] = 0.0f;
             data[threadIdx.x + blockDim.x][threadIdx.y + blockDim.y] = A[ gLoc + kernelRadius+1 + IMUL(DIM,(kernelRadius+1))];
       float sum = 0;
       x = kernelRadius + threadIdx.x;
       y = kernelRadius + threadIdx.y;
       // Execute the convolution in the shared memory (kernel is in constant memory)
    #pragma unroll
       for(int i = -kernelRadius; i<=kernelRadius; i++)
              for(int j=-kernelRadius; j<=kernelRadius; j++)
                      sum += data[x+i][y+j]  * gpu_D[i+kernelRadius][j+kernelRadius];
       // Transfer the risult to global memory
       result[gLoc] = sum;

    The kernel only receives the surface matrix and the result where to store the convolved image. The kernel isn’t provided because it has been put into a special memory called “constant memory” which is read-only by kernels, pre-fetched and highly optimized to let all threads read from a specific location with minimum latency. The downside is that this kind of memory is extremely limited (in the order of 64Kb) so should be used wisely. Declaring our 3x3 kernel as constant memory grants us a significant speed advantage

    __device__ __constant__ float gpu_D[laplacianD][laplacianD]; // Laplacian 2D kernel

    The image below helps to determine how threads load from the surface matrix in the global memory the data and store them into faster on-chip shared memory before actually using them in the convolution operation. The purple 3x3 square is the kernel window and the central element is the value we are actually pivoting on. The grid is a 172x172 blocks 3x3 threads each one; each block of 3x3 threads have four stages to complete before entering the convolution loop: load the upper left apron and image data into shared memory (the upper left red square from the kernel purple window), load the upper right area (red square), load the lower left area (red square) and load the lower right area (idem). Since shared memory is only available to the threads in a block, each block loads its own shared area. Notice that we chose to let every thread read something from global memory to maximize coalescence, but we are not actually going to use every single element. The image shows a yellow area and a gray area: the yellow data is actually going to be used in the convolution operation for each element in the purple kernel square (it comprises aprons and data) while the gray area isn’t going to be used by any convolution performed by the block we are considering.

    After filling each block’s shared memory array, the CUDA threads get synchronized to minimize their divergence. Then the execution of the convolution algorithm is performed: shared data is multiplied against constant kernel data resulting in a highly optimized operation.

    The #pragma unroll directive instructs the compiler to unroll (where possible) the loop to reduce cycle control overhead and improve performances. A small example of loop unrolling: the following loop

    for(int i=0;i<1000;i++)

    a[i] = b[i] + c[i];

    might be optimized by unrolling it

    for(int i=0;i<1000;i+=2)


    a[i] = b[i] + c[i];

    a[i+1] = b[i+1] + c[i+1];


    so that the control instructions are executed less and the overall loop improves its performances. It is to be noticed that almost every optimization in CUDA code needs to be carefully and thoroughly tested because a different architecture and different program control flows might produce different results (as well as different compiler optimizations that, unfortunately, cannot be always trusted).

    Also notice that the IMUL macro is used in the code which is defined as

    #define IMUL(a,b) __mul24(a,b)

    On devices of CUDA compute capability 1.x, 32-bit integer multiplication is implemented using multiple instructions as it is not natively supported. 24-bit integer multiplication is natively supported via the __[u]mul24 intrinsic. However on devices of compute capability 2.0, however, 32-bit integer multiplication is natively supported, but 24-bit integer multiplication is not. __[u]mul24 is therefore implemented using multiple instructions and should not be used. So if you are planning to use the code on 2.x devices, make sure to redefine the macro directive.

    A typical code which could call the kernel we just wrote could be

    sMatrix convolutionGPU_i(sMatrix& A, sMatrix& B)
        unsigned int A_bytes = A.rows*A.columns*sizeof(float);
        sMatrix result(A.rows, A.columns);
        float *cpu_result = (float*)malloc(A_bytes);
        // Copy A data to the GPU global memory (B aka the kernel is already there)
        cudaError_t chk;
        chk = cudaMemcpy(m_gpuData->gpu_matrixA, A.values, A_bytes, cudaMemcpyHostToDevice);
        if(chk != cudaSuccess)
            return result;
        // Call the convolution kernel
        dim3 blocks(172,172);
        dim3 threads(3,3);
        convolutionGPU<<<blocks,threads>>>(m_gpuData->gpu_matrixA, m_gpuData->gpu_matrixResult);
        // Copy back the result
        chk = cudaMemcpy(cpu_result, m_gpuData->gpu_matrixResult, A_bytes, cudaMemcpyDeviceToHost);
        if(chk != cudaSuccess)
             return result;
        // Allocate a sMatrix and return it with the GPU data
        result.values = cpu_result;
        return result;

    obviously CUDA memory should be cudaMalloc-ated at the beginning of our program and freed only when the GPU work is complete.

    However, as we stated before, converting a sequentially-designed program into a parallel one isn’t an easy task and often requires more than just a plain function-to-function conversion (it depends on the application). In our case substituting just a CPU-convolution function with a GPU-convolution function won’t work. In fact even though we distributed our workload in a better way from the CPU version (see the images below for a CPU-GPU exclusive time percentage), we actually slowed down the whole simulation.

    The reason is simple: our kernel() function is called whenever a draw event is dispatched, so it needs to be called very often. Although the CUDA kernel is faster than the CPU convolution function and although GPU memory bandwidths are higher than CPU’s, transferring from (possibly paged-out) host memory to global device memory back and forth just kills our simulation performances. Applications which would benefit more from a CUDA approach usually perform a single-shot heavily-computational kernel workload and then transfer back the results. Real time applications might benefit from a concurrent kernels approach, but a 2.x capability device would be required.

    In order to actually accelerate our simulation, a greater code revision is required.

    Another more subtle thing to take into account when working with GPU code is CPU optimizations: take a look at the following asm codes for the CPU version of the line

    un = (*data.u)*data.a1 + (*data.u0)*data.a2 + convolutionCPU((*data.u),(*data.D))*data.c1;

    000000013F2321EF  mov        r8,qword ptr [m_simulationData+20h (13F234920h)]  
    000000013F2321F6  mov        rdx,qword ptr [m_simulationData+10h (13F234910h)]  
    000000013F2321FD  lea        rcx,[rbp+2Fh]  
    000000013F232201  call       convolutionCPU (13F231EC0h)  
    000000013F232206  nop  
    000000013F232207  movss      xmm2,dword ptr [m_simulationData+8 (13F234908h)]  
    000000013F23220F  lea        rdx,[rbp+1Fh]  
    000000013F232213  mov        rcx,rax  
    000000013F232216  call       sMatrix::operator* (13F2314E0h)  
    000000013F23221B  mov        rdi,rax  
    000000013F23221E  movss      xmm2,dword ptr [m_simulationData+4 (13F234904h)]  
    000000013F232226  lea        rdx,[rbp+0Fh]  
    000000013F23222A  mov        rcx,qword ptr [m_simulationData+18h (13F234918h)]  
    000000013F232231  call       sMatrix::operator* (13F2314E0h)  
    000000013F232236  mov        rbx,rax  
    000000013F232239  movss      xmm2,dword ptr [m_simulationData (13F234900h)]  
    000000013F232241  lea        rdx,[rbp-1]  
    000000013F232245  mov        rcx,qword ptr [m_simulationData+10h (13F234910h)]  
    000000013F23224C  call       sMatrix::operator* (13F2314E0h)  
    000000013F232251  nop  
    000000013F232252  mov        r8,rbx  
    000000013F232255  lea        rdx,[rbp-11h]  
    000000013F232259  mov        rcx,rax  
    000000013F23225C  call       sMatrix::operator+ (13F2315B0h)  
    000000013F232261  nop  
    000000013F232262  mov        r8,rdi  
    000000013F232265  lea        rdx,[rbp-21h]  
    000000013F232269  mov        rcx,rax  
    000000013F23226C  call       sMatrix::operator+ (13F2315B0h)  
    000000013F232271  nop  
    000000013F232272  cmp        dword ptr [rax],1F4h  
    000000013F232278  jne        kernel+33Fh (13F2324CFh)  
    000000013F23227E  cmp        dword ptr [rax+4],1F4h  
    000000013F232285  jne        kernel+33Fh (13F2324CFh)  
    000000013F23228B  mov        r8d,0F4240h  
    000000013F232291  mov        rdx,qword ptr [rax+8]  
    000000013F232295  mov        rcx,r12  
    000000013F232298  call       memcpy (13F232DDEh)  
    000000013F23229D  nop  
    000000013F23229E  mov        rcx,qword ptr [rbp-19h]  
    000000013F2322A2  call       qword ptr [__imp_operator delete (13F233090h)]  
    000000013F2322A8  nop  
    000000013F2322A9  mov        rcx,qword ptr [rbp-9]  
    000000013F2322AD  call       qword ptr [__imp_operator delete (13F233090h)]  
    000000013F2322B3  nop  
    000000013F2322B4  mov        rcx,qword ptr [rbp+7]  
    000000013F2322B8  call       qword ptr [__imp_operator delete (13F233090h)]  
    000000013F2322BE  nop  
    000000013F2322BF  mov        rcx,qword ptr [rbp+17h]  
    000000013F2322C3  call       qword ptr [__imp_operator delete (13F233090h)]  
    000000013F2322C9  nop  
    000000013F2322CA  mov        rcx,qword ptr [rbp+27h]  
    000000013F2322CE  call       qword ptr [__imp_operator delete (13F233090h)]  
    000000013F2322D4  nop  
    000000013F2322D5  mov        rcx,qword ptr [rbp+37h]  
    000000013F2322D9  call       qword ptr [__imp_operator delete (13F233090h)]  

    and now take a look at the GPU version of the line

    un = (*data.u)*data.a1 + (*data.u0)*data.a2 + convolutionGPU_i ((*data.u),(*data.D))*data.c1;

    000000013F7E23A3  mov        rax,qword ptr [data]  
    000000013F7E23AB  movss      xmm0,dword ptr [rax+8]  
    000000013F7E23B0  movss      dword ptr [rsp+0A8h],xmm0  
    000000013F7E23B9  mov        rax,qword ptr [data]  
    000000013F7E23C1  mov        r8,qword ptr [rax+20h]  
    000000013F7E23C5  mov        rax,qword ptr [data]  
    000000013F7E23CD  mov        rdx,qword ptr [rax+10h]  
    000000013F7E23D1  lea        rcx,[rsp+70h]  
    000000013F7E23D6  call       convolutionGPU_i (13F7E1F20h)  
    000000013F7E23DB  mov        qword ptr [rsp+0B0h],rax  
    000000013F7E23E3  mov        rax,qword ptr [rsp+0B0h]  
    000000013F7E23EB  mov        qword ptr [rsp+0B8h],rax  
    000000013F7E23F3  movss      xmm0,dword ptr [rsp+0A8h]  
    000000013F7E23FC  movaps     xmm2,xmm0  
    000000013F7E23FF  lea        rdx,[rsp+80h]  
    000000013F7E2407  mov        rcx,qword ptr [rsp+0B8h]  
    000000013F7E240F  call       sMatrix::operator* (13F7E2B20h)  
    000000013F7E2414  mov        qword ptr [rsp+0C0h],rax  
    000000013F7E241C  mov        rax,qword ptr [rsp+0C0h]  
    000000013F7E2424  mov        qword ptr [rsp+0C8h],rax  
    000000013F7E242C  mov        rax,qword ptr [data]  
    000000013F7E2434  movss      xmm0,dword ptr [rax+4]  
    000000013F7E2439  movaps     xmm2,xmm0  
    000000013F7E243C  lea        rdx,[rsp+50h]  
    000000013F7E2441  mov        rax,qword ptr [data]  
    000000013F7E2449  mov        rcx,qword ptr [rax+18h]  
    000000013F7E244D  call       sMatrix::operator* (13F7E2B20h)  
    000000013F7E2452  mov        qword ptr [rsp+0D0h],rax  
    000000013F7E245A  mov        rax,qword ptr [rsp+0D0h]  
    000000013F7E2462  mov        qword ptr [rsp+0D8h],rax  
    000000013F7E246A  mov        rax,qword ptr [data]  
    000000013F7E2472  movss      xmm2,dword ptr [rax]  
    000000013F7E2476  lea        rdx,[rsp+40h]  
    000000013F7E247B  mov        rax,qword ptr [data]  
    000000013F7E2483  mov        rcx,qword ptr [rax+10h]  
    000000013F7E2487  call       sMatrix::operator* (13F7E2B20h)  
    000000013F7E248C  mov        qword ptr [rsp+0E0h],rax  
    000000013F7E2494  mov        rax,qword ptr [rsp+0E0h]  
    000000013F7E249C  mov        qword ptr [rsp+0E8h],rax  
    000000013F7E24A4  mov        r8,qword ptr [rsp+0D8h]  
    000000013F7E24AC  lea        rdx,[rsp+60h]  
    000000013F7E24B1  mov        rcx,qword ptr [rsp+0E8h]  
    000000013F7E24B9  call       sMatrix::operator+ (13F7E2BF0h)  
    000000013F7E24BE  mov        qword ptr [rsp+0F0h],rax  
    000000013F7E24C6  mov        rax,qword ptr [rsp+0F0h]  
    000000013F7E24CE  mov        qword ptr [rsp+0F8h],rax  
    000000013F7E24D6  mov        r8,qword ptr [rsp+0C8h]  
    000000013F7E24DE  lea        rdx,[rsp+90h]  
    000000013F7E24E6  mov        rcx,qword ptr [rsp+0F8h]  
    000000013F7E24EE  call       sMatrix::operator+ (13F7E2BF0h)  
    000000013F7E24F3  mov        qword ptr [rsp+100h],rax  
    000000013F7E24FB  mov        rax,qword ptr [rsp+100h]  
    000000013F7E2503  mov        qword ptr [rsp+108h],rax  
    000000013F7E250B  mov        rdx,qword ptr [rsp+108h]  
    000000013F7E2513  lea        rcx,[un]  
    000000013F7E2518  call       sMatrix::operator= (13F7E2A90h)  
    000000013F7E251D  nop  
    000000013F7E251E  lea        rcx,[rsp+90h]  
    000000013F7E2526  call        sMatrix::~sMatrix (13F7E2970h)  
    000000013F7E252B  nop  
    000000013F7E252C  lea        rcx,[rsp+60h]  
    000000013F7E2531  call       sMatrix::~sMatrix (13F7E2970h)  
    000000013F7E2536  nop  
    000000013F7E2537  lea        rcx,[rsp+40h]  
    000000013F7E253C  call       sMatrix::~sMatrix (13F7E2970h)  
    000000013F7E2541  nop  
    000000013F7E2542  lea        rcx,[rsp+50h]  
    000000013F7E2547  call       sMatrix::~sMatrix (13F7E2970h)  
    000000013F7E254C  nop  
    000000013F7E254D  lea        rcx,[rsp+80h]  
    000000013F7E2555  call       sMatrix::~sMatrix (13F7E2970h)  
    000000013F7E255A  nop  
    000000013F7E255B  lea        rcx,[rsp+70h]  
    000000013F7E2560 call        sMatrix::~sMatrix(13F7E2970h)  

    the code, although the data involved are practically the same, looks much more bloated up (there are even nonsense operations, look at address 000000013F7E23DB). Probably letting the CPU finish the calculation after the GPU has done its work is not a good idea.

    Since there are other functions which can be parallelized (like the matrix2bitmap() function), we need to move as much workload as possible on the device.

    First we need to allocate memory on the device at the beginning of the program and deallocate it when finished. Small data like our algorithm parameters can be stored into constant memory to boost performances, while matrices large data is more suited into global memory (constant memory size is very limited).

    // Initialize all data used by the device
    // and the rendering simulation data as well
    void initializeGPUData()
        /* Algorithm parameters */
        // Time step
        float dt = (float)0.05;
        // Speed of the wave
        float c = 1;
        // Space step
        float dx = 1;
        // Decay factor
        float k = (float)0.002;
        // Droplet amplitude (Gaussian amplitude)
        float da = (float)0.07;
        // Initialize u0
        sMatrix u0(DIM,DIM);
        for(int i=0; i<DIM; i++)
              for(int j=0; j<DIM; j++)
                    u0(i,j) = 0.0f; // The corresponding color in the colormap for 0 is green
        // Initialize the rendering img to the u0 matrix
        CPUsMatrix2Bitmap(u0, renderImg);
        // Decayment per timestep
        float kdt=k*dt;
        // c1 constant
        float c1=pow(dt,2)*pow(c,2)/pow(dx,2);
        // Droplet as gaussian
        // This code creates a gaussian discrete droplet, see the documentation for more information
        const int dim = 4*dropletRadius+1;
        sMatrix xd(dim, dim);
        sMatrix yd(dim, dim);
        for(int i=0; i<dim; i++)
               for(int j=-2*dropletRadius; j<=2*dropletRadius; j++)
                      xd(i,j+2*dropletRadius) = j;
                      yd(j+2*dropletRadius,i) = j;
        float m_Zd[dim][dim];
        for(int i=0; i<dim; i++)
               for(int j=0; j<dim; j++)
                      // Calculate Gaussian centered on zero
                      m_Zd[i][j] = -da*exp(-pow(xd(i,j)/dropletRadius,2)-pow(yd(i,j)/dropletRadius,2));
        /* GPU data initialization */
        // Allocate memory on the GPU for u and u0 matrices
        unsigned int UU0_bytes = DIM*DIM*sizeof(float);
        cudaError_t chk;
        chk = cudaMalloc((void**)&m_gpuData.gpu_u, UU0_bytes);
        if(chk != cudaSuccess)
             printf("\nCRITICAL: CANNOT ALLOCATE GPU MEMORY");
        chk = cudaMalloc((void**)&m_gpuData.gpu_u0, UU0_bytes);
        if(chk != cudaSuccess)
             printf("\nCRITICAL: CANNOT ALLOCATE GPU MEMORY");
        // Allocate memory for ris0, ris1, ris2 and ptr matrices
        chk = cudaMalloc((void**)&m_gpuData.ris0, UU0_bytes);
        if(chk != cudaSuccess)
             printf("\nCRITICAL: CANNOT ALLOCATE GPU MEMORY");
        chk = cudaMalloc((void**)&m_gpuData.ris1, UU0_bytes);
        if(chk != cudaSuccess)
             printf("\nCRITICAL: CANNOT ALLOCATE GPU MEMORY");
        chk = cudaMalloc((void**)&m_gpuData.ris2, UU0_bytes);
        if(chk != cudaSuccess)
             printf("\nCRITICAL: CANNOT ALLOCATE GPU MEMORY");
        chk = cudaMalloc((void**)&m_gpuData.gpu_ptr, DIM*DIM*4);
        if(chk != cudaSuccess)
             printf("\nCRITICAL: CANNOT ALLOCATE GPU MEMORY");
        // Initialize to zero both u and u0
        chk = cudaMemcpy(m_gpuData.gpu_u0, u0.values, UU0_bytes, cudaMemcpyHostToDevice);
        if(chk != cudaSuccess)
        chk = cudaMemcpy(m_gpuData.gpu_u, u0.values, UU0_bytes, cudaMemcpyHostToDevice);
        if(chk != cudaSuccess)
        // Preload Laplacian kernel
        float m_D[3][3];
        m_D[0][0] = 0.0f; m_D[1][0] = 1.0f;  m_D[2][0]=0.0f;
        m_D[0][1] = 1.0f; m_D[1][1] = -4.0f; m_D[2][1]=1.0f;
        m_D[0][2] = 0.0f; m_D[1][2] = 1.0f;  m_D[2][2]=0.0f;
        // Copy Laplacian to constant memory
        chk = cudaMemcpyToSymbol((const char*)gpu_D, m_D, 9*sizeof(float), 0, cudaMemcpyHostToDevice);
        if(chk != cudaSuccess)
              printf("\nCONSTANT MEMORY TRANSFER FAILED");
        // Store all static algorithm parameters in constant memory
        const float a1 = (2-kdt);
        chk = cudaMemcpyToSymbol((const char*)&gpu_a1, &a1, sizeof(float), 0, cudaMemcpyHostToDevice);
        if(chk != cudaSuccess)
              printf("\nCONSTANT MEMORY TRANSFER FAILED");
        const float a2 = (kdt-1);
        chk = cudaMemcpyToSymbol((const char*)&gpu_a2, &a2, sizeof(float), 0, cudaMemcpyHostToDevice);
        if(chk != cudaSuccess)
             printf("\nCONSTANT MEMORY TRANSFER FAILED");
        chk = cudaMemcpyToSymbol((const char*)&gpu_c1, &c1, sizeof(float), 0, cudaMemcpyHostToDevice);
        if(chk != cudaSuccess)
             printf("\nCONSTANT MEMORY TRANSFER FAILED");
        const int ddim = dim;
        chk = cudaMemcpyToSymbol((const char*)&gpu_Ddim, &ddim, sizeof(int), 0, cudaMemcpyHostToDevice);
        if(chk != cudaSuccess)
             printf("\nCONSTANT MEMORY TRANSFER FAILED");
        const int droplet_dsz = dropletRadius;
        chk = cudaMemcpyToSymbol((const char*)&gpu_dsz, &droplet_dsz, sizeof(int), 0, cudaMemcpyHostToDevice);
        if(chk != cudaSuccess)
             printf("\nCONSTANT MEMORY TRANSFER FAILED");
        chk = cudaMemcpyToSymbol((constchar*)&gpu_Zd, &m_Zd, sizeof(float)*dim*dim, 0, cudaMemcpyHostToDevice);
        if(chk != cudaSuccess)
             printf("\nCONSTANT MEMORY TRANSFER FAILED");
        // Initialize colormap and ppstep in constant memory
        chk = cudaMemcpyToSymbol((const char*)&gpu_pp_step, &pp_step, sizeof(float), 0, cudaMemcpyHostToDevice);
        if(chk != cudaSuccess)
             printf("\nCONSTANT MEMORY TRANSFER FAILED");
        chk = cudaMemcpyToSymbol((const char*)&gpu_m_colorMap, &m_colorMap, sizeof(unsigned char)*COLOR_NUM*3, 0, cudaMemcpyHostToDevice);
        if(chk != cudaSuccess)
             printf("\nCONSTANT MEMORY TRANSFER FAILED");
    void deinitializeGPUData()
        // Free everything from device memory

    After initializing the GPU memory the openGLRenderer can be started as usual to call the kernel() function in order to obtain a valid render-able surface image matrix. But there’s a difference now, right in the openGLRenderer constructor

         //. . .
         // Sets up the CUBLAS
         cublasStatus_t status = cublasInit();
         if (status != CUBLAS_STATUS_SUCCESS)
               // CUBLAS initialization error
               printf("\nCRITICAL: CUBLAS LIBRARY FAILED TO LOAD");
         // Set up the bitmap data with page-locked memory for fast access
         //no more : renderImg = new unsigned char[DIM*DIM*4];
         cudaError_t chk = cudaHostAlloc((void**)&renderImg, DIM*DIM*4*sizeof(char), cudaHostAllocDefault);
         if(chk != cudaSuccess)
               return ;

    First we decided to use CUBLAS library to perform matrix addition for two reasons:

    • our row-major data on the device is ready to be used by the CUBLAS functions yet (cublasMalloc is just a wrapper around the cudaMalloc)
    • CUBLAS library is extremely optimized for large matrices operations; our matrices aren’t that big but this could help extending the architecture for a future version

    Using our sMatrix wrapper is no more an efficient choice and we need to get rid of it while working on the device, although we can still use it for the initialization stage.

    The second fundamental thing that we need to notice in the openGLRenderer constructor is that we allocated host-side memory (the memory that will contain the data to be rendered) with cudaHostAlloc instead of the classic malloc. As the documentation states, allocating memory with such a function grants that the CUDA driver will track the virtual memory ranges allocated with this function and accelerate calls to function like cudaMemCpy. Host memory allocated with cudaHostAlloc is often referred as “pinned memory”, and cannot be paged-out (and because of that allocating excessive amounts of it may degrade system performance since it reduces the amount of memory available to the system for paging). This expedient will grant additional speed in memory transfers between device and host.

    We are not ready to take a peek at the revised kernel() function

    // This kernel is called at each iteration
    // It implements the main loop algorithm and someway "rasterize" the matrix data
    // to be passed to the openGL renderer. It also adds droplets in the waiting queue
    void kernel(unsigned char *ptr)
        // Set up the grid
        dim3 blocks(172,172);
        dim3 threads(3,3); // 516x516 img is 172x172 (3x3 thread) blocks
        // Implements the un = (*data.u)*data.a1 + (*data.u0)*data.a2 + convolution((*data.u),(*data.D))*data.c1;
        // line by means of several kernel calls
        convolutionGPU<<<blocks,threads>>>(m_gpuData.gpu_u, m_gpuData.ris0);
        // Now multiply everything by c1 constant
        multiplyEachElementby_c1<<<blocks,threads>>>(m_gpuData.ris0, m_gpuData.ris1);
        // First term is ready, now u*a1
        multiplyEachElementby_a1<<<blocks,threads>>>(m_gpuData.gpu_u, m_gpuData.ris0);
        // u0*a2
        multiplyEachElementby_a2<<<blocks,threads>>>(m_gpuData.gpu_u0, m_gpuData.ris2);
        // Perform the matrix addition with the CUBLAS library
        // un = ris0 + ris2 + ris1
        // Since everything is already stored as row-major device vectors, we don't need to do anything to pass it to the CUBLAS
        cublasSaxpy(DIM*DIM, 1.0f, m_gpuData.ris0, 1, m_gpuData.ris2, 1);
        cublasSaxpy(DIM*DIM, 1.0f, m_gpuData.ris2, 1, m_gpuData.ris1, 1);
        // Result is not in m_gpuData.ris1
        // Step forward in time
        cudaMemcpy(m_gpuData.gpu_u0, m_gpuData.gpu_u, DIM*DIM*sizeof(float), cudaMemcpyDeviceToDevice);
        cudaMemcpy(m_gpuData.gpu_u, m_gpuData.ris1, DIM*DIM*sizeof(float), cudaMemcpyDeviceToDevice);
        // Draw the u surface matrix and "rasterize" it into gpu_ptr
        gpuMatrix2Bitmap<<<blocks,threads>>>(m_gpuData.gpu_u, m_gpuData.gpu_ptr);
        // Back on the pagelocked host memory
        cudaMemcpy(ptr, m_gpuData.gpu_ptr, DIM*DIM*4, cudaMemcpyDeviceToHost);
        if(first_droplet == 1) // By default there's just one initial droplet
             first_droplet = 0;
             int x0d= DIM / 2; // Default droplet center
             int y0d= DIM / 2;
             cudaMemcpy(m_gpuData.ris0, m_gpuData.gpu_u, DIM*DIM*sizeof(float), cudaMemcpyDeviceToDevice);
             addDropletToU<<<blocks,threads>>>(m_gpuData.ris0, x0d,y0d, m_gpuData.gpu_u);
        // Add all the remaining droplets in the queue
        while(dropletQueueCount >0)
             int y0d = DIM - dropletQueue[dropletQueueCount].my;
             // Copy from u to one of our buffers
             cudaMemcpy(m_gpuData.ris0, m_gpuData.gpu_u, DIM*DIM*sizeof(float), cudaMemcpyDeviceToDevice);
             addDropletToU<<<blocks,threads>>>(m_gpuData.ris0, dropletQueue[dropletQueueCount].mx,y0d, m_gpuData.gpu_u);
        // Synchronize to make sure all kernels executions are done

    The line

    un = (*data.u)*data.a1 + (*data.u0)*data.a2 + convolution((*data.u),(*data.D))*data.c1;

    has completely been superseded by multiple kernel calls which respectively operate a convolution operation, multiply matrix data by algorithm constants and perform a matrix-matrix addition via CUBLAS. Everything is performed in the device including the point-to-RGBvalue mapping (which is a highly parallelizable operation since must be performed for every value in the surface image matrix). Stepping forward in time is also accomplished with device methods. Eventually the data is copied back to the page-locked pinned host memory and droplets waiting in the queue are added for the next iteration to the u surface simulation data matrix.

    The CUDA kernels called by the kernel() function are the following

                                    CUDA KERNELS
    // For a 512x512 image the grid is 170x170 blocks 3x3 threads each one
    __global__ void convolutionGPU(float *A, float *result)
       __shared__ float data[laplacianD*2][laplacianD*2];
       // Absolute position into the image
       const int gLoc = threadIdx.x + IMUL(blockIdx.x,blockDim.x) + IMUL(threadIdx.y,DIM) + IMUL(blockIdx.y,blockDim.y)*DIM;
       // Image-relative position
       const int x0 = threadIdx.x + IMUL(blockIdx.x,blockDim.x);
       const int y0 = threadIdx.y + IMUL(blockIdx.y,blockDim.y);
       // Load the apron and data regions
       int x,y;
       // Upper left square
       x = x0 - kernelRadius;
       y = y0 - kernelRadius;
       if(x < 0 || y < 0)
            data[threadIdx.x][threadIdx.y] = 0.0f;
            data[threadIdx.x][threadIdx.y] = A[ gLoc - kernelRadius - IMUL(DIM,kernelRadius)];
       // Upper right square
       x = x0 + kernelRadius + 1;
       y = y0 - kernelRadius;
       if(x >= DIM || y < 0)
            data[threadIdx.x + blockDim.x][threadIdx.y] = 0.0f;
            data[threadIdx.x + blockDim.x][threadIdx.y] = A[ gLoc + kernelRadius+1 - IMUL(DIM,kernelRadius)];
       // Lower left square
       x = x0 - kernelRadius;
       y = y0 + kernelRadius+1;
       if(x < 0 || y >= DIM)
            data[threadIdx.x][threadIdx.y + blockDim.y] = 0.0f;
            data[threadIdx.x][threadIdx.y + blockDim.y] = A[ gLoc - kernelRadius + IMUL(DIM,(kernelRadius+1))];
       // Lower right square
       x = x0 + kernelRadius+1;
       y = y0 + kernelRadius+1;
       if(x >= DIM || y >= DIM)
             data[threadIdx.x + blockDim.x][threadIdx.y + blockDim.y] = 0.0f;
             data[threadIdx.x + blockDim.x][threadIdx.y + blockDim.y] = A[ gLoc + kernelRadius+1 + IMUL(DIM,(kernelRadius+1))];
       float sum = 0;
       x = kernelRadius + threadIdx.x;
       y = kernelRadius + threadIdx.y;
       // Execute the convolution in the shared memory (kernel is in constant memory)
    #pragma unroll
       for(int i = -kernelRadius; i<=kernelRadius; i++)
             for(int j=-kernelRadius; j<=kernelRadius; j++)
                      sum += data[x+i][y+j]  * gpu_D[i+kernelRadius][j+kernelRadius];
       // Transfer the risult to global memory
       result[gLoc] = sum;
    __global__ void multiplyEachElementby_c1(float *matrix, float *result)
        // Absolute position into the image
        const int gLoc = threadIdx.x + IMUL(blockIdx.x,blockDim.x) + IMUL(threadIdx.y,DIM) + IMUL(blockIdx.y,blockDim.y)*DIM;
        // Multiply by c1 each matrix's element
        result[gLoc] = matrix[gLoc]*gpu_c1;
    __global__ void multiplyEachElementby_a1(float *matrix, float *result)
        // Absolute position into the image
        const int gLoc = threadIdx.x + IMUL(blockIdx.x,blockDim.x) + IMUL(threadIdx.y,DIM) + IMUL(blockIdx.y,blockDim.y)*DIM;
        // Multiply by c1 each matrix's element
        result[gLoc] = matrix[gLoc]*gpu_a1;
    __global__ void multiplyEachElementby_a2(float *matrix, float *result)
        // Absolute position into the image
        const int gLoc = threadIdx.x + IMUL(blockIdx.x,blockDim.x) + IMUL(threadIdx.y,DIM) + IMUL(blockIdx.y,blockDim.y)*DIM;
        // Multiply by c1 each matrix'selement
        result[gLoc] = matrix[gLoc]*gpu_a2;
    // Associate a colormap RGB value to each point
    __global__ void gpuMatrix2Bitmap(float *matrix, BYTE *bitmap)
        // Absolute position into the image
        const int gLoc = threadIdx.x + IMUL(blockIdx.x,blockDim.x) + IMUL(threadIdx.y,DIM) + IMUL(blockIdx.y,blockDim.y)*DIM;
        int cvalue = (int)((matrix[gLoc] + 1.0f)/gpu_pp_step);
        if(cvalue < 0)
              cvalue = 0;
        else if(cvalue >= COLOR_NUM)
              cvalue = COLOR_NUM-1;
        bitmap[gLoc*4] = gpu_m_colorMap[cvalue][0];
        bitmap[gLoc*4 + 1] = gpu_m_colorMap[cvalue][1];
        bitmap[gLoc*4 + 2] = gpu_m_colorMap[cvalue][2];
        bitmap[gLoc*4 + 3] = 0xFF; // Alpha
    // Add a gaussian 2D droplet matrix to the surface data
    // Warning: this kernel has a high divergence factor, it is meant to be seldom called
    __global__ void addDropletToU(float *matrix, int x0d, int y0d, float *result)
        // Absolute position into the image
        const int gLoc = threadIdx.x + IMUL(blockIdx.x,blockDim.x) + IMUL(threadIdx.y,DIM) + IMUL(blockIdx.y,blockDim.y)*DIM;
        // Image relative position
        const int x0 = threadIdx.x + IMUL(blockIdx.x,blockDim.x);
        const int y0 = threadIdx.y + IMUL(blockIdx.y,blockDim.y);
        // Place the (x0d;y0d) centered Zd droplet on the wave data (it will be added at the next iteration)
        if (x0 >= x0d-gpu_dsz*2 && y0 >= y0d-gpu_dsz*2 && x0 <= x0d+gpu_dsz*2 && y0 <= y0d+gpu_dsz*2)
             // Add to result the matrix value plus the Zd corresponding value
             result[gLoc] = matrix[gLoc] + gpu_Zd[x0 -(x0d-gpu_dsz*2)][y0 - (y0d-gpu_dsz*2)];
            result[gLoc] = matrix[gLoc]; // This value shouln't be changed

    Notice that we preferred to “hardcode” the constant values usages with different kernels rather than introducing divergence with a conditional branch. The only kernel that increases thread divergence is the addDropletToU since only a few threads are actually performing the Gaussian packet-starting routine (see the theoric algorithm described a few paragraphs ago), but this isn’t a problem due to its low calling frequency. 

    Performance comparison

    The timing measurements and performance comparisons have been performed on the following system

    Intel 2 quad cpu Q9650 @ 3.00 Ghz

    6 GB ram

    64 bit OS

    NVIDIA GeForce GTX 285 (1GB DDR3 @ 1476 Mhz, 240 CUDA cores)

    The CUDA version we used to compile the projects is 4.2, if you have problems make sure to install the right version or change it as described in the readme file.

    To benchmark the CUDA kernel execution we used the cudaEventCreate / cudaEventRecord / cudaEventSynchronize / cudaEventElapsedTime functions shipped with every CUDA version, while for the CPU version we used two Windows platform-dependent APIs: QueryPerformanceFrequency and QueryPerformanceCounter.

    We split the benchmark into four stages: a startup stage where the only droplet in the image is the default one, a second stage when both the CPU and the GPU version stabilized themselves, a third one where we add 60-70 droplets to the rendering queue and a final one when the application is left running for 15-20 minutes. We saw that in every test the CPU performed worse than the GPU version which could rely on a large grid of threads ready to split up an upcoming significant workload and provide a fixed rendering time. On the other hand in the long term period, although the GPU still did better, the CPU version showed a small performance increment, perhaps thanks to the caching mechanisms.

    Notice that an application operating on larger data would surely have taken a greater advantage from a massive parallelization approach. Our wave PDE simulation is quite simple indeed and did not require a significant workload thus reducing the performance gain that could have been achieved.

    Once and for all: there’s not a general rule to convert a sequentially-designed algorithm to a parallel one and each case must be evaluated in its own context and architecture. Also using CUDA could provide a great advantage in scientific simulations, but one should not consider the GPU as a substitute of the CPU but rather as a algebraic coprocessor that can rely on massive data parallelization. Combining CPU sequential code parts with GPU parallel code parts is the key to succeed. 

    CUDA Kernels best practices

    The last, but not the least, section of this paper provide a checklist of best-practices and errors to avoid when writing CUDA kernels in order to get the maximum from your GPU-accelerated application

    1. Minimize host <-> device transfers, especially device -> host transfers which are slower, also if that could mean running on the device kernels that would not have been slower on the CPU.
    2. Use pinned memory (pagelocked) on the host side to exploit bandwidth advantages. Be careful not to abuse it or you’ll slow your entire system down.
    3. cudaMemcpy is a blocking function, if possible use it asynchronously with pinned memory + a CUDA stream in order to overlap transfers with kernel executions.
    4. If your graphic card is integrated, zero-copy memory (that is: pinned memory allocated with cudaHostAllocMapped flag) always grants an advantage. If not, there is no certainty since the memory is not cached by the GPU.
    5. Always check your device compute capabilities (use cudaGetDeviceProperties), if < 2.0 you cannot use more than 512 threads per block (65535 blocks maximum).
    6. Check your graphic card specifications: when launching a AxB block grid kernel, each SM (streaming multiprocessor) can serve a part of them. If your card has a maximum of 1024 threads per SM you should size your blocks in order to fill as many of them as possible but not too many (otherwise you would get scheduling latencies). Every warp is usually 32 thread (although this is not a fixed value and is architecture dependent) and is the minimum scheduling unit on each SM (only one warp is executed at any time and all threads in a warp execute the same instruction - SIMD), so you should consider the following example: on a GT200 card you need to perform a matrix multiplication. Should you use 8x8, 16x16 or 32x32 threads per block?

    For 8X8 blocks, we have 64 threads per block. Since each SM can take up to 1024 threads, there are 16 Blocks (1024/64). However, each SM can only take up to 8 blocks. Hence only 512 (64*8) threads will go into each SM -> SM execution resources are under-utilized; fewer wraps to schedule around long latency operations

    For 16X16 blocks , we have 256 threads per Block. Since each SM can take up to 1024 threads, it can take up to 4 Blocks (1024/256) and the 8 blocks limit isn’t hit -> Full thread capacity in each SM and maximal number of warps for scheduling around long-latency operations (1024/32= 32 wraps).

    For 32X32 blocks, we have 1024 threads per block -> Not even one can fit into an SM! (there’s a 512 threads per block limitation).

    1. Check the SM registers limit per block and divide them by the thread number: you’ll get the maximum register number you can use in a thread. Exceeding it by just one per thread will cause less warp to be scheduled and decreased performance.
    2. Check the shared memory per block and make sure you don’t exceed that value. Exceeding it will cause less warp to be scheduled and decreased performance.
    3. Always check the thread number to be inferior than the maximum threads value supported by your device.
    4. Use reduction and locality techniques wherever applicable.
    5. If possible, never split your kernel code into conditional branches (if-then-else), different paths would cause more executions for the same warp and the following overhead.
    6. Use reduction techniques with divergence minimization when possible (that is, try to perform a reduction with warps performing a coalesced reading per cycle as described in chapter 6 of Kirk and Hwu book)
    7. Coalescence is achieved by forcing hardware reading consecutive data. If each thread in a warp access consecutive memory coalescence is significantly increased (half of the threads in a warp should access global memory at the same time), that’s why with large matrices (row-major) reading by columns is better than rows readings. Coalescence can be increased with locality (i.e. threads cooperating in loading data needed by other threads); kernels should perform coalesced readings with locality purposes in order to maximize memory throughputs. Storing values into shared memory (when possible) is a good practice too.
    8. Make sure you don’t have unused threads/blocks, by design or because of your code. As said before graphic cards have limits like maximum threads on SM and maximum blocks per SM. Designing your grid without consulting your card specifications is highly discouraged.
    9. As stated before adding more registers than the card maximum registers limit is a recipe for performance loss. Anyway adding a register may also cause instructions to be added, that is: more time to parallelize transfers or to schedule warps and better performances. Again: there’s no general rule, you should abide by the best practises when designing your code and then experiment by yourself.
    10. Data prefetching means preloading data you don’t actually need at the moment to gain performance in a future operation (closely related to locality). Combining data prefetching in matrix tiles can solve many long-latency memory access problems.
    11. Unrolling loops is preferable when applicable (as if the loop is small). Ideally loop unrolling should be automatically done by the compiler, but checking to be sure of that is always a better choice.
    12. Reduce thread granularity with rectangular tiles (Chapter 6 Kirk and Hwu book) when working with matrices to avoid multiple row/columns readings from global memory by different blocks.
    13. Textures are cached memory, if used properly they are significantly faster than global memory, that is: texture are better suited for 2D spatial locality accesses (e.g. multidimensional arrays) and might perform better in some specific cases (again, this is a case-by-case rule).
    14. Try to mantain at least 25% of the overall registers occupied, otherwise access latency could not be hidden by other warps’ computations.
    15. The number of threads per block should always be a multiple of warp size to favor coalesced readings.
    16. Generally if a SM supports more than just one block, more than 64 threads per block should be used.
    17. The starting values to begin experimenting with kernel grids are between 128 and 256 threads per block.
    18. High latency problems might be solved by using more blocks with less threads instead of just one block with a lot of threads per SM. This is important expecially for kernels which often call __syncthreads().
    19. If the kernel fails, use cudaGetLastError() to check the error value, you might have used too many registers / too much shared or constant memory / too many threads.
    20. CUDA functions and hardware perform best with float data type. Its use is highly encouraged.
    21. Integer division and modulus (%) are expensive operators, replace them with bitwise operations whenever possible. If n is a power of 2,

    1. Avoid automatic data conversion from double to float if possible.
    2. Mathematical functions with a __ preceeding them are hardware implemented, they’re faster but less accurate. Use them if precision is not a critical goal (e.g. __sinf(x) rather than sinf(x)).
    3. Use signed integers rather than unsigned integers in loops because some compilers optimize signed integers better (overflows with signed integers cause undefined behavior, then compilers might aggressively optimize them).
    4. Remember that floating point math is not associative because of round-off errors. Massively parallel results might differ because of this.
    5. 1) If your device supports concurrent kernel executions (see concurrentKernels device property) you might be able to gain additional performance by running kernels in parallel. Check your device specifications and the CUDA programming guides.

    This concludes our article. The goal was an exploration of GPGPU applications capabilities to improve scientific simulations and programs which aim to large data manipulations.




    This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

    About the Author

    Alesiani Marco

    I'm a Computer Science Engineer and I've been programming with a large variety of technologies for years. I love writing software with C/C++, CUDA, .NET and playing around with reverse engineering
    저작자 표시

    '소스코드' 카테고리의 다른 글

    Endogine sprite engine  (0) 2012.07.12
    Paint.NET  (0) 2012.07.12
    GPGPU on Accelerating Wave PDE  (0) 2012.07.12
    Microsoft® Surface® 2 Design and Interaction Guide  (0) 2012.07.12
    From Soup to Nuts with the Surface SDK 2.0  (0) 2012.07.12
    Microsoft® Surface® 2 Development Whitepaper  (0) 2012.07.12