Home Products Download Events Support Registration

Home
Up

Network multiple computers/processors for scientific parallel computing

I. Introduction to a simple and free toolkit SocketPro for engineering computation

udaparts
06/17/2006

 

 

Download C++, C# and VB.NET source code -- ppi.zip

Contents

 

1.    Introduction

        Scientific computation has a high demanding to computer CPU power and is time-consuming in many cases. It is not uncommon that scientific computation requires a lot of computers networked and let all of computers solve a complicate and a large problem in parallel to save wall clock time. Two languages, FORTRAN and C/C++, are often employed for developing parallel computations. Sometimes, Java is also used for parallel computation. Over past decades, a number of technologies are developed for this purpose. They are MPI (Message Passing Interface), OpenMP, Grid computing, and Charm++. You can do internet search for these technologies by typing these key words for lots of detailed or concise introductions. 

        Parallel computation is popular in scientific computation, but the technique is not well applied to common business applications. The basic reason is that the major processor manufacturers can double CPU speed in every two years in the past three decades so that common business software developers do not really care much about code performance. However, this free lunch is over now and all of chip makers design their CPU with multi-core architecture. To fully take advantage of this multi-core architecture, programmers must write an application with strong concurrency (parallelism) in mind because software scalability and performance are largely dependent on code parallelism. Parallel computation becomes important and important.

        This article comes with a simple sample to demonstrate basic steps for parallel computation in C++, C# and VB.NET. The sample does not require you to have much math acknowledge. We are going to calculate the π value using numerical integration as accurate as possible with multiple computers.

2.    Sample numerical integration

        Consider the problem of computing the value of π using the below numerical integration.

        We can use trapezoidal integration to solve the integral. The basic idea is to fill the area under a curve with a series of tiny rectangles. As the width of rectangles approaches 0, the sum of the areas of these rectangles approaches the real value of π. To accurately get the value, we must divide the integral area with rectangles as many and small as possible. What we are going to divide the integration into a set of sections and send them onto different computing machines (servers). One center machine will collect the integral values of these sections and sum them to get the π value.

3.    Downloading SocketPro Package

        To experiment this sample parallel computation, you need to download SocketPro package at www.udaparts.com. After downloading the package, install it and make sure that your development environment Visual C++ includes the directory ..\include. Also, you may manually add the file sprowrap.cpp into the demo project for successfully compiling. If you use C# and VB.NET, make sure that sample projects reference SocketProAdapter (SProAdapter.dll for DotNet version 2).

4.    Universal interface definition file

        To simplify your development, SocketPro provides you a tool uidparser.exe to quickly create client and server skeleton codes with a given universal interface definition file. For details, you may see the tutorial one inside the SocketPro package. This sample file is pi.uid containing the following simple code.

[              

                ServiceID = 1212

]                                              

CPPi

{

                $double Compute(in double dStart, in double dStep, in int nNum);          

}

        The parameter dStart is the starting point of a portion of integration. The input parameter is the width of a tiny integration rectangle. The parameter nNum is the number of tiny rectangles assigned to a machine for integration. The request will return a portion of integral in double. Since it is a slow request, I put the char $ in front of the request. The tool uidparser will create proper code to put the computation in a worker thread automatically. Afterwards, you find the tool uidparser.exe in the directory ..\bin, and create skeleton C++ client and server codes by executing the following command in DOS. 

        uidparser -FE:\uskt\PPi\pi.uid -L1

        Alternatively, you can get C# and VB.NET skeleton codes with options -L0 and -L2, respectively.

5.    Designing and coding parallel application

       A.    Decomposition -- The first step in designing a parallel program is to break the problem into discreet "chunks" of work that can be distributed to different computers or processors for processing. This step is known as decomposition or partitioning.

        B.    Domain decomposition and functional decomposition -- They are two basic ways to decompose a large problem. With domain decomposition, each of processors or machines works a portion of data. This demo uses domain decomposition to partition the problem. See the below code in the file PPiDlg.cpp.

void CPPiDlg::PrepareComputeContexts()

{

        //divide integration sections according to machine performance

        //fast machine

        m_pComputeContext[0].m_strIPAddr = _T("192.168.1.100"); //or MyMachineName

        m_pComputeContext[0].m_nPort = 20901;

        m_pComputeContext[0].m_dStart = 0; //0 - 0.4

        m_pComputeContext[0].m_dStep = 0.000000001;

        m_pComputeContext[0].m_nNum = 400000000; //400000000 * 0.000000001 = 0.4

 

        //slow machine

        m_pComputeContext[1].m_strIPAddr = _T("localhost");

        m_pComputeContext[1].m_nPort = 20901;

        m_pComputeContext[1].m_dStart = 0.4; //0.4 - 0.6

        m_pComputeContext[1].m_dStep = 0.000000001;

        m_pComputeContext[1].m_nNum = 200000000; //200000000 * 0.000000001 = 0.2

 

        //fast machine

        m_pComputeContext[2].m_strIPAddr = _T("yyexp");

        m_pComputeContext[2].m_nPort = 20901;

        m_pComputeContext[2].m_dStart = 0.6; //0.6 - 1

        m_pComputeContext[2].m_dStep = 0.000000001;

        m_pComputeContext[2].m_nNum = 400000000; //400000000 * 0.000000001 = 0.40

}

        This sample client application will build three socket connections to three different machines, 192.168.1.100, localhost and yyexp. After looking at the integration formula in the Section 2, you will find that the code decompose the integration into three separate parts, 0 ~ 0.4, 0.4 ~ 0.6 and 0.6 ~ 1. Each of tiny rectangles has a width of 0.000000001 as detailed in the above code comments. There are totally 1,000,000,000 steps to calculate the real value of π.

        When a large problem can be divided into different types of tasks, functional decomposition should be used to partition it. I will use a new sample to demonstrate functional decomposition.

        C.    Integrating a portion of π -- Look at the server implementation skeleton code in the file PiImpl.h and find the function Compute inside the class CPPiPeer. The code for integrating a portion of π is really simple as following.

void Compute(double dStart, double dStep, int nNum, /*out*/double &ComputeRtn)

{

        int n;

        int n100 = nNum/100;

        double dX = dStart;

       dX += dStep/2;

        double dd = dStep * 4.0;

        ComputeRtn = 0.0;

        for(n=0; n<nNum; n++)

        {

            dX += dStep;

            ComputeRtn += dd/(1 + dX*dX);

            if(n100 > 0 && (n%n100) == 0 && n>0)

            {

                    int nPercent = n/n100;

                    //send a completion percentage to a client

                    ULONG ulRtn = SendReturnData(idComputing, (BYTE*)&nPercent, sizeof(nPercent));

 

                    if(ulRtn == SOCKET_NOT_FOUND || ulRtn == REQUEST_CANCELED)

                    {

                            //when a client closes socket or cancels the computation,

                            //set the returned result to NOT_COMPLETED

                            ComputeRtn = NOT_COMPLETED;

                            break;

                    }

               }

        }

}

        The above code also sends a result (idComputing) to a client to indicate computing progress. Additionally, SocketPro can monitor network events and a special request Cancel while it is computing. When a client wants to stop computing, the loop can be broken elegantly. 

        D.    Asynchronous requests, barriers, synchronizations, and collecting results -- Parallel computation requires initial asynchronous requests from a center computer. Usually, you start initial asynchronous requests to different computing computers using different worker threads, and sum all of returned results with careful synchronization of various shared data variables. Also, parallel computation requires setting up events or callbacks to track returned results. Now, let's see code execution inside the function void CPPiDlg::OnComputeButton() of file PPiDlg.cpp.

//send requests to servers without waiting for results

for(n=0; n<COMPUTE_PI_COUNT; n++)

{

        //send asynchronous requests to different machines or processors

        m_pComputeContext[n].m_PiHanlder.ComputeAsyn(m_pComputeContext[n].m_dStart, m_pComputeContext[n].m_dStep, m_pComputeContext[n].m_nNum);

}

 

//now, we are waiting for each of results

for(n=0; n<COMPUTE_PI_COUNT; n++)

{

        m_pComputeContext[n].m_ClientSocket.WaitAll(); //WaitAll -- barrier

        if(m_pComputeContext[n].m_PiHanlder.m_ComputeRtn != NOT_COMPLETED)

        {

                //collect and sum every portion of integral values

                m_dPi += m_pComputeContext[n].m_PiHanlder.m_ComputeRtn;

        }

}

        Look at the above code comments. You should be clear to know how to send asynchronous requests, barrier for results and collect/sum results with SocketPro. However, you can't find any codes for creating threads to initial asynchronous requests. Also, you can't find any thread locking objects to synchronize any variables. Why? The reason is that SocketPro is written by use of non-blocking socket communication, which is very helpful and also a natural way to parallel computation. There is no need at all to use worker threads to initialize asynchronous requests. Because no worker threads are involved, no thread locking objects are required for any variables. As you can see, this feature can simplify parallel computation development.

        Next question is how to track various results from a server. Actually, the tool uidparser.exe has already completed all of works for you as shown in the below code in the class CPPi of the file Pi.h.

virtual void OnResultReturned(unsigned short usRequestID, CUQueue &UQueue)

{

        if(m_err.m_hr != S_OK) return; //exception transferred from SocketPro server

        switch(usRequestID)

        {

        case idComputeCPPi:

                UQueue >> m_ComputeRtn;

                break;

        case idComputing:  //manually add code here to track computing progress

                UQueue >> m_nPercent;

                break;

        default:

                break;

        }

}

        Note that we manually add code to track returned computing progress from a server for idComputing. There are other codes inside the files PPiDlg.h and PPiDlg.cpp. As long as you are familiar with C++ and MFC, you can understand all of code execution by putting break points anywhere.

6.    Major challenges with parallel computation

        Program a parallel application is not a simple work. A programmer or designer will meet a set of challenges. Here is a list of challenges.

  • Designing -- Dig into a large problem, and find how to partition a large problem better. Special attentions should be given to various overheads like creating threads, communicating and data synchronization etc.

  • Implementation and human resources -- Programmers must feel very comfortable with multithreads, network/inter-process communications and data synchronization. Also, select a proper toolkit to complete a job. Programmer must have very good logical reasoning. This is also essential.

  • Find a friendly tool to debug code. Debug a parallel application is somewhat challenging.

        This toolkit is written for data communication among machines using 100% non-blocking socket. Because of use of non-blocking socket communication, usually you don't have to create threads on either client or server side for paralleling a large problem. Because you don't use worker threads as much as other toolkits, this feature will also simplify your development for not synchronizing various global or static variables. You can use IUSocket::WaitAll or IUSocket::Wait to barrier various requests. As shown in this sample, it is also simple and joyful for you to debug using commercial MS Visual studio (Note that MS Visual studio express is just fine too!). Best of all, you can divide a large and complex problem into a set of small and simple problems, and solve them one by one. Additionally, SocketPro supports two socket connections for free. As long as there are not over two socket connections to a server application, SocketPro will run perfectly fine without requiring for any registration.

7.    Some references

        This is a very simple demo to parallel computation and is designed for beginners. To help you further understand parallel computation, I list a number of references.