|
EnginSoft Conference 2010 Proceedings CD now available
|
New trends in High Performance Computing
 |
| Tipical High Performance cluster architecture |
New hardware and software technologies can reduce costs and computational time very effectively. In order to have productive clusters, the right choice of operating system, computer hardware, interconnection and disk storage is crucial. Moreover, also deployment and support for computational software installation must be taken into account in order to have cost-effective solutions which will not become a nightmare for users and administrators.
Operationg system and queue system
Two worlds: Linux with Perceus project and Microsoft HPC Server 2008 are the leading edge technologies for developing a cluster solution.
Perceus
Perceus is the next generation cluster and enterprise tool kit for the deployment, provisioning, and management of groups of servers. Employing the power of the Perceus OS and framework, the user can quickly suggest a machine out of the box. Perceus truly makes the computer a commodity, allowing an organization to manage large quantities of machines in a scalable fashion.
Perceus is developed and provided to the world under the GNU GPL by Infiscale.com.
 |
| When using the VGL Image Transport (formerly "Direct Mode"), the 3D rendering occurs on the application server, but the 2D rendering occurs on the client machine. VirtualGL compresses the rendered images from the 3D application and sends them as a video stream to the client, which decompresses and displays the video stream in real time. |
HPC Server 2008
Windows HPC Server 2008 provides a productive, cost-effective, and high-performance computing (HPC) solution that runs on x64-bit hardware. Windows HPC Server 2008 can be deployed, managed, and extended using familiar tools and technologies. It enables broader adoption of HPC by providing a rich and integrated end-user experience, ranging from the desktop application to the clusters. A wide range of software vendors, in various verticals, have designed their applications to work seamlessly with Windows HPC Server 2008 so that users can submit and monitor jobs from within familiar applications avoiding to learn new or complex user interfaces.

The queue system: the heart of a cluster
There are several points involved in a queue system:
HOSTS
- Master host – The master host is central to the overall cluster activity. The master host runs the master daemon sge_qmaster. This daemon controls all Grid Engine system scheduling and components, such as queues and jobs. The daemon maintains tables about the status of the components, user access permissions, etc. By default, the master host is also an administration host and a submit host.
- Execution hosts – Execution hosts are systems allowed to execute jobs. Therefore, queue instances are attached to the execution hosts. Execution hosts run the execution daemon.
- Administration hosts – Administration hosts are hosts allowed to carry out any kind of administrative activity for the Grid system.
- Submit hosts – Submit hosts enable users to submit and control batch jobs only. In particular, a user who is logged in to a submit host can submit jobs with the qsub command, can monitor the job status with the qstat command.

QUEUES
A queue is a container for a class of jobs allowed to run on one or more hosts concurrently. A queue determines certain job attributes, for example, whether the job can be migrated. Throughout its lifetime, a running job is associated with its queue. The association with a queue affects some of the things that can happen to a job. For example, if a queue is suspended, all jobs associated with that queue are also suspended. Jobs do not need to be submitted directly to a queue. If you submit a job to a specified queue, the job is bound to this queue. As a result, the Grid Engine system daemons are unable to select a better-suited device or a device that has a lighter load.
You only need to specify the requirement profile of the job. A profile might include requirements such as memory, operating system, available software, and so forth. The Grid Engine software automatically dispatches the job to a suitable queue and a suitable host with a light execution load.
A queue can reside on a single host or can extend among multiple hosts. For this reason, Grid Engine system queues are also referred to as cluster queues. Cluster queues enable users and administrators to work with a cluster of execution hosts by means of a single queue configuration. Each host that is attached to a cluster queue receives its own queue instance from the cluster queue.
License management
Most commercial software use FLEXLM (tm) license management system to distribute licenses. The combination of licensing system with queue system has become in the past months a serious matter for mass intensive optimization computation, as well for users and system administrators.
Available licenses are checked in only when the job has already entered the queue system, thus at that point is too late to deny a license because of no more licenses available.
This is very disappointing for users coming back from weekend to find their optimization job basically not done over time, just because some other batch jobs where launced by other departments, or because network delays. The control of this situation needs a very deep understanding how queue systems work and interactions between all system components: customization must be well engineered to avoid interferences between the license manager and the cluster.
We develop lots of custom scripts for SunGridEngine (fully platform independent, portable to Microsoft cluster system) to solve this problem and to make queue jobs start at right time, allocating the right licenses and sub-licenses.
There will be a 0.1% of cases where this procedures will not work, spawning job at the wrong time, but this is a side effect of communication among daemons (queue, system,cluster etc..) that could not be taken away.
 |
| Large model with non-linear material and deformations example solvede on a 64 nodes cluster system |
Parallel applications
The development of parallel programs requires integrated development environments along with the support for distributed computing standards. Visual Studio 2008 provides a comprehensive parallel programming environment for Windows HPC Server 2008. Besides supporting OpenMP, MPI, and Web Services, Windows HPC Server 2008 also supports third-party numerical library providers, performance optimizers, compilers, and a native parallel debugger for developing and troubleshooting parallel programs.
Common bottleneck sources
As the CAE industry continues an aggressive platform migration from proprietary Unix servers to commodity HPC clusters, CAE models are becoming more realistic, too, requiring clusters to handle ever-increasing volumes of I/O and the movement of large files.
As organizations rapidly expand their cluster deployments, many encounter I/O bottlenecks when using legacy network attached storage (NAS) architectures.
Initially, these NAS systems offered advantages such as shared storage and simplified IT administration which reduced costs, but today a few of them provide the scalability required for effective I/O performance in parallel CAE simulations. Recently, a new class of shared parallel storage technology has developed to remove serial bottlenecks and to improve i/o performances, therefore extending the overall scalability of CAE simulations on clusters.
Parallel storage is the leading solution of parallel NAS and enables the most advanced and I/O demanding CAE challenges to become practical applications. Some examples include the high-fidelity transient CFD, large eddy simulation (LES), aerocoustics, large DOF structural dynamic response, parameterized non-deterministic CAE simulations for design optimization and the coupling of CAE disciplines such as fluid-structure interaction (FSI). CAE workflows are overburdened with lost productivity when engineers and scientists must wait for serial I/O operations and large file transfers to complete.
Furthermore, as simulation and workflow performance degrades, so does CAE analyst efficiency and effective workgroup collaboration. A parallel storage eliminates the I/O bottlenecks with a cost-saving solution that restores productivity and drives analyst creativity.
 |
| Tipical cluster management system and visualization nodes |
The benefits of parallel I/O for transient CFD were demonstrated with a production case of an ANSYS aerodynamics model of 111M cells, provided by an industrial truck vehicle manufacturer. Figure 2 below, illustrates the I/O schematic of the performance tests that were conducted, which comprised a case file read, a compute solve of 5 time steps with 100 iterations and a write of the data file. In a full transient simulation the solve and write tasks would be repeated to a much larger number of time steps and iterations, and with roughly the same amount of computational work for each of these repeatable tasks.
It is important to note that the performance of CFD solvers and the numerical operations are not affected by the choice of the file system, which only performs I/O operations. That is, a CFD solver will perform the same on a given cluster regardless of whether a parallel or serial NFS file system is used. The advantage of parallel I/O is best illustrated in a comparison of the computational profiles of each scheme. ANSYS CFD 12 on PanFS keeps the I/O percent of the total job time in the range of 3% at 64 cores to 8% at 256 cores, whereas 6.3 and NFS spend as much as 50% of the total job time in I/O.
Visualization and Postprocessing
Another relevant matter of large cluster is visualization and post-processing of results on relatively slow networks. An effective solution is performing 3D renders with openGL inside the cluster and giving the client the possibility of remote Display.
VirtualGL is an open source package which gives any Unix or Linux remote display software the ability to run OpenGL applications with full 3D hardware acceleration. Some remote display software, such as VNC, lacks the ability to run OpenGL applications at all.
Other remote display software forces OpenGL applications to use a slow software-only OpenGL renderer, to the detriment of performance as well as compatibility. The traditional method of displaying OpenGL applications to a remote X server (indirect rendering) supports a 3D hardware acceleration, but this approach causes all of the OpenGL commands and 3D data to be sent over the network to be rendered on the client machine. This is not a tenable proposition unless the data is relatively small and static, unless the network is very fast and unless the OpenGL application is specifically tuned for a remote X-Windows environment.
 |
| An example of a mesh generation for a reactor pressure vessel, 11 million nodes and 35 million DOFs. |
With VirtualGL the OpenGL commands and 3D data are instead redirected to a 3D graphics accelerator on the application server and only the rendered 3D images are sent to the client machine. Thus VirtualGL "virtualizes" 3D graphics hardware allowing it to be placed in the "cold room" with compute and storage resources. VirtualGL also allows the 3D graphics hardware to be shared among multiple users and provides "workstation-like" levels of performance even on the most modest of networks. This makes it possible for large, noisy, hot 3D workstations to be replaced with laptops or even thinner clients. More importantly, however, it is the fact that VirtualGL eliminates the workstation and the network as barriers to the data size. Users can now visualize gigabytes and gigabytes of data in real time without needing to copy any of the data over the network or sitting in front of the machine that is rendering the data.
Usually, a Unix OpenGL application would send all of its drawing commands and data, both 2D and 3D, to an X-Windows server which may be located across the network from the application server. VirtualGL, however, employs a technique called "split rendering" to force the 3D commands from the application to go to a 3D graphics card in the application server. VGL performs this by pre-loading a dynamic shared object (DSO) into the application at run time. This DSO intercepts a handful of GLX, OpenGL, and X11 commands that are necessary to perform the split rendering. Whenever a window is created by the application, VirtualGL creates a corresponding 3D pixel buffer ("Pbuffer") on a 3D graphics card in the application server.
Whenever the application requests that an OpenGL rendering context have to be created for the window, VirtualGL intercepts the request and creates the context on the corresponding Pbuffer instead. Whenever the application swaps or flushes the drawing buffer to indicate that it has finished rendering a frame VirtualGL reads back the Pbuffer and sends the rendered 3D image to the client.
For further information:
Ing. Gino Perna - ICT Manager
info@enginsoft.it
Enginsoft provides all ranges of HPC solutions: from ready to use systems to dedicated HPC setup for specific needs in the simulation market.
Enginsoft expertize ranges from system configuration, queue control, monitoring tools, licensing integration and etherogeneous systems building to maintain cluster efficiency along time.
Also integration with parallel file systems and remote graphic system is under continuous monitoring to provide our customers with the best of class solutions.
|