# PROGRAMMING AND OPTIMIZATION WITH INTEL XEON PHI COPROCESSORS Colfax Developer Training One-day Labs CDT 102 **Abstract**: Colfax Developer Training (CDT) is an in-depth intensive course on efficient parallel programming of Intel Xeon family processors and Intel Xeon Phi coprocessors. The 1-day labs course (CDT 102) features hands-on exercises on the available programming models and best optimization practices for the Intel many-core platform, and on the usage of the Intel software development and diagnostic tools. The pre-requisite for this class is the one-day seminar CDT 101. ## 9 am to 4 pm: Hands-on session. - Offload and Native: "Hello World" to complex; using MPI. - Performance Analysis: VTune. - Case Study: all aspects of tuning in the N-body calculation. - **Optimization I**: strip-mining for vectorization, parallel reduction. - **Optimization II**: loop tiling, thread affinity. Intel Xeon Phi coprocessors, featuring the Intel Many Integrated Core (MIC) architecture, are novel many-core computing accelerators for highly parallel applications, capable of delivering greater performance per system and per watt than general-purpose CPUs. Unlike GPGPUs, they support traditional HPC programming frameworks, including OpenMP and MPI, and require the same software optimization methods as multi-core CPUs. # Schedule | 9:00-9:30 | Remote Access Configuration, Lab Orientation | |------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 9:30-10:30 | Programming with Explicit Offload | | | <ul> <li>Offload pragmas and object markup</li> </ul> | | | <ul> <li>Diagnostics and control with environment variables</li> </ul> | | | <ul> <li>Data persistence and memory retention</li> </ul> | | | - Multiple coprocessors | | | <ul> <li>Overlapping communication with computation.</li> </ul> | | 10:30-11:00 | Native Programming | | | - Cross-compilation | | | <ul> <li>Running a native application with ssh, micrativeloadex</li> <li>Using native applications in MPI.</li> </ul> | | 11:00-12:00 | Darformana Analysis | | 11:00-12:00 | Performance Analysis – Using Intel VTune Amplifier. | | | - Osing inter vitune Ampimer. | | | — Lunch break — | | | | | 1:00-2:00 | Comprehensive optimization: N-body calculation | | 1:00-2:00 | Comprehensive optimization: N-body calculation – all areas of optimization in one exercise. | | | - all areas of optimization in one exercise. | | 1:00-2:00<br>2:00-3:00 | <ul> <li>all areas of optimization in one exercise.</li> <li>Partnering vectors and cores: histogram example</li> </ul> | | | <ul> <li>all areas of optimization in one exercise.</li> </ul> Partnering vectors and cores: histogram example <ul> <li>strip-mining for vectorization</li> </ul> | | | <ul> <li>all areas of optimization in one exercise.</li> <li>Partnering vectors and cores: histogram example</li> </ul> | | 2:00-3:00 | <ul> <li>all areas of optimization in one exercise.</li> <li>Partnering vectors and cores: histogram example</li> <li>strip-mining for vectorization</li> <li>eliminating synchronization through parallel reduction</li> <li>first-touch allocation impact on Xeon.</li> </ul> | | | <ul> <li>all areas of optimization in one exercise.</li> <li>Partnering vectors and cores: histogram example</li> <li>strip-mining for vectorization</li> <li>eliminating synchronization through parallel reduction</li> <li>first-touch allocation impact on Xeon.</li> </ul> Boosting memory and cache traffic: transposition example | | 2:00-3:00 | <ul> <li>all areas of optimization in one exercise.</li> <li>Partnering vectors and cores: histogram example <ul> <li>strip-mining for vectorization</li> <li>eliminating synchronization through parallel reduction</li> <li>first-touch allocation impact on Xeon.</li> </ul> </li> <li>Boosting memory and cache traffic: transposition example <ul> <li>loop tiling for cached data re-use</li> </ul> </li> </ul> | | 2:00-3:00 | <ul> <li>all areas of optimization in one exercise.</li> <li>Partnering vectors and cores: histogram example <ul> <li>strip-mining for vectorization</li> <li>eliminating synchronization through parallel reduction</li> <li>first-touch allocation impact on Xeon.</li> </ul> </li> <li>Boosting memory and cache traffic: transposition example <ul> <li>loop tiling for cached data re-use</li> <li>compiler hints for vectorization</li> </ul> </li> </ul> | | 2:00-3:00 | <ul> <li>all areas of optimization in one exercise.</li> <li>Partnering vectors and cores: histogram example <ul> <li>strip-mining for vectorization</li> <li>eliminating synchronization through parallel reduction</li> <li>first-touch allocation impact on Xeon.</li> </ul> </li> <li>Boosting memory and cache traffic: transposition example <ul> <li>loop tiling for cached data re-use</li> </ul> </li> </ul> | Instructor: Vadim Karpusenko, Ph. D., is Principal HPC Research Engineer at Colfax International involved in training and consultancy projects on data mining, software development and statistical analysis of complex systems. His research interests are in the area of physical modeling with HPC clusters, highly parallel architectures, and code optimization. Vadim holds a PhD from North Carolina State University for his computational biophysics research on the free energy and stability of helical secondary structures of proteins. He is a co-author of the book "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors" 1, and a regular contributor to the online resource Colfax Research<sup>2</sup>. Instructor: Andrey Vladimirov, Ph. D., is Head of HPC Research at Colfax International. His primary interest is the application of modern computing technologies to computationally demanding scientific problems. Prior to joining Colfax, A. Vladimirov was involved in computational astrophysics research at Stanford University, North Carolina State University, and the Ioffe Institute (Russia), where he studied cosmic rays, collisionless plasmas and the interstellar medium using computer simulations. He is a co-author of the book "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors", a regular contributor to the online resource Colfax Research, and an author or co-author of over 10 peer-reviewed publications in the fields of theoretical astrophysics and scientific computing. Instructor: Ryo Asai is a Researcher at Colfax International. Ryo holds a B. A. degree in Physics from University of California, Berkeley. He develops optimization methods for scientific applications targeting emerging parallel computing platforms, computing accelerators and interconnect technologies. Having joined Colfax's research team early on, Ryo has acquired deep domain expertise in programming the Intel MIC architecture. He has committed a great deal of work to the Colfax Developer Training materials, and his peer-reviewed work is among the most widely read publications of Colfax Research. $<sup>^1\</sup>mathrm{March}$ 2013, ISBN-10: 0-9885234-1-8, ISBN-13: 978-0-9885234-1-8, more details available at <code>http://www.colfax-intl.com/nd/xeonphi/book.aspx</code> <sup>&</sup>lt;sup>2</sup>http://research.colfaxinternational.com/ # Notes #### Presentations Video and audio recording and still photography during Colfax Developer Training (CDT) is permitted only for private or institutional use by the attendees and their direct collaborators. No recorded materials shall be publicly disseminated without explicit written authorization from Colfax International. ## **Materials** The slides of all presentations will be made available to all attendees in electronic form. Attendees are free to use these materials privately and share them with direct collaborators. However, no materials shall be publicly disseminated without explicit written authorization from Colfax International. The book on which the CDT is based, "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors", is available in the electronic format and as a hard copy at http://www.colfax-intl.com/nd/xeonphi/book.aspx. An electronic copy of the book and enclosed codes of exercises is included in the training price. # Contacts and Resources The instructors of this CDT can be contacted via email at vadim@colfax-intl.com, andrey@colfax-intl.com and ryo@colfax-intl.com. You may also find useful our online resource research.colfaxinternational.com, where explanatory and research publications can be found. General inquiries regarding Colfax's business can be sent to phi@colfax-intl.com. Colfax's business Web site www.colfax-intl.com contains information about the company's hardware solutions, education and consulting offerings.