UBE 520
PARALLEL PROGRAMMING
20112012 SPRING
Parallel
and Distributed Systems Programming, 3 hours lecture.
Parallel programming techniques, architectures and algorithms.
Course TA (Elif Acar)
Instructor: PROF.DR.M.E.DALKILIC
Textbook: Barry Wilkinson
& Michael Allen, Parallel Programming Techniques and Applications Using
Networked Workstations and Parallel Computers Second edition, PrenHall 2004
http://www.cs.uncc.edu/par_prog
References:
 I. Foster: Designing and Building Parallel Programs,
AddisonWesley Publishing, 1995
 Blaise
Barney, Paralel
Computing Tutorial Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/
 A. Grama, A. Gupta, G.
Karypis and V. Kumar, Introduction to Parallel Computing, 2nd Ed.,
AddisonWesly, 2003
http://wwwusers.cs.umn.edu/~karypis/parbook
 Calvin Lin & Larry
Snyder, Principles of Parallel Programming, PrenHall 2009
Goals: To
introduce the theory and implementation of parallel programming techniques.
Programming platform:
·
MPI
(Message Passing Interface) on a 15 node Linux Cluster
·
Threads
/ OpenMP
·
Multicore
architectures
·
GPU
(Graphical Processing Unit)
Prerequisites: Data Structures and Algorithms, Computer Architecture
and C/Java Programming
Topics:
PART I: BASIC TECHNIQUES
Chapter 1. Parallel Computers
Chapter 2. Message Passing Computing
Chapter 3. Embarrasingly Parallel Computations
Chapter 4. Partitioning and DivideandConquer Strategies
Chapter 5. Pipelined Computations
Chapter 6. Synchronous Computations
Chapter 7. Load Balancing and Termination Detection
Chapter 8. Programming with Shared Memory
Chapter 9. Distributed Shared Memory Systems and Programming
PART II
ALGORITHMS and APPLICATIONS
Chapter 10. Sorting Algorithms
Chapter 11. Numerical Algorithms
Chapter 12. Image Processing
Grading
 Assignments, 40% (+ 5/10)
 Midterm Exam, 25% (+/ 5)
 Final Exam, 35% (+/ 5)
Homework #1 due date: March 1st, 2012
Homework 2 due date: March 8st, 2012
Homework #3 (due March 29^{th}, 2012)
1. Write an MPI program to
distribute MPI source files
to
all machines in your Lamhosts file and to compile
them
automatically. To be more specific your program
may
take two arguments: a source file name
and
a distribution list.
Note: When MPI nodes activated (e.g., lamboot) MPI assigns
a
node id to each Machine. Thus, you can use these node ids
instead
of taking a distribution list as an input parameter.
In
summary, you don’t actually need a distribution list parameter.
(Thanks
Murat Kurt for pointing this out.)
Homework #4 (due April 5th^{th}, 2012)
Suppose you have N integers to be summed using p
processors.
You can implement an MPI program for the above
task;
i)
using
separate send()s and recv()s,,
ii)
using
a broadcast routine with separate recv()s to return partial results, and
iii)
using
scatter and reduce routines.
Write MPI programs for all three implementations
and execute, time and compare them
for N= 100, 1000, 10000 and p= 4, 8.
Homework
#6 (due April 12th,2012)
Compare experimentally
a fully synchronous parallel MPI implementation
and a partially
synchronous parallel MPI implementation of the
heatdistribution
problem described in Section 6.3.2 of the textbook.
Try different
values of s in the convergence condition specified in
Section 6.4. Write
a report on your findings that includes the specific
Speed improvements
you obtained.
Homework #7 (due April 19^{th}, 2012)
Here
is the OpenMP programming tasks.
UBI 520 HW#8 Due Date: 03.05.2012
Java or .NET Thread Programming
Simulate the Gale Shapley algorithm for stable
marriage problem.
Initialization:
Create n m_threads and n w_threads, each
representing a participant.
Each thread is in either the free state
or the engaged state. Initially each thread is in the free state.
Each thread, when it is fired,
creates a random preference list of the other gender.
Additionally, each w_thread creates
the inverse list (as shown in lecture notes).
Each thread keeps its state and its
fiancée if exists.
Shared Variables:number_of_free_man
(initially n). When a man becomes engaged,
decrease number_of_free_man by 1.
When a man becomes disengaged, increase
number_of_free_man by 1. But be
careful about the order in case of a disengagement.
Do the increment first otherwise
number_of_free_man may go down to 0 causing the
algorithm terminates prematurely.
Algorithm for m_threads:
Each free m_thread offers to the
first w_thread that didn’t reject it yet.
(Note the order that the free man
proposes doesn’t matter.)
If its offer (m_thread’s) is
accepted, the m_thread sets its fiancée variable and goes in the engaged
state
where he doesn’t offer as long as it is
engaged.
If rejected, the free m_thread may
go on to offer the next w_thread in its preference list.
Algorithm for w_threads:
Each w_thread that receives an offer
must either reject or accept. If it is in free state, it accepts.
If it accepts an offer, a w_thread
becomes engaged and never becomes free again; however,
may change its fiancée when it gets a
better offer.
Termination:
When there is no free m_thread,
algorithm terminates (When all man have fiancées, we are done).
Please make sure that you present:
1.
Your code
2.
Your test results for n=5 (preference
lists, intermediate steps and final matchings)
3.
Run time results n=5, 10, 20, 50,
100 for average and the longest thread execution times in millisecond.
Homework #9 (due date May
17th, 2012)
F(x) fonksiyonun (analitik bir fonksiyon olmak zorunda değil,
simülasyon sonucu da olabilir)
[a,b]aralığında minimum değerini bulmak istiyoruz. Bunun için
[a,b] aralığını eşit n parçaya
bölerek
fonksiyonun bu noktalardaki değerlerini hesaplıyoruz. Daha sonra
hesaplanan bu değerlerden
en küçüğünü üreten x değerini bulmaya çalışıyoruz.
·
Yukarıda açıklanan problemin
çözümüne yönelik aşağıda verilen fonksiyonların
local minima’larını CPU’da ve GPU’da hesaplayan
iki farklı prosedür yazınız.
__host__ __device__
float func1(float x)
{
return x
* x;
}
__host__ __device__ float fun2(float x)
{
return sin(x);
}
__host__ __device__ float fun3(float x)
{
return sqrt(sin(x*x));
}
float calcMinCPU(float
a, float
b, int funcIndex) /* funcIndex
hangi fonksiyonu kullanacağımızı belirliyor */
{
...
}
float calcMinGPU(float
a, float
b, int funcIndex) /*
funcIndex hangi fonksiyonu kullanacağımızı belirliyor */
{
...
}
·
GPU’da minimum bulmak için
kullandığınız paralel algoritmayı açıklayınız.
·
Her üç fonksiyon için [0, 1]
aralığında CPU’da ve üç farklı blok konfigürasyonu ile GPU’da çalışma
sürelerini ölçünüz.
·
Çalışma sürelerinin farklılıklarını
yorumlayınız.
Cuda slides
MPI tutorials one, two,three,four, GroupComm
Startup
document for MPI UBE_LAM_MPI.doc
Send any comments or suggestions
to dalkilic
Last revised in Feb 12,1996
