duminică, 3 martie 2013

MEV - Manycore Edge Vectorization

Many-core architectures have come to be widespreaded ranging from HPC to low power SoC devices, allowing for a mix of SISD and SIMD processing techniques. Image vectorization refers to converting raster graphics -- on general a rectangular grid of pixels -- into vector graphics, i.e. represention through mathematical expressions. Common applications include computer aided design (CAD), geographic information systems (GIS) and graphic design and photography.


Image filters (Algorithm Phase 1)
The aim is to bring the image into a black-white format which contains no white black fills. Depending on the original image, either grayscale or black and white, the number of applied filters will vary. A threshold with the fill removal filter is applied regardless of the input. Additionally if the input is grayscale, a Blur alongside a Sobel or Canny filter will be used. Idealy, the set of connected curves would indicate the boundaries of objects.

Tracing (Algorithm Phase 2)
Contour tracing algorithms are formed by simple rules for image traversal where points are selected into groups that constitute paths. Two tracing algorithms have been designed, one slower but more precise, and one faster but with poorer precision. The former is a modified version of Moore-NeighBor algorithm.



General considerations
Image vectorization, as introduced in this work, depends largely on the size of the image and the set of paths that can be extracted from it. Based on this, realtime processing of small images of up to HD quality (e.g. 2MP with 15-30fps) can be achieved on modest architectures such as Intel Atom. On the other hand, processing images of 1MP to 10GP can be optimized on high throughput architectures such as Intel Xeon or AMD Opteron.

It is usually quite difficult to evaluate a new hardware architecture, mainly due to the large set of features that it provides. The proposed application can act as a benchmark when using images exceeding 16MP (e.g. over 4096x4096), or HD video streams (e.g. greater than 1366x768). Running the application in a realtime video mode will output a score representing the average fps. When processing images, execution time can be measured either as a whole or per processing step (i.e. for filter, extraction, or reduction respectively). Both methods can be used to differentiate between processors, to evaluate different load conditions -- for CPU, GPU, or RAM -- and even compare entire systems.

Image vectorization can be performed on either SISD or SIMD units, making it a good fit for heterogeneous and high performance homogeneous architectures alike. The problem can be solved using homogeneous manycore CPUs, but low power heterogeneous processing units are still best suited for this task. Extra specialized units, such as GPUs, are able to complement the CPU, and achieve high throughput, and low CPU utilization. Architecture dependent instruction sets, such as SSE or AVX can further enhance the overall performance of the algorithms, without increasing the CPU load.

glupescu@clusterLG:~/Desktop/mvec/mvec$ ./mev

 OPTION             INFO                                  
 -------------------------------------------
 -help,-h           list commands           
                       
 -mode 1            threshold + contour                        
 -mode 2            blur + sobel + contour                    
 -mode 3            gauss + sobel + contour                      
 -mode 4            gauss + sobel + contourSSE              
 -mode 5            blurSSE + sobel + contourSSE            
 -mode 6            canny + contourSSE                    
 -mode 7            blurCL + sobelCL + contourCL    
        
 -webcam              realtime processing, webcam[0] as input CV      
 -video [filename]   realtime processing, vidfile[0] as inputCV      
 -image [filename]  process each image file          
 -limit [num]       limit total stream frame count        
 -buf   [num]      buffer limit in MB                      
 -aprox [num]    specify level of aproximation        
 -fast                 force fast contour extraction alg    
 -gui                 display image processing steps  
 -ps                  output postscript file (out.ps)            

The source code for the program can be found here https://gitorious.org/mvec/mvec
It supports any combination of OpenMP, OpenCL, SSE, OpenCV/CImg and can act as a benchmark. It can be compiled both on Linux and Windows (it uses cmake). Tested on Ubuntu 11/12 and Windows 7.


vineri, 30 martie 2012

OpenCL platform/SDKs


OpenCL [1] has come along way evolving towards a widely adopted industry standard. For now its main targets remain X86 CPUs (Amd, Intel) and discrete GPUs (Amd, Nvidia), with ARM architectures caching on (support for ARM is premature at this point). Though it's been nearly 3 years since the standard showed up implementations (Amd, Intel, Nvidia) still differ in some aspects. Bellow i present some of the differences found in these 3 major implementations. An older article on this subject can be found here [7]

AMD  [2]
APP SDK v2.6 
OpenCL 1.1
Min CPU SSE3 (even Intel) or GPU HD4xxx
INTEL [3]
CUDA Toolking v4.1
OpenCL 1.1
Intel CPUs only, Min CPU SSE4
NVIDIA[4]
OpenCL SKD v1.5
OpenCL 1.1
Min Geforce 8

The fact that AMD APP SDK supports also Intel processors can be problematic if the INTEL SDK is also present and supporting the processor. As a result we would get the device shown up in different platforms – in a naïve implementation this could result in using the device twice at the same time, could lead to instability, performance loss.


Support in general is a sensible part in OpenCL. First of all GPUs are rather unstable devices (especially older ones and under Linux OS) both due to unsupported features, software emulation, poor drivers.

For example consider AMD's HD4xxx which is near a mess regarding OpenCL - it has limited support for OpenCL (as stated here [8]) in many areas - example byte level operations are forbidden by default ( cl_uchar testErr = (cl_uchar) 255; => error). Likewise processing errors tend to freeze up the whole system, while windows has a driver reset function, the X server on Linux would freeze up in some cases leading to a complete system freeze (no access to tty's).



OpenCL setup takes quite some time (device query, kernel read, kernel compilation, mem write, computation, mem read back), making it unsuitable for anything other than long stream-like computations (realtime video processing, large data processing).

Another interesting aspect of the implementations is when are threads spawned – even if the actual work is not yet available for processing. Analyzing a typical OpenCL application we find that (on Core i5 -ex- )

OCL APP Steps
AMD (threads)
INTEL (threads)
NVIDIA (threads)
clGetDeviceInfo
1
1
1
clCreateContext
1
1
1 + 7
clCreateQueue
1 + 5
1
1 + 7
clCreateProgram
1 + 5
1
1 + 7
clBuildProgram
1 + 5
1 + 4
1 + 7
clCreateKernel
1 + 5
1 + 4
1 + 7
clSetKernelArgs
1 + 5
1 + 4
1 + 7
clEnqueue
1 + 5
1 + 4
1 + 7
clFinish
1 + 5
1 + 4
1 + 7
clReleaseContext
1 + 5
1 + 4
1 + 7
clReleaseQueue
1 + 1
1 + 4
1 + 7

From the presented figure NVIDIA spawns the most threads leaving them active even when OpenCL objects go out of scope/ are destroyed, likewise as INTEL. Bellow a couple of screenshots

Amd thread spawn


Intel thread spawn 


Nvidia thread spawn



How much memory is available for processing is an interesting aspect to evaluate. Considering a 1GB graphics card and a 64 bit system with 4GB RAM I received errors of the kind “not enough” for allocation of 2 chunks of ~ 450MB… This is odd and needs further investigation…

CPU memory problem => "Invalid mem object"
GPU no memory => "Mem object allocation failed"

Heterogeneous architectures have emerged first with AMD’s Brazos/Liano platforms and Intel’s future Ivy Bridge lineup. The main advantage to this is zero-copy which can speed up the transfer but comes with the disadvantage of a shared CPU-GPU bandwidth. The bandwidth of an E-350 is aprox 8GB/sec which is very low by today’s standards having both a dual core CPU (similar in performance with K8) and a 80 VLIW5 core GPU (similar with HD5470).

Finally 2 utilities to query OpenCL compatible devices : GPU Shark [5], GPU Caps Viewer [6]



sâmbătă, 19 noiembrie 2011

Maximum subarray problem on multicore/manycore architectures (Part 2)

After connecting to the front-end, using qsub, a user is able to submit jobs to one of the available  clusters in the queue. A cluster consists of 4 Xeon processors, each 10 cores with 24MB cache, having a total of ~260GB RAM in quad channel. (HT by default is deactivated). Scalability is hard to achieve mainly due to the multiprocessor configuration even on highly parallel problems such as Kadane2D.



Kadane2D adapted for MTL

Kadane2D's algorithm consists mainly of 3 for loops, out of which the first 2 denote the (i, j) combination of rows whilst the last – say k - denotes the columns where we use Kadane1D.

for( int i = 0; i < row_count; i++)
    for(int j = i; j < row_count; j++)
        Kadane1D( k = 0 → col_count )

As presented previous we need to assign each thread a set of {(i, j) | i < j} and we can do this in at least 3 ways : equal chunks by i (solution A – inefficient), uneven chunks by i (solution B) equal by Sum(i,j), and finally by grouping zones so to create redundancy ( solution C ).

Solution A does not have a balance workload amongst threads and should yield bad results in anything but a singlecore computer – differences will increase as core count increases. Moving on and comparing the former 2, B will yield weaker results than C due to the lack of data redundancy – in other words B will need more bandwidth and cache than C.

On the MTL working a 10000x10000 matrix will more than double the performance
solution B (2N columns in cache)  => 150 sec
solution C (N+1 columns in cache) => 60 sec

Considering this following questions :
1) Will “solution C” scale near perfect on the MTL (given high bandwidth and large cache) ?
2) Will increasing redundancy help more ? - instead of {(0,1), (0,2), (0,3)} use {(0,1), (0,2), (1,2)}

Implementing solution C we have obtained the following scores on 10000x10000 and 5000x5000 matrix











The answer to both the questions is NO for a similar configuration to MTL. This is due to the fact that multiprocessor systems have different caches and because the OS/middle-ware is in charge of thread distribution. Threads that would require common elements could thus end up on different processors and with the increase of multiprocessors solution C would come closer to solution B that did not benefit from redundancy.

For Intel MTL system the conclusions to solution C are :

Spikes appear when work is not divided equally between processors – mainly due to cache misses, data distribution among processors
Best speedup is up to 20 cores or 2 processors, moderate speedup is up to 30 cores or 3 processors whilst a modest speedup can be seen between 30-40 cores or 3-4 processors