Ns3 Projects for B.E/B.Tech M.E/M.Tech PhD Scholars.  Phone-Number:9790238391   E-mail: ns3simulation@gmail.com

A performance study of CUDA UVM vs. manual optimizations in a real-world setup: Application to a Monte Carlo wave-particle event-based interaction model

THE performance of a Monte Carlo model for the simulation of electromagnetic wave propagation in particle-filled atmospheres has been conducted for different CUDA versions and design approaches. The proposed algorithm exhibits a high degree of parallelism, which allows favorable implementation in a GPU. Practical implementation aspects of the model have been also explained and their impact assessed, such as the use of the different types of memories present in a GPU.

A number of setups have been chosen in order to compare performance for manually optimized vs. UVM (Unified Virtual Memory) implementations for different CUDA versions. Features and relative performance impact of the different options have been discussed, extracting practical hints and rules useful to speed up CUDA programs.

Swap-And-Randomize: A Method for Building Low-Latency HPC Interconnects

Random network topologies have been proposed to create low-diameter, low-latency interconnection networks in large-scale computing systems. However, these topologies are difficult to deploy in practice, especially when re-designing existing systems, because they lead to increased total cable length and cable packaging complexity.

In this work we propose a new method for creating random topologies without increasing cable length: randomly swap link endpoints in a non-random topology that is already deployed across several cabinets in a machine room. We quantitatively evaluate topologies created in this manner using both graph analysis and cycle-accurate network simulation, including comparisons with non-random topologies and previously-proposed random topologies.

Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes

OpenCL is an open standard to write parallel applications for heterogeneous computing systems. Since its usage is restricted to a single operating system instance, programmers need to use a mix of OpenCL and MPI to program a heterogeneous cluster. In this paper, we introduce an MPI-OpenCL implementation of the LINPACK benchmark for a cluster with multi-GPU nodes. The LINPACK benchmark is one of the most widely used benchmark applications for evaluating high performance computing systems.

Our implementation is based on High Performance LINPACK (HPL) and uses the blocked LU decomposition algorithm. We address that optimizations aimed at reducing the overhead of CPUs are necessary to overcome the performance gap between the CPUs and the multiple GPUs. Our LINPACK implementation achieves 93.69 Tflops (46 percent of the theoretical peak) on the target cluster with 49 nodes, each node containing two eight-core CPUs and four GPUs.

Heterogeneous NoC Router Architecture

We introduce a novel heterogeneous NoC router architecture, supporting different link bandwidths and different number of virtual channels (VCs) per unidirectional port. The NoC router is based on shared-buffer architecture and has the advantages of ingress and egress bandwidth decoupling, and better performance as compared with input-buffer router architecture. We present the challenges facing the design of such heterogeneous NoC router, and describe how this router architecture addresses them.

We introduce and formally prove a novel approach that reduces the number of required middle shared-buffers without affecting the performance of the router. In comparison with an optimal input-buffer homogeneous router, our NoC router improves saturation throughput by 6-47 percent for standard traffic patterns. The router achieves significant run-time improvement for NoC-based CMP running PARSEC benchmarks. It offers better scalability, area, and power reduction of 15-60 percent, for NoC based CMPs of size 4 × 4 up to 16 × 16, as compared with optimal input-buffer homogeneous and heterogeneous routers.

In-Place Matrix Transposition on GPUs

Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. With more and more algebra libraries offloading to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPU to achieve good performance. In this paper we present our in-place matrix transposition approach for GPUs that is performed using elementary tile-wise transpositions.

We propose low-level optimizations for the elementary transpositions, and find the best performing configurations for them. Then, we compare all sequences of transpositions that achieve full transposition, and detect which is the most favorable for each matrix. We present an heuristic to guide the selection of tile sizes, and compare them to brute-force search. We diagnose the drawback of our approach, and propose a solution using minimal padding. With fast padding and unpadding kernels, the overall throughput is significantly increased. Finally, we compare our method to another recent implementation.

Verifying Pipelined-RAM Consistency over Read/Write Traces of Data Replicas

Data replication technologies enable efficient and highly-available data access, thus gaining more and more interests in both the academia and the industry. However, it introduces the problem of data consistency. For high performance and availability, modern data replication systems often settle for weak consistency. Pipelined-RAM consistency is one of the wellknown weak consistency models. It does not require all the replicas to agree on the same global view of the order in which data operations occur. To determine whether a data replication system indeed provides Pipelined-RAM consistency, we study the problem of verifying Pipelined-RAM consistency over read/write traces (VPC, for short). We identify four variants of VPC according to a) whether write operations can assign Duplicate values (or only Unique values) for each shared variable, and b) whether there are Multiple shared variables (or one Single variable); the four variants are labeled VPC-SU, VPC-MU, VPC-SD, and VPC-MD.

For the VPC problem, all read operations are totally ordered on the same process p0. It turns out that the VPC problem is NP-complete if write operations can assign duplicate values for each shared variable. In contrast, it is polynomially tractable if only unique values are allowed. Specifically, we prove that VPC-SD is NP-complete (so is VPC-MD) by reducing the strongly NP-complete problem 3-PARTITION to it. On the other hand, we present a polynomial algorithm, called READCENTRIC, for the VPC-MU variant. The READ-CENTRIC algorithm constructs an operation graph by iteratively applying a rule which guarantees that no overwritten values are allowed to be read. It makes use of the total order between the dictating writes on the same variable to avoid redundant applications of the rule, achieving the time complexity of O(n4), where n is the number of operations in the trace. Moreover, the READ-CENTRIC algorithm is incremental in the way that it processes all the read operations on process p0 – ne by one. The experiments have demonstrated its practical efficiency and scalability.

Parallel and High-Speed Computations of Elliptic Curve Cryptography Using Hybrid-Double Multipliers

High-performance and fast implementation of point multiplication is crucial for elliptic curve cryptographic systems. Recently, considerable research has investigated the implementation of point multiplication on different curves over binary extension fields. In this paper, we propose efficient and high speed architectures to implement point multiplication on binary Edwards and generalized Hessian curves. We perform a data-flow analysis and investigate maximum number of parallel multipliers to be employed to reduce the latency of point multiplication on these curves.

Then, we modify the addition and doubling formulations and employ a newly proposed digit-level hybrid-double Gaussian normal basis multiplier to remove the data dependencies and hence reduce the latency of point multiplication. To the best of our knowledge, this is the first time that one employs hybrid-double multiplication technique to reduce the computation time of point multiplication. Moreover, we have implemented our proposed architectures for point multiplication on FPGA and obtained the results of timing and area. Our results indicate that the proposed scheme is one step forward to improve the performance of point multiplication on binary Edward and generalized Hessian curves.

A Comment on “Fast Bloom Filters and Their Generalization”

A Bloom filter is a data structure that provides probabilistic membership checking. Bloom filters have many applications in computing and communications systems. The performance of a Bloom filter is measured by false positive rate, memory size requirement, and query (or memory look-up) overhead. A recent paper by Qiao et al. proposes the Fast Bloom Filter, also called Bloom-1, which requires only a single memory look-up for a membership test.

Bloom-1 achieves a reduced query overhead at the expense of a slightly higher false positive rate for a given memory size. The false positive rate of Bloom-1 has been analyzed theoretically by Qiao et al. relying on a well-known, but flawed, approximation for the false positive rate for a Bloom filter. In this comment paper we show that the Qiao et al. analysis of Bloom-1 under-estimates the false positive rate for low loads. We provide a correct analysis of Bloom-1 yielding an expression for the exact false positive rate.

A parallel out-of-core algorithm for the time-domain adaptive integral method

A parallel multi-level out-of-core algorithm is presented to mitigate the high memory requirement of the time-domain adaptive integral method.

Numerical results demonstrate the performance of the method for large-scale simulations where terabytes of data are transferred to/from a distributed storage system.

A Partial PIC Based Receiver Design for SFBC-OFDM Cooperative Relay Systems

Space frequency block codes (SFBC) have been widely adopted with orthogonal frequency division multiplexing (OFDM) to achieve both spatial and frequency diversity. However, OFDM is very sensitive to carrier frequency offset (CFO), which would cause intercarrier interference (ICI) and degrade thesystem performance severely. In this paper, we propose a low-complexity partial parallel interference cancellation (PIC) based receiver with common CFO compensation, modified SFBC decoding, and iterative partial PIC for Alamouti coded SFBC-OFDM cooperative relay systems.

For the proposed receiver, the received signal is first compensated by a common CFO value to mitigate the effect of different CFOs from the distributed relays, and then the result is processed by a modified SFBC decoder. The decoded data is subsequently fed back to reconstruct the ICI for use in the developed partial PIC procedure. As compared with a previous related work, the proposed partial PIC based receiver achieves better performance with lower computational complexity.