A performance study of CUDA UVM vs. manual optimizations in a real-world setup: Application to a Monte Carlo wave-particle event-based interaction model
THE performance of a Monte Carlo model for the simulation of electromagnetic wave propagation in particle-filled atmospheres has been conducted for different CUDA versions and design approaches. The proposed algorithm exhibits a high degree of parallelism, which allows favorable implementation in a GPU. Practical implementation aspects of the model have been also explained and their […]
Swap-And-Randomize: A Method for Building Low-Latency HPC Interconnects
Random network topologies have been proposed to create low-diameter, low-latency interconnection networks in large-scale computing systems. However, these topologies are difficult to deploy in practice, especially when re-designing existing systems, because they lead to increased total cable length and cable packaging complexity. In this work we propose a new method for creating random topologies without increasing cable […]
Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes
OpenCL is an open standard to write parallel applications for heterogeneous computing systems. Since its usage is restricted to a single operating system instance, programmers need to use a mix of OpenCL and MPI to program a heterogeneous cluster. In this paper, we introduce an MPI-OpenCL implementation of the LINPACK benchmark for a cluster with multi-GPU nodes. The LINPACK benchmark […]
Heterogeneous NoC Router Architecture
We introduce a novel heterogeneous NoC router architecture, supporting different link bandwidths and different number of virtual channels (VCs) per unidirectional port. The NoC router is based on shared-buffer architecture and has the advantages of ingress and egress bandwidth decoupling, and better performance as compared with input-buffer router architecture. We present the challenges facing the […]
In-Place Matrix Transposition on GPUs
Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. With more and more algebra libraries offloading to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application […]
Verifying Pipelined-RAM Consistency over Read/Write Traces of Data Replicas
Data replication technologies enable efficient and highly-available data access, thus gaining more and more interests in both the academia and the industry. However, it introduces the problem of data consistency. For high performance and availability, modern data replication systems often settle for weak consistency. Pipelined-RAM consistency is one of the wellknown weak consistency models. It does not […]
Parallel and High-Speed Computations of Elliptic Curve Cryptography Using Hybrid-Double Multipliers
High-performance and fast implementation of point multiplication is crucial for elliptic curve cryptographic systems. Recently, considerable research has investigated the implementation of point multiplication on different curves over binary extension fields. In this paper, we propose efficient and high speed architectures to implement point multiplication on binary Edwards and generalized Hessian curves. We perform a data-flow […]
A Comment on “Fast Bloom Filters and Their Generalization”
A Bloom filter is a data structure that provides probabilistic membership checking. Bloom filters have many applications in computing and communications systems. The performance of a Bloom filter is measured by false positive rate, memory size requirement, and query (or memory look-up) overhead. A recent paper by Qiao et al. proposes the Fast Bloom Filter, also […]
A parallel out-of-core algorithm for the time-domain adaptive integral method
A parallel multi-level out-of-core algorithm is presented to mitigate the high memory requirement of the time-domain adaptive integral method. Numerical results demonstrate the performance of the method for large-scale simulations where terabytes of data are transferred to/from a distributed storage system.
A Partial PIC Based Receiver Design for SFBC-OFDM Cooperative Relay Systems
Space frequency block codes (SFBC) have been widely adopted with orthogonal frequency division multiplexing (OFDM) to achieve both spatial and frequency diversity. However, OFDM is very sensitive to carrier frequency offset (CFO), which would cause intercarrier interference (ICI) and degrade thesystem performance severely. In this paper, we propose a low-complexity partial parallel interference cancellation (PIC) based receiver with […]









