Ns3 Projects for B.E/B.Tech M.E/M.Tech PhD Scholars.  Phone-Number:9790238391   E-mail: ns3simulation@gmail.com

FP-NUCA: A Fast NOC Layer for Implementing Large NUCA Caches

NUCA caches have traditionally been proposed as a solution for mitigating wire delays, and delays introduced due to complex networks on chip. Traditional approaches have reported significant performance gains with intelligent block placement, location, replication, and migration schemes. In this paper, we propose a novel approach in this space, called FP-NUCA. It differs from conventional approaches, and relies on a novel method of co-designing the last level cache and the network on chip. We artificially constrain the communication pattern in the NUCA cache such that all the messages travel along a few predefined paths (fast paths) for each set of banks.

We leverage this communication pattern by designing a new type of NOC router called the Freeze router, which augments a regular router by adding a layer of circuitry that gates the clock of the regular router when there is a fast pathmessage waiting to be transmitted. Messages along the fast path do not require buffering, switching, or routing. We incorporate a bank predictor with our novel NOC for reducing the number of messages, and resultant energy consumption. We compare our performance with state of the art protocols, and report speedups of up to 31 percent (mean: 6.3 percent), and ED2 reduction up to 46 percent (mean: 10.4 percent) for a suite of Splash and Parsec benchmarks. We implement the Freeze router in VHDL and show that the additional fast path logic has minimal area and timing overheads.

Spatial Locality Aware Disk Scheduling in Virtualized Environment

Exploiting spatial locality, a key technique for improving disk I/O utilization and performance, faces additional challenges in the virtualized cloud because of the transparency feature of virtualization. This paper contributes a novel disk I/O scheduling framework, named Pregather, to improve disk I/O efficiency through exposure and exploitation of the special spatial locality in the virtualized environment, thereby improving the performance of disk-intensive applications without harming the transparency feature of virtualization. The key idea behind Pregather is to implement an intelligent model to predict the access regularity of spatial locality for each VM.

Moreover, Pregather embraces an adaptive time slice allocation scheme to further reduce the resource contention and ensure fairness among VMs. We implement the Pregather disk scheduling framework and perform extensive experiments that involve multiple simultaneous applications of both synthetic benchmarks and MapReduce applications on Xen-based platforms. Our experiments demonstrate the accuracy of our prediction model and indicate thatPregather results in the high disk spatial locality and a significant improvement in disk throughput and application performance.

Goodput-Aware Load Distribution for Real-Time Traffic over Multipath Networks

Load distribution is a key research issue in deploying the limited network resources available to support traffic transmissions. Developing an effective solution is critical for enhancing traffic performance and network utilization. In this paper, we investigate the problem of load distribution for real-time traffic over multipath networks. Due to the path diversity and unreliability in heterogeneous overlay networks, large end-to-end delay and consecutive packet losses can significantly degrade the traffic flow’s goodput, whereas existing studies mainlyfocus on the delay or throughput performance. To address the challenging problems, we propose a Goodput-Aware Load distribuTiON (GALTON) model that includes three phases: (1) path status estimation to accurately sense the quality of each transport link, (2) flow rate assignment to optimize the aggregate goodput of input traffic, and (3) deadline-constrained packet interleaving to mitigate consecutive losses.

We present a mathematical formulation for multipath load distribution and derive the solution based on utility theory. The performance of the proposed model is evaluated through semi-physical emulations in Exata involving both real Internet traffic traces and H.264 video streaming. Experimental results show that GALTON outperforms existing traffic distribution models in terms of goodput, video Peak Signal-to-Noise Ratio (PSNR), end-to-end delay, and aggregate loss rate.

GPU Acceleration for Simulating Massively Parallel Many-Core Platforms

Emerging massively parallel architectures such as a general-purpose processor plus many-core programmable accelerators are creating an increasing demand for novel methods to perform their architectural simulation. Most state-of-the-art simulation technologies are exceedingly slow and the need to model full system many-core architectures adds further to the complexity issues. This paper presents a fast, scalable and parallel simulator, which uses a novel methodology to accelerate the simulation of a many-core coprocessor using GPU platforms. The main idea is to use. The target architecture of the associated. Simulation of many target nodes is mapped to the many hardware-threads available on highly parallel GPU platforms. This paper presents a novel methodology to accelerate the simulation of many-core coprocessors using GPU platforms.

We demonstrate the challenges, feasibility and benefits of our idea to use heterogeneous system (CPU and GPU) to simulate future architecture of many-core heterogeneous platforms. The target architecture selected to evaluate our methodology consists of an ARM general purpose CPU coupled with many-core coprocessor with thousands of simple in-order cores connected in a tile network. This work presents optimization techniques used to parallelize the simulation specifically for acceleration on GPUs. We partition the full system simulation between CPU and GPU, where the target general purpose CPU is simulated on the host CPU, whereas the many-core coprocessor is simulated on the NVIDIA Tesla 2070 GPU platform. Our experiments show performance of up to 50 MIPS when simulating the entire heterogeneous chip, and high scalability with increasing cores on coprocessor.

FreeRider: Non-Local Adaptive Network-on-Chip Routing with Packet-Carried Propagation of Congestion Information

Non-local adaptive routing techniques, which utilize statuses of both local and distant links to make routing decisions, have recently been shown to be effective solutions for promoting the performance of Network-on-Chip (NoC). The essence of non-local adaptive routing was an additional network dedicated to propagate congestion information of distant links on the NoC. While the dedicated Congestion Propagation Network (CPN) helps routers to make promising routing decisions, it incurs additional wiring and power costs and becomes an unnecessary decoration when the load of NoC is light. Moreover, the CPN has to be extended if one would utilize more sophisticated congestion information to enhance the performance of NoC, bringing in even larger wiring and power costs.

This paper proposes an innovative non-local adaptive routing technique called FreeRider, which does not use a dedicated CPN but instead leverages free bits in head flits of existing packets to carry and propagate rich congestion information without introducing additional wires or flits. In order to balance the network load, FreeRider adopts a novel three-stage strategy of output link selection, which adequately utilizes the propagated information to make routing decisions. Experimental results on both synthetic traffic patterns and application traces show that FreeRider achieves better throughput, shorter latency, and smaller power consumption than a state-of-the-art adaptive routing technique with dedicated CPN.

A Novel Method for Scaling Iterative Solvers: Avoiding Latency Overhead of Parallel Sparse-Matrix Vector Multiplies

In parallel linear iterative solvers, sparse matrix vector multiplication (SpMxV) incurs irregular point-to-point (P2P) communications, whereas inner product computations incur regular collective communications. These P2P communications cause an additional synchronization point with relatively high message latency costs due to small message sizes. In these solvers, each SpMxV is usually followed by an inner product computation that involves the output vector of SpMxV. Here, we exploit this property to propose a novel parallelization method that avoids the latency costs and synchronization overhead of P2P communications. Our method involves a computational and a communication rearrangement scheme. The computational rearrangement provides an alternative method for forming input vector of SpMxV and allows P2P and collective communications to be performed in a single phase.

The communication rearrangement realizes this opportunity by embedding P2P communications into global collective communication operations. The proposed method grants a certain value on the maximum number of messages communicated regardless of the sparsity pattern of the matrix. The downside, however, is the increased message volume and the negligible redundant computation. We favor reducing the message latency costs at the expense of increasing message volume. Yet, we propose two iterative-improvementbased heuristics to alleviate the increase in the volume through one-to-one task-to-processor mapping. Our experiments on two supercomputers, Cray XE6 and IBM BlueGene/Q, up to 2,048 processors show that the proposed parallelization method exhibits superior scalable performance compared to the conventional parallelization method.

A PTAS Mechanism for Provisioning and Allocation of Heterogeneous Cloud Resources

Cloud providers provision their heterogeneous resources such as CPUs, memory, and storage in the form of virtual machine (VM) instances which are then allocated to the users. One of the major challenges faced by the cloud providers is to allocate and provision these resources such that their profit is maximized, and the resources are utilized efficiently. Recently, cloud providers have introduced auction-based models which allow users to submit bids for their requested VMs. We address the problem of autonomic VM provisioning and allocation for the auction-based model considering multiple types of resources by designing an approximation mechanism.

In addition, the mechanism determines the payment the users have to pay for using the allocated resources. This problem is computationally intractable, and our proposed mechanism is by far the strongest approximation result that can be achieved for this problem. We show that the proposed approximation mechanism is a polynomial-time approximation scheme (PTAS). Furthermore, our proposed mechanism drives the system into an equilibrium in which the users do not have incentives to manipulate the system by untruthfully reporting their VM bundle requests and valuations. We perform extensive experiments using real workload traces in order to investigate the performance of the proposed mechanism.

A Differentiated Quality Adaptation Approach for Scalable Streaming Services

Providing scalable video streaming services for heterogeneous users in dynamic networked environments requires efficient and adaptive quality management mechanisms which deliver quality-customized services according to the client’s preferences and adapt the services to cope with various network conditions. In this paper, we address the issue of quality adaptation for providing personalized scalable media streaming services in dynamic network environments. We propose a differentiated adaptive quality optimization algorithm, called Scalable Video Coding Quality Adaptation algorithm (SVC-QA), which adapts streaming quality based on both system-level and client-level optimization to optimize streaming quality according to network bandwidth conditions, content characteristics, a user’s quality preferences, and buffering capacities of different client devices (e.g., mobile phones, PCs, HDTVs, etc.).

Comparative studies are conducted to compare our proposed algorithms with other adaptive methods. We show that two-level SVC quality adaptation method can achieve better SVC streaming quality with both high peak signal-to-noise ratio (PSNR) and low quality variance under dynamic resource constraints. Moreover, the proposed distributed method reduces the computational complexities at the server side substantially, making it practical and flexible for providing scalable streaming services.

Social-Aware Replication in Geo-Diverse Online Systems

Distributing long-tail content is a difficult task due to the low amortization of bandwidth transfer costs as such content has limited number of views. Two recent trends are making this problem harder. First, the increasing popularity of user-generated content and online social networks create and reinforce such popularity distributions. Second, the recent trend of geo-replicating content across multiple points of presence spread around the world, done for improving quality of experience (QoE) for users. In this paper, we analyze and explore the tradeoff involving the “freshness” of the information available to the users and WAN bandwidth costs, and we propose ways to reduce the latter through smart update propagation scheduling, by leveraging on the knowledge of the mapping between social relationships and geographic location, the timing regularities and time differences in end user activity. We first assess the potential of our approach by implementing a simple social-aware scheduling algorithm that operates under bandwidth budget constraints and by quantifying its benefits through a trace-driven analysis.

We show that it can reduce WAN traffic by up to 55 percent compared to an immediate update of all replicas, with a minimal effect on information freshness and latency. Second, we build TailGate, a practical system that implements our social-aware scheduling approach, which distributes on the fly long-tail content across PoPs at reduced bandwidth costs by flattening the traffic. We evaluate TailGate by using traces from an OSN and show that it can decrease WAN bandwidth costs by as much as 80 percent and improve QoE. We deploy TailGate on PlanetLab and show that even in the case when imprecise social information is available, it can still decrease by a factor of 2 the latency for accessing long-tail YouTube videos.

Efficient and Cost-Effective Hybrid Congestion Control for HPC Interconnection Networks

Interconnection networks are key components in high-performance computing (HPC) systems, their performance having a strong influence on the overall system one. However, at high load, congestion and its negative effects (e.g., Head-of-line blocking) threaten the performance of the network, and so the one of the entire system. Congestion control (CC) is crucial to ensure an efficient utilization of the interconnection network during congestion situations. As one major trend is to reduce the effective wiring in interconnection networks to reduce cost and power consumption, the network will operate very close to its capacity. Thus, congestion control becomes essential. Existing CC techniques can be divided into two general approaches. One is to throttle traffic injection at the sources that contribute to congestion, and the other is to isolate the congested traffic in specially designated resources.

However, both approaches have different, but non-overlapping weaknesses: injection throttling techniques have a slow reaction against congestion, while isolating traffic in special resources may lead the system to run out of those resources. In this paper we propose EcoCC, a new Efficient and Cost-Effective CC technique, that combines injection throttling and congested-flow isolation to minimize their respective drawbacks and maximize overall system performance. This new strategy is suitable for current commercial switch architectures, where it could be implemented without requiring significant complexity. Experimental results, using simulations under synthetic and real trace-based traffic patterns, show that this technique improves by up to 55 percent over some of the most successful congestion control techniques.