Daniel Castro

Scopus Publications

PIM-STM: Software Transactional Memory for Processing-In-Memory Systems
André Lopes, Daniel Castro, and Paolo Romano
ACM
Processing-In-Memory (PIM) is a novel approach that augments existing DRAM memory chips with lightweight logic. By allowing to offload computations to the PIM system, this architecture allows for circumventing the data-bottleneck problem that affects many modern workloads. This work tackles the problem of how to build efficient software implementations of the Transactional Memory (TM) abstraction by introducing PIM-STM, a library that provides a range of diverse TM implementations for UPMEM, the first commercial PIM system. Via an extensive study we assess the efficiency of alternative choices in the design space of TM algorithms on this emerging architecture. We further quantify the impact of using different memory tiers of the UPMEM system (having different trade-offs for what concerns latency vs capacity) to store the metadata used by different TM implementations. Finally, we assess the gains achievable in terms of performance and memory efficiency when using PIM-STM to accelerate TM applications originally conceived for conventional CPU-based systems.

TIGER: Tor Traffic Generator for Realistic Experiments
Daniela Lopes, Daniel Castro, Diogo Barradas, and Nuno Santos
ACM
Tor is the most widely adopted anonymity network, helping safeguard the privacy of Internet users, including journalists and human rights activists. However, effective attacks aimed at deanonymizing Tor users' remains a significant threat. Unfortunately, evaluating the impact such attacks by collecting realistic Tor traffic without gathering real users' data poses a significant challenge. This paper introduces TIGER (Tor traffIc GEnerator for Realistic experiments), a novel framework that automates the generation of realistic Tor traffic datasets towards improving our understanding of the robustness of Tor's privacy mechanisms. To this end, TIGER allows researchers to design large-scale testbeds and collect data on the live Tor network while responsibly avoiding the need to collect real users' traffic. We motivate the usefulness of TIGER by collecting a preliminary dataset with applicability to the evaluation of traffic confirmation attacks and defenses.

CSMV: A highly scalable multi-versioned software transactional memory for GPUs
Diogo Nunes, Daniel Castro, and Paolo Romano
Elsevier BV

Stochastic simulated annealing for directed feedback vertex set[Formula presented]
Luís M.S. Russo, Daniel Castro, Aleksandar Ilic, Paolo Romano, and Ana D. Correia
Elsevier BV

Persistent Memory: A Survey of Programming Support and Implementations
Alexandro Baldassin, João Barreto, Daniel Castro, and Paolo Romano
Association for Computing Machinery (ACM)
The recent rise of byte-addressable non-volatile memory technologies is blurring the dichotomy between memory and storage. In particular, they allow programmers to have direct access to persistent data instead of relying on traditional interfaces, such as file and database systems. However, they also bring new challenges, as a failure may render the program in an unrecoverable and inconsistent state. Consequently, a lot of effort has been put by both industry and academia into making the task of programming with such memories easier while, at the same time, efficient from the runtime perspective. This survey summarizes such a body of research, from the abstractions to the implementation level. As persistent memory is starting to appear commercially, the state-of-the-art research condensed here will help investigators to quickly stay up to date while also motivating others to pursue research in the field.

CSMV: A Highly Scalable Multi-Versioned Software Transactional Memory for GPUs
Diogo Nunes, Daniel Castro, and Paolo Romano
IEEE
GPUs have traditionally focused on streaming applications with regular parallelism. Over the last years, though, GPUs have also been successfully used to accelerate irregular applications in a number of application domains by using fine grained synchronization schemes. Unfortunately, fine-grained synchronization strategies are notoriously complex and error-prone. This has motivated the search for alternative paradigms aimed to simplify concurrent programming and, among these, Transactional Memory (TM) is probably one of the most prominent proposals. This paper introduces CSMV (Client Server Multiversioned), a multi-versioned Software TM (STM) for GPUs that adopts an innovative client-server design. By decoupling the execution of transactions from their commit process, CSMV provides two main benefits: (i) it enables the use of fast on chip memory to access the global metadata used to synchronize transaction (ii) it allows for implementing highly efficient collaborative commit procedures, tailored to take full advantage of the architectural characteristics of GPUs. Via an extensive experimental study, we show that CSMV achieves up to 3 orders of magnitude speed-ups with respect to state of the art STMs for GPUs and that it can accelerate by up to 20× irregular applications running on state of the art STMs for CPUs.

SPHT: Scalable persistent hardware transactions

NV-PhTM: An efficient phase-based transactional system for non-volatile memory
Alexandro Baldassin, Rafael Murari, João P. L. de Carvalho, Guido Araujo, Daniel Castro, João Barreto, and Paolo Romano
Springer International Publishing

Hetm: Transactional memory for heterogeneous systems
Daniel Castro, Paolo Romano, Aleksandar Ilic, and Amin M. Khan
IEEE
Modern heterogeneous computing architectures, which couple multi-core CPUs with discrete many-core GPUs (or other specialized hardware accelerators), enable unprecedented peak performance and energy efficiency levels. However, developing applications that can take full advantage of the potential of heterogeneous systems is a notoriously hard task. This work takes a step towards reducing the complexity of programming heterogeneous systems by introducing the abstraction of Heterogeneous Transactional Memory (HeTM). HeTM provides programmers with the illusion of a single memory region, shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with support for atomic transactions. Besides introducing the abstract semantics and programming model of HeTM, we present the design and evaluation of a concrete implementation of the proposed abstraction, referred herein as Speculative HeTM (SHeTM). SHeTM makes use of a novel design that leverages speculative techniques, which aims at hiding the inherently large communication latency between CPUs and discrete GPUs and at minimizing inter-device synchronization overhead. We demonstrate the efficiency of the SHeTM via an extensive quantitative study based both on synthetic benchmarks and on a popular object caching system.

Hardware Transactional Memory meets memory persistency
Daniel Castro, Paolo Romano, and João Barreto
Elsevier BV

Hardware transactional memory meets memory persistency
Daniel Castro, Paolo Romano, and Joao Barreto
IEEE
Persistent Memory (PM) and Hardware Transactional Memory (HTM) are two recent architectural developments whose joint usage promises to drastically accelerate the performance of concurrent, data-intensive applications. Unfortunately, combining these two mechanisms using existing architectural supports is far from being trivial. This paper presents NV-HTM, a system that allows the execution of transactions over PM using unmodified commodity HTM implementations. NV-HTM relies on a hardware-software co-design technique, which is based on three key ideas: i) relying on software to persist transactional modifications after they have been committed via HTM; ii) postponing the externalization of commit events to applications until it is ensured, via software, that any data version produced and observed by committed transactions is first logged in PM; ii) pruning the commit logs via checkpointing schemes that not only bound the log space and recovery time, but also implement wear levelling techniques to enhance PM's endurance. By means of an extensive experimental evaluation, we show that NV-HTM can achieve up to 10× speed-ups and up to 11.6× reduced flush operations with respect to state of the art solutions, which, unlike NV-HTM, require custom modifications to existing HTM systems.

An Analytical Model of Hardware Transactional Memory
Daniel Castro, Paolo Romano, Diego Didona, and Willy Zwaenepoel
IEEE
This paper investigates the problem of deriving a white box performance model of Hardware Transactional Memory (HTM) systems. The proposed model targets TSX, a popular implementation of HTM integrated in Intel processors starting with the Haswell family in 2013.An inherent difficulty with building white-box models of commercially available HTM systems is that their internals are either vaguely documented or undisclosed by their manufacturers. We tackle this challenge by designing a set of experiments that allow us to shed lights on the internal mechanisms used in TSX to manage conflicts among transactions and to track their readsets and writesets.We exploit the information inferred from this experimental study to build an analytical model of TSX focused on capturing the impact on performance of two key mechanisms: the concurrency control scheme and the management of transactional meta-data in the processor's caches. We validate the proposed model by means of an extensive experimental study encompassing a broad range of workloads executed on a real system.