@illumina.com
Emerging Solutions
Illumina
genomics, data analysis, federated analysis
Scopus Publications
Scholar Citations
Scholar h-index
Scholar i10-index
Ernst Joachim Houtgast, Vlad-Mihai Sima, Koen Bertels, and Zaid Al-Ars
IEEE
Bioinformatics workloads are characterized by huge data sets and complex algorithms, requiring enormous data processing and making high performance heterogeneous computation platforms such as FPGAs and GPUs highly relevant. We compare three accelerated implementations of the widely used BWA-MEM genomic mapping tool as a case study on design-time optimization for heterogeneous architectures: BWA-MEM-CUDA, BWA-MEM-OpenCL, and BWA-MEMVHDL, each using an optimized Smith-Waterman algorithm implementation. Optimization of design-time is important because of the significant development effort of such implementations: BWA-MEM-CUDA and BWA-MEM-OpenCL require 5-7x more lines of code to express the Smith-Waterman algorithm, while BWA-MEM-VHDL requires more than 40x as many lines of code. Similar differences hold for required implementation time, ranging from one month for BWA-MEMOpenCL to six months for BWA-MEM-VHDL. The advantages and disadvantages of each implementation are described using both quantitative and qualitative metrics, and recommendations are given for future algorithm implementations.
Ernst Joachim Houtgast, Vlad-Mihai Sima, Koen Bertels, and Zaid Al-Ars
Elsevier BV
We present our work on hardware accelerated genomics pipelines, using either FPGAs or GPUs to accelerate execution of BWA-MEM, a widely-used algorithm for genomic short read mapping. The mapping stage can take up to 40% of overall processing time for genomics pipelines. Our implementation offloads the Seed Extension function, one of the main BWA-MEM computational functions, onto an accelerator. Sequencers typically output reads with a length of 150 base pairs. However, read length is expected to increase in the near future. Here, we investigate the influence of read length on BWA-MEM performance using data sets with read length up to 400 base pairs, and introduce methods to ameliorate the impact of longer read length. For the industry-standard 150 base pair read length, our implementation achieves an up to two-fold increase in overall application-level performance for systems with at most twenty-two logical CPU cores. Longer read length requires commensurately bigger data structures, which directly impacts accelerator efficiency. The two-fold performance increase is sustained for read length of at most 250 base pairs. To improve performance, we perform a classification of the inefficiency of the underlying systolic array architecture. By eliminating idle regions as much as possible, efficiency is improved by up to +95%. Moreover, adaptive load balancing intelligently distributes work between host and accelerator to ensure use of an accelerator always results in performance improvement, which in GPU-constrained scenarios provides up to +45% more performance.
Ernst Houtgast, Vlad-Mihai Sima, and Zaid Al-Ars
IEEE
The Smith-Waterman algorithm is widely used in bioinformatics and is often used as a benchmark of FPGA performance. Here we present our highly optimized Smith-Waterman implementation on Intel FPGAs using OpenCL. Our implementation is both faster and more efficient than other current Smith-Waterman implementations, obtaining a theoretical performance of 214 GCUPS. Moreover, due to the streaming, implicit synchronizing nature of our implementation, which streams alignments and places no restrictions on the number of alignments in flight, it achieves 99.8% of this performance in practice, almost three times as fast as previous implementations. The expressiveness of OpenCL results in a significant reduction in lines of code, and in a significant reduction of development time compared to programming in regular hardware description languages
Razvan Nane, Vlad-Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, Andrew Canis, Yu Ting Chen, Hsuan Hsiao, Stephen Brown, Fabrizio Ferrandi,et al.
Institute of Electrical and Electronics Engineers (IEEE)
High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing today's system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is particularly interesting for designing field-programmable gate array circuits, where hardware implementations can be easily refined and replaced in the target device. Recent years have seen much activity in the HLS research community, with a plethora of HLS tool offerings, from both industry and academia. All these tools may have different input languages, perform different internal optimizations, and produce results of different quality, even for the very same input description. Hence, it is challenging to compare their performance and understand which is the best for the hardware to be implemented. We present a comprehensive analysis of recent HLS tools, as well as overview the areas of active interest in the HLS research community. We also present a first-published methodology to evaluate different HLS tools. We use our methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources.
Ernst Joachim Houtgast, Vlad-Mihai Sima, Giacomo Marchiori, Koen Bertels, and Zaid Al-Ars
IEEE
We propose a novel FPGA-accelerated BWA-MEM implementation, a popular tool for genomic data mapping. The performance and power-efficiency of the FPGA implementation on the single Xilinx Virtex-7 Alpha Data add-in card is compared against a software-only baseline system. By offloading the Seed Extension phase onto the FPGA, a two-fold speedup in overall application-level performance is achieved and a 1.6x gain in power-efficiency. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 34 thousand base pairs per Joule.
Nauman Ahmed, Vlad-Mihai Sima, Ernst Houtgast, Koen Bertels, and Zaid Al-Ars
IEEE
The fast decrease in cost of DNA sequencing has resulted in an enormous growth in available genome data, and hence led to an increasing demand for fast DNA analysis algorithms used for diagnostics of genetic disorders, such as cancer. One of the most computationally intensive steps in the analysis is represented by the DNA read alignment. In this paper, we present an accelerated version of BWA-MEM, one of the most popular read alignment algorithms, by implementing a heterogeneous hardware/software optimized version on the Convey HC2ex platform. A challenging factor of the BWA-MEM algorithm is the fact that it consists of not one, but three computationally intensive kernels: SMEM generation, suffix array lookup and local Smith-Waterman. Obtaining substantial speedup is hence contingent on accelerating all of these three kernels at once. The paper shows an architecture containing two hardware-accelerated kernels and one kernel optimized in software. The two hardware kernels of suffix array lookup and local Smith-Waterman are able to reach speedups of 2.8x and 5.7x, respectively. The software optimization of the SMEM generation kernel is able to achieve a speedup of 1.7x. This enables a total application acceleration of 2.6x compared to the original software version.
Ernst Joachim Houtgast, Vlad-Mihai Sima, Giacomo Marchiori, Koen Bertels, and Zaid Al-Ars
IEEE
Next Generation Sequencing techniques have dramatically reduced the cost of sequencing genetic material, resulting in huge amounts of data being sequenced. The processing of this data poses huge challenges, both from a performance perspective, as well as from a power-efficiency perspective. Heterogeneous computing can help on both fronts, by enabling more performant and more power-efficient solutions. In this paper, power-efficiency of the BWA-MEM algorithm, a popular tool for genomic data mapping, is studied on two heterogeneous architectures. The performance and power-efficiency of an FPGA-based implementation using a single Xilinx Virtex-7 FPGA on the Alpha Data add-in card is compared to a GPU-based implementation using an NVIDIA GeForce GTX 970 and against the software-only baseline system. By offloading the Seed Extension phase on an accelerator, both implementations are able to achieve a two-fold speedup in overall application-level performance over the software-only implementation. Moreover, the highly customizable nature of the FPGA results in much higher power-efficiency, as the FPGA power consumption is less than one fourth of that of the GPU. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 44 thousand base pairs per Joule, a 2.1x gain in power-efficiency as compared to the software-only baseline.
Ernst Joachim Houtgast, Vlad-Mihai Sima, Koen Bertels, and Zaid Al-Ars
Springer International Publishing
Genomic sequencing is rapidly becoming a premier generator of Big Data, posing great computational challenges. Hence, acceleration of the algorithms used is of utmost importance. This paper presents a GPU-accelerated implementation of BWA-MEM, a widely used algorithm to map genomic sequences onto a reference genome. BWA-MEM contains three main computational functions: Seed Generation, Seed Extension and Output Generation. This paper discusses acceleration of the Seed Extension function on a GPU accelerator.
The GPU-based Extend kernel achieves three times higher performance and, by offloading the kernel onto an accelerator and overlapping its execution with the other functions, this results in an overall improvement to application-level execution time of upi¾źto 1.6x.
To ensure that using an accelerator always results in an overall performance improvement, especially when considering slower GPUs, an adaptive load balancing solution is introduced, which intelligently distributes work between host and GPU. This provides, compared to not using load balancing, upi¾źto +46i¾ź% more performance.
Ernst Joachim Houtgast, Vlad-Mihai Sima, Koen Bertels, and Zaid Al-Ars
IEEE
We present the first accelerated implementation of BWA-MEM, a popular genome sequence alignment algorithm widely used in next generation sequencing genomics pipelines. The Smith-Waterman-like sequence alignment kernel requires a significant portion of overall execution time. We propose and evaluate a number of FPGA-based systolic array architectures, presenting optimizations generally applicable to variable length Smith-Waterman execution. Our kernel implementation is up to 3× faster, compared to software-only execution. This translates into an overall application speedup of up to 45%, which is 96% of the theoretically maximum achievable speedup when accelerating only this kernel.
Shanshan Ren, Vlad-Mihai Sima, and Zaid Al-Ars
IEEE
Many DNA sequence analysis tools have been developed to turn the massive raw DNA sequencing data generated by NGS (Next Generation Sequencing) platforms into biologically meaningful information. The pair-HMMs forward algorithm is widely used to calculate the overall alignment probability needed by a number of DNA analysis tools. In this paper, we propose a novel systolic array design to accelerate the pair-HMMs forward algorithm on FPGAs. A number of architectural features have been implemented to improve the performance of the design, such as early exit points to increase the utilization of the array for small sequence sizes, as well as on-chip buffering to enable the processing of long sequences effectively. We present an implementation of the design on the Convey supercomputing platform. Experimental results show that the FPGA implementation of the pair-HMMs forward algorithm is up to 67x faster, compared to software-only execution.
Mark de Jong, Vlad-Mihai Sima, Koen Bertels, and David Thomas
IEEE
Monte-Carlo Integration (MCI) is a numerical technique for evaluating integrals which have no closed form solution. Naive MCI randomly samples the integrand at uniformly distributed points. This naive approach converges very slowly. Stratified sampling can be used to concentrate the samples on segments of the integration domain where the integrand has the highest variance. Even with stratified sampling, MCI converges very slowly for multidimensional integrals. In this work, we implement an FPGA-accelerated design for MISER, a widely used adaptive MCI algorithm applying stratified sampling. We show how to eliminate the recursion from MISER and partition the algorithm between CPUs and FPGAs. The CPUs manage the control-heavy stratification strategy, while the FPGA is responsible for sampling the integrand. The integrand is compiled into a deep pipeline on the FPGA, producing one function evaluation per clock cycle. We demonstrate the FPGA-accelerated design by pricing a path dependent financial derivative called an Asian option. To make optimal use of the stratification, we implement a Brownian bridge on the FPGA that produces one entire bridge per clock cycle. The FPGA-accelerated design is up to 880 times faster compared to a software reference using the GSL implementation of MISER. Compared to naive MCI in software, our design even requires up to 3572 times less execution time to achieve the same accuracy.
Giovanni Mariani, Gianluca Palermo, Roel Meeuws, Vlad-Mihai Sima, Cristina Silvano, and Koen Bertels
IEEE
Application development for heterogeneous platforms requires to code and map functionalities on a set of different computing elements. As a consequence, the development process needs a clear understanding of both, application requirements and heterogeneous computing technologies. To support the development process, we propose a framework called DRuiD capable of learning application characteristics that make them suitable for certain computing elements. The framework is composed of an expert system that supports the designer in the mapping decision and gives hints on possible code modifications to be applied to make the functionality more suitable for a computing element. The experimental results are tailored for a heterogeneous and reconfigurable platform (the Xilinx-ml510) including two computational elements, i.e. a Virtex5 FPGA and a PowerPC. The expert system identifies 88.9% of the times what are the functionalities that are accelerated efficiently by using the FPGA, without requiring the kernel porting. Additionally, we present two case studies demonstrating the potentialities of the framework to give hints on high level code modifications for an efficient kernel mapping on the FPGA.
Razvan Nane, Vlad Mihai Sima, Cuong Pham Quoc, Fernando Goncalves, and Koen Bertels
IEEE
High-level synthesis (HLS) is an automated design process that deals with the generation of behavioral hardware descriptions from high-level algorithmic specifications. The main benefit of this approach is that ever-increasing system-on-chip (SoC) design complexity and ever-shorter time-to-market can still be both manageable and achievable. This advantage, coupled with the increasing number of available heterogeneous platforms that loosely couple general-purpose processors with Field Programmable Gate Array-based co-processors, led to an increasing attention for HLS tool development and optimization from both the academia as well as the industry. However, in order for HLS to fully reach its potential, it is imperative to look simultaneously at local HLS optimizations as well as to HLS system-level integration and design space exploration issues. In this paper, we present the Delft Workbench tool-chain that takes C-code as input and generates, in a semi-automatic way, a complete system. Subsequently, we describe the design and output code optimization of the DWARV 3.0 HLS compiler using the CoSy compiler framework. Based on this experience, we provide an overview of similarities and differences in leveraging this commercial compiler framework to build a hardware compiler as opposed to building a software compiler. Finally, we report speedups up to 3.72x at application level and development times measurable in hours rather than weeks.
Fernando Gonçalves, Zlatko Petrov, José Gabriel de F. Coutinho, Razvan Nane, Vlad-Mihai Sima, João M. P. Cardoso, Stephan Werner, Sujit Bhattacharya, Tiago Carvalho, Ricardo Nobre,et al.
Springer New York
Ricardo Nobre, João M. P. Cardoso, Bryan Olivier, Razvan Nane, Liam Fitzpatrick, José Gabriel de F. Coutinho, Hans van Someren, Vlad-Mihai Sima, Koen Bertels, and Pedro C. Diniz
Springer New York
This chapter describes the CoSy-based [1, 2] compilers developed in the context of the REFLECT project to support its aspect-oriented design-flow. In particular, these CoSy-based compilers are guided by LARA strategies and are responsible for generating code targeting traditional processors, as well as generating behavioral-RTL VHDL code [3] targeting hardware accelerators. Throughout this chapter, these compilers are referred collectively as reflectc, except when a specific compilation flow (with its specific name) is used. We also describe the compiler development extensions to support the REFLECT approach [4] and the weaving process controlled by LARA strategies [5–7] as described in Chap. 3.
João M. P. Cardoso, José Gabriel de F. Coutinho, Razvan Nane, Vlad-Mihai Sima, Bryan Olivier, Tiago Carvalho, Ricardo Nobre, Pedro C. Diniz, Zlatko Petrov, Koen Bertels,et al.
Springer New York
This chapter describes the design-flow approach developed in the REFLECT project as presented originally in [1]. Over the course of the project, this design-flow has evolved and has been extended into a fully operational toolchain. We begin by presenting an overview of the underlying aspect-oriented compilation flow followed by an extended description of the design-flow and its toolchain.
Roel Meeuws, S. Arash Ostadzadeh, Carlo Galuzzi, Vlad Mihai Sima, Razvan Nane, and Koen Bertels
Association for Computing Machinery (ACM)
There has been a steady increase in the utilization of heterogeneous architectures to tackle the growing need for computing performance and low-power systems. The execution of computation-intensive functions on specialized hardware enables to achieve substantial speedups and power savings. However, with a large legacy code base and software engineering experts, it is not at all obvious how to easily utilize these new architectures. As a result, there is a need for comprehensive tool support to bridge the knowledge gap of many engineers as well as to retarget legacy code. In this article, we present the Quipu modeling approach, which consists of a set of tools and a modeling methodology that can generate hardware estimation models, which provide valuable information for developers. This information helps to focus their efforts, to partition their application, and to select the right heterogeneous components. We present Quipu ’s capability to generate domain-specific models, that are up to several times more accurate within their particular domain (error: 4.6%) as compared to domain-agnostic models (error: 23%). Finally, we show how Quipu can generate models for a new toolchain and platform within a few days.
Giovanni Mariani, Vlad-Mihai Sima, Gianluca Palermo, Vittorio Zaccaria, Giacomo Marchiori, Cristina Silvano, and Koen Bertels
IEEE
A key tool to increase the exploitation of dynamic reconfigurable platforms is the run-time resource manager. This system module coordinates the usage of both software and reconfigurable hardware resources in the context of a multi-programmed environment, by alleviating the operating system's induced overhead. This paper introduces a two-layers run-time resource manager for dynamic reconfigurable platforms. The upper level is composed of several application-level managers (one for each application) that provide the most suitable mapping based on resource constraints and performance prediction. The lower level is composed of a centralized system-level resource manager that assigns the HW/SW resources to each application. We present a video surveillance case study in which the proposed resource management technique outperforms the performance of the state of the art by 28% on average, introducing a computational time overhead within 2%.
Razvan Nane, Vlad-Mihai Sima, Bryan Olivier, Roel Meeuws, Yana Yankova, and Koen Bertels
IEEE
In the last decade, a considerable amount of effort was spent on raising the implementation level of hardware systems by automatically extracting the parallelism from input applications and using tools to generate Hardware/Software co-design solutions. However, the tools developed thus far either focus on particular application domains or they impose severe restrictions on the input language. In this paper, we present the DWARV 2.0 compiler that accepts general C-code as input and generates synthesizable VHDL for unrestricted application domains. Dissimilar to previous hardware compilers, this implementation is based on CoSy compiler framework. This allowed us to build a highly modular compiler in which standard or custom optimizations can be easily integrated. Validation experiments showed speed-ups of up to 4.41× when comparing against another state of the art hardware compiler.
Razvan Nane, Vlad-Mihai Sima, and Koen Bertels
IEEE
If-conversion is a known software technique to speedup applications containing conditional expressions and targeting processors with predication support. However, the success of this scheme is highly dependent on the structure of the if-statements, i.e., if they are balanced or unbalanced, as well as on the path taken. Therefore, the predication scheme does not always provide a better execution time than the conventional jump scheme. In this paper, we present an algorithm that leverages the benefits of both jump and predication schemes adapted for hardware execution. The results show that performance degradation is not possible anymore for the unbalanced if-statements as well as a speedup for all test cases between 4% and 21%.
R. Nane, V.M. Sima, and K. Bertels
IEEE
Hardware compilers which generate hardware descriptions from high-level languages are rapidly gaining in popularity. These generated descriptions are used to obtain fast implementations of software/hardware solutions in heterogeneous computing platforms. However, to obtain optimal solutions under certain platform constraints, we need intelligent hardware compilers that choose proper values for the different design parameters automatically. In this paper, we present a two-step algorithm to optimize the performance for different area constraints. The design parameters under investigation are the maximum unroll factor and the optimal allocation of resource types. Experimental results show that generated solutions are mapped into the available area at an occupancy rate between 74% and 99%. Furthermore, these solutions provide the best execution time when compared to the other solutions that satisfy the same area constraint. Finally, a reduction in design time of 42x on average can be achieved when these parameters are chosen by the compiler compared to manually selecting them.
Koen Bertels, Ariano Lattanzi, Emanuele Ciavattini, Ferruccio Bettarelli, Maria Teresa Chiaradia, Raffaele Nutricato, Alberto Morea, Anna Antola, Fabrizio Ferrandi, Marco Lattuada,et al.
Springer Netherlands
This chapter describes the different design steps needed to go from legacy code to a transformed application that can be efficiently mapped on the hArtes platform.
Ferruccio Bettarelli, Emanuele Ciavattini, Ariano Lattanzi, Giovanni Beltrame, Fabrizio Ferrandi, Luca Fossati, Christian Pilato, Donatella Sciuto, Roel J. Meeuws, S. Arash Ostadzadeh,et al.
Springer Netherlands
In this chapter, we describe functionality which has also been developed in the context of the hArtes project but that were not included in the final release or that are separately released. The development of the tools described here was often initiated after certain limitations of the current toolset were identified. This was the case of the memory analyser QUAD which does a detailed analysis of the memory accesses. Other tools, such as the rSesame tool, were developed and explored in parallel with the hArtes tool chain. This tool assumes a KPN-version of the application and then allows for high level simulation and experimentation with different mappings and partitionings. Finally, ReSP was developed to validate the partitioning results before a real implementation was possible.
G. Mariani, V. Sima, G. Palermo, V. Zaccaria, C. Silvano, and K. Bertels
IEEE
Resource run-time managers have been shown particularly effective for coordinating the usage of the hardware resources by multiple applications, eliminating the necessity of a full-blown operating system. For this reason, we expect that this technology will be increasingly adopted in emerging multi-application reconfigurable systems. This paper introduces a fully automated design flow that exploits multi-objective design space exploration to enable runtime resource management for the Molen reconflgurable architecture. The entry point of the design flow is the application source code; our flow is able to heuristically determine a set of candidate hardware/software configurations of the application (i.e., operating points) that trade off the occupation of the reconflgurable fabric (in this case, an FPGA), the load of the master processor and the performance of the application itself. This information enables a run-time manager to exploit more efficiently the available system resources in the context of multiple applications. We present the results of an experimental campaign where we applied the proposed design flow to two reference audio applications mapped on the Molen architecture. The analysis proved that the overhead of the design space exploration and operating points extraction with respect to the original Molen flow is within reasonable bounds since the final synthesis time still represents the major contribution. Besides, we have found that there is a high variance in terms of execution time speedup associated with the operating points of the application (characterized by a different usage of the FPGA) which can be exploited by the run-time manager to increase/decrease the quality of service of the application depending on the available resources1.
Razvan Nane, Sven van Haastregt, Todor Stefanov, Bart Kienhuis, Vlad Mihai Sima, and Koen Bertels
IEEE
Many of today's embedded multiprocessor systems are implemented as heterogeneous systems, consisting of hardware and software components. To automate the composition and integration of multiprocessor systems, the IP-XACT standard was defined to describe hardware IP blocks and (sub)systems. However, the IP-XACT standard does not provide sufficient means to express Reconfigurable Computing (RC) specific information, such as Hardware dependent Software (HdS) meta-data, which prevents automated integration. In this paper, we propose several IP-XACT extensions such that the HdS can be generated and integrated automatically. We validate these specific extensions and demonstrate the interoperability of the approach based on an H.264 decoder application case study. For this case study we achieved an overall 30.4% application-wise speed-up and we reduced the development time of HdS from days to a few seconds.