Comparative analysis of system-level acceleration techniques in bioinformatics: A case study of accelerating the Smith-Waterman Algorithm for BWA-MEM Ernst Joachim Houtgast, Vlad-Mihai Sima, Koen Bertels, Zaid Al-Ars Proceedings 2018 IEEE 18th International Conference on Bioinformatics and Bioengineering Bibe 2018, 2018 Bioinformatics workloads are characterized by huge data sets and complex algorithms, requiring enormous data processing and making high performance heterogeneous computation platforms such as FPGAs and GPUs highly relevant. We compare three accelerated implementations of the widely used BWA-MEM genomic mapping tool as a case study on design-time optimization for heterogeneous architectures: BWA-MEM-CUDA, BWA-MEM-OpenCL, and BWA-MEMVHDL, each using an optimized Smith-Waterman algorithm implementation. Optimization of design-time is important because of the significant development effort of such implementations: BWA-MEM-CUDA and BWA-MEM-OpenCL require 5-7x more lines of code to express the Smith-Waterman algorithm, while BWA-MEM-VHDL requires more than 40x as many lines of code. Similar differences hold for required implementation time, ranging from one month for BWA-MEMOpenCL to six months for BWA-MEM-VHDL. The advantages and disadvantages of each implementation are described using both quantitative and qualitative metrics, and recommendations are given for future algorithm implementations.
Hardware acceleration of BWA-MEM genomic short read mapping for longer read lengths Ernst Joachim Houtgast, Vlad-Mihai Sima, Koen Bertels, Zaid Al-Ars Computational Biology and Chemistry, 2018 We present our work on hardware accelerated genomics pipelines, using either FPGAs or GPUs to accelerate execution of BWA-MEM, a widely-used algorithm for genomic short read mapping. The mapping stage can take up to 40% of overall processing time for genomics pipelines. Our implementation offloads the Seed Extension function, one of the main BWA-MEM computational functions, onto an accelerator. Sequencers typically output reads with a length of 150 base pairs. However, read length is expected to increase in the near future. Here, we investigate the influence of read length on BWA-MEM performance using data sets with read length up to 400 base pairs, and introduce methods to ameliorate the impact of longer read length. For the industry-standard 150 base pair read length, our implementation achieves an up to two-fold increase in overall application-level performance for systems with at most twenty-two logical CPU cores. Longer read length requires commensurately bigger data structures, which directly impacts accelerator efficiency. The two-fold performance increase is sustained for read length of at most 250 base pairs. To improve performance, we perform a classification of the inefficiency of the underlying systolic array architecture. By eliminating idle regions as much as possible, efficiency is improved by up to +95%. Moreover, adaptive load balancing intelligently distributes work between host and accelerator to ensure use of an accelerator always results in performance improvement, which in GPU-constrained scenarios provides up to +45% more performance.
High performance streaming smith-waterman implementation with implicit synchronization on intel FPGA using OpenCL Ernst Houtgast, Vlad-Mihai Sima, Zaid Al-Ars Proceedings 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering Bibe 2017, 2017 The Smith-Waterman algorithm is widely used in bioinformatics and is often used as a benchmark of FPGA performance. Here we present our highly optimized Smith-Waterman implementation on Intel FPGAs using OpenCL. Our implementation is both faster and more efficient than other current Smith-Waterman implementations, obtaining a theoretical performance of 214 GCUPS. Moreover, due to the streaming, implicit synchronizing nature of our implementation, which streams alignments and places no restrictions on the number of alignments in flight, it achieves 99.8% of this performance in practice, almost three times as fast as previous implementations. The expressiveness of OpenCL results in a significant reduction in lines of code, and in a significant reduction of development time compared to programming in regular hardware description languages
A Survey and Evaluation of FPGA High-Level Synthesis Tools Razvan Nane, Vlad-Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, et al. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 2016 High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing today's system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is particularly interesting for designing field-programmable gate array circuits, where hardware implementations can be easily refined and replaced in the target device. Recent years have seen much activity in the HLS research community, with a plethora of HLS tool offerings, from both industry and academia. All these tools may have different input languages, perform different internal optimizations, and produce results of different quality, even for the very same input description. Hence, it is challenging to compare their performance and understand which is the best for the hardware to be implemented. We present a comprehensive analysis of recent HLS tools, as well as overview the areas of active interest in the HLS research community. We also present a first-published methodology to evaluate different HLS tools. We use our methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources.
Power-Efficient Accelerated Genomic Short Read Mapping on Heterogeneous Computing Platforms Ernst Joachim Houtgast, Vlad-Mihai Sima, Giacomo Marchiori, Koen Bertels, Zaid Al-Ars Proceedings 24th IEEE International Symposium on Field Programmable Custom Computing Machines Fccm 2016, 2016 We propose a novel FPGA-accelerated BWA-MEM implementation, a popular tool for genomic data mapping. The performance and power-efficiency of the FPGA implementation on the single Xilinx Virtex-7 Alpha Data add-in card is compared against a software-only baseline system. By offloading the Seed Extension phase onto the FPGA, a two-fold speedup in overall application-level performance is achieved and a 1.6x gain in power-efficiency. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 34 thousand base pairs per Joule.
Heterogeneous hardware/software acceleration of the BWA-MEM DNA alignment algorithm Nauman Ahmed, Vlad-Mihai Sima, Ernst Houtgast, Koen Bertels, Zaid Al-Ars 2015 IEEE ACM International Conference on Computer Aided Design Iccad 2015, 2016 The fast decrease in cost of DNA sequencing has resulted in an enormous growth in available genome data, and hence led to an increasing demand for fast DNA analysis algorithms used for diagnostics of genetic disorders, such as cancer. One of the most computationally intensive steps in the analysis is represented by the DNA read alignment. In this paper, we present an accelerated version of BWA-MEM, one of the most popular read alignment algorithms, by implementing a heterogeneous hardware/software optimized version on the Convey HC2ex platform. A challenging factor of the BWA-MEM algorithm is the fact that it consists of not one, but three computationally intensive kernels: SMEM generation, suffix array lookup and local Smith-Waterman. Obtaining substantial speedup is hence contingent on accelerating all of these three kernels at once. The paper shows an architecture containing two hardware-accelerated kernels and one kernel optimized in software. The two hardware kernels of suffix array lookup and local Smith-Waterman are able to reach speedups of 2.8x and 5.7x, respectively. The software optimization of the SMEM generation kernel is able to achieve a speedup of 1.7x. This enables a total application acceleration of 2.6x compared to the original software version.
GPU-accelerated BWA-MEM genomic mapping algorithm using adaptive load balancing Ernst Joachim Houtgast, Vlad-Mihai Sima, Koen Bertels, Zaid Al-Ars Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2016 Genomic sequencing is rapidly becoming a premier generator of Big Data, posing great computational challenges. Hence, acceleration of the algorithms used is of utmost importance. This paper presents a GPU-accelerated implementation of BWA-MEM, a widely used algorithm to map genomic sequences onto a reference genome. BWA-MEM contains three main computational functions: Seed Generation, Seed Extension and Output Generation. This paper discusses acceleration of the Seed Extension function on a GPU accelerator. The GPU-based Extend kernel achieves three times higher performance and, by offloading the kernel onto an accelerator and overlapping its execution with the other functions, this results in an overall improvement to application-level execution time of upi¾źto 1.6x. To ensure that using an accelerator always results in an overall performance improvement, especially when considering slower GPUs, an adaptive load balancing solution is introduced, which intelligently distributes work between host and GPU. This provides, compared to not using load balancing, upi¾źto +46i¾ź% more performance.
Power-efficiency analysis of accelerated BWA-MEM implementations on heterogeneous computing platforms Ernst Joachim Houtgast, Vlad-Mihai Sima, Giacomo Marchiori, Koen Bertels, Zaid Al-Ars 2016 International Conference on Reconfigurable Computing and Fpgas Reconfig 2016, 2016 Next Generation Sequencing techniques have dramatically reduced the cost of sequencing genetic material, resulting in huge amounts of data being sequenced. The processing of this data poses huge challenges, both from a performance perspective, as well as from a power-efficiency perspective. Heterogeneous computing can help on both fronts, by enabling more performant and more power-efficient solutions. In this paper, power-efficiency of the BWA-MEM algorithm, a popular tool for genomic data mapping, is studied on two heterogeneous architectures. The performance and power-efficiency of an FPGA-based implementation using a single Xilinx Virtex-7 FPGA on the Alpha Data add-in card is compared to a GPU-based implementation using an NVIDIA GeForce GTX 970 and against the software-only baseline system. By offloading the Seed Extension phase on an accelerator, both implementations are able to achieve a two-fold speedup in overall application-level performance over the software-only implementation. Moreover, the highly customizable nature of the FPGA results in much higher power-efficiency, as the FPGA power consumption is less than one fourth of that of the GPU. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 44 thousand base pairs per Joule, a 2.1x gain in power-efficiency as compared to the software-only baseline.
An FPGA-based systolic array to accelerate the BWA-MEM genomic mapping algorithm Ernst Joachim Houtgast, Vlad-Mihai Sima, Koen Bertels, Zaid Al-Ars Proceedings 2015 International Conference on Embedded Computer Systems Architectures Modeling and Simulation Samos 2015, 2015 We present the first accelerated implementation of BWA-MEM, a popular genome sequence alignment algorithm widely used in next generation sequencing genomics pipelines. The Smith-Waterman-like sequence alignment kernel requires a significant portion of overall execution time. We propose and evaluate a number of FPGA-based systolic array architectures, presenting optimizations generally applicable to variable length Smith-Waterman execution. Our kernel implementation is up to 3× faster, compared to software-only execution. This translates into an overall application speedup of up to 45%, which is 96% of the theoretically maximum achievable speedup when accelerating only this kernel.
FPGA acceleration of the pair-HMMs forward algorithm for DNA sequence analysis Shanshan Ren, Vlad-Mihai Sima, Zaid Al-Ars Proceedings 2015 IEEE International Conference on Bioinformatics and Biomedicine Bibm 2015, 2015 Many DNA sequence analysis tools have been developed to turn the massive raw DNA sequencing data generated by NGS (Next Generation Sequencing) platforms into biologically meaningful information. The pair-HMMs forward algorithm is widely used to calculate the overall alignment probability needed by a number of DNA analysis tools. In this paper, we propose a novel systolic array design to accelerate the pair-HMMs forward algorithm on FPGAs. A number of architectural features have been implemented to improve the performance of the design, such as early exit points to increase the utilization of the array for small sequence sizes, as well as on-chip buffering to enable the processing of long sequences effectively. We present an implementation of the design on the Convey supercomputing platform. Experimental results show that the FPGA implementation of the pair-HMMs forward algorithm is up to 67x faster, compared to software-only execution.
Hardware/software compilation Ricardo Nobre, João M. P. Cardoso, Bryan Olivier, Razvan Nane, Liam Fitzpatrick, et al. Compilation and Synthesis for Embedded Reconfigurable Systems an Aspect Oriented Approach, 2013
LARA experiments Fernando Gonçalves, Zlatko Petrov, José Gabriel de F. Coutinho, Razvan Nane, Vlad-Mihai Sima, et al. Compilation and Synthesis for Embedded Reconfigurable Systems an Aspect Oriented Approach, 2013
The REFLECT design-flow João M. P. Cardoso, José Gabriel de F. Coutinho, Razvan Nane, Vlad-Mihai Sima, Bryan Olivier, et al. Compilation and Synthesis for Embedded Reconfigurable Systems an Aspect Oriented Approach, 2013
DWARV 2.0: A CoSy-based C-to-VHDL hardware compiler Razvan Nane, Vlad-Mihai Sima, Bryan Olivier, Roel Meeuws, Yana Yankova, et al. Proceedings 22nd International Conference on Field Programmable Logic and Applications Fpl 2012, 2012
Extensions of the hArtes tool chain Ferruccio Bettarelli, Emanuele Ciavattini, Ariano Lattanzi, Giovanni Beltrame, Fabrizio Ferrandi, et al. Hardware Software Co Design for Heterogeneous Multi Core Platforms the Hartes Toolchain, 2012
The hArtes tool chain Koen Bertels, Ariano Lattanzi, Emanuele Ciavattini, Ferruccio Bettarelli, Maria Teresa Chiaradia, et al. Hardware Software Co Design for Heterogeneous Multi Core Platforms the Hartes Toolchain, 2012
The hArtes CarLab: A new approach to advanced algorithms development for automotive audio AES Journal of the Audio Engineering Society, 2011
RECENT SCHOLAR PUBLICATIONS
Comparative analysis of system-level acceleration techniques in bioinformatics: A case study of accelerating the smith-waterman algorithm for bwa-mem EJ Houtgast, VM Sima, K Bertels, Z Al-Ars 2018 IEEE 18th International Conference on Bioinformatics and Bioengineering … , 2018 2018 Citations: 10
Hardware acceleration of BWA-MEM genomic short read mapping for longer read lengths EJ Houtgast, VM Sima, K Bertels, Z Al-Ars Computational biology and chemistry 75, 54-64 , 2018 2018 Citations: 178
High performance streaming Smith-Waterman implementation with implicit synchronization on intel FPGA using OpenCL E Houtgast, VM Sima, Z Al-Ars 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering … , 2017 2017 Citations: 26
An efficient gpuaccelerated implementation of genomic short read mapping with bwamem EJ Houtgast, VM Sima, K Bertels, Z AlArs ACM SIGARCH Computer Architecture News 44 (4), 38-43 , 2017 2017 Citations: 25
Power-efficiency analysis of accelerated BWA-MEM implementations on heterogeneous computing platforms EJ Houtgast, VM Sima, G Marchiori, K Bertels, Z Al-Ars 2016 international conference on reconfigurable computing and fpgas … , 2016 2016 Citations: 16
Power-efficient accelerated genomic short read mapping on heterogeneous computing platforms EJ Houtgast, VM Sima, G Marchiori, K Bertels, Z Al-Ars 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom … , 2016 2016 Citations: 5
GPU-accelerated BWA-MEM genomic mapping algorithm using adaptive load balancing EJ Houtgast, VM Sima, K Bertels, Z Al-Ars International conference on architecture of computing systems, 130-142 , 2016 2016 Citations: 34
Computational Challenges of Next Generation Sequencing Pipelines Using Heterogeneous Systems EJ Houtgast, VM Sima, K Bertels, Z Al-Ars 12th International Summer School on Advanced Computer Architecture and … , 2016 2016 Citations: 2
A survey and evaluation of FPGA high-level synthesis tools R Nane, VM Sima, C Pilato, J Choi, B Fort, A Canis, YT Chen, H Hsiao, ... IEEE Transactions on Computer-Aided Design of Integrated Circuits and … , 2015 2015 Citations: 896
FPGA acceleration of the pair-HMMs forward algorithm for DNA sequence analysis S Ren, VM Sima, Z Al-Ars 2015 IEEE international conference on bioinformatics and biomedicine (BIBM … , 2015 2015 Citations: 53
Heterogeneous hardware/software acceleration of the BWA-MEM DNA alignment algorithm N Ahmed, VM Sima, E Houtgast, K Bertels, Z Al-Ars 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 240-246 , 2015 2015 Citations: 69
An FPGA-based systolic array to accelerate the BWA-MEM genomic mapping algorithm EJ Houtgast, VM Sima, K Bertels, Z Al-Ars 2015 international conference on embedded computer systems: Architectures … , 2015 2015 Citations: 78
Intra-application data-communication characterization I Ashraf, VM Sima, K Bertels Proc. 1st Int. Workshop Commun. Archit. Extreme Scale, 1-11 , 2015 2015 Citations: 7
FPGA-accelerated Monte-Carlo integration using stratified sampling and Brownian bridges M De Jong, VM Sima, K Bertels, D Thomas 2014 International Conference on Field-Programmable Technology (FPT), 68-75 , 2014 2014 Citations: 6
High-level synthesis in the delft workbench hardware/software co-design tool-chain R Nane, VM Sima, CP Quoc, F Goncalves, K Bertels 2014 12th IEEE International Conference on Embedded and Ubiquitous Computing … , 2014 2014 Citations: 20
DRuiD: Designing reconfigurable architectures with decision-making support G Mariani, G Palermo, R Meeuws, VM Sima, C Silvano, K Bertels 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC), 213-218 , 2014 2014 Citations: 6
Run-time optimization of a dynamically reconfigurable embedded system through performance prediction G Mariani, VM Sima, G Palermo, V Zaccaria, G Marchiori, C Silvano, ... 2013 23rd International Conference on Field programmable Logic and … , 2013 2013 Citations: 2
LARA experiments F Gonçalves, Z Petrov, JG de F. Coutinho, R Nane, VM Sima, ... Compilation and Synthesis for Embedded Reconfigurable Systems: An Aspect … , 2013 2013
The REFLECT design-flow JMP Cardoso, JG de F. Coutinho, R Nane, VM Sima, B Olivier, T Carvalho, ... Compilation and Synthesis for Embedded Reconfigurable Systems: An Aspect … , 2013 2013 Citations: 2
Hardware/Software Compilation R Nobre, JMP Cardoso, B Olivier, R Nane, L Fitzpatrick, ... Compilation and Synthesis for Embedded Reconfigurable Systems: An Aspect … , 2013 2013 Citations: 3
MOST CITED SCHOLAR PUBLICATIONS
A survey and evaluation of FPGA high-level synthesis tools R Nane, VM Sima, C Pilato, J Choi, B Fort, A Canis, YT Chen, H Hsiao, ... IEEE Transactions on Computer-Aided Design of Integrated Circuits and … , 2015 2015 Citations: 896
Hardware acceleration of BWA-MEM genomic short read mapping for longer read lengths EJ Houtgast, VM Sima, K Bertels, Z Al-Ars Computational biology and chemistry 75, 54-64 , 2018 2018 Citations: 178
DWARV 2.0: A CoSy-based C-to-VHDL hardware compiler R Nane, VM Sima, B Olivier, R Meeuws, Y Yankova, K Bertels 22nd international conference on field programmable logic and applications … , 2012 2012 Citations: 133
An FPGA-based systolic array to accelerate the BWA-MEM genomic mapping algorithm EJ Houtgast, VM Sima, K Bertels, Z Al-Ars 2015 international conference on embedded computer systems: Architectures … , 2015 2015 Citations: 78
Heterogeneous hardware/software acceleration of the BWA-MEM DNA alignment algorithm N Ahmed, VM Sima, E Houtgast, K Bertels, Z Al-Ars 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 240-246 , 2015 2015 Citations: 69
FPGA acceleration of the pair-HMMs forward algorithm for DNA sequence analysis S Ren, VM Sima, Z Al-Ars 2015 IEEE international conference on bioinformatics and biomedicine (BIBM … , 2015 2015 Citations: 53
Hartes: Hardware-software codesign for heterogeneous multicore platforms K Bertels, VM Sima, Y Yankova, G Kuzmanov, W Luk, G Coutinho, ... IEEE micro 30 (5), 88-97 , 2010 2010 Citations: 38
Using multi-objective design space exploration to enable run-time resource management for reconfigurable architectures G Mariani, VM Sima, G Palermo, V Zaccaria, C Silvano, K Bertels 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE … , 2012 2012 Citations: 36
GPU-accelerated BWA-MEM genomic mapping algorithm using adaptive load balancing EJ Houtgast, VM Sima, K Bertels, Z Al-Ars International conference on architecture of computing systems, 130-142 , 2016 2016 Citations: 34
Runtime decision of hardware or software execution on a heterogeneous reconfigurable platform VM Sima, K Bertels 2009 IEEE International Symposium on Parallel & Distributed Processing, 1-6 , 2009 2009 Citations: 27
High performance streaming Smith-Waterman implementation with implicit synchronization on intel FPGA using OpenCL E Houtgast, VM Sima, Z Al-Ars 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering … , 2017 2017 Citations: 26
An efficient gpuaccelerated implementation of genomic short read mapping with bwamem EJ Houtgast, VM Sima, K Bertels, Z AlArs ACM SIGARCH Computer Architecture News 44 (4), 38-43 , 2017 2017 Citations: 25
Compiler assisted runtime task scheduling on a reconfigurable computer M Sabeghi, VM Sima, K Bertels 2009 International Conference on Field Programmable Logic and Applications … , 2009 2009 Citations: 22
High-level synthesis in the delft workbench hardware/software co-design tool-chain R Nane, VM Sima, CP Quoc, F Goncalves, K Bertels 2014 12th IEEE International Conference on Embedded and Ubiquitous Computing … , 2014 2014 Citations: 20
REFLECT: rendering FPGAs to multi-core embedded computing JMP Cardoso, PC Diniz, Z Petrov, K Bertels, M Hübner, H van Someren, ... Reconfigurable Computing: From FPGAs to Hardware/Software Codesign, 261-289 , 2011 2011 Citations: 18
Power-efficiency analysis of accelerated BWA-MEM implementations on heterogeneous computing platforms EJ Houtgast, VM Sima, G Marchiori, K Bertels, Z Al-Ars 2016 international conference on reconfigurable computing and fpgas … , 2016 2016 Citations: 16
Quipu: A statistical model for predicting hardware resources R Meeuws, SA Ostadzadeh, C Galuzzi, VM Sima, R Nane, K Bertels ACM Transactions on Reconfigurable Technology and Systems (TRETS) 6 (1), 1-25 , 2013 2013 Citations: 14
Hartes toolchain early evaluation: Profiling, Compilation and HDL generation K Bertels, G Kuzmanov, EM Panainte, G Gaydadjiev, Y Yankova, ... 2007 International Conference on Field Programmable Logic and Applications … , 2007 2007 Citations: 12
Comparative analysis of system-level acceleration techniques in bioinformatics: A case study of accelerating the smith-waterman algorithm for bwa-mem EJ Houtgast, VM Sima, K Bertels, Z Al-Ars 2018 IEEE 18th International Conference on Bioinformatics and Bioengineering … , 2018 2018 Citations: 10
Area constraint propagation in high level synthesis R Nane, VM Sima, K Bertels 2012 International Conference on Field-Programmable Technology, 247-252 , 2012 2012 Citations: 10