Programme | DATE 2024

Photos are available in the DATE 2024 Gallery.

The time zone for all times mentioned at the DATE website is CET – Central Europe Time (UTC+1). AoE = Anywhere on Earth.

DATE 2024 Detailed Programme

The detailed programme of DATE 2024 will continuously be updated.

More information on ASD Initiative, Keynotes, Tutorials, Workshops, Young People Programme

Navigate to | | .

OC Opening Ceremony

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 08:30 CEST - 09:00 CEST
Location / Room: Auditorium 1

Session chair:
Andy Pimentel, University of Amsterdam, NL

Session co-chair:
Valeria Bertacco, University of Michigan, US

Supported by

Time	Label	Presentation Title Authors
08:30 CEST	OC.1	WELCOME ADDRESSES Presenter: Andy Pimentel, University of Amsterdam, NL Authors: Andy Pimentel¹ and Valeria Bertacco² ¹University of Amsterdam, NL; ²University of Michigan, US Abstract Welcome Addresses from the Chairs
08:45 CEST	OC.2	PRESENTATION OF AWARDS Presenter: Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Authors: Jürgen Teich¹, David Atienza², Georges Gielen³ and Yervant Zorian⁴ ¹Friedrich-Alexander-Universität Erlangen-Nürnberg, DE; ²EPFL, CH; ³KU Leuven, BE; ⁴Synopsys, US Abstract Presentation of Awards

OK01 Opening Keynote 1

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 09:00 CEST - 09:45 CEST
Location / Room: Auditorium 1

Session chair:
Andy Pimentel, University of Amsterdam, NL

Session co-chair:
Valeria Bertacco, University of Michigan, US

Time	Label	Presentation Title Authors
09:00 CEST	OK01.1	CHIPLET STANDARDS: A NEW ROUTE TO ARM-BASED CUSTOM SILICON Presenter: Robert Dimond, ARM, GB Author: Robert Dimond, ARM, GB Abstract A key challenge our partners are consistently looking to solve is: How can we continue to push performance boundaries, with maximum efficiency, while managing costs associated with manufacturing and yield? Today, as the ever more complex AI-accelerated computing landscape evolves, a key solution emerging is chiplets. Chiplets are designed to be combined to create larger and more complex systems that can be packaged and sold as a single solution, made of a number of smaller dice instead of one single larger monolithic die. This creates interesting new design possibilities, with one of the most exciting being a potential route to custom silicon for manufacturers who historically chose off-the-shelf solutions. This talk will describe two complementary approaches to realising this chiplet opportunity: · Decomposing an existing system across multiple chiplets, in the same way a monolithic chip is composed of IP blocks. · Aggregating well-defined peripherals across a motherboard into a single package. Both of these approaches require collaboration in standards to align on the many non-differentiating choices in chiplet partitioning. This talk will describe the standards framework that Arm is building with our partners, and the broader industry. Including, own specifications such as the Arm Chiplet System Architecture (Arm CSA), AMBA chip-to-chip and the role of industry standards such as UCIe.

OK02 Opening Keynote 2

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 09:45 CEST - 10:30 CEST
Location / Room: Auditorium 1

Session chair:
Andy Pimentel, University of Amsterdam, NL

Session co-chair:
Valeria Bertacco, University of Michigan, US

Time	Label	Presentation Title Authors
09:45 CEST	OK02.1	ENLIGHTEN YOUR DESIGNS WITH PHOTONIC INTEGRATED CIRCUITS Presenter: Luc Augustin, SMART Photonics, NL Author: Luc Augustin, SMART Photonics, NL Abstract The field of integrated photonics holds great promise for overcoming societal challenges in data and telecom, autonomous driving and healthcare in terms of cost, performance, and scalability. Similar to the semiconductor industry, the ever-increasing demands of various applications are driving the necessity for platform integration in photonics as well, enabling seamless integration of diverse functionalities into compact and efficient photonic devices. This high level of integration reduces footprint and drives down system level costs. In this trend towards high levels of integration , Indium Phosphide (InP) is the material of choice for long-distance communication lasers, owing to its proven track record over several decades. Leveraging standardized fabrication processes, the cost and performance targets can be addressed. The key advantages of InP-based integration lie in its ability to fully integrate lasers, amplifiers, modulators, and passives, providing a flexible and reliable platform for building complex Photonic Integrated Circuits (PICs). This paper will address the photonic integration platforms, the applicability to current and future markets requiring the need for further heterogenous integration with other technologies, and the change to a foundry business model, much like its electronics counterpart.

ASD01 ASD Technical Paper Session: Designing Adaptive Autonomous Systems for Resource-Constrained Platforms

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Break-Out Room S8

Time	Label	Presentation Title Authors
11:00 CEST	ASD01.1	CONTEXT-AWARE MULTI-MODEL OBJECT DETECTION FOR DIVERSELY HETEROGENEOUS COMPUTE SYSTEMS Speaker: Justin Davis, Colorado School of Mines, US Authors: Justin Davis and Mehmet Belviranli, Colorado School of Mines, US Abstract In recent years, deep neural networks (DNNs) have gained widespread adoption for continuous mobile object detection (OD) tasks, particularly in autonomous systems. However, a prevalent issue in their deployment is the one-size-fits-all approach, where a single DNN is used, resulting in inefficient utilization of computational resources. This inefficiency is particularly detrimental in energy-constrained systems, as it degrades overall system efficiency. We identify that, the contextual information embedded in the input data stream (e.g., the frames in the camera feed that the OD models are run on) could be exploited to allow a more efficient multi-model-based OD process. In this paper, we propose SHIFT which continuously selects from a variety of DNN-based OD models depending on the dynamically changing contextual information and computational constraints. During this selection, SHIFT uniquely considers multi-accelerator execution to better optimize the energy-efficiency while satisfying the latency constraints. Our proposed methodology results in improvements of up to 7.5x in energy usage and 2.8x in latency compared to state-of-the-art GPU-based single model OD approaches.
11:30 CEST	ASD01.2	ADAPTIVE LOCALIZATION FOR AUTONOMOUS RACING VEHICLES WITH RESOURCE-CONSTRAINED EMBEDDED PLATFORMS Speaker: Gianluca Brilli, Università di Modena e Reggio Emilia, IT Authors: Federico Gavioli¹, Gianluca Brilli², Paolo Burgio¹ and Davide Bertozzi³ ¹Università di Modena e Reggio Emilia, IT; ²Unimore, IT; ³University of Ferrara, IT Abstract Modern autonomous vehicles have to cope with the consolidation of multiple critical software modules processing huge amounts of real-time data on power- and resource-constrained embedded MPSoCs. In such a highly-congested and dynamic scenario, it is extremely complex to ensure that all components meet their quality-of-service requirements (e.g., sensor frequen- cies, accuracy, responsiveness, reliability) under all possible work- ing conditions and within tight power budgets. One promising solution consists of taking advantage of complementary resource usage patterns of software components by implementing dynamic resource provisioning. A key enabler of this paradigm consists of augmenting applications with dynamic reconfiguration capability, thus adaptively modulating quality-of-service based on resource availability or proactively demanding resources based just on the complexity of the input at hand. The goal of this paper is to explore the feasibility of such a dynamic model of computation for the critical localization function of self-driving vehicles, so that it can burden on system resources just for what is needed at any point in time or gracefully degrade accuracy in case of resource shortage. We validate our approach in a harsh scenario, by implementing it in the localization module of an autonomous racing vehicle. Experiments show an overall reduction of platform utilization and power consumption for this computation-greedy software module by up to 1.6× and 1.5×, respectively, for roughly the same quality of service.
12:00 CEST	ASD01.3	ADAPTIVE DEEP LEARNING FOR EFFICIENT VISUAL POSE ESTIMATION ABOARD ULTRA-LOW-POWER NANO-DRONES Speaker: Beatrice Alessandra Motetti, Politecnico di Torino, IT Authors: Beatrice Alessandra Motetti¹, Luca Crupi², Omer Mohammed Mustafa¹, Matteo Risso¹, Daniele Jahier Pagliari¹, Daniele Palossi³ and Alessio Burrello⁴ ¹Politecnico di Torino, IT; ²Dalle Molle Institute for Artificial Intelligence, USI and SUPSI, N/; ³ETH - Zurich, CH; ⁴Politecnico di Torino \| Università di Bologna, IT Abstract Sub-10cm diameter nano-drones are gaining momentum thanks to their applicability in scenarios prevented to bigger flying drones, such as in narrow environments and close to humans. However, their tiny form factor also brings their major drawback: ultra-constrained memory and processors for the onboard execution of their perception pipelines. Therefore, lightweight deep learning-based approaches are becoming increasingly popular, stressing how computational efficiency and energy-saving are paramount as they can make the difference between a fully working closed-loop system and a failing one. In this work, to maximize the exploitation of the ultra-limited resources aboard nano-drones, we present a novel adaptive deep learning-based mechanism for the efficient execution of a vision-based human pose estimation task. We leverage two State-of-the-Art (SoA) convolutional neural networks (CNNs) with different regression performance vs. computational costs trade-offs. By combining these CNNs with three novel adaptation strategies based on the output's temporal consistency and on auxiliary tasks to swap the CNN being executed proactively, we present six different systems. On a real-world dataset and the actual nano-drone hardware, our best-performing system, compared to executing only the bigger and most accurate SoA model, shows 28% latency reduction while keeping the same mean absolute error (MAE), 3% MAE reduction while being iso-latency, and the absolute peak performance, i.e., 6% better than SoA model.

BPA02 BPA - Reliability And Optimizations

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Break-Out Room S1+2

Session chair:
Elena-Ioana Vatajelu, Grenoble-INP, FR

Session co-chair:
Marie-Minerve Louerat, Sorbonne Université - LIP6, FR

Time	Label	Presentation Title Authors
11:00 CEST	BPA02.1	SCALABLE SEQUENTIAL OPTIMIZATION UNDER OBSERVABILITY DON'T CARES Speaker: Dewmini Marakkalage, EPFL, CH Authors: Dewmini Marakkalage¹, Eleonora Testa², Walter Lau Neto³, Alan Mishchenko⁴, Giovanni De Micheli¹ and Luca Amaru² ¹EPFL, CH; ²Synopsys, US; ³University of Utah, US; ⁴University of California, Berkeley, US Abstract Sequential logic synthesis can provide better Power-Performance-Area (PPA) than combinational logic synthesis since it explores a larger solution space. As the gate cost in advanced technologies keeps rising, sequential logic synthesis provides a powerful alternative that is gaining momentum in the EDA community. In this work, we present a new scalable algorithm for don't-care-based sequential logic synthesis. Our new approach is based on sequential k-step induction and can apply both redundancy removal and resubstitution transformations under Sequential Observability Don't Cares (SODCs). Using SODC-based optimizations with induction is a challenging problem due to dependencies and alignment of don't cares among the base case and the inductive case. We propose a new approach utilizing the full power of SODCs without limiting the solution space. Our algorithm is implemented as part of an industrial tool and achieves a 6.9% average area improvement after technology mapping when compared to state-of-the-art sequential synthesis methods. Moreover, all the new sequential optimizations can be verified using state-of-the-art sequential verification tools.
11:20 CEST	BPA02.2	VACSEM: VERIFYING AVERAGE ERRORS IN APPROXIMATE CIRCUITS USING SIMULATION-ENHANCED MODEL COUNTING Speaker: Chang Meng, EPFL, CH Authors: Chang Meng¹, Hanyu Wang², Yuqi Mai³, Weikang Qian³ and Giovanni De Micheli¹ ¹EPFL, CH; ²ETH Zurich, CH; ³Shanghai Jiao Tong University, CN Abstract Approximate computing is an effective computing paradigm to reduce area, delay, and power for error-tolerant applications. Average error is a widely-used metric for approximate circuits, measuring the deviation between the outputs of exact and approximate circuits. This paper proposes VACSEM, a formal method to verify average errors in approximate circuits using simulation-enhanced model counting. VACSEM leverages circuit structure information and logic simulation to speed up verification. Experimental results show that VACSEM is on average 35× faster than the state-of-the-art method.
11:40 CEST	BPA02.3	SELCC: ENHANCING MLC RELIABILITY AND ENDURANCE WITH SINGLE-CELL ERROR CORRECTION CODES Speaker: Yujin Lim, Sungkyunkwan University, KR Authors: Yujin Lim, Dongwhee Kim and Jungrae Kim, Sungkyunkwan University, KR Abstract Conventional DRAM's limitations in volatility, high static power consumption, and scalability have led to the exploration of alternative technologies such as Phase Change Memory (PCM) and Resistive RAM (ReRAM). Storage-Class Memory (SCM) arises as a target application for these emerging technologies with non-volatility and higher capacity through Multi-Level Cells (MLCs). However, MLCs face issues of reliability and reduced endurance. To address this, our paper introduces a novel Error Correction Codes (ECC) method, "Single Eight-Level Cell Correcting" (SELCC) ECC. This technique efficiently corrects single-cell errors in 8-level cell memories using existing ECC syndromes without added redundancy. SELCC enhances memory reliability and improves 8LC memory endurance by 3.2 times, surpassing previous solutions without significant overheads.
12:00 CEST	BPA02.4	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

ET02 Embedded Tutorial: On-Device Continual Learning Meets Ultra-Low Power Processing

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Break-Out Room S3+4

This tutorial provides an overview of the recent On-Device Learning (ODL) topic for ultra-low power extreme edge devices, such as MicroController compute units (MCU).

Nowadays, these devices are capable of running Deep Neural Network (DNN) inference tasks to extract high-level information from data captured by on-board sensors. The DNN algorithms are typically trained off-device using high-performance servers and, then, frozen and deployed on resource-constrained MCU-powered platforms. However, the data used for the DNN training may not be representative of the deployment environment, causing mispredictions/misclasifications that eventually reflect into (i) expensive model redesigns and (ii) re-deployments at scale. Recently proposed Continual Learning methods stand out as potential solutions to this fundamental issue, enabling DNN model personalization by incorporating new knowledge (e.g. new class, new domains, or both) given the new data retrieved for the target environment. However, the DNN learning task has been normally considered out-of-scope for highly resource-constrained devices because of the high memory and computation requirements, limiting its application to server machines where, thanks to the potentially unlimited resources, custom DNN models can be retrained from scratch as soon as new data becomes available.

This tutorial focuses on the application of the Continual Learning (CL) paradigm on MCUs devices, to enable small sensor nodes to adapt their DNN models in-the-field, without relying on external computing resources. After providing a brief taxonomy of the main CL algorithms and scenarios, we will review the fundamental ODL operations, referring to the backpropagation learning algorithm. We will analyze the memory and computational costs of the learning process when targeting a multi-core RISC-V-based MCU, derived from the PULP-project template, and we will use a case study to see how these costs constrain an on-device learning application. Next, a hands-on session will bring the audience to familiarize with software optimizations for DNN learning primitives using PULP-TrainLib (https://github.com/pulp-platform/pulp-trainlib), the first software library for DNN training on RISC-V multicore MCU-class devices. Finally, we will conclude by reviewing the main ODL challenges and limitations and describing the latest major results in this field.

Speakers:

Dr. Manuele Rusci, post-doc KU Leuven
Cristian Cioflan, PhD student ETH Zurich
Davide Nadalini, PhD student UNIBO and POLITO

Tutorial Learning Objectives:

A brief taxonomy of Continual Learning scenarios, metrics and benchmarks.
Basic operations and building blocks of On-Device Learning.
Analysis of memory and computation costs for ODL on MCU-class devices.
Software optimization hands-on for ODL on a multi-core RISC-V platform.
Present main research directions and challenges for ODL on ultra-low power MCUs.

Target Audience

This tutorial targets researchers and practitioners interested in new hardware & software solutions for On-Device Learning on low-end devices, such as MCUs. Participants should be familiar with concepts of Deep Neural Networks (main building blocks, inference vs training) and basic C programming for Microcontrollers.

Hands-on & Material

The hands-on will demonstrate the open-source PULP-TrainLib software library (https://github.com/pulp-platform/pulp-trainlib), the state-of-the-art software package for MCU class devices, to provide a concrete embodiment of the ODL concepts and explain the application of software optimization to the learning routines, i.e. parallelization, low-precision (half-precision floating point), loop unrolling, and vectorization. The speaker will show the main library features using reference code examples. To this aim, we will adopt the open-source PULP platform simulator (https://github.com/pulp-platform/gvsoc) and encourage the audience to practice with the PULP-TrainLib ODL framework during the session. We will collect all the materials and installation instructions on a dedicated Github repository, which will be made available to the audience in advance and after the tutorial.

Detailed Program

Part I: M. Rusci (KUL). On-Device Continual Learning: motivation and intro. (10’)

Tutorial Intro
Limitation of DNN inference on MCUs
Continual Learning scenarios and Metrics
Dealing with Catastrophic forgetting

Part II: C. Cioflan (ETHZ). On-device Adaptation on a multi-core MCU device (25’ + 5’ Q&A)

Revisiting ODL basic operations
Reviewing DNN learning from an embedded system perspective: computation and memory
Learning on the PULP platform
Case study: On-Device noise adaptation for keyword spotting

Part III: D. Nadalini (UNIBO). Hands-on On-Device Learning on MCUs: Pulp-TrainLib. (25’ + 5’ Q&A)

PULP-TrainLib Overview, Operator Definitions and Learning deployment
Code Optimizations: parallelization, low-precision and vectorization

Part IV: M. Rusci (KUL). Challenges and Research Directions for On-Device Continual Learning – 15’ + 5’ Q&A

ODL software frameworks for MCUs
Memory-efficient learning and architectures
Data efficient learning
Open Problems

Final Q&A

FS05 Focus Session: Evolution Of Ml Hardware: From Technologies To Algorithms And Architectures

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Auditorium 2

Session chair:
Krishnendu Chakrabarty, Arizona State University, US

Session co-chair:
Hussam Amrouch, TU Munich (TUM), DE

Organiser:
Partha Pande, Washington State University, US

Time	Label	Presentation Title Authors
11:00 CEST	FS05.1	ALGORITHM TO TECHNOLOGY CO-OPTIMIZATION FOR CIM-BASED HYPERDIMENSIONAL COMPUTING Speaker: Mehdi Tahoori, Karlsruhe Institute of Technology, DE Authors: Mahta Mayahinia¹, Simon Thomann², Paul Genssler³, Christopher Münch¹, Hussam Amrouch⁴ and Mehdi Tahoori¹ ¹Karlsruhe Institute of Technology, DE; ²Chair of AI Processor Design, TU Munich (TUM), DE; ³University of Stuttgart, DE; ⁴TU Munich (TUM), DE Abstract Hyperdimensional computing (HDC) has been recognized as an efficient machine learning algorithm in recent years. Robustness against noise, and simple computational operations, while being limited by the memory bandwidth, make it a perfect fit for the concept of computation in memory (CiM) with emerging nonvolatile memory (NVM) technologies. For an HDC accelerator based on NVM-CiM, there are different parameters from the algorithm all the way down to the technology that interact with each other and affecting the overall inference accuracy as well as the energy efficiency of the accelerator. Therefore, in this paper, we propose, for the first time, a full-stack co-optimization method and use it to design an HDC accelerator based on NVM-based content addressable memory (CAM). By incorporating the device manufacturing variability and co-optimizing the algorithm and hardware design, HDC inference on our proposed NVM-based CiM accelerator can reduce the energy consumption by 3.27x, while compared to the purely software-based implementation, the inference accuracy loss is merely 0.125%.
11:30 CEST	FS05.2	ACCELERATING NEURAL NETWORKS FOR LARGE LANGUAGE MODELS AND GRAPH PROCESSING WITH SILICON PHOTONICS Speaker: Sudeep Pasricha, Colorado State University, US Authors: Salma Afifi, Febin Sunny, Mahdi Nikdast and Sudeep Pasricha, Colorado State University, US Abstract In the rapidly evolving landscape of artificial intelligence, large language models (LLMs) and graph processing have emerged as transformative technologies for natural language processing (NLP), computer vision, and graph-structured data applications. However, the complex structures of these models pose challenges for acceleration on conventional electronic platforms. In this paper, we describe novel hardware accelerators based on silicon photonics to accelerate transformer neural networks that are used in LLMs and graph neural networks for graph data processing. Our analysis demonstrates that both hardware accelerators achieve at least 10.2× throughput improvement and 3.8× better energy efficiency over multiple state-of-the-art electronic hardware accelerators designed for LLMs and graph processing.
12:00 CEST	FS05.3	DATAFLOW-AWARE PIM-ENABLED MANYCORE ARCHITECTURE FOR DEEP LEARNING WORKLOADS Speaker: Partha Pratim Pande, Washington State University, IN Authors: Harsh Sharma¹, Gaurav Narang¹, Jana Doppa¹, Umit Ogras² and Partha Pratim Pande¹ ¹Washington State University, US; ²University of Wisconsin Madison, Madison, US Abstract Processing-in-memory (PIM) has emerged as an enabler for the energy-efficient and high-performance acceleration of deep learning (DL) workloads. Resistive random-access memory (ReRAM) is one of the most promising technologies to implement PIM. However, as the complexity of Deep convolutional neural networks (DNNs) grows, we need to design a manycore architecture with multiple ReRAM-based processing elements (PEs) on a single chip. Existing PIM-based architectures mostly focus on computation while ignoring the role of communication. ReRAM-based tiled manycore architectures often involve many Processing Elements (PEs), which need to be interconnected via an efficient on-chip communication infrastructure. Simply allocating more resources (ReRAMs) to speed up only computation is ineffective if the communication infrastructure cannot keep up with it. In this paper, we highlight the design principles of a dataflow-aware PIM-enabled manycore platform tailor-made for various types of DL workloads. We consider the design challenges with both 2.5D interposer- and 3D integration-enabled architectures.

LKS01 Later … With The Keynote Speakers

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: VIP Room

Speakers: Robert Dimond & Luc Augustin

TS07 FPGA Solutions

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Auditorium 3

Session chair:
Georgios Zervakis, University of Patras, GT

Session co-chair:
Kuan-Hsun Chen, University of Twente, NL

Time	Label	Presentation Title Authors
11:00 CEST	TS07.1	UNVEILING THE BLACK-BOX: LEVERAGING EXPLAINABLE AI FOR FPGA DESIGN SPACE OPTIMIZATION Speaker: Jaemin Seo, Pohang University of Science and Technology, KR Authors: Jaemin Seo, Sejin Park and Seokhyeong Kang, Pohang University of Science and Technology, KR Abstract With significant advancements in various design methodologies, modern integrated circuits have experienced noteworthy improvements in power, performance, and area. Among various methodologies, design space optimization (DSO), which automatically explores electronic design automation (EDA) tool parameters for a given design, has been extensively studied in recent years. In this study, we propose an approach to fine-tuning an effective FPGA design space to suit a specific design. By utilizing our ML-based prediction and explainable artificial intelligence (XAI) approach, we quantify parameter contribution scores, which reveal the correlation between each parameter and the final timing results. Using the valuable insights from the parameter contribution scores, we can refine the design space only with effective parameters for subsequent timing optimization. During the optimization, our framework improved the maximum operating frequency by 26% on average in six test designs. To accomplish this, our framework required even 47% fewer FPGA compilations than the baseline, demonstrating its superior capacity for achieving fast convergence.
11:05 CEST	TS07.2	AN AGILE DEPLOYING APPROACH FOR LARGE-SCALE WORKLOADS ON CGRA-CPU ARCHITECTURE Speaker: Jiahang Lou, State Key Laboratory of ASIC and System, Fudan University, Shanghai, China, CN Authors: Jiahang Lou, Xuchen Gao, Yiqing Mao, Yunhui Qiu, Yihan Hu, Wenbo Yin and Lingli Wang, Fudan University, CN Abstract Adopting specialized accelerators such as Coarse-Grained Reconfigurable Architectures (CGRAs) alongside CPUs to enhance performance within specific domains is an astute choice. However, the integration of heterogeneous architectures introduces complex challenges for compiler design. Simultaneously, the ever-expanding scale of workloads imposes substantial burdens on deployment. To address above challenges, this paper introduces CGRV-OPT, a user-friendly multi-level compiler designed to deploy large-scale workloads to CGRA and RISC-V CPU architecture. Built upon the MLIR framework, CGRV-OPT serves as a pivotal bridge, facilitating the seamless conversion of high-level workload descriptions into low-level intermediate representations (IRs) for different architectures. A salient feature of our approach is the automation of a comprehensive suite of optimizations and transformations, which speed up each kernel computing within the intricate SoC. Additionally, we have seamlessly integrated an automated software-hardware partitioning mechanism, guided by our multi-level optimizations, resulting in a remarkable 2.14x speed up over large-scale workloads. The CGRV-OPT framework significantly alleviates the challenges faced by software developers, including those with limited expertise in hardware architectures.
11:10 CEST	TS07.3	CUPER: CUSTOMIZED DATAFLOW AND PERCEPTUAL DECODING FOR SPARSE MATRIX-VECTOR MULTIPLICATION ON HBM-EQUIPPED FPGAS Speaker: zhou jin, Super Scientific Software Laboratory, China University of Petroleum-Beijing, China, CN Authors: Enxin Yi¹, Yiru Duan¹, Yinuo Bai¹, Kang Zhao², Zhou Jin¹ and Weifeng Liu¹ ¹China University of Petroleum-Beijing, CN; ²Beijing University Of Posts and Telecommunications, CN Abstract Sparse matrix-vector multiplication (SpMV) is pivotal in many scientific computing and engineering applications. Considering the memory-intensive nature and irregular data access patterns inherent in SpMV, its acceleration is typically bounded by the limited bandwidth. Multiple memory channels of the emerging high bandwidth memory (HBM) provide exceptional bandwidth, offering a great opportunity to boost the performance of SpMV. However, ensuring high bandwidth utilization with low memory access conflicts is still non-trivial. In this paper, we present Cuper, a high-performance SpMV accelerator on HBM-equipped FPGAs. Through customizing the dataflow to be HBM-compatible with the proposed sparse storage format, the bandwidth utilization can be sufficiently enhanced. Furthermore, a two-step reordering algorithm and perceptual decoder-centric hardware architecture are designed to greatly mitigate read-after-write (RAW) conflicts, enhance the vector reusability and on-chip memory utilization. The evaluation of 12 large matrices shows that Cuper's geomean throughput outperforms the four latest SpMV accelerators HiSparse, GraphLily, Sextans, and Serpens, by 3.28×, 1.99×, 1.75×, and 1.44×, respectively. Furthermore, the geomean bandwidth efficiency shows 3.28×, 2.20×, 2.82×, and 1.31× improvements, while the geomean energy efficiency has 3.59×, 2.08×, 2.21×, and 1.44× optimizations, respectively. Cuper also demonstrates 2.51× throughput and 7.97× energy efficiency of improvement over the K80 GPU on 2,757 SuiteSparse matrices.
11:15 CEST	TS07.4	TOWARDS HIGH-THROUGHPUT NEURAL NETWORK INFERENCE WITH COMPUTATIONAL BRAM ON NONVOLATILE FPGAS Speaker: Hao Zhang, Shandong University, CN Authors: Hao Zhang¹, Mengying Zhao², Huichuan Zheng², Yuqing Xiong², Yuhao Zhang³ and Zhaoyan Shen² ¹Shandong university, CN; ²Shandong University, CN; ³Tsinghua University, CN Abstract Field-programmable gate arrays (FPGAs) have been widely used in artificial intelligence applications. As the capacity requirements of both computation and memory resources continuously increase, emerging nonvolatile memory has been proposed to replace static random access memory (SRAM) in FPGAs to build nonvolatile FPGAs (NV-FPGAs), which have advantages of high density and near-zero leakage power. Features of emerging nonvolatile memory should be fully explored to improve performance, energy efficiency as well as lifetime of NV-FPGAs. In this paper, we study an intrinsic characteristic of emerging nonvolatile memory, i.e., computing-in-memory, in nonvolatile block random access memory (BRAM) of NV-FPGAs. Specifically, we present a computational BRAM architecture (C-BRAM), and propose a computational density aware operator allocation strategy to fully utilize C-BRAM. Neural network inference is taken as an example to evaluate the proposed architecture and strategy, showing 68% and 62% improvement in computational density compared to traditional SRAM-based FPGA and existing NV-FPGA, respectively.
11:20 CEST	TS07.5	BITSTREAM FAULT INJECTION ATTACKS ON CRYSTALS KYBER IMPLEMENTATIONS ON FPGAS Speaker: Ziying Ni, Centre for Secure Information Technologies (CSIT), Queen, GB Authors: Ziying Ni¹, ayesha khalid², Weiqiang Liu³ and Maire O'Neill² ¹CentreforSecureInformationTechnologies(CSIT),QueensUniversityBelfast, GB; ²Queen's University Belfast, GB; ³Nanjing University of Aeronautics and Astronautics, CN Abstract CRYSTALS-Kyber is the only Public-key Encryption (PKE)/ Key-encapsulation Mechanism (KEM) scheme that was chosen for standardization by the National Institute of Standards and Technology initiated Post-quantum Cryptography competition (so called NIST PQC). In this paper, we show the first successfully malicious modifications of the bitstream of a Kyber FPGA implementation. We successfully demonstrate 4 different attacks on Kyber hardware implementations on Artix-7 FPGAs that either reduce the complexity of polynomial multiplication operations or enable direct secret key/ message recovery by: disabling BRAMs, disabling DSPs, zeroing NTT ROM and tampering with CBD2 results. Two of our attacks are generic in nature and the other two require reverse-engineering or a detailed knowledge of the design. We evaluate the feasibility of the four attacks, among which the zeroing NTT ROM and tampering with the CBD2 result attacks produce higher public key and ciphertext complexity and thus are difficult to be detected. Two countermeasures are proposed to prevent the attacks proposed in this paper.
11:25 CEST	TS07.6	ON-FPGA SPIKING NEURAL NETWORKS FOR INTEGRATED NEAR-SENSOR ECG ANALYSIS Speaker: Matteo Antonio Scrugli, University of Cagliari, IT Authors: Matteo Antonio Scrugli, Paola Busia, Gianluca Leone and Paolo Meloni, Università degli studi di Cagliari, IT Abstract The identification of cardiac arrhythmias is a significant issue in modern healthcare and a major application for Artificial Intelligence (AI) systems based on artificial neural networks. This research introduces a real-time arrhythmia diagnosis system that uses a Spiking Neural Network (SNN) to classify heartbeats into five types of arrhythmias from a single-lead electrocardiogram (ECG) signal. The system is implemented on a custom SNN processor running on a low-power Lattice iCE40-UltraPlus FPGA. It was tested using the MIT-BIH dataset, and achieved accuracy results that are comparable to the most advanced SNN models, reaching 98.4% accuracy. The proposed modules take advantage of the energy efficiency of SNNs to reduce the average execution time to 4.32 ms and energy consumption to 50.98 uJ per classification.
11:30 CEST	TS07.7	SPECHD: HYPERDIMENSIONAL COMPUTING FRAMEWORK FOR FPGA-BASED MASS SPECTROMETRY CLUSTERING Speaker: Sumukh Pinge, University of California, San Diego, US Authors: Sumukh Pinge¹, Weihong Xu¹, Jaeyoung Kang¹, Tianqi Zhang¹, Niema Moshiri¹, Wout Bittremieux² and Tajana Rosing¹ ¹University of California, San Diego, US; ²University of Antwerp, BE Abstract Mass spectrometry-based proteomics is a key enabler for personalized healthcare, providing a deep dive into the complex protein compositions of biological systems. This technology has vast applications in biotechnology and biomedicine but faces significant computational bottlenecks. Current methodologies often require multiple hours or even days to process extensive datasets, particularly in the domain of spectral clustering. To tackle these inefficiencies, we introduce Spec-HD, a hyperdimensional computing framework supplemented by an FPGA-accelerated architecture with integrated near-storage preprocessing. Utilizing streamlined binary operations in a hyperdimensional computational environment, Spec-HD capitalizes on the low-latency and parallel capabilities of FPGAs. This approach markedly improves clustering speed and efficiency, serving as a catalyst for real-time, high-throughput data analysis in future healthcare applications. Our evaluations demonstrate that Spec-HD not only maintains but often surpasses existing clustering quality metrics while drastically cutting computational time. Specifically, it can cluster a large-scale human proteome dataset—comprising 25 million MS/MS spectra and 131 GB of MS data—in mere 5 minutes. With energy efficiency exceeding 31× and a speedup factor that spans a range of 6× to 54× over existing state-of-the-art solutions, Spec-HD emerges as a promising solution for the rapid analysis of mass spectrometry data with great implications for personalized healthcare.
11:35 CEST	TS07.8	MEMORY SCRAPING ATTACK ON XILINX FPGAS: PRIVATE DATA EXTRACTION FROM TERMINATED PROCESSES Speaker: Sandip Kundu, University of Massachusetts Amherst, US Authors: Bharadwaj Madabhushi, Sandip Kundu and Daniel Holcomb, University of Massachusetts Amherst, US Abstract FPGA-based hardware accelerators are becoming increasingly popular due to their versatility, customizability, energy efficiency, constant latency, and scalability. FPGAs can be tailored to specific algorithms, enabling efficient hardware implementations that effectively leverage algorithm parallelism. This can lead to significant performance improvements over CPUs and GPUs, particularly for highly parallel applications. For example, a recent study found that Stratix 10 FPGAs can achieve up to 90\% of the performance of a TitanX Pascal GPU while consuming less than 50\% of the power. This makes FPGAs an attractive choice for accelerating machine learning (ML) workloads. However, our research finds privacy and security vulnerabilities in existing Xilinx FPGA-based hardware acceleration solutions. These vulnerabilities arise from the lack of memory initialization and insufficient process isolation, which creates potential avenues for unauthorized access to private data used by processes. To illustrate this issue, we conducted experiments using a Xilinx ZCU104 board running the PetaLinux tool from Xilinx. We found that PetaLinux does not effectively clear memory locations associated with a terminated process, leaving them vulnerable to memory scraping attack (MSA). This paper makes two main contributions. The first contribution is an attack methodology of using the Xilinx debugger from a different user space. We find that we are able to access process IDs, virtual address spaces, and pagemaps of one user from a different user space because of lack of adequate process isolation. The second contribution is a methodology for characterizing terminated processes and accessing their private data. We illustrate this on Xilinx ML application library. These vulnerabilities were reported to Xilinx and confirmed by them.
11:40 CEST	TS07.9	HIGH-EFFICIENCY FPGA-BASED APPROXIMATE MULTIPLIERS WITH LUT SHARING AND CARRY SWITCHING Speaker: Qilin Zhou, Yunnan University, CN Authors: Yi GUO¹, Qilin ZHOU¹, Xiu CHEN¹ and Heming SUN² ¹Yunnan University, CN; ²Yokohama National University, JP Abstract Approximate multiplier saves energy and improves hardware performance for error-tolerant computation-intensive applications. This work proposes hardware-efficient FPGA-based approximate multipliers with look-up table (LUT) sharing and carry switching. Sharing two LUTs with the same inputs enables to fully utilize the available LUT resources. To mitigate the accuracy loss incurred from this approach, the truncated carry is partially reserved by switching it to the adjacent calculation. In addition, we create a library of 8×8 approximate multipliers to provide various multiplication choices. The proposed design can provide enhancements of up to 38.75% in power, 17.29% in latency, and 28.17% in area compared to the Xilinx exact multiplier. Our proposed designs are open-source at https://github.com/YnuGuoLab/DATE_FPGA_Approx_Mul and assist in further reproducing and development.
11:41 CEST	TS07.10	OTFGENCODER-HDC: HARDWARE-EFFICIENT ENCODING TECHNIQUES FOR HYPERDIMENSIONAL~COMPUTING Speaker: Mahboobe Sadeghipourrudsari, Technology Factory Karlsruhe GmbH, DE Authors: Mahboobe Sadeghipourrudsari, Jonas Krautter and Mehdi Tahoori, Karlsruhe Institute of Technology, DE Abstract Hyper-Dimensional Computing (HDC), a brain-inspired computing paradigm for cognitive tasks, is especially suited for resource-constrained edge devices due to its hardware-efficient and fault-resistant inference. However, existing HDC approaches require large amounts of memory, resulting in high power consumption, limiting their use in edge devices. We offer a hardware-aware encoding where computation parameters in hardware implementations can be reproduced on-the-fly through low-overhead cyclic digital circuits, significantly reducing memory utilization and subsequently power consumption.
11:42 CEST	TS07.11	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS17 Adaptive And Sensing Systems

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Multi-Purpose Room M1A+C

Session chair:
Amit Singh, University of Essex, UK

Session co-chair:
Nicola Bombieri, U Verona, IT

Time	Label	Presentation Title Authors
11:00 CEST	TS17.1	OISA: ARCHITECTING AN OPTICAL IN-SENSOR ACCELERATOR FOR EFFICIENT VISUAL COMPUTING Speaker: Shaahin Angizi, New Jersey Institute of Technology, US Authors: Mehrdad Morsali¹, Sepehr Tabrizchi², Deniz Najafi³, Mohsen Imani⁴, Mahdi Nikdast⁵, Arman Roohi⁶ and Shaahin Angizi³ ¹New jersey Institute of Technology, US; ²University of Nebraska–Lincoln, US; ³New Jersey Institute of Technology, US; ⁴University of California, Irvine, US; ⁵Colorado State University, US; ⁶University of Nebraska–Lincoln (UNL), US Abstract Targeting vision applications at the edge, in this work, we systematically explore and propose a high-performance and energy-efficient Optical In-Sensor Accelerator architecture called OISA for the first time. Taking advantage of the promising efficiency of photonic devices, the OISA intrinsically implements a coarse-grained convolution operation on the input frames in an innovative minimum-conversion fashion in low-bit-width neural networks. Such a design remarkably reduces the power consumption of data conversion, transmission, and processing in the conventional cloud-centric architecture as well as recently-presented edge accelerators. Our device-to-architecture simulation results on various image data-sets demonstrate acceptable accuracy while OISA achieves 6.68 TOp/s/W efficiency. OISA reduces power consumption by a factor of 7.9 and 18.4 on average compared with existing electronic in-/near-sensor and ASIC accelerators.
11:05 CEST	TS17.2	PATHDRIVER-WASH: A PATH-DRIVEN WASH OPTIMIZATION METHOD FOR CONTINUOUS-FLOW LAB-ON-A-CHIP SYSTEMS Speaker: Jiaxuan Wang, Northwestern Polytechnical University, CN Authors: Xing Huang¹, Jiaxuan Wang¹, Zhiwen Yu¹, Bin Guo¹, Tsung-Yi Ho², Ulf Schlichtmann³ and Krishnendu Chakrabarty⁴ ¹Northwestern Polytechnical University, CN; ²The Chinese University of Hong Kong, HK; ³TU Munich, DE; ⁴Arizona State University, US Abstract Rapid advances in microfluidics technologies have facilitated the emergence of highly integrated lab-on-a-chip (LoC) biochip systems. With such coin-sized biochips, complicated bioassay procedures can be executed efficiently without any human intervention. To ensure the correctness and precision of assay outcomes, however, cross-contamination among different fluid samples/reagents needs to be dealt with separately during assay execution. As a consequence, wash operations have to be introduced and a wash path network needs to be established on the chip to remove the residues left in flow channels. To realize optimized assay procedures with efficient wash operations, we propose PathDriver-Wash in this paper, a path-driven wash optimization method for continuous-flow LoC biochip systems. The proposed method includes the following three key techniques: 1) The necessity of contamination removals is analyzed systemically to avoid unnecessary wash operations, 2) wash operations are integrated with the regular removal of excess fluids, so that extra path occupations caused by wash can be minimized, and 3) optimized wash paths and time windows are computed and assigned to wash operations, so that the completion time of assays can be minimized. Experimental results demonstrate that the proposed method leads to highly efficient wash procedures as well as minimized assay completion times.
11:10 CEST	TS17.3	MULTI-AGENT REINFORCEMENT LEARNING FOR THERMALLY-RESTRICTED PERFORMANCE OPTIMIZATION ON MANYCORES Speaker: Heba Khdr, Karlsruhe Institute of Technology, DE Authors: Heba Khdr¹, Mustafa Batur¹, Kanran Zhou¹, Mohammed Bakr Sikal² and Joerg Henkel¹ ¹Karlsruhe Institute of Technology, DE; ²Chair for Embedded Systems, Karlsruhe Institute of Technology, DE Abstract The problem of performance maximization under a thermal constraint has been tackled by means of dynamic voltage and frequency scaling (DVFS) in many system-level optimization techniques. State-of-the-art ones have exploited Supervised Learning (SL) to develop models that predict power and performance characteristics of applications and temperature of the cores. Such predictions enable proactive and efficient optimization decisions that exploit performance potentials under a temperature constraint. SL-based models are built at design time based on training data generated considering specific environment settings, i.e., processor architecture, cooling system, ambient temperature, etc. Hence, these models cannot adapt at runtime to different environment settings. In contrast, Reinforcement Learning (RL) employs an agent that explores and learns the environment at runtime, and hence can adapt to its potential changes. Nonetheless, using an RL agent to perform optimization on manycores is challenging because of the inherent large state/action spaces that might hinder the agent's ability to converge. To get the advantages of RL while tackling this challenge, we employ for the first time multi-agent RL to perform thermally-restricted performance optimization for manycores through DVFS. We investigated two RL algorithms—Table-based Q-Learning (TQL) and Deep QLearning (DQL)—and demonstrated that the latter outperforms the former. Compared to the state of the art, our DQL delivers a significant performance improvement of 34.96% on average, while also guaranteeing thermally-safe operation on the manycore. Our evaluation reveals the runtime adaptability of our DQL to varying workloads and ambient temperatures.
11:15 CEST	TS17.4	OPLIXNET: TOWARDS AREA-EFFICIENT OPTICAL SPLIT-COMPLEX NETWORKS WITH REAL-TO-COMPLEX DATA ASSIGNMENT AND KNOWLEDGE DISTILLATION Speaker: Ruidi Qiu, TU Munich, DE Authors: Ruidi Qiu¹, Amro Eldebiky¹, Grace Li Zhang², Xunzhao Yin³, Cheng Zhuo³, Ulf Schlichtmann¹ and Bing Li¹ ¹TU Munich, DE; ²TU Darmstadt, DE; ³Zhejiang University, CN Abstract Having the potential for high computing speed, high throughput, and low energy cost, optical neural networks (ONNs) have emerged as a promising candidate for accelerating deep learning tasks. In conventional ONNs, light amplitudes are modulated at the input and detected at the output. However, the light phases are still ignored in conventional structures, although they can also carry information for computing. To address this issue, in this paper, we propose a framework called OplixNet to compress the areas of ONNs by modulating input image data into the amplitudes and phases of light signals. The input and output parts of the ONNs are redesigned to make full use of both amplitude and phase information. Moreover, mutual learning across different ONN structures is introduced to maintain the accuracy. Experimental results demonstrate that the proposed framework significantly reduces the areas of ONN with the accuracy within an acceptable range. For instance, 75.03% area is reduced with a 0.33% accuracy decrease on fully connected neural network (FCNN) and 74.88% area is reduced with a 2.38% accuracy decrease on ResNet-32.
11:20 CEST	TS17.5	DESIGN AUTOMATION FOR ORGANS-ON-CHIP Speaker: Maria Emmerich, TU Munich, DE Authors: Maria Emmerich¹, Philipp Ebner² and Robert Wille¹ ¹TU Munich, DE; ²Johannes Kepler University Linz, AT Abstract Organs-on-Chips (OoCs) are testing platforms for the pharmaceutical, cosmetic, and chemical industries. They are composed of miniaturized organ tissues (so-called organ modules) that are connected via a microfluidic channel network and, by this, emulate human or other animal physiology on a miniaturized chip. The design of those chips, however, requires a sophisticated orchestration of numerous aspects, such as the size of organ modules, the required shear stress on membranes, the dimensions and geometry of channels, pump pressures, etc. Mastering all this constitutes a non-trivial design task for which, unfortunately, no automatic support exists yet. In this work, we propose a first design automation solution for OoCs. To this end, we review the respective design steps and formalize a corresponding design specification from it. Based on that, we then propose an automatic method which generates a design of the desired device. Evaluations (inspired by real-world use cases and confirmed by CFD simulations) demonstrate the applicability and validity of the proposed approach.
11:25 CEST	TS17.6	TRACE-ENABLED TIMING MODEL SYNTHESIS FOR ROS2-BASED AUTONOMOUS APPLICATIONS Speaker: Hazem Abaza, Huawei Technologies, DE Authors: Hazem Abaza¹, Debayan Roy², Shiqing Fan³, Selma Saidi⁴ and Antonios Motakis² ¹Huawei Dresden Research Center - TU Dortmund, DE; ²Huawei Dresden Research Center, DE; ³Huawei Munich Research Center, DE; ⁴TU Dortmund, DE Abstract Autonomous applications are typically developed over Robot Operating System 2.0 (ROS2) even in time-critical systems like automotive. Recent years have seen increased interest in developing model-based timing analysis and schedule optimization approaches for ROS2-based applications. To complement these approaches, we propose a tracing and measurement framework to obtain timing models of ROS2-based applications. It offers a tracer based on extended Berkeley Packet Filter that probes different functions in ROS2 middleware and reads their arguments or return values to reason about the data flow in applications. It combines event traces from ROS2 and the operating system to generate a directed acyclic graph showing ROS2 callbacks, precedence relations between them, and their timing attributes. While being compatible with existing analyses, we also show how to model (i) message synchronization, e.g., in sensor fusion, and (ii) service requests from multiple clients, e.g., in motion planning. Considering that, in real-world scenarios, the application code might be confidential and formal models are unavailable, our framework still enables the application of existing analysis and optimization techniques. We demonstrate our framework's capabilities by synthesizing the timing model of a real-world benchmark implementing LIDAR-based localization in Autoware's Autonomous Valet Parking.
11:30 CEST	TS17.7	SLET FOR DISTRIBUTED AEROSPACE LANDING SYSTEM Speaker: Guillaume Phavorin, ASTERIOS Technologies, FR Authors: Damien Chabrol¹, Guillaume Phavorin¹ and Eric Jenn² ¹Krono-Safe, FR; ²IRT Saint Exupéry, FR Abstract The aerospace industry is moving towards digital systems that are both more condensed (merging criticality-heterogenous functions on a same equipment) and more distributed (for robustness, availability, actuators/sensors closeness) while software-defined. This makes integration activities highly critical for next-generation systems, due to the interaction complexity between the software components and their deployment on the hardware platform, combined with outdated development processes with regard to the multicore transition. Therefore, predictability, testability, and ultimately strong determinism are crucial high-level properties needed not only at equipment level but at the whole system scope, which cannot be tackled without changes in the design process. This paper deals with an innovative solution, based on the sLET paradigm, to bring drastic integration time reduction whatever the underlying architecture (multicore, multi-node). Already proved worthy for multicore platforms, sLET deployment is applied to an aerospace landing system over a distributed system architecture.
11:35 CEST	TS17.8	A MAPPING OF TRIANGULAR BLOCK INTERLEAVERS TO DRAM FOR OPTICAL SATELLITE COMMUNICATION Speaker: Lukas Steiner, University of Kaiserslautern-Landau, DE Authors: Lukas Steiner¹, Timo Lehnigk-Emden², Markus Fehrenz² and Norbert Wehn³ ¹University of Kaiserslautern-Landau, DE; ²Creonic GmbH, DE; ³TU Kaiserslautern, DE Abstract Communication in optical downlinks of low earth orbit (LEO) satellites requires interleaving to enable reliable data transmission. These interleavers are orders of magnitude larger than conventional interleavers utilized for example in wireless communication. Hence, the capacity of on-chip memories (SRAMs) is insufficient to store all symbols and external memories (DRAMs) must be used. Due to the overall requirement for very high data rates beyond 100 Gbit/s, DRAM bandwidth then quickly becomes a critical bottleneck of the communication system. In this paper, we investigate triangular block interleavers for the aforementioned application and show that the standard mapping of symbols used for SRAMs results in low bandwidth utilization for DRAMs, in some cases below 50 %. As a solution, we present a novel mapping approach that combines different optimizations and achieves over 90 % bandwidth utilization in all tested configurations. Further, the mapping can be applied to any JEDEC-compliant DRAM device.
11:40 CEST	TS17.9	OC-DLRM: MINIMIZING THE I/O TRAFFIC OF DLRM BETWEEN MAIN MEMORY AND OCSSD Speaker: Tseng-Yi Chen, National Central University, TW Authors: Shang-Hung Ti¹, Tseng-Yi Chen¹, Tsung Tai Yeh², Shuo-Han Chen² and Yu-Pei Liang³ ¹National Central University, TW; ²National Yang Ming Chiao Tung University, TW; ³National Chung Cheng University, TW Abstract Due to the exponential growth of data in computing, DRAM-based main memory is now insufficient for data-intensive applications like machine learning and recommendation systems. This has led to a performance issue involving data transfer between main memory and storage devices. Conventional NAND-based SSDs are unable to efficiently handle this problem as they can't distinguish between data types from the host system. In contrast, open-channel SSDs (OCSSD) offer a solution by optimizing data placement from the host-side system. This research focuses on developing a new data access model for deep learning recommendation systems (DLRM) using OCSSD storage drives, called OC-DLRM. OC-DLRM reduces I/O traffic to flash memory by aggregating frequently-accessed data using the I/O unit of a flash memory drive. Our experiments show that OC-DLRM has significant performance improvement compared with traditional swapping space management techniques.
11:41 CEST	TS17.10	SEAL: SENSING EFFICIENT ACTIVE LEARNING ON WEARABLES THROUGH CONTEXT-AWARENESS Speaker: Anil Kanduri, University of Turku Finland, FI Authors: Hamidreza Alikhani Koshkak¹, Ziyu Wang¹, Anil Kanduri², Pasi Liljeberg², Amir M. Rahmani¹ and Nikil Dutt¹ ¹University of California, Irvine, US; ²University of Turku, FI Abstract In this paper, we introduce SEAL, a co-optimization framework designed to enhance both sensing and querying strategies in wearable devices for mHealth applications. Employing Reinforcement Learning (RL), SEAL strategically utilizes user contextual information and the machine learning model's confidence levels to make efficient decisions. This innovative approach is particularly significant in addressing the challenge of battery drain due to continuous physiological signal sensing, such as Photoplethysmography (PPG). Our framework demonstrates its effectiveness in a stress monitoring application, achieving a substantial reduction of 76\% in the volume of PPG signals collected, while only experiencing a minor 6\% decrease in user-labeled data quality. This balance showcases SEAL's potential in optimizing data collection in a way that is considerate of both device constraints and data integrity.
11:42 CEST	TS17.11	DYNAMIC PER-FLOW QUEUES FOR TSN SWITCHES Speaker: Wenxue Wu, Lanzhou University, CN Authors: Wenxue Wu¹, Zhen Li¹, Tong Zhang², Xiaoqin Feng¹, Liwei Zhang¹, Xuelong Qi¹ and Fengyuan Ren³ ¹Lanzhou University, CN; ²Nanjing University of Aeronautics and Astronautics, CN; ³Lanzhou University \| Tsinghua University, CN Abstract Dynamic Per-Flow Queues (DFQ) extend queues from per-class to per-flow in Time-Sensitive Networking (TSN) switches that overcome large resource consumption by dynamically mapping a fixed number of physical queues to active flows. It can implement per-flow queuing with much less on-chip resource. Compared to brute-force hardware queues, DFQ prototyped on an FPGA, can effectively manage more per-flow queues, allowing for improved priority scheduling with minimal throughput and latency impact.
11:43 CEST	TS17.12	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS22 Microarchitectural And Side-Channel-Based Attacks And Countermeasures

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Multi-Purpose Room M1B+D

Session chair:
Ricardo Chaves, INESC-ID, IST/ULisboa, PT

Session co-chair:
Francesco Regazzoni, UvA & ALaRI, CH

Time	Label	Presentation Title Authors
11:00 CEST	TS22.1	FLUSH+EARLYRELOAD: COVERT CHANNELS ATTACK ON SHARED LLC USING MSHR MERGING Speaker: Prathamesh Tanksale, Dept. of CSE, Indian Institute of Technology Ropar, Punjab, India, IN Authors: Aditya Gangwar¹, Prathamesh Tanksale¹, Shirshendu Das² and Sudeepta Mishra¹ ¹Department of CSE Indian Institute of Technology Ropar, Punjab, India, IN; ²Department of CSE Indian Institute of Technology Hyderabad, Telangana, India, IN Abstract Modern multiprocessors include multiple cores, all of them share a large Last Level Cache (LLC). Because of the shared nature of LLC, different timing channel attacks can be built by exploiting LLC behaviors. Covert Channel and Side Channel are two well-known attacks used to steal sensitive information from a secure application. While several countermeasures have already been proposed to prevent these attacks, the possibility of discovering new variants cannot be overlooked. In this paper, we propose a covert channel attack designed to circumvent the state-of-the-art countermeasure for preventing the Flush+Reload attack. Experimental results indicate that the proposed attack renders the current state-of-the-art countermeasure futile and ineffective.
11:05 CEST	TS22.2	PRIME+RESET: INTRODUCING A NOVEL CROSS-WORLD COVERT-CHANNEL THROUGH COMPREHENSIVE SECURITY ANALYSIS ON ARM TRUSTZONE Speaker: Arash Pashrashid, National University of Singapore (NUS), SG Authors: Yun Chen¹, Arash Pashrashid¹, Yongzheng Wu² and Trevor E. Carlson¹ ¹National University of Singapore, SG; ²Huawei Singapore, SG Abstract ARM TrustZone, a robust security technique, thwarts a wide range of threats by partitioning the system-on-chip hardware and software into two distinct worlds, namely the normal world and the secure world. However, it still remains susceptible to malicious attacks, including side-channel and covert-channel vulnerabilities. Previous efforts to leak data from TrustZone focused on cache-based and performance monitoring unit (PMU)-based channels; in this paper, we, however, propose a security analysis benchmark suite by traversing the hardware components involved in the microarchitecture to study their security impact on the secure world. Our investigation unveils an undisclosed leakage source stemming from the L2 prefetcher. We design a new cross-core and cross-world covert-channel attack based on our reverse engineering of the L2 prefetcher, named Prime+Reset. Compared to most cross-world covert-channel attacks, Prime+Reset is a cache- and PMU-agnostic attack that effectively bypasses many existing defenses. The throughput of Prime+Reset can achieve 776Kib/s, which demonstrates a significant improvement, 70x, over the state-of-the-art, while maintaining a similar error rate (<2%). One can find the code at https://github.com/yunchen-juuuump/prime-reset.
11:10 CEST	TS22.3	STATISTICAL PROFILING OF MICRO-ARCHITECTURAL TRACES AND MACHINE LEARNING FOR SPECTRE DETECTION: A SYSTEMATIC EVALUATION Speaker: Mai AL-Zu'bi, TU Wien, AT Authors: Mai AL-Zu'bi and Georg Weissenbacher, TU Wien, AT Abstract Security vulnerabilities like Spectre exploit features of modern processors to leak sensitive data through speculative execution and shared resources (such as caches). A popular approach to detect such attacks deploys Machine Learning (ML) to identify suspicious micro-architectural patterns. These techniques, however, are often rather ad-hoc in terms of the selection of micro- architectural features as well as ML techniques, and frequently lack a description of the underlying training- and test-data. To ad- dress these shortcomings, we systematically evaluate a large range of (combinations of) micro-architectural features recorded in up to 40 Hardware Performance Counters (HPCs) and multiple ML algorithms on a comprehensive set of well-documented scenarios and datasets. Using statistical methods, we rank the HPCs used to generate our dataset, which helps us determine the minimum number of features required for detecting Spectre attacks with high accuracy and minimal overhead. Furthermore, we identify the best-performing ML classifiers, and provide a comprehensive description of our data collection, running scenarios, selected HPCs, and chosen classification models.
11:15 CEST	TS22.4	THREE SIDEKICKS TO SUPPORT SPECTRE COUNTERMEASURES Speaker: Markus Krausz, Ruhr University Bochum, DE Authors: Markus Krausz¹, Jan Thoma¹, Florian Stolz¹, Marc Fyrbiak² and Tim Güneysu¹ ¹Ruhr University Bochum, DE; ²emproof GmbH, DE Abstract The Spectre attack revealed a critical security threat posed by speculative execution and since then numerous related attacks have been discovered and exploited to leak secrets across process boundaries. As the primary cause of the attack is deeply rooted in the microarchitectural processor design, mitigating speculative execution attacks with minimal impact on performance is far from straightforward. For example, various countermeasures have been proposed to limit speculative execution for certain instruction patterns, however, resulting in severe performance overheads. In this paper, we propose a set of code transformations to reduce the number of speculatively executed instructions and therefore significantly reduce the performance overhead of various countermeasures. We evaluate our code transformations combined with a hardware-based countermeasure in gem5. Our results demonstrate that our code transformations speed up the secure system by up to 16.6%.
11:20 CEST	TS22.5	DETECTING BACKDOOR ATTACKS IN BLACK-BOX NEURAL NETWORKS THROUGH HARDWARE PERFORMANCE COUNTERS Speaker: Manaar Alam, New York University Abu Dhabi, AE Authors: Manaar Alam¹, Yue Wang² and Michail Maniatakos¹ ¹New York University Abu Dhabi, AE; ²New York University, US Abstract Deep Neural Networks (DNNs) have made significant strides, but their susceptibility to backdoor attacks still remains a concern. Most defenses typically assume access to white-box models or poisoned data, requirements that are often not feasible in practice, especially for proprietary DNNs. Existing defenses in a black-box setting usually rely on confidence scores of DNN's predictions. However, this exposes DNNs to the risk of model stealing attacks, a significant concern for proprietary DNNs. In this paper, we introduce a novel strategy for detecting backdoors, focusing on a more realistic black-box scenario where only hard-level (i.e., without any prediction confidence) query access is available. Our strategy utilizes data flow dynamics in a computational environment during DNN inference to identify potential backdoor inputs and is agnostic of trigger types or their locations in the input. We observe that a clean image and its corresponding backdoor counterpart with a trigger induce distinct patterns across various microarchitectural activities during the inference phase. We exploit these variations captured by Hardware Performance Counters (HPCs) and use principles of the Gaussian Mixture Model to detect backdoor inputs. To the best of our knowledge, this is the first work that utilizes HPCs for detecting backdoors in DNNs. Extensive evaluation considering a range of benchmark datasets, DNN architectures, and trigger patterns shows the efficacy of the proposed method in distinguishing between clean and backdoor inputs using HPCs.
11:25 CEST	TS22.6	CAN MACHINE LEARN PIPELINE LEAKAGE? Speaker: Parisa Amiri Eliasi, Radboud University, NL Authors: Omid Bazangani¹, Parisa Amiri Eliasi², Stjepan Picek² and Lejla Batina³ ¹Digital Security Group, Radboud University, NL; ²Radboud University, NL; ³Radboud University Nijmegen, NL Abstract Side-channel attacks cause a significant threat to security implementations in embedded devices. Accordingly, an automated framework simulating side-channel behaviours can offer invaluable insights into leakage origins and characteristics, helping developers improve those devices during the design phase. While there has been a substantial effort towards crafting leakage simulators, earlier methods either necessitated significant manual work for reverse engineering the micro-architectural layer or depended on Deep Learning (DL) models where the neural network's complexity increased considerably with the addition of pipeline stages. This paper presents a novel modelling approach using Recur- rent Neural Networks (RNNs) to construct instruction-level power models that exhibit enhanced performance in detecting pipeline leakage. Our findings indicate that with memory-based machine learning models, it becomes unnecessary to input data accounting for the pipeline effect. This strategy reduces feature dimensionality by at least one-third for a three-stage pipeline, albeit at a modest compromise in model performance. This reduced feature set underscores our model's scalability, making it a preferred choice for analyzing microprocessors with extended pipeline stages. Importantly, our methodology accelerates the micro-architectural profiling phase in side-channel simulator design. When evaluated on an expansive dataset, the performance of our memory-based model closely matches that of the Multilayer Perceptron (MLP) with an R2 value of 0.79. On a reduced dataset (removing the pipeline effect), our model achieves an R2 value of 0.65, outperforming the MLP, which reaches an R2 value of 0.39. Moreover, our model is designed with scalability in mind, making it suitable for profiling microcontrollers with advanced pipeline stages. For the practical realisation of our approach, we employed the open-source ABBY-CM0 dataset from the ARM Cortex-M0 microcontroller, which has three pipeline stages. To provide a detailed analysis, we also consider a Convolutional Neural Network (CNN) besides two RNN architectures (Long Short-Term Memory and Gated Recurrent Unit).
11:30 CEST	TS22.7	A DEEP-LEARNING TECHNIQUE TO LOCATE CRYPTOGRAPHIC OPERATIONS IN SIDE-CHANNEL TRACES Speaker: Davide Galli, Politecnico di Milano, IT Authors: Giuseppe Chiari, Davide Galli, Francesco Lattari, Matteo Matteucci and Davide Zoni, Politecnico di Milano, IT Abstract Side-channel attacks allow extracting secret information from the execution of cryptographic primitives by correlating the partially known computed data and the measured side-channel signal. However, to set up a successful side-channel attack, the attacker has to perform i) the challenging task of locating the time instant in which the target cryptographic primitive is executed inside a side-channel trace and then ii) the time-alignment of the measured data on that time instant. This paper presents a novel deep-learning technique to locate the time instant in which the target computed cryptographic operations are executed in the side-channel trace. In contrast to state-of-the-art solutions, the proposed methodology works even in the presence of trace deformations obtained through random delay insertion techniques. We validated our proposal through a successful attack against a variety of unprotected and protected cryptographic primitives that have been executed on an FPGA-implemented system-on-chip featuring a RISC-V CPU.
11:35 CEST	TS22.8	IMCE: AN IN-MEMORY COMPUTING AND ENCRYPTING HARDWARE ARCHITECTURE FOR ROBUST EDGE SECURITY Speaker: Hanyong Shao, Peking University, CN Authors: Hanyong Shao, Boyi Fu, Jinghao Yang, Wenpu Luo, Chang Su, Zhiyuan Fu, Kechao Tang and Ru Huang, Peking University, CN Abstract Edge devices deployed in unsupervised scenarios employ Physical Unclonable Functions (PUFs) for identity authentication and embedded XOR encoding for data encryption. However, on the one hand, the existing strong PUFs such as CMOS-based XOR Arbiter PUFs and NVM-based RRAM PUFs are vulnerable to various machine learning (ML) modeling attacks. On the other hand, the transmission of keys for embedded XOR encoding also faces the risk of being eavesdropped in unsecured channels. In response to these challenges, this paper proposes a high-security In-Memory Computing and Encrypting (IMCE) hardware architecture based on a FeFET macro, featuring both a PUF mode for identity authentication and an encrypted CIM mode with in-situ decryption. The PUF mode ensures a prediction accuracy close to 50% (equivalent to random guessing attack) under various ML models due to the proposed Hamming distance comparison used in challenge-response pairs (CRPs) generation. In addition, by utilizing the CRPs generated in PUF mode as encryption keys, the CIM mode of IMCE achieves robust security through public-key cryptography via CRPs-masked key transfer, preventing the leakage of keys and data. Therefore, by applying a novel CRPs generation scheme and reusing the generated CRPs for in-situ CIM decryption, the security of both PUF and encrypted CIM mode is enhanced concurrently. In addition, IMCE significantly reduces the power overhead thanks to the high energy efficiency of ferroelectric FETs (FeFETs), making it highly suitable for secure applications in edge computing devices.
11:40 CEST	TS22.9	DEMONSTRATING POST-QUANTUM REMOTE ATTESTATION FOR RISC-V DEVICES Speaker: Maximilian Barger, Vrije Universiteit Amsterdam and University of Amsterdam, NL Authors: Maximilian Barger¹, Marco Brohet² and Francesco Regazzoni³ ¹Vrije Universiteit Amsterdam and University of Amsterdam, NL; ²University of Amsterdam, NL; ³University of Amsterdam and Università della Svizzera italiana, CH Abstract The rapid proliferation of Internet of Things (IoT) devices has revolutionized many aspects of modern computing. Experience has shown that these devices often have severe security problems and are common targets for malware. One approach to ensure that only trusted software is executed on these devices is Remote Attestation (RA), which allows a verifier to attest the integrity of software running on such a prover device. As malware is typically not trusted, an infected device will fail to generate a valid signature, which allows the verifier to detect the presence of malware on the prover. To achieve its security guarantees, RA requires a trust anchor, often found in the form of dedicated hardware on the prover. For IoT and embedded devices such hardware has only recently become largely deployed. Current RA protocols rely on classical asymmetric signatures that are vulnerable to quantum attacks, which are expected to become feasible in the near future. In this work we present SPRAV, a software-based RA system that leverages the Physical Memory Protection (PMP) primitive of RISC-V to achieve its security guarantees and employs quantum-safe cryptographic algorithms to ensure resistance against quantum attacks in the future. Our evaluation shows that it is feasible to deploy this solution on RISC-V devices without incurring a prohibitive overhead or the need for additional hardware, paving the way towards quantum-resistant functionalities also in IoT.
11:41 CEST	TS22.10	CIRCUMVENTING RESTRICTIONS IN COMMERCIAL HIGH-LEVEL SYNTHESIS TOOLS Speaker: Benjamin Carrion Schaefer, University of Texas at Dallas, US Authors: Benjamin Carrion Schaefer and Chaitali Sathe, University of Texas at Dallas, US Abstract Many Software (SW) vendors limit the functionality of their product based on the version purchased. This trend has also carried over to Electronic Design Automation (EDA). For example, Field-Programmable Gate Array (FPGA) vendors make their Lite versions freely available to anyone, but charge for their full versions, e.g., Intel Quartus Prime Lite vs. Quartus Prime. Some High-Level Synthesis (HLS) tool vendors have started to do the same in order to appeal more to FPGA users who are more price conscious as opposed to the ASIC users. FPGA tools are typically free or very inexpensive and hence, it makes sense to have dedicated FPGA versions of their HLS tools. To enable this strategy some HLS vendors have put in place different control mechanisms to avoid anyone using their inexpensive FPGA version to target ASICs, as this would defeat their price discrimination strategy. In this work, we review different strategies used by the HLS vendors and propose to the best of our knowledge the first technique to circumvent these. In particular we show how we can generate ASIC circuits with similar area and performance using the HLS Lite HLS version that only allows to target small FPGAs as compared to using the full ASIC HLS version.
11:42 CEST	TS22.11	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

LK01 IEEE Ceda Distinguished Lecturer Lunchtime Keynote

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 13:15 CEST - 14:00 CEST
Location / Room: Auditorium 2

Session chair:
Ian O’Connor, Ecole Centrale de Lyon, FR

Session co-chair:
Luis Miguel Silveira, TU Lisbon, PT

Supported by

Time	Label	Presentation Title Authors
13:15 CEST	LK01.1	AI MODELS FOR EDGE COMPUTING: HARDWARE-AWARE OPTIMIZATIONS FOR EFFICIENCY Speaker and Author: Hai (Helen) Li, Duke University, US Abstract As artificial intelligence (AI) transforms various industries, state-of-the-art models have exploded in size and capability. The growth in AI model complexity is rapidly outstripping hardware evolution, making the deployment of these models on edge devices remain challenging. To enable advanced AI locally, models must be optimized for fitting into the hardware constraints. In this presentation, we will first discuss how computing hardware designs impact the effectiveness of commonly used AI model optimizations for efficiency, including techniques like quantization and pruning. Additionally, we will present several methods, such as hardware-aware quantization and structured pruning, to demonstrate the significance of software/hardware co-design. We will also demonstrate how these methods can be understood via a straightforward theoretical framework, facilitating their seamless integration in practical applications and their straightforward extension to distributed edge computing. At the conclusion of our presentation, we will share our insights and vision for achieving efficient and robust AI at the edge.

ASD02 ASD Technical Paper Session: Towards Assuring Safe Autonomous Driving

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: Break-Out Room S8

Time	Label	Presentation Title Authors
14:00 CEST	ASD02.1	BACK TO THE FUTURE: REVERSIBLE RUNTIME NEURAL NETWORK PRUNING FOR SAFE AUTONOMOUS SYSTEMS Speaker: Danny Abraham, University of California, Irvine, US Authors: Danny Abraham¹, Biswadip Maity¹, Bryan Donyanavard² and Nikil Dutt¹ ¹University of California, Irvine, US; ²San Diego State University, US Abstract Neural network pruning has emerged as a technique to reduce the size of networks at the cost of accuracy to enable deployment in resource-constrained systems. However, low-accuracy pruned models may compromise the safety of realtime autonomous systems when encountering unpredictable scenarios, e.g., due to anomalous or emergent behavior. We propose Back to the Future: a novel approach that combines pruning with dynamic routing to achieve both latency gains and dynamic reconfiguration to meet desired accuracy at runtime. Our approach enables the pruned model to quickly revert to the full model when unsafe behavior is detected, enhancing safety and reliability. Experimental results demonstrate that our swapping approach is 32x faster than loading the original model from disk, providing seamless reversion to the accurate version of the model, demonstrating its applicability for safe autonomous systems design.
14:30 CEST	ASD02.2	AUTOMATED TRAFFIC SCENARIO DESCRIPTION EXTRACTION USING VIDEO TRANSFORMERS Speaker: Aron Harder, University of Virginia, US Authors: Aron Harder and Madhur Behl, University of Virginia, US Abstract Scenario Description Languages (SDLs) serve as high-level encodings, offering an interpretable representation of traffic situations encountered by autonomous vehicles (AVs). Their utility extends to critical safety analyses, such as identifying analogous traffic scenarios within vast AV datasets, and aiding in real-to-simulation transfers. This paper addresses the challenging task of autonomously deriving SDL embeddings from AV data. We introduce the Scenario2Vector method, leveraging video transformers to automatically detect spatio-temporal actions of the ego AV through front-camera video footage. Our methodology draws upon the Berkeley Deep Drive - eXplanations (BDD-X) dataset. To determine ground truth actions of the ego AV, we employ BERT combined and dependency grammar-based trees, utilizing the resulting labels for Scenario2Vector training. Our approach is benchmarked against a 3D convolution (C3D)-based method and a transfer-learned video transformer (ViViT) model, evaluating both extraction accuracy and scenario retrieval capabilities. The results reveal that Scenario2Vector is highly effective in detecting ego vehicle actions from video input, adeptly handling traffic scenarios with multiple ego vehicle maneuvers.
15:00 CEST	ASD02.3	ADASSURE: DEBUGGING METHODOLOGY FOR AUTONOMOUS DRIVING CONTROL ALGORITHMS Speaker: Andrew Roberts, FinEst Center for Smart Cities, Tallinn University of Technology, EE Authors: Andrew Roberts¹, Mohammad Reza Heidari Iman¹, Mauro Bellone², Tara Ghasempouri³, Olaf Maennel⁴, Jaan Raik¹, Mohammad Hamad⁵ and Sebastian Steinhorst⁵ ¹Tallinn University of Technology, EE; ²FinEst Centre for Smart Cities, EE; ³Department of Computer System, Tallinn University of Technology, Estonia, EE; ⁴University of Adelaide, AU; ⁵TU Munich, DE Abstract Autonomous driving (AD) system designers need methods to efficiently debug vulnerabilities found in control algorithms. Existing methods lack alignment to the requirements of AD control designers to provide an analysis of the parameters of the AD system and how they are affected by cyber-attacks. We introduce ADAssure, a methodology for debugging AD control system algorithms that incorporates automated mechanisms which support generation of assertions to guide the AD system designer to identify vulnerabilities in the system. Our evaluation of ADAssure, on a real-world AD vehicular system, using diverse cyber-attacks, developed a set of assertions that identified weaknesses in the OpenPlanner 2.5 AD planning algorithm and its constituent planning functions. Working with an AD control system designer and safety validation engineer, the results of ADAssure identified remediation of the AD control system, which can support the implementation of a redundant observer for data integrity checking and improvements to the planning algorithm. The adoption of ADAssure improves autonomous system design by providing a systematic approach to enhance safety and reliability through the identification and mitigation of vulnerabilities from corner cases.

BPA03 BPA - New Circuits And Devices

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: Break-Out Room S1+2

Session chair:
Aida Todri Sanial, TUE, NL

Session co-chair:
Rajendra Bishnoi, TU Delft, NL

Time	Label	Presentation Title Authors
14:00 CEST	BPA03.1	DYNAMIC REALIZATION OF MULTIPLE CONTROL TOFFOLI GATE Speaker: Abhoy Kole, DFKI, DE Authors: Abhoy Kole¹, Arighna Deb², Kamalika Datta³ and Rolf Drechsler⁴ ¹DFKI, DE; ²Kalinga Institute of Industrial Technology, IN; ³University of Bremen, DE; ⁴University of Bremen \| DFKI, DE Abstract Dynamic Quantum Circuits (DQC) is an inevitable solution for today's Noisy Intermediate Scale Quantum (NISQ) systems. This enables realization of an n-qubit (where, n > 2) quantum circuit using only 2-qubits with the aid of additional non-unitary operations which is evident from the recent dynamic realizations of algorithms like Quantum Phase Estimation (QPE) and Bernstein–Vazirani (BV) as well as 3-qubit Toffoli operation. In this work, we introduce two different dynamic realization schemes for Multiple Control Toffoli (MCT) gates, for the first time to the best of our knowledge. We compare the respective realizations in terms of resources (e.g., gate, depth and nearest neighbor overhead) and computational accuracy. For this purpose, we apply the proposed dynamic MCT gates in Deutsch-Jozsa (DJ) algorithm, thereby realizing the traditional DJ algorithm as DQCs. Experimental evaluations show that one dynamic scheme for MCT gates leads to DQCs with better computational accuracy, while the other one results in DQCs with better computational resources.
14:20 CEST	BPA03.2	A FEFET-BASED TIME-DOMAIN ASSOCIATIVE MEMORY FOR MULTI-BIT SIMILARITY COMPUTATION Speaker: Xunzhao Yin, Zhejiang University, CN Authors: Qingrong Huang¹, Hamza Errahmouni Barkam², Zeyu Yang¹, Jianyi Yang¹, Thomas K ̈ampfe³, Kai Ni⁴, Grace Li Zhang⁵, Bing Li⁶, Ulf Schlichtmann⁶, Mohsen Imani², Cheng Zhuo¹ and Xunzhao Yin¹ ¹Zhejiang University, CN; ²University of California, Irvine, US; ³Fraunhofer IPMS, DE; ⁴University of Notre Dame, US; ⁵TU Darmstadt, DE; ⁶TU Munich, DE Abstract The exponential growth of data across various domains of human society necessitates the rapid and efficient data processing. In many contemporary data-intensive applications, similarity computation (SC) is one of the most fundamental and indispensable operations. In recent years, In-memory computing (IMC) architectures have been designed to accelerate SC by reducing data movement costs, however, they encounter challenges with signal domain conversion, variation sensitivity, and limited precision. This paper proposes a ferroelectric FET (FeFET) based time-domain (TD) associative memory (AM) for energy efficient SC. Such TD design can convert its output (i.e., time interval) to digits with relatively simple sensing circuitry thus saves large amount of area and energy compared with conventional IMC designs that process analog voltage/current signals. The variable-capacitance (VC) delay chain structure in our design supports quantitative SC and enhances robustness against variations. Furthermore, by exploiting multi-domain ferroelctric FET (FeFET), our design is capable of performing SC on vectors with multi-bit element, enabling support for higher-precision algorithms. Simulation results show that the proposed TD-AM achieves 13.8×/1.47× energy saving of our design compared to CMOS/NVM based TD-IMC designs. Additionally, our design exhibits good robustness in monte carlo simulation with variation extracted from experimental measurements. Investigation on precision of hyperdimensional computing (HDC) show that higher element precision reduces the size of HDC model when considering to achieve same accuracy, indicating an improved efficiency. Benchmarkings against GPU demonstrate in general 2/3 orders of magnitude speedup/energy efficiency improvement of our design. Our proposed multi-bit TD-AM promises energy-efficient quantitative SC for diverse intensive data processing application, especially in energy-constrained scenarios.
14:40 CEST	BPA03.3	RVCE-FAL: A RISC-V VECTOR-SCALAR CUSTOM EXTENSION FOR FASTER FALCON DIGITAL SIGNATURE Speaker: Xinglong Yu, State Key Laboratory of Integrated Chips and Systems, Fudan University, Shanghai, CN Authors: Xinglong Yu, Yi Sun, Yifan Zhao, Honglin Kuang and Jun Han, Fudan University, CN Abstract The National Institute of Standards and Technology (NIST) has selected FALCON as one of the standardized digital signature algorithms against quantum attacks in 2022. Compared with the other post-quantum cryptography (PQC) schemes, lattice-based FALCON is more appropriate for future Internet of Things (IoT) applications due to the fastest signature verification process and the lowest transmission overhead. In this paper, we propose a custom extension based on the RISC-V scalar-vector framework for efficient implementation of FALCON. To our best knowledge, this work is the first hardware-software co-design for complete FALCON signature generation and verification routines. Besides, we design the first FALCON Gaussian sampling hardware and a RISC-V vector extension (RVV) based domain-specific core. The proposed architecture accelerates kernel operations in FALCON, such as discrete Gaussian sampling, number theoretic transform (NTT), inverse NTT, and polynomial operations. Compared with the reference implementation, results on the gem5-RTL simulation platform present a speedup for signature generation and verification of up to 18× and 6.9×.
15:00 CEST	BPA03.4	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

CF01.1 Careers Fair - Industry Company Presentations

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 14:00 CEST - 14:45 CEST
Location / Room: Auditorium 3

Session chair:
Marina Saryan, Synopsys, AM

This is a Young People Programme event. During the Company Presentation Session, presenters will introduce their companies, ongoing activities, and work culture. Presenting companies include Cadence Design Systems, Synopsys, Racyics, Semidynamics, Arm, ams OSRAM and Graf Research

Time	Label	Presentation Title Authors
14:05 CEST	CF01.1.1	INTRODUCING SEMIDYNAMICS Presenter: Pedro Marcuello, Semidynamics, ES Author: Pedro Marcuello, Semidynamics, ES Abstract Introducing Semidynamics
14:11 CEST	CF01.1.2	INTRODUCING SYNOPSYS Presenter: Asya Mkhitaryan, Synopsys, AM Author: Asya Mkhitaryan, Synopsys, AM Abstract Introducing Synopsys
14:17 CEST	CF01.1.3	INTRODUCING ARM Presenter: Nick Brill, Arm, GB Author: Nick Brill, Arm, GB Abstract Introducing ARM
14:23 CEST	CF01.1.4	INTRODUCING RACYICS Presenter: Florian Bilstein, RacyICs, DE Author: Florian Bilstein, RacyICs, DE Abstract Introducing RACYICS
14:29 CEST	CF01.1.5	INTRODUCING AMS OSRAM Presenter: Rafael Serrano-Gotarredona, ams OSRAM, ES Author: Rafael Serrano-Gotarredona, ams OSRAM, ES Abstract Introducing AMS OSRAM
14:34 CEST	CF01.1.6	INTRODUCING GRAF RESEARCH Presenter: Jonathan Graf, Graf Research, US Author: Jonathan Graf, Graf Research, US Abstract Introducing GRAF RESEARCH
14:39 CEST	CF01.1.7	INTRODUCING CADENCE DESIGN SYSTEMS Presenter: Lizzy Kiely, Cadence Design Systems, US Author: Lizzy Kiely, Cadence Design Systems, US Abstract Introducing CADENCE DESIGN SYSTEMS

ES Executive Session

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: Break-Out Room S5

Time	Label	Presentation Title Authors
14:00 CEST	ES.1	MAKECHIP - AN ACCESSIBLE, COST-EFFECTIVE, AND CLOUD-BASED CHIP DESIGN PLATFORM FOR STARTUPS AND ACADEMIA Presenter: Patrick Doell, Racyics GmbH, DE Author: Patrick Doell, Racyics GmbH, DE Abstract "The Presentation is about makeChip, an cloud-based Design Platform developed by Racyics. The platform targets start-ups, SMEs, research institutes and universities and is a central gateway to design advanced integrated circuits. Young companies and research groups doesn't have to invest upfront in costly IT infrastructure and have direct access to EDA tools, IPs, PDKs and silicon-proven design flows. The platform provides reliable IT infrastructure with a full set of EDA tool installations and technology data setup, i.e. PDKs, foundation IP, complex IP. All tools and design data are linked by Racyics' silicon proven design flow and project management system. The turnkey environment enables any makeChip customer to realize complex System on Chips in the most advanced technology nodes. Racyics supports makeChip customers by on-demand design services, such as digital layout generation, tape-out sign-off execution and many more. After giving an introduction about the concept and structure of makeChip, we outline the unique benefits for academia, start-ups, and SMEs. Afterwards, we go into detail about the technical aspects of the infrastructure itself: Resource and Data Management, Tool Wrapper, etc. Furthermore, a sucessfull research project which was realized on makeChip is presented. At the end, additional tools such as an Rapid Adoption Kit for rapid chipd devolpment is shown. "
14:45 CEST	ES.2	AUTOMATING SECURITY VERIFICATION ON FIRMWARE BINARIES: THE FAULT INJECTION EXAMPLE Presenter: Lionel Rivière, eShard, FR Authors: Lionel Rivière and Guillaume Vilcocq, eShard, FR Abstract In the context of the vulnerability analysis, automating the validation of the target resistance against physical attacks is a must. Checking the efficiency of side-channel and fault injection countermeasures implemented by the designer requires systematic analysis of the attack paths built on the leakage and fault model of the target. With this intent, dynamic binary analysis (DBA) based on emulation can efficiently support automation thanks to a validation strategy both top-down and parallel. In this talk, we will focus on emulating fault injection on a secure bootloader.

FS10 Focus session: Chip Design Enablement: How to lower barriers to hardware design with high level languages, generative AI, and Cloud

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: Auditorium 2

Session chair:
Olivier Sentieys, INRIA, FR

Organisers:
Catherine Le Lan, Synopsys, FR
Olivier Sentieys, INRIA, FR

Time	Label	Presentation Title Authors
14:00 CEST	FS10.1	WRITING SOFTWARE TO ELABORATE HARDWARE (SPINALHDL) Presenter: Charles Papon, Independent, FR Author: Charles Papon, Independent, FR Abstract This talk will provide an overview about how software approaches can be used to capture hardware specifications and generate their corresponding Verilog/VHDL "netlist ». The main motivation for such approach being to go beyond what the System-Verilog and VHDL tooling support in terms of hardware elaboration, increasing developer productivity, being able to build hardware generation abstractions while keeping full control over the generated hardware. Another motivation is also to be able to use an extendable tool, meaning, not having to restrict yourself to baked-in features, but instead being able to extend the scope of the tool, avoiding having to switch between different languages, and so avoiding a flow fracture or mismatch. This talk will be based around the recents developments made on VexiiRiscv (a RISC-V multi-issue processor softcore), including its hardware pipelining API and scheduler generation which are based on Scala (a general purpose programming language) and SpinalHDL (a Scala library providing a hardware description API).
14:30 CEST	FS10.2	VERIHDL: ENFORCING CORRECTNESS IN LLM-GENERATED HDL Presenter: Valerio Tenace, PrimisAI, US Authors: Valerio Tenace¹ and Pierre-Emmanuel Gaillardon² ¹PrimisAI, US; ²University of Utah, US Abstract Over the past few months, the rapid proliferation of Large Language Models (LLMs) for Hardware Description Language (HDL) generation has attracted considerable attention. These technologies, by automating complex design tasks, aim to streamline workflows and hasten the innovation process. However, the intrinsic probabilistic nature of LLMs introduces a degree of uncertainty, often manifesting as errors or hallucinations in the generated HDL code—a matter of great concern in the realm of hardware design where accuracy is non-negotiable. To tackle this issue, we present VeriHDL, a novel LLM-based framework tailored to bolster the accuracy and dependability of HDL code produced by generative artificial intelligence. Central to VeriHDL is a harmonious blend of sophisticated and interconnected LLMs that enable: (1) a systematic HDL generation process that leverages a dedicated knowledge base of pre-defined and verified IPs as an initial safeguard against most common mistakes, (2) an automated and iterative testbench generation mechanism, to enforce functional correctness, and (3) a seamless interface with Electronic Design Automation tools, as to enable an automatic verification process with no-human-in-the-loop. This innovative and original approach, currently in its early beta stage within PrimisAI's RapidGPT platform, is not only meant to improve the accuracy of generated code, but also to significantly reduce the time and resources required for verification. As a result, VeriHDL represents a substantial paradigm shift in hardware design, facilitating a more efficient, reliable, and streamlined development process. In this presentation, we will delve into the foundational principles of VeriHDL, starting with an in-depth analysis of its architecture. We will then explore the dynamic interplay among its components, illustrating how they collectively contribute to the framework's effectiveness. The session will culminate in a demonstration, showcasing VeriHDL's practical application and its impact on streamlining the hardware design process.
15:00 CEST	FS10.3	CUMULUS: A CLOUD-BASED EDA PLATFORM FOR TEACHING AND RESEARCH Presenter: Phillip Christie, Trinity College Dublin, IE Author: Phillip Christie, Trinity College Dublin, IE Abstract In this presentation, I will relate our early experiences in the development and deployment of a cloud-based Electronic Design Automation (EDA) platform in Microsoft Azure, to be used as the basis of a set of laboratories associated with a Masters' degree level module in microelectronics at Trinity College Dublin. The intent of the CUMULUS platform is not to implement a commercial design flow but rather to permit students to see how choices made at one stage affect performance at later stages. The flow from device model, library characterization, synthesis, place and route, to timing analysis creates a sequence of data files in standardized formats which, while being mostly text-based, are not designed for human assessment. A key design concept of the CUMULUS platform has therefore been to use a suite of specially created MATLAB toolboxes to visualise and assess these data (in liberty, LEF, DEF, timing reports, GDSII formats) generated at each stage. This presentation will end by providing an overview of costs for the lab, show examples of the graphical representation of data files and make recommendations for future improvements to the CUMULUS platform.

LKS02 Later … With The Keynote Speakers

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: VIP Room

Speaker: Hai (Helen) Li

TS28 Emerging Machine Learning Techniques

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: Multi-Purpose Room M1A+C

Session chair:
Mihai Lazarescu, Politecnico di Torino, IT

Session co-chair:
Alessio Burrello, Politecnico di Torino, IT

Time	Label	Presentation Title Authors
14:00 CEST	TS28.1	PIPETTE: AUTOMATIC FINE-GRAINED LARGE LANGUAGE MODEL TRAINING CONFIGURATOR FOR REAL-WORLD CLUSTERS Speaker: Jinkyu Yim, Seoul National University, KR Authors: Jinkyu Yim¹, Jaeyong Song¹, Yerim Choi², Jaebeen Lee², Jaewon Jung¹, Hongsun Jang¹ and Jinho Lee¹ ¹Seoul National University, KR; ²Samsung Electronics., KR Abstract Training large language models (LLMs) is known to be challenging because of the huge computational and memory capacity requirements. To address these issues, it is common to use a cluster of GPUs with 3D parallelism, which splits a model along the data batch, pipeline stage, and intra-layer tensor dimensions. However, the use of 3D parallelism produces the additional challenge of finding the optimal number of ways on each dimension and mapping the split models onto the GPUs. Several previous studies have attempted to automatically find the optimal configuration, but many of these lacked several important aspects. For instance, the heterogeneous nature of the interconnect speeds is often ignored. While the peak bandwidths for the interconnects are usually made equal, the actual attained bandwidth varies per link in real-world clusters. Combined with the critical path modeling that does not properly consider the communication, they easily fall into sub-optimal configurations. In addition, they often fail to consider the memory requirement per GPU, often recommending solutions that could not be executed. To address these challenges, we propose Pipette, which is an automatic fine-grained LLM training configurator for real-world clusters. By devising better performance models along with the memory estimator and fine-grained individual GPU assignment, Pipette achieves faster configurations that satisfy the memory constraints. We evaluated Pipette on large clusters to show that it provides a significant speedup over the prior art.
14:05 CEST	TS28.2	A COMPUTATIONALLY EFFICIENT NEURAL VIDEO COMPRESSION ACCELERATOR BASED ON A SPARSE CNN-TRANSFORMER HYBRID NETWORK Speaker: Wendong Mao, National Sun Yat-Sen University, TW Authors: Siyu Zhang¹, Wendong Mao², Huihong Shi¹ and Zhongfeng Wang¹ ¹Nanjing University, CN; ²National Sun Yat-Sen University, TW Abstract Video compression is widely used in digital television, surveillance systems, and virtual reality. Real-time video decoding is crucial in practical scenarios. Recently, neural video compression (NVC) combines traditional coding with deep learning, achieving impressive compression efficiency. Nevertheless, NVC models involve high computational costs and complex memory access patterns, challenging real-time hardware implementations. To relieve this burden, we propose an algorithm and hardware co-design framework named NVCA for video decoding on resource-limited devices. Firstly, a CNN-Transformer hybrid network is developed to improve compression performance by capturing multi-scale non-local features. In addition, we propose a fast algorithm-based sparse strategy that leverages the dual advantages of pruning and fast algorithms, sufficiently reducing computational complexity while maintaining video compression efficiency. Secondly, a configurable sparse computing core is designed to flexibly support sparse convolutions and deconvolutions based on the fast algorithm-based sparse strategy. Furthermore, a novel heterogeneous layer chaining dataflow is incorporated to reduce off-chip memory traffic stemming from extensive inter-frame motion and residual information. Thirdly, the overall architecture of NVCA is designed and synthesized under TSMC 28nm CMOS technology. Extensive experiments demonstrate that our design provides superior coding quality and up to 22.7× decoding speed improvements over other video compression designs. Meanwhile, our design achieves up to 2.2× improvements in energy efficiency compared to prior accelerators.
14:10 CEST	TS28.3	MULTIMODALHD: FEDERATED LEARNING OVER HETEROGENEOUS SENSOR MODALITIES USING HYPERDIMENSIONAL COMPUTING Speaker: Quanling Zhao, University of California, San Diego, US Authors: Quanling Zhao, Xiaofan Yu, Shengfan Hu and Tajana Rosing, University of California, San Diego, US Abstract Federated Learning (FL) has gained increasing interest as a privacy-preserving distributed learning paradigm in recent years. Although previous works have addressed data and system heterogeneities in FL, there has been less exploration of modality heterogeneity, where clients collect data from various sensor types such as accelerometer, gyroscope, etc. As a result, traditional FL methods assuming uni-modal sensors are not applicable in multimodal federated learning (MFL). State-of-the-art MFL methods use modality-specific blocks, usually recurrent neural networks, to process each modality. However, executing these methods on edge devices proves challenging and resource intensive. A new MFL algorithm is needed to jointly learn from heterogeneous sensor modalities while operating within limited resources and energy. We propose a novel hybrid framework based on Hyperdimensional Computing (HD) and deep learning, named MultimodalHD, to learn effectively and efficiently from edge devices with different sensor modalities. MultimodalHD uses a static HD encoder to encode raw sensory data from different modalities into high-dimensional low-precision hypervectors. These multimodal hypervectors are then fed to an attentive fusion module for learning richer representations via inter-modality attention. Moreover, we design a proximity-based aggregation strategy to alleviate modality interference between clients. MultimodalHD is designed to fully utilize the strengths of both worlds: the computing efficiency of HD and the capability of deep learning. We conduct experiments on multimodal human activity recognition datasets. Results show that MultimodalHD delivers comparable (if not better) accuracy compared to state-of-the-art MFL algorithms, while being 2x - 8x more efficient in terms of training time. Our code is available online.
14:15 CEST	TS28.4	DYPIM: DYNAMIC-INFERENCE-ENABLED PROCESSING-IN-MEMORY ACCELERATOR Speaker: Tongxin Xie, Tsinghua University, CN Authors: Tongxin Xie¹, Tianchen Zhao¹, Zhenhua Zhu¹, Xuefei Ning¹, Bing Li², Guohao Dai³, Huazhong Yang¹ and Yu Wang¹ ¹Tsinghua University, CN; ²Capital Normal University, CN; ³Shanghai Jiao Tong University, CN Abstract Dynamic inference is an emerging research topic in deep learning. By selectively activating model components conditioned on the input, dynamic inference promises substantial computation reduction without significant impact on accuracy. While many dynamic inference algorithms have demonstrated superior trade-offs between the accuracy and inference efficiency, memory I/O turns irregular and dominant in dynamic networks, making it hard to transform theoretical computation reduction into end-to-end speedup on hardware. Processing-In-Memory (PIM), which can perform Matrix-Vector Multiplications inside the memory, eliminating the data movement of network parameters, is promising to address the Memory Wall challenge. However, deploying dynamic algorithms on PIM architectures face the challenge of pipeline stall, granularity mismatch and the constrain of fixed computing resource allocated for each layer. To tackle these problems, we propose a software-hardware co-design strategy including a PIM-friendly dynamic inference algorithm with independent decision flow and a computing-resource-aware training theme. It also includes a PIM architecture enabling the dynamic dataflow and efficient exploitation of the algorithm computation reduction. Experiments show that DyPIM can achieve 1.52x to 2.74x speedup and 2.05x to 3.95x throughput improvement over the baseline ResNet networks.
14:20 CEST	TS28.5	MULTI-LEVEL ANALYSIS OF GPU UTILIZATION IN ML TRAINING WORKLOADS Speaker: Paul Delestrac, LIRMM, University of Montpellier, FR Authors: Paul Delestrac¹, Debjyoti Bhattacharjee², Simei Yang², Diksha Moolchandani², Francky Catthoor², Lionel Torres¹ and David Novo³ ¹LIRMM, University of Montpellier, FR; ²imec, BE; ³Université de Montpellier, FR Abstract Training time has become a critical bottleneck due to the recent proliferation of large-parameter ML models. GPUs continue to be the prevailing architecture for training ML models. However, the complex execution flow of ML frameworks makes it difficult to understand GPU computing resource utilization. Our main goal is to provide a better understanding of how efficiently ML training workloads use the computing resources of modern GPUs. To this end, we first describe an ideal reference execution of a GPU-accelerated ML training loop and identify relevant metrics that can be measured using existing profiling tools. Second, we produce a coherent integration of the traces obtained from each profiling tool. Third, we leverage the metrics within our integrated trace to analyze the impact of different software optimizations (e.g., mixed-precision, various ML frameworks, and execution modes) on the throughput and the associated utilization at multiple levels of hardware abstraction (i.e., whole GPU, SM subpartitions, issue slots, and tensor cores). In our results on two modern GPUs, we present seven takeaways and show that although close to 100% utilization is generally achieved at the GPU level, average utilization of the issue slots and tensor cores always remains below 50% and 5.2%, respectively.
14:25 CEST	TS28.6	12 MJ PER CLASS ON-DEVICE ONLINE FEW-SHOT CLASS-INCREMENTAL LEARNING Speaker: Fransiskus Yoga Esa Wibowo, ETH Zurich, CH Authors: Fransiskus Yoga Esa Wibowo¹, Cristian Cioflan¹, Thorir Ingolfsson¹, Michael Hersche¹, Leo Zhao¹, Abbas Rahimi² and Luca Benini³ ¹ETH Zurich, CH; ²IBM Research, CH; ³ETH Zurich, CH \| Università di Bologna, IT Abstract Few-Shot Class-Incremental Learning (FSCIL) enables machine learning systems to expand their inference capabilities to new classes using only a few labeled examples, without forgetting the previously learned classes. Classical backpropagation-based learning and its variants are often unsuitable for battery-powered, memory-constrained systems at the extreme edge. In this work, we introduce Online Few-Shot Class-Incremental Learning (O-FSCIL), based on a lightweight model consisting of a pretrained and metalearned feature extractor and an expandable explicit memory storing the class prototypes. The architecture is pretrained with a novel feature orthogonality regularization and metalearned with a multi-margin loss. For learning a new class, our approach extends the explicit memory with novel class prototypes, while the remaining architecture is kept frozen. This allows learning previously unseen classes based on only a few examples with one single pass (hence online). O-FSCIL obtains an average accuracy of 68.62% on the FSCIL CIFAR100 benchmark, achieving state-of-the-art results. Tailored for ultra-low-power platforms, we implement O-FSCIL on the 60 mW GAP9 microcontroller, demonstrating online learning capabilities within just 12 mJ per new class.
14:30 CEST	TS28.7	DECOUPLED ACCESS-EXECUTE ENABLED DVFS FOR TINYML DEPLOYMENTS ON STM32 MICROCONTROLLERS Speaker: Manolis Katsaragakis, National TU Athens (NTUA), Greece, GR Authors: Elisavet Alvanaki¹, Manolis Katsaragakis¹, Dimosthenis Masouros¹, Sotirios Xydis¹ and Dimitrios Soudris² ¹National TU Athens, GR; ²National Technical University of Athens, GR Abstract Over the last years the rapid growth Machine Learning (ML) inference applications deployed on the Edge is rapidly increasing. Recent Internet of Things (IoT) devices and microcontrollers (MCUs), become more and more mainstream in everyday activities. In this work we focus on the family of STM32 MCUs. We propose a novel methodology for CNN deployment on the STM32 family, focusing on power optimization through effective clocking exploration and configuration and decoupled access-execute convolution kernel execution. Our approach is enhanced with optimization of the power consumption through Dynamic Voltage and Frequency Scaling (DVFS) under various latency constraints, composing an NP-complete optimization problem. We compare our approach against the state-of-the-art TinyEngine inference engine, as well as TinyEngine coupled with power-saving modes of the STM32 MCUs, indicating that we can achieve up to 25.2% less energy consumption for varying QoS levels.
14:35 CEST	TS28.8	AN ISOTROPIC SHIFT-POINTWISE NETWORK FOR CROSSBAR-EFFICIENT NEURAL NETWORK DESIGN Speaker: Boyu Li, University of Hong Kong, HK Authors: Ziyi Guan¹, Boyu Li¹, Yuan Ren¹, Muqun Niu¹, Hantao Huang², Graziano Chesi¹, Hao Yu² and Ngai Wong¹ ¹University of Hong Kong, HK; ²Southern University of Science and Technology, CN Abstract Resistive random-access memory (RRAM), with its programmable and nonvolatile conductance, permits compute-in-memory (CIM) at a much higher energy efficiency than the traditional von Neumann architecture, making it a promising candidate for edge AI. Nonetheless, the fixed-size crossbar tiles on RRAM are inherently unfit for conventional pyramid-shape convolutional neural networks (CNNs) that incur low crossbar utilization. To this end, we recognize the mixed-signal (digital-analog) nature in RRAM circuits and customize an isotropic shift-pointwise network that exploits digital shift operations for efficient spatial mixing and analog pointwise operations for channel mixing. To fast ablate various shift-pointwise topologies, a new reconfigurable energy-efficient shift module is designed and packaged into a seamless mixed-domain simulator. The optimized design achieves a near-100% crossbar utilization, providing a state-of-the-art INT8 accuracy of 94.88% (76.55%) on the CIFAR-10 (CIFAR-100) dataset with 1.6M parameters, which sets a new standard for RRAM-based AI accelerators.
14:40 CEST	TS28.9	MICRONAS: ZERO-SHOT NEURAL ARCHITECTURE SEARCH FOR MCUS Speaker: Ye Qiao, University of California, Irvine, US Authors: Ye Qiao, Haocheng Xu, Yifan Zhang and Sitao Huang, University of California, Irvine, US Abstract Neural Architecture Search (NAS) effectively discovers new Convolutional Neural Network (CNN) architectures, particularly for accuracy optimization. However, prior approaches often require resource-intensive training on super networks or extensive architecture evaluations, limiting practical applications. To address these challenges, we propose MicroNAS, a hardware-aware zero-shot NAS framework designed for microcontroller units (MCUs) in edge computing. MicroNAS considers target hardware optimality during the search, utilizing specialized performance indicators to identify optimal neural architectures without high computational costs. MicroNAS achieves up to 1104× improvement in search efficiency compared to previous works and discovers models with over 3.23× faster MCU inference while maintaining similar accuracy.
14:41 CEST	TS28.10	ZERO-SHOT CLASSIFICATION USING HYPERDIMENSIONAL COMPUTING Speaker: Abbas Rahimi, IBM Research, CH Authors: Samuele Ruffino¹, Geethan Karunaratne², Michael Hersche¹, Luca Benini³, Abu Sebastian² and Abbas Rahimi² ¹ETH Zurich, CH; ²IBM Research, CH; ³ETH Zurich, CH \| Università di Bologna, IT Abstract Classification based on Zero-shot Learning (ZSL) is the ability of a model to classify images into novel classes on which the model has not previously seen any training examples. Providing an auxiliary descriptor in the form of a set of attributes describing the new classes involved in the ZSL-based classification is one of the favored approaches to solving this task. In this work, we present how stationary binary codebooks of symbol-like distributed representations, inspired by Hyperdimensional Computing (HDCs), as attribute encoders, can be used to compactly represent a computationally simpler end-to-end trainable model, which we name HDC-enhanced model. The proposed model consists of a trainable image encoder, an HDC-based frozen attribute encoder, and a similarity kernel. We show that our model can be used to first perform zero-shot attribute extraction tasks and, can later be repurposed for Zero-shot Classification tasks with minimal architectural change and minimal model retraining. The HDC-enhanced model achieves Pareto optimal results with a 63.8% top-1 classification accuracy on the CUB-200 dataset having only 26.6 million trainable parameters. Compared to two other state-of-the-art non-generative approaches, HDC-enhanced model reports 4.3% and 9.9% better accuracy, while they require more than 1.85x and 1.72x parameters compared to the HDC-enhanced model respectively. Finally, we show how our HDC-enhanced model's architecture is judiciously designed to fit in modern energy-efficient heterogeneous computing platforms.
14:42 CEST	TS28.11	ACCELERATING DNNS USING WEIGHT CLUSTERING ON RISC-V CUSTOM FUNCTIONAL UNITS Speaker: Muhammad Sabih, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Authors: Muhammad Sabih¹, Batuhan Sesli², Frank Hannig³ and Jürgen Teich³ ¹University of Erlangen-Nuremberg (Friedrich-Alexander-Universität Erlangen-Nürnberg), DE; ²Friedrich-Alexander-Üniversität, DE; ³Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Abstract Weight clustering is typically used to compress a Deep Neural Network (DNN) by reducing the number of unique weight values, which can be encoded using a few bits. However, using weight clustering for acceleration remains an unexplored area. In this work, we propose a design using Custom Functional Units (CFUs) to accelerate DNNs with weight clustering on a RISC-V-based SoC. We evaluate our accelerator on resource-constrained ML use cases and are able to report considerable speedups of up to 8 times with minimal overhead in the utilization of FPGA resources.
14:43 CEST	TS28.12	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

W03 Workshop: Workshop on Nano Security: From Nano-Electronics to Secure Systems

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 14:00 CEST - 18:00 CEST
Location / Room: Break-Out Room S6+7

Organisers

Ilia Polian, University of Stuttgart, Germany

Nan Du, Friedrich Schiller University Jena, Germany, Germany

Shahar Kvatinsky, Technion – Israel Institute of Technology, Israel

Ingrid Verbauwhede, KU Leuven, Belgium

URL: Workshop website at the University of Stuttgart

Today’s societies critically depend on electronic systems. Security of such systems are facing completely new challenges due to the ongoing transition to radically new types of nano-electronic devices, such as memristors, spintronics, or carbon nanotubes. The use of such emerging nano-technologies is inevitable to address the essential needs related to energy efficiency, computing power and performance. Therefore, the entire industry are switching to emerging nano-electronics alongside scaled CMOS technologies in heterogeneous integrated systems. These technologies come with new properties and also facilitate the development of radically different computer architectures.

The second edition of the NanoSec workshop will bring together researchers from hardware-oriented security and from emerging hardware technology. It will explore the potential of new technologies and architectures to provide new opportunities for achieving security targets, but it will also raise questions about their vulnerabilities to new types of hardware-oriented attacks. The workshop is based on a Priority Program https://spp-nanosecurity.uni-stuttgart.de/ funded since 2019 by the German DFG, and will be open to members and non-members of that Priority Program alike.

Keynote

Mon, 14:00 - 14:45

Session chair

Michael Hutter, PQShield, Vienna, Austria

The Impact of Logic Synthesis and Technology Mapping on Logic Locking Security

Lilas Alrahis, NYU Abu Dhabi

Session 1: Secure Architectures

Mon, 14:45 - 15:15

Session chair

Giorgio Di Natale, TIMA, France

Okapi: A Lightweight Architecture for Secure Speculation Exploiting Locality of Memory Accesses

Philipp Schmitz¹, Tobias Jauch¹, Alex Wezel¹, Mohammad R. Fadiheh², Thore Tiemann³, Jonah Heller³, Thomas Eisenbarth³, Dominik Stoffel¹, Wolfgang Kunz¹

¹RPTU Kaiserslautern-Landau, ²Stanford U, ³U Lübeck

Neuromorphic and In-Memory Computing Based on Memristive Circuits for Predictive Maintenance and Supply-Chain Management and Security

Nikolaos Athanasios Anagnostopoulos, Nico Mexis, Stefan Katzenbeisser, Elif Bilge Kavun, Tolga Arul, U Passau

Poster session: Projects of Priority Program Nano Security

Mon, 15:15 - 16:00

Organiser

Ilia Polian, University of Stuttgart, Germany

OnE-Secure: Securing State-of-the-Art Chips Against High-Resolution Contactless Optical and Electron-Beam Probing Attacks

Sebastian Brand (FhG IMWS), Rolf Drechsler (U Bremen), Jean-Pierre Seifert TU Berlin), Frank Sill Torres (DLR)

STAMPS-PLUS: Exploration of an integrated Strain-based TAMPer Sensor for Puf and trng concepts with best-in-class Leakage resilience and robUStness

Ralf Brederlow (TU Munich), Matthias Hiller (FhG AISEC), Michael Pehl (TU Munich)

RAINCOAT: Randomization in Secure Nano-Scale Microarchitectures 2

Lucas Davi (U Duisburg-Essen), Tim Güneysu (RU Bochum)

EMBOSOM: Embedded Software Security into Modern Emerging Hardware Paradigms

Rolf Drechsler (U Bremen), Tim Güneysu (RU Bochum)

MemCrypto: Towards Secure Electroforming-free Memristive Cryptographic Implementations

Nan Du (FSU Jena), Ilia Polian (U Stuttgart)

HaSPro: Verifiable Hardware Security for Out-of-Order Processors

Thomas Eisenbarth (U Lübeck), Wolfgang Kunz (TU Kaiserslautern)

NanoSec2: Nanomaterial-based platform electronics for PUF circuits with extended entropy sources

Sascha Herrmann (TU Chemnitz), Stefan Kat-zenbeisser (U Passau), Elif Kavun (U Passau)

SecuReFET: Secure Circuits through Inherent Reconfigurable FET

Akash Kumar (TU Dresden), Thomas Mikolajick (NaMLab GmbH)

SSIMA: Scalable Side-Channel Immune Micro-Architecture

Amir Moradi (TU Darmstadt)

SeMSiNN: Secure Mixed-SIgnal Neural Networks

Maurits Ortmanns (U Ulm), Ilia Polian (U Stuttgart)

Session 2: Physical Aspects of Secure Computing in the Nano Regime

Mon, 16:30 - 17:15

Session chair

Francesco Regazzoni, University of Amsterdam, NL and Università della Svizzera italiana, Switzerland

Hardware Trojan Detection Using Optical Probing

Sajjad Parvin¹, Frank Sill Torres², Rolf Drechsler¹, ¹U Bremen, ²DLR Bremen

A Cautionary Note about Bit Flips in ReRAM

Felix Staudigl¹, Jan Philipp Thoma², Christian Niesler³, Karl J. X. Sturm¹, Rebecca Pelke¹, Dominik Sisejkovic¹, Jan Moritz Joseph¹, Tim Güneysu², Lucas Davi³, Rainer Leupers¹

¹RWTH Aachen, ²RU Bochum, ³U Duisburg Essen

An Analysis of the Effects of Temperature on the Performance of ReRAM-Based TRNGs

Nico Mexis, Nikolaos Athanasios Anagnostopoulos, Stefan Katzenbeisser, Tolga Arul, U Passau

Session 3: Emerging Technologies for Security

Mon, 17:15 - 18:00

Session chair

Haralampos-G. Stratigopoulos, Sorbonne Universités, CNRS, LIP6, France

A Guide to Assessing Emerging Reconfigurable Nanotechnologies for Robust IP Protection

Armin Darjani, Nima Kavand, Akash Kumar, TU Dresden

Fingerprinting and Identification of Hall Sensors

Christoph Frisch¹, Tobias Chlan¹, Carl Riehm¹, Markus Sand², Markus Stahl-Offergeld³, Michael Pehl¹, Ralf Brederlow^1,3

¹TU Munich, ²LZE GmbH, ³Fraunhofer Institute for Integrated Circuits IIS

Memristors in the Context of Security and AI

Alexander Tekles, Tolga Arul, Nico Mexis, Stefan Katzenbeisser, Nikolaos Athanasios Anagnostopoulos, U Passau

W04 Workshop: 4th Workshop on Open-Source Design Automation (OSDA 2024)

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 14:00 CEST - 18:00 CEST
Location / Room: Multi-Purpose Room M1B+D

OSDA intends to provide an avenue for industry, academics, and hobbyists to collaborate, network, and share their latest visions and open-source contributions, with a view to promoting reproducibility and re-usability in the design automation space. DATE provides the ideal venue to reach this audience since it is the flagship European conference in this field -- particularly poignant due to the recent efforts across the European Union (and beyond) that mandate “open access” for publicly funded research to both published manuscripts as well as software code necessary for reproducing its conclusions. A secondary objective of this workshop is to provide a peer-reviewed forum for researchers to publish “enabling” technology such as infrastructure or tooling as open-source contributions -- standalone technology that would not normally be regarded as novel by traditional conferences -- such that others inside and outside of academia may build upon it.

CF01.2 Careers Fair – Industry: Speed Dating

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 14:45 CEST - 15:30 CEST
Location / Room: Auditorium 3

Session chair:
Marina Saryan, Synopsys, AM

This is a Young People Programme event. At the Speed Dating event, attendees of the Young People Programme can meet the recruiters and exchange business cards and CVs. Recruiters from Cadence Design Systems, Synopsys, Racyics and Semisynamics will attend.

CF01.3 Careers Fair – Industry: Panel

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 16:15 CEST - 17:00 CEST
Location / Room: Break-Out Room S3+4

Session chair:
Victor Grimblatt, Synopsys, CL

Panellists:
Mariusz Grabowski, Cadence Design Systems, PL
Rui Machado, Synopsys, PT
Patrick Doell, Racyics, DE
Josep Sans, Semidynamics, ES
Javier Ramos, Biobee, ES

This is a Young People Programme event. At the Panel on Industry Career Perspectives, Young Professionals from Companies and startups will talk about their experience changing from academia to industry or starting a startup. Names of the panelists: Mariusz Grabowski – Cadence - Poland Rui Pedro Machado – Synopsys - Portugal Patrick Doell – Racyics - Germany Josep Sans - Semidynamics - Spain

ASD03 ASD Technical Paper Session: On Perception in Autonomous Systems

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Break-Out Room S8

Organiser:
Marilyn Wolf, University of Nebraska, US

Time	Label	Presentation Title Authors
16:30 CEST	ASD03.1	ADAPTIVE PERCEPTION CONTROL FOR AERIAL ROBOTS WITH TWIN DELAYED DDPG Speaker and Author: Marilyn Wolf, University of Nebraska, US Abstract Robotic perception is commonly assisted by convolutional neural networks. However, these networks are static in nature and do not adjust to changes in the environment. Additionally, these are computationally complex and impose latency in inference. We propose an adaptive perception system that changes in response to the robot's requirements. The perception controller has been designed using a recently proposed reinforcement learning technique called Twin Delayed DDPG (TD3). Our proposed method outperformed the baseline approaches. Index Terms—Closed loop systems, Control
16:53 CEST	ASD03.2	FRONTIERS IN EDGE AI WITH RISC-V: HYPERDIMENSIONAL COMPUTING VS. QUANTIZED NEURAL NETWORKS Speaker: Hussam Amrouch, TU Munich (TUM), DE Authors: Hussam Amrouch¹, Paul Genssler², Sandy A. Wasif¹, Miran Wael¹ and Rodion Novkin¹ ¹TU Munich (TUM), DE; ²University of Stuttgart, DE Abstract Hyperdimensional Computing (HDC) is an emerging paradigm that stands as a compelling alternative to conventional Deep Learning algorithms. HDC holds four key promises. First, the ability to learn from little data. Second, to be robust against noise in this data. HDC also promises to be resilient against errors in the underlying hardware. This includes the memory on which the model is stored and errors in the computations of the operations, which is attributed to the encoding of information across an expansive dimensional space. Fourth, HDC can be implemented efficiently in hardware due to its lightweight and embarrassingly parallel computations. In this work, those four key promises are evaluated in a holistic way. A fixed-point and a binary HDC implementation are compared against neural network implementations. The models are executed on a RISC-V processor to ensure a fair comparison. While the results confirm the ability to learn from little data and the resiliency against errors, the higher inference accuracy of neural networks favors them in most experiments. Based on these insights, we formulate challenges and opportunities for HDC. Our implementations for QNN, binary and fixed-point HDC are available online: https://github.com/TUM-AIPro/HDC-vs-QNN.
17:15 CEST	ASD03.3	DRIVING AUTONOMY WITH EVENT-BASED CAMERAS: ALGORITHM AND HARDWARE PERSPECTIVES Speaker and Author: Saibal Mukhopadhyay, Georgia Tech, US Abstract In high-speed robotics and autonomous vehicles, rapid environmental adaptation is necessary. Traditional cameras often face issues with motion blur and limited dynamic range. Event-based cameras address these by tracking pixel changes continuously and asynchronously, offering higher temporal resolution with minimal blur. In this work, we highlight our recent efforts in solving the challenge of processing event-camera data efficiently from both algorithm and hardware perspective. Specifically, we present how brain-inspired algorithms such as spiking neural networks (SNNs) can efficiently detect and track object motion from event-camera data. Next, we discuss how we can leverage associative memory structures for efficient event-based representation learning. And finally, we show how our developed Application Specific Integrated Circuit (ASIC) architecture for low latency, energy-efficient processing outperforms typical GPU/CPU solutions, thus enabling real-time event-based processing. With a 100× reduction in latency and a 1000× lower energy per event compared to state-of-the-art GPU/CPU setups, this enhances the front-end camera systems capability in autonomous vehicles to handle higher rates of event generation, improving control.
17:38 CEST	ASD03.4	CORPORATE GOVERNANCE AND MANAGEMENT OF AI-DRIVEN PRODUCT DEVELOPMENT: VEHICLE AUTOMATION Speaker: William Widen, University of Miami, US Author: Marilyn Wolf, University of Nebraska, US Abstract This essay explores the interplay between proper corporate governance and engineering expertise in developing products that use artificial intelligence (with a focus on vehicle automation) and considers how an organization might balance maximizing earnings with the public interest in safe and secure AI-driven products. The essay recommends that for oversight directors use a management model with structures of communication and reporting which provides for more direct engagement with engineers and product managers.

BPA04 BPA - Novel Architecture Solutions

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Break-Out Room S1+2

Session chair:
Sara Vinco, Politecnico di Torino, IT

Session co-chair:
Tanja Harbaum, Karlsruhe Institute of Technology, DE

Time	Label	Presentation Title Authors
16:30 CEST	BPA04.1	FUSIONARCH: A FUSION-BASED ACCELERATOR FOR POINT-BASED POINT CLOUD NEURAL NETWORKS Speaker: Xueyuan Liu, Shanghai Jiao Tong University, China, CN Authors: Xueyuan Liu¹, Zhuoran Song¹, Guohao Dai¹, Gang Li¹, Can Xiao², Yan Xiang², Dehui Kong², Ke Xu² and Xiaoyao Liang¹ ¹Shanghai Jiao Tong University, CN; ²Sanechips Technology, CN Abstract Point-based Point Cloud Neural Networks (PCNNs) have attracted much attention for their higher accuracy than voxel-based and multi-view-based PCNNs. Nevertheless, the increasing scale of point cloud data poses a challenge for real-time processing. Numerous previous works focus on accelerating PCNN inference but only optimize specific stages, limiting their generality to different networks with diverse performance bottlenecks. In this paper, we take nearly all stages of PCNNs into account, and propose 3 orthogonal algorithms, including Fusion-FPS, Fusion-Computation, and Fusion-Aggregation. We introduce Fusion-FPS to alter the sequential execution flow by reducing the Farthest Point Sampling (FPS) across layers to once and organize all neighbor search stages in parallel. To exclude redundant feature computations of "Filling Points", we propose Fusion-Computation, identifying the presence and locations of ``Filling Points" and directly borrowing the nearest neighbor features for them. To eliminate redundant memory accesses caused by shared neighbors in aggregation, we present Fusion-Aggregation, which clusters nearby centroids and coalesces their replicated accesses. In support of our algorithms, we co-design FusionArch, an architecture that implements our strategies and further optimizes memory access via a Local Fusion-Aggregation Table (LFT). We evaluate FusionArch on both server-level and edge-level platforms on 5 PCNNs across 4 applications and show remarkable accuracy and performance gains. On average, FusionArch achieves 2.6x, 5.6x, 13x speedup and 17x, 22x, 62.4x energy savings over PointAcc.Server, NVIDIA A100 GPU and Intel Xeon CPU, respectively. Moreover, it outperforms PRADA, PointAcc.Edge, Mesorasi and GPU with speedups of 2.4x, 2.9x, 5.3x, 5.5x, and energy savings of 4.4x, 7.2x, 12.4x, 11.5x, respectively.
16:50 CEST	BPA04.2	EFFICIENT EXPLORATION OF CYBER-PHYSICAL SYSTEM ARCHITECTURES USING CONTRACTS AND SUBGRAPH ISOMORPHISM Speaker: Yifeng Xiao, University of Southern California, US Authors: Yifeng Xiao¹, Chanwook Oh¹, Michele Lora² and Pierluigi Nuzzo¹ ¹University of Southern California, US; ²Università di Verona, IT Abstract We present ContrArc, a methodology for the exploration of cyber-physical system architectures aiming to minimize a cost function while adhering to a set of heterogeneous constraints. We assume a system topology, defined as a graph, where components (nodes) are selected from an implementation library, and connections between components (edges) are drawn from a finite set of possible connection choices. ContrArc uses assume-guarantee contracts to formalize different viewpoints in the system requirements, such as timing and power consumption, as well as the interface of different components, and translate the exploration problem into a mixed integer linear programming problem. It then searches for efficient solutions by relying on contract decompositions and a method based on subgraph isomorphism to iteratively prune infeasible architectures out of the search space. Experiments on a reconfigurable production line and an aircraft power distribution network show up to two orders of magnitude acceleration in architectural exploration with respect to comparable approaches.
17:10 CEST	BPA04.3	SYNTHESIZING HARDWARE-SOFTWARE LEAKAGE CONTRACTS FOR RISC-V OPEN-SOURCE PROCESSORS Speaker: Gideon Mohr, Saarland University, DE Authors: Gideon Mohr¹, Marco Guarnieri² and Jan Reineke¹ ¹Saarland University, DE; ²IMDEA Software Institute, ES Abstract Microarchitectural attacks compromise security by exploiting software-visible artifacts of microarchitectural optimizations such as caches and speculative execution. Defending against such attacks at the software level requires an appropriate abstraction at the instruction set architecture (ISA) level that captures microarchitectural leakage. Hardware-software leakage contracts have recently been proposed as such an abstraction. In this paper, we propose a semi-automatic methodology for synthesizing hardware-software leakage contracts for open-source microarchitectures. For a given ISA, our approach relies on human experts to (a) capture the space of possible contracts in the form of contract templates and (b) devise a test-case generation strategy to explore a microarchitecture's potential leakage. For a given implementation of an ISA, these two ingredients are then used to automatically synthesize the most precise leakage contract that is satisfied by the microarchitecture. We have instantiated this methodology for the RISC-V ISA and applied it to the Ibex and CVA6 open-source processors. Our experiments demonstrate the practical applicability of the methodology and uncover subtle and unexpected leaks.
17:30 CEST	BPA04.4	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

FS07 Focus Session: Formal Verification Under Resource Constraints

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Auditorium 2

Session chair:
Rolf Drechsler, University of Bremen | DFKI, DE

Organiser:
Lukas Sekanina, Brno University of Technology, CZ

Time	Label	Presentation Title Authors
16:30 CEST	FS07.1	POLYNOMIAL FORMAL VERIFICATION OF SEQUENTIAL CIRCUITS Speaker: Caroline Dominik, University of Bremen \| DFKI, DE Authors: Caroline Dominik¹ and Rolf Drechsler² ¹Institute of Computer Science, University of Bremen/DFKI, DE; ²University of Bremen \| DFKI, DE Abstract Recently, the concept of Polynomial Formal Verification (PFV) has been introduced and successfully applied to several classes of functions, allowing complete verification under resource constraints. But so far, all studies were carried out for combinational circuits only. In this paper we show how the concept of PFV can be extended to sequential circuits. As a first case study we show for counters that PFV can be performed, even though they have an exponential number of states, i.e., they can be fully formally verified within polynomial upper bounds on run-time and memory requirement.
16:53 CEST	FS07.2	AUTOMATED VERIFIABILITY-DRIVEN DESIGN OF APPROXIMATE CIRCUITS: EXPLOITING ERROR ANALYSIS Speaker: Zdenek Vasicek, Brno University of Technology, CZ Authors: Zdenek Vasicek, Vojtech Mrazek and Lukas Sekanina, Brno University of Technology, CZ Abstract A fundamental assumption for search-based circuit approximation methods is the ability to massively and efficiently traverse the search space and evaluate candidate solutions. For complex approximate circuits (adders and multipliers), common error metrics, and error analysis approaches (SAT solving, BDD analysis), we perform a detailed analysis to understand the behavior of the error analysis methods under constrained resources, such as limited execution time. In addition, we show that when evaluating the error of a candidate approximate circuit, it is highly beneficial to reuse knowledge obtained during the evaluation of previous circuit instances to reduce the total design time. When an adaptive search strategy that drives the search towards promptly verifiable approximate circuits is employed, the method can discover circuits that exhibit better trade-offs between error and desired parameters (such as area) than the same method with unconstrained verification resources and within the same overall time budget. For 16-bit and 20-bit approximate multipliers, it was possible to achieve a 75\% reduction in area when compared with the baseline method.
17:15 CEST	FS07.3	USING FORMAL VERIFICATION METHODS FOR OPTIMIZATION OF CIRCUITS UNDER EXTERNAL CONSTRAINTS Speaker: Daniel Grosse, Johannes Kepler University Linz, AT Authors: Daniel Grosse, Lucas Klemmer and Dominik Bonora, Johannes Kepler University Linz, AT Abstract This paper targets the optimization of circuit netlists by eliminating redundant gates under given external constraints. Typical examples for external constraints – which can be viewed as external don't cares – are restrictions on input operands, instruction subsets used by a processor for specific applications, or limited operation modes of an integrated IP block. Targeting external don't cares presents a challenge because the optimization problem changes from a completely specified Boolean function to a Boolean relation. We propose an optimization approach that utilizes formal verification methods. We demonstrate how to formulate Property Checking (PC) and Equivalence Checking (EC) problems to determine if a gate is redundant under given external constraints. Essentially, the validity of up to four rules must be checked per gate. We show that these checks can be solved concurrently, resulting in faster overall optimization. We have implemented our approach as the tool Formal SYNthesis (FSYN). FSYN utilizes open-source tools to scale the solving of formal instances with available hardware resources. We demonstrate that our approach can achieve substantial reductions in the number of gates for combinational circuits under given external constraints.
17:38 CEST	FS07.4	COMBINING FORMAL VERIFICATION AND TESTING FOR DEBUGGING OF ARITHMETIC CIRCUITS Speaker: Jiteshri Dasari, University of Massachusetts Amherst, US Authors: Jiteshri Dasari and Maciej Ciesielski, University of Massachusetts Amherst, US Abstract Formal verification has been successfully used to verify different types of digital circuits, including combinational and sequential logic, arithmetic circuits, and datapath designs. However, the verification techniques concentrate on confirming whether the circuit performs its intended function, while the issue of debugging, i.e., detection and correction of functional errors of the design, remains an open problem. Elaborate testing techniques have been developed that target certain types of manufacturing faults but there are no general techniques that address the debugging issue for functional bugs. This paper addresses the issue of debugging of arithmetic circuits that due to their large size and complexity are particularly hard to verify and debug. Current debugging techniques handle only simple types of bugs: gate replacement, wrong gate polarity, or a missing gate, but cannot handle more realistic faults, such as wrong wiring or using a wrong combination of logic gates. We describe a novel method that combines formal verification and testing techniques to enable efficient identification and correction of faults. The technique involves setting select signals to some predefined constants to reduce the design to easily verifiable circuit components; these components are then verified using logic equivalence checking and SAT tools. The fault can then be identified in form or a small logic area (with a few logic gates) to be replaced by a new, functionally correct logic. The proposed technique is illustrated with debugging of different types of divider circuits up to 1024 bit-wide.

TS01 System-Level Design Methodologies And High-Level Synthesis

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Auditorium 3

Session chair:
Mario Casu, Politecnico di Torino, IT

Session co-chair:
Todor Stefanov, Leiden University, NL

Time	Label	Presentation Title Authors
16:30 CEST	TS01.1	PIMLC: LOGIC COMPILER FOR BIT-SERIAL BASED PIM Speaker: Chenyu Tang, Shanghai Jiao Tong University, CN Authors: Chenyu Tang, Chen Nie, Weikang Qian and Zhezhi He, Shanghai Jiao Tong University, CN Abstract Abstract—Recently, the bit-serial-based processing-in-memory (PIM) has evolved as a promising solution to enhance the computing performance of data-intensive applications, due to its high performance and programmability. However, it is absent that a compiler can automatically convert an arbitrary Boolean function (generic workload) into PIM instructions, with optimized scheduling w.r.t. the varying hardware resource and specification. To fill the gap, we develop a logic compiler for bit-serial-based PIM (PIMLC). In PIMLC, we propose a workload-resource-aware scheduling to minimize the execution latency of a given parallel workload. Thanks to PIMLC, PIM can achieve 15.55× and 19.03× speedup (geo-mean) for SRAM- and ReRAM-PIM respectively, compared to the naive scheduling of prior work. PIMLC is publicly available at: https://github.com/Intelligent-Computing-Research-Group/PIMLC.
16:35 CEST	TS01.2	PIMSYN: SYNTHESIZING PROCESSING-IN-MEMORY CNN ACCELERATORS Speaker: Wanqian Li, Institute of Computing Technology, Chinese Academy of Sciences, CN Authors: Wanqian Li, Xiaotian Sun, Xinyu Wang, Lei Wang, Yinhe Han and Xiaoming Chen, Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract Processing-in-memory architectures have been regarded as a promising solution for CNN acceleration. Existing PIM accelerator designs rely heavily on the experience of experts and require significant manual design overhead. Manual design cannot effectively optimize and explore architecture implementations. In this work, we develop an automatic framework PIMSYN for synthesizing PIM-based CNN accelerators, which greatly facilitates architecture design and helps generate energy efficient accelerators. PIMSYN can automatically transform CNN applications into execution workflows and hardware construction of PIM accelerators. To systematically optimize the architecture, we embed an architectural exploration flow into the synthesis framework, providing a more comprehensive design space. Experiments demonstrate that PIMSYN improves the power efficiency by several times compared with existing works.
16:40 CEST	TS01.3	MATADOR: AUTOMATED SYSTEM-ON-CHIP TSETLIN MACHINE DESIGN GENERATION FOR EDGE APPLICATIONS Speaker: Tousif Rahman, Newcastle University, GB Authors: Tousif Rahman¹, Gang Mao¹, Sidharth Maheshwari², Rishad Shafik¹ and Alex Yakovlev¹ ¹Newcastle University, GB; ²IIT Jammu, IN Abstract System-on-Chip Field-Programmable Gate Arrays (SoC-FPGAs) offer significant throughput gains for machine learning (ML) edge inference applications via the design of co-processor accelerator systems. However, the design effort for training and translating ML models into SoC-FPGA solutions can be substantial and requires specialist knowledge aware trade-offs between model performance, power consumption, latency and resource utilization. Contrary to other ML algorithms, Tsetlin Machine (TM) performs classification by forming logic proposition between boolean actions from the Tsetlin Automata (the learning elements) and boolean input features. A trained TM model, usually, exhibits high sparsity and considerable overlapping of these logic propositions both within and among the classes. The model, thus, can be translated to RTL-level design using a miniscule number of AND and NOT gates. This paper presents MATADOR, an automated boolean-to-silicon tool with GUI interface capable of implementing optimized accelerator design of the TM model onto SoC-FPGA for inference at the edge. It offers automation of the full development pipeline: model training, system level design generation, design verification and deployment. It makes use of the logic sharing that ensues from propositional overlap and creates a compact design by effectively utilizing the TM model's sparsity. MATADOR accelerator designs are shown to be up to 13.4x faster, up to 7x more resource frugal and up to 2x more power efficient when compared to the state-of-the-art Quantized and Binary Deep Neural Network implementations.
16:45 CEST	TS01.4	SUBGRAPH EXTRACTION-BASED FEEDBACK-GUIDED ITERATIVE SCHEDULING FOR HLS Speaker: David Pan, University of Texas at Austin, US Authors: Hanchen Ye¹, David Z. Pan², Chris Leary³, Deming Chen¹ and Xiaoqing Xu⁴ ¹University of Illinois at Urbana-Champaign, US; ²University of Texas at Austin, US; ³Google, US; ⁴X, US Abstract This paper proposes ISDC, a novel feedback-guided iterative system of difference constraints (SDC) scheduling algorithm for high-level synthesis (HLS). ISDC leverages subgraph extraction-based low-level feedback from downstream tools like logic synthesizers to iteratively refine HLS scheduling. Technical innovations include: (1) An enhanced SDC formulation that effectively integrates low-level feedback into the linear-programming (LP) problem; (2) A fanout and window-based subgraph extraction mechanism driving the feedback cycle; (3) A no-human-in-loop ISDC flow compatible with a wide range of downstream tools and process design kits (PDKs). Evaluation shows that ISDC reduces register usage by 28.5% against an industrial-strength open-source HLS tool.
16:50 CEST	TS01.5	HIERARCHICAL SOURCE-TO-POST-ROUTE QOR PREDICTION IN HIGH-LEVEL SYNTHESIS WITH GNNS Speaker: Mingzhe Gao, Shanghai Jiao Tong University, CN Authors: Mingzhe Gao¹, Jieru Zhao¹, Zhe Lin² and Minyi Guo¹ ¹Shanghai Jiao Tong University, CN; ²National Sun Yat-Sen University, TW Abstract High-level synthesis (HLS) notably speeds up the hardware design process by avoiding RTL programming. However, the turnaround time of HLS increases significantly when post-route quality of results (QoR) are considered during optimization. To tackle this issue, we propose a hierarchical post-route QoR prediction approach for FPGA HLS, which features: (1) a modeling flow that directly estimates latency and post-route resource usage from C/C++ programs; (2) a graph construction method that effectively represents the control and data flow graph of source code and effects of HLS pragmas; and (3) a hierarchical GNN training and prediction method capable of capturing the impact of loop hierarchies. Experimental results show that our method presents a prediction error of less than 10% for different types of QoR metrics, which gains tremendous improvement compared with the state-of-the-art GNN methods. By adopting our proposed methodology, the runtime for design space exploration in HLS is shortened to tens of minutes and the achieved ADRS is reduced to 6.91% on average. Code and models are available at https://github.com/sjtu-zhao-lab/hierarchical-gnn-for-hls.
16:55 CEST	TS01.6	CASCO: CASCADED CO-OPTIMIZATION FOR HOLISTIC NEURAL NETWORK ACCELERATION Speaker: Chao Gao, Huawei Technologies Canada Co., Ltd, CN Authors: Bahador Rashidi¹, Shan Lu¹, kiarash aghakasiri¹, Chao Gao¹, Fred Xuefei Han¹, Zhisheng Wang², Laiyuan Gong² and Fengyu Sun² ¹Huawei Inc, CA; ²Huawei Inc, CN Abstract Automatic design space exploration of neural accelerators has become essential to maintain high productivity in deep learning acceleration. Previous accelerator co-design approaches mainly focus on exploring hardware and software design spaces assuming individual mapping of neural network layers is deployed on hardware independently while neglecting the vast yet crucial joint effect of layer fusion. To address this shortcoming, we propose CASCO, a cascaded and holistic co-optimization aware of all co-design aspects, including the accelerator's hardware architecture, on-chip operator mapping, and off-chip layer fusion optimization. We then propose an efficient joint-optimization algorithm framework for exploring the joint-design space efficiently by adaptive search resource allocation. Empirical results show that CASCO outperforms the existing co-design framework HASCO by a large margin. Especially, when training on the same networks, CASCO produces generalizable HW design on unseen neural network applications with EDP reduction from 1.4x to 3.2x.
17:00 CEST	TS01.7	BUSYMAP, AN EFFICIENT DATA STRUCTURE TO OBSERVE INTERCONNECT CONTENTION IN SYSTEMC TLM-2.0 Speaker: Emad Arasteh, Chapman University, US Authors: Emad Arasteh¹, Vivek Govindasamy² and Rainer Doemer² ¹Chapman University, US; ²University of California, Irvine, US Abstract For designing embedded computer architectures that meet desired performance constraints at low cost, fast and accurate simulation models are needed early in the design flow. To identify and avoid bottlenecks early at the system level, observing the contention of shared resources is critical. In this paper, we propose and evaluate a novel data structure called BusyMap that accurately reflects contention at system busses or similar interconnect components. BusyMap is an efficient data structure that allows the system designer to accurately model and easily observe contention in IEEE loosely-timed TLM-2.0. In contrast to prior state-of-the-art, our model fully supports temporal decoupling and multiple levels of interconnect. Our experiments demonstrate the effectiveness of BusyBus with results showing high accuracy at high-speed SystemC simulation.
17:05 CEST	TS01.8	SENSEDSE: SENSITIVITY-BASED PERFORMANCE EVALUATION FOR DESIGN SPACE EXPLORATION OF MICROARCHITECTURE Speaker: Zheng Wu, Fudan University, CN Authors: Zheng Wu, Xiaoling Yi, Li Shang and Fan Yang, Fudan University, CN Abstract The design of modern processors is driven by plenty of benchmarks. As processors evolve and applications expand, the complexity of benchmark programs grows, which increases the computational cost of architecture design space exploration (DSE). To accelerate performance evaluations of processors in DSE, we developed a sensitivity-based framework for performance evaluation of a large set of benchmarks. The framework avoids simulating the insensitive benchmarks to the modified parameters during the exploration of designs. We developed a sampling algorithm based on evolutionary strategies to provide learning data for the sensitivity analysis and enhance the performance of the fast performance evaluation algorithm. We integrated this framework into a RISC-V processor architecture exploration framework. Our experiments revealed that we could achieve a significant acceleration in runtime with negligible accuracy loss in DSE.
17:10 CEST	TS01.9	MICROPROCESSOR DESIGN SPACE EXPLORATION VIA SPACE PARTITIONING AND BAYESIAN OPTIMIZATION Speaker: Zijun Jiang, Hong Kong University of Science and Technology, HK Authors: Zijun JIANG¹ and Yangdi Lyu² ¹Hong Kong University of Science & Technology (Guangzhou), CN; ²Hong Kong University of Science and Technology, HK Abstract Design space exploration (DSE) has long been a very important topic in electronic design automation (EDA), but the growing diversity of applications and the complexity of integrated circuits make conventional DSE frameworks less effective and efficient. Therefore, an exploration algorithm that can find the optimal designs with fewer samples is demanded. This paper proposes a DSE framework for microprocessors that integrates a novel optimization algorithm with EDA flows. The proposed optimization algorithm utilizes space partitioning and Bayesian optimization to explore diverse and high-dimensional design spaces in microprocessors efficiently. Using the framework, we explore the design space of VexRiscv CPUs for TinyML workloads, where our proposed optimization algorithm obtains more Pareto-optimal designs and higher hypervolume with fewer samples.
17:11 CEST	TS01.10	DEEPSEQ: DEEP SEQUENTIAL CIRCUIT LEARNING Speaker: Sadaf Khan, The Chinese University of Hong Kong, Pl Authors: Sadaf Khan¹, Zhengyuan Shi¹, Min Li² and Qiang Xu¹ ¹The Chinese University of Hong Kong, HK; ²Huawei Technologies, CN Abstract Circuit representation learning is a promising research direction in the electronic design automation (EDA) field. With sufficient data for pre-training, the learned general yet effective representation can help to solve multiple downstream EDA tasks by fine-tuning it on a small set of task-related data. However, existing solutions mainly target combinational circuits, significantly limiting their applications. In this work, we propose DeepSeq, a novel representation learning framework for sequential netlists. It utilizes a dedicated graph neural network (GNN) with a customized propagation scheme to exploit the temporal correlations between gates in sequential circuits. To ensure effective learning, we propose a multi-task training objective with two sets of strongly related supervision: logic probability and transition probability at each logic gate. A novel dual attention aggregation mechanism is introduced to facilitate learning both tasks efficiently. Experimental results on various benchmark circuits show that DeepSeq outperforms other GNN models for sequential circuit learning. We evaluate the generalization capability of DeepSeq on two downstream tasks: power estimation and reliability analysis. After fine-tuning, DeepSeq accurately estimates reliability and power across various circuits under different workloads.
17:12 CEST	TS01.11	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS12 Emerging Design Technologies For Future Computing

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Multi-Purpose Room M1A+C

Session chair:
David Novo, LIRMM, FR

Session co-chair:
Shaahin Angizi, New Jersey Institute of Technology, SG

Time	Label	Presentation Title Authors
16:30 CEST	TS12.1	PARA-ZNS: IMPROVING SMALL-ZONE ZNS SSDS PARALLELISM THROUGH DYNAMIC ZONE MAPPING Speaker: Zhenhua Tan, Chongqing University of Posts and Telecommunications,, CN Authors: Zhenhua Tan¹, Linbo Long¹, Jingcheng Shen¹, Congming Gao², Renping Liu¹ and Yi Jiang¹ ¹College of Computer Science and Technology of Chongqing University of Posts and Telecommunications, CN; ²Xiamen University, CN Abstract The emerging Zoned Namespace (ZNS) interface helps flash-based SSDs achieve high performance by dividing the logical space into fixed-size zones. Typically, a zone is mapped to blocks across multiple dies to achieve I/O parallelism. Small zones can make better use of space and are therefore widely studied. However, a small zone fails to be mapped to blocks residing on all dies, causing underutilized die-level parallelism. Meanwhile, a fine-grained (i.e., plane-level) parallelism is rarely exploited for ZNS SSDs due to a strict limitation mandating that only the same type of operation can be simultaneously performed on the same address across different planes within a die. To address these issues, this paper proposes a novel small-zone ZNS-SSD design with dynamic zone mapping, named Para-ZNS. First, a new parallel block grouping module is devised to group blocks across all planes from multiple dies as a basic unit to be mapped to a zone. Such a basic mapping unit achieves parallelism among multiple dies and plane-level parallelism. Then, a die-parallelism identification module is implemented to locate idle dies. Subsequently, to fully exploit the die-level parallelism, a dynamic zone mapping scheme is employed to intelligently map the basic mapping units on the identified idle dies to open zones. The evaluation results based on a widely-used I/O tester (FIO) demonstrate that Para-ZNS improves the bandwidth by 3.42× on average in comparison to state-of-the-art work.
16:35 CEST	TS12.2	QUANTUM STATE PREPARATION USING AN EXACT CNOT SYNTHESIS FORMULATION Speaker: Hanyu Wang, ETH Zurich, CH Authors: Hanyu Wang¹, Jason Cong² and Giovanni De Micheli³ ¹ETH Zurich, CH; ²University of California, Los Angeles, US; ³EPFL, CH Abstract Minimizing the use of CNOT gates in quantum state preparation is a crucial step in quantum compilation, as they introduce coupling constraints and more noise than single-qubit gates. Reducing the number of CNOT gates can lead to more efficient and accurate quantum computations. However, the attainment of optimal solutions is hindered by the complexity of modeling superposition and entanglement. In this paper, we propose an effective state preparation algorithm using an exact CNOT synthesis formulation. Our method represents a milestone as the first design automation algorithm to surpass manual design, reducing the best CNOT count to prepare a Dicke state by 2x. For general states with up to 20 qubits, our method reduces the CNOT count by 9% and 32% for dense and sparse states, respectively, on average, compared to the latest algorithms.
16:40 CEST	TS12.3	BLOCKAMC: SCALABLE IN-MEMORY ANALOG MATRIX COMPUTING FOR SOLVING LINEAR SYSTEMS Speaker: Lunshuai Pan, Peking University, CN Authors: Lunshuai Pan, Pushen Zuo, Yubiao Luo, Zhong Sun and Ru Huang, Peking University, CN Abstract Recently, in-memory analog matrix computing (AMC) with nonvolatile resistive memory has been developed for solving matrix problems in one step, e.g., matrix inversion of solving linear systems. However, the analog nature sets up a barrier to the scalability of AMC, due to the limits on the manufacturability and yield of resistive memory arrays, non-idealities of device and circuit, and cost of hardware implementations. Aiming to deliver a scalable AMC approach for solving linear systems, this work presents BlockAMC, which partitions a large original matrix into smaller ones on different memory arrays. A macro is designed to perform matrix inversion and matrix-vector multiplication with the block matrices, obtaining the partial solutions to recover the original solution. The size of block matrices can be exponentially reduced by performing multiple stages of divide-and-conquer, resulting in a two-stage solver design that enhances the scalability of this approach. BlockAMC is also advantageous in alleviating the accuracy issue of AMC, especially in the presence of device and circuit non-idealities, such as conductance variations and interconnect resistances. Compared to a single AMC circuit solving the same problem, BlockAMC improves the area and energy efficiency by 48.83% and 40%, respectively.
16:45 CEST	TS12.4	A PARALLEL TEMPERING PROCESSING ARCHITECTURE WITH MULTI-SPIN UPDATE FOR FULLY-CONNECTED ISING MODELS Speaker: Yang Zhang, South China University of Technology, CN Authors: Yang Zhang, Xiangrui Wang, Dong Jiang, Zhanhong Huang, Gaopeng Fan and Enyi Yao, South China University of Technology, CN Abstract Combinatorial optimization problems (COPs) are notoriously difficult to solve for classic Von-Neumann computers, which are ubiquitous in various domains. As a state-of-the-art hardware acceleration scheme for COPs, Ising machines are one of the promising research directions for the next generation of computing, but still suffer from the low solution accuracy and speed due to the high complexity of the fully-connected Ising model. In this work, a novel parallel tempering processing architecture (PTPA) is proposed with the modified parallel tempering algorithm, aimed at reducing search time and improving the solution quality. Several techniques are developed to further reduce hardware overhead and enhance parallelism, including the independent pipelined spin update architecture, approximated probability equations, and compact random number generators. Its prototype is implemented on FPGA with eight replicas, each replica containing 1,024 fully-connected spins and at most 64 concurrent update spins. The proposed design achieves an average cut accuracy of 99.43% within 1ms solution time on various G-set problems. Compared with the CPU-based parallel tempering implementation, it enhances the speed of solving the max-cut problems by 5,160 times.
16:50 CEST	TS12.5	A3PIM: AN AUTOMATED, ANALYTIC AND ACCURATE PROCESSING-IN-MEMORY OFFLOADER Speaker: Qingcai Jiang, University of Science and Technology of China, CN Authors: Qingcai Jiang, Shaojie Tan, Junshi Chen and Hong An, University of Science and Technology of China, CN Abstract The the performance gap between memory and processor has grown rapidly. Consequently, the energy and wall-clock time costs to move the data between CPU and main memory predominates the overall computational cost. The Processing-in-Memory (PIM) paradigm emerges as a promising architecture that mitigates the need for extensive data movements by strategically positioning computing units proximate to the memory. Despite the abundant efforts devoted to build a robust and highly-available PIM system, identifying PIM-friendly segments of applications poses significant challenges due to the lack of a comprehensive tool to evaluate the intrinsic memory access pattern of the segment. To tackle this challenge, we propose A3PIM: an Automated, Analytic and Accurate Processing-in-Memory offloader. We synthetically consider the cross-segment data movement and the intrinsic memory access pattern of each code segment via static code analyzer. We evaluate A$^3$PIM across a wide range of real-world workloads including GAP and PrIM benchmarks and achieve an average speedup of 2.63x and 4.45x (up to 7.14x and 10.64x) when compared to CPU-only and PIM-only executions, respectively.
16:55 CEST	TS12.6	HYGHD: HYPERDIMENSIONAL HYPERGRAPH LEARNING Speaker: You Hak Lee, University of California, San Diego, US Authors: Jaeyoung Kang, You Hak Lee, Minxuan Zhou, Weihong Xu and Tajana Rosing, University of California, San Diego, US Abstract Hypergraphs can model real-world data that has higher-order relationships. Graph neural network (GNN)-based solutions emerged as a hypergraph learning solution, but they face non-uniform memory accesses and accompany memory-intensive and compute-intensive operations, making the acceleration with near-data processing challenging. In this work, we propose a hyperdimensional computing (HDC)-based hypergraph learning framework called HygHD, which consists of highly parallelizable and lightweight HDC operations. HygHD accelerates both the training and inference on ferroelectric field-effect transistor (FeFET)-based processing-in-memory (PIM) hardware. Furthermore, we devise a hardware-friendly block-level concatenation and fine-grained block-level scheduler for high efficiency. Our evaluation results show that HygHD offers comparable accuracy to existing GNN-based solutions. Also, HygHD on GPU is up to 443x (7.67x) faster and 142x (2.78x) more energy efficient in training (inference) than the fastest GNN-based approach on GPU. The HygHD accelerator further accelerates the HygHD algorithm, providing an average speedup of 40.0x (3.41x) on training (inference) compared to the HygHD GPU implementation.
17:00 CEST	TS12.7	SUPERFLOW: A FULLY-CUSTOMIZED RTL-TO-GDS DESIGN AUTOMATION FLOW FOR ADIABATIC QUANTUM-FLUX-PARAMETRON SUPERCONDUCTING CIRCUITS Speaker: Shuyuan Lai, Northeastern University, US Authors: Yanyue Xie¹, Peiyan Dong¹, Geng Yuan², Zhengang Li¹, Masoud Zabihi³, Chao Wu¹, Sung-En Chang¹, Xufeng Zhang¹, Xue Lin¹, Caiwen Ding⁴, Nobuyuki Yoshikawa⁵, Olivia Chen⁶ and Yanzhi Wang¹ ¹Northeastern University, US; ²University of Georgia, US; ³Notheastern University, US; ⁴University of Connecticut, US; ⁵Yokohama National University, JP; ⁶Tokyo City University, JP Abstract Superconducting circuits, like Adiabatic Quantum-Flux-Parametron (AQFP), offer exceptional energy efficiency but face challenges in physical design due to sophisticated spacing and timing constraints. Current design tools often neglect the importance of constraint adherence throughout the entire design flow. In this paper, we propose SuperFlow, a fully-customized RTL-to-GDS design flow tailored for AQFP devices. SuperFlow leverages a synthesis tool based on CMOS technology to transform any input RTL netlist to an AQFP-based netlist. Subsequently, we devise a novel place-and-route procedure that simultaneously considers wirelength, timing, and routability for AQFP circuits. The process culminates in the generation of the AQFP circuit layout, followed by a Design Rule Check (DRC) to identify and rectify any layout violations. Our experimental results demonstrate that SuperFlow achieves 12.8\% wirelength improvement on average and 12.1\% better timing quality compared with previous state-of-the-art placers for AQFP circuits.
17:05 CEST	TS12.8	JPLACE: A CLOCK-AWARE LENGTH-MATCHING PLACEMENT FOR RAPID SINGLE-FLUX-QUANTUM CIRCUITS Speaker: Siyan Chen, ShanghaiTech University, Shanghai Innovation Center for Processor Technologies, Shanghai, China, CN Authors: Siyan Chen¹, Rongliang Fu², Junying Huang³, Zhimin Zhang⁴, Xiaochun Ye⁴, Tsung-Yi Ho⁵ and Dongrui Fan⁶ ¹ShanghaiTech University; Shanghai Innovation Center for Processor Technologies, CN; ²The Chinese University of Hong Kong, CN; ³SKLP, Institute of Computing Technology, CAS, CN; ⁴SKLP, Institute of Computing Technology, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China, CN; ⁵The Chinese University of Hong Kong, HK; ⁶SKLP, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences; ShanghaiTech, Shanghai, China, CN Abstract Superconducting rapid single-flux-quantum (RSFQ) logic has emerged as a promising candidate for future computing technology, owing to its low power consumption and high frequency characteristics. Given its ultra-high frequency operation, achieving precise timing alignment is crucial for RSFQ circuit physical design. To address the timing issue, this paper introduces JPlace, a clock-aware length-matching placement framework for RSFQ circuits. JPlace simultaneously addresses data and clock signal length matching, effectively ensuring accurate timing alignment and mitigating timing alignment challenges during the routing phase. We propose a heuristic method for constructing the clock distribution and a dynamic programming-based approach for minimizing the total vertical wirelength while maintaining fixed placement orders. Additionally, we introduce a barycenter-based reordering method to further explore the solution space and reduce wirelength. Experimental results on the RSFQ benchmark demonstrate the effectiveness and efficiency of JPlace.
17:10 CEST	TS12.9	ADAP-CIM: COMPUTE-IN-MEMORY BASED NEURAL NETWORK ACCELERATOR USING ADAPTIVE POSIT Speaker: Jingyu He, Hong Kong University of Science and Technology, HK Authors: Jingyu He, Fengbin Tu, Tim Cheng and Chi Ying Tsui, Hong Kong University of Science and Technology, HK Abstract This study proposes two novel approaches to address memory wall issues in AI accelerator designs for large neural networks. The first approach introduces a new format called adaptive Posit (AdaP) with two exponent encoding schemes that dynamically extend the dynamic range of its representation at run time with minimal hardware overhead. The second approach proposes using compute-in-memory (CIM) with speculative input alignment (SAU) to implement the AdaP multiply-and-accumulate (MAC) computation, significantly reducing the delay, area, and power consumption for the max exponent computation. The proposed approaches outperform state-of-the-art quantization methods and achieve significant energy and area efficiency improvements.
17:11 CEST	TS12.10	DNA-BASED SIMILAR IMAGE RETRIEVAL VIA TRIPLET NETWORK-DRIVEN ENCODER Speaker: Takefumi Koike, Graduate School of Informatics, Kyoto University, JP Authors: Takefumi Koike, Hiromitsu Awano and Takashi Sato, Kyoto University, JP Abstract With the exponential growth of digital data, DNA has emerged as an attractive medium for storage and computing. Design methods for encoding, storing, and searching digital data within DNA storage are thus of utmost importance. This paper introduces image classiﬁcation as a measurable task for evaluating the performance of DNA encoders in similar image searches. In addition, we propose a triplet network-based DNA encoder to improve accuracy and efﬁciency. The evaluation using the CIFAR-100 dataset demonstrates that the proposed encoder outperforms existing encoders in retrieving similar images, with an accuracy of 0.77, which is equivalent to 94% of the practical upper limit and 16 times faster training time.
17:12 CEST	TS12.11	UNLEASHING THE POWER OF T1-CELLS IN SFQ ARITHMETIC CIRCUITS Speaker: Rassul Bairamkulov, EPFL, CH Authors: Rassul Bairamkulov, Mingfei Yu and Giovanni De Micheli, EPFL, CH Abstract Rapid single-flux quantum (RSFQ), a leading cryogenic superconductive electronics (SCE) technology, offers extremely low power dissipation and high speed. However, implementing RSFQ systems at VLSI complexity faces challenges, such as substantial area overhead from gate-level pipelining and path balancing, exacerbated by RSFQ's limited layout density. T1 flip-flop (T1-FF) is an RSFQ logic cell operating as a pulse counter. Using T1-FF the full adder function can be realized with only 40% of the area required by the conventional realization. This cell however imposes complex constraints on input signal timing, complicating its use. Multiphase clocking has been recently proposed to alleviate gate-level pipelining overhead. The fanin signals can be efficiently controlled using multiphase clocking. We present the novel two-stage SFQ technology mapping methodology supporting the T1-FF. Compatible parts of the SFQ network are first replaced by the efficient T1-FFs. Multiphase retiming is next applied to assign clock phases to each logic gate and insert DFFs to satisfy the input timing. Using our flow, the area of the SFQ networks is reduced, on average, by 6% with up to 25% reduction in optimizing the 128-bit adder.
17:13 CEST	TS12.12	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

CF01.4 Careers Fair – Academia: Open Academic Positions Presentation

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 17:00 CEST - 17:30 CEST
Location / Room: Break-Out Room S3+4

Post-Doctoral Researcher Position on Hardware-Efficient Realization of UA Cryptographic Standards
Levent Aksoy
levent [dot] aksoytaltech [dot] ee

PhD/Postdoc positions on Deep Learning Hardware Security
Ruben Salvador
ruben [dot] salvadorinria [dot] fr

Postdoc Position in Computer Architecture/VLSI working on In-memory computing or approximate computing
Nima TaheriNejad
nima [dot] taherinejadziti [dot] uni-heidelberg [dot] de

Postdoc and Researcher positions at Processor Research Team in RIKEN
Kentaro Sano
kentaro [dot] sanogmail [dot] com

Design enablement for ac1ve back-side and early explora1on flows for System-Technology Co-Op1miza1on (STCO) in 2.5D/3D IC -Joint degree Cadence-imec Phd
Cadence Design Systems
lizzycadence [dot] com

The UNISCO project: Design of an IoT-based ecosystem for frail people
Graziano Pravadelli
graziano [dot] pravadelliunivr [dot] it

Research Assistant/Associate in AI Compilers and Edge Computing
José Cano
jose [dot] canoreyesglasgow [dot] ac [dot] uk

System Level Postdoc at Technical University of Munich
Ulf Schlichtmann
ulf [dot] schlichtmanntum [dot] de

Associate professor position in Nantes University
Sebastien Pillement
sebastien [dot] pillementuniv-nantes [dot] fr

Job Opportunities at the Department of Design Automation and Computing System, Peking University
Yibo Lin
yibolinpku [dot] edu [dot] cn

Faculty Openings in Quantum Computing and Edge & Cloud Computing in NTU, Singapore
Arvind Easwaran
arvindentu [dot] edu [dot] sg

YPPK Ypp Keynote On Career Opportunities

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 17:30 CEST - 18:00 CEST
Location / Room: Break-Out Room S3+4

Session chair:
Sara Vinco, Politecnico di Torino, IT

Time	Label	Presentation Title Authors
17:30 CEST	YPPK.1	GIVE YOUR CV THE EDGE WITH OPEN SOURCE EDA Presenter: Matthew Venn, TinyTapeout LLC, ES Author: Matthew Venn, TinyTapeout LLC, ES Abstract While it's important to list your experience with the proprietary tools, open source EDA tools can help you improve your CV. How? For starters, they allow you to continue gaining experience without access to expensive proprietary tools. We now have free and open tools for synthesis, place and route, static timing analysis - everything needed for a whole ASIC flow. How about adding a few more custom chips to your CV? TinyTapeout is a new platform that allows anyone to tapeout small digital and mixed signal designs for just a few hundred euros. And if you're excited to get involved in a startup, then the open source tools can dramatically extend your runway and increase your percentage ownership by driving down the need to borrow money. Speaker Bio Matthew Venn is a science & technology communicator and electronic engineer living in Valencia, Spain. He has been involved with open source silicon for the last 3 years and has sent over 20 open source chips for manufacture. He has helped over 400 people learn the design tools, with over 600 people taking part in manufacturable designs. https://mattvenn.net/ https://zerotoasiccourse.com/ https://tinytapeout.com/

PhDF PhD Forum feat. Student Teams Fair

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 18:30 CEST - 20:00 CEST
Location / Room: Foyer

The PhD Forum is a great opportunity for PhD students to present their work to a broad audience in the system design and design automation community from both industry and academia, and to establish contacts for entering the job market. Representatives from industry and academia get a glance of state-of-the-art in system design and design automation. The PhD Forum is hosted by EDAA, ACM SIGDA, and IEEE CEDA.

The Student Teams fair brings together University Student Teams involved in international competitions and personnel from EDA and microelectronic companies to build new collaboration.
Team name: KITcar e.V. Team
KITcar is a young and motivated team of around 50 students from the Karlsruhe Institute of Technology. The goal of the team is the participation in competitions for autonomous driving. Due to the continuous development of their cars and our many years of participation in the former Carolo Cup, KITcar e.V. is one of the most experienced and successful teams in this area, with the most recently victory in 2023.

Time	Label	Presentation Title Authors
18:30 CEST	PhDF.1	PROCESSING-IN-MEMORY ARCHITECTURES FOR GRAPH PROCESSING Presenter: Yu Huang, Huazhong University of Science & Technology, CN Author: Yu Huang, Huazhong University of Science & Technology, CN Abstract Despite the potential benefits of PIM devices for graph processing, there are several issues with the current PIM architecture research. One of the major issues is the insufficient parallel capability, as the scale of graph data increases rapidly, demanding higher parallelism from the computing architecture. However, the limited parallelism of the planar memristor architecture makes it difficult to meet the high parallelism requirements. Another issue is the low parallel efficiency due to the sparse topology structure of real-world graph data. This results in a low execution efficiency of graph processing loads on regularized PIM architectures, limiting the computational potential of the PIM architecture. Additionally, the current PIM architectures exhibit a single parallel mode, which cannot effectively accommodate the diverse data structures and heterogeneous computational characteristics of emerging graph processing applications, such as graph learning and genome graph analysis. Further, graph data exhibits dynamic behavior in practical applications, but current approaches are predominantly geared towards static graph data, neglecting the dynamic aspects. This dissertation aims to overcome the challenges in graph processing by implementing architecture-level optimizations.
18:30 CEST	PhDF.2	REVOLUTIONIZING CRYOGENIC COMPUTING THROUGH FERROELECTRIC-SUPERCONDUCTOR INTERPLAY Speaker and Author: Shamiul Alam, University of Tennessee Knoxville, US Abstract To unleash the full potential of a quantum computer, it must be scaled up to thousands of qubits. This poses a major challenge due to the requirement of a huge number of lossy wires and interconnects to link room temperature control circuitry and storage units with the qubits placed at cryogenic temperature. The key solution lies in relocating the controller and memory to cryogenic temperatures, enabling the use of a more efficient, smaller number of lossless superconducting interconnects. Such a cryogenic computing platform also holds exceptional promise for exa-scale high-performance computing and aerospace explorations. While both semiconductor and superconductor-based technologies have been explored for developing cryogenic memory and processing units, semiconductor-based CMOS technology falls short due to its high-power consumption. On the other hand, superconducting devices offer unparalleled energy efficiency but face challenges such as low-density memory systems, single fanout issues in logic circuits, and difficulties in driving high load impedances. During my PhD, I have focused on developing suitable cryogenic memory and logic systems by leveraging the interplay between non-superconducting ferroelectric materials and superconducting devices. I have developed a non-volatile cryogenic memory system, capable of proving the advantages of superconducting memories but with significantly higher scalability. Subsequently, I used the same device to design a voltage-controlled Boolean logic family. This novel topology overcomes the fanout limitations of the existing superconducting logic circuits and enables seamless connections between two logic stages. Finally, I also used the unique properties of the developed memory system to enable in-memory computing, bringing unique advantages tailored specifically for the cryogenic environment.
18:30 CEST	PhDF.3	HARDWARE DESIGNS FOR SECURE MICROARCHITECTURES Speaker: Jan Philipp Thoma, Ruhr-University Bochum, DE Authors: Jan Thoma and Tim Güneysu, Ruhr-University Bochum, DE Abstract Much of the massive increase in computing power over the past decades has been due to advances in processor microarchitecture design. Today's processors are immensely complex and implement a variety of performance-enhancing technologies such as multi-core architectures, superscalar pipelines, multi-level caches, out-of-order execution, and branch prediction. Among other things, these features enable processors to be highly efficient and to execute many different applications seemingly in parallel. Crucially, these parallel processes are not always mutually trusted. Therefore, the CPU (in combination with the operating system) has the crucial responsibility of separating the processes from each other. However, despite the separation, they are still running on the same hardware. This can lead to contention, where the operations of one process affect the latency of the operations of the other process. These timing differences form what is known as a side channel, an unintended communication channel through which one process can learn information from the other. In the past, it has been shown that side channels in CPU microarchitectures can be exploited for a variety of attacks, including key recovery attacks on cryptographic algorithms. In this thesis, we investigate microarchitectural attacks and countermeasures on modern processors. Since caches are typically shared by all processes and thus provide an accessible and reliable side channel, the first part of the thesis focuses on cache side channel attacks and countermeasures. We present new cache-based side channels, explore different mitigation strategies, apply them to the CPU microarchitecture, and evaluate the security and performance of the proposed designs. In the second part of the thesis, we investigate methods to prevent speculation-based transient execution attacks. Our approach is based on the premise that efficient CPUs do not need to rely on speculation as much as current processors do. Instead, we investigate ISA features that reduce the dependency on speculation.
18:30 CEST	PhDF.4	DYNAMO: A FRAMEWORK TO OPTIMIZE, VERIFY AND RECONFIGURE FLEXIBLE MANUFACTURING SYSTEMS Speaker: Sebastiano Gaiardelli, University of Verona, Dept. of Engineering for Innovation Medicine, Italy, IT Authors: Sebastiano Gaiardelli and Franco Fummi, Università di Verona, IT Abstract Industry 4.0 revolutionized the concept of manufacturing systems by introducing technologies such as Industrial Internet of Things (IIoT) sensors and collaborative robots. This revolution has deeply changed manufacturing systems by increasing their interconnection and flexibility capabilities. In such a context, timely response to unforeseen events (e.g., machine breakdown and production delays) is a fundamental requirement for reducing production costs and wastes. This thesis proposes a novel model-based framework titled Dynamic Manufacturing Orchestrator (DynaMO). Leveraging knowledge embedded in models, DynaMO reasons over an abstraction layer of the underlying manufacturing systems, enabling its application across diverse manufacturing contexts. On top of this abstraction layer, the framework incorporates a set of methodologies to optimize, verify, and reconfigure the underlying manufacturing system. The proposed framework has been prototyped to govern a real production line, interacting with a commercial Manufacturing Execution System (MES). This integration fully unleashes the production line's flexibility, showcasing the advantages of the proposed framework.
18:30 CEST	PhDF.5	DESIGN AUTOMATION TOOLS AND SOFTWARE FOR QUANTUM COMPUTING Speaker and Author: Lukas Burgholzer, TU Munich, DE Abstract In the 1970s, researchers started to use quantum mechanics to address questions in computer science—establishing new research directions such as quantum computing. Now, five decades later, we are at the dawn of a new "computing age" in which quantum computers indeed will find their way into practical applications. However, while impressive accomplishments can be observed in physical realizations of quantum computers, the development of automated tools and methods that provide assistance in the design of applications for those devices is at risk of not being able to keep up with this development anymore—leaving a situation where we might have powerful quantum computers but hardly any proper means to actually use them. The main contribution of the thesis is the development of various design automation tools and open-source software packages for the domain of quantum computing that support the design and development of this future computing technology. More precisely, the work covers selected contributions to the areas of simulation, compilation and verification of quantum circuits.
18:30 CEST	PhDF.6	HW-SW CO-EXPLORATION AND OPTIMIZATION FOR NEXT-GENERATION LEARNING MACHINES Speaker: Chunyun Chen, Nanyang Technological University, SG Authors: Chunyun Chen and Mohamed M. Sabry Aly, Nanyang Technological University, SG Abstract Deep Neural Networks (DNNs) are proliferating in numerous AI applications, thanks to their high accuracy. For instance, Convolution Neural Networks (CNNs), one variety of DNNs, are used in object detection for autonomous driving and have reached or exceeded the performance of humans in some objection detection problems. Commonly adopted CNNs such as ResNet and MobileNet are becoming deeper (more layers) while narrower (smaller feature maps) than early AlexNet and VGG. Nevertheless, due to the race for better accuracy, the scaling up of DNN models, especially Transformers — another variety of DNNs, to trillions of parameters and trillions of Multiply-Accumulate (MAC) operations, as in the case of GPT-4, during both training and inference, has made DNN models both data-intensive and compute-intensive, carrying heavier workloads on the memory capacity to store weights and computation. This poses a significant challenge for the deployment of these models in an area-efficient and power-efficient manner. Given these challenges, model compression is a vital research topic to alleviate the crucial difficulties of memory capacity from the algorithmic perspective. Pruning, quantization, and entropy coding are three directions of model compression for DNNs. The effectiveness of pruning and quantization can be enhanced with entropy coding for further model compression. Entropy coding focuses on encoding the quantized values of weights or features in a more compact representation by utilizing the peaky distribution of the quantized values, to achieve a lower number of bits per variable, without any accuracy loss. Currently employed Fixed-to-Variable (F2V) entropy coding schemes such as Huffman coding and Arithmetic coding are inefficient to be decoded in the hardware platforms, suffering from high decoding complexity of O(n · k), where n is the number of codewords (quantized values) and k is the reciprocal of compression ratio. Read more ...
18:30 CEST	PhDF.7	DYNAMIC MEMORY MANAGEMENT TECHNIQUES FOR OPTIMIZATION OVER HETEROGENEOUS DRAM/NVM SYSTEMS Speaker: Manolis Katsaragakis, National Technical University of Athens, GR Authors: Manolis Katsaragakis¹, Francky Catthoor² and Dimitrios Soudris¹ ¹National TU Athens, GR; ²IMEC, BE Abstract This PhD focuses on the development of systematic methodology for providing source code organization, data structure refinement, exploration and placement over emerging memory technologies. The goal is to extract alternative solutions, aiming to provide multi-criteria tradeoff over different optimization aspects, such as memory footprint, accesses, performance and energy consumption.
18:30 CEST	PhDF.8	INTEGRATING BIOLOGICAL AND ARTIFICIAL NEURAL NETWORKS PROCESSING ON FPGAS Speaker: Gianluca Leone, University of Cagliari, IT Authors: Gianluca Leone and Paolo Meloni, Università degli Studi di Cagliari, IT Abstract Neural interfaces are rapidly gaining momentum in the current landscape of neuroscience and bioengineering. This is due to a) unprecedented technology capable of sensing biological neural network electrical activity b) increasingly accurate analytical models usable to represent and understand dynamics and behavior in neural networks c) novel and improved artificial intelligence methods usable to extract information from recorded neural activity. Nevertheless, all these instruments pose significant requirements in terms of processing capabilities, especially when focusing on embedded implementations, respecting real-time constraints, and exploiting resource-constrained computing platforms. Acquisition frequencies, as well as the complexity of neuron models and artificial intelligence methods based on neural networks, pose the need for high throughput processing of very high data rates and expose a significant level of intrinsic parallelism. Thus, a promising technology serving as a substrate for implementing efficient embedded neural interfaces is represented by FPGAs, that enable the use of configurable logic, organizable memory blocks, and parallel DSP slices. In this thesis, we assess the usability of FPGA in this domain by focusing on: a) Real-time processing and analysis of MEA-acquired signals featuring spike detection and spike sorting on 5,500 recording electrodes; b) Real-time emulation of a biologically relevant spiking neural network counting 3,098 Izhikevich neurons and 9.6e6 synaptic interconnections; c) Real-time execution of spiking neural networks for neural activity decoding during a delayed reach-to-grasp task addressing low-power embedded applications.
18:30 CEST	PhDF.9	REINFORCEMENT LEARNING FRAMEWORKS FOR PROTECTING INTEGRATED CIRCUITS Speaker and Author: Vasudev Gohil, Texas A&M University, US Abstract Integrated circuits (ICs) are essential components of modern electronics, but they are vulnerable to various security threats, such as hardware Trojans (HTs) and fault-injection attacks. These attacks can compromise the integrity and functionality of ICs, leading to severe consequences in critical applications. To protect ICs from these attacks, we need to develop effective techniques for detecting and analyzing them. In this research, we propose a novel application of reinforcement learning (RL), an optimization technique that can handle dynamic and uncertain environments, to address these security challenges. We use RL agents to perform various tasks related to HT detection and fault-injection vulnerability analysis. In particular, (i) we devise an RL agent for detecting HTs using logic testing, (ii) we devise another RL agent for detecting HTs using side-channel analyses, (iii) we question the false sense of security provided by existing HT detection techniques and devise an RL agent that inserts stealthy HTs that evade all prior detection techniques, and (iv) we devise an RL agent that automatically evaluates vulnerabilities of block ciphers to fault attacks. Experimental results demonstrate the effectiveness of our techniques. We outperform prior works on logic testing-based HT detection and side-channel-analysis-based HT detection by factors of 169⨉ and 34⨉. Our RL agent for inserting stealthy HTs evades 8 HT detection techniques. Our RL agent for finding fault-injection vulnerabilities finds all known faults for the AES and GIFT block ciphers (which took more than a decade of human expert research) in less than 24 hours. Additionally, we also find a novel fault in the GIFT block cipher that had not been found by any human expert. Our research contributes to the advancement of the state-of-the-art in IC security, and opens up new possibilities for using RL in this domain.
18:30 CEST	PhDF.10	EFFICIENT DEPLOYMENT OF EARLY-EXIT DNN ARCHITECTURES ON FPGA PLATFORMS Speaker: Anastasios Dimitriou, University of Southampton, GB Authors: Anastasios Dimitriou, Geoff Merrett and Jonathon Hare, University of Southampton, GB Abstract Dynamic Deep Neural Networks (DNNs) can enhance speed and reduce computational intensity during inference by allocating fewer resources to easily recognizable or less informative parts of an input. These networks strategically deactivate components, such as layers, channels, or sub-networks, based on data-dependent decisions. However, their application has only been explored on traditional computing systems (CPU+GPU) using libraries designed for static networks, limiting their impact. This paper explores the benefits and limitation of realizing these networks on FPGAs. It introduces and investigates two approaches to efficiently implement early-exit dynamic networks on these platforms, focusing on the implementation of the sub-networks responsible for the data-dependent decisions. The pipeline approach utilizes existing hardware for the sub-network execution, while the parallel approach employs dedicated circuitry. We assess the performance of each approach using the BranchyNet early exit architecture on LeNet-5, Alexnet, VGG19 and ResNet32, evaluating on a Xilinx ZCU106. The pipeline approach demonstrates a 36% improvement over a desktop CPU, consuming 0.51 mJ per inference, 16 times lower than a non-dynamic network on the same platform and 8 times lower than an Nvidia Jetson Xavier NX. Meanwhile, the parallel approach achieves a 17% speedup over the pipeline approach in scenarios without early exits during dynamic inference, albeit with a 28% increase in energy consumption. Finally, we explore a dynamic placement of the exit points in these types of dynamic networks, taking advantage of the versatility that the parallel approach offers. Thus, we managed to further reduce the number of computations on an early-exit ResNet32 by 24%.
18:30 CEST	PhDF.11	FORMALLY PROVED MEMORY CONTROLLERS Presenter: Felipe Lisboa Malaquias, Télécom Paris, FR Author: Felipe Lisboa Malaquias, Télécom Paris, FR Abstract In order to prove conformance to memory standards and bound memory access latency, recently proposed real-time DRAM controllers rely on paper and pencil proofs, which can be troubling: they are difficult to read and review, they are often shown only partially and/or rely on abstractions for the sake of conciseness, and they can easily diverge from the controller implementation, as no formal link is established between both. We propose a new framework written in Coq, in which we model a DRAM controller and its expected behavior as a formal specification. The trustworthiness in our solution comes two-fold: 1) proofs that are typically done on paper and pencil are now done in Coq and thus certified by it's kernel, and 2) the reviewer's job develops into making sure that the formal specification matches the standards – instead of performing a thorough check of the underlying mathematical formalism. Our framework provides a generic DRAM model capturing a set of controller properties as proof obligations, which all implementations must comply with. We focus on properties related to the respect of timing constraints imposed by the memory standards, the correctness of the DRAM command protocol and the assertiveness that every incoming request is handled in bounded time. We refine our specification with two implementations based on widely-known arbitration policies – First-in First-Out (FIFO) and Time-Division Multiplexing (TDM). We extract proved code from our model and use it as a "trusted core" on a cycle-accurate DRAM simulator.
18:30 CEST	PhDF.12	EFFICIENT HARDWARE ACCELERATOR FOR CONVOLUTIONAL NEURAL NETWORK BASED INFERENCE ENGINES Speaker: Deepika S, Vellore institute of Technology (VIT), Vellore, IN Authors: Deepika S and Arunachalam V, VIT, IN Abstract To improve the speed and accuracy of the inferencing in various fields such as automotive, medical, and industrial, Convolutional Neural Network (CNN) based image classifiers are employed. Further, these pre-trained Inference Engines (IEs) need to be accurate and energy efficient. In the first part of the thesis, an optimized 2×1 CO is designed, implementing a Fix/Float data format with half-precision floating-point (HPFP) for weights and 16-bit fixed-point representation for input feature maps. MATLAB-based range and precision analysis guide the selection of these formats, demonstrating that the proposed CO achieves a remarkable 97% accuracy in worst-case scenarios when compared to single-precision floating-point (SPFP). The convolutional operator, featuring a multiplication operation processing unit (MOPU), exhibits superior hardware efficiency, requiring 22% less area and 17.98% less power than HPFP at a clock speed of 250MHz. Also, the speed of 750MOPS and hardware efficiency of 24.22 TOPS/W are achieved. The second part of the thesis explores the energy efficiency of CNN-based IEs, emphasizing the role of performance and power consumption. A sparse-based accelerator is introduced, capitalizing on sparsity in both IFMs and weights to compress insignificant inputs and skip inefficient computations. The study reveals layer-wise sparsity ranging from 18% to 90% in IFMs, paving the way for a 3×1 Convolutional array with zero-detect-Skip CZS-3×1 control units. The proposed 20-compressed Processing Element (CPE) using 3×3 of CZS-3×1 achieves impressive performance metrics, with 90 GOPS and an energy efficiency of 3.42 TOPS/W. The proposed design achieves high throughput and low latency, making it well-suited for edge computing real-time applications.
18:30 CEST	PhDF.13	OPTIMIZING NEURAL NETWORKS FOR EMBEDDED EDGE-PROCESSING PLATFORMS Speaker: Paola Busia, Università degli Studi di Cagliari, IT Authors: Paola Busia and Paolo Meloni, DIEE, Università degli Studi di Cagliari, IT Abstract Convolutional Neural Networks (CNNs) have become a standard approach in several application fields. This has created a growing interest in enabling their deployment on processing systems designed to operate in the edge computing domain. In this context, network design demands a tight trade-off between the model's accuracy and its efficiency, to be obtained with careful evaluation of a large space of available design parameters. Completing this task to reach a near-optimal design within a reasonable time creates the need for efficient exploration strategies. Additionally, given the heterogeneity of the hardware architectural solutions, the comparison of the efficiency of alternative CNN models requires a certain degree of platform awareness. Finally, the neural network domain is dynamic and always evolving, with the introduction of new operands and architectural solutions, enriching the CNN baseline of new features and capabilities. A relevant example is represented by the transformer, whose success motivates the interest in enabling its efficient deployment at the edge. The aim of this work is to contribute to these challenges, as outlined in the three following points: • the definition of an effective design exploration strategy, comprehensively considering the effects of tuning the available design parameters on the performance and efficiency of edge implementation, within a limited exploration time; • the development of an accurate estimation method to predict the on-hardware efficiency of a candidate design point on a given target platform; • the application of hardware-aware design concepts to a transformer architecture.
18:30 CEST	PhDF.14	SECURITY VALIDATION OF SYSTEMS IMPLEMENTATIONS Presenter: Aruna Jayasena, University of Florida, US Author: Aruna Jayasena, University of Florida, US Abstract A System-on-Chip (SoC) is an integrated circuit (IC) that combines various components and functionalities of a computer or electronic system into a single chip. These components typically include a central processing unit (CPU), memory, input/output ports, and sometimes additional components like graphics processing units (GPUs), networking modules, and more. Firmware is used to specify the functionality of and incorporate the controlling logic for different peripherals. Hardware and Firmware combined together to provide the expected functionality of the device along with the reconfigurability and updatability with the changing user requirements, contributing to a compact design, power efficiency, and overall performance. While all the functionalities and efficiency of the SoCs are great, the complexity of the devices that are built using the SoCs brings more security concerns to users and their data. My research thesis is focused on security validation of systems implementations (hardware and firmware). Based on the current progress of my research, I was able to pursue several research directions and they can be divided into three categories. 1) Vulnerability analysis and mitigations is the first category, where I analyzed system implementations to identify vulnerable components. These vulnerabilities are confirmed using different directed test generation techniques~cite{jayasena2024directed}. Then mitigations and detection techniques are proposed for each of the identified vulnerabilities. 2) The next category is focused on directed test generation for achieving different security objectives. This avenue involves proposing novel test generation techniques for objectives such as side-channel evaluations and malicious implant detection. 3) The final category focuses on different system validation techniques based on directed test generation.
18:30 CEST	PhDF.15	ARCHITECTING FAST AND ENERGY-EFFICIENT HARDWARE ACCELERATORS FOR EMERGING ML WORKLOADS USING HARDWARE-SOFTWARE CO-DESIGN Speaker: TOM GLINT, FZ JULICH, DE Author: Tom Glint, IIT Gandhinagar, IN Abstract This research presents a suite of innovative contributions to the field of DNN hardware accelerators, addressing the critical need for energy efficiency and computational speed in machine learning. It encompasses four distinct studies: the Approximate Fixed Posit Number System, 3D Stacked Accelerators, the 3D UNet Accelerator, and the DeepFrack Scheduling Framework. Each study introduces novel approaches to enhance DNN processing, ranging from efficient data representation and leveraging 3D stacking technology to optimizing specific computational challenges in medical imaging and revolutionizing DNN layer processing. Collectively, these works demonstrate significant strides in reducing energy consumption, accelerating computational speeds, and increasing the adaptability of DNN accelerators. The research contributes to setting new benchmarks in the efficiency and practicality of hardware accelerators, poised to meet the escalating computational demands in the evolving landscape of machine learning.
18:30 CEST	PhDF.16	OPTIMIZED FLOATING-POINT ARITHMETIC UNIT FOR EFFICIENT OPERATIONS IN DNN ACCELERATORS Speaker: Jing Gong, University of New South Wales, AU Authors: Jing Gong¹ and Sri Parameswaran² ¹University of New South Wales, AU; ²University of Sydney, AU Abstract Over the past decade, the evolution of deep neural networks (DNNs) has boosted the development of specialized accelerators to enhance performance and resource efficiency for both training and inference phases. These accelerators incorporate advanced spatial array architectures like systolic arrays and vector processing units, optimized dataflows, and on-chip storage hierarchies. Floating-point (FP) in training is essential, the need for efficient FP operations becomes paramount due to their high resource demands compared to the integer or fixed-point computations used in inference accelerators. This necessitates a shift towards optimizing FP arithmetic operations within these sophisticated architectures to enhance DNN accelerator performance during the resource-intensive training phase. Central to DNN computations is the General Matrix Multiplication (GEMM) operation, involving extensive multiplication and accumulation operations that dictate the computational intensity and processing power required. Our research focuses on optimizing these key aspects—multiplication and accumulation—through tools like ApproxTrain, which simulates the impact of approximate FP multipliers on DNN accuracy, and the same-signed separate accumulation (SEA) scheme, an innovative accumulation method that improves overall efficiency by separately accumulating same-signed terms using efficient same-signed FP adders before combining them.
18:30 CEST	PhDF.17	INVESTIGATION OF SENSITIVITY OF DIFFERENT LOGIC AND MEMORY CELLS TO LASER FAULT INJECTIONS Speaker and Author: Dmytro Petryk, IHP – Leibniz Institute for High Performance Microelectronics, DE Abstract Plenty of semiconductor devices are developed to operate with private data. To guarantee security requirements, e.g. confidentiality, data integrity, authentication, etc., cryptographic algorithms are used. Theoretically, the cryptographic algorithms using keys with recommended lengths are secure. The issue is that the devices are usually physically accessible, i.e. an attacker can steal them and attack in a lab with the goal to extract cryptographic keys, e.g. by means of Fault Injections (FIs). One class of FI attacks exploits the sensitivity of semiconductors to light and is typically performed using a laser. This work is devoted to a study of the sensitivity of different logic and memory cells to optical FI attacks. Front-side attacks against cells manufactured in different IHP technologies were performed using two different red lasers. To reach the repeatability of the experimental results and to increase the comparability of the results with attack results published in literature the laser beam parameters, positioning accuracy as well as different setting parameters of the setup were experimentally evaluated. Attacks were performed against inverter, NAND, NOR, flip-flop cells from standard libraries, radiation-hard flip-flops based on Junction Isolated Common Gate technique, radiation-tolerant Triple Modular Redundancy registers as well as non-volatile Resistive Random Access Memory (RRAM) cells. The results of attacks against volatile circuits were successful transient bit-set and bit-reset as well as permanent stuck-at faults. The results of attacks against RRAM cells were successful manipulation of all logic states. The faults injected are repeatable and reproducible. The goal of this work was not only to achieve successful manipulation of the cell state output but also to determine cell area(s) sensitive to laser. Knowledge about areas sensitive to laser can be used by designers to implement corresponding countermeasure(s) at the initial stage of chip development and is the necessary step to design appropriate countermeasures.
18:30 CEST	PhDF.18	HORIZONTAL ADDRESS-BIT SCA ATTACKS AND COUNTERMEASURES Speaker and Author: Ievgen Kabin, IHP – Leibniz Institute for High Performance Microelectronics, DE Abstract Due to the main application areas of Elliptic Curve Cryptography (ECC), i.e. IoT and WSN, efficient countermeasures against physical attacks are highly important for cryptographic accelerators. Especially dangerous are horizontal non-invasive power and electromagnetic analysis attacks, i.e. the attacks revealing the secret scalar k analysing only a single power or electromagnetic trace measured during a kP execution. The regularity and the atomicity principles are well-known kinds of algorithmic countermeasures against simple SCA attacks. In this work, we investigated the resistance of binary kP implementations based on the Montgomery ladder as well on an atomic pattern algorithm against horizontal SCA attacks. It was possible to reveal the secret value k completely analysing a single execution trace using simple statistical methods as well as the automated simple power analysis. We determined the SCA leakage sources and their nature: the reason causing the success of our attacks is the key-dependent addressing of the registers and other design blocks, which is an inherent feature of binary kP algorithms. We investigated the influence of state-of-the-art countermeasures on the resistance of the design. Nowadays, effective algorithmic countermeasures against horizontal address-bit SCA are not known. As a means reducing the attack success, we investigated the possibility to use one of the design blocks as a source of digital noise. Additionally, we proposed regular scheduling when addressing the block as an effective strategy for reducing the success of the attacks. The mentioned analysis methods can be successfully applied to determine SCA leakage sources in early design phases. Furthermore, the address-bit attacks and countermeasures presented in this study are relevant not only to ECC but also to any cryptographic algorithms that involve key-dependent addressing. The protection mechanisms proposed here can be a basis for automating the design process. It can be helpful for engineering side-channel resistant cryptographic implementations.
18:30 CEST	PhDF.19	POWER-SUPPLY NOISE MONITORING TO EVALUATE EMBEDDED VRMS AND SI-BACKSIDE BURRIED DECAPS Speaker: Kazuki Monta, Kobe University, JP Authors: Kazuki Monta and Makoto Nagata, Kobe University, JP Abstract Power supply noise is a critical issue in modern IC chips that has a significant impact on performance and reliability. In particular, advanced CMOS technology and dense chip stacking have increased the need for power supply noise management and optimization. Properly evaluating and effectively suppressing power supply noise have been more important than ever. We first developed a method to accurately understand and evaluate the behavior of power supply noise inside IC chips in "Testing Embedded Toggle Generation Through On-Chip IRDrop Measurements". This technique is essential for a detailed analysis of power supply noise. Then, in "3-D CMOS Chip Stacking for Security ICs Featuring Backside Buried Metal Power Delivery Networks With Distributed Capacitance," we approached the problem of power supply noise suppression by utilizing the structure of 3-D CMOS stacking. These two papers make important contributions regarding the understanding and management of power supply noise.
18:30 CEST	PhDF.20	MODELING ENERGY AND PERFORMANCE IN ENERGY HARVESTING IOT SYSTEMS Speaker: Fatemeh Ghasemi, Norwegian University of Science and Technology, NO Authors: Fatemeh Ghasemi and Magnus Jahre, Norwegian University of Science and Technology, NO Abstract The PhD thesis that this extended abstract summarizes focuses on the fundamental question of how to apply analytical modeling to optimize performance and energy consumption in energy-harvesting IoT systems. More specifically, we apply analytical modeling along three distinct research directions: (i) early-stage analysis of energy-harvesting IoT systems which led to proposing the Periodic Energy Harvesting Systems (PES) model; (ii) facilitating efficient and accurate empirical evaluation of energy-harvesting IoT systems where our key contribution is the Energy Subsystem Simulator (ESS); and (iii) Energy-aware Connection Management (ECM).
18:30 CEST	PhDF.21	THIN-FILM COMPUTE-IN-MEMORY CIRCUITS AND SYSTEMS FOR LARGE-AREA SENSING APPLICATIONS Speaker: Jialong Liu, Tsinghua University, CN Authors: Jialong Liu, Huazhong Yang and Xueqing Li, Tsinghua University, CN Abstract Large-area and flexible sensing has been widely used in various scenarios. These applications require more intelligent processing and faster data transmission with the development of Internet-of-Things (IoT) and increase of edge data. However, most large-area sensing devices suffer from unstable long wiring, complicated interfaces, as well as huge cost of data movement, which is still far away to satisfy the developing demands. Addressing these challenges, this abstract is aiming at achieving large-area smart sensing systems with low power, low latency and high reliability. The key technology we take is to adopt thin-film transistor (TFT)-based compute-in-memory (CiM) in large-area sensing applications. However, there still exist several barriers on the path to this expected computing paradigm, including the low performance and high variation of TFT devices, the tradeoff between pre-processing and data compression, and the lacking of design and evaluation tools. This work has made a cross-layer exploration from devices to systems with a series of works solving the current problems of TFT-based CiM and in-sensor computing. On circuit level, several TFT-based CiM chips with different computing schemes and cell structures are proposed, achieving high linearity, density and reliability; On architecture and algorithm level, high-parallelism pre-processing and data compression are performed simultaneously by hardware-software co-optimization; On system and tool-chain level, large-area behavior-level simulator, EDA tools and test platform for TFT technologies are built up. we expect the proposed TFT-based smart sensing systems to reach higher improvements in wider range of application scenarios in the future.
18:30 CEST	PhDF.22	OPERATIONAL DATA ANALYTICS IN HPC: TOWARDS ANOMALY PREDICTION Speaker: Martin Molan, Università di Bologna, IT Authors: Martin Molan and Andrea Bartolini, Università di Bologna, IT Abstract The transition to exascale in high-performance computing (HPC) systems and the associated increase in complexity motivates the introduction of machine learning methodologies to support the work of system administrators. This work examines the current self-supervised approaches for anomaly detection and the transition to supervised anomaly prediction powered by the introduction of graph neural networks. We present an anomaly prediction model where compute nodes are represented as vertexes in a line graph, described by the attributes collected by the monitoring system. The preliminary results of the graph anomaly prediction model show that it outperforms the current anomaly detection model, even predicting up to 24 hours in advance.
18:30 CEST	PhDF.23	DYNAMIC DNNS AND RUNTIME MANAGEMENT FOR EFFICIENT INFERENCE ON MOBILE/EMBEDDED DEVICES Speaker: Lei Xun, University of Southampton, GB Authors: Lei Xun, Jonathon Hare and Geoff Merrett, University of Southampton, GB Abstract Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms due to several key advantages in latency, privacy and always-on availability. However, due to limited computing resources, efficient DNN deployment on mobile and embedded platforms is challenging. Although many hardware accelerators and static model compression methods were proposed by previous works, at system runtime, multiple applications are typically executed concurrently and compete for hardware resources. This raises two main challenges: Runtime Hardware Availability and Runtime Application Variability. Previous works have addressed these challenges through either dynamic neural networks that contain sub-networks with different performance trade-offs or runtime hardware resource management. In this thesis, we proposed a combined method, a system was developed for DNN performance trade-off management, combining the runtime trade-off opportunities in both algorithms and hardware to meet dynamically changing application performance targets and hardware constraints in real time. We co-designed novel Dynamic Super-Networks to maximise runtime system-level performance and energy efficiency on heterogeneous hardware platforms. Compared with SOTA, our experimental results using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency. We also designed a hierarchical runtime resource manager that tunes both dynamic neural networks and DVFS at runtime. Compared with the Linux DVFS governor schedutil, our runtime approach achieves up to a 19% energy reduction and a 9% latency reduction in single model deployment scenario, and an 89% energy reduction and a 23% latency reduction in a two concurrent model deployment scenario.
18:30 CEST	PhDF.24	DESIGN METHODOLOGIES AND ARCHITECTURES FOR APPLICATION-SPECIFIC COARSE-GRAIN RECONFIGURABLE ACCELERATORS Speaker and Author: Francesco Ratto, Università degli Studi di Cagliari, IT Abstract FPGA HeSoCs integrate programmable logic with hard processor cores on the same chip. Despite their energy efficiency and performance, the usage complexity limits their adoption. To take full advantage of FPGA HeSoCs, programming models and tools that target them efficiently are needed. The research activity described in this abstract focuses on: — power awareness, intended as the possibility of having a clear view of the impact of design choices on the final power consumption; — flexibility, intended as the capability of the system to execute various/different workloads; — adaptivity, the ability to change the execution mode to adapt to internal or external changes. These challenges are addressed focussing on accelerators implemented on the FPGA side. The target architectures are dataflow-based Coarse-Grain Reconfigurable (CGR) accelerators: adopting a model-based architecture eases the translation of high-level specifications into a custom accelerator; coarse-grain reconfigurability ensures fast and memory-lightweight reconfiguration. Power awareness is investated by implementing Clock Gating (CG) at different granularity levels to CGR accelerators. This study analyzes the mutual impact between the CG granularity and the HLS tool used for the design of the accelerator. Results show that HLS-generated designs leave room for power optimizations. Flexible hardware multi-threading is delivered through a design methodology that leverages tagged dataflow models and extends the features of the MDC tool. A complete hardware/software architecture and design automation process is defined, which allows the design, deployment, and programming of multi-thread hardware accelerators on FPGA HeSoCs. A complete toolchain for adaptive CNN accelerators is that takes as input a CNN model and produces a runtime adaptive accelerator to be deployed on an FPGA HeSoC. A resulting accelerator, applied to a video surveillance use case, is demonstrated to be capable of trading off execution latency with reduced energy consumption at runtime.
18:30 CEST	PhDF.25	EFFICIENT SYNTHESIS METHODS FOR HIGH-QUALITY APPROXIMATE COMPUTING CIRCUITS AND ARCHITECTURES Speaker and Author: Chang Meng, EPFL, CH Abstract Approximate computing is an emerging computing paradigm to reduce hardware cost for error-tolerant applications. This thesis proposes efficient synthesis methods for high-quality approximate circuits and architectures. At the circuit layer, this thesis studies approximate logic synthesis (ALS), which can automatically produce approximate designs under specific error constraints. The performance of an ALS algorithm can be evaluated from two aspects: quality and efficiency. To improve the quality of ALS algorithms, this thesis develops area- and delay-driven ALS methods. For one thing, area-driven ALS frameworks are proposed, which achieve 7% to 18% area reduction compared to the state-of-the-art methods. For another, a delay-driven ALS framework is proposed, which can reduce the circuit delay by 40.7% compared to the state-of-the-art method. To improve the efficiency of ALS methods, this thesis tries to accelerate the most time-consuming step of ALS, error evaluation. It focuses on one of the widely-used error metrics, the maximum error, and proposes an efficient maximum error checking method. It depicts the behavior of the maximum error with partial Boolean difference and performs efficient maximum error checking with satisfiability sweeping. The proposed method accelerates the ALS framework by 13 times. At the architecture layer, this thesis proposes a low-power and high-speed approximate LUT architecture. To implement an arbitrary function with the architecture at the cost of the smallest error, efficient synthesis algorithms are proposed using integer linear programming and heuristic methods. The proposed architecture achieves energy and latency savings by 56.5% and 92.4%, respectively, compared to the state-of-the-art method. To sum up, this thesis proposes novel techniques that can efficiently synthesize high-quality approximate circuits and architectures. They advance the study of approximate computing and pave a promising way to improve the quality of VLSI designs in the post-Moore era.
18:30 CEST	PhDF.26	AUTOMATIC HARDWARE-AWARE DESIGN AND OPTIMIZATION OF DEEP LEARNING MODELS Speaker: Matteo Risso, Politecnico di Torino, IT Authors: Matteo Risso, Massimo Poncino and Daniele Jahier Pagliari, Politecnico di Torino, IT Abstract Deep learning is becoming pervasive in several applications thanks to its unprecedented capabilities in solving complex tasks. Nevertheless, deploying Deep Neural Networks (DNNs) on real-world devices requires many modifications and optimizations to models to comply with the specific use-case constraints. The scenario is even more critical when the final deployment target platforms are represented by edge devices and Internet of Things (IoT) nodes with limited computing capabilities and tight power, latency, and memory constraints. Optimizing DNNs for tiny platforms includes search over a large set of architectural hyper-parameters such as the number and type of layers, their configuration, the weights and activations bitwidth, etc. Traditionally, these hyper-parameters are picked manually following not-well motivated heuristics and rules of thumb and often focusing only on the final accuracy of the model without considering the target platform for the final deployment. In order to efficiently explore this complex and multi-faceted landscape, automated optimization tools, generally referred to as AutoML, are becoming the go-to approach to the development of DNN. In particular, Neural Architecture Search (NAS) automates the search for optimal combinations of layers and their configurations whereas Mixed-Precision Search (MPS) solutions look for the optimal data representation for each model tensor. Nevertheless, most popular automated approaches show important limitations: first, they are typically complex, and impractical for users with modest computing power; second, the HW awareness of such tools is still poor and consider simple surrogate metrics which poorly correlate with real HW quantities. The purpose of this Ph.D. thesis is to address these limitations realizing a lightweight yet flexible and powerful DNN optimization toolchain capable of taking into account both task-specific performance and target HW underlying structure and its related metrics. This thesis presents PLiNIO, an open-source library for Plug-and-play Lightweight Neural Inference Optimization.
18:30 CEST	PhDF.27	LEVERAGE WIFI CHANNEL STATE INFORMATION AND WEARABLE SENSORS FOR HUMAN MONITORING Speaker: Cristian Turetta, Università di Verona, IT Authors: Cristian Turetta and Graziano Pravadelli, Università di Verona, IT Abstract In recent years, the evolution of remote monitoring and control systems, particularly in healthcare, ambient assisted living, and Industry 4.0, has been significantly propelled by advancements in WiFi and wearable sensing technologies. The widespread availability of WiFi and the cost-effectiveness of commodity WiFi routers have positioned WiFi sensing as a prominent, insightful solution in signal propagation analysis. Using CSI, it provides a detailed perspective on environmental interactions, surpassing conventional signal strength indicators in applications such as indoor positioning and human activity recognition. Simultaneously, WBANs and wearable sensors have ushered in a significant shift in healthcare and personal wellness management, facilitating continuous physiological monitoring in a discreet and privacy-preserving manner. However, these technologies are not without limitations. The objective of this research is to address these challenges, aiming to integrate and enhance both WiFi and wearable sensing technologies for more efficient and accurate remote monitoring systems.
18:30 CEST	PhDF.28	VERSATILE HARDWARE ANALYSIS TECHNIQUES - FROM WAVEFORM-BASED ANALYSIS TO FORMAL VERIFICATION Speaker: Lucas Klemmer, Johannes Kepler University Linz, AT Authors: Lucas Klemmer and Daniel Grosse, Johannes Kepler University Linz, AT Abstract The development of next-generation electronic systems poses significant challenges to all phases of the design process. In particular, the verification phase is the most time-consuming part. Verification aims to find design errors as early as possible, and verification is multidimensional: Both, functional requirements and non-functional requirements (e.g., timing, performance, latency) have to be verified, and in practice there is often an intersection of the respective tasks and models used. If one again breaks down the verification task, then it is dominated by debugging. Even worse, debugging is rated as least predictable since it requires a deep design understanding. Both, verification and debugging analyze the design or artifacts of the design (e.g., waveforms created during simulation) to uncover bugs or to gain insights into the design. However, analyzing hardware designs is complicated, for example, due to the high complexity of the systems or the enormous amounts of data that has to be analyzed. Further, much of the analysis done today is performed manually, for example in waveform viewers or during the creation of test benches. This thesis advances the field of hardware analysis with works ranging from performance analysis methods based on waveforms to formal verification. We provide various techniques related to design, debug, and verification of functional and of non-functional properties, and present several works with both theoretical foundations and practical applications on real-world examples.
18:30 CEST	PhDF.29	EFFICIENT SECURITY FOR EMERGING COMPUTING PLATFORMS: SIDE-CHANNEL ATTACKS AND DEFENSES Speaker: Mahya Morid Ahmadi, TU Wien, AT Authors: Mahya Morid Ahmadi¹ and Muhammad Shafique² ¹TU Wien, AT; ²New York University Abu Dhabi, AE Abstract As the use of embedded systems becomes more widespread in mission-critical applications, the need to ensure their security is becoming increasingly important. Side-channel attacks (SCAs) represent a significant threat to these systems, as they can allow attackers to extract sensitive information from them by analyzing the power consumption, electromagnetic emissions, or other side channels. To address this threat, research in SCAs and countermeasures has become a growing area of interest. By exploring SCAs in both hardware modules and software applications, we can gain a better understanding of the vulnerabilities present in modern systems and develop effective mitigation techniques. This research not only contributes to the security of embedded systems but also has broader implications for securing sensitive information in various areas such as the Internet of Things (IoT) and critical infrastructure.
18:30 CEST	PhDF.30	ASSESSING THE RELIABILITY OF THE SKYWATER 130NM PDK FOR FUTURE SPACE PROCESSORS Speaker: Ivan Rodriguez-Ferrandez, Barcelona Supercomputer Center, ES Authors: Ivan Rodriguez Ferrandez and Leonidas Kosmidis, UPC \| BSC, ES Abstract Recently the ASIC industry experiences a massive change with more and more small and medium businesses entering the custom ASIC development. This trend is fueled by the recent open hardware movement and relevant government and privately funded initiatives. These new developments can open new opportunities in the space sector -- which is traditionally characterised by very low volumes and very high non-recurrent (NRE) costs -- if we can show that the produced chips have favourable radiation properties. In this paper, we describe the design and tape-out of Space Shuttle, the first test chip for the evaluation of the suitability of the SkyWater 130nm PDK and the OpenLane EDA toolchain using the Google/E-fabless shuttle run for future space processors.
18:30 CEST	PhDF.31	A SELF-ADAPTIVE RESILIENT METHOD FOR IMPLEMENTING AND MANAGING THE HIGH-RELIABILITY PROCESSING SYSTEM Speaker: Junchao Chen, IHP – Leibniz Institute for High Performance Microelectronics, DE Authors: Junchao Chen and Milos Krstic, IHP – Leibniz Institute for High Performance Microelectronics, DE Abstract As a result of CMOS scaling, radiation-induced Single-Event Effects (SEEs) in electronic circuits became a critical reliability issue for modern Integrated Circuits (ICs) operating under harsh radiation conditions. SEEs can be triggered in combinational or sequential logic by the impact of high-energy particles, leading to destructive or non-destructive faults, resulting in data corruption or even system failure. Usually, the SEE mitigation methods are deployed statically in processing architectures based on the worst-case radiation conditions, which is most of the time unnecessary and results in a resource overhead. Moreover, the space radiation conditions are dynamically changing, especially during Solar Particle Events (SPEs). The intensity of space radiation can differ over five orders of magnitude within a few hours or days, resulting in several orders of magnitude fault probability variation in ICs during SPEs. To overcome the static mitigation overhead issue, this thesis introduces a comprehensive approach for designing a self-adaptive fault resilient multiprocessing system. This work mainly addresses the following topics: • Design of on-chip radiation particle monitor for real-time radiation environment detection • Investigation of space environment predictor, as the support for solar condition forecast • Dynamic mode configuration in the resilient multiprocessing system Therefore, according to detected and predicted in-flight space radiation conditions, the target system can be configured to use no mitigation or low-overhead mitigation during non-critical periods. The redundant resources can be used to improve system performance or save power. On the other hand, during increased radiation activity periods, such as SPEs, the mitigation methods can be configured appropriately depending on the dynamic space environment, resulting in higher system reliability. Thus, a dynamic trade-off in the target system between reliability, performance, and power consumption in real-time can be achieved. All results of this work are evaluated in a highly reliable quad-core multiprocessing system that allows the self-adaptive setting of optimal radiation mitigation mechanisms during run-time. Proposed methods can serve as a basis for establishing a comprehensive self-adaptive resilient system design process. Successful implementation of the proposed design in a quad-core multiprocessor also shows its application perspective in other designs.
18:30 CEST	PhDF.32	SYNERGISTIC SOFTWARE-HARDWARE APPROXIMATION AND CO-DESIGN FOR MACHINE LEARNING CLASSIFICATION IN PRINTED ELECTRONICS Speaker: Giorgos Armeniakos, National and TU Athens, GR Authors: Giorgos Armeniakos¹, Dimitrios Soudris² and Joerg Henkel³ ¹National Technichal University of Athens, GR; ²National TU Athens, GR; ³Karlsruhe Institute of Technology, DE Abstract This PhD dissertation proposes a holistic cross-layer approach which aims in minimizing the hardware-requirements of the printed ML classifier, including approximate computing principles. The goal is to enable efficient approximation in printed electronics, to build battery-powered ML classifiers and explore the codesign space in an automated manner, expanding, thus, the printed electronics ecosystem.
18:30 CEST	PhDF.33	SECURING NANO-CIRCUITS AGAINST OPTICAL PROBING ATTACK Speaker: Sajjad Parvin, University of Bremen, DE Authors: Sajjad Parvin¹ and Rolf Drechsler² ¹University of Bremen, DE; ²University of Bremen \| DFKI, DE Abstract Recently, a non-invasive laser-based Side-Channel Analysis (SCA) attack, namely Optical Probing (OP) attack has been shown to pose an immense threat to the security of sensitive information on chips. In the literature, several countermeasures are proposed which require changes in the chip fabrication process. As a result, these countermeasures are costly and not economical. In this work, we focused on proposing countermeasures against OP at the layout level. Consequently, our proposed countermeasures against OP are industry standard-friendly, and cost-effective. Moreover, to evaluate the security robustness of our proposed layout-level countermeasures, we developed an OP simulator. The developed OP simulator enables designers to analyze the information leakage of a design pre-silicon. An additional feature of our developed simulator is its capability to detect Hardware Trojan (HT) inserted in a design when fabricated in an untrusted foundry. Finally, we designed a chip using a 28 nm commercial technology that consists of more than 300 security test structures to evaluate the robustness of our proposed countermeasures and the accuracy of our developed OP simulator.
18:30 CEST	PhDF.34	FAULT-BASED ANALYSIS COVERING THE ANALOG SIDE OF A SMART OR CYBER-PHYSICAL SYSTEM Speaker and Author: Nicola Dall'Ora, Università di Verona, IT Abstract Over the last decade, the industrial world has been involved in a massive revolution guided by the adoption of digital technologies. In this context, complex systems like cyber-physical systems play a fundamental role since they were designed and realized by composing heterogeneous components. The combined simulation of the behavioral models of these components allows to reproduce the nominal behavior of the real system. Similarly, a smart system is a device that integrates heterogeneous components but in a miniaturized form factor. The development of smart or cyber-physical systems, in combination with faulty behaviors modeled for the different physical domains composing the system, enables to support advanced functional safety assessment at the system level. This thesis proposes several methodologies to create and inject multi-domain fault models in the analog side of these systems by exploiting the physical analogy between the electrical and mechanical domains to infer a new mechanical fault taxonomy. Thus, standard electrical fault models are injected into the electrical part, while the derived mechanical fault models are injected directly into the mechanical part. The entire flow has been applied to several industrial case studies.
18:30 CEST	PhDF.35	EXPONENT SHARING FOR MODEL COMPRESSION Speaker and Author: Prachi Kashikar, IIT Goa, IN Abstract AI on the edge has emerged as an important research area in the last decade to deploy different applications in the domains of computer vision and natural language processing on tiny devices. These devices have limited on-chip memory and are battery-powered. On the other hand, neural network models require large memory to store model parameters and intermediate activation values. Thus, it is critical to make the models smaller so that their on-chip memory requirements are reduced. Various existing techniques like quantization and weight sharing reduce model sizes at the expense of some loss in accuracy. We propose a lossless technique of model size reduction by focusing on sharing of exponents in weights, which is different from sharing of weights. We present results based on Generalized Matrix Multiplication (GEMM) in neural network models. Our method achieves at least a 20\% reduction in memory when using Bfloat16 and around 10\% reduction when using IEEE single-precision floating-point, for models in general, with a very small impact (up to 10\% on the processor and less than 1\% on FPGA) on the execution time with no loss in accuracy.
18:30 CEST	PhDF.36	CONTRIBUTIONS TO IMPROVE THE ENERGY EFFICIENCY OF FPGA-BASED APPLICATIONS Speaker: Christoph Niemann, University of Rostock, DE Authors: Christoph Niemann and Dirk Timmermann, University of Rostock, DE Abstract Over several decades, the semiconductor industry has succeeded in achieving enormous increases in performance by scaling process technology. The energy requirement for computing operations was also extremely reduced. However, this trend is currently slowing down significantly. Important future technologies such as artificial intelligence and modern communication systems such as 5G and 6G in turn demand increased computing power. At the same time, the demand for energy cannot be allowed to grow without limit. Therefore, the challenge is to achieve more computing power per unit of energy despite a less dynamic development of process technology. This work is dedicated to this task. I propose an Adaptive Voltage Scaling (AVS) system that is tailored towards the specific properties of FPGAs. Particular attention is paid to sensor calibration and placement. A prototype implementation and experimental evaluation show the considerable potential of the developed concept for saving energy. Due to our novel AVS approach, the supply voltage could be reduced by an average of 21%. This resulted in an average reduction of the power consumption between 51% and 66% depending on the characteristics of the application design.
18:30 CEST	PhDF.37	SCALABLE AND EFFICIENT METHODS FOR UNCERTAINTY ESTIMATION AND REDUCTION IN DEEP LEARNING Speaker: Soyed Tuhin Ahmed, Karlsruhe Institute of Technology, Karlsruhe, Germany, DE Authors: Soyed Tuhin Ahmed and Mehdi Tahoori, Karlsruhe Institute of Technology, DE Abstract The deployment of Neural Networks (NNs) in resource-constrained, safety-critical systems presents significant challenges, despite their increasing integration into automated decision-making systems. A notable advancement in this area is the use of neuro-inspired Computation-in-memory (CIM) architectures, particularly those utilizing emerging resistive non-volatile memories like memristors. These architectures offer enhanced on-chip weight storage and efficient parallel computation, reducing power consumption. However, NN predictions often suffer from reliability and uncertainty issues, particularly when faced with out-of-distribution (OOD) inputs and hardware non-idealities like manufacturing defects and in-field faults in CIM-implemented NNs. These uncertainties can lead to unpredictable behaviors and reduced inference accuracy, necessitating effective mitigation strategies, especially in safety-critical applications. This thesis addresses these uncertainties by proposing the implementation of Bayesian Neural Networks (BayNNs) in CIM architectures, alongside innovative testing and fault-tolerance methods. However, this approach introduces new challenges, including the minimization of testing overhead, non-invasive test generation, and addressing the black-box nature of NNs in contexts like Machine Learning as a Service (MLaaS). Additionally, while BayNNs provide inherent prediction uncertainty, they are more resource-intensive than conventional NNs. The thesis aims to mitigate these resource demands through algorithm-hardware co-design-based solutions, targeting improved testability, reliability, performance, manufacturing yield, and efficiency of CIM-implemented NNs. Key contributions of this work include the development of problem-aware training algorithms, novel NN topologies and layers, enhanced hardware architectures, parameter mapping, and test vector generation methods. These innovations not only address reliability and scalability issues but also contribute to the advancement of reliable and efficient NN applications in safety-critical systems, as evidenced by several best paper nominations.
18:30 CEST	PhDF.38	DESIGNING AND PROGRAMMING MASSIVELY PARALLEL SYSTEMS Speaker: Samuel Riedel, ETH Zurich, CH Authors: Samuel Riedel¹ and Luca Benini² ¹ETH Zurich, CH; ²ETH Zurich, CH \| Università di Bologna, IT Abstract Shared L1 scratchpad memory (SPM) clusters are a common architectural pattern (e.g., in general-purpose graphics processing units (GPUs)) for building efficient and flexible multi-processing-element (PE) engines. However, it is a common belief that these tightly coupled clusters would not scale beyond a few tens of processing elements (PEs). This thesis addresses the challenge of constructing an energy-efficient, general-purpose manycore system by extending a versatile single-cluster architecture to accommodate hundreds of cores. The proposed system, named MemPool, features independent, programmable small cores with low-latency access to a globally shared L1 memory. Leveraging efficient synchronization techniques and fast direct memory access (DMA) engines, the aim is to create a streamlined, open-source architecture capable of efficiently handling a wide range of workloads. Additionally, the thesis explores the integration of systolic array architectures into shared-memory systems, introducing novel atomic read–modify–write (RMW) instructions (LRwait and SCwait) and a Mwait instruction to eliminate polling. The scalable Colibri implementation demonstrates improved synchronization for cache-less manycore systems, showcasing throughput improvements and reduced impact in various scenarios. Finally, the binary-translation-based emulator Banshee enables fast emulation of massively parallel systems for architectural explorations and software development.
18:30 CEST	PhDF.39	ACCELERATION OF DNA ALIGNMENT TOOLS USING HARDWARE-SOFTWARE CO-DESIGN Speaker: Kisaru Liyanage, UNSW Sydney, AU Authors: Kisaru Liyanage, Hasindu Gamaarachchi and Sri Parameswaran, University of New South Wales, AU Abstract DNA sequencing, the method of reading the information within DNA molecules in fragments (called "reads") through chemical processes, has seen significant progress with enhanced efficiency and affordability. However, DNA analysis, the process of assembling all the DNA data read in fragments to determine the complete DNA sequence using computational algorithms, remains a challenging task, especially for the latest generation of DNA sequencing known as third-generation sequencing. Third-generation sequencing, also known as long-read sequencing, has gained significant popularity in recent years and was recently recognized as the Nature Method of the Year 2022. While third-generation sequencing offers significantly longer read lengths (often exceeding 10 kilo base pairs, in contrast to previous generations that produced reads of only a few hundred base pairs), enabling the resolution of complex genomic regions, the computational challenge intensifies in the DNA analysis process. The widely adopted bioinformatics tool minimap2 serves as the gold standard software for third-generation sequence analysis. minimap2 is utilized in various third-generation sequence analysis pipelines, including those employed by leading third-generation sequencing companies such as ONT and PacBio. However, the computational intensity of mapping long reads with minimap2 results in time-consuming processing, necessitating the acceleration of the tool's runtime. With profiling performed on a HPC processor with two different representative datasets (ONT and PacBio) and two main configurations of the tool (without base-level alignment and with base-level alignment), it was identified that the "mm_chain_dp" function that performs the chaining step is the computational bottleneck of minimap2. This thesis work delves into strategies for accelerating third-generation DNA analysis performed with minimap2. The methods involve offloading computational bottleneck (chaining step) to an FPGA-based hardware accelerator and designing a RISC-V based custom embedded processor using hardware-software co-design techniques, with the ultimate goal of enhancing the efficiency of third-generation DNA analysis.
18:30 CEST	PhDF.40	AN ML-BASED SYSTEM-LEVEL FRAMEWORK FOR BACK-ANNOTATING LOW-LEVEL EFFECTS Presenter: Katayoon Basharkhah, University of Theran, IR Authors: Katayoon Basharkhah¹ and Zain Navabi² ¹University of Theran, IR; ²University of Theran, Abstract The work presented here is on back-annotation of physical properties into components of an embedded system for system-level simulation and facilitating fast design space exploration. Two important properties, power and noise are considered in this thesis. A general ML-based framework is proposed for both power and noise back-annotation. This framework removes the expensive low-level simulations and performs a one-time offline effort for low-level property characterization. The actual data from characterization generates the dataset for training ML-based models. We evaluate several ML techniques on this task like multi-layer perceptron, convolutional neural network, gradient boosting, and LSTM. The trained model will be implemented in a fast SystemC surrogate model that speeds up the simulation time for system design space exploration. In this paper the methodology will be explained for interconnect crosstalk back-annotation. A crosstalk model for a RISC-V like processor interconnects is implemented and evaluated as a case study. The same methodology is used for back-annotating and evaluating processor power at system level.

REC Welcome Reception

Add this session to my calendar

Date: Monday, 25 March 2024
Time: 18:30 CEST - 20:00 CEST
Location / Room: Foyer

ASD04 ASD Technical Paper Session: Real-Time Aware Communication Systems for Autonomy

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 08:30 CEST - 10:00 CEST
Location / Room: Break-Out Room S8

Time	Label	Presentation Title Authors
08:30 CEST	ASD04.1	AN ADAPTIVE UAV SCHEDULING PROCESS TO ADDRESS DYNAMIC MOBILE NETWORK DEMAND EFFICIENTLY Speaker: Ruide Cao, Southern University of Science and Technology, CN Authors: Ruide Cao¹, Jiao Ye², Jin Zhang³, Qian You¹, Chao Tang⁴, Yan Liu¹ and Yi Wang¹ ¹SUSTech Institute of Future Networks, CN; ²SZU School of Architecture and Urban Planning, CN; ³SUSTech Department of Computer Science and Engineering, CN; ⁴SUSTech School of Design, CN Abstract Benefiting from high flexibility and probability of line-of-sight, deploying unmanned aerial vehicles (UAVs) as aerial access points has emerged as a promising solution for ensuring reliable wireless connectivity in crowded events. This paper introduces a UAV scheduling process adaptive to dynamic mobile network demand, including three phases. In the sensing phase, the user distribution is sensed, and user number thresholds are set to determine whether UAV assistance is needed. The planning phase presents an enhanced mean shift algorithm to find suitable locations to deploy UAVs with a dynamic bandwidth derived from the user distribution, the UAV's maximum capacity, and the UAV's maximum throughput. The deploying phase dispatches and recalls UAVs based on planning results. Comprehensive simulation experiments are conducted on OMNeT++ using real-world data. Results show that the proposed process shows great adaptivity, with an efficiency increase of 18.7% and a fairness increase of 28.9% compared to the existing related works on average.
09:00 CEST	ASD04.2	END-TO-END LATENCY OPTIMIZATION OF THREAD CHAINS UNDER THE DDS PUBLISH/SUBSCRIBE MIDDLEWARE Speaker: Gerlando Sciangula, Huawei, Scuola Superiore Sant'Anna, IT Authors: Gerlando Sciangula¹, Daniel Casini², Alessandro Biondi² and Claudio Scordino³ ¹Huawei and Scuola Superiore Sant'Anna, IT; ²Scuola Superiore Sant'Anna, IT; ³Huawei Inc, IT Abstract Modern autonomous systems integrate diverse software solutions to manage tightly communicating functionalities. These applications commonly communicate using frameworks implementing the publish/subscribe paradigm, such as the Data Distribution Service (DDS). However, these frameworks are realized with a multi-threaded software architecture and implement internal policies for message dispatching, posing additional challenges for guaranteeing timing constraints. This work addresses the problem of optimizing a DDS-based interconnected real-time systems, proposing analysis-driven algorithms to set a vast range of parameters, ranging from classical thread priorities to other DDS-specific configurations. We evaluate our approaches on the Autoware Reference System, a realistic testbed from the Autoware autonomous driving framework.
09:30 CEST	ASD04.3	ORCHESTRATION-AWARE OPTIMIZATION OF ROS2 COMMUNICATION PROTOCOLS Speaker: Mirco De Marchi, Università di Verona, IT Authors: Mirco De Marchi and Nicola Bombieri, Università di Verona, IT Abstract The robot operating system (ROS) standard has been extended with different communication mechanisms to address real-time and scalability requirements. On the other hand, containerization and orchestration platforms like Docker and Kubernetes are increasingly being adopted to strengthen platform-independent development and automatic software deployment. In this paper, we quantitatively analyze the impact of topology, containerization, and edge-cloud distribution of ROS nodes on the efficiency of the ROS2 communication protocols. We then present a framework that automatically binds the most efficient ROS protocol for each node-to-node communication by considering the architectural characteristics of both software and edge-cloud computing platforms. The framework is available at https://github.com/PARCO-LAB/ros4k.

BPA01 BPA - Better Machine Learning

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 08:30 CEST - 10:00 CEST
Location / Room: Break-Out Room S1+2

Session chair:
Liliana Cucu-Grosjean, Inria, Paris, FR

Session co-chair:
Maria Michael, U Cyprus, CY

Time	Label	Presentation Title Authors
08:30 CEST	BPA01.1	CLASS-AWARE PRUNING FOR EFFICIENT NEURAL NETWORKS Speaker: Mengnan Jiang, TU Darmstadt, DE Authors: Mengnan Jiang¹, Jingcun Wang¹, Amro Eldebiky², Xunzhao Yin³, Cheng Zhuo³, Ing-Chao Lin⁴ and Grace Li Zhang¹ ¹TU Darmstadt, DE; ²TU Munich, DE; ³Zhejiang University, CN; ⁴National Cheng Kung University, TW Abstract Deep neural networks (DNNs) have demonstrated remarkable success in various fields. However, their huge numbers of floating-point operations (FLOPs) pose challenges for the practical deployment in resource-constrained applications, e.g., edge devices. Pruning is a technique to reduce the computational cost in executing DNNs. Existing pruning strategies using weight values, gradient values and output activations are based on experiments. Different from previous pruning techniques, in this paper, we propose a class-aware iterative pruning technique to compress DNNs, which provides a novel perspective to reduce the computational cost of DNNs. Specifically, in each iteration, the neural network training is modified to facilitate the class-aware pruning. Afterwards, the importance of filters with respect to the number of classes is evaluated. Those filters that are only important for a few number of classes are removed. The neural network is then retrained to compensate the accuracy loss. The pruning iterations end until no filters can be removed anymore, indicating that the remaining filters are very important for many classes. The experimental results confirm that this class-aware pruning technique can significantly reduce the number of parameters and FLOPs, while preserving classification accuracy. Furthermore, this pruning technique outperforms existing pruning methods in terms of accuracy, pruning rate and the reduction of FLOPs.
08:50 CEST	BPA01.2	ENHANCING RELIABILITY OF NEURAL NETWORKS AT THE EDGE: INVERTED NORMALIZATION WITH STOCHASTIC AFFINE TRANSFORMATIONS Speaker: Soyed Tuhin Ahmed, Karlsruhe Institute of Technology, Karlsruhe, Germany, DE Authors: Soyed Tuhin Ahmed¹, Kamal Danouchi², Michael Hefenbrock³, Guillaume Prenat⁴, Lorena Anghel⁵ and Mehdi Tahoori¹ ¹Karlsruhe Institute of Technology, DE; ²University Grenoble Alpes, CEA, CNRS, Grenoble INP, IRIG-Spintec Laboratory, FR; ³Revo AI, DE; ⁴Spintec, FR; ⁵Grenoble-Alpes University, Grenoble, France, FR Abstract Bayesian Neural Networks (BayNNs) naturally provide uncertainty in their predictions, making them a suitable choice in safety-critical applications. Additionally, their realization using memristor-based in-memory computing (IMC) architectures enables them for resource-constrained edge applications. In addition to predictive uncertainty, however, the ability to be inherently robust to noise in computation is also essential to ensure functional safety. In particular, memristor-based IMCs are susceptible to various sources of non-idealities such as manufacturing and runtime variations, drift, and failure, which can significantly reduce inference accuracy. In this paper, we propose a method to inherently enhance the robustness and inference accuracy of BayNNs deployed in IMC architectures. To achieve this, we introduce a novel normalization layer combined with stochastic affine transformations. Empirical results in various benchmark datasets show a graceful degradation in inference accuracy, with an improvement of up to 58.11%.
09:10 CEST	BPA01.3	HW-SW OPTIMIZATION OF DNNS FOR PRIVACY-PRESERVING PEOPLE COUNTING ON LOW-RESOLUTION INFRARED ARRAYS Speaker: Matteo Risso, Politecnico di Torino, IT Authors: Matteo Risso¹, Chen Xie¹, Francesco Daghero¹, Alessio Burrello², Seyedmorteza Mollaei¹, Marco Castellano³, Enrico Macii¹, Massimo Poncino¹ and Daniele Jahier Pagliari¹ ¹Politecnico di Torino, IT; ²Politecnico di Torino \| Università di Bologna, IT; ³STMicroelectronics, IT Abstract Low-resolution infrared (IR) array sensors enable people counting applications such as monitoring the occupancy of spaces and people flows while preserving privacy and minimizing energy consumption. Deep Neural Networks (DNNs) have been shown to be well-suited to process these sensor data in an accurate and efficient manner. Nevertheless, the space of DNNs' architectures is huge and its manual exploration is burdensome and often leads to sub-optimal solutions. To overcome this problem, in this work, we propose a highly-automated full-stack optimization flow for DNNs that goes from neural architecture search, mixed-precision quantization, and post-processing, down to the realization of a new smart sensor prototype, including a Microcontroller with a customized instruction set. Integrating these cross-layer optimizations, we obtain a large set of Pareto-optimal solutions in the 3D-space of energy, memory, and accuracy. Deploying such solutions on our hardware platform, we improve the state-of-the-art achieving up-to 4.2x model size reduction, 23.8x code size reduction, and 15.38x energy reduction at iso-accuracy.
09:30 CEST	BPA01.4	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

ET03 Embedded Tutorial: Coarse-Grained Reconfigurable Arrays: Modelling and Exploration Using the Open-Source CGRA-ME Framework

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 08:30 CEST - 10:00 CEST
Location / Room: Break-Out Room S3+4

Coarse-grained reconfigurable arrays (CGRAs) are programmable hardware devices within the broader umbrella of reconfigurable architectures. They are promising candidates for the realization of application accelerators. In contrast to FPGAs, CGRAs are configurable at the word level, rather than the bit level. This distinction positions CGRAs to deliver power, performance, and area characteristics more closely aligned with custom ASICs. Notably, the emergence of numerous machine-learning accelerator startups, such as Tenstorrent, Groq, Cerebras, and SambaNova, offer architectures that closely resemble CGRAs.

CGRA-ME is an open-source CGRA modeling and exploration framework actively being developed at the University of Toronto. CGRA-ME is intended to facilitate research on new CGRA architectures and new CAD algorithms. Given the current surge in research interest in CGRAs from both industry and academia, this tutorial aims to provide valuable insights and practical guidance in this dynamic field.

Speakers

Dr. Jason Anderson, Professor, University of Toronto, Toronto, Canada
Dr. Boma Adhi, Researcher, RIKEN Center for Computational Science, Kobe, Japan
Omar Ragheb, Ph.D. Candidate, University of Toronto, Toronto, Canada
Stephen Wicklund, MASc. Candidate, University of Toronto, Toronto, Canada

Target Audience

We invite DATE 2024 participants with a keen interest in reconfigurable architectures and computer-aided design (CAD) tools. Please join us!

Learning objectives

A brief history of coarse-grained reconfigurable arrays (CGRAs).
An introduction to the CGRA-ME framework.
Details of the building blocks of the tool.
Description of the CAD algorithms utilized.
Learning how CGRA architectures are defined.
Hands-on experimentation using CGRA-ME.
Motivate future research within the CGRA-ME framework.

Required background

Familiarity with C++ and Linux.
A keen interest in learning about CGRA architectures and CAD, or application acceleration using CGRAs.
Desirable: prior knowledge of reconfigurable hardware and associated design methodologies.

Detailed Program

Part 1: Lecture (45 mins)
Part 2: Hands-on session (45 mins)

FS01 Focus Session: Cryogenic Computing: Current Status And Future Perspectives

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 08:30 CEST - 10:00 CEST
Location / Room: Multi-Purpose Room M1B+D

Session chair:
Victor Grimblatt, Synopsys, CL

Organiser:
Giovanni De Micheli, EPFL, CH

Time	Label	Presentation Title Authors
08:30 CEST	FS01.1	DEPTH-OPTIMAL ADDRESSING OF 2D QUBIT ARRAY WITH 1D CONTROLS BASED ON EXACT BINARY MATRIX FACTORIZATION Speaker: Jason Cong, University of California, Los Angeles, US Authors: Daniel Bochen Tan, Shuohao Ping and Jason Cong, University of California, Los Angeles, US Abstract Reducing control complexity is essential for achieving large-scale quantum computing, particularly on platforms operating in cryogenic environments. Wiring each qubit to a room-temperature control poses a challenge, as this approach would surpass the thermal budget in the foreseeable future. An essential tradeoff becomes evident: reducing control knobs compromises the ability to independently address each qubit. Recent progress in neutral atom-based platforms suggests that rectangular addressing may strike a balance between control granularity and flexibility for 2D qubit arrays. This scheme allows addressing qubits on the intersections of a set of rows and columns each time. While quadratically reducing controls, it may necessitate more depth. We formulate the depth-optimal rectangular addressing problem as exact binary matrix factorization, an NP-hard problem also appearing in communication complexity and combinatorial optimization. We introduce a satisfiability modulo theories-based solver for this problem, and a heuristic, row packing, performing close to the optimal solver on various benchmarks. Furthermore, we discuss rectangular addressing in the context of fault-tolerant quantum computing, leveraging a natural two-level structure.
08:53 CEST	FS01.2	FROM MASTER EQUATION TO SPICE: A PLATFORM TO MODEL CRYO-CMOS CONTROL FOR QUBITS Speaker: Vladimir Pešić, EPFL, CH Authors: Vladimir Pesic, Andrew Wright and Edoardo Charbon, AQUA Lab, EPFL, CH Abstract Cryogenic classical electronics for the control of qubits can be placed near quantum processors for a more compact system, ultimately enabling a highly scalable one. However, cryogenic operation poses very strict power requirements on electronics, in addition to heavy constraints on precision in both amplitude and phase, as well as noise, so as to achieve the necessary fidelity. To test all the trade-offs that arise from these requirements, detailed simulations based on the physics of qubits and on the effects that circuits may have on them are needed. This paper focuses on the models and the simulation framework needed for quantum processors that can be efficiently executed on classical hardware. These models allow circuit design and specification derivation at all levels of the design. The suitability of the approach is demonstrated with superconducting qubit platforms and their control and characterization.
09:15 CEST	FS01.3	TECHNOLOGY-AWARE LOGIC SYNTHESIS FOR SUPERCONDUCTING ELECTRONICS Speaker: Rassul Bairamkulov, EPFL, CH Authors: Rassul Bairamkulov, Siang-Yun Lee, Alessandro Tempia Calvino, Dewmini Marakkalage, Mingfei Yu and Giovanni De Micheli, EPFL, CH Abstract Superconducting electronics provide us with cryogenic digital circuits that can rival established technologies in performance and energy consumption. Today, the lack of tools for the design of large-scale integrated superconducting circuits is a major obstacle to their deployment. Few research institutions and companies have contributed to making such tools available. This review focuses on methods, algorithms, and open-source design tools for logic synthesis of superconducting circuits in two major families: single-flux quantum (SFQ) circuits and adiabatic quantum flux parametron (AQFP).
09:38 CEST	FS01.4	CHALLENGES AND UNEXPLORED FRONTIERS IN ELECTRONIC DESIGN AUTOMATION FOR SUPERCONDUCTING DIGITAL LOGIC Speaker: Massoud Pedram, University of Southern California, US Authors: Sasan Razmkhah¹, Robert Aviles², Mingye li², Sandeep Gupta¹, Peter Beerel² and Massoud Pedram² ¹University of Southern California (USC), US; ²University of Southern California, US Abstract Positioned as a highly promising post-CMOS computing technology, superconductor electronics (SCE) offer the potential for unparalleled performance and energy efficiency gains compared to end-of-roadmap CMOS circuits. However, achieving very large-scale integration poses numerous challenges. These challenges span from the modeling and analysis of superconducting devices and logic gates to the intricate design of complex SCE circuits and systems. Addressing power and clock distribution issues, minimizing adverse effects of flux trappings, and mitigating stray electromagnetic fields in sensitive SCE circuitry are key challenges that need attention. Verification and testing of SCE circuits also remain open problems. Moreover, scaling the minimum feature sizes of SCE circuits, currently set at 150nm, presents critical scaling and physical design challenges that must be overcome. This review aims to delve into these issues, providing detailed insights while exploring existing or potential solutions to overcome them.

SD01 Special Day On Sustainable Computing

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 08:30 CEST - 10:00 CEST
Location / Room: Auditorium 2

Time	Label	Presentation Title Authors
08:30 CEST	SD01.1	WELCOME AND INTRODUCTION Presenter: Ana Lucia Varbanescu, University of Twente, NL Author: Ana Lucia Varbanescu, University of Twente, NL Abstract Welcome and introduction
08:35 CEST	SD01.2	PHYSICAL COMPUTING ON SUSTAINABILITY: A PERSPECTIVE ON DEVICES, DESIGN AND APPLICATIONS Presenter: Aida Todri-Sanial, Eindhoven University of Technology, NL Author: Aida Todri-Sanial, Eindhoven University of Technology, NL Abstract .
09:10 CEST	SD01.3	NONE OF MY BUSINESS? ON SUSTAINABILITY ISSUES FOR COMPITER ARCHITECTS Presenter: Carsten Trinitis, TUMunich, DE Author: Carsten Trinitis, TUMunich, DE Abstract .
09:25 CEST	SD01.4	CRYPTOGRAPHIC AGILITY FOR SUSTAINABLE SYSTEMS Presenter: Nusa Zidaric, Leiden University, NL Author: Nusa Zidaric, Leiden University, NL Abstract .
09:32 CEST	SD01.5	TIMELY TURNING-OFF SYSTEMS Presenter: Kuan-Hsun Chen, University of Twente, NL Author: Kuan-Hsun Chen, University of Twente, NL Abstract .
09:39 CEST	SD01.6	SUSTAINABLE ENERGY HARVESTING IOT SYSTEMS THROUGH MODELING ENERGY AND PERFORMANCE Presenter: Fatemeh Ghasemi, Norwegian University of Science and Technology, NO Author: Fatemeh Ghasemi, Norwegian University of Science and Technology, NO Abstract .
09:47 CEST	SD01.7	ON-DEVICE CUSTOMIZATION OF TINY DEEP LEARNING MODELS FOR KEYWORD SPOTTING WITH FEW EXAMPLES Presenter: Manuele Rusci, KU Leuven, BE Author: Manuele Rusci, KU Leuven, BE Abstract .
09:56 CEST	SD01.8	POLL, REMINDER OF SECOND SESSION Presenter: Ana Lucia Varbanescu, University of Twente, NL Author: Ana Lucia Varbanescu, University of Twente, NL Abstract Poll, Reminder of second session

TS03 System Simulation, Validation And Verification

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 08:30 CEST - 10:00 CEST
Location / Room: Auditorium 3

Session chair:
Daniel Grosse, Johannes Kepler University Linz, AT

Session co-chair:
Rolf Drechsler, University of Bremen, DE

Time	Label	Presentation Title Authors
08:30 CEST	TS03.1	EVILCS: AN EVALUATION OF INFORMATION LEAKAGE THROUGH CONTEXT SWITCHING ON SECURITY ENCLAVES Speaker: Prabhat Mishra, University of Florida, US Authors: Aruna Jayasena, Richard Bachmann and Prabhat Mishra, University of Florida, US Abstract Security enclaves provide isolated execution environments for trusted applications. However, modern processors utilize diverse performance enhancement methods (e.g., branch prediction and parallel execution) that can introduce security vulnerabilities. Specifically, if a processor leaks any information, an adversary can monitor and recover secrets from trusted applications. This paper makes a connection between context switching and information leakage in security enclaves. We present an evaluation framework that analyzes the potential channels through which context switching can expose sensitive information across the security enclave boundaries as a physical side-channel signature. Specifically, we propose a statistical information leakage assessment technique to evaluate the side-channel leakage of a security enclave during the pre-silicon design stage. Experimental evaluation on multiple RISC-V security enclaves reveals that context switching introduces power side channels that an adversary can exploit to infer the execution sequences as well as register values of trusted applications.
08:35 CEST	TS03.2	HETEROGENEOUS STATIC TIMING ANALYSIS WITH ADVANCED DELAY CALCULATOR Speaker: Zizheng Guo, Peking University, CN Authors: Zizheng Guo¹, Tsung-Wei Huang², Jin Zhou³, Cheng Zhuo⁴, Yibo Lin¹, Runsheng Wang¹ and Ru Huang¹ ¹Peking University, CN; ²University of Wisconsin - Madison, US; ³Shanghai Second Polytechnic University, CN; ⁴Zhejiang University, CN Abstract Static timing analysis (STA) in advanced technology nodes encounter many new challenges in analysis accuracy and speed efficiency. To accurately model complex interconnect networks, existing timers have leveraged reduced-order models with effective capacitance to design advanced delay calculation algorithms. However, the iterative nature of these algorithms makes them extremely time-consuming to use in a timer, significantly limiting their capability in many timing-driven applications. To overcome this challenge, we propose a novel GPU-accelerated delay calculator that targets Arnoldi-based model order reduction with an effective capacitance algorithm. We design efficient numerical kernels for batched nodal analysis model construction, LU decomposition, Krylov subspace calculation, eigenvalue decomposition, and Newton-Raphson iteration. Compared with two industrial standard timers, PrimeTime and OpenSTA, we achieve a strong correlation with up to 7.27× and 14.03× speed-up, respectively.
08:40 CEST	TS03.3	TSA-TICER: A TWO-STAGE TICER ACCELERATION FRAMEWORK FOR MODEL ORDER REDUCTION Speaker: Pengju Chen, Southeast University, CN Authors: Pengju Chen¹, Dan Niu¹, Zhou Jin², Changyin Sun³, Qi Li¹ and Hao Yan¹ ¹Southeast University, CN; ²China University of Petroleum-Beijing, CN; ³Anhui University, CN Abstract To enhance the post-simulation efficiency of large-scale integrated circuits, various model order reduction (MOR) methods have been proposed. Among these, TICER (Time-Constant Equilibration Reduction) is a widely-used resistor-capacitor (RC) network reduction algorithm. However, the time constant computation for eliminated-node classification in TICER is quite time-consuming. In this work, a two-stage TICER acceleration framework (TSA-TICER) is proposed. First, an improved graph attention network (named BCTu-GAT) equipped with betweenness centrality metric (BCM) based sample selection strategy and bi-level aggregation-based topology updating scheme (BiTu) is proposed to quickly and accurately determine all the eliminated nodes one time in the TICER. Second, an adaptive merging strategy for the new fill-in capacitors are designed to further accelerate the insertion stage. The proposed TSA-TICER is tested on RC networks with the size from 2k to 2 million nodes. Experimental results show that the proposed TSA-TICER achieves up to 796.21X order reduction speedup and 10.46X fill-in speedup compared to the TICER with 0.574% maximum relative error.
08:45 CEST	TS03.4	A RISC-V "V" VP: UNLOCKING VECTOR PROCESSING FOR EVALUATION AT THE SYSTEM LEVEL Speaker: Manfred Schlägl, Institute for Complex Systems, Johannes Kepler University Linz, AT Authors: Manfred Schlaegl, Moritz Stockinger and Daniel Grosse, Johannes Kepler University Linz, AT Abstract In this paper we introduce the first free- and open-source SystemC TLM based RISC-V Virtual Prototype (VP) with support for the RISC-V "V" Vector Extension (RVV) Version 1.0. After an introduction to RVV, we present the integration of RVV and its 600+ instructions into an existing VP leveraging code generation for over 20k Lines of Code (LoC). Moreover, we describe the verification of the resulting VP using the Instruction Sequence Generator (ISG) FORCE-RISCV and the Instruction Set Simulator (ISS) riscvOVPsim. Our case studies demonstrate the benefits of the RVV enhanced VP for system-level evaluation. We present non-vectorized and vectorized variants of two common algorithms which are executed on the VP with varying parameters. We show that by comparing the number of simulated execution cycles, we can derive valuable assessments for the design of RVV micro-architectures.
08:50 CEST	TS03.5	ARTMINE: AUTOMATIC ASSOCIATION RULE MINING WITH TEMPORAL BEHAVIOR FOR HARDWARE VERIFICATION Speaker: Mohammad Reza Heidari Iman, Tallinn University of Technology, EE Authors: Mohammad Reza Heidari Iman, Gert Jervan and Tara Ghasempouri, Tallinn University of Technology, EE Abstract Association rule mining is a promising data mining approach that aims to extract correlations and frequent patterns between items in a dataset. On the other hand, in the realm of assertion-based verification, automatic assertion mining has emerged as a prominent technique. Generally, to automatically mine the assertions to be used in the verification process, we need to find the frequent patterns and correlations between variables in the simulation trace of hardware designs. Existing association rule mining methods cannot capture temporal behaviors such as next[N], until, and eventually that hold significance within the context of assertion-based verification. In this paper, a novel association rule mining algorithm specifically designed for assertion mining is introduced to overcome this limit. This algorithm powers ARTmine, an assertion miner that leverages association rule mining and temporal behavior concepts. ARTmine outperforms other approaches by generating fewer assertions, achieving broader design behavior coverage in less time, and reducing verification costs.
08:55 CEST	TS03.6	MSH: A MULTI-STAGE HIZ-AWARE HOMOTOPY FRAMEWORK FOR NONLINEAR DC ANALYSIS Speaker: Zhou Jin, Super Scientific Software Laboratory, China University of Petroleum-Beijing, Beijing, China, CN Authors: Zhou Jin¹, Tian Feng², Xiao Wu², Dan Niu³, Zhenya Zhou² and Cheng Zhuo⁴ ¹China University of Petroleum-Beijing, CN; ²Huada Empyrean Software Co. Ltd, CN; ³Southeast University, CN; ⁴Zhejiang University, CN Abstract Nonlinear DC analysis is one of the most important tasks in transistor-level circuit simulation. Homotopy gains great success to eliminate non-convergence problem occurred in the Newton-Raphson (NR) based methods. However, nonlinear circuits with DC-path available high impedance (HiZ) nodes may fail to converge with homotopy methods due to sufficiently large resistance compared to homotopy insertions, leading to an insufficiently close enough initial-guess. In this paper, we propose a HiZ-aware homotopy framework, MSH, enabling multi-stage continuation for HiZ nodes and others separately to enhance simulation convergence. In addition, a brand-new homotopy function with limited current gain variation for MOS transistors is utilized to ensure smoother solution curve and better efficiency. Moreover, we trace the solution curve with arclength by considering homotopy parameters as unknown variables to better ensure convergence. The effectiveness of our proposed homotopy framework is demonstrated on large-scale industrial-level circuits.
09:00 CEST	TS03.7	SELFIE5: AN AUTONOMOUS, SELF-CONTAINED VERIFICATION APPROACH FOR HIGH-THROUGHPUT RANDOM TESTING OF PROGRAMMABLE PROCESSORS. Speaker: Yehuda Kra, Bar-Ilan University, Ramat-Gan, Israel, IL Authors: Yehuda Kra¹, Naama Kra² and Adam Teman¹ ¹Bar-Ilan University, IL; ²EnICS Labs, IL Abstract Random testing plays a crucial role in processor designs, complementing other verification methodologies. This paper introduces Selfie5, an autonomous, self-contained verification approach that utilizes the device under verification (DUV) itself to generate, execute, and verify random sequences. This approach eliminates the overhead associated with testing environment interfaces, resulting in a substantial increase in throughput, a critical aspect for achieving comprehensive coverage. The utility can be deployed to FPGA prototypes, emulation platforms and fabricated ASICs and run at-speed to execute billions of tested scenarios per hour, while ensuring the reproducibility of captured failures in an observable simulation environment. This paper describes the Selfie5 approach, algorithms, and utility, while also providing detailed insights into successful deployment of the utility for a RISC-V implementation. When deployed on a 16 nm test SoC featuring a RISC-V processor, Selfie5 delivered a testing throughput of 13.8 billion tested instructions per hour, which is 69X higher than other published works.
09:05 CEST	TS03.8	ISPT-NET: A NOVAL TRANSIENT BACKWARD-STEPPING REDUCTION POLICY BY IRREGULAR SEQUENTIAL PREDICTION TRANSFORMER Speaker: Dan Niu, Southeast University, CN Authors: Yichao Dong¹, Dan Niu¹, Zhou Jin², Chuan Zhang¹, Changyin Sun¹ and Zhenya Zhou³ ¹Southeast University, CN; ²China University of Petroleum-Beijing, CN; ³Empyrean Technology Co., Ltd., CN Abstract In the post-layout simulation for large-scale integrated circuits, transient analysis (TA), determining the time-domain response over a specified time interval, is essential and important. However, it tends to be computationally intensive and quite time-consuming without proper settings of NR initial solution and accurate LTE estimation for determining the next transient timestep, which will lead to a mass of backward-steppings. In this paper, an irregular sequential prediction transformer named ISPT-Net is proposed to predict accurately transient solution as NR initial solution and further obtain precise LTE estimation for setting next timestep. The ISPT-Net is strengthened with timestep positional encoding module (TPE), frequency- and timestep-sensitive muti-head self-attention module (FT-MSA) to enhance irregular sequence feature extraction and prediction accuracy. We assess ISPT-Net in the real large-scale industrial circuits on a commercial SPICE simulator, and achieve a remarkable backward stepping reduction: up to 14.43X for NR nonconvergence case and 4.46X for LTE overlimit case while guaranteeing higher solution accuracy.
09:10 CEST	TS03.9	VERIBUG: AN ATTENTION-BASED FRAMEWORK FOR BUG LOCALIZATION IN HARDWARE DESIGNS Speaker: Stefano Quer, Politecnico di Torino, IT Authors: Giuseppe Stracquadanio¹, Sourav Medya¹, Stefano Quer² and Debjit Pal³ ¹University of Illinois Chicago, US; ²Politecnico di Torino, IT; ³University of Illinois at Chicago, US Abstract In recent years, there has been an exponential growth in the size and complexity of System-on-Chip (SoC) designs targeting different specialized applications. The cost of an undetected bug in these systems is much higher than in traditional processors, as it may imply loss of property or life. Despite decades of research on simulation and formal methods for debugging and verification, the problem is exacerbated by the ever-shrinking time-to-market and ever-increasing demand to churn out billions of devices. In this work, we propose VeriBug, which leverages recent advances in deep learning (DL) to accelerate debugging at the Register-Transfer level (RTL) and generates explanations of likely root causes. Our experiments show that VeriBug can achieve an average bug localization coverage of 82.5% on open-source designs and a wide variety of injected bugs.
09:11 CEST	TS03.10	AN ENDEAVOR TO INDUSTRIALIZE HARDWARE FUZZING: AUTOMATING NOC VERIFICATION IN UVM Speaker: Ruiyang Ma, Peking University, CN Authors: Ruiyang Ma¹, Huatao Zhao², Jiayi Huang³, Shijian Zhang² and Guojie Luo¹ ¹Peking University, CN; ²Alibaba Group, CN; ³Hong Kong University of Science and Technology, HK Abstract We endeavor to make hardware fuzzing compatible with the standard IC development process and apply that to NoC verification in a real-world industrial environment. We systematically employ fuzzing throughout the entire NoC verification process, including router verification, network verification, and stress testing. As a case study, we apply our approach to an open-source NoC component in OpenPiton. Remarkably, our fuzzing methods automatically achieved complete code and functional coverage in the router and mesh network, and effectively detect injected starvation bugs. The evaluation results clearly demonstrate the practicability of our fuzzing approach to considerably reduce the manpower required for test case generation compared with traditional NoC verification.
09:12 CEST	TS03.11	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS18 Modeling And Mitigation Of Defects, Faults, Variability, And Reliability

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 08:30 CEST - 10:00 CEST
Location / Room: Multi-Purpose Room M1A+C

Session chair:
Alberto Ancilotto, Fondazione Bruno Kessler, IT

Session co-chair:
Chiara Sandionigi, CEA, FR

Time	Label	Presentation Title Authors
08:30 CEST	TS18.1	LAVA: AN EFFECTIVE LAYER VARIATION AWARE BAD BLOCK MANAGEMENT FOR 3D CT NAND FLASH Speaker: Shuhan Bai, Huazhong University of Science and Technology; City University of Hong Kong, CN Authors: Shuhan BAI¹, You ZHOU², Fei WU², Changsheng XIE², Tei-Wei KUO³ and Chun Jason XUE⁴ ¹City University of Hong Kong, CN \| Huazhong University of Science and Technology, CN; ²Huazhong University of Science & Technology, CN; ³National Taiwan University, TW; ⁴City University of Hong Kong, HK Abstract 3D NAND flash with charge trap (CT) technology has been developed by stacking multiple layers vertically to boost storage capacity while ensuring reliability and scalability. One of its critical characteristics is the large endurance variation among and inside blocks and layers. With this feature, traditional bad block management (BBM), which determines block lifetime by the page with worst endurance, results in underutilization of solid state drive (SSD) usage. In this paper, a layer variation aware and fault-tolerant bad block management, named LaVA, is proposed to prolong the lifetime of 3D NAND flash storage. The relevant layer, instead of the entire flash block, is discarded at a finer granularity when a page failure is encountered. Experimental results based on real-world workloads show that LaVA can significantly extend the endurance of 3D CT NAND flash (30.6%-62.2%) with a small performance degradation (less than 10% increase of tail I/O response time), compared to the conventional technique.
08:35 CEST	TS18.2	LEARNING ASSISTED POST-MANUFACTURE TESTING AND TUNING OF RRAM-BASED DNNS FOR YIELD RECOVERY Speaker: Abhijit Chatterjee, Georgia Tech, US Authors: Kwondo Ma, Anurup Saha, Chandramouli Amarnath and Abhijit Chatterjee, Georgia Tech, US Abstract Variability-induced accuracy degradation of RRAM-based DNNs is of great concern due to their significant potential for use in future energy-efficient machine learning architectures. To address this, we propose a two-step process. First, an enhanced testing procedure is used to predict DNN accuracy from a set of compact test stimuli (images). This test response (signature) is simply the concatenated vectors of output neurons of intermediate and final DNN layers over the compact test images applied. DNNs with a predicted accuracy below a threshold are then tuned based on this signature vector. Using a clustering based approach, the signature is mapped to the optimal tuning parameter values of the DNN (determined using off-line training of the DNN via back-propagation) in a single step, eliminating any post-manufacture training of the DNN weights (expensive). The tuning parameters themselves consist of the gains and offsets of the ReLU activation of neurons of the DNN on a per-layer basis and can be tuned digitally. Tuning is achieved in less than a second of tuning time, with yield improvements of over 45% with a modest accuracy reduction of 4% compared to digital DNNs.
08:40 CEST	TS18.3	A GRAPH-LEARNING-DRIVEN PREDICTION METHOD FOR COMBINED ELECTROMIGRATION AND THERMOMIGRATION STRESS ON MULTI-SEGMENT INTERCONNECTS Speaker: yunfan Zuo, Southeast University, CN Authors: Yunfan Zuo¹, Yuyang Ye¹, Hongchao Zhang¹, Tinghuan Chen², Hao Yan¹ and Longxing Shi¹ ¹Southeast University, CN; ²The Chinese University of Hong Kong, CN Abstract As technology advances, the temperature gradient in the interconnects becomes more significant, which causes serious thermomigration. Simulating the coupling effects of thermomigration (TM) and electromigration (EM) on large-scale circuits is very time-consuming caused by a substantial increase in computational complexity. Recently, some researchers utilized graph learning-based methods to predict EM stress in medium-scale cases. Unfortunately, these works overlooked the effects of TM. To predict the EM-TM stress of large-scale interconnects accurately and efficiently, we propose a framework based on Graph Attention Networks (GATs) with a customized alternating aggregation method for collecting information in junctions and branches of interconnects jointly. The experimental results show that our work achieves an average relative error of less than $1\%$ compared to the commercial software COMSOL for interconnects consisting of fewer than 200 segments. Furthermore, our method also achieves $9037 imes$ speedup in predicting the OpenROAD test circuit with a maximum segment number reaching $10807$.
08:45 CEST	TS18.4	POLM: POINT CLOUD AND LARGE PRE-TRAINED MODEL CATCH MIXED-TYPE WAFER DEFECT PATTERN RECOGNITION Speaker: Hongquan He, ShanghaiTech University, CN Authors: Hongquan He¹, Guowen Kuang², QI SUN³ and Hao Geng¹ ¹ShanghaiTech University, CN; ²Shenzhen Polytechnic University, CN; ³Zhejiang University, CN Abstract As the technology node scales down to 5nm/3nm, the consequent difficulty has been widely lamented. The defects on the surface of wafers are much more prone to emerge during manufacturing than ever. What's worse, various single-type defect patterns may be coupled on a wafer and thus shape a mixed-type pattern. To improve yield during the design cycle, mixed-type wafer defect pattern recognition is required to perform to identify the failure mechanisms. Based on these issues, we revisit failure dies on wafer maps by treating them as point sets in two-dimensional space and propose a two-stage classification framework, PoLM. The challenge of noise reduction is considerably improved by first using an adaptive alpha-shapes algorithm to extract intricate geometric features of mixed-type patterns. Unlike sophisticated frameworks based on CNNs or Transformers, PoLM only completes classification within a point cloud cluster for aggregating and dispatching features. Furthermore, recognizing the remarkable success of large pre-trained foundation models (e.g., OpenAI's GPT-n series) in various visual tasks, this paper also introduces a training paradigm leveraging these pre-trained models and fine-tuning to improve the final recognition. Experiments demonstrate that our proposed framework significantly surpasses the state-of-the-art methodologies in classifying mixed-type wafer defect patterns.
08:50 CEST	TS18.5	DERAILED: ARBITRARILY CONTROLLING DNN OUTPUTS WITH TARGETED FAULT INJECTION ATTACKS Speaker: Jhon Ordonez Ingali, University of Delaware, BO Authors: Jhon Ordoñez and Chengmo Yang, University of Delaware, US Abstract Hardware accelerators have been widely deployed to improve the efficiency of DNN execution in terms of performance, power, and time predictability. Yet recent studies have shown that DNN accelerators are vulnerable to fault injection attacks, compromising their integrity and reliability. Classic fault injection attacks are capable of causing a high overall accuracy drop. However, one limitation is that they are difficult to control, as faults affect the computation across random classes. In comparison, this paper presents a controlled fault injection attack, capable of derailing arbitrary inputs to a targeted range of classes. Our observation is that the fully connected (FC) layers greatly impact inference results, whereas the computation in the FC layer is typically performed in order. Leveraging this fact, an adversary can perform a controlled fault injection attack even to a black-box DNN model. Specifically, this attack adopts a two-step search process that first identifies the time window during which the FC layer is computed and then pinpoints the targeted classes. This attack is implemented with clock glitching, and the target DNN accelerator is a DPU implemented in the FPGA. The attack is tested on three popular DNN models, namely, ResNet50, InceptionV1, and MobileNetV2. Results show that up to 93% of inputs are derailed to the attacker-specified classes, demonstrating its effectiveness.
08:55 CEST	TS18.6	IOT-GRAF: IOT GRAPH LEARNING-BASED ANOMALY AND INTRUSION DETECTION THROUGH MULTI-MODAL DATA FUSION Speaker: ROZHIN YASAEI, University of Arizona, US Authors: Rozhin Yasaei¹, Yasamin Moghaddas² and Mohammad Al Faruque² ¹University of Arizona, US; ²University of California, Irvine, US Abstract IoT devices are vulnerable to attacks and failures, challenging to mitigate because of the multi-disciplinary nature of IoT. Although numerous stand-alone approaches are proposed in the literature for network intrusion detection or sensor anomaly detection, a holistic model is missing to integrate the information from both domains and extract the valuable context shared among different system components. To address these, we present a novel approach for multi-modal data fusion that fuses sensor and communication data for the first time. We integrate IoT physical and cyber elements into a graph representation that signifies the correlation between elements as a connection and provides embedding for data generated by each component. We construct a Graph Neural Network (GNN) model to learn the context and normal state of the system and detect abnormal activities. We further study the signatures of networks and sensor attacks to determine the source of the anomaly to facilitate fast and informed recovery after an incident. Our model is optimized to be executed on a fog-based platform for real-time system supervision. Our experiment on greenhouse monitoring IoT systems demonstrates that our approach, on average, achieves 22% F1-score improvement over the single-modal techniques for anomaly detection.
09:00 CEST	TS18.7	ALLEVIATING BARREN PLATEAUS IN PARAMETERIZED QUANTUM MACHINE LEARNING CIRCUITS: INVESTIGATING ADVANCED PARAMETER INITIALIZATION STRATEGIES Speaker: Muhammad Kashif, Center for Quantum and Topological Systems, NewYork University Abu Dhabi, PK Authors: Muhammad Kashif¹, Muhammad Rashid², Saif Al-Kuwari³ and Muhammad Shafique¹ ¹eBrain Lab, Division of Engineering, New York University (NYU) Abu Dhabi, UAE, AE; ²Computer Engineering Department, Umm Al Qura University, Makkah, Saudi Arabia, SA; ³Qatar Center fro Quantum Computing, College of Science and Engineering, hamad Bin Khalifa University, Doha Qatar, QA Abstract Parameterized quantum circuits (PQCs) have emerged as a foundational element in the development and applications of quantum algorithms. However, when initialized with random parameter values, PQCs often exhibit barren plateaus (BP). These plateaus, characterized by vanishing gradients with an increasing number of qubits, hinder optimization in quantum algorithms. In this paper, we analyze the impact of state-of-the-art parameter initialization strategies from classical machine learning in random PQCs from the aspect of BP phenomenon. Our investigation encompasses a spectrum of initialization techniques, including random, Xavier (both normal and uniform variants), He, LeCun, and Orthogonal methods. Empirical assessment reveals a pronounced reduction in variance decay of gradients across all these methodologies compared to the randomly initialized PQCs. Specifically, the Xavier initialization technique outperforms the rest, showing a 62% improvement in variance decay compared to the random initialization. The He, Lecun, and orthogonal methods also display improvements, with respective enhancements of 32%, 28%, and 26%. This compellingly suggests that the adoption of these existing initialization techniques holds the potential to significantly amplify the training efficacy of Quantum Neural Networks (QNNs), a subclass of PQCs. Demonstrating this effect, we employ the identified techniques to train QNNs for learning the identity function, effectively mitigating the adverse effects of BPs. The training performance, ranked from the best to the worst, aligns with the variance decay enhancement as outlined above. This paper underscores the role of tailored parameter initialization in mitigating the BP problem and eventually enhancing the training dynamics of QNNs.
09:05 CEST	TS18.8	FARE: FAULT-AWARE GNN TRAINING ON RERAM-BASED PIM ACCELERATORS Speaker: Pratyush Dhingra, Washington State University, US Authors: Pratyush Dhingra¹, Chukwufumnanya Ogbogu¹, Biresh Kumar Joardar², Jana Doppa¹, Ananth Kalyanaraman¹ and Partha Pratim Pande¹ ¹Washington State University, US; ²University of Houston, US Abstract Resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architecture is an attractive solution for training Graph Neural Networks (GNNs) on edge platforms. However, the immature fabrication process and limited write endurance of ReRAMs make them prone to hardware faults, thereby limiting their widespread adoption for GNN training. Further, the existing fault-tolerant solutions prove inadequate for effectively training GNNs in the presence of faults. In this paper, we propose a fault-aware framework referred to as FARe that mitigates the effect of faults during GNN training. FARe outperforms existing approaches in terms of both accuracy and timing overhead. Experimental results demonstrate that FARe framework can restore GNN test accuracy by 47.6% on faulty ReRAM hardware with a ~1% timing overhead compared to the fault-free counterpart.
09:10 CEST	TS18.9	TOWARDS SEU FAULT PROPAGATION PREDICTION WITH SPATIO-TEMPORAL GRAPH CONVOLUTIONAL NETWORKS Speaker: Li Lu, IHP – Leibniz Institute for High Performance Microelectronics, DE Authors: Li Lu, Junchao Chen, Markus Ulbricht and Milos Krstic, IHP – Leibniz Institute for High Performance Microelectronics, DE Abstract Assessing Single Event Upset (SEU) sensitivity in complex circuits is increasingly important but challenging. This paper proposes an efficient approach using Spatio-temporal Graph Convolutional Networks (STGCN) to predict the results of SEU simulation-based fault injection. Representing circuit structures as graphs and integrating temporal data from the workload's waveform into these graphs, STGCN achieves a 94-96\% prediction accuracy on four test circuits.
09:11 CEST	TS18.10	FAST ESTIMATION FOR ELECTROMIGRATION NUCLEATION TIME BASED ON RANDOM ACTIVATION ENERGY MODEL Speaker: Jingyu Jia, Beijing University of Posts and Telecommunications, CN Authors: Jingyu Jia, Jianwang Zhai and Kang Zhao, Beijing University of Posts and Telecommunications, CN Abstract Electromigration (EM) has attracted significant interest in recent years, because the current density of on-chip power delivery networks (PDNs) is always increasing. However, the EM phenomenon is affected by the randomness of the annealing process during nanofabrication, which requires more reliable statistical models for EM analysis. In this work, we propose a fast estimation method for EM nucleation time based on the random activation energy model. Experiments demonstrate that our method can accurately and efficiently analyze the nucleation time distribution under random processes, and achieve 39.1% improvement in estimation speed compared with the previous work.
09:12 CEST	TS18.11	OUT-OF-DISTRIBUTION DETECTION USING POWER-SIDE CHANNELS FOR IMPROVING FUNCTIONAL SAFETY OF NEURAL NETWORK FPGA ACCELERATORS Speaker: Vincent Meyers, Karlsruhe Institute of Technology, DE Authors: Vincent Meyers, Dennis Gnad and Mehdi Tahoori, Karlsruhe Institute of Technology, DE Abstract Accurate out-of-distribution (OOD) detection is crucial for ensuring the safety and reliability of neural network (NN) accelerators in real-world scenarios. This paper proposes a novel OOD detection approach for NN FPGA accelerators using remote power side-channel measurements. We assess different methods for distinguishing power measurements of in-distribution (ID) samples from OOD samples, comparing the effectiveness of simple power analysis and OOD sample identification based on the reconstruction error of an autoencoder (AE). Leveraging on-chip voltage sensors enables non-intrusive and concurrent remote OOD detection, eliminating the need for explicit labels or modifications to the underlying NN.
09:13 CEST	TS18.12	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

W02 Workshop: 3D Integration: Heterogeneous 3D Architectures and Sensors

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 08:30 CEST - 12:30 CEST
Location / Room: Break-Out Room S6+7

Organiser

Pascal Vivet, CEA-LIST, France

Workshop Description

3D technologies are becoming more and more pervasive in digital architectures, as a strong enabler for heterogeneous integration. With the limits of current sub-nanometric technologies, 3D integration technology is paving the way to a wide architecture scope, with reduced cost, reduced form factor, increased energy efficiency, allowing a wide variety of heterogeneous architectures. Due to the high amount of required data and associated memory capacity, ML and AI accelerator could benefit of 3D integration not only for HPC, but also for the edge and embedded HPC. 3D integration and associated architectures are opening a wide spectrum of system solutions, from chiplet-based partitioning for High Performance Computing to various sensors such as fully integrated image sensors embedding AI features, but also but also for next generation of computing architectures: AI accelerators, InMemoryComputing, Quantum, etc.

Technical Program

Subject to final changes

8:30 - 8:35 : Workshop Introduction, Pascal Vivet, CEA-List, FR
8:35 - 9:00 : Keynote
- “Enabling a Chiplet Ecosystem”, Tony Mastroianni, SIEMENS EDA, USA
9:00 - 10:00 : Session 1 : Chiplet architecture and AI acceleration
- "Occamy - A 432-Core Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support", Gianna Paulin, ETH-Z, CH (*now in AXELERA)
- “Chiplets for future automotive application”, Andy Heinig, Fraunhofer, Dresden, DE
- “Silicon Photonic Network-on-Interposer Design for Energy Efficient Convolutional Neural Network Acceleration on 2.5D Chiplet Platforms”, Sudeep Pasricha, Colorado State Univ, USA
- “Multiple-Stacked-Wafer (MSW) technology : the appropriate technology for ultra-low power Smart Sensor”, Sébastien Thuriès, CEA-List and IRT-Nanoelec, FR
10:00 - 11:00 : Coffee Break
11:00 - 12:30 : Session 2 : Advanced Architectures and Thermal management
- “Foundry Monolithic 3D Logic+Memory Stack unlocks Large IC Architectural Benefits”, Tathagata Srimani, Carnegie Mellon Univ., USA
- “3D Evolution in Nanosheet: A Glance on General Purpose AI, Dense-XR and Edge computing”, Sudipta Das, IMEC, BE
- “Machine Learning for Thermal Modeling of 3D Integrated Circuits”, Yuanqing Cheng, Beihang University, China
- “Thermal Management for 3D-Stacked Processors”, Anuj Pathania, University of Amsterdam, NL
- “Heterogeneous 3D integration for quantum computer chip”, Ryoichi Ishihara, TU Delft, NL
12:30 : Closing

Workshop Committee

Pascal Vivet, CEA-List & IRT Nanoelec, France

Gianna Paulin, ETH-Z, Switzerland

Peter Ramm, Fraunhofer EMFT, Germany

Mustafa Badaroglu, QUALCOMM, United States

Subhasish Mitra, Stanford University, United States

Past editions

The 3D Integration workshop took place from 2009 to 2015 and restarted in 2022.

DATE 2009: https://past.date-conference.com/date09/conference/workshop-W5
DATE 2010: https://past.date-conference.com/date10/conference/workshop-W5
DATE 2011: https://past.date-conference.com/date11/conference/workshop-W5
DATE 2012: https://past.date-conference.com/date12/conference/workshop-W5
DATE 2013: https://past.date-conference.com/date13/conference/workshop-W5
DATE 2014: https://past.date-conference.com/date14/conference/workshop-W5
DATE 2015: https://past.date-conference.com/date15/conference/workshop-W05
DATE 2022: https://date22.date-conference.com/workshop/w02
DATE 2023: https://date23.date-conference.com/workshop/w02

ASD05K ASD Embedded Keynote

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 11:00 CEST - 12:00 CEST
Location / Room: Break-Out Room S8

Session chair:
Selma Saidi, TU Dortmund, DE

Time	Label	Presentation Title Authors
11:00 CEST	ASD05K.1	CERTAINTY OR INTELLIGENCE: PICK ONE! Speaker and Author: Edward Lee, University of California, Berkeley, US Abstract Mathematical models can yield certainty, as can probabilistic models where the probabilities degenerate. The field of formal methods emphasizes developing such certainty about engineering designs. In safety critical systems, such certainty is highly valued and, in some cases, even required by regulatory bodies. But achieving reasonable performance for sufficiently complex environments appears to require the use of AI technologies, which resist such certainty. This extended abstract suggests that certainty and intelligence may be fundamentally incompatible.

FS08 Panel session: How to attract, train and keep more Talent to science & Semiconductor

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Break-Out Room S3+4

Session chair:
Catherine Le Lan, Synopsys, FR

Organisers:
Catherine Le Lan, Synopsys, FR
Olivier Sentieys, INRIA, FR

Time	Label	Presentation Title Authors
11:00 CEST	FS08.1	INTRODUCTION Presenter: Catherine Le Lan, SYNOPSYS, FR Abstract Introduction to the Session
11:05 CEST	FS08.2	CULTIVATING A DIVERSE WORKFORCE TO MEET EU CHIPS ACT AMBITIONS Presenter: Laith Altimime, SEMI EUROPE, DE Abstract Talent remains as one of the industry's most pressing global challenge. To achieve the EU Chips Act Ambitions by achieving 20% of the global market share by 2030 we will need to more than double the current workforce in Europe and with more diverse skills. We will need to explore new strategies to Scale-Up initiatives to populate the talent pipeline. Strategies include scaled-up funding mechanisms at national and global levels by establishing global and regional Industry – Academia partnerships with focus on Diversity, Upskilling, Reskilling and Retention of the future workforce, as well as collaborative global initiatives to attract and facilitate talent mobility for a diverse international workforce.
11:25 CEST	FS08.3	EUROPEAN UNIVERSITY TRAINING FOR CHIP ACT: DIVERSITY, INTERNATIONALIZATION, COLLABORATION Presenter: Mehdi Tahoori, Karlsruhe Institute of Technology, DE Abstract Workforce education, training, and retention in European universities are paramount for the success of the EU Chip Act, facilitating a thriving chip design and manufacturing ecosystem in Europe. Central to this endeavor is the attraction of global talent, achieved through strategic initiatives like internationalization and the promotion of diversity. Establishing an inclusive, discrimination-free environment is pivotal in cultivating an atmosphere conducive to international scholars, both during their academic pursuits and when selecting their career paths within European industries. From a technical perspective, industry-relevant educational programs in chip design and university-industry research collaboration play indispensable roles in attracting, educating, and retaining students at both MS and PhD levels, ensuring they possess the necessary competence in chip design.
11:45 CEST	FS08.4	MEETING INDUSTRY NEEDS FOR TOMORROW'S CHIP DESIGN WORK FORCE Presenter: Yervant Zorian, Synopsys, US Abstract As chip design requirements continue to change, tomorrow's work force needs to perform new types of activities and require corresponding preparation. A well-defined collaboration between academia and chip design industry is necessary to prepare tomorrow workforce during its education stage, in order to utilize advanced ML-based EDA tools, embedded firmware, high performance interface IP subsystems, chiplets for 2.5/3D multi-die designs. This presentation will cover a particular educational model developed by Synopsys Armenia Educational Department to prepare for tomorrow's work force and implemented in partnership with several academic institutions.
12:05 CEST	FS08.5	HOW TO ENSURE EXISTING ENGINEERS ARE TRAINED ON TOMORROW'S CHALLENGES Presenter: Peter Groos, Cadence Design Systems, US Abstract The increasing demand for engineers in the semiconductor industry in the next 5 to 10 years requires strategies and programs including national or regional funding to reduce the workforce and educational gap. The effect of these programs is rather mid to long term and yet to be proven. Under the assumption that the shortage will prevail measures need to be taken to continuously develop and maintain the knowledge and skill set of today's engineering teams to create future products and solutions in the most efficient ways. Work force development programs help to develop new skills through re-skilling or up-skilling. Furthermore, prediction of market trends and agility to adopt to new demands are key in having the right skill sets when they are needed. For a successful implementation these programs need to be centered around the individual embedded in an inclusive team culture.
12:25 CEST	FS08.6	CONCLUSION Presenter: Catherine Le Lan, SYNOPSYS, FR Abstract Conclusion of the session

MPP01 Multi-Partner Projects

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Break-Out Room S1+2

Session chair:
Sara Vinco, Politecnico di Torino, IT

Session co-chair:
Franco Fummi, Università di Verona, IT

Time	Label	Presentation Title Authors
11:00 CEST	MPP01.1	EMDRIVE ARCHITECTURE: EMBEDDED COMPUTING AND DIAGNOSTICS FROM SENSOR TO EDGE Speaker: Patrick Schmidt, Karlsruhe Institute of Technology, DE Authors: Patrick Schmidt¹, Iuliia Topko¹, Matthias Stammler¹, Tanja Harbaum¹, Juergen Becker¹, Rico Berner², Omar Ahmed², Jakub Jagielski², Thomas Seidler², Markus Abel², Marius Kreutzer³, Maximilian Kirschner³, Victor Pazmino³, Robin Sehm⁴, Lukas Groth⁴, Andrija Neskovic⁵, Rolf Meyer⁴, Saleh Mulhem⁶, Mladen Berekovic⁴, Matthias Probst⁷, Manuel Brosch⁷, Georg Sigl⁸, Thomas Wild⁷, Matthias Ernst⁷, Andreas Herkersdorf⁷, Florian Aigner⁹, Stefan Hommes¹⁰, Sebastian Lauer¹⁰, Maximilian Seidler¹¹, Thomas Raste¹², Gasper Bozic¹³, Ibai Irigoyen Ceberio¹³, Muhammad Hassan¹³ and Albrecht Mayer¹³ ¹Karlsruhe Institute of Technology, DE; ²Ambrosys GmbH, DE; ³FZI Research Center for Information Technology, DE; ⁴Universität zu Lübeck, DE; ⁵University of Luebeck, DE; ⁶Institute of Computer Engineering, University of Lübeck, DE; ⁷TU Munich, DE; ⁸TU Munich/Fraunhofer AISEC, DE; ⁹Elektrobit Automotive GmbH, DE; ¹⁰ZF Friedrichshafen AG, DE; ¹¹Siemens AG Technology, DE; ¹²Continental Automotive Technologies GmbH, DE; ¹³Infineon Technologies, DE Abstract Future automotive architectures are expected to transition from a network-centric to a domain-centered architecture featuring central compute units. Powerful domain controllers or smart sensors alleviate the load on these central units and communication systems. These controllers execute tasks with varying criticalities on heterogeneous multicore processors, and are ideally capable of dynamically balancing the computing load between the central unit and sensors. Here, AI capabilities play a crucial role, as it is in high demand for such an automotive architecture. However, AI still requires specialized accelerators to improve their computation performance. Task-oriented distributed computing with criticalities up to ASIL-D necessitates the development and utilization of specialized methodologies, such as safety, through the isolation and abstraction of low-level hardware concepts. Meanwhile, online monitoring and diagnostics become vital features to detect errors during operation. The EMDRIVE architecture includes methods, components, and strategies to enhance the performance, safety, and security of such distributed computing platforms. The nationally funded EMDRIVE project connects its twelve partners from academia and industry and is currently in its intermediate stage
11:05 CEST	MPP01.2	EVALUATING AN OPEN-SOURCE HARDWARE APPROACH FROM HDL TO GDS FOR A SECURITY CHIP DESIGN — A REVIEW OF THE FINAL STAGE OF PROJECT HEP Speaker: Norbert Herfurth, IHP – Leibniz Institute for High Performance Microelectronics, DE Authors: Norbert Herfurth¹, Tim Henkes², Tuba Kiyan³, Fabian Buschkowski⁴, Christoph Lüth⁵, Steffen Reith⁶, Pascal Sasdrich⁷, Milan Funk⁸, Rene Rathfelder⁹, Aygün Walter⁹, Julian Wälde¹⁰, Goran Panic¹, Levente Suta¹¹, Detlef Boeck¹¹, Arnd Weber³ and Torsten Grawunder¹² ¹IHP – Leibniz Institute for High Performance Microelectronics, DE; ²HSRM - Hochschule RheinMain, DE; ³TU Berlin, DE; ⁴Ruhr-University, DE; ⁵University of Bremen \| DFKI, DE; ⁶RheinMain University of Applied Sciences, DE; ⁷Ruhr-Universität Bochum, DE; ⁸Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), DE; ⁹IAV GmbH, DE; ¹⁰Fraunhofer SIT, DE; ¹¹Elektrobit Automotive GmbH, DE; ¹²Swissbit Germany AG, DE Abstract The project "Hardening the value chain through open-source, trustworthy EDA tools and processors (HEP)'' uses open-source, free components and tools for the production of a prototypical security chip. A design flow using only free and open tools from the abstract description in SpinalHDL via OpenROAD down to the GDS-file for tape-out has been established, and first ASICs produced at IHP. The prototypical hardware security module (HSM) produced in this way provides, among other things, a processor based on VexRiscv, a cryptographic accelerator and masking of cryptographic keys. The open development tools used in the process were integrated into a common environment and expanded to include missing functionality. Subsequently, the whole tool chain and its peripherals are wrapped into a new Nyx container. The easy accessibility of the used process significantly reduces the learning curve for chip design. Additionally, we provide tools for formal verification and masking against side-channel attacks in our design flow. Interest in the results of project HEP has been shown in publications in which industrial partners participated, such as Elektrobit, Hensoldt Cyber, IAV, Secure-IC and Swissbit Germany.
11:10 CEST	MPP01.3	THE 3D NEURAL NETWORK COMPUTE CUBE (N2C2) CONCEPT ENABLING EFFICIENT HARDWARE TRANSFORMER ARCHITECTURES TOWARDS SPEECH-TO-SPEECH TRANSLATION Speaker: Cristell MANEUX, University of Bordeaux, FR Authors: Ian O'Connor¹, Sara Mannaa¹, Alberto Bosio¹, Bastien Deveautour¹, Damien Deleruyelle¹, Tetiana Obukhova¹, Cedric Marchand², Jens Trommer³, Cigdem Cakirlar⁴, Bruno Neckel Wesling⁴, Thomas Mikolajick⁵, Oskar Baumgartner⁶, Mischa Thesberg⁶, David Pirker⁶, Christoph Lenz⁶, Zlatan Stanojevic⁶, Markus Karner⁷, Guilhem Larrieu⁸, Sylvain Pelloquin⁸, Konstantinous Moustakas⁸, Jonas Muller⁸, Giovanni Ansaloni⁹, Alireza Amirshahi⁹, David Atienza⁹, Jean-Luc Rouas¹⁰, Leila Ben Letaifa¹¹, Georgeta Bordea¹², Charles Brazier¹⁰, Yifan Wang¹³, Chhandak Mukherjee¹⁴, Marina Deng¹⁴, Marc François¹⁴, Houssem Rezgui¹⁴, Reveil Lucas¹⁴ and Cristell Maneux¹⁵ ¹Lyon Institute of Nanotechnology, FR; ²École Centrale de Lyon, FR; ³Namlab gGmbH, DE; ⁴Namlab GgmbH, DE; ⁵NaMLab Gmbh / TU Dresden, DE; ⁶Global TCAD Solutions GmbH, AT; ⁷Global TCAD Solutions, AT; ⁸LAAS-CNRS, FR; ⁹EPFL, CH; ¹⁰LaBRI CNRS UMR 5800 University Bordeaux, FR; ¹¹LaBRI, FR; ¹²Université de Bordeaux, FR; ¹³IMS Bourdeaux, FR; ¹⁴IMS Bordeaux, FR; ¹⁵University of Bordeaux, FR Abstract This multi-partner-project contribution introduces the midway results of the H2020 FVLLMONTI project. Our objective is to develop a new and ultra-efficient class of ANN accelerators based on the novel neuronal network compute cube (N2C2) block, specifically designed to execute complex machine learning tasks in a 3D technology, in order to provide the high computing power and ultra-high efficiency needed for future EdgeAI applications. We showcase its effectiveness by targeting the challenging class of Transformer ANNs, tailored for Automatic Speech Recognition and Machine Translation, both fundamental components of speech-to-speech translation. To gain the full benefit of the accelerator design, we develop disruptive vertical transistor technologies and execute design-technology-co-optimization (DTCO) loop from single device, to cell and compute cube level. Further, a hardware-software-co-optimization is implemented, e.g. by compressing the executed speech recognition and translation models for energy efficient execution without substantial loss in precision.
11:15 CEST	MPP01.4	A SCALABLE RISC-V HARDWARE PLATFORM FOR INTELLIGENT SENSOR PROCESSING Speaker: Paul Palomero Bernardo, University of Tübingen, DE Authors: Paul Palomero Bernardo¹, Patrick Schmid¹, Oliver Bringmann², Mohammed Iftekhar³, Babak Sadiye³, Wolfgang Mueller³, Andreas Koch⁴, Eyck Jentzsch⁵, Axel Sauer⁶, Ingo Feldner⁶ and Wolfgang Ecker⁷ ¹University of Tübingen, DE; ²University of Tübingen \| FZI, DE; ³University of Paderborn, DE; ⁴TU Darmstadt, DE; ⁵MINRES Technologies GmbH, DE; ⁶Robert Bosch GmbH, DE; ⁷Infineon Technologies, DE Abstract This paper presents a demonstrator chip for an industrial audio event detection application developed as part of the Scale4Edge project. The project aims at enabling a comprehensive RISC-V based ecosystem to efficiently assemble well-tailored edge devices. The chip is manufactured in Globalfoundries' 22FDX technology and contains a RISC-V CPU with custom Instruction-Set-Architecture Extensions (ISAX) for fast AI and DSP processing, a low power neural network accelerator, and a scalable PLL to fulfill real-time processing requirements. By automated integration of these specialized hardware components, we achieve a speedup of x2.15 while reducing the power by 27% as compared to the unoptimized solution.
11:20 CEST	MPP01.5	NEUSPIN: DESIGN OF A RELIABLE EDGE NEUROMORPHIC SYSTEM BASED ON SPINTRONICS FOR GREEN AI Speaker: Lorena Anghel, University Grenoble Alpes, CEA, CNRS, Grenoble INP, and IRIG-Spintec, FR Authors: Soyed Tuhin Ahmed¹, Kamal Danouchi², Guillaume Prenat³, Lorena Anghel⁴ and Mehdi Tahoori¹ ¹Karlsruhe Institute of Technology, DE; ²University Grenoble Alpes, CEA, CNRS, Grenoble INP, IRIG-Spintec Laboratory, FR; ³Spintec, FR; ⁴Grenoble-Alpes University, Grenoble, France, FR Abstract Artificial intelligence (AI) models and algorithms are considered data-centric and inherently resource-hungry. Existing hardware architectures with separate computing units and memory blocks (Von Neumann architectures) are not suitable for implementing AI models for inference in an energy-efficient manner due to the memory wall problem. Another class of AI model, called Bayesian Neural Network (BayNN), has the ability to provide uncertainty associated with an inference prediction. However, BayNNs are even more resource-hungry compared to conventional AI models. Thus, an alternative biologically inspired AI neuromorphic hardware architecture, called computing-in-Memory (CIM) with emerging resistive memories, has their memory blocks, and computing units are locally combined. CIM offers high computational efficiency and extremely low power consumption. As a result, some of the inherent costs of BayNN can be reduced. However, BayNNs require the implementation of statistical distributions to be implemented on CIM hardware, which is challenging. The NeuSPIN project pursues the ambitious goal of transferring BayNN algorithms to CIM with full-stack hardware and software co-design efforts. We aim to use novel variants of non-volatile memory spintronic (NVM) technologies with specifically developed neuron and synapse designs to implement the CIM neural network architecture. Although NVM technologies are promising approaches for CIM-based computing systems due to their efficient implementation, due to their non-ideal properties, in particular variability, stochasticity, and manufacturing defects, the implementation of modern adapted AI algorithms using them poses several technical challenges. We explore novel algorithmic approaches with spintronic-based CiM hardware adaptations and BayNNs implementation. Furthermore, we explore algorithmic and circuit design approaches that improve the robustness and testability of BayNNs.
11:25 CEST	MPP01.6	AUTONOMOUS REALIZATION OF SAFETY- AND TIME-CRITICAL EMBEDDED ARTIFICIAL INTELLIGENCE Speaker: Björn Forsberg, RISE Research Institutes of Sweden, SE Authors: Joakim Lindén¹, Andreas Ermedahl², Hans Salomonsson³, Masoud Daneshtalab⁴, Björn Forsberg⁵ and Paris Carbone⁴ ¹SAAB AB, SE; ²Ericsson AB, SE; ³EMBEDL AB, SE; ⁴KTH Royal Institute of Technology, SE; ⁵RISE Research Institutes of Sweden, SE Abstract There is an evident need to complement embedded critical control logic with AI inference, but today's AI-capable hardware, software, and processes are primarily targeted towards the needs of cloud-centric actors. Telecom and defense airspace industries, which make heavy use of specialized hardware, face the challenge of manually hand-tuning AI workloads and hardware, presenting an unprecedented cost and complexity due to the diversity and sheer number of deployed instances. Furthermore, embedded AI functionality must not adversely affect real-time and safety requirements of the critical business logic. To address this, end-to-end AI pipelines for critical platforms are needed to automate the adaption of networks to fit into resource-constrained devices under critical and real-time constraints, while remaining interoperable with de-facto standard AI tools and frameworks used in the cloud. We present two industrial applications where such solutions are needed to bring AI to critical and resource- constrained hardware, and a generalized end-to-end AI pipeline that addresses these needs. Crucial steps to realize it are taken in the industry-academia collaborative FASTER-AI project.
11:26 CEST	MPP01.7	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

SD02 Special Day On Sustainable Computing

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Auditorium 2

Time	Label	Presentation Title Authors
	SD02.1	WELCOME BACK Presenter: Ana Lucia Varbanescu, University of Twente, NL Author: Ana Lucia Varbanescu, University of Twente, NL Abstract Welcome back
11:00 CEST	SD02.2	MACHINE LEARNING-BASED PERFORMANCE PREDICTION AND OPTIMIZATION FOR SUSTAINABLE COMPUTING ON PUBLIC CLOUDS Presenter: David Atienza, EPFL, CH Author: David Atienza, EPFL, CH Abstract .
11:35 CEST	SD02.3	SUSTAINABLE SUPERCOMPUTING: CHALLENGES AND OPPORTUNITIES FOR HPC AND AI Presenter: Michele Weiland, EPCC - University of Edinburgh, GB Author: Michele Weiland, EPCC - University of Edinburgh, GB Abstract .
12:10 CEST	SD02.4	DATA-DRIVEN SYSTEM AVAILABILITY PREDICTION FOR SUSTAINABLE DATACENTER OPERATION Presenter: Martin Molan, Università di Bologna, IT Author: Martin Molan, Università di Bologna, IT Abstract .
12:17 CEST	SD02.5	RESEARCH FOR SUSTAINABILITY: UNRAVELING PATHS TOWARDS ECO-INNOVATION Presenter: Chiara Sandionigi, CEA-LIST, FR Author: Chiara Sandionigi, CEA-LIST, FR Abstract .
12:25 CEST	SD02.6	DISCUSSION, POLL, CLOSING Presenter: Ana Lucia Varbanescu, University of Twente, NL Author: Ana Lucia Varbanescu, University of Twente, NL Abstract .

TS13 Emerging Design Technologies For Future Memories

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Multi-Purpose Room M1A+C

Session chair:
Guillaume Prenat, SPINTEC, FR

Session co-chair:
William Simon, IBM, US

Time	Label	Presentation Title Authors
11:00 CEST	TS13.1	COMET: A CROSS-LAYER OPTIMIZED OPTICAL PHASE CHANGE MAIN MEMORY ARCHITECTURE Speaker: Sudeep Pasricha, Colorado State University, US Authors: Febin Sunny¹, Amin Shafiee¹, Benoit Charbonnier², Mahdi Nikdast¹ and Sudeep Pasricha¹ ¹Colorado State University, US; ²CEA-Leti, FR Abstract Traditional DRAM-based main memory systems face several challenges with memory refresh overhead, high latency, and low throughput as the industry moves towards smaller DRAM cells. These issues have been exacerbated by the emergence of data-intensive applications in recent years. Memories based on phase change materials (PCMs) offer promising solutions to these challenges. PCMs store data in the material's phase, which can shift between amorphous and crystalline states when external thermal energy is supplied. This is often achieved using electrical pulses. Alternatively, using laser pulses and integration with silicon photonics offers a unique opportunity to realize high-bandwidth and low-latency photonic memories. Such a memory system may in turn open the possibility of realizing fully photonic computing systems. But to realize photonic memories, several challenges that are unique to the photonic domain such as crosstalk, optical loss management, and laser power overhead have to be addressed. In this work, we present COMET, the first cross-layer optimized optical main memory architecture that uses PCMs. In architecting COMET, we explore how to use silicon photonics and PCMs together to design a large-scale main memory system while addressing associated challenges. We explore challenges and propose solutions at the PCM cell, photonic memory circuit, and memory architecture levels. Based on our evaluations, COMET offers 75.3× lower latencies , 5.1× better bandwidth, 12.9× lower EPB, and 65.8× better BW/EPB than the best-known prior work on photonic main memory architecture design.
11:05 CEST	TS13.2	CAFEHD: A CHARGE-DOMAIN FEFET-BASED COMPUTE-IN-MEMORY HYPERDIMENSIONAL ENCODER WITH HYPERVECTOR MERGING Speaker: Taixin Li, Tsinghua University, CN Authors: Taixin Li¹, Hongtao Zhong¹, Juejian Wu¹, Thomas Kämpfe², Kai Ni³, Vijaykrishnan Narayanan⁴, Huazhong Yang¹ and Xueqing Li¹ ¹Tsinghua University, CN; ²Fraunhofer IPMS, DE; ³Rochester Institute of Technology, US; ⁴Pennsylvania State University, US Abstract Hyperdimensional computing (HDC) is an emerging paradigm that employs hypervectors (HVs) to emulate cognitive tasks. In HDC, the most time-consuming and power-hungry process is encoding, the first step that maps raw data into HVs. There have been non-volatile memory (NVM) based computing-in-memory (CiM) HDC encoding designs, which exploit the intrinsic HDC characteristics of high parallelism, massive data, and robustness. These NVM-based CiMs have shown great potential in reducing encoding time and power consumption. Among them, the ferroelectric field-effect transistor (FeFET) based designs show ultra-high energy efficiency. However, existing FeFET-based HDC encoding designs face the challenges of energy-consuming current-mode addition, inefficient HV storage, limited endurance, and single encoding method support. These challenges limit the energy efficiency, lifetime, and versatility of the designs. This work proposes an energy-efficient charge-domain FeFET-based in-memory HDC encoder, i.e., CafeHD, with extended lifetime, good versatility, and comparable accuracy. Area-efficient charge-domain computing is proposed in HDC encoding for the first time, which enables CafeHD with ultra-low power and high scalability. An HV merging technique is explored to improve the performance. A low-cost partial MAJ interface is also proposed to reduce writes. Besides, CafeHD also supports two widely used encoding methods. Results show that CafeHD on average achieves 10.9x/12.7x/3.5x speedup and 103.3x/21.9x/6.3x energy efficiency with ~84% write times reduction and similar accuracy compared with the state-of-the-art ReRAM/PCM/FeFET-based CiM design for HDC encoding, respectively.
11:10 CEST	TS13.3	AFPR-CIM: AN ANALOG-DOMAIN FLOATING-POINT RRAM-BASED COMPUTE-IN-MEMORY ARCHITECTURE WITH DYNAMIC RANGE ADAPTIVE FP-ADC Speaker: Haobo Liu, University of Electronic Science and Technology of China, CN Authors: Haobo Liu¹, Zhengyang Qian², Wei Wu², Hongwei Ren³, Zhiwei Liu¹ and Leibin Ni² ¹University of Electronic Science and Technology of China, CN; ²Huawei Technologies Co., Ltd., CN; ³Hong Kong University of Science and Technology, HK Abstract Power consumption has become the major concern in neural network accelerators for edge devices. The novel non-volatile-memory (NVM) based computing-in-memory (CIM) architecture has shown great potential for better energy efficiency. However, most of the recent NVM-CIM solutions mainly focus on fixed-point calculation and are not applicable to floating-point (FP) processing. In this paper, we propose an analog-domain floating-point CIM architecture (AFPR-CIM) based on resistive random-access memory (RRAM). A novel adaptive dynamic-range FP-ADC is designed to convert the analog computation results into FP codes. Output current with high dynamic range is converted to a normalized voltage range for readout, to prevent precision loss at low power consumption. Moreover, a novel FP-DAC is also implemented which reconstructs FP digital codes into analog values to perform analog computation. The proposed AFPR-CIM architecture enables neural network acceleration with FP8 (E2M5) activation for better accuracy and energy efficiency. Evaluation results show that AFPR-CIM can achieve 19.89 TFLOPS/W energy efficiency and 1474.56 GOPS throughput. Compared to traditional FP8 accelerator, digital FP-CIM, and analog INT8-CIM, this work achieves 4.135×, 5.376×, and 2.841× energy efficiency enhancement, respectively.
11:15 CEST	TS13.4	ARCTIC: AGILE AND ROBUST COMPUTE-IN-MEMORY COMPILER WITH PARAMETERIZED INT/FP PRECISION AND BUILT-IN SELF TEST Speaker: Hongyi Zhang, State Key Laboratory of Integrated Chips and Systems, Fudan University, CN Authors: Hongyi Zhang¹, Haozhe Zhu¹, Siqi He¹, Mengjie Li¹, Chengchen Wang², Xiankui Xiong², Haidong Tian², Xiaoyang Zeng¹ and Chixiao Chen¹ ¹State Key Laboratory of Integrated Chips & Systems, Frontier Institute of Chip and System, Fudan University, CN; ²State Key Laboratory of Mobile Network and Mobile Multimedia Technology, ZTE Corporation, Shenzhen, China, CN Abstract Digital Compute-in-Memory (DCIM) architectures are playing an increasingly vital role in artificial intelligence (AI) applications due to their significant energy efficiency enhancement. Coupling memory and computing logic in DCIM requires extensive customization of custom cells and layouts, thus increasing design complexity and implementation effort. To adapt to the swiftly evolving AI algorithms, DCIM compiler for agile customization is required. Previous DCIM compilers accelerate the customization process, but only focus on integer computation. Moreover, with technology node scaling down, design-for-test circuits are critical for robust chip design, while previous built-in-self-test (BIST) schemes for traditional memory fail to offer support for DCIM. This paper presents ARCTIC, an agile and robust DCIM compiler supporting parameterized integer/floating-point formats with corresponding BIST circuits. To support variable precision formats (including integer and floating-point), ARCTIC applies adaptive topology and layout optimization schemes for optimal performance. The compiler is also equipped with DCIM-friendly MarchCIM BIST circuits for efficient post-silicon tests with negligible area overhead. The energy efficiency of the generated DCIM macros remains competent with the state-of-the-art counterparts.
11:20 CEST	TS13.5	BIT-TRIMMER: INEFFECTUAL BIT-OPERATION REMOVAL FOR CIM ARCHITECTURE Speaker: Yintao He, Institute of Computing Technology, Chinese Academy of Sciences, CN Authors: Yintao He¹, Shixin Zhao¹, Songyun Qu¹, Huawei Li¹, Xiaowei Li² and Ying Wang¹ ¹Institute of Computing Technology, Chinese Academy of Sciences, CN; ²ICT, Chinese Academy of Sciences, CN Abstract ReRAM-based accelerator of bit-slicing architecture is a promising solution to neural network inference, which allows ineffectual bit-operation removal for greater potential gains. However, existing techniques mostly exploit the removal of weight-associated ineffectual operations, which cannot eliminate the activation-induced ineffectual operations. Alternatively, some approaches adopt an isolated two-stage approach to remove at the weight and activation-level, which leaves a big proportion of ineffectual bit-level operations. Therefore, in contrast to all these coarse-grained operation removal techniques, it is challenging to jointly eliminate ineffectual bit-operation induced by either activation or weight bit-slices for ReRAM-based accelerators. This work presents a novel ineffectual bit-operation removal approach and the accompanied ReRAM-based bit-operation clipping architecture that skips all those bit-level operations that make negligible impacts on neural network outputs. In experiments, the proposed bit-operation clipping ReRAM accelerator, Bit-Trimmer, achieves 5.28× energy efficiency and 2.04× speedup on average. Besides, compared with two SOTA ReRAM accelerator designs with bit-operation removal, it outperforms by 1.56× and 1.88× energy efficiency.
11:25 CEST	TS13.6	HYQA: HYBRID NEAR-DATA PROCESSING PLATFORM FOR EMBEDDING BASED QUESTION ANSWERING SYSTEM Speaker: Shengwen Liang, SKLP, Institute of Computing Technology, CAS, CN Authors: Shengwen Liang¹, Ziming Yuan¹, Ying Wang¹, Dawen Xu², Huawei Li³ and Xiaowei Li⁴ ¹State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences, CN; ²Seehi Microelectronic Corp. Ltd., CN; ³Institute of Computing Technology, Chinese Academy of Sciences, CN; ⁴ICT, Chinese Academy of Sciences, CN Abstract An LLM-based question-answering (QA) system has gained attention for its conversational ability. However, domain knowledge limitations, time lag, high training costs, and security concerns suggest building on-premise QA systems with embedding techniques. However, deploying embedding-based QA systems on existing GPUs or domain-specific accelerators is sub-optimal as they only address high computation costs and ignore large memory footprint and data movement costs, which impact response latency and user experience. To address these issues, we propose a hybrid near-data processing platform, HyQA, which collaboratively optimizes response latency, memory footprint, and data movement cost by exploiting the benefit of near-memory and near-storage computing simultaneously. First, HyQA analyzes computational patterns of sub-tasks in embedding-based QA systems, tailors domain-specific hardware accelerators, and assigns suitable computational paradigms. Second, these dedicated accelerators are designed to communicate directly with flash memory, avoiding additional data movement. The experiment shows that HyQA significantly improves performance and reduces energy over CPU, GPU, Cognitive SSD, and DeepStore platforms.
11:30 CEST	TS13.7	TOWARDS EFFICIENT RECONFIGURATION THROUGH LIGHTWEIGHT INPUT INVERSION FOR MLC NVFPGAS Speaker: Huichuan Zheng, Shandong University, CN Authors: Huichuan Zheng, Mengying Zhao, Hao Zhang, Yuqing Xiong, Xiaojun Cai and Zhiping Jia, Shandong University, CN Abstract Nonvolatile field programmable gate arrays (NVFPGAs) have been proposed to address the challenges raised by artificial intelligence and big data related applications, since nonvolatile memories (NVMs) introduce advantages of high storage density, low leakage power, and high system robustness. In addition, multi-level cell (MLC), which can store multiple bits within one memory cell, further improves the logic density of NVFPGAs. However, the inefficient write operation of MLC NVM significantly increases the reconfiguration cost in aspects of energy, latency, and lifetime. In this paper, we focus on the reconfiguration cost of MLC LUTs in NVFPGA and propose a lightweight input inversion based scheme to reduce the reconfiguration cost. Inversion flexibility is defined and modeled for LUT inputs to guide the proposed scheme. We also discuss how the proposed scheme can be combined with other existing write reduction strategies. Evaluation shows the proposed scheme can reduce reconfiguration cost by 10.01% with negligible overhead.
11:35 CEST	TS13.8	LOW POWER AND TEMPERATURE-RESILIENT COMPUTE-IN-MEMORY BASED ON SUBTHRESHOLD-FEFET Speaker: Yifei Zhou, Zhejiang University, CN Authors: Yifei Zhou¹, Xuchu Huang¹, Jianyi Yang¹, Kai Ni², Hussam Amrouch³, Cheng Zhuo¹ and Xunzhao Yin¹ ¹Zhejiang University, CN; ²University of Notre Dame, US; ³TU Munich (TUM), DE Abstract Compute-in-memory (CiM) is a promising solution for addressing the challenges of artificial intelligence (AI) and the Internet of Things (IoT) hardware such as "memory wall" issue. Specifically, CiM employing nonvolatile memory (NVM) devices in a crossbar structure can efficiently accelerate multiply-accumulation (MAC) computation, a crucial operator in neural networks among various AI models. Low power CiM designs are thus highly desired for further energy efficiency optimization on AI models. Ferroelectric FET (FeFET), an emerging device, is attractive for building ultra-low power CiM array due to CMOS compatibility, high ION /IOF F ratio, etc. Recent studies have explored FeFET based CiM designs that achieve low power consumption. Nevertheless, subthreshold-operated FeFETs, where the operating voltages are scaled down to the subthreshold region to reduce array power consumption, are particularly vulnerable to temperature drift, leading to accuracy degradation. To address this challenge, we propose a temperature-resilient 2T-1FeFET CiM design that performs MAC operations reliably at subthreahold region from 0°C to 85°C, while consuming ultra-low power. Benchmarked against the VGG neural network architecture running the CIFAR-10 dataset, the proposed 2T-1FeFET CiM design achieves 89.45% CIFAR-10 test accuracy. Compared to previous FeFET based CiM designs, it exhibits immunity to temperature drift at an 8-bit wordlength scale, and achieves better energy efficiency with 2866 TOPS/W.
11:40 CEST	TS13.9	PIPELINE DESIGN OF NONVOLATILE-BASED COMPUTING IN MEMORY FOR CONVOLUTIONAL NEURAL NETWORKS INFERENCE ACCELERATORS Speaker: Lixia Han, Peking University, CN Authors: Lixia Han, Peng Huang, Zheng Zhou, Yiyang Chen, Haozhang Yang, Xiaoyan Liu and Jinfeng Kang, Peking University, CN Abstract Nonvolatile-based computing-in-memory inference chips show great potential to accelerate convolutional neural networks. The intrinsic weight stationary characteristic makes pipeline design a crucial solution to further enhance throughput. In this work, we propose a balanced pipeline design and establish performance/area evaluation models for the optimal pipeline solution. The evaluation results indicate that our pipeline design achieves 30× computational efficiency improvement.
11:41 CEST	TS13.10	RETAP: PROCESSING-IN-RERAM BITAP APPROXIMATE STRING MATCHING ACCELERATOR FOR GENOMIC ANALYSIS Speaker: Chin-Fu Nien, Department of Computer Science and Information Engineering, Chang Gung University, Taiwan (ROC), TW Authors: Tsung-Yu Liu¹, Yen An Lu², James Yu³, Chin-Fu Nien⁴ and Hsiang-Yun Cheng⁵ ¹Research Center for Information Technology Innovation, Academia Sinica, Taiwan, TW; ²Cornell University, Master of Engineering in Electrial and Computer Engineering, US; ³Master of Science in Computer Science, Georgia Institute of Technology, US; ⁴Department of Computer Science and Information Engineering, Chang Gung University, TW; ⁵Academia Sinica, TW \| National Taiwan University, TW Abstract Read mapping, which involves computationally intensive approximate string matching (ASM) on large datasets, is the primary performance bottleneck in genome sequence analysis. To accelerate read mapping, a processing-in-memory (PIM) architecture that conducts highly parallel computations within the memory to reduce energy-inefficient data movements can be a promising solution. In this paper, we present ReTAP, a processing-in-ReRAM Bitap accelerator for genomic analysis. Instead of using the intricate dynamic programming algorithm, our design incorporates the Bitap algorithm, which uses only simple bitwise operations to perform ASM. Additionally, we explore the opportunity to reduce redundant computations by dynamically adjusting the error tolerance of Bitap and co-design the hardware to enhance computation parallelism. Our evaluation demonstrates that ReTAP outperforms GenASM, the state-of-the-art Bitap accelerator, with a $153.7 imes$ higher throughput.
11:42 CEST	TS13.11	A DTCO FRAMEWORK FOR 3D NAND FLASH READOUT Speaker: Mattia Gerardi, imec, BE Authors: Mattia Gerardi, Arvind Sharma, Yang Xiang, Jakub Kaczmarek, Fernando Garcia Redondo, Maarten Rosmeulen and Marie Garcia Bardon, imec, BE Abstract To continue increasing the storage density of 3D NAND flash memories, new technology options need to be evaluated early on. This work presents a unique predictive parametric framework for Multi-Level Cell 3D NAND Flash read operation at the array level. This framework is used to explore the read sensitivity to multiple parameters and technology options. We identify the trade-offs between number of layers, read-current and read time to be the most determinant factors to ensure the array readability while enabling stacks of more than 300 layers and maximizing the memory density.
11:43 CEST	TS13.12	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS15 Efficient And Secure Systems-On-Chip For The New Hyperconnected Environments

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Multi-Purpose Room M1B+D

Session chair:
Giovanni De Micheli, EPFL, CH

Session co-chair:
Cédric Marchand, Ecole centrale of Lyon, FR

Time	Label	Presentation Title Authors
11:00 CEST	TS15.1	EFFICIENT FAST ADDITIVE HOMOMORPHIC ENCRYPTION CRYPTOPROCESSOR FOR PRIVACY-PRESERVING FEDERATED LEARNING AGGREGATION Speaker: Wenye Liu, Independent Researcher, SG Authors: Wenye Liu, Nazim Koca and Chip Hong Chang, Nanyang Technological University, SG Abstract Privacy leakage is a critical concern of collaboratively training a large-scale deep learning model from multiple clients. To protect the local data, homomorphic encryption (e.g., Paillier) could be utilized for data aggregation on the central server. Nevertheless, even with CPU-optimized libraries or FPGA-based accelerators, the computing power and throughput limitations remain a stumbling block for practical deployment of Paillier scheme. In this paper, we present an efficient and high-throughput cryptoprocessor based on a recently introduced Fast Additive Homomorphic Encryption (FAHE) algorithm. For encryption, we incorporate the asymmetric decomposition, time multiplexing resource reuse and hard-macro based wide-bus logic operations to efficiently map the large (>40 kbits) integer multiplications for low latency FPGA implementation. For decryption, we propose a table lookup method for rapid modular reduction by leveraging the relative short modulus size of FAHE. The single large precomputed lookup table is carefully partitioned into multiple subtables and deployed in dual-port RAMs to enable resource-efficient parallel computation. The FAHE cryptoprocessor is implemented on a Xilinx ZCU102 FPGA board for performance evaluation and comparison. The results show that the throughput of our design is 354x to 404x higher than the state-of-the-art Paillier accelerators. Compared to the FAHE software implementation, the latency of our proposed design is 14.95x and 11.42x lower for encryption and decryption, respectively.
11:05 CEST	TS15.2	CALLOC: CURRICULUM ADVERSARIAL LEARNING FOR SECURE AND ROBUST INDOOR LOCALIZATION Speaker: Danish Gufran, Colorado State UNiversity, US Authors: Danish Gufran and Sudeep Pasricha, Colorado State University, US Abstract Indoor localization has become increasingly vital for many applications from tracking assets to delivering personalized services. Yet, achieving pinpoint accuracy remains a challenge due to noise and complexity in indoor environments. As indoor localization gains greater importance in various applications, addressing the vulnerability of these systems to adversarial attacks also becomes imperative, as these attacks not only threaten service integrity but also reduce localization accuracy. To combat this challenge, we introduce CALLOC, a novel machine learning (ML) framework designed to resist adversarial attacks that reduce system accuracy and reliability. CALLOC employs a novel adaptive curriculum learning approach with a domain specific lightweight scaled-dot product attention neural network, tailored for adversarial resilience in practical use cases with resource constrained mobile devices. Experimental evaluations demonstrate that CALLOC can achieve improvements of up to 6.03x in mean errors and 4.6x in worst-case errors against state-of-the-art localization frameworks, across building floorplans and adversarial attacks scenarios, showcasing the frameworks promise.
11:10 CEST	TS15.3	OPTIMIZING CIPHERTEXT MANAGEMENT FOR FASTER FULLY HOMOMORPHIC ENCRYPTION COMPUTATION Speaker: Eduardo Chielle, New York University Abu Dhabi, AE Authors: Eduardo Chielle, Oleg Mazonka and Michail Maniatakos, New York University Abu Dhabi, AE Abstract Fully Homomorphic Encryption (FHE) is the pinnacle of privacy-preserving outsourced computation as it enables meaningful computation to be performed in the encrypted domain without the need for decryption or back-and-forth communication between the client and service provider. Nevertheless, FHE is still orders of magnitude slower than unencrypted computation, which hinders its widespread adoption. In this work, we propose Furbo, a plug-and-play framework that can act as middleware between any FHE compiler and any FHE library. Our proposal employs smart ciphertext memory management and caching techniques to reduce data movement and computation, and can be applied to FHE applications without modifications to the underlying code. Experimental results using Microsoft SEAL as the base FHE library and focusing on privacy-preserving Machine Learning as a Service show up to 2x performance improvement in the fully-connected layers, and up to 24x improvement in the convolutional layers without any code change.
11:15 CEST	TS15.4	CACHE BANDWIDTH CONTENTION LEAKS SECRETS Speaker: Han Wang, Wuhan University, CN Authors: Han Wang, Ming Tang, Ke Xu and Quancheng Wang, Wuhan University, CN Abstract Abstract—In the modern CPU architecture, enhancements such as the Line Fill Buffer (LFB) and Super Queue (SQ), which are designed to track pending cache requests, have significantly boosted performance. To exploit this structure, we deliberately engineered blockages in the L2 to L1d route by controlling LFB conflict and triggering prefetch prediction failures, while consciously dismissing other plausible influencing factors. This approach was subsequently extended to the L3 to L2 and L2 to L1i pathways, resulting in three potent covert channels, termed L2CC, L3CC, and LiCC, with capacities of 10.02 Mbps, 10.37 Mbps, and 1.83 Mbps, respectively. Strikingly, the capacities of L2CC and L3CC surpass those of earlier non-shared-memory-based covert channels, reaching a level comparable to their shared memory-dependent equivalents. Leveraging this congestion further facilitated the extraction of key bits from RSA and EdDSA designs. Coupled with SpectreV1 and V2, our covert channels effectively evade the majority of traditional Spectre defenses. Their confluence with Branch Prediction (BP) Timing assaults additionally undercuts balanced branch protections, hence broadening their capability to infiltrate a wide range of cryptography libraries. Index Terms—covert channel, side channel, CPU microarchitecture
11:20 CEST	TS15.5	IOMMU DEFERRED INVALIDATION VULNERABILITY: EXPLOIT AND DEFENSE Speaker: Chathura Rajapaksha, Boston University, US Authors: Chathura Rajapaksha, Leila Delshadtehrani, Richard Muri, Manuel Egele and Ajay Joshi, Boston University, US Abstract Direct Memory Access (DMA) introduces a security vulnerability as peripherals are given direct access to system memory, exposing privileged data to potentially malicious Input/Output (IO) devices. Modern systems are equipped with an IOMMU to mitigate such DMA attacks. The OS uses the IOMMU and IO page tables to map and unmap a designated memory region before and after the DMA operation, constraining each DMA request to the approved region. IOMMU protection comes at the cost of reduced throughput in IO-intensive workloads, mainly due to the high IOTLB invalidation latency. The Linux OS eliminates this bottleneck by deferring the IOTLB invalidation requests to a later time, which opens a vulnerability window where a memory region is unmapped but the relevant IOTLB entry remains. In this paper, we present a proof-of-concept exploit, empirically demonstrating that a malicious DMA-capable device can use this vulnerability window to leak data transferred by other devices. Furthermore, we propose hardware-assisted mitigation for the deferred invalidation vulnerability by making minor changes to the existing IOMMU hardware and OS software. We implemented the proposed mitigation in the Intel IOMMU implementation in QEMU and the Linux kernel. Our security evaluation showed that our proposed mitigation successfully mitigated the deferred invalidation vulnerability and provided 12.7% higher throughput compared to the strict invalidation mode.
11:25 CEST	TS15.6	EMACLAVE: AN EFFICIENT MEMORY AUTHENTICATION FOR RISCV ENCLAVES Speaker: Omais Pandith, Technology Innovation Institute, AE Authors: Omais Pandith, Rafail Psiakis and Johanna Toivanen, Technology Innovation Institute, Abu Dhabi, AE Abstract Enclave technologies have been a common solution for secure execution in the recent times. Most of the prior designs proposed for RISCV enclaves do not evaluate any performance overheads associated with encryption and integrity of data. The recent competing scheme proposed for RISCV enclaves and several other competing schemes for other architectures use an integrity tree to prevent against replay attacks, ensuring the integrity of the data. However, the downside of such approaches are multiple memory accesses required for traversing an integrity-tree and an additional storage overheads. In this paper, we propose EMAClave to ensure integrity protection using a modified bloom filter and additional hardware modifications. Moreover, we also prevent cache side channel attacks by proposing an intelligent cache allocation technology when secure and non-secure applications are running together. We show that we are able to outperform the recent competing scheme Penglai by around 16.1%.
11:30 CEST	TS15.7	ROLDEF: ROBUST LAYERED DEFENSE FOR INTRUSION DETECTION AGAINST ADVERSARIAL ATTACKS Speaker: Tajana Rosing, University of California, San Diego, US Authors: Onat Gungor¹, Tajana Rosing¹ and Baris Aksanli² ¹University of California, San Diego, US; ²San Diego State University, US Abstract The Industrial Internet of Things (IIoT) includes networking equipment and smart devices to collect and analyze data from industrial operations. However, IIoT security is challenging due to its increased inter-connectivity and large attack surface. Machine learning (ML)-based intrusion detection system (IDS) is an IIoT security measure that aims to detect and respond to malicious traffic by using ML models. However, these methods are susceptible to adversarial attacks. In this paper, we propose a RObust Layered DEFense (ROLDEF) against adversarial attacks. Our denoising autoencoder (DAE) based defense approach first detects if a sample comes from an adversarial attack. If an attack is detected, adversarial component is eliminated using the most effective DAE and the purified data is provided to the ML model. We use a realistic IIoT intrusion data set to validate the effectiveness of our defense across various ML models, where we improve the average prediction performance by 114% with respect to no defense. Our defense also provides 50% average prediction performance improvement compared to the state-of-the-art defense under various adversarial attacks. Our defense can also be deployed for any underlying ML model and provides an effective protection against adversarial attacks.
11:35 CEST	TS15.8	BEYOND RANDOM INPUTS: A NOVEL ML-BASED HARDWARE FUZZING Speaker: Mohamadreza Rostami, TU Darmstadt, DE Authors: Mohamadreza Rostami¹, Marco Chilese¹, Shaza Zeitouni¹, Rahul Kande², Jeyavijayan Rajendran² and Ahmad-Reza Sadeghi¹ ¹TU Darmstadt, DE; ²Texas A&M University, US Abstract Modern computing systems heavily rely on hardware as the root of trust. However, their increasing complexity has given rise to security-critical vulnerabilities that cross-layer attacks can exploit. Traditional hardware vulnerability detection methods, such as random regression and formal verification, have limitations. Random regression, while scalable, is slow in exploring hardware, and formal verification techniques are often concerned with manual effort and state explosions. Hardware fuzzing has emerged as an effective approach to exploring and detecting security vulnerabilities in large-scale designs like modern processors. They outperform traditional methods regarding coverage, scalability, and efficiency. However, state-of-the-art fuzzers struggle to achieve comprehensive coverage of intricate hardware designs within a practical timeframe, often falling short of a 70% coverage threshold. To address this challenge, we propose a novel ML-based hardware fuzzer, ChatFuzz. Our approach leverages large language models (LLMs) to understand processor language and generate data/control flow entangled yet random machine code sequences. Reinforcement learning (RL) is integrated to guide the input generation process by rewarding the inputs using code coverage metrics. Utilizing the open-source RISC-V-based RocketCore and BOOM cores as our testbed, ChatFuzz achieves 75% condition coverage in RocketCore in just 52 minutes. This contrasts with state-of-the-art fuzzers, which demand a 30-hour timeframe for comparable condition coverage. Notably, our fuzzer can reach a 79.14% condition coverage rate in RocketCore by conducting approximately 199k test cases. In the case of BOOM, ChatFuzz accomplishes a remarkable 97.02% condition coverage in 49 minutes. Our analysis identified all detected bugs by TheHuzz, including two new bugs in the RocketCore and discrepancies from the RISC-V ISA Simulator.
11:40 CEST	TS15.9	HDCIRCUIT: BRAIN-INSPIRED HYPERDIMENSIONAL COMPUTING FOR CIRCUIT RECOGNITION Speaker: Paul R. Genssler, University of Stuttgart, DE Authors: Paul Genssler¹, Lilas Alrahis², Ozgur Sinanoglu² and Hussam Amrouch³ ¹University of Stuttgart, DE; ²New York University Abu Dhabi, AE; ³TU Munich (TUM), DE Abstract Circuits possess a non-Euclidean representation, necessitating the encoding of their data structure (e.g., gate-level netlists) into fixed formats like vectors. This work is the first to propose brain-inspired hyperdimensional computing (HDC) for optimized circuit encoding. HDC does not require extensive training to encode a gate-level netlist into a hypervector and simplifies the similarity check between circuits from graph-based to the similarity between their hypervectors. We introduce a versatile HDC-based encoding method for circuit encoding. We demonstrate its effectiveness with the application of circuit recognition using ITC-99 and ISCAS-85 benchmarks. We maintain a 98.2% accuracy, even when the designs are obfuscated using logic locking.
11:41 CEST	TS15.10	SECURING ISW MASKING SCHEME AGAINST GLITCHES Speaker: Sylvain Guilley, Secure-IC, FR Authors: Sofiane Takarabt¹, Mohammad Ebrahimabadi², Javad Bahrami², Sylvain Guilley¹ and Naghmeh Karimi² ¹Secure-IC, FR; ²University of Maryland Baltimore County, US Abstract Ishai-Sahai-Wagner (ISW) masking scheme has been proposed in literature to protect cryptographic circuitries against side-channel analysis attacks. This scheme is provably secure from a theoretical standpoint. However, such security proof holds true if the gates are only evaluated after all of their inputs are available. Thereby, in practice, hardware ISW implementation may not follow such requirement considering that combinational gates are evaluated as soon as any single input of them is changed. In this paper, we provide a repair for ISW to address such leakage-related security flaws that result in the recovery of the secret key. Namely, we demonstrate that a constructive method to insert artificial delays and/or "refreshing" on some sensitive paths can fix ISW even when the circuitry is aged. The extra delays ensure that the combinational gates making up the ISW gadget are evaluated in the order expected by the ISW rationale, following exactly the ISW theory. We will verify the security of our proposed structure by leakage detection as well as an offensive methodology (attack). The robustness of our solution (so-called E-ISW standing for Enhanced-ISW) is supported by simulations in various conditions, to ensure that attacks cannot compromise our repairing scheme over time even when the gate delays are changed due to the aging impacts.
11:42 CEST	TS15.11	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS27 Efficient Optimization Techniques For Machine Learning Architectures

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Auditorium 3

Session chair:
José Cano, University of Glasgow, UK

Session co-chair:
Jinho Lee, Seoul National University, KR

Time	Label	Presentation Title Authors
11:00 CEST	TS27.1	BORE: ENERGY-EFFICIENT BANDED VECTOR SIMILARITY SEARCH WITH OPTIMIZED RANGE ENCODING FOR MEMORY-AUGMENTED NEURAL NETWORK Speaker: Chi-Tse Huang, National Taiwan University, TW Authors: Chi-Tse Huang¹, Cheng-Yang Chang¹, Hsiang-Yun Cheng² and An-Yeu Wu³ ¹Graduate Institute of Electronics Engineering National Taiwan University, TW; ²Academia Sinica, TW \| National Taiwan University, TW; ³Graduate Institute of Electronics Engineering, National Taiwan University, TW Abstract Memory-augmented neural networks (MANNs) incorporate external memories to address the significant issue of catastrophic forgetting in few-shot learning applications. MANNs rely on vector similarity search (VSS), which incurs substantial energy and computational overhead due to frequent data transfers and complex cosine similarity calculations. To tackle these challenges, prior research has proposed adopting ternary content addressable memories (TCAMs) for parallel VSS within memory. One promising approach is to use Exact-Match TCAM (EX-TCAM) with range encoding to find the vector with the minimum L∞ distance, avoiding the need for sensing circuit modifications as required by Best-Match TCAM (Best-TCAM). However, this method demands multiple search iterations and longer code words, limiting its practicality. In this paper, we propose an energy-efficient EX-TCAM-based design called BORE. BORE skips redundant search iterations and reduces code word length through performing Banded L∞ distance search with Optimized Range Encoding. Additionally, we consider the characteristics of the similarity metric and develop a distance-based training mechanism aimed at improving classification accuracy. Simulation results demonstrate that BORE enhances energy efficiency by 9.35× to 11.84× and accuracy by 2.95% to 4.69% compared to previous EX-TCAM-based approaches. Furthermore, BORE improves energy efficiency by 1.04× to 1.63× over prior works of Best-TCAM-based VSS.
11:05 CEST	TS27.2	COMMUNICATION-EFFICIENT MODEL PARALLELISM FOR DISTRIBUTED IN-SITU TRANSFORMER INFERENCE Speaker: Yuanxin Wei, National Sun Yat-Sen University, TW Authors: Yuanxin Wei, Shengyuan Ye, Jiazhi Jiang, Xu Chen, Dan Huang, Jiangsu Du and Yutong Lu, National Sun Yat-Sen University, TW Abstract Transformer models have shown significant success in a wide range of tasks. Meanwhile, massive resources required by its inference prevent scenarios with resource-constrained devices from in-situ deployment, leaving a high threshold of integrating its advances. Observing that these scenarios, e.g. smart home of edge computing, are usually comprise a rich set of trusted devices with untapped resources, it is promising to distribute Transformer inference onto multiple devices. However, due to the tightly-coupled feature of Transformer model, existing model parallelism approaches necessitate frequent communication to resolve data dependencies, making them unacceptable for distributed inference, especially under weak interconnect of edge scenarios. In this paper, we propose DeTransformer, a communication-efficient distributed in-situ Transformer inference system for edge scenarios. DeTransformer is based on a novel block parallelism approach, with the key idea of restructuring the original Transformer layer with a single block to the decoupled layer with multiple sub-blocks and exploit model parallelism between sub-blocks. Next, DeTransformer contains an adaptive placement approach to automatically select the optimal placement strategy by striking a trade-off among communication capability, computing power and memory budget. Experimental results show that DeTransformer can reduce distributed inference latency by up to 2.81× compared to the SOTA approach on 4 devices, while effectively maintaining task accuracy and a consistent model size.
11:10 CEST	TS27.3	DIAPASON: DIFFERENTIABLE ALLOCATION, PARTITIONING AND FUSION OF NEURAL NETWORKS FOR DISTRIBUTED INFERENCE Speaker: Federico Nicolás Peccia, FZI Research for Information Technology, DE Authors: Federico Peccia¹, Alexander Viehl² and Oliver Bringmann³ ¹FZI Research Center for Information Technology, DE; ²FZI Forschungszentrum Informatik, DE; ³University of Tübingen \| FZI, DE Abstract Concerns in areas such as privacy, energy consumption, climate gas emissions, and costs, push the trend of migrating neural network inference from being executed on the cloud to embedded edge devices. We present our novel approach DIAPASON to overcome restrictions brought on by the computing and application requirements that impede their execution on resource-constrained embedded devices. Our approach addresses these challenges by distributing the inference across multiple computing instances, which could be anything ranging from a multi-CPU configuration on the same SoC to geographically distributed devices. In contrast to recent efforts which tend to apply heuristics to reduce the search space of their problem definition to solve it in a timely fashion, our novel problem definition applies the concept of continuous relaxation to the categorical selection of partitioning, layer fusion, and allocation opportunities. This approach overcomes the problem of the poor exploration of the actual search space that arises when removing potential distribution opportunities during problem simplifications. We conduct numerical simulations and ablation experiments on each one of the configuration parameters of our algorithm by distributing several widely used neural networks. Finally, we compare DIAPASON against the commonly used MoDNN baseline and a state-of-the-art approach, CoopAI, achieving a 44 % and 12 % mean speed-up respectively.
11:15 CEST	TS27.4	DIMO-SPARSE: DIFFERENTIABLE MODELING AND OPTIMIZATION OF SPARSE CNN DATAFLOW AND HARDWARE ARCHITECTURE Speaker: Jiang Hu, Texas A&M, US Authors: Jianfeng Song¹, Rongjian Liang², Yu Gong³, Bo Yuan³ and Jiang Hu¹ ¹Texas A&M University, US; ²NVIDIA Corp., US; ³Rutgers University, US Abstract Many real-world CNNs exhibit sparsity, this characteristic has primarily been utilized in manual design processes and has received little attention in existing automatic optimization techniques. This paper presents the first systematic investigation on automatic dataflow and hardware optimization for sparse CNN computation, to the best of our knowledge. A differentiable PPA (Power Performance Area) model is developed to enable fast nonlinear optimization solving and massively parallel local search-based discretization. Experimental results on public domain testcases demonstrate the efficacy of the proposed approach, achieving an average of 5X and 10X better PPA than the previous work for two different sparsity patterns.
11:20 CEST	TS27.5	GPACE: AN ENERGY-EFFICIENT PQ-BASED GCN ACCELERATOR WITH REDUNDANCY REDUCTION Speaker: Yibo Du, Institute of Computing Technology,University of Chinese Academy of Sciences, CN Authors: Yibo Du¹, Shengwen Liang², Ying Wang³, Huawei Li³, Xiaowei Li⁴ and yinhe han³ ¹Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences, CN; ²State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences, CN; ³Institute of Computing Technology, Chinese Academy of Sciences, CN; ⁴ICT, Chinese Academy of Sciences, CN Abstract Graph convolutional network (GCN) has been proven powerful in various tasks for it combines both neural networks and graph processing operators. However, this characteristic makes GCN exhibit hybrid execution patterns, which is unfavorable for CPUs and GPUs. Therefore, designing specialized GCN accelerators is becoming a prevalent paradigm. Unfortunately, as graph scale continues to grow, existing GCN accelerators suffer from significant bandwidth consumption and memory footprint as they neglect the inherent semantic redundancy of vertex features. Although applying Product Quantization to GCN is a promising solution to reduce the sizeable graph data via distilling semantic redundancy, it introduces novel operations with unique patterns that existing GCN accelerators cannot support. In this paper, we propose GPACE, an energy-efficient GCN accelerator that can fully harness the potential of PQ to reduce bandwidth consumption and data movement. GPACE is designed with a lookup-efficient architecture and well-optimized dataflow to support the unique data access and computation pattern of PQ-GCN. In addition to leveraging PQ to distill semantic redundancy, we exploit the operation redundancy and propose a redundancy-aware architecture to detect and reduce types of redundant operations to achieve higher energy efficiency. Evaluations show GPACE achieves high speedup and energy saving compared with CPU, GPU, and specialized GCN accelerators.
11:25 CEST	TS27.6	AUTOMATED OPTIMIZATION OF DEEP NEURAL NETWORKS: DYNAMIC BIT-WIDTH AND LAYER-WIDTH SELECTION VIA CLUSTER-BASED PARZEN ESTIMATION Speaker: Massoud Pedram, University of Southern California, US Authors: Seyedarmin Azizi, Mahdi Nazemi, Arash Fayyazi and Massoud Pedram, University of Southern California, US Abstract Given the ever-growing complexity and computational requirements of deep learning models, it has become imperative to efficiently optimize neural network architectures. This paper presents an automated, search-based method for optimizing the bit-width and layer-width of individual neural network layers, achieving substantial reductions in the size and processing requirements of these models. Our unique approach employs Hessian-based search space pruning to discard unpromising solutions, greatly reducing the search space. We further refine the optimization process by introducing a novel, adaptive algorithm that combines k-means clustering with tree-structured Parzen estimation. This allows us to dynamically adjust a figure of merit used in tree-structured Parzen estimation, i.e., the desirability of a particular bit-width and layer-width configuration, thereby expediting the identification of optimal configurations. Through extensive experiments on benchmark datasets, we validate the efficacy of our method. More precisely, our method outperforms existing techniques by achieving an average 20% reduction in model size without sacrificing any output accuracy. It also boasts a 12x acceleration in search time compared to the most advanced search-based approaches.
11:30 CEST	TS27.7	VIT-TOGO : VISION TRANSFORMER ACCELERATOR WITH GROUPED TOKEN PRUNING Speaker: Seokhyeong Kang, Pohang University of Science and Technology, KR Authors: Seungju Lee, Kyumin Cho, Eunji Kwon, Sejin Park, Seojeong Kim and Seokhyeong Kang, Pohang University of Science and Technology, KR Abstract Vision Transformer (ViT) has gained prominence for its performance in various vision tasks but comes with considerable computational and memory demands, posing a challenge when deploying it on resource-constrained edge devices. To address this limitation, various token pruning methods have been proposed to reduce the computation. However, the majority of token pruning techniques do not account for practical use in actual embedded devices, which demand a significant reduction in computational load. In this paper, we introduce ViT-ToGo, a ViT accelerator with grouped token pruning. This enables the parallel execution of the ViT models and the token pruning process. We implement grouped token pruning with a head-wise importance estimator which simplifies the process need for token pruning, including sorting and reordering. Our proposed method achieves up to 66\% reduction in the number of tokens, resulting in up to 36% reduction in GFLOPs, with only a minimal accuracy drop of around 1%. Furthermore, the hardware implementation incurs a marginal resource overhead of 1.13% in average.
11:35 CEST	TS27.8	SEA: SIGN-SEPARATED ACCUMULATION SCHEME FOR RESOURCE-EFFICIENT DNN ACCELERATORS Speaker: Jing Gong, University of New South Wales, AU Authors: Jing Gong¹, Hassaan Saadat¹, Haris Javaid², Hasindu Gamaarachchi¹, David Taubman³ and Sri Parameswaran¹ ¹University of New South Wales, AU; ²AMD, SG; ³School of Electrical Engineering and Telecommunications, UNSW Sydney, NSW, Australia, AU Abstract Deep Neural Network (DNN) accelerators targeting training need to support the resource-hungry floating-point (FP) arithmetic. Typically, the additions (accumulations) are performed in higher precision compared to multiplications, resulting in substantial proportion of resources consumed by the adders. In this paper, we aim for improving resource efficiency of FP addition in DNN training accelerators and present a sign-separated accumulation scheme (SEA). The proposed SEA scheme accumulates the same-signed terms separately, followed by a final addition of the two oppositely-signed sub-accumulations. The separate sub-accumulations, when performed using our resource-efficient same-signed FP adders, result in significant improvement in overall resource efficiency. We present a SEA-based systolic array design, involving a novel processing element design, capable of performing FP matrix multiplications for DNN training. Our experimental results show substantial improvements in delay, area-delay product (ADP), and energy consumption, by achieving reductions of up to 18.3% in delay, 36.9% in ADP and 37.0% in energy, across various adder-multiplier combinations compared to the original designs. SEA introduces no approximations in accumulation or modifications in the rounding mechanism. We integrate SEA-based systolic array into the open-source Gemmini ecosystem for use by the broader community.
11:40 CEST	TS27.9	TOWARDS FORWARD-ONLY LEARNING FOR HYPERDIMENSIONAL COMPUTING Speaker: Hyunsei Lee, DGIST, KR Authors: Hyunsei Lee¹, Hyukjun Kwon¹, Jiseung Kim¹, Seohyun Kim¹, Mohsen Imani² and Yeseong Kim¹ ¹DGIST, KR; ²University of California, Irvine, US Abstract Hyperdimensional (HD) Computing is a lightweight representation system that symbolizes data as high-dimensioned vectors. HD computing has been growing in popularity in recent years as an alternative to deep neural networks mainly due to its simple and efficient operations. In HD-based learning frameworks, the encoding of the high dimensional representations are widely cited to be the most contributing procedure to accuracy and efficiency. However, throughout HD computing's history, the encoder has largely remained static. In this work, we explore methods for a dynamic encoder that yields better representations as training progresses. Our proposed methods, SEP and IMP, achieve accuracies comparable to state-of-the-art HD-based methods proposed in the literature; more notably, our solutions outperform existing work at lower dimensions, e.g., 5.49% while maintaining a relatively small dimension of D=3,000, which equates to an average of 3.32times faster inference.
11:41 CEST	TS27.10	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

ASD05P ASD Posters

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 12:00 CEST - 12:30 CEST
Location / Room: Break-Out Room S8

Time	Label	Presentation Title Authors
12:00 CEST	ASD05P.1	CONSTRAINT-AWARE RESOURCE MANAGEMENT FOR CYBER-PHYSICAL SYSTEMS Speaker: Mehmet Belviranli, Colorado School of Mines, US Authors: Justin McGowen, Ismet Dagli, Neil Dantam and Mehmet Belviranli, Colorado School of Mines, US Abstract Cyber-physical systems (CPS) such as robots and self-driving cars pose strict physical requirements to avoid failure. Scheduling choices impact these requirements. This presents a challenge: how do we find efficient schedules for CPS with heterogeneous processing units, such that the schedules are resource-bounded to meet the physical requirements? We propose the creation of a structured system, the Constrained Autonomous Workload Scheduler, which determines scheduling decisions with direct relations to the environment. By using a representation language (AuWL), Timed Petri nets, and mixed-integer linear programming, our scheme offers novel capabilities to represent and schedule many types of CPS workloads, real world constraints, and optimization criteria.
12:00 CEST	ASD05P.2	ROBUSTNESS AND ACCURACY EVALUATIONS OF LOCALIZATION TECHNIQUES FOR AUTONOMOUS RACING Speaker: Tian Yi Lim, ETH Zurich, CH Authors: Tian Yi Lim, Nicolas Baumann, Edoardo Ghignone and Michele Magno, ETH Zurich, CH Abstract This work introduces SynPF, a Monte-Carlo Localization (MCL)-based algorithm tailored for high-speed racing environments. Benchmarked against Cartographer, a State of the Art (SotA) pose-graph Simultaneous Localization and Mapping (SLAM) algorithm, SynPF leverages synergies from previous Particle Filter (PF) methodologies and synthesizes them for the high-performance racing domain. Our extensive in-field evaluations reveal that while Cartographer excels under nominal conditions, it struggles when subjected to wheel-slip—a common phenomenon in the racing scenario due to varying grip levels and aggressive driving behavior. Conversely, SynPF demonstrates robustness in these challenging conditions and a low-latency computation time of 1.25 ms on a Graphics Processing Unit (GPU) denied On Board Computer (OBC). Using the F1TENTH platform, a 1:10 scaled autonomous racing vehicle, this work not only highlights the vulnerabilities of existing algorithms in high-speed scenarios, tested up until 7.6 m s −1 , but also emphasizes the potential of SynPF as a viable alternative, especially in deteriorating odometry conditions.
12:00 CEST	ASD05P.3	A STAKEHOLDER ANALYSIS ON OPERATIONAL DESIGN DOMAINS OF AUTOMATED DRIVING SYSTEMS Speaker: Marcel Aguirre Mehlhorn, TU Ilmenau \| Volkswagen, DE Authors: Marcel Aguirre Mehlhorn¹, Hauke Dierend², Andreas Richter² and Yuri Shardt³ ¹TU Ilmenau; Volkswagen, DE; ²Volkswagen, DE; ³TU Ilmenau, DE Abstract Developing an automated driving system (ADS) involves collaboration between various stakeholders. To support this process, the concept of operational design domain (ODD) has emerged. Nonetheless, stakeholders require variable levels of information from an ODD. A thorough investigation has identified eight main stakeholder categories. Furthermore, a stakeholder analysis is used to assess their expectations, interests, and influence. These findings briefly summarise all necessary ODD engineering requirements and deliverables for all ODD stakeholders.

LK02 Special Day on Sustainable Computing Lunchtime Keynote

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 13:15 CEST - 14:00 CEST
Location / Room: Auditorium 2

Session chair:
Ana-Lucia Varbanescu, University of Twente, NL

Session co-chair:
Andrea Bartolini, Università di Bologna, IT

Time	Label	Presentation Title Authors
13:15 CEST	LK02.1	DATA CENTER DEMAND RESPONSE FOR SUSTAINABLE COMPUTING: MYTH OR OPPORTUNITY? Speaker and Author: Ayse Coskun, Boston University, US Abstract In our computing-driven era, the escalating power consumption of modern data centers, currently constituting approximately 3% of global energy use, is a burgeoning concern. With the anticipated surge in usage accompanying the widespread adoption of AI technologies, addressing this issue becomes imperative. This paper discusses a potential solution: integrating data centers into grid programs such as "demand response" (DR). This strategy not only optimizes power usage without requiring new fossil-fuel infrastructure but also facilitates more ambitious renewable deployment by adding demand flexibility to the grid. However, the unique scale, operational knobs and constraints, and future projections of data centers present distinct opportunities and urgent challenges for implementing DR. This paper delves into the myths and opportunities inherent in this perspective on improving data center sustainability. While obstacles including creating the requisite software infrastructure, establishing institutional trust, and addressing privacy concerns remain, the landscape is evolving to meet the challenges. Noteworthy achievements have emerged in the development of intelligent solutions that can be swiftly implemented in data centers to accelerate the adoption of DR. These multifaceted solutions encompass dynamic power capping, load scheduling, load forecasting, market bidding, and collaborative optimization. We offer insights into this promising step towards making sustainable computing a reality.

LKS03 Later … With The Keynote Speakers

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: VIP Room

Session chair:
Ana-Lucia Varbanescu, University of Twente, NL

Session co-chair:
Andrea Bartolini, Università di Bologna, IT

Speaker: Ayse Coskun

TS05 Architecture And Interconnect Design

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: Multi-Purpose Room M1A+C

Session chair:
Alejandro Valero, Univerisy of Zaragoza, ES

Session co-chair:
Alessandro Cilardo, Università di Napoli Federico II, IT

Time	Label	Presentation Title Authors
14:00 CEST	TS05.1	NOVA: NOC-BASED VECTOR UNIT FOR MAPPING ATTENTION LAYERS ON A CNN ACCELERATOR Speaker: Mohit Upadhyay, National University of Singapore, SG Authors: Mohit Upadhyay, Rohan Juneja, Weng-Fai Wong and Li-Shiuan Peh, National University of Singapore, SG Abstract Attention mechanisms are becoming increasingly popular, being used in neural network models in multiple domains such as natural language processing (NLP) and vision applications, especially at the edge. However, attention layers are difficult to map onto existing neuro accelerators since they have a much higher density of non-linear operations, which lead to inefficient utilization of today's vector units. This work introduces NOVA, a NoC-based Vector Unit that can perform non-linear operations within the NoC of the accelerators, and can be overlaid onto existing neuro accelerators to map attention layers at the edge. Our results show that the NOVA architecture is up to 37.8x more power-efficient than state-of-the-art hardware approximators when running existing attention-based neural networks.
14:05 CEST	TS05.2	INDEXMAC: A CUSTOM RISC-V VECTOR INSTRUCTION TO ACCELERATE STRUCTURED-SPARSE MATRIX MULTIPLICATIONS Speaker: Vasileios Titopoulos, Democritus University of Thrace, GR Authors: Vasileios Titopoulos¹, Kosmas Alexandridis¹, Christodoulos Peltekis¹, Chrysostomos Nicopoulos² and Giorgos Dimitrakopoulos¹ ¹Democritus University of Thrace, GR; ²University of Cyprus, CY Abstract Structured sparsity has been proposed as an efficient way to prune the complexity of modern Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. The acceleration of ML models -- for both training and inference -- relies primarily on equivalent matrix multiplications that can be executed efficiently on vector processors or custom matrix engines. The goal of this work is to incorporate the simplicity of structured sparsity into vector execution, thereby accelerating the corresponding matrix multiplications. Toward this objective, a new vector index-multiply-accumulate instruction is proposed, which enables the implementation of low-cost indirect reads from the vector register file. This reduces unnecessary memory traffic and increases data locality. The proposed new instruction was integrated in a decoupled RISC-V vector processor with negligible hardware cost. Extensive evaluation demonstrates significant speedups of 1.80x--2.14x, as compared to state-of-the-art vectorized kernels, when executing layers of varying sparsity from state-of-the-art Convolutional Neural Networks (CNNs).
14:10 CEST	TS05.3	LRSCWAIT: ENABLING SCALABLE AND EFFICIENT SYNCHRONIZATION IN MANYCORE SYSTEMS THROUGH POLLING-FREE AND RETRY-FREE OPERATION Speaker: Samuel Riedel, ETH Zurich, CH Authors: Samuel Riedel¹, Marc Gantenbein¹, Alessandro Ottaviano¹, Torsten Hoefler¹ and Luca Benini² ¹ETH Zurich, CH; ²ETH Zurich, CH \| Università di Bologna, IT Abstract Extensive polling in shared-memory manycore systems can lead to contention, decreased throughput, and poor energy efficiency. Both lock implementations and the general-purpose atomic operation, load-reserved/store-conditional (LRSC), cause polling due to serialization and retries. To alleviate this overhead, we propose LRwait and SCwait, a synchronization pair that eliminates polling by allowing contending cores to sleep while waiting for previous cores to finish their atomic access. As a scalable implementation of LRwait, we present Colibri, a distributed and scalable approach to managing LRwait reservations. Through extensive benchmarking on an open-source RISC-V platform with 256 cores, we demonstrate that Colibri outperforms current synchronization approaches for various concurrent algorithms with high and low contention regarding throughput, fairness, and energy efficiency. With an area overhead of only 6%, Colibri outperforms LRSC-based implementations by a factor of 6.5× in terms of throughput and 7.1× in terms of energy efficiency.
14:15 CEST	TS05.4	MACO: EXPLORING GEMM ACCELERATION ON A LOOSELY-COUPLED MULTI-CORE PROCESSOR Speaker: Wei Guo, School of Computer, National University of Defense Technology and Key Laboratory of Advanced Microprocessor Chips and Systems, CN Authors: Bingcai Sui¹, Junzhong Shen¹, Caixia Sun¹, Junhui Wang², Zhong Zheng¹ and Wei Guo¹ ¹Key Laboratory of Advanced Microprocessor Chips and Systems, National University of Defense Technology, CN; ²National University of Defense Technology, CN Abstract General-purpose processor vendors have integrated customized accelerator in their products due to the widespread use of General Matrix-Matrix Multiplication (GEMM) kernels. However, it remains a challenge to further improve the flexibility and scalability of these GEMM-enhanced processors to cater to the emerging large-scale GEMM workloads. In this paper we propose MACO, a novel loosely-coupled multi-core general-purpose architecture optimized for GEMM-related applications. To enhance the programmability and flexibility of MACO, the paper introduces a tile-based instruction set architecture. Additionally, the paper presents techniques such as multi-level data tiling, hardware-assisted data prefetching, and predictive address translation to further enhance the computational efficiency of MACO for GEMM workloads. The experimental results demonstrate that MACO exhibits good scalability, achieving an average computational efficiency of 90% across multiple cores. Furthermore, evaluations on state-of-the-art deep neural networks show that MACO can achieve up to 1.1 TFLOPS with 88% computational efficiency, indicating its adaptivity to deep learning workloads.
14:20 CEST	TS05.5	DRAM-LOCKER: A GENERAL-PURPOSE DRAM PROTECTION MECHANISM AGAINST ADVERSARIAL DNN WEIGHT ATTACKS Speaker: Shaahin Angizi, New Jersey Institute of Technology, US Authors: Ranyang Zhou¹, Sabbir Ahmed², Arman Roohi³, Adnan Siraj Rakin² and Shaahin Angizi¹ ¹New Jersey Institute of Technology, US; ²Binghamton University, US; ³University of Nebraska–Lincoln (UNL), US Abstract In this work, we propose DRAM-Locker as a robust general-purpose defense mechanism that can protect DRAM against various adversarial Deep Neural Network (DNN) weight attacks affecting data or page tables. DRAM-Locker harnesses the capabilities of in-DRAM swapping combined with a lock-table to prevent attackers from singling out specific DRAM rows to safeguard DNN's weight parameters. Our results indicate that DRAM-Locker can deliver a high level of protection downgrading the performance of targeted weight attacks to a random attack level. Furthermore, the proposed defense mechanism demonstrates no reduction in accuracy when applied to CIFAR-10 and CIFAR-100. Importantly, DRAM-Locker does not necessitate any software retraining or result in extra hardware burden.
14:25 CEST	TS05.6	TOWARDS SCALABLE GPU SYSTEM WITH SILICON PHOTONIC CHIPLET Speaker: Chengeng LI, Hong Kong University of Science and Technology, HK Authors: Chengeng Li, Fan Jiang, Shixi Chen, Xianbin LI, Jiaqi Liu, Wei Zhang and Jiang Xu, Hong Kong University of Science and Technology, HK Abstract GPU-based computing has emerged as a predominant solution for high-performance computing and machine learning applications. The continuously escalating computing demand foresees a requirement for larger-scale GPU systems in the future. However, this expansion is constrained by the finite number of transistors per die. Although chiplet technology shows potential for building large-scale systems, current chiplet interconnection technologies suffer from limitations in both bandwidth and energy efficiency. In contrast, optical interconnect has ultra-high bandwidth and energy efficiency, and thereby is promising for constructing chiplet-based GPU systems. Yet, previously proposed optical networks lack scalability and cannot be directly applied to existing chiplet-based GPU systems. In this work, we address the challenges of designing large-scale GPU systems with silicon photonic chiplets. We propose GROOT, a group-based optical network that divides the entire system into groups and facilitates resource sharing among the chiplets within each group. Additionally, we design dedicated channel mapping and allocation policies tailored for the request network and the reply network, respectively. Experimental results show that GROOT achieves 48% improvement on performance and 24.5% reduction on system energy consumption over the baseline.
14:30 CEST	TS05.7	NEAR-MEMORY PARALLEL INDEXING AND COALESCING: ENABLING HIGHLY EFFICIENT INDIRECT ACCESS FOR SPMV Speaker: Chi Zhang, ETH Zurich, CH Authors: Chi Zhang¹, Paul Scheffler¹, Thomas Benz¹, Matteo Perotti¹ and Luca Benini² ¹ETH Zurich, CH; ²ETH Zurich, CH \| Università di Bologna, IT Abstract Sparse matrix-vector multiplication (SpMV) is central to numerous data-intensive applications, but requires streaming indirect memory accesses that severely degrade both processing and memory throughput in state-of-the-art architectures. Near-memory hardware units, decoupling indirect streams from processing elements, partially alleviate the bottleneck, but rely on low DRAM access granularity, which is highly inefficient for modern DRAM standards like HBM and LPDDR. To fully address the end-to-end challenge, we propose a low-overhead data coalescer combined with a near-memory indirect streaming unit for AXI-Pack, an extension to the widespread AXI4 protocol packing narrow irregular stream elements onto wide memory buses. Our combined solution leverages the memory-level parallelism and coalescence of streaming indirect accesses in irregular applications like SpMV to maximize the performance and bandwidth efficiency attained on wide memory interfaces. Our solution delivers an average speedup of 8x in effective indirect access, often reaching the full memory bandwidth. As a result, we achieve an average end-to-end speedup on SpMV speedup of 3x. Moreover, our approach demonstrates remarkable on-chip efficiency, requiring merely 27kB of on-chip storage and a very compact implementation size of 0.2-0.3 mm^2 in a 12nm node.
14:35 CEST	TS05.8	STAR: SUM-TOGETHER/APART RECONFIGURABLE MULTIPLIERS FOR PRECISION-SCALABLE ML WORKLOADS Speaker: Edward Manca, Politecnico di Torino, IT Authors: Edward Manca, Luca Urbinati and Mario R. Casu, Politecnico di Torino, IT Abstract To achieve an optimal balance between accuracy and latency in Deep Neural Networks (DNNs), precision-scalability has become a paramount feature for hardware specialized for Machine Learning (ML) workloads. Recently, many precision-scalable (PS) multipliers and multiply-and-accumulate (MAC) units have been proposed. They are mainly divided in two categories, Sum-Apart (SA) and Sum-Together (ST), and have been always presented as alternative implementations. Instead, in this paper, we introduce for the first time a new class of PS Sum-Together/Apart Reconfigurable multipliers, which we call STAR, designed to support both SA and ST modes with a single reconfigurable architecture. STAR multipliers could be useful in MAC units of CPU or hardware accelerators, for example, enabling them to handle both 2D Convolution (in ST mode) and Depth-wise Convolution (in SA mode) with a unique PS hardware design, thus saving hardware resources. We derive four distinct STAR multiplier architectures, including two derived from the well-known Divide-and-Conquer and Sub-word Parallel SA and ST families, which support 16, 8 and 4-bit precision. We perform an extensive exploration of these architectures in terms of power, performance, and area, across a wide range of clock frequency constraints, from 0.4 to 2.0 GHz, targeting a 28-nm CMOS technology. We identify the Pareto-optimal solutions with the lowest area and power in the low-frequency, mid-frequency, and high-frequency ranges. Our findings allow designers to select the best STAR solution depending on their design target, either low-power and low-area, high performance, or balance.
14:40 CEST	TS05.9	HARNESSING ML PRIVACY BY DESIGN THROUGH CROSSBAR ARRAY NON-IDEALITIES Speaker: Khaled N Khasawneh, George Mason University, US Authors: Md Shohidul Islam¹, Sankha B Dutta², Andres Marquez², Ihsen Alouani³ and Khaled Khasawneh¹ ¹George Mason University, US; ²Pacific Northwest National Lab, US; ³Queen's University Belfast, GB Abstract Deep Neural Networks (DNNs), handling compute- and data-intensive tasks, often utilize accelerators like Resistive- switching Random-access Memory (RRAM) crossbar for energy- efficient in-memory computation. Despite RRAM's inherent non- idealities causing deviations in DNN output, this study transforms the weakness into strength. By leveraging RRAM non-idealities, the research enhances privacy protection against Membership Inference Attacks (MIAs), which reveal private information from training data. RRAM non-idealities disrupt MIA features, increasing model robustness and revealing a privacy-accuracy tradeoff. Empirical results with four MIAs and DNNs trained on different datasets demonstrate significant privacy leakage reduction with a minor accuracy drop (e.g., up to 2.8% for ResNet-18 with CIFAR-100).
14:41 CEST	TS05.10	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS08 Approximate And Efficient Computing

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: Auditorium 3

Session chair:
Anca Molnos, CEA, FR

Session co-chair:
Alessandro Savino, Politecnico di Torino, IT

Time	Label	Presentation Title Authors
14:00 CEST	TS08.1	PACE: A PIECE-WISE APPROXIMATE AND CONFIGURABLE FLOATING-POINT DIVIDER FOR ENERGY-EFFICIENT COMPUTING Speaker: Chenyi Wen, Zhejiang University, CN Authors: Chenyi Wen¹, Haonan Du¹, Zhengrui Chen¹, Li Zhang², QI SUN¹ and Cheng Zhuo¹ ¹Zhejiang University, CN; ²Hubei University of Technology, CN Abstract Approximate computing is a promising alternative to improve energy efficiency for human perception related applications on the edge. This work proposes a piece-wise approximate floating-point divider, which is resource-efficient and run-time configurable. We provide a piece-wise approximation algorithm for 1/y, utilizing powers of 2. This approach enables the implementation of a reciprocal-based floating-point divider that is independent of multipliers, which not only reduces hardware consumption but also results in shorter latency. Furthermore, a multi-level run-time configurable hardware structure is introduced, enhancing the adaptability to various application scenarios. When compared to the prior state-of-the-art approximate divider, the proposed divider strikes an advantageous balance between accuracy and resource efficiency. The application-level evaluation of the proposed dividers demonstrates manageable and minimal degradation of the output quality when compared to the exact divider.
14:05 CEST	TS08.2	ATTBIND: MEMORY-EFFICIENT ACCELERATION FOR LONG-RANGE ATTENTION USING VECTOR-DERIVED SYMBOLIC BINDING Speaker: Weihong Xu, University of California, San Diego, US Authors: Weihong Xu, Jaeyoung Kang and Tajana Rosing, University of California, San Diego, US Abstract Transformer models have achieved a number of breakthrough results on a variety of complex tasks. Transformer's promising performance originates from the multi-head attention (MHA), which can model long-range sequence data dependency. It has been demonstrated that better performance can be obtained by increasing the sequence length N. However, scaling up the sequence length is extremely challenging for memory-constrained hardware because the naive Transformer requires quadratic O(N^2) complexity. In this work, we address this challenge by leveraging the binding operation in vector symbolic architecture (VSA). We propose the memory-efficient MHA algorithm to simplify the MHA computation at the cost of linear complexity. Then, we present the ASIC hardware architecture with optimized timing and dataflow to accelerate the proposed algorithm. We extensively evaluate our design across various long-range attention tasks. Our experiments show that the accuracy is competitive to state-of-the-art MHA optimization approaches with lower memory consumption and inference latency. The proposed algorithm achieves 7.8× speedup and 4.5× data movement reduction over the naive Transformer on ASIC. Meanwhile, our design supports 8 to 16× sequence length compared to existing hardware accelerators.
14:10 CEST	TS08.3	A STOCHASTIC ROUNDING-ENABLED LOW-PRECISION FLOATING-POINT MAC FOR DNN TRAINING Speaker: Sami Ben Ali, Inria Rennes, FR Authors: Sami BEN ALI, Olivier Sentieys and Silviu-Ioan Filip, INRIA, FR Abstract Training Deep Neural Networks (DNNs) can be computationally demanding, particularly when dealing with large models. Recent work has aimed to mitigate this computational challenge by introducing 8-bit floating-point (FP8) formats for multiplication. However, accumulations are still done in either half (16-bit) or single (32-bit) precision arithmetic. In this paper, we investigate lowering accumulator word length while maintaining the same model accuracy. We present a multiply-accumulate (MAC) unit with FP8 multiplier inputs and FP12 accumulations, which leverages an optimized stochastic rounding (SR) implementation to mitigate swamping errors that commonly arise during low precision accumulations. We investigate the hardware implications and accuracy impact associated with varying the number of random bits used for rounding operations. We additionally attempt to reduce MAC area and power by proposing a new scheme to support SR in floating-point MAC and by removing support for subnormal values. Our optimized eager SR unit significantly reduces delay and area when compared to a classic lazy SR design. Moreover, when compared to MACs utilizing single-or half-precision adders, our design showcases notable savings in all metrics. Furthermore, our approach consistently maintains near baseline accuracy across a diverse range of computer vision tasks, making it a promising alternative for low-precision DNN training.
14:15 CEST	TS08.4	DAISM: DIGITAL APPROXIMATE IN-SRAM MULTIPLIER-BASED ACCELERATOR FOR DNN TRAINING AND INFERENCE Speaker: Lorenzo Sonnino, Keio University, BE Authors: Lorenzo Sonnino, Shaswot Shresthamali, Yuan He and Masaaki Kondo, Keio University, JP Abstract DNNs are widely used but face significant computational costs due to matrix multiplications, especially from data movement between the memory and processing units. One promising approach is therefore Processing-in-Memory as it greatly reduces this overhead. However, most PIM solutions rely either on novel memory technologies that have yet to mature or bit-serial computations that have significant performance overhead and scalability issues. Our work proposes an in-SRAM digital multiplier, that uses a conventional memory to perform bit-parallel computations, leveraging multiple wordlines activation. We then introduce DAISM, an architecture leveraging this multiplier, which achieves up to two orders of magnitude higher area efficiency compared to the SOTA counterparts, with competitive energy efficiency.
14:20 CEST	TS08.5	EMBEDDING HARDWARE APPROXIMATIONS IN DISCRETE GENETIC-BASED TRAINING FOR PRINTED MLPS Speaker: Florentia Afentaki, University of Patras, GR Authors: Florentia Afentaki¹, Michael Hefenbrock², Georgios Zervakis¹ and Mehdi Tahoori³ ¹University of Patras, GR; ²RevoAI GmbH, DE; ³Karlsruhe Institute of Technology, DE Abstract Printed Electronics (PE) stands out as a promising technology for widespread computing due to its distinct attributes, such as low costs and flexible manufacturing. Unlike traditional silicon-based technologies, PE enables stretchable, conformal and non-toxic hardware. However, PE are constrained by larger feature sizes, making it challenging to implement complexcircuits such as machine learning (ML) classifiers. Approximate computing has been proven to reduce the hardware cost of ML circuits such as Multilayer Perceptrons (MLPs). In this paper, we maximize the benefits of approximate computing by integrating hardware approximation into the MLP training process. Due to the discrete nature of hardware approximation, we propose and implement a genetic-based, approximate, hardware-aware training approach specifically designed for printed MLPs. For a 5% accuracy loss, our MLPs achieve over 5× area and power reduction compared to the baseline while outperforming state-of-the-art approximate and stochastic printed MLPs.
14:25 CEST	TS08.6	A CONFIGURABLE APPROXIMATE MULTIPLIER FOR CNNS USING PARTIAL PRODUCT SPECULATION Speaker: Xiaolu Hu, Shanghai Jiao Tong University, CN Authors: Xiaolu Hu¹, Ao Liu¹, Xinkuang Geng¹, Zizhong Wei², Kai Jiang² and Honglan Jiang¹ ¹Shanghai Jiao Tong University, CN; ²Inspur Academy of Science and Technology, CN Abstract To improve the performance and energy efficiency of the compute-intensive convolutional neural networks (CNNs), approximate multipliers have widely been investigated, taking advantage of the inherent error tolerance in CNNs. However, as per their divergencies in the capability of error tolerance, different CNN models and datasets may require various accuracy in multiplication. Thus, in this paper, we propose an energy-efficient approximate multiplier with configurable accuracy to satisfy the continuously evolving requirements of CNNs. In this design, the approximation level is configured by changing the processing scheme for inputs due to their significance to accuracy. The correlations between partial products (PPs) are utilized to eliminate the generation and accumulation of some less significant PPs that are speculated by their adjacent more significant ones. Consequently, four approximate multiplier configurations are devised for 8×8 unsigned multiplication, denoted as AMPPS_S2, AMPPS_S3, AMPPS_S4, and AMPPS_S6. Compared with existing approximate multipliers, the proposed designs show significantly higher accuracy. Compared with an exact carry-save array multiplier, AMPPS_S2 can reduce the power dissipation, delay, and area by 21.7%, 35.1%, and 24.4%, respectively. Moreover, to enhance the efficiency of the proposed approximate multiplier in systolic array-based hardware architectures for CNNs, a novel encoding strategy is proposed for storing the pretrained weights. Obtaining a similar classification accuracy to the accurate implementation (tested in ResNet18 and ResNet50 on ImageNet), the 32×32 systolic array using AMPPS_S2 shows a 13.5% reduction in power consumption and a 19.2% reduction in area. Overall, the experimental results demonstrate that the proposed approximate multiplier results in higher accuracy in CNN-based image classification, with lower hardware overhead, compared with state-of-the-art approximate multipliers.
14:30 CEST	TS08.7	ASCEND: ACCURATE YET EFFICIENT END-TO-END STOCHASTIC COMPUTING ACCELERATION OF VISION TRANSFORMER Speaker: Tong Xie, Peking University, CN Authors: Tong Xie, Yixuan Hu, Renjie Wei, Meng Li, Yuan Wang, Runsheng Wang and Ru Huang, Peking University, CN Abstract Stochastic computing (SC) has emerged as a promising computing paradigm for neural acceleration. However, how to accelerate the state-of-the-art Vision Transformer (ViT) with SC remains unclear. Unlike convolutional neural networks, ViTs introduce notable compatibility and efficiency challenges because of their nonlinear functions, e.g., softmax and GELU. In this paper, for the first time, a ViT accelerator based on end-toend SC, dubbed ASCEND, is proposed. ASCEND co-designs the SC circuits and ViT networks to enable accurate yet efficient acceleration. To overcome the compatibility challenges, ASCEND proposes a novel deterministic SC module for GELU and leverages an SC-friendly iterative approximate algorithm to design an accurate and efficient softmax circuit. To improve inference efficiency, ASCEND develops a two-stage training pipeline to produce accurate low-precision ViTs. With extensive experiments, we show the proposed GELU and softmax modules achieve 56.3% and 22.6% error reduction compared to existing SC designs, respectively, and reduce the area-delay product (ADP) by 5.29× and 12.6×, respectively. Moreover, compared to the baseline low-precision ViTs, ASCEND also achieves significant accuracy improvements on CIFAR10 and CIFAR100.
14:35 CEST	TS08.8	A MODULAR BRANCH PREDICTOR PERFORMANCE ANALYSIS FRAMEWORK FOR FAST DESIGN SPACE EXPLORATION Speaker: YA WANG, Hong Kong University of Science and Technology, HK Authors: Ya Wang¹, Hanwei FAN¹, Sicheng Li², Tingyuan Liang¹ and Wei ZHANG¹ ¹Hong Kong University of Science and Technology, HK; ²Alibaba Group, US Abstract As modern processor designs scale up and workloads become more complex, the selection of the branch predictor (BP) and the optimization of its internal parameters are increasingly critical in striking a balance between performance and resource usage. However, current fast performance evaluation models and micro-architectural Design Space Exploration (DSE) frameworks provide limited support for BP components, especially regarding internal parameter adjustments. In this work, we propose a modular BP performance analysis framework that provides fast performance feedback for different BP configurations. Our framework includes a pattern analyzer equipped with more accurate metrics for quantifying an application's predictability of branch behavior, a classification module for selecting the appropriate BP type, and an analytical model set that reflects the impact of internal parameter adjustments of various BPs on both performance and storage resource usage, thereby supporting DSE. Experimental results on three benchmarks confirm the framework's effectiveness, as our proposed model exhibits better correlation while reflecting more parameter changes than previous work. To the best of our knowledge, this is the first analytical model framework that supports comprehensive BP type and parameter adjustments.
14:40 CEST	TS08.9	CLAST: CROSS-LAYER APPROXIMATE HIGH-LEVEL SYNTHESIS WITH CONFIGURABLE APPROXIMATE THREE-OPERAND ADDERS Speaker: Jian Shi, Shanghai Jiao Tong University, CN Authors: Jian Shi, Wenjing Zhang and Weikang Qian, Shanghai Jiao Tong University, CN Abstract Approximate high-level synthesis (HLS) technique has been proposed in recent years to produce approximate designs automatically from a high-level description. All existing approximate HLS methods work on two-operand approximate units. In this paper, we propose CLAST, a cross-layer approximate high-level synthesis using configurable three-operand approximate adders. The experimental results show that CLAST outperforms previous HLS tools by achieving 57.0% improvement in area-delay product, while maintaining a high accuracy
14:41 CEST	TS08.10	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS24 Embedded Software And Tool Chains

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: Multi-Purpose Room M1B+D

Session chair:
Zebo Peng, Linkoping University, SE

Session co-chair:
Tiziana Margaria, University of Limerick, IE

Time	Label	Presentation Title Authors
14:00 CEST	TS24.1	ECM: IMPROVING IOT THROUGHPUT WITH ENERGY-AWARE CONNECTION MANAGEMENT Speaker: Fatemeh Ghasemi, Norwegian University of Science and Technology, NO Authors: Fatemeh Ghasemi, Lukas Liedtke and Magnus Jahre, Norwegian University of Science and Technology, NO Abstract Designing Internet of Things (IoT) devices that solely rely on energy harvesting is the most promising approach towards achieving a scalable and sustainable IoT. The power output of energy harvesters can however vary significantly and maximizing throughput hence requires adapting application behavior to match the harvester's current power output. In this work, we focus on the connection policy of the IoT device and find that the on-demand connect policy — which is used by state-of-the-art IoT runtime systems — and the aggressive maintain connection policy both fall short across a broad range of harvester power outputs. We therefore propose Energy-aware Connection Management (ECM) which tunes the connection policy and sampling frequency to consistently achieve high throughput. ECM accomplishes this by predicting both the average power output of the harvester and the energy consumed by the IoT device with a lightweight analytical model that only requires tracking six energy thresholds. Our evaluation demonstrates that ECM can improve throughput substantially, i.e., by up to 9.5× and 3.0× compared to the on-demand connect and maintain connection policies, respectively.
14:05 CEST	TS24.2	FULL-STACK OPTIMIZATION FOR CAM-ONLY DNN INFERENCE Speaker: Joao Paulo Cardoso de Lima, TU Dresden, DE Authors: Joao Paulo De Lima¹, Asif Ali Khan², Luigi Carro³ and Jeronimo Castrillon² ¹Technische UniversitÃ¤t Dresden, DE; ²TU Dresden, DE; ³UFRGS, BR Abstract The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models remain a challenge. Additionally, for some CIM designs, the activation movement still requires considerable time and energy. This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors (APs) implemented using racetrack memory (RTM). We propose a novel compilation flow to optimize convolutions on APs by reducing their arithmetic intensity. By leveraging the benefits of RTM-based APs, this approach substantially reduces data transfers within the memory while addressing accuracy, energy efficiency, and reliability concerns. Concretely, our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5× compared to crossbar in-memory accelerators while retaining software accuracy.
14:10 CEST	TS24.3	DISCOVERING EFFICIENT FUSED LAYER CONFIGURATIONS FOR EXECUTING MULTI-WORKLOADS ON MULTI-CORE NPUS Speaker: Younghyun Lee, Hanyang University, KR Authors: Younghyun Lee¹, Hyejun Kim², Yongseung Yu¹, Myeongjin Cho³, Jiwon Seo¹ and Yongjun Park² ¹Hanyang University, KR; ²Yonsei University, KR; ³Sapeon Korea, KR Abstract As the AI industry grows rapidly, Neural Processing Units (NPUs) have been developed to deliver AI services more efficiently. One of the most important challenges for NPUs is task scheduling to minimize off-chip memory accesses, which may occur significant performance overhead. To reduce memory accesses, multiple convolution layers can be fused into a fused layer group, which offers numerous optimization opportunities. However, in most Convolutional Neural Networks (CNNs), when multiple layers are fused, the on-chip memory utilization of the fused layers gradually decreases, resulting in non-flat memory usage. In this paper, we propose a scheduling search algorithm to optimize the fusion of multiple convolution layers while reducing the peak on-chip memory usage. The proposed algorithm aims to find a schedule that simultaneously optimizes execution time and peak on-chip memory usage, despite a slight increase in off-chip memory accesses. It organizes the search space into a graph of possible partial schedules and then finds the optimal path. As a result of the improved on-chip memory usage, multiple workloads can be executed on multi-core NPUs with increased throughput. Experimental results show that the fusion schedule explored by the proposed method reduced on-chip memory usage by 39%, while increasing latency by 13%. When the freed on-chip memory was allocated to other workloads and the two workloads were executed concurrently in a multi-core NPU, a 32% performance improvement could be achieved.
14:15 CEST	TS24.4	CLSA-CIM: A CROSS-LAYER SCHEDULING APPROACH FOR COMPUTING-IN-MEMORY ARCHITECTURES Speaker: Rebecca Pelke, RWTH Aachen University, DE Authors: Rebecca Pelke¹, José Cubero-Cascante¹, Nils Bosbach¹, Felix Staudigl², Rainer Leupers¹ and Jan Moritz Joseph¹ ¹RWTH Aachen University, DE; ²Institute for Communication Technologies and Embedded System, RWTH Aachen University, DE Abstract The demand for efficient machine learning (ML) accelerators is growing rapidly, driving the development of novel computing concepts such as resistive random access memory (RRAM)-based tiled computing-in-memory (CIM) architectures. CIM allows to compute within the memory unit, resulting in faster data processing and reduced power consumption. Efficient compiler algorithms are essential to exploit the potential of tiled CIM architectures. While conventional ML compilers focus on code generation for CPUs, GPUs, and other von Neumann architectures, adaptations are needed to cover CIM architectures. Cross-layer scheduling is a promising approach, as it enhances the utilization of CIM cores, thereby accelerating computations. Although similar concepts are implicitly used in previous work, there is a lack of clear and quantifiable algorithmic definitions for cross-layer scheduling for tiled CIM architectures. To close this gap, we present CLSA-CIM, a cross-layer scheduling algorithm for tiled CIM architectures. We integrate CLSA-CIM with existing weight-mapping strategies and compare performance against state-of-the-art (SOTA) scheduling algorithms. CLSA-CIM improves the utilization by up to 17.9 × , resulting in an overall speedup increase of up to 29.2 × compared to SOTA.
14:20 CEST	TS24.5	HIGH THROUGHPUT HARDWARE ACCELERATED CORESIGHT TRACE DECODING Speaker: Matthew Weingarten, ETH Zurich, CH Authors: Matthew Weingarten, Nora Hossle and Timothy Roscoe, ETH Zurich, CH Abstract A single tracing component embedded into a high-frequency processor may produce up to 1 GB/s of trace data or more. These data are vital in debugging, monitoring, verification, and performance analysis in System-on-chip and heterogeneous system development. Hardware trace decoders and analyzers have emerged to support online processing of trace data for real-time applications. However, the existing hardware trace decoders designed for the Embedded Trace Macrocell version 4 (ETMv4), a standard feature in most modern ARM processors, can only process trace data at a maximum rate of 250 MB/s. This paper proposes an optimized and parallelized trace decoder for the ETMv4 specification implemented on a Xilinx Ultrascale+ processing up to 1 GB/s of trace data from a single ETM.
14:25 CEST	TS24.6	COMPUTATIONAL AND STORAGE EFFICIENT QUADRATIC NEURONS FOR DEEP NEURAL NETWORKS Speaker: Chuangtao Chen, TU Munich, DE Authors: Chuangtao Chen¹, Grace Li Zhang², Xunzhao Yin³, Cheng Zhuo³, Ulf Schlichtmann¹ and Bing Li¹ ¹TU Munich, DE; ²TU Darmstadt, DE; ³Zhejiang University, CN Abstract Deep neural networks (DNNs) have been widely deployed across diverse domains such as computer vision and natural language processing. However, the impressive accomplishments of DNNs have been realized alongside extensive computational demands, thereby impeding their applicability on resource-constrained devices. To address this challenge, many researchers have been focusing on basic neuron structures, the fundamental building blocks of neural networks, to alleviate the computational and storage cost. In this work, an efficient quadratic neuron architecture distinguished by its enhanced utilization of second-order computational information is introduced. By virtue of their better expressivity, DNNs employing the proposed quadratic neurons can attain similar accuracy with fewer neurons and computational cost. Experimental results have demonstrated that the proposed quadratic neuron structure exhibits superior computational and storage efficiency across various tasks when compared with both linear and non-linear neurons in prior work.
14:30 CEST	TS24.7	PERFORMANCE ANALYSIS AND OPTIMIZATIONS OF MATRIX MULTIPLICATIONS ON ARMV8 PROCESSORS Speaker: Hucheng Liu, Harbin Institute of Technology, CN Authors: Hucheng Liu, Shaohuai Shi, Xuan Wang, Zoe L. Jiang and Qian Chen, Harbin Institute of Technology, CN Abstract General matrix multiplication (GEMM) as a fundamental subroutine has been widely used in many applications like scientific computing, machine learning, etc. Although many studies are dedicated to optimizing its performance, they mainly focus on matrices with regular shapes or x86 platforms. The irregularly shaped matrices on GEMM running on modern ARMv8 processors are under-explored. In this paper, we provide a thorough performance analysis of the general block-panel multiplication (GEBP) kernel of GEMM that has irregular shapes. Based on our analysis, we propose a new GEMM algorithm named EPPA with three novel schemes to improve GEMM performance on ARMv8 processors: i) eliminating packing to reduce L1 cache contention, ii) avoiding data eviction and pre-fetching data to reduce the L1 cache miss penalty, and iii) an adaptive selection strategy of the above two and original schemes. We conduct extensive experiments with a large range of irregular matrices on three popular ARMv8 processors compared to seven state-of-the-art GEMM libraries. The experimental results show that our EPPA algorithm outperforms existing ones across workloads and processors and accelerates real-world applications.
14:35 CEST	TS24.8	A COMPILER PHASE TO OPTIMALLY SPLIT GPU WAVEFRONTS FOR SAFETY-CRITICAL SYSTEMS Speaker: Artem Klashtorny, University of Waterloo, CA Authors: Artem Klashtorny, Mahesh Tripunitara and Hiren Patel, University of Waterloo, CA Abstract We present a compiler phase for GPUs enabled with predictable wavefront splitting (PWSG) that implements an optimal algorithm to split diverging GPU wavefronts into separate scheduleable entities. This algorithm selects branches in the GPU kernel that guarantee the lowest worst-case execution time (WCET) for the kernel. We implement our algorithm in a compiler flow for an AMD GPU, and we deploy the resulting binary on a gem5 micro-architectural implementation of the AMD GCN3 GPU. We evaluate our implementation on an extensive set of synthetic benchmarks. Our experiments show that by automatically selecting points in the GPU kernel to split wavefronts, we are able to reduce the WCET ranging from 34% to 52% reduction compared to five alternative approaches.
14:40 CEST	TS24.9	LIGHTWEIGHT INSTRUMENTATION FOR ACCURATE PERFORMANCE MONITORING IN RTOSES Speaker: Bruno Endres Forlin, University of Twente, NL Authors: Bruno Endres Forlin¹, Kuan-Hsun Chen¹, Nikolaos Alachiotis¹, Luca Cassano² and Marco Ottavi³ ¹University of Twente, NL; ²Politecnico di Milano, IT; ³University of Rome Tor Vergata \| University of Twente, NL Abstract Evaluating performance metrics in embedded systems poses challenges, particularly due to the limited set of tools available for monitoring performance counters. In addition, performance evaluation frameworks for Real-Time Operating Systems (RTOSes) often lack the sophistication and capabilities available in general-purpose operating systems like Linux, which benefit from utilities such as perf_event. To bridge this gap, this paper presents an accurate and low-overhead instrumentation utility tailored for RTOSes. Our approach utilizes performance monitoring counters to observe individual user applications within the RTOS environment. Importantly, it enables comprehensive application monitoring by strategically placing probes at points of inherent system interference, thereby minimizing additional overhead. A pre-calibration of these probes allows for fine-grained measurements within user applications. This results in the elimination of 100% of the overheads for most counters in our test configuration, impacting the context switch by only three additional instructions per monitored counter.
14:41 CEST	TS24.10	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

UFP University Fair Presentations

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 14:00 CEST - 15:00 CEST
Location / Room: Break-Out Room S1+2

Time	Label	Presentation Title Authors
14:00 CEST	UFP.1	OPEN EDA FOR OPEN HARDWARE - THE CORIOLIS FLOW Speaker: Jean-Paul Chaput, LIP6, FR Authors: Cecile Braunstein¹, Jean-Paul Chaput¹, Roselyne Chotin¹, Dimitri Galayko¹, Marie-Minerve Louerat², Franck Wajsburt¹, Ali Oudrhiri¹ and Mazher Iqbal³ ¹Sorbonne Université, FR; ²CNRS and Sorbonne Universite, FR; ³LIP6 (Sorbonne Université & CNRS), FR Abstract At DATE 2024, we demonstrate an open source EDA flow, from Register Transfer Level (RTL) down to layout, combining YOSYS (logical synthesis), Coriolis (placer and router), and KLayout (layout editor) with an Open Source Process Design Kit (PDK) and Open Source Standard Cells. This EDA flow is the corner stone to define Open Hardware. The focus will be more specifically on the Coriolis software (Open Software award at OSEC 2022 and UK Innovation award in 2023). Several digital benches having a RTL level description will be compiled down to physical description (layout), using an Open Source PDK (SkyWater130nm, GF180nm, IHP 130nm, ...) and open source standard cells (c4m-flexcell, nsxlib, lsxlib). Other Digital Open Source chips' description and their ASIC deployment will also be presented.
14:08 CEST	UFP.2	ISOLATED 150 MHZ LLC DC-DC CONVERTER CMOS CHIPSET USING MICROTRANSFORMER Presenter: pat meehan, University of Limerick, IE Authors: pat meehan and Karl Rinne, University of Limerick, IE Abstract Interim results are available for a novel galvanically-isolated DC-DC. It includes a CMOS integrated microtransformer capable of withstanding over 10 kV across its insulator. The entire device, including the primary controller and secondary rectifier fit inside one integrated circuit. It is at the pre-comercialisation stage with a TRL3 and 4. The primary and secondary circuits contain some innovative circuit technqiues including the ability to use on-chip metal-insulator-metal capacitors in the LLC converter.
14:15 CEST	UFP.3	HARDWARE AND SOFTWARE DESIGNS FOR HIGH PERFORMANCE AND RELIABLE SPACE PROCESSING Presenter: Marc Solé Bonet, BSC, ES Authors: Marc Solé Bonet¹, Ivan Rodriguez Ferrandez², Dimitris Aspetakis¹, Jannis Wolf¹, Matina Maria Trompouki¹ and Leonidas Kosmidis² ¹BSC, ES; ²UPC \| BSC, ES Abstract Current critical systems and especially space systems require increased performance in order to support advanced functionalities such as autonomous operation. At the same time, they need to operate reliably in the harsh space environment. In this University Fair submission, we will showcase hardware and software demonstrators of our institution, Barcelona Supercomputing Center and Universitat Politècnica de Catalunya in this research line. In particular: -An open source, RISC-V based high performance multicore and GPU processor for space systems, integrated with the SPARROW SIMD AI accelerator [1]. The design is prototyped on a Xilinx FPGA and is capable of running the RTEMS SMP qualifiable space real-time operating system, as well OpenMP and OpenCL [2]. -Space Shuttle: a silicon prototype of a protected register file for space systems on the SkyWater 130nm [3]. The design has been taped out in the second run of the Google sponsored efabless shuttle run for open source designs. The register file is protected with multiple means (dual and triple modular redundancy and ECC protection) and its combined with a redundant Performance Monitor Unit which logs all corrections, for the assessment of the radiation tolerance of this processing node. -OBPMark: this open source benchmarking suite developed for the European Space Agency (ESA) consists of representative space applications such as image processing of optical payloads, synthetic aperture radar (SAR), CCSDS 121 and 122 compression and CCSDS encryption, as well as Machine Learning applications such as cloud screening and ship detection from satellite images. The software is showcased on the aforementioned RISC-V platform and in embedded GPU platforms. OBPMark supports multiple programming models (sequential C, OpenMP, CUDA, OpenCL) and it is the official benchmarking suite used by ESA for deciding the hardware requirements of new missions as well as as demonstrators for newly funded ESA projects [4]. For this work our team received a HiPEAC Technology Transfer Award in 2021. -Formal Verification of General Purpose Graphics Processing Unit (GPGPU) software. This software prototype is based on AdaCore's Ada SPARK backend for CUDA, which is currently under development. We showcase that our developed methodology can be used to formally prove the absence of runtime errors in the space representative OBPMark software. Defects found in earlier versions of the C version of the software were not present in the Ada SPARK port, thanks to the use of formal methods [5].
14:22 CEST	UFP.4	AN EVENT-BASED POWER ANALYSIS TOOL WITH INTEGRATED TIMING CHARACTERIZATION AND LOGIC SIMULATION Presenter: Katayoon Basharkhah, University of Theran, IR Authors: Katayoon Basharkhah and Zain Navabi, University of Theran, IR Abstract With the shrinking size of transistors in the most advanced technologies and demand for battery-operated devices, dynamic power estimation has become a major concern. State of the art power estimation environments strongly depend on gate-level static timing characterization and dynamic logic simulation that are external to power estimation tools. This imposes a large overhead on power analysis speed and makes timing and power analysis two separate passes. This separation prevents utilization of dynamic timing information that would be useful for power analysis. This work proposes a distributed event-based power simulation environment that is coupled with timing analysis. This environment is based on SystemC/C++ gate models, among which power and timing simulation tasks are distributed. This also removes the overhead of many file exchanges required in decoupled commercial tools. We show that our integrated simulation runs faster than any of the required passes of conventional simulations. This proposed coupled estimation environment reaches up to 25X speed-up in runtime for large circuits and large simulation cycles, while retaining the estimation accuracy with an averaged error of 1.13% in comparison with the commercial power estimation tools.
14:30 CEST	UFP.5	TRAINING PACKAGE FOR DESIGN AND IMPLEMENTATION OF RISC-V BASED EMBEDDED SYSTEMS Presenter: Katayoon Basharkhah, University of Theran, IR Authors: Katayoon Basharkhah¹, Maryam Rajabalipanah², Zahra Jahanpeima¹, Nooshin Nosrati¹, Rezgar Sadeghi¹, Zahra Mahdavi¹, Negin Safari¹, Amirmahdi Joudi¹ and Zain Navabi³ ¹University of Theran, IR; ²University of Tehran, IR; ³University of Theran, Abstract RISC-V processors, known for their adherence to Reduced Instruction Set Computing (RISC) principles, are gaining popularity in various industries due to their open-source nature, flexibility, and scalability. This architecture, favored for its adaptability, is increasingly adopted in embedded systems like IoT devices and mobiles, offering a cost-effective solution. Emerging start-ups in this field face challenges in both hardware and software development, necessitating comprehensive training in RISC-V architecture, power optimization, firmware development, security, and testing. To address these challenges, a training package is proposed, covering modules on RISC-V architecture, hardware design, software development, testing methodologies, and security practices. This training aims to empower start-ups, bridge knowledge gaps, and contribute to building a skilled workforce for the growing demand in RISC-V embedded systems. We present the development of a training package specifically tailored for designers focusing on RISC-V based embedded systems.
14:38 CEST	UFP.6	PROCESSING-IN-MEMORY ACCELERATOR DESIGN WITH A FULLY SYNTHESIZABLE CELL-BASED DRAM Presenter: Tai-feng Chen, Nagoya University, JP Authors: Tai-feng Chen, Yutaka Masuda and Tohru Ishihara, Nagoya University, JP Abstract Processing-in-Memory (PiM) is one of the most promising design techniques for applications like AI accelerators that require both high performance and energy efficiency. Since AI algorithms and their accelerator models are evolving rapidly, the biggest challenge of AI chip design is to synthesize high-performance and highly energy-efficient accelerators rapidly. In this research project, we develop fully synthesizable cell-based DRAMs to address this challenge, where the gain cell structure is adopted as a bit cell storage with a specifically designed circuit structure to enable near-threshold voltage operation. The cell-based DRAM structure allows commercial APR tools to freely distribute and place memories around the computation logic circuits, thus achieving PiM without I/O bandwidth restrictions. The design results of the cell-based DRAM and an accelerator placed and routed in a mixed manner are shown in the fair, with possible future improvements discussed in the final part.
14:45 CEST	UFP.7	SECURING NANO-CIRCUITS AGAINST OPTICAL PROBING ATTACK Presenter: Sajjad Parvin, University of Bremen, DE Authors: Sajjad Parvin¹, Frank Sill Torres² and Rolf Drechsler³ ¹University of Bremen, DE; ²German Aerospace Center, DE; ³University of Bremen \| DFKI, DE Abstract Recently, a non-invasive laser-based Side-Channel Analysis (SCA) attack, namely Optical Probing (OP) attack has been shown to pose an immense threat to the security of sensitive information on chips. In the literature, several countermeasures are proposed which require changes in the chip fabrication process. As a result, these countermeasures are costly and not economical. In this work, we focused on proposing countermeasures against OP at the layout level. Moreover, to design robust chips against OP during design time, we developed an OP simulator. The developed OP simulator enables designers to analyze the information leakage of a design pre-silicon. The structures on this chip will be used to evaluate and improve the accuracy of our OP simulator under an experimental OP setup.
14:53 CEST	UFP.8	LLM-ASSISTED HIGH QUALITY INVARIANTS GENERATION FOR FORMAL VERIFICATION Presenter: Khushboo Qayyum, DFKI, DE Authors: Khushboo Qayyum¹, Sallar Ahmadi-Pour², Muhammad Hassan³, Chandan Kumar Jha² and Rolf Drechsler³ ¹DFKI, DE; ²University of Bremen, DE; ³University of Bremen \| DFKI, DE Abstract The increasing complexity of hardware makes it essential that the design be verified thoroughly. However, the task of verification is in itself tedious and simplification of this task has garnered a lot of interest. In particular, Large Language Models (LLMs) are being explored to exploit their potential for hardware verification and simplification of this cumbersome task. Although complete verification through the LLMs may still not be achievable, but the task of generating formal properties is promising. In this work, we showcase an automated approach to generate high quality invariants for formal verification using OpenAI's GPT-4 LLM. In particular, we automate the generation of invariants from specifications as well as formal model from Verilog behavioral model. We evaluate the quality of generated invariants by utilizing mutation testing. Our experimental evaluation reveals that LLM-assistance can simplify the hardware verification process.

US01 Unplugged Session

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: Break-Out Room S3+4

Come join us for stimulating brainstorm discussions in small groups about the future of digital engineering. Our focus will be on the digital twinning paradigm where virtual instances are created of a system as it is operated, maintained, and repaired (e.g., one evolving virtual instance for each individual car realized from a single design model, connected by a digital thread). We investigate how to take advantage of this paradigm in engineering systems and what new system engineering approaches and architectures (hardware/software) and design workflows are needed and become possible.

W01 Workshop: Eco-ES: Eco-design and circular economy of Electronic Systems

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 14:00 CEST - 18:00 CEST
Location / Room: Break-Out Room S6+7

Organisers

Chiara Sandionigi, CEA, FR

David Bol, UC Louvain, BE

Jonas Gustafsson, RISE, SE

Jean-Christophe Crebier, CNRS/G-INP/UGA, FR

Eco-ES, the workshops devoted to Eco-design and circular economy of Electronic Systems, comes back to DATE 2024 for its second edition. This half-day event consists of a plenary keynote, invited talks and regular presentations. As a novelty, the second edition proposes also a session dedicated to European projects working on the sustainability of electronics.

Workshop Description

The impact of electronics on the environment is becoming an important issue, especially because of the number of systems growing exponentially. Eco-design and circular economy applied to Electronic Systems are thus becoming major challenges for our society to respond to the dangers for the environment: exponential increase in electronic waste generation, depletion of resources, contribution to climate change and poor resiliency to supply-chain issues. Electronic Systems designers willing to engage in eco-design face several difficulties, related in particular to a limited knowledge of the environmental impact from the design phase and the uncertain extension of the service lifetime of the system or parts of the system, owing to the variability in user behaviour and business models.

The objective of the workshop Eco-ES is to gather experts from both academia and industry, covering a wide scope in the environmental sustainability of Electronic Systems. Regular sessions with talks and a poster session will offer a place for the audience to discuss and share ideas.

Topic Areas

Workshop topics include:

Specification and modelling of sustainable Electronic Systems
Life Cycle Assessment tools and techniques
Electronic Design Automation tools for eco-design
Design Space Exploration including environmental aspects
Eco-reliability techniques to design sustainable systems with extended lifetime
Reparability methods
Reuse strategies
Recycling of Electronic Systems and Refurbish for a second life of the products
Digital infrastructure refresh and reuse strategies
Inter-disciplinary works linking the technology aspects of eco-design and circular economy to social and economic sciences
Projects in progress on the previous topics

Introduction and keynote

Tue, 14:00 - 14:40

14.00: Workshop introduction (C. Sandionigi, CEA)

14.10: Keynote - Challenged with Planetary Boundaries? Addressing Design for Absolute Sustainability in Electronics Research and Development Processes (M. Rio, INP-Gre)

Sustainable electronics in Europe

Tue, 14:40 - 16:00

Session chair

Jonas Gustafsson, RISE, SE

14.40: Towards the creation of a European ecosystem for Sustainable Electronics (C. Sandionigi, CEA)

15.00: The Sustronics project (O. Kattan, Philips)

15.20: ESOS - Electronics: Sustainable, Open and Sovereign (T. Marty, INSA-Rennes)

15.40: Components circular re-use of electronic scrap (N. Canis-Moal, Continental; J. Gabriel, CEA)

Life Cycle Assessment and eco-design of electronics

Tue, 16:15 - 18:00

Session chair

Chiara Sandionigi, CEA, FR

16.15: 4MOD's eco-design method for consumer electronics: Integrating Life Cycle Assessment into the design and R&D process (E. Whitmore, 4MOD)

16.30: Assessing the embodied carbon footprint of a cellular base station with a parametric LCA (R. Dethienne, UCLouvain)

16.45: Data center IT hardware refresh driven by environmental impact assessment for a circular economy (P. Thampi, RISE)

17.00: Chiplets: Opportunities for reusable/replaceable/life-extended sustainable electronic chips (R. Massoud, EPFL)

17.15: On the need for open life cycle analysis datasets (T. Marty, INSA-Rennes)

17.30: Integrating screening Life Cycle Assessment in digital system design flow to enable eco-design (M. Peralta, CEA)

17.45: Making electronic greener? Profiling greener IC and boards (G. Saucier, D&R)

W08 ASD-Workshop: Autonomous Systems Certification and Homologation

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 14:00 CEST - 18:00 CEST
Location / Room: Break-Out Room S8

Organisers

Selma Saidi, Technische Universität Dortmund, Germany

Rolf Ernst, Technical University Braunschweig, Germany

Sebastian Steinhorst, Technical University of Munich, Germany

ASD6 Workshop (Tuesday March 26: 14h00h- 18h00)

Title: “Autonomous Systems Certification and Homologation”

Abstract: Autonomous functions have been proposed for pretty much all transport systems, vehicles, aircraft, or rail systems. Such transport systems and their design are governed by safety standards that are developed by large groups in public bodies or standardization organizations. At the core of such safety standards are certification or homologation processes that govern the regulatory approval of an autonomous system. A major challenge for this approval is the coexistence of traditional transport systems and transport systems with different levels of autonomy. This coexistence is a necessity, as there will be no exclusive space for autonomous transport systems, neither in the air nor on the ground, especially in densely populated areas. There are many related projects and initiatives addressing this challenge with different requirements and rules, in part a result of the specific form of potential system interference, in part based on the tradition of safety guarantees in the respective domain. The European U-space initiative is a very good example regulating the coexistence of traditional and unmanned aircrafts and drones and the related design processes and protection mechanisms.

Workshop Organization: After an introduction, the workshop will start with a first session of introductory talks by experts from industry and public authorities in the different domains. In the second session, a panel of experts will discuss the implications for Autonomous Systems Design and research in the field.

ASD6.1: Introductory Talks (14h00-16h00)

Talk 1: “ISO PAS 8800: An AI safety standard for automotive applications”,

Simon Burton, Independent Consultant and Honorary Visiting Professor, University of York

Talk 2: “Certifying Autonomy in Railway Systems”,

Mario Trapp, Director Fraunhofer Institute Cognitive Systems

Talk 3: “AI Assurance and Challenges”,

Huafeng Yu, Highly Automated Systems Safety Center of Excellence (HASS COE), US Department of Transportation

Talk 4: “Autonomous Systems: Management Oversight of Certification and Homologation”,

William Widen, School of Law, University of Miami

ASD6.2: Panel (16h30-18h00)

Confirmed Panelists:

Huafeng Yu, HASS COE, US Department of Transportation
Simon Burton, Independent Consultant & University of York
Bill Widen, University of Miami
Mario Trapp, Fraunhofer IKS
Matthieu Gallas, Airbus
Benjamin Lesage, ONERA

ADC ARM Design Contest

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 15:00 CEST - 15:45 CEST
Location / Room: Break-Out Room S1+2

Session chair:
Anton Klotz, Cadence Design Systems, DE

Organizers:
John Darlington, University of Southampton, UK
Richard Buttrey, Arm, UK
Becky Ellis, Arm, UK
Peter Groos, Cadence, DE

Arm and the University of Southampton have launched SoC Labs, a global meeting place for academics to learn from one another and identify and collaborate on SoC Design projects based on Arm. In order to foster great SoC design work and research collaboration, SoC labs began a series of design contests. Following their successful 2023 design contest at IEEE SOCC, SoC Labs supported by Arm are here at DATE to present their ‘Understanding Our World SoC Design Contest’ - which will be back at DATE ‘25 to present its project outcomes.
SoC Labs, is pleased announce a ‘Understanding Our World’ research and teaching design contest at DATE. There are opportunities for the best entries to be sponsored to attend a special session at DATE 2025. Research groups interested in taping-out their designs have an opportunity to join a shuttle, where there will be a contribution towards the cost.

Come and join us to hear more about participating in the design contest, SoC Labs and how to gain access to a wide suite of Arm IP to empower your SoC designs and accelerate your research.

There are many challenges to address in the world, not least those encapsulated in the UN Sustainable Development Goals (SDGs). Many real-world problems need improved real time understanding of environments of different kinds and our impact upon them. Sustainable monitoring and management of the external environment demands efficient gathering of data for compute to act on as the first stage of unlocking solutions - from climate change, deforestation, pollution to water use and beyond. Gathering various kinds of real-world data has its own challenges in terms of analogue to digital signal conversion and the related design of an efficient SoC.

The design contest has two tracks: (1) one based around collaboration/ education where we focus on skills development and working with others, including re-use and repurposing designs. For this track we are more interested to hear about the journey itself rather than unique hardware designs (2) a second based on hardware implementation which includes FPGA and ASIC, looking for mixed signal SoC designs. Analogue/ digital mixed signal designs are important in understanding our world in areas such as energy efficiency, calibration, noise and generally bringing data into the SoC. We are looking for contest submissions from both early career and experienced researchers.

ET01 Embedded Tutorial: Using Generative AI for Next-generation EDA

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Break-Out Room S1+2

Organiser

Hammond Pearce, University of New South Wales Sydney, Australia

Presenters

Jason Blocklove, NYU, United States

Siddharth Garg, NYU, United States

Jeyavaijayan Rajendran, TAMU, United States

Ramesh Karri, NYU, United States

URLs

https://github.com/JBlocklove/LLMs-for-EDA-Tutorial

Tutorial resources: https://github.com/JBlocklove/LLMs-for-EDA-Tutorial

Motivation:	There are increasing demands for integrated circuits but a shortage of designers - Cadence’s blog reports a shortage of 67,000 employees in the US alone. These increasing pressures alongside shifts to smaller nodes and more complexity lead to buggy designs and slow time-to-market. State-of-art Generative AI tools like GPT-4 and Bard have shown promising capabilities in automatic generation of register transfer level (RTL) code, assertions, and testbenches, and in bug/Trojan detection. Such models can be further specialized for hardware tasks by fine-tuning on open-source datasets. As Generative AI solutions find increasing adoption in the EDA flow, there is a need for training EDA experts on using, training and fine-tuning such models in the hardware context.
Intended audience:	Students, academics, and practitioners in EDA/VLSI/FPGA and Security
Objectives	In this tutorial we will show the audience how one can use current capabilities in generative AI (e.g. ChatGPT) to accelerate hardware design tasks. We will explore how it can be used with both closed and open-source tooling, and how you can also train your own language models and produce designs in a fully open-source manner. We'll discuss how commercial operators are beginning to make moves in this space (GitHub Copilot, Cadence JedAI) and reflect on the consequences of this in education and industry (will our designs become buggier? Will our graduating VLSI students know less?). We'll cover all of this using a representative suite of examples both simple (basic shift registers) to complex (AXI bus components and microprocessor designs).
Abstract	There are ever-increasing demands on complexity and production timelines for integrated circuits. This puts pressure on chip designers and design processes, and ultimately results in buggy designs with potentially exploitable mistakes. When computer chips underpin every part of modern life, enabling everything from your cell phone to your car, traffic lights to pacemakers, coffee machines to wireless headphones, then mistakes have significant consequences. This unfortunate combination of demand and increasing difficulty has resulted in shortages of qualified engineers, with some reports indicating that there are 67,000 jobs in the field yet unfilled. Fortunately, there is a path forward. For decades, the Electronic Design Automation (EDA) field has applied the ever-increasing capabilities from the domains of machine learning and artificial intelligence to steps throughout the chip design flow. Steps from layouts, power and performance analysis and estimation, and physical design are all improved by programs taught rather than programmed. In this tutorial we will explore what's coming next: EDA applications from the newest type of artificial intelligence, generative pre-trained transformers (GPTs), also known as Large Language Models. We will show how models like the popular ChatGPT can be applied to tasks such as writing HDL, searching for and repairing bugs, and even applying itself to the production of complex debugging tasks like producing assertions. Rather than constrain oneself just to commercial and closed-source tooling, we'll also show how you can train your own language models and produce designs in a fully open-source manner. We'll discuss how commercial operators are beginning to make moves in this space (GitHub Copilot, Cadence JedAI) and reflect on the consequences of this in education and industry (will our designs become buggier? Will our graduating VLSI students know less?). We'll cover all of this using a representative suite of examples both simple (basic shift registers) to complex (AXI bus components and microprocessor designs).
Necessary background	Experience with EDA flows and softwares such as Xilinx Vivado, Yosys, iverilog, etc. will be helpful but is not required as training on the day will be provided.
References: (Tutorial presenters in bold)	S. Thakur, B. Ahmad, Z. Fan, H. Pearce, B. Tan, R. Karri, B. Dolan Gavitt, S. Garg , "Benchmarking Large Language Models for Automated Verilog RTL Code Generation," 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), Antwerp, Belgium, 2023, pp. 1-6, doi: 10.23919/DATE56975.2023.10137086. J. Blocklove, S. Garg., R. Karri, H. Pearce, “Chip-Chat: Challenges and Opportunities in Conversational Hardware Design,” 2023 Machine Learning in CAD Workshop (MLCAD),. Preprint: https://arxiv.org/abs/2305.13243 H. Pearce, B. Tan, B. Ahmad, R. Karri and B. Dolan-Gavitt, "Examining Zero-Shot Vulnerability Repair with Large Language Models," 2023 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 2023, pp. 2339-2356, doi: 10.1109/SP46215.2023.10179324. B. Ahmad, S. Thakur, B. Tan, R. Karri, H. Pearce, “Fixing Hardware Security Bugs with Large Language Models,” under review. Preprint: https://arxiv.org/abs/2302.01215 R. Kande, H. Pearce, B. Tan, B. Dolan-Gavitt, S. Thakur, R. Karri, J. Rajendran, “LLM-assisted Generation of Hardware Assertions,” under review. Preprint: https://arxiv.org/abs/2306.14027

On the day:

Hands on session	Content: Audience members will use the language models to achieve various tasks within a simple EDA environment focused on simulation. Goals: While we will also demo approaches using more complex software, the hands-on session will focus on the use of iverilog, which is a simple, free, and open-source software for simulation of Verilog designs. iverilog is not demanding (it can be run on local machines/laptops) and is compatible with windows, Linux, and mac. Pre-requisites: While it is preferable for participants to have installed gcc, build-essential, iverilog, and gtkwave in advance, doing so on the day is not difficult and we can provide guidance at the beginning of the session.
Tutorial material	Reference material on the pre-requisites and the manuscripts from the listed references.
Tutorial plan	0-15 mins: Introduction and motivation by Hammond Pearce, Ramesh Karri, Siddharth Garg, and Jason Blocklove (presenter TBD) 15-35 mins: Hands-on Chip-chat - using ChatGPT for writing, simulating, and bug-fixing Verilog by Jason Blocklove and Hammond Pearce (participants will be provided with scripts that they can adapt to interact with ChatGPT for their own tools) 35-60 mins: Hands-on VeriGen: Developing Open-source EDA datasets and models by Shailja Thakur and Jason Blocklove 60-80 mins: AI for Bug Detection: Accelerating hardware fuzzing and flagging bugs and Trojans with Generative AI by Benjamin Tan and JV Rajendran 80-90 mins: Gazing into the Crystal Ball: The future of EDA with Generative AI by Siddharth Garg and Ramesh Karri

Hands on session

Content: Audience members will use the language models to achieve various tasks within a simple EDA environment focused on simulation.

Goals: While we will also demo approaches using more complex software, the hands-on session will focus on the use of iverilog, which is a simple, free, and open-source software for simulation of Verilog designs. iverilog is not demanding (it can be run on local machines/laptops) and is compatible with windows, Linux, and mac.

Pre-requisites: While it is preferable for participants to have installed gcc, build-essential, iverilog, and gtkwave in advance, doing so on the day is not difficult and we can provide guidance at the beginning of the session.

Tutorial material

Reference material on the pre-requisites and the manuscripts from the listed references.

Tutorial plan

0-15 mins: Introduction and motivation by Hammond Pearce, Ramesh Karri, Siddharth Garg, and Jason Blocklove (presenter TBD)

15-35 mins: Hands-on Chip-chat - using ChatGPT for writing, simulating, and bug-fixing Verilog by Jason Blocklove and Hammond Pearce (participants will be provided with scripts that they can adapt to interact with ChatGPT for their own tools)

35-60 mins: Hands-on VeriGen: Developing Open-source EDA datasets and models by Shailja Thakur and Jason Blocklove

60-80 mins: AI for Bug Detection: Accelerating hardware fuzzing and flagging bugs and Trojans with Generative AI by Benjamin Tan and JV Rajendran

80-90 mins: Gazing into the Crystal Ball: The future of EDA with Generative AI by Siddharth Garg and Ramesh Karri

FS06 Focus Session: Resilient Cognitive Sensing And Inference In Distributed And Dynamic Environments

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Auditorium 2

Session chair:
Kaushik Roy, Purdue University, US

Organiser:
Amit Ranjan Trivedi, University of Illinois at Chicago, US

Time	Label	Presentation Title Authors
16:30 CEST	FS06.1	HALO-FL: HARDWARE-AWARE LOW-PRECISION FEDERATED LEARNING Speaker: Priyadarshini Panda, Yale University, US Authors: Yeshwanth Venkatesha, Abhiroop Bhattacharjee, Abhishek Moitra and Priyadarshini Panda, Yale University, US Abstract Applications of federated learning involve devices with extremely limited computational resources and often with considerable heterogeneity in terms of energy efficiency, latency tolerance, and hardware area. Although low-precision training methods have demonstrated effectiveness in accommodating the device constraints in a centralized setting, their applicability in distributed learning scenarios featuring heterogeneous client capabilities has not been well explored. In this work, we design a hardware-aware low-precision federated training framework (HaLo-FL) tailored to heterogeneous resource-constrained devices. In particular, we optimize the precision for weights, activations, and errors for each client's hardware constraint using a precision selector (named HaLo-PS). To validate our approach, we propose HaLoSim, a hardware evaluation platform that enables precision reconfigurability and evaluates hardware metrics like energy, latency, and area utilization on a crossbar-based In-memory Computing (IMC) platform.
16:53 CEST	FS06.2	COGNITIVE SENSING FOR ENERGY-EFFICIENT EDGE INTELLIGENCE Speaker: Saibal Mukhopadhyay, Georgia Tech, US Authors: Minah Lee, Sudarshan Sharma, Wei Chun Wang, Hemant Kumawat, Nael Rahman and Saibal Mukhopadhyay, Georgia Tech, US Abstract Edge platforms in autonomous systems integrate multiple sensors to interpret their environment. The high-resolution and high-bandwidth pixel arrays of these sensors improve sensing quality but also generate a vast, and arguably unnecessary, volume of real-time data. This challenge, often referred to as the analog data deluge, hinders the deployment of high-quality sensors in resource-constrained environments. This paper discusses the concept of cognitive sensing, which learns to extract low-dimensional features directly from high-dimensional analog signals, thereby reducing both digitization power and generated data volume. First, we discuss design methods for analog-to-feature extraction (AFE) using mixed-signal compute-in-memory. Next, we then present examples of cognitive sensing, incorporating signal processing or machine learning, for various sensing modalities including vision, Radar, and Infrared. Subsequently, we discuss the reliability challenges in cognitive sensing, taking into account hardware and algorithmic properties of AFE. The paper concludes with discussions on future research directions in this emerging field of cognitive sensors.
17:15 CEST	FS06.3	UNEARTHING THE POTENTIAL OF SPIKING NEURAL NETWORKS Speaker: Kaushik Roy, Purdue University, US Authors: Sayeed Shafayet Chowdhury, Adarsh Kosta, Deepika Sharma, Marco Apolinario and Kaushik Roy, Purdue University, US Abstract Spiking neural networks (SNNs) are a promising avenue for machine learning with superior energy efficiency compared to traditional artificial neural networks (ANNs). Recent advances in training and input encoding have put SNNs on par with state-of-the-art ANNs in image classification. However, such tasks do not utilize the internal dynamics of SNNs fully. Notably, a spiking neuron's membrane potential acts as an internal memory, merging incoming inputs sequentially. This recurrent dynamic enables the networks to learn temporal correlations, making SNNs suitable for sequential learning. Such problems can also be tackled using ANNs. However, to capture the temporal dependencies, either the inputs have to be lumped over time (e.g. Transformers); or explicit recurrence needs to be introduced (e.g. recurrent neural networks (RNNs), long-short-term memory (LSTM) networks), which incurs considerable complexity. To that end, we explore the capabilities of SNNs in providing lightweight solutions to four sequential tasks involving text, speech and vision. Our results demonstrate that SNNs, by leveraging their intrinsic memory, can be an efficient alternative to RNNs and LSTMs for sequence processing, especially for certain edge applications. Furthermore, SNNs can be combined with ANNs (hybrid networks) synergistically to obtain the best of both worlds in terms of accuracy and efficiency.
17:38 CEST	FS06.4	NAVIGATING THE UNKNOWN: UNCERTAINTY-AWARE COMPUTE-IN-MEMORY AUTONOMY OF EDGE ROBOTICS Speaker: Amit Ranjan Trivedi, University of Illinois at Chicago, US Authors: Nastaran Darabi¹, Priyesh Shukla², Dinithi Jayasuriya¹, Divake Kumar¹, Alex Stutts¹ and Amit Trivedi¹ ¹University of Illinois at Chicago, US; ²Samsung, IN Abstract This paper addresses the challenging problem of energy-efficient and uncertainty-aware pose estimation in insect-scale drones, which is crucial for tasks such as surveillance in constricted spaces and for enabling non-intrusive spatial intelligence in smart homes. Since tiny drones operate in highly dynamic environments, where factors like lighting and human movement impact their predictive accuracy, it is crucial to deploy uncertainty-aware prediction algorithms that can account for environmental variations and express not only the prediction but also confidence in the prediction. We address both of these challenges with Compute-in-Memory (CIM) which has become a pivotal technology for deep learning acceleration at the edge. While traditional CIM techniques are promising for energy-efficient deep learning, to bring in the robustness of uncertainty-aware predictions at the edge, we introduce a suite of novel techniques: First, we discuss CIM-based acceleration of Bayesian filtering methods uniquely by leveraging the Gaussian-like switching current of CMOS inverters along with co-design of kernel functions to operate with extreme parallelism and with extreme energy efficiency. Secondly, we discuss the CIM-based acceleration of variational inference of deep learning models through probabilistic processing while unfolding iterative computations of the method with a compute reuse strategy to significantly minimize the workload. Overall, our co-design methodologies demonstrate the potential of CIM to improve the processing efficiency of uncertainty-aware algorithms by orders of magnitude, thereby enabling edge robotics to access the robustness of sophisticated prediction frameworks within their extremely stringent area/power resources.

TS04 Architectural And Microarchitectural Design Solutions

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Auditorium 3

Session chair:
Todd Austin, University of Michigan, US

Session co-chair:
Paola Busia, Universty of Cagliari, IT

Time	Label	Presentation Title Authors
16:30 CEST	TS04.1	VITA: A HIGHLY EFFICIENT DATAFLOW AND ARCHITECTURE FOR VISION TRANSFORMERS Speaker: Chunyun Chen, Nanyang Technological University, SG Authors: Chunyun Chen, Lantian Li and Mohamed M. Sabry Aly, Nanyang Technological University, SG Abstract Transformer-based DNNs have dominated several AI fields with remarkable performance. However, the scaling up of Transformer models up to trillions of parameters and computation operations has made them both computationally and data-intensive. This poses a significant challenge to utilizing Transformer models, e.g., in the area- and power-constrained systems. This paper introduces ViTA, an efficient architecture for accelerating the entire workload of Vision Transformer (ViT), targeting enhanced area and power efficiency. ViTA adopts a novel memory-centric dataflow that reduces memory usage and data movement, exploiting computational parallelism and locality. This design results in a 76.71% reduction in memory requirements for Multi-Head Self Attention (MHA) compared to original dataflows with VGA resolution images. A fused configurable module is designed for supporting non-linear functions in ViT workloads, such as GELU, Softmax, and LayerNorm, optimizing hardware resource usage. Our results show that ViTA achieves 16.384 TOPS with area and power efficiencies of 2.13 TOPS/mm^2 and 1.57 TOPS/W at 1 GHz, surpassing current Transformer accelerators by 27.85× and 1.40×, respectively.
16:35 CEST	TS04.2	SAVA: A SPATIAL- AND VALUE-AWARE ACCELERATOR FOR POINT CLOUD TRANSFORMER Speaker: Xueyuan Liu, Shanghai Jiao Tong University, CN Authors: Xueyuan Liu, Zhuoran Song, Xiang Liao, Xing Li, Tao Yang, Fangxin Liu and Xiaoyao Liang, Shanghai Jiao Tong University, CN Abstract Point Cloud Transformer is undergoing a rising trend in both industry and academia. It aligns traditional point cloud feature extraction methods with the latest transformer technology and achieves remarkable performance. However, accelerators for traditional point cloud neural networks (PCNNs) and those solely for transformers fail to capture the characteristics of point cloud transformers, thus exhibiting poor performance. To address this challenge, we propose Sava, a co-designed accelerator that adopts a spatial- and value-aware hybrid pruning strategy for point cloud transformers. In terms of the spatial domain, we observe that points in regions of various densities exhibit different levels of importance. In the value space, a minor input contributes less to features, indicating lower importance. Considering both perspectives, we hybridize the information inherited from the spatial and value spaces to prune less significant values in attention, which converts data to sparse patterns and makes it readily accelerated. Furthermore, we adopt low-bit quantization to boost computations and apply varying quantization precisions across different network layers based on their sensitivity. In support of our algorithm, we propose an architecture that employs a configurable mixed-precision systolic array for various computing loads under diverse precisions. To address the workload imbalance of the unstructured sparse computations, we introduce a data rearrangement mechanism, which improves resource utilization while hiding latency. We evaluate our Sava on four point cloud transformer models and achieve notable accuracy and performance gains. In comparison with CPU, GPUs, and ASICs, our Sava offers 10.3x, 3.6x, 3.3x, 2.6x, 2.2x speedup, along with 20x, 8.8x, 6.9x, 3.2x, 2.4x energy savings on average.
16:40 CEST	TS04.3	EFFICIENT DESIGN OF A HYPERDIMENSIONAL PROCESSING UNIT FOR MULTI-LAYER COGNITION Speaker: Mohamed Ibrahim, Georgia Tech, US Authors: Mohamed Ibrahim¹, Youbin Kim² and Jan Rabaey³ ¹Georgia Tech, US; ²University of California, Berkeley, US; ³University of California at Berkeley, US Abstract The methodology used to design and optimize the very first general-purpose hyperdimensional (HD) processing unit capable of executing a broad spectrum of HD workloads (called "HPU") is presented. HD computing is a brain-inspired computational paradigm that uses the principles of high-dimensional mathematics to perform cognitive tasks. While considerable efforts have been spent toward realizing efficient HD processors, all of these targeted specific application domains, most often pattern classification. In contrast, the HPU design addresses the multiple layers of a cognitive process. A structured methodology identifies the kernel HD computations recurring at each of these layers, and maps them onto a unified and parameterized architectural model. The effectiveness in terms of runtime and energy consumption of the approach is evaluated. The results show that the resulting HPU efficiently processes the full range of HD algorithms, and far outperforms baseline implementations on a GPU.
16:45 CEST	TS04.4	ECOFLEX-HDP: HIGH-SPEED AND LOW-POWER AND PROGRAMMABLE HYPERDIMENSIONAL-COMPUTING PLATFORM WITH CPU CO-PROCESSING Speaker: Michihiro Shintani, Grad. School of Science and Technology Kyoto Institute of Technology, JP Authors: Yuya Isaka¹, Nau Sakaguchi², Michiko Inoue¹ and Michihiro Shintani³ ¹NAIST, JP; ²San Jose State University, JP; ³Kyoto Institute of Technology, JP Abstract Hyperdimensional computing (HDC) can efficiently perform various cognitive tasks efficiently by mapping data to hyperdimensional vectors with thousands to tens of thousands of dimensions. However, the primary operations of HDC—Bind, Permutation, and Bound—need to be executed more efficiently on a standard CPU platform. This study introduces a novel computational platform, EcoFlex-HDP, specifically designed for HDC. EcoFlex-HDP exploits the parallelism and high memory access efficiency of HDC operations to achieve low computation time and energy consumption, outperforming the CPU. Furthermore, it can work cooperatively with a CPU, enabling integration with existing software, providing flexibility to apply new algorithms, and contributing to the development of an HDC ecosystem. Through experimental evaluations with a Cortex-A9 processor, HDC operations were shown to be accelerated by a maximum of 169 times. Furthermore, EcoFlex-HDP was confirmed to improve the energy–delay product by up to 13,469 times when training an image recognition task. All source codes for our platform and experiments are available at https://github.com/yuya-isaka/EcoFlex-HDP.
16:50 CEST	TS04.5	ACCELERATING CHAINING IN GENOMIC ANALYSIS USING RISC-V CUSTOM INSTRUCTIONS Speaker: Kisaru Liyanage, UNSW Sydney, AU Authors: Kisaru Liyanage, Hasindu Gamaarachchi, Hassaan Saadat, Tuo Li, Hiruna Samarakoon and Sri Parameswaran, University of New South Wales, AU Abstract This paper presents a method for designing custom instructions tailored to RISC-V processors, focusing on optimizing the chaining step of Minimap2 (a software tool used to analyze DNA data emanating from third-generation sequencing machines). This custom instruction design involves employing an architectural template within the Rocket Custom Coprocessor (RoCC) unit of Rocket Chip, an open-source hardware implementation of RISC-V ISA, aided by a heuristic algorithm that facilitates extracting custom instructions from high-level C code targeting the proposed architectural template. Two types of instructions are created in this work: complex computational instructions; and instructions that load static data apriori so that these data are not repeatedly brought in from the memory. The resulting custom instructions integrated into Rocket Chip demonstrate a speedup of up to 2.4x in the chaining step of Minimap2 with no adverse impact on the final mapping accuracy compared to the original software. The acceleration of Minimap2's chaining stage on a RISC-V processor enhances its portability and energy efficiency, making third-generation DNA sequence analysis more accessible in various settings.
16:55 CEST	TS04.6	ONE SA: ENABLING NONLINEAR OPERATIONS IN SYSTOLIC ARRAYS FOR EFFICIENT AND FLEXIBLE NEURAL NETWORK INFERENCE Speaker: Ruiqi Sun, Shanghai Jiao Tong University, CN Authors: Ruiqi Sun¹, Yinchen Ni¹, Jie Zhao², Xin He³ and An Zou¹ ¹Shanghai Jiao Tong University, CN; ²Microsoft, US; ³University of Michigan, US Abstract Deep neural networks (DNNs) have achieved significant success in wide fields, such as computer vision and natural language processing. However, their computation and memory-intensive nature limits their use in many mobile and embedded contexts. Application-specific integrated circuit (ASIC) hardware accelerators employ matrix multiplication units (such as the systolic arrays) and dedicated nonlinear function units to speed up DNN computations. A close examination of these ASIC accelerators reveals that the designs are often specialized and lack versatility across different networks, especially when the networks have different types of computation. In this paper, we introduce a novel systolic array architecture, which is capable of executing nonlinear functions. By encompassing both inherent linear and newly enabled nonlinear functions within the systolic arrays, the proposed architecture facilitates versatile network inferences, substantially enhancing computational power and energy efficiency. Experimental results show that employing this systolic array enables seamless execution of entire DNNs, incurring only a negligible loss in the network inference accuracy. Furthermore, assessment and evaluation with FPGAs reveal that integrating nonlinear computation capacity into a systolic array does not introduce extra notable (less than 1.5%) block memory memories (BRAMs), look-up- tables (LUTs), or digital signal processing (DPSs) but a mere 13.3% - 24.1% more flip flops (FFs). In comparison to existing methodologies, executing the networks with the proposed systolic array, which enables the flexibility of different network models, yields up to 25.73×, 5.21×, and 1.54× computational efficiency when compared to general-purpose CPUs, GPUs, and SoCs respectively, while achieving comparable (83.4% - 135.8%) performance with the conventional accelerators which are designed for specific neural network models.
17:00 CEST	TS04.7	DEEPFRACK: A COMPREHENSIVE FRAMEWORK FOR LAYER FUSION, FACE TILING, AND EFFICIENT MAPPING IN DNN HARDWARE ACCELERATORS Speaker: Tom Glint, IIT Gn, IN Authors: Tom Glint, Mithil Pechimuthu and Joycee Mekie, IIT Gandhinagar, IN Abstract DeepFrack is a novel framework developed for enhancing energy efficiency and reducing latency in deep learning workloads executed on hardware accelerators. By optimally fusing layers and implementing an asymmetric tiling strategy, DeepFrack addresses the limitations of traditional layer-by-layer scheduling. The computational efficiency of our method is underscored by significant performance improvements seen across various deep neural network architectures such as AlexNet, VGG, and ResNets when run on Eyeriss and Simba accelerators. The reduction in latency (30% to 40%) and energy consumption (30% to 50%) are further enhanced by the efficient usage of the on-chip buffer and reduction of external memory bandwidth bottleneck. This work contributes to the ongoing efforts in designing more efficient hardware accelerators for machine learning workloads.
17:05 CEST	TS04.8	S-LGCN: SOFTWARE-HARDWARE CO-DESIGN FOR ACCELERATING LIGHTGCN Speaker: Shun Li, Fudan University, CN Authors: Shun Li¹, Ruiqi Chen¹, Enhao Tang¹, Jing Yang², Yajing Liu² and Kun Wang¹ ¹Fudan University, CN; ²Fuzhou University, CN Abstract Graph Convolutional Networks (GCNs) have garnered significant attention in recent years, finding applications across various domains, including recommendation systems, knowledge graphs, and biological prediction. One prominent GCN-based recommendation model, LightGCN, optimizes embeddings for final prediction through graph convolution operations, and has achieved outstanding performance in the commodity recommendation and molecular property prediction. However, LightGCN suffers from suboptimal layer combination parameters and limited nonlinear modeling capabilities on the software side. On the hardware side, due to the irregularity of the aggregation phase of LightGCN, CPU and GPU executions are not efficient, and designing a accelerator will be constrained by transmission bandwidth and the efficiency of the sparse matrix multiplication kernel. In this paper, we optimize the layer combination parameters of LightGCN by Q-learning and add hardware-friendly activation function to enhance its nonlinear modeling capability. The optimized LightGCN not only performs well on the original dataset and some molecular prediction tasks, but also does not incur significant hardware overhead. Subsequently, we propose an efficient architecture to accelerate the inference of LightGCN to improve its adaptability to real-time tasks. Comparing S-LGCN to Intel(R) Xeon(R) Gold 5218R CPU and NVIDIA RTX3090 GPU, we observe that S-LGCN is 1576.4× and 21.8× faster with energy consumption reductions of 3211.6× and 71.6×, respectively. When compared to FPGA-based accelerator, S-LGCN demonstrates 1.5-4.5× lower latency and 2.03× higher throughput.
17:10 CEST	TS04.9	FLINT: EXPLOITING FLOATING POINT ENABLED INTEGER ARITHMETIC FOR EFFICIENT RANDOM FOREST INFERENCE Speaker: Christian Hakert, TU Dortmund University, DE Authors: Christian Hakert¹, Kuan-Hsun Chen² and Jian-Jia Chen¹ ¹TU Dortmund, DE; ²University of Twente, NL Abstract In many machine learning applications, e.g., tree- based ensembles, floating point numbers are extensively utilized due to their expressiveness. Even if floating point hardware is present in general computing systems, using integer operations instead of floating point operations promises to reduce operation overheads and improve the performance. In this paper, we provide FLInt, a full precision floating point comparison for random forests, by only using integer and logic operations. The usage of FLInt basically boils down to a one- by-one replacement of conditions: For instance, a comparison statement in C: if(pX[3]<=(float)10.074347) becomes if(((((int)(pX))+3))<=((int)(0x41213087))). Experimental evaluation on X86 and ARMv8 desktop and server class systems shows that the execution time can be reduced by up to ≈ 30% with our novel approach.
17:11 CEST	TS04.10	I²SR: IMMEDIATE INTERRUPT SERVICE ROUTINE ON RISC-V MCU TO CONTROL MMWAVE RF TRANSCEIVERS Speaker: Lee Jimin, Samsung Electronics, KR Authors: Jimin Lee, Sangwoo Park, Junho Huh, Sanghyo Jeong, Inhwan Kim and Jae Min Kim, Samsung Electronics, KR Abstract In 5G technology, millimeter Wave (mmWave), known as Frequency Range 2 (FR2), is one of the candidates for next-generation mobile communications. However, the adoption of mmWave technology in transceiver systems faces challenges in terms of strict latency requirements, due to the significantly increased number of control and compensation tasks. In this paper, we propose a novel interrupt architecture, Immediate Interrupt Service Routine (I²SR) on a RISC-V MCU, enabling zero-latency context switching. Applied to the mmWave transceiver, our I2SR MCU efficiently handles the mmWave tasks by removing context switching overhead. In our experiment, I²SR architecture reduces the overall MCU utilization by 30.6% compared to the baseline MCU, thereby demonstrating that I²SR architecture satisfies the strict latency requirements essential for mmWave applications.
17:12 CEST	TS04.11	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS23 Integrity And Ip Protection For Security-Critical Circuits

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Multi-Purpose Room M1B+D

Session chair:
Giorgio Di Natale, TIMA-CNRS, FR

Session co-chair:
Grace Li Zhang, TU Darmstadt, DE

Time	Label	Presentation Title Authors
16:30 CEST	TS23.1	UNCERTAINTY-AWARE HARDWARE TROJAN DETECTION USING MULTIMODAL DEEP LEARNING Speaker: Amin Rezaei, California State University, Long Beach, US Authors: Rahul Vishwakarma and Amin Rezaei, California State University, Long Beach, US Abstract The risk of hardware Trojans being inserted at various stages of chip production has increased in a zero-trust fabless era. To counter this, various machine learning solutions have been developed for the detection of hardware Trojans. While most of the focus has been on either a statistical or deep learning approach, the limited number of Trojan-infected benchmarks affects the detection accuracy and restricts the possibility of detecting zero-day Trojans. To close the gap, we first employ generative adversarial networks to amplify our data in two alternative representation modalities: a graph and a tabular, which ensure a representative distribution of the dataset. Further, we propose a multimodal deep learning approach to detect hardware Trojans and evaluate the results from both early fusion and late fusion strategies. We also estimate the uncertainty quantification metrics of each prediction for risk-aware decision-making. The results not only validate the effectiveness of our suggested hardware Trojan detection technique but also pave the way for future studies utilizing multimodality and uncertainty quantification to tackle other hardware security problems.
16:35 CEST	TS23.2	PROGRAMMABLE EM SENSOR ARRAY FOR GOLDEN-MODEL FREE RUN-TIME TROJAN DETECTION AND LOCALIZATION Speaker: Shuo Wang, University of Florida, US Authors: Hanqiu Wang, Max Panoff, Zihao Zhan, Shuo Wang, Christophe Bobda and Domenic Forte, University of Florida, US Abstract Side-channel analysis has been proven effective at detecting hardware Trojans in integrated circuits (ICs). However, most detection techniques rely on large external probes and antennas for data collection and require a long measurement time to detect Trojans. Such limitations make these techniques impractical for run-time deployment and ineffective in detecting small Trojans with subtle side-channel signatures. To overcome these challenges, we propose a Programmable Sensor Array (PSA) for run-time hardware Trojan detection, localization, and identification. PSA is a tampering-resilient integrated on-chip magnetic field sensor array that can be re-programmed to change the sensors' shape, size, and location. Using PSA, EM side-channel measurement results collected from sensors at different locations on an IC can be analyzed to localize and identify the Trojan. The PSA has better performance than conventional external magnetic probes and state-of-the-art on-chip single-coil magnetic field sensors. We fabricated an AES-128 test chip with four AES Hardware Trojans. They were successfully detected, located, and identified with the proposed on-chip PSA within 10 milliseconds using our proposed cross-domain analysis.
16:40 CEST	TS23.3	KRATT: QBF-ASSISTED REMOVAL AND STRUCTURAL ANALYSIS ATTACK AGAINST LOGIC LOCKING Speaker: Levent Aksoy, Tallinn University of Technology, EE Authors: Levent Aksoy¹, Muhammad Yasin² and Samuel Pagliarini¹ ¹Tallinn University of Technology, EE; ²National University of Sciences & Technology, PK Abstract This paper introduces KRATT, a removal and structural analysis attack against state-of-the-art logic locking techniques, such as single and double flip locking techniques (SFLTs and DFLTs). KRATT utilizes powerful quantified Boolean formulas (QBFs), which have not found widespread use in hardware security, to find the secret key of SFLTs for the first time. It can handle locked circuits under both oracle-less (OL) and oracle-guided (OG) threat models. It modifies the locked circuit and uses a prominent OL attack to make a strong guess under the OL threat model. It uses a structural analysis technique to identify promising protected input patterns and explores them using the oracle under the OG model. Experimental results on ISCAS'85, ITC'99, and HeLLO: CTF'22 benchmarks show that KRATT can break SFLTs using a QBF formulation in less than a minute, can decipher a large number of key inputs of SFLTs and DFLTs with high accuracy under the OL threat model, and can easily find the secret key of DFLTs under the OG threat model. It is shown that KRATT outperforms publicly available OL and OG attacks in terms of solution quality and run-time.
16:45 CEST	TS23.4	TROSCAN: ENHANCING ON-CHIP DELIVERY RESILIENCE TO PHYSICAL ATTACK THROUGH FREQUENCY-TRIGGERED KEY GENERATION Speaker: Jianfeng Wang, Tsinghua University, CN Authors: Jianfeng Wang¹, Shuwen Deng¹, Huazhong Yang¹, Vijaykrishnan Narayanan² and Xueqing Li¹ ¹Tsinghua University, CN; ²Pennsylvania State University, US Abstract Keys grant access to devices and are the core secrets in logic obfuscation. Typically, keys are stored in tamper-proof memory and are subsequently delivered to logic locking modules through scan chains. However, recent physical attacks have successfully extracted keys directly from registers, challenging the security of the prior scan obfuscation/blocking efforts. This paper mitigates the threat of direct value extraction by proposing TroScan, an architecture that leverages the internal frequency of register chains to activate trigger circuits. We propose three key generation methods for typical defense scenarios and gate-aware obfuscation optimization. To the authors' best knowledge, this work presents the first on-chip key delivery obfuscation architecture against Electro-Optical Frequency Mapping (EOFM) attacks. Evaluation shows ~100% key obfuscation effectiveness under two EOFM attack targets. For overheads, we demonstrate the worst-case fault coverage rate of 97.6%, average area/power overheads of 7.5%/11.8%, and an average key generation success rate of 98% across 80 process voltage temperature (PVT) conditions.
16:50 CEST	TS23.5	A GOLDEN-FREE FORMAL METHOD FOR TROJAN DETECTION IN NON-INTERFERING ACCELERATORS Speaker: Anna Lena Duque Antón, University of Kaiserslautern-Landau, DE Authors: Anna Lena Duque Antón¹, Johannes Müller², Lucas Deutschmann², Mohammad Rahmani Fadiheh³, Dominik Stoffel² and Wolfgang Kunz² ¹RPTU Kaiserslautern-Landau, DE; ²University of Kaiserslautern-Landau, DE; ³Stanford University, US Abstract The threat of hardware Trojans (HTs) in security-critical IPs like cryptographic accelerators poses severe security risks. The HT detection methods available today mostly rely on golden models and detailed circuit specifications. Often they are specific to certain HT payload types, making pre-silicon verification difficult and leading to security gaps. We propose a novel formal verification method for HT detection in non-interfering accelerators at the Register Transfer Level (RTL), employing standard formal property checking. Our method guarantees the exhaustive detection of any sequential HT independently of its payload behavior, including physical side channels. It does not require a golden model or a functional specification of the design. The experimental results demonstrate efficient and effective detection of all sequential HTs in accelerators available on Trust-Hub, including those with complex triggers and payloads.
16:55 CEST	TS23.6	TITANCFI: TOWARD ENFORCING CONTROL-FLOW INTEGRITY IN THE ROOT-OF-TRUST Speaker: Emanuele Parisi, Università di Bologna, IT Authors: Emanuele Parisi, Alberto Musa, Simone Manoni, Maicol Ciani, Davide Rossi, Francesco Barchi, Andrea Bartolini and Andrea Acquaviva, Università di Bologna, IT Abstract Modern RISC-V platforms control and monitor security-critical systems such as industrial controllers and autonomous vehicles. While these platforms feature a Root-of-Trust (RoT) to store authentication secrets and enable secure boot technologies, they often lack Control-Flow Integrity (CFI) enforcement and are vulnerable to cyber-attacks which divert the control flow of an application to trigger malicious behaviours. Recent techniques to enforce CFI in RISC-V systems include ISA modifications or custom hardware IPs, all requiring ad-hoc binary toolchains or design of CFI primitives in hardware. This paper proposes TitanCFI, a novel approach to enforce CFI in the RoT. TitanCFI modifies the commit stage of the protected core to stream control flow instructions to the RoT and it integrates the CFI enforcement policy in the RoT firmware. Our approach enables maximum reuse of the hardware resource present in the SoC, and it avoids the design of custom IPs and the modification of the compilation toolchain, while exploiting the RoT tamper-proof storage and cryptographic accelerators to secure CFI metadata. We implemented the proposed architecture on a modern RISC-V SoC along with a return address protection policy in the RoT, and benchmarked area and runtime overhead. Experimental results show that our approach achieves overhead comparable to SoA hardware CFI solutions for most benchmarks, with lower area overhead, resulting in 1% of additional SoC area occupation.
17:00 CEST	TS23.7	RL-TPG: AUTOMATED PRE-SILICON SECURITY VERIFICATION THROUGH REINFORCEMENT LEARNING-BASED TEST PATTERN GENERATION Speaker: Mark Tehranipoor, University of Florida, US Authors: Nurun Mondol, Arash Vafaei, Kimia Zamiri Azar, Farimah Farahmandi and Mark Tehranipoor, University of Florida, US Abstract Verifying the security of System-on-Chip (SoC) designs against hardware vulnerabilities is challenging because of the increasing complexity of SoCs, the diverse sources of vulnerabilities, and the need for comprehensive testing to identify potential security threats. In this paper, we propose RL-TPG, a novel framework that combines traditional verification with hardware security verification using Reinforcement Learning (RL) in Register Transfer Level (RTL) design. Significant research has been done on formal verification, semi-formal verification, automated security asset identification, and gate-level netlist. However, the area of automated simulation using machine learning at the RTL level is still unexplored. RL-TPG employs an RL agent that generates intelligent test patterns targeting security properties, verification coverage, and rare nodes of the design to achieve security property violation, increase verification coverage, and reach rare nodes. Our framework triggers all embedded vulnerabilities, achieving an average of 90% traditional coverage in an average of 192 seconds for the experimental benchmarks. To demonstrate the effectiveness of our approach, the results are compared with JasperGold by Cadence.
17:05 CEST	TS23.8	A CONCEALABLE RRAM PHYSICAL UNCLONABLE FUNCTION COMPATIBLE WITH IN-MEMORY COMPUTING Speaker: Jiang Li, Nanjing University of Aeronautics and Astronautics, CN Authors: Jiang Li¹, Yijun Cui¹, Chenghua Wang¹, Weiqiang Liu¹ and Shahar Kvatinsky² ¹Nanjing University of Aeronautics and Astronautics, CN; ²Technion – Israel Institute of Technology, IL Abstract Resistive random access memory (RRAM) has been widely used in physical unclonable function (PUF) design due to its low power consumption, fast read/write speed, and sig-nificant intrinsic randomness. However, existing RRAM PUFs cannot overcome the cycle-to-cycle (C2C) variations of RRAM, leading to poor reproducibility of PUF keys across cycles. Most prior designs directly store PUF keys in RRAMs, increasing vulnerability to attacks. In this paper, we propose a concealable RRAM PUF based on an RRAM crossbar array, utilizing the differential resistive switching characteristics of two RRAMs to generate keys. By enabling the reproducibility of PUF keys across cycles, a concealment scheme is proposed to prevent the exposure of PUF keys, thus enhancing the security of the RRAM PUF. Through post-processing operations, the proposed PUF exhibits high reliability over ±10% VDD and a wide temperature range from 248K to 373K. Furthermore, this RRAM PUF is compatible with in-memory computing (IMC), and they can be implemented using the same RRAM crossbar array.
17:10 CEST	TS23.9	AUTOMATED HARDWARE SECURITY COUNTERMEASURE INTEGRATION INSIDE HIGH-LEVEL SYNTHESIS Speaker: Athanasios Papadimitriou, University of the Peloponnese, Pl Authors: Amalia Koufopoulou¹, Athanasios Papadimitriou², Mihalis Psarakis¹ and David Hély³ ¹University of Piraeus, GR; ²University of the Peloponnese, GR; ³University Grenoble Alpes, Grenoble INP, LCIS, FR Abstract High-level Synthesis (HLS) methodology has revolutionized the development of complex hardware designs. It enables the rapid conversion of algorithmic descriptions of functionalities to highly optimized hardware equivalents. While modern HLS tools excel in addressing classic design constraints, such as area, latency and power requirements, they fall short regarding security considerations. Security's role is significantly empha-sized in today's digital environment, given the existence of powerful hardware attacks, such as Fault Injection (FI) and Side-Channel Analysis (SCA) attacks. HLS methodology can theoretically facilitate the integration of security measures from the high level, yet its core mechanisms do not actively address the preservation or the improvement of security levels of any countermeasure described. Instead, it may sacrifice security enhancement entirely in circuits of high optimization goals. In this work, first, we propose the automatic countermeasure insertion in a way so that both HLS optimization efforts and the secure addition of the countermeasure are implemented effectively. Secondly, we modify the internal mechanisms of the HLS scheduling algorithm and operation chaining to reduce vulnerable points of the design. We demonstrate our method-ology by performing fault injection experiments and comparing the results with a straightforward countermeasure integration technique in terms of hardware security and traditional design metrics.
17:11 CEST	TS23.10	SCANCAMOUFLAGE: OBFUSCATING SCAN CHAINS WITH CAMOUFLAGED SEQUENTIAL AND LOGIC GATES Speaker: Tarik Ibrahimpasic, Chair of Electronic Design Automation, TU Munich, DE Authors: Tarik Ibrahimpasic¹, Grace Li Zhang², Michaela Brunner¹, Georg Sigl³, Bing Li¹ and Ulf Schlichtmann¹ ¹TU Munich, DE; ²TU Darmstadt, DE; ³TU Munich/Fraunhofer AISEC, DE Abstract Scan chain is a commonly used technique in testing integrated circuits as it provides observability and controllability of the internal states of circuits. However, its presence can make circuits vulnerable to attacks and potentially result in confidential internal data leakage. In this paper, we propose a novel technique for obfuscating scan chains using camouflaged flip-flops, which are designed with the same layout as the original flip-flops but have the actual functionality of a buffer. Furthermore, we employ camouflaged logic gates interconnected in special configurations to increase the difficulty of SAT attack. Experimental results demonstrate that circuits with only a small number of flip-flops can already be protected by the proposed technique while incurring only a minimal area overhead.
17:12 CEST	TS23.11	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS25 Verified Real-Time Systems

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Multi-Purpose Room M1A+C

Session chair:
Leonidas Kosmidis, Barcelona Supercomputing Center, Spain, ES

Session co-chair:
Prabhat Mishra, University of Florida, US

Time	Label	Presentation Title Authors
16:30 CEST	TS25.1	FAULT-TOLERANT CYCLIC QUEUING AND FORWARDING IN TIME-SENSITIVE NETWORKING Speaker: Liwei Zhang, Lanzhou University, CN Authors: Liwei Zhang¹, Tong Zhang², Wenxue Wu¹, Xiaoqin Feng¹, Guoxi Lin¹ and Fengyuan Ren³ ¹Lanzhou University, CN; ²Nanjing University of Aeronautics and Astronautics, CN; ³Lanzhou University/Tsinghua University,, CN Abstract Time-sensitive networking (TSN) provides deterministic time-sensitive transmission services for critical data at the link layer. Cyclic Queuing and Forwarding (CQF) defined by IEEE 802.1Qch is used for critical data transmission. However, unexpected data errors may occur due to transient faults like electromagnetic interference. At present, the solution to such failures defined in the IEEE TSN standards is to transmit multiple data copies on redundant paths, which introduces network resources wastage. Compared to redundant transmission, retransmission can reduce resource waste, but may violate the deterministic transmission guarantee in TSN. To tackle with this issue, we propose a time-redundant fault-tolerant mechanism for CQF, called fault-tolerant CQF (FT-CQF). On the basis of standard CQF, FT-CQF occupies an additional queue to cache copies of Time-Trigger (TT) flows and reserves time slots to forward them. According to the returned CRC-related messages, FT-CQF will decide whether to forward these copies. Non-TT flows can also be transmitted during this time when copies are not required to be forwarded. We implement FT-CQF in OMNeT++, and verify the performance of FT-CQF in typical network scenarios. The extensive simulation experiments show that FT-CQF is effective in terms of fault-tolerant effects, consumed resources, delay, and jitter.
16:35 CEST	TS25.2	HARDWARE-ASSISTED CONTROL-FLOW INTEGRITY ENHANCEMENT FOR IOT DEVICES Speaker: Weiyi Wang, Zhejiang University, CN Authors: Weiyi Wang¹, Lang Feng², Zhiguo Shi¹, Cheng Zhuo¹ and Jiming Chen¹ ¹Zhejiang University, CN; ²National Sun Yat-Sen University, TW Abstract Internet of Things (IoT) devices face an escalating threat from code reuse attacks (CRAs) as they can reuse existing code for malicious purpose. Thus a practical cost-effective Control-Flow Integrity (CFI) mechanism for IoT devices is urgently needed. However, existing CFI solutions suffer from impracticalities, including high performance overhead and a heavy reliance on offline perfect Control-Flow Graph (CFG) generation. To tackle these challenges, we propose a fine-grained dependable CFI scheme for IoT devices that real-time updates the CFG of devices. We evaluate the implementation on RISC-V architectures and the results show that our CFI scheme provides both backward- and forward-edge protection with almost no performance overhead in the case of fixed CFG, negligible power overhead, and low hardware overhead. Compared to the current hardware-assisted CFI designs, our design eliminates the dependence on the offline perfect CFG generation and performs real-time CFG updating for better practicality.
16:40 CEST	TS25.3	PP-HDC: A PRIVACY-PRESERVING INFERENCE FRAMEWORK FOR HYPERDIMENSIONAL COMPUTING Speaker: Xun Jiao, Villanova University, US Authors: Ruixuan Wang, Wengying Wen, Kyle Juretus and Xun Jiao, Villanova University, US Abstract Recently, brain-inspired hyperdimensional computing (HDC), an emerging neuro-symbolic computing scheme that imitates human brain functions to process information using abstract and high-dimensional patterns, has seen increasing applications in multiple application domains and deployment in edge-cloud collaborative processing. However, sending sensitive data to the cloud for inference may face severe privacy threats. Unfortunately, HDC is particularly vulnerable to privacy threats due to its reversible nature. To address this challenge, we propose PP-HDC, a novel privacy-preserving inference framework for HDC. PP-HDC is designed to protect the privacy of both inference input and output. To preserve the privacy of inference input, we propose a novel hash-encoding approach in high-dimensional space by implementing a sliding-window-based transformation on the input hypervector (HV). By leveraging the unique mathematical properties of HDC, we are able to seamlessly perform training and inference on the hash-encoded HV with negligible overhead. For inference output privacy, we propose a multi-model inference approach to encrypt the inference results by leveraging the unique structure of HDC item memories and ensuring the inference result is only accessible to the owner with a proper key. We evaluate PP-HDC on three datasets and demonstrate that PP-HDC enhances privacy-preserving effects compared with state-of-the-art works while incurring minimal accuracy loss.
16:45 CEST	TS25.4	SHARED CACHE ANALYSIS UNDER PREEMPTIVE SCHEDULING Speaker: Thilo Fischer, TU Hamburg, DE Authors: Thilo Fischer and Heiko Falk, TU Hamburg, DE Abstract When sharing a cache between multiple cores, the inter-core interference has to be considered in the worst-case execution time (WCET) analysis. Current interference models are overly pessimistic or not applicable to preemptively scheduled systems. We propose a novel technique to model interference in a preemptive system to classify accesses as cache hits or potential misses. We account for inter-core interference by considering the potential execution scenarios on the interfering core and find the worst-case interference pattern. The resulting access classifications are then used to compute the cache-related preemption delay. Our evaluation shows that the proposed analysis significantly increases the cache hit classifications, reduces WCET on average by up to 11.7%, and reduces worst-case response times on average by up to 15.4% compared to the existing classification technique.
16:50 CEST	TS25.5	SHARED DATA KILLS REAL-TIME CACHE ANALYSIS. HOW TO RESURRECT IT? Speaker: Mohamed Hassan, McMaster University, CA Authors: Safin Bayes, Mohamed Hossam and Mohamed Hassan, McMaster University, CA Abstract While data sharing is becoming a necessity in modern multi-core real-time systems, it complicates system analyzability and leads to significantly pessimistic latency bounds. This work is a step towards facilitating high-performance and coherent data sharing in real-time systems by tackling two main problems. The first is a well-acknowledged one: shared caches renders cache analysis techniques useless and all cache accesses have to be assumed misses. The second is a new one that is an artifact of cache coherence and we unveil in this work by showing that coherence interference voids classical cache analysis techniques. We contribute a solution that tackles both problems by leveraging time-based cache coherence and a novel methodology to integrate its effect into cache analysis. Thanks to this solution, we enable the usage of shared memory hierarchy with coherent shared data, while we prove that we are able to restore cache analysis; and hence, provide much tighter memory latency bounds.
16:55 CEST	TS25.6	OPTIMAL FIXED PRIORITY SCHEDULING IN MULTI-STAGE MULTI-RESOURCE DISTRIBUTED REAL-TIME SYSTEMS Speaker: Niraj Kumar, Nanyang Technological University, SG Authors: Niraj Kumar, Chuanchao Gao and Arvind Easwaran, Nanyang Technological University, SG Abstract This work studies fixed priority (FP) scheduling of real-time jobs with end-to-end deadlines in a distributed system. Specifically, given a multi-stage pipeline with multiple heterogeneous resources of the same type at each stage, the problem is to assign priorities to a set of real-time jobs with different release times to access a resource at each stage of the pipeline subject to the end-to-end deadline constraints. Note, in such a system, jobs may compete with different sets of jobs at different stages of the pipeline depending on the job-to-resource mapping. To this end, following are the two major contributions of this work. We show that an OPA-compatible schedulability test based on the delay composition algebra can be constructed, which we then use with an optimal priority assignment algorithm to compute a priority ordering. Further, we establish the versatility of pairwise priority assignment in such a multi-stage multi-resource system, compared to a total priority ordering. In particular, we show that a pairwise priority assignment may be feasible even if a priority ordering does not exist. We propose an integer linear programming formulation and a scalable heuristic to compute a pairwise priority assignment. We also show through simulation experiments that the proposed approaches can be used for the holistic scheduling of real-time jobs in edge computing systems.
17:00 CEST	TS25.7	REAL-TIME MULTI-PERSON IDENTIFICATION AND TRACKING VIA HPE AND IMU DATA FUSION Speaker: Mirco De Marchi, Università di Verona, IT Authors: Nicola Bombieri, Mirco De Marchi, Graziano Pravadelli and Cristian Turetta, Università di Verona, IT Abstract In the context of smart environments, crafting remote monitoring systems that are efficient, cost-effective, user-friendly, and respectful of privacy is crucial for many scenarios. Recognizing and tracing individuals via markerless motion capture systems in multi-person settings poses challenges due to obstructions, varying light conditions, and intricate interactions among subjects. In contrast, methods based on data gathered by Inertial Measurement Units (IMUs) located in wearables grapple with other issues, including the precision of the sensors and their optimal placement on the body. We claim that more accurate results can be achieved by mixing Human Pose Estimation (HPE) techniques with information collected by wearables. To do that, we introduce a real-time platform that fuses HPE and IMU data to track and identify people. It exploits a matching model that consists of two synergistic components: the first employs a geometric approach, correlating orientation, acceleration, and velocity readings from the input sources. The second utilizes a Convolutional Neural Network (CNN) to yield a correlation coefficient for each HPE and IMU data pair. The proposed platform achieves promising results in identification and tracking, with an accuracy rate of 96.9%.
17:05 CEST	TS25.8	AXI-REALM: A LIGHTWEIGHT AND MODULAR INTERCONNECT EXTENSION FOR TRAFFIC REGULATION AND MONITORING OF HETEROGENEOUS REAL-TIME SOCS Speaker: Thomas Benz, Integrated Systems Laboratory, ETH Zurich, Switzerland, CH Authors: Thomas Benz¹, Alessandro Ottaviano¹, Robert Balas¹, Angelo Garofalo², Francesco Restuccia³, Alessandro Biondi⁴ and Luca Benini⁵ ¹ETH Zurich, CH; ²Università di Bologna, IT; ³University of California, San Diego, US; ⁴Scuola Superiore Sant'Anna, IT; ⁵ETH Zurich, CH \| Università di Bologna, IT Abstract The increasing demand for heterogeneous functionality in the automotive industry and the evolution of chip manufacturing processes have led to the transition from federated to integrated critical real-time embedded systems (CRTESs). This leads to higher integration challenges of conventional timing predictability techniques due to access contention on shared resources, which can be resolved by providing system-level observability and controllability in hardware. We focus on the interconnect as a shared resource and propose AXI-REALM, a lightweight, modular, and technology-independent real-time extension to industry-standard AXI4 interconnects, available open-source. AXI-REALM uses a credit-based mechanism to distribute and control the bandwidth in a multi-subordinate system on periodic time windows, proactively prevents denial of service from malicious actors in the system, and tracks each manager's access and interference statistics for optimal budget and period selection. We provide detailed performance and implementation cost assessment in a 12nm node and an end-to-end functional case study implementing AXI-REALM into an open-source Linux-capable RISC-V SoC. In a system with a general-purpose core and a hardware accelerator's DMA engine causing interference on the interconnect, AXI-REALM achieves fair bandwidth distribution among managers, allowing the core to recover 68.2 % of its performance compared to the case without contention. Moreover, near-ideal performance (above 95 %) can be achieved by distributing the available bandwidth in favor of the core, improving the worst-case memory access latency from 264 to below eight cycles. Our approach minimizes buffering compared to other solutions and introduces only 2.45 % area overhead compared to the original SoC.
17:10 CEST	TS25.9	SGPRS: SEAMLESS GPU PARTITIONING REAL-TIME SCHEDULER FOR PERIODIC DEEP LEARNING WORKLOADS Speaker: Thidapat Chantem, Virginia Tech, US Authors: Amir Fakhim Babaei and Thidapat (Tam) Chantem, Virginia Tech, US Abstract Deep Neural Networks (DNNs) are useful in many applications, including transportation, healthcare, and speech recognition. Despite various efforts to improve accuracy, few works have studied DNN in the context of real-time requirements. Coarse resource allocation and sequential execution in existing frameworks result in underutilization. In this work, we conduct GPU speedup gain analysis and propose SGPRS, the first real-time GPU scheduler considering zero configuration partition switch. The proposed scheduler not only meets more deadlines for parallel tasks but also sustains overall performance beyond the pivot point.
17:11 CEST	TS25.10	LIGHTWEIGHT AND PREDICTABLE MEMORY VIRTUALIZATION ON MEDIUM-SIZE MICROCONTROLLERS Speaker: Stefano Mercogliano, Università di Napoli Federico II, IT Authors: Stefano Mercogliano, Daniele Ottaviano, Alessandro Cilardo and Marcello Cinque, Università di Napoli Federico II, IT Abstract Nowadays industry research is heading towards the consolidation of multiple real-time applications and execution environments on single microcontrollers, with the aim of optimizing area, power, and cost while keeping an eye on protection and flexibility. To this end, virtualization seems an attractive solution, but it must be redesigned according to the specific requirements of microcontroller tasks, different than traditional application processor workloads. This paper examines two possible hardware-based models to support virtual machines on medium-size microcontrollers providing an extensive and reproducible analysis over a RISC-V processor.
17:12 CEST	TS25.11	MOTIVATING THE USE OF MACHINE-LEARNING FOR IMPROVING TIMING BEHAVIOUR OF EMBEDDED MIXED-CRITICALITY SYSTEMS Speaker: Behnaz Ranjbar, TU Dresden, DE Authors: Vikash Kumar¹, Behnaz Ranjbar² and Akash Kumar² ¹Indian Institute of Science, IN; ²Ruhr University Bochum, DE Abstract In Mixed-Criticality (MC) systems, due to encountering multiple Worst-Case Execution Times (WCETs) for each task corresponding to the system operation modes, estimating appropriate WCETs for tasks in lower-criticality (LO) modes is essential to improve the system's timing behavior. While numerous studies focus on determining WCET in the high-criticality mode, determining the appropriate WCET in the LO mode poses significant challenges and has been addressed in a few research works due to its inherent complexity. This article introduces a novel scheme to obtain appropriate WCET for LO modes. We propose an ML-based approach for WCET estimation based on the application's source code analysis and the model training using a comprehensive data set. The experimental results show a significant improvement in utilization by up to 23.3% for the ML-based approach, while mode switching probability is bounded by 7.19% in the worst-case scenario.
17:13 CEST	TS25.12	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

US02 Unplugged Session

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Break-Out Room S3+4

PARTY Date Party

Add this session to my calendar

Date: Tuesday, 26 March 2024
Time: 19:30 CEST - 23:00 CEST
Location / Room: L'Hemisfèric at Ciutat de les Arts in Les Ciències

Time	Label	Presentation Title Authors
19:30 CEST	PARTY.1	ARRIVAL AND WELCOME Presenter: Andy Pimentel, University of Amsterdam, NL Authors: Andy Pimentel¹ and Valeria Bertacco² ¹University of Amsterdam, NL; ²University of Michigan, US Abstract Arrival and Welcome at DATE Party
20:00 CEST	PARTY.2	PRESENTATION OF AWARDS Presenter: Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Authors: Jürgen Teich¹, Andy Pimentel², Valeria Bertacco³, David Atienza⁴ and Ian O'Connor⁵ ¹Friedrich-Alexander-Universität Erlangen-Nürnberg, DE; ²University of Amsterdam, NL; ³University of Michigan, US; ⁴EPFL, CH; ⁵Lyon Institute of Nanotechnology, FR Abstract Presentation of Awards
20:15 CEST	PARTY.3	DATE PARTY WITH DRINKS&FOODBARS Presenter: All Participants, DATE, ES Author: All Participants, DATE, ES Abstract Drinks and foodbars at DATE Party

ET04 Embedded Tutorial: Introduction to Certifiable General Purpose GPU Programming for Safety-Critical Systems using Khronos APIs

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 08:30 CEST - 10:00 CEST
Location / Room: Break-Out Room S6+7

GPUs are currently considered from all safety critical industries (automotive, avionics, aerospace, healthcare and others) to accelerate general purpose computations and meet performance requirements of new advanced functionalities, which are not possible with the legacy, single-core processors used in these domains. However, most of the R&D in companies from these domains is focused on proof of concepts, which demonstrate the capabilities of employing GPUs in these domains, ignoring the certification challenges introduced by GPUs.

In this tutorial, we will teach the attendees how general purpose GPU code can be developed and certified according to safety critical standards used in these industries, using open standards such as Khronos APIs. In particular, the tutorial covers two essential APIs. The first half is focused on OpenGL SC 2.0, a well established graphics API already deployed in several high criticality systems, including DAL A systems in avionics such as the primary flight displays in modern aircraft, as well as in ASIL D systems in automotive dashboards of contemporary high-end vehicles.

The second half of the tutorial focuses on SYCL and its upcoming variant SYCL SC, which abstracts away the low-end complexity of heterogeneous programming in safety critical systems. SYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and through the SYCLOPS Horizon project RISC-V accelerators) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to an open platform-independent standard such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming from completely standard C++. In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language. The main benefit of using SYCL over other heterogeneous programming models is the single programming language approach, which enables one to target multiple devices using the same programming model, and therefore to have a cleaner, portable, and more readable code.

The tutorial includes the latest advancements in Khronos SC APIs, through the developments in the on-going Horizon Europe projects, METASAT (https://metasat-project.eu/) and SYCLOPS (https://www.syclops.org/).

This is a hands-on tutorial. The format will be short presentations followed by hands-on exercises. Hence, attendees will require their own laptop to perform the hands-on exercises, by connecting to a ready to use remote system with a preinstalled environment.

Speakers

Leonidas Kosmidis, Barcelona Supercomputing Center (BSC)
Victor Perez, Codeplay/Intel

Target Audience

Academic and industrial attendees from the domain of embedded, real-time and safety critical areas, who are interested in safety critical GPU programming APIs and safety certification. We expect particular interest from DATE attendees, as well as from the ones attending the Autonomous Systems Design (ASD) initiative at DATE. This includes practitioners from automotive, avionics, space and other safety critical domains, which face high performance needs that can be satisfied by GPUs.

Learning objectives

The tutorial is going to be a mixture of fundamentals and hands on. At first, we will introduce the fundamental theory/concepts on how we can achieve certification of general purpose GPU code, using already certified graphics-based solutions such as OpenGL SC 2.0. Next we will present the upcoming solution, SYCL, whose SC version is currently under development at Khronos. During the interleaved presentations and hands-on sessions, the attendees will have the opportunity to obtain hands on experience with the presented methods through a series of exercises.

Required background

The tutorial attendees must be familiar with C++. Moreover, understanding of safety critical systems and familiarity with at least one safety standard (ISO 26262, DO-178C, ECSS) and safety critical code development guidelines (i.e. MISRA C/C++) is desirable but not required.

Detailed Program

8:30 - 8:40	Certification and challenges with GPUs (L. Kosmidis)
8:40 - 9:10	OpenGL SC 2.0 (L. Kosmidis)
9:10 - 9:50	SYCL (V. Perez)
9:50 - 10:00	What’s new in upcoming SYCL SC (L. Kosmidis)

FS09 AI for EDA, and EDA for AI

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 08:30 CEST - 10:00 CEST
Location / Room: Break-Out Room S8

Session chair:
Anuj Pathania, University of Amsterdam, NL

Session co-chair:
Todor Stefanov, Leiden University, NL

Organisers:
Catherine Le Lan, Synopsys, FR
Olivier Sentieys, INRIA, FR

Time	Label	Presentation Title Authors
08:30 CEST	FS09.1	AI – THE NEXT REVOLUTION IN CHIP DESIGN AUTOMATION Presenter: Tobias Bjerregaard, Synopsys, DK Author: Tobias Bjerregaard, Synopsys, DK Abstract AI is making great strides and in particularly the generative AI technologies are set to transform entire industries. AI-based chip design flows are not only yielding better results, but also improving designer productivity by automating many of the tasks traditionally done by human experts. In this talk I will outline key uses of AI in the chip design process. I will look across the entire EDA stack at where AI-based approaches have made the highest impact. I will also look at how generative AI technologies can help capture human knowhow and as such be particularly powerful in mitigating the talent gap that the chip design industry is facing.
09:00 CEST	FS09.2	AUTOMATION IN DEEP LEARNING AND HARDWARE INTEGRATION Presenter: Dolly Sapra, University Van Amsterdam, NL Author: Dolly Sapra, University Van Amsterdam, NL Abstract This talk focuses on how EDA tools are changing the way deep learning models are integrated with hardware, especially in edge devices. Tools to automatically identify the most suitable deep learning models for specific hardware configurations are discussed. These methods are crucial for a large number of edge devices, where limitations in power and processing capacity are often encountered. The talk further elaborates on the methods to achieve efficient system-resource management, ensuring that neural network layers are meticulously mapped to the most appropriate hardware components. This approach not only enhances the computational efficiency of edge devices but also contributes significantly to their energy conservation. The aim of this talk is to highlight the developments and opportunities of automating deep learning inference, at a time when AI is increasingly present in every aspect of our lives and becoming integral to the devices we use every day.
09:30 CEST	FS09.3	DESIGNING THE FUTURE CHIPS: : A-DECA SYNERGIZES AI AND OPTIMIZATION TECHNIQUES FOR AUTOMATIC ARCHITECTURE EXPLORATION. Presenter: Lilia Zaourar, CEA LIST, FR Author: Lilia Zaourar, CEA LIST, FR Abstract Nowadays, chip designers face various challenges due to the short time to market, increased system complexity, computing heterogeneity, and various SW/HW constraints. Usually, traditional solutions are based on prior engineering experience from computer architects. However, finding a suitable architecture that can accommodate a good balance between various Key Performance Indicators is a notorious problem because of two issues: 1) design space is extremely large, and its size grows exponentially when adding new components, and 2) computing cost is large, in particular for metrics evaluation. A-DECA (Automated Design space Exploration for Computing Architectures) framework leverages various Combinatorial Optimization techniques and Machine Learning strategies to achieve an efficient trade-off on the usual PPA (Performance, Power, Area) but also includes other ones such as security and sustainability. A concrete use case focusing on the design of the next generation of exascale processors for HPC will be presented.

LBR01 Late Breaking Results: Breakthrough Architecture with innovative technologies, Design & Test methodologies

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 08:30 CEST - 10:00 CEST
Location / Room: Break-Out Room S1+2

Session chair:
Regazzoni Francesco, UvA & ALaRI, CH

Session co-chair:
Aida Todri Sanial, TUE, NL

Time	Label	Presentation Title Authors
08:30 CEST	LBR01.1	LATE BREAKING RESULTS: SINGLE FLUX QUANTUM BASED BROWNIAN CIRCUITS FOR ULTRA-LOW-POWER COMPUTING Presenter: Satoshi Kawakami, Kyushu University, JP Authors: Satoshi Kawakami¹, Yusuke Ohtsubo¹, Koji Inoue¹ and Masamitsu Tanaka² ¹Kyushu University, JP; ²Nagoya University, JP Abstract This paper proposes a random walk circuit implementation with single flux quantum devices, essential for Brownian circuits, to reduce processing energy consumption dramatically. SPICE-based simulation demonstrating its functional operation and random walks can be achieved via the Shapiro-Wilk test. Furthermore, we developed a Monte Carlo simulator for Brownian circuits, enabling functionality verification and computation step distribution analysis. Latency/energy evaluation using a half-adder as a case study revealed that proposed circuits could reduce energy consumption by 1/1260 and offer an opportunity for low-power computing systems.
08:33 CEST	LBR01.2	A SCALABLE LOW-LATENCY FPGA ARCHITECTURE FOR SPIN QUBIT CONTROL THROUGH DIRECT DIGITAL SYNTHESIS Presenter: Mathieu TOUBEIX, University Grenoble Alpes, CEA, List,, FR Authors: Mathieu TOUBEIX¹, Eric Guthmuller¹, Adrian Evans² and Tristan Meunier³ ¹University Grenoble Alpes, CEA, List,, FR; ²University Grenoble Alpes, CEA, List, FR; ³University Grenoble Alpes, CNRS, Institut Néel, FR Abstract Scaling qubit control is a key issue for Large Scale Quantum (LSQ) computing and hardware control systems are increasingly costly in logic and memory resources. We present a newly developed compact Direct Digital Synthesis (DDS) architecture for signal generation for spin qubits that is scalable in terms of waveform accuracy and the number of synchronized channels. Fine control of gate voltages is achieved by on-the-fly generation of very precise ramps. Embedded memory requirements are reduced by orders of magnitude compared to current Arbitrary Waveform Generator (AWG) architectures, removing a major scalability barrier for quantum computing.
08:36 CEST	LBR01.3	MNT BENCH: BENCHMARKING SOFTWARE AND LAYOUT LIBRARIES FOR FIELD-COUPLED NANOCOMPUTING Presenter: Simon Hofmann, TU Munich, DE Authors: Simon Hofmann, Marcel Walter and Robert Wille, TU Munich, DE Abstract As Field-coupled Nanocomputing (FCN) gains traction as a viable post-CMOS technology, the EDA community lacks public benchmarks to evaluate the performance of academic and commercial design tools. We propose MNT Bench to address this gap by providing a platform for researchers to compare algorithms across a diverse set of benchmarks generated by multiple physical design tools. These benchmarks span various clocking schemes and gate libraries, with MNT Bench being consistently updated to integrate the latest advancements in the field. In fact, using MNT Bench, we were able to provide layouts that are substantially better (in terms of area) than everything the community generated thus far.
08:39 CEST	LBR01.4	HIDDEN COST OF CIRCUIT DESIGN WITH RFETS Presenter: Sajjad Parvin, University of Bremen, DE Authors: Sajjad Parvin¹, Chandan Jha¹, Frank Sill Torres² and Rolf Drechsler³ ¹University of Bremen, DE; ²German Aerospace Center, DE; ³University of Bremen \| DFKI, DE Abstract Reconfigurable Field Effect Transistors (RFETs) can be programmed on the fly to behave either as NMOS or PMOS. Digital circuit designs using RFETs have been shown to benefit both in design and security metrics compared to traditional FETs. In this paper, we highlight the problem associated with the cascading of RFET-based logic cells that have their Source(S)/Drain(D) terminals not connected to the supply Voltage(VDD)/Ground(GND). While these circuits occupy a lesser area, there is a drastic increase in the delay of these logic cells when they are cascaded as a result of the S/D being driven by inputs. We then discuss two methods to mitigate this issue using a) buffer insertion for delay minimization, and b) logic cells that have their S/D terminals driven by VDD/GND.
08:42 CEST	LBR01.5	OPTIMIZING OFFLOAD PERFORMANCE IN HETEROGENEOUS MPSOCS Presenter: Luca Colagrande, ETH Zurich, CH Authors: Luca Colagrande¹ and Luca Benini² ¹ETH Zurich, CH; ²ETH Zurich, CH \| Università di Bologna, IT Abstract Heterogeneous multi-core architectures combine a few "host'' cores, optimized for single-thread performance, with many small energy-efficient "accelerator'' cores for data-parallel processing, on a single chip. Offloading a computation to the many-core acceleration fabric introduces a communication and synchronization cost which reduces the speedup attainable on the accelerator, particularly for small and fine-grained parallel tasks. We demonstrate that by co-designing the hardware and offload routines, we can increase the speedup of an offloaded DAXPY kernel by as much as 47.9%. Furthermore, we show that it is possible to accurately model the runtime of an offloaded application, accounting for the offload overheads, with as low as 1% MAPE error, enabling optimal offload decisions under offload execution time constraints.
08:45 CEST	LBR01.6	BREAKING THE MEMORY WALL WITH A FLEXIBLE OPEN-SOURCE L1 DATA-CACHE Presenter: Davy Million, University Grenoble Alpes, CEA, List, FR Authors: Davy Million¹, Noelia Oliete-Escuín² and César Fuguet¹ ¹University Grenoble Alpes, CEA, List, FR; ²BSC, ES Abstract The lack of concurrency and pipelining in the memory sub-system of recent open-source RISC-V processors, such as the CVA6, is increasingly becoming the performance bottleneck. Recent updates to the new RISC-V High Performance L1 Data-cache (HPDcache), now fully integrated with the CVA6, bring significant (up to +234%) speedups in key benchmarks with a negligible 5.92% area impact. In this short paper, we detail these improvements, compare performance with existing caches and highlight the benefits of this new, open-source data-cache.
08:48 CEST	LBR01.7	LLM-GUIDED FORMAL VERIFICATION COUPLED WITH MUTATION TESTING Presenter: Muhammad Hassan, University of Bremen \| DFKI, DE Authors: Muhammad Hassan¹, Sallar Ahmadi-Pour², Khushboo Qayyum³, Chandan Jha² and Rolf Drechsler¹ ¹University of Bremen \| DFKI, DE; ²University of Bremen, DE; ³DFKI, DE Abstract The increasing complexity of modern hardware designs poses significant challenges for design verification, particularly defining and validating properties and invariants manually. Recently, Large Language Models (LLMs) such has GPT-4 have been explored to generate these properties. However, assessing the quality of these LLM generated properties is still lacking. In this paper, we introduce a LLM-guided formal verification methodology combined with mutation testing for creating and assessing invariants for Design Under Verification (DUV). Utilizing OpenAI's GPT-4, we automate the generation of invariants and formal models from design specifications and Verilog behavioral models, respectively. We further enhance this approach with mutation testing to validate the quality of the invariants. We use a 27-channel interrupt controller (C432) from ISCAS-85 benchmarks as a complex case-study to showcase the methodology.
08:51 CEST	LBR01.8	SCAN-CHAIN OPTIMIZATION WITH CONSTRAINED SINGLE LINKAGE CLUSTERING AND GEOMETRY-BASED CLUSTER BALANCING Presenter: Gireesh kumar K M, IBM India, IN Authors: Gireesh kumar K M¹, George Antony¹, Naiju Karim Abdul² and Rahul Rao² ¹IBM India, IN; ²IBM, IN Abstract Scan chain optimization is an optimization step in the physical design flow where connections between placed scannable elements are re-ordered to reduce the total wirelength, there-by improving routability and power. In this paper we present a hierarchical clustering technique using single linkage followed by geometry-based balancing to adhere to test requirements while improving scan chain wirelength. The proposed method is compared against commonly used K-means clustering method and demonstrates a maximum improvement of around 10% for certain design topologies.
08:54 CEST	LBR01.9	OBFUSGATE: REPRESENTATION LEARNING-BASED GATEKEEPER FOR HARDWARE-LEVEL OBFUSCATED MALWARE DETECTION Presenter: Zhangying He, California State University, Long Beach, US Authors: Zhangying He, Chelsea William Fernandes and Hossein Sayadi, California State University, Long Beach, US Abstract In this paper, we explore the interplay between code obfuscation techniques and performance counter traces to undermine Hardware Malware Detectors (HMDs) that rely on Machine Learning (ML) models. By crafting various obfuscated malware categories and analyzing a wide range of ML models, we demonstrate a notable detection performance reduction, showcasing the evasive impact of obfuscated malware in HMD methods. To counter these threats, we propose ObfusGate, an intelligent and robust defense mechanism based on feature representation learning that significantly enhances machine learning models against both obfuscated and unobfuscated malware attacks. The results indicate the effectiveness of ObfusGate, attaining up to 24% detection rate increase across diverse ML models assessed for hardware-level obfuscated malware detection.
08:57 CEST	LBR01.10	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

SD03 Special Day On Responsible And Robust AI

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 08:30 CEST - 10:00 CEST
Location / Room: Auditorium 2

Time	Label	Presentation Title Authors
08:30 CEST	SD03.1	OPERATIONAL DESIGN DOMAIN, A KEY ELEMENT THAT CONTRIBUTES TO AI SYSTEM TRUSTWORTHINESS Presenter: Morayo ADEDJOUMA, CEA LIST, FR Author: Morayo ADEDJOUMA, CEA LIST, FR Abstract The Operational Design Domain (ODD) concept, developed in the automotive sector, makes it possible to describe the operational conditions under which an automated system or one of its functions is specifically designed to operate[…] (inspired from SAE J3016). The objective is to define the limits of validity of the system and thus to limit the framework of its safety and validation study, recognizing that no guarantees can be provided outside its ODD. The ODD has recently attracted a lot of interest from manufacturers and it appears to be a relevant asset for the overall AI engineering workflow, including data and machine learning perspectives. However, this concept was defined with safety by design in mind. To define an AI system, the ODD is supposed to be refined throughout the development cycle of an overall system until reaching its final maturity level. Work currently carried out in Confiance.ai seeks to define this articulation, including how ODD will relevant to drive data collection and selection, machine learning evaluation, design and V&V activities.
09:00 CEST	SD03.2	ETHICAL DESIGN OF ARTIFICIAL INTELLIGENCE FOR THE INTERNET OF THINGS: A SMART HEALTHCARE PERSPECTIVE Presenter: Sudeep Pasricha, Colorado State University, US Author: Sudeep Pasricha, Colorado State University, US Abstract The dawn of the digital medicine era, ushered in by increasingly powerful Internet of Things (IoT) computing devices, is creating new therapies and biomedical solutions that promise to positively transform our quality of life. However, the integration of artificial intelligence (AI) into the digital medicine revolution also creates unforeseen and complex ethical, regulatory, and societal issues. In this talk, I will reflect on the ethical challenges facing AI-driven digital medicine. I will discuss differences between traditional ethics in the medical domain and emerging ethical challenges with AI-driven smart healthcare, particularly as they relate to transparency, bias, privacy, safety, responsibility, justice, and autonomy. Open challenges and recommendations will be outlined to enable the integration of ethical principles into the design, validation, clinical trials, deployment, monitoring, repair, and retirement of AI-based smart healthcare products.
09:30 CEST	SD03.3	TRUSTWORTHY EDGE AI – WHAT'S MISSING Presenter: Aaron Ding, TU Delft, NL Author: Aaron Ding, TU Delft, NL Abstract Similar to the progression from cloud computing to cloud intelligence, we are witnessing a fast evolution from edge computing to edge intelligence (aka Edge AI). As a rising research branch that merges distributed computing, data analytics, embedded and distributed ML, Edge AI is envisioned to provide adaptation for data-driven applications and enable the creation, optimization and deployment of distributed AI/ML pipelines. However, despite of all the promises ahead, the path to realize Edge AI is far from straightforward. This talk will illustrate a major concern of Edge AI from the trustworthiness perspective, since critical building blocks are still missing. Besides practical lessons from EU SPATIAL project exploration, the talk will share an envisioned roadmap for Edge AI, which is a stepping stone to establish an appropriate foundation, on which the promise of Edge AI can be built.

TS06 Low-Power And Energy-Efficient Design

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 08:30 CEST - 10:00 CEST
Location / Room: Multi-Purpose Room M1B+D

Session chair:
Daniel Morris, Meta, US

Session co-chair:
Rainer Doemer, University of California, Irvine, US

Time	Label	Presentation Title Authors
08:30 CEST	TS06.1	ALGORITHM-HARDWARE CO-DESIGN FOR ENERGY-EFFICIENT A/D CONVERSION IN RERAM-BASED ACCELERATORS Speaker: Chenguang Zhang, Peking University, CN Authors: Chenguang Zhang¹, Zhihang Yuan², Xingchen Li² and Guangyu Sun² ¹Center for Energy-efficient Computing and Applications (CECA) School of EECS, Peking University, CN; ²Peking University, CN Abstract Deep neural networks are widely deployed in many fields. Due to the in-situ computation (known as processing in memory) capacity of the Resistive Random Access Memory (ReRAM) crossbar, ReRAM-based accelerator shows potential in accelerating DNN with low power and high performance. However, despite power advantage, such kind of accelerators suffer from the high power consumption of peripheral circuits, especially Analog-to-Digital Converter (ADC), which account for over 60 percent of total power consumption. This problem hinders the ReRAM-based accelerator to achieve higher efficiency. Some redundant Analog-to-Digital conversion operations have no contribution to maintaining inference accuracy, and such operations can be eliminated by modifying the ADC searching logic. Based on such observations, we propose an algorithm-hardware co-design method and explore the co-design approach in both hardware design and quantization algorithms. Firstly, we focus on the distribution output along the crossbar's bit-lines and identify the fine-grained redundant ADC sampling bits. To further compress ADC bits, we propose a hardware-friendly quantization method and coding scheme, in which different quantization strategy was applied to the partial results in different intervals. To support the two features above, we propose a lightweight architectural design based on SAR-ADC@. It's worth mentioning that our method is not only more energy efficient but also retains the flexibility of the algorithm. Experiments demonstrate that our method can reduce about $1.6 sim 2.3 imes$ ADC power reduction.
08:35 CEST	TS06.2	AN EFFICIENT ASYNCHRONOUS CIRCUITS DESIGN FLOW WITH BACKWARD DELAY PROPAGATION CONSTRAINT Speaker: Lingfeng Zhou, National Sun Yat-Sen University, TW Authors: Lingfeng Zhou, Shanlin Xiao, Huiyao Wang, Jinghai Wang, Zeyang Xu, Bohan Wang and Zhiyi Yu, National Sun Yat-Sen University, TW Abstract Asynchronous circuits have recently become more popular in Internet of Things and neural network chips because of their potential low power consumption. However, due to the lack of Electronic Design Automation (EDA) tools, the asynchronous circuits design efficiency remains low and faces challenges in large-scale applications. This paper proposes a new asynchronous circuits design flow using traditional EDA tools, and applies a new backward delay propagation constraint (BDPC) method. In this method, control paths and data paths are tightly coupled and analyzed together to improve the accuracy of static timing analysis. Compared to previous works, the proposed design flow and constraint method offer significant advantages in terms of accuracy and efficiency. To verify this flow, an asynchronous RISC-V processor was implemented on TSMC 65nm process. Compared to synchronous version, asynchronous processor achieves a power optimization of 17.4% while maintaining the same speed and area.
08:40 CEST	TS06.3	PELS: A LIGHTWEIGHT AND FLEXIBLE PERIPHERAL EVENT LINKING SYSTEM FOR ULTRA-LOW POWER IOT PROCESSORS Speaker: Alessandro Ottaviano, ETH Zurich, CH Authors: Alessandro Ottaviano¹, Robert Balas¹, Philippe Sauter¹, Manuel Eggimann¹ and Luca Benini² ¹ETH Zurich, CH; ²ETH Zurich, CH \| Università di Bologna, IT Abstract A key challenge for ultra-low-power (ULP) devices is handling peripheral linking, where the main central processing unit (CPU) periodically mediates the interaction among multiple peripherals following wake-up events. Current solutions address this problem by either integrating event interconnects that route single-wire event lines among peripherals or by general-purpose I/O processors, with a strong trade-off between the latency, efficiency of the former, and the flexibility of the latter. In this paper, we present an open-source, peripheral-agnostic, lightweight, and flexible Peripheral Event Linking System (PELS) that combines dedicated event routing with a tiny I/O processor. With the proposed approach, the power consumption of a linking event is reduced by 2.5 times compared to a baseline relying on the main core for the event-linking process, at a low area of just 7 kGE in its minimal configuration, when integrated into a ULP RISC-V IoT processor.
08:45 CEST	TS06.4	ATTENTION-BASED EDA TOOL PARAMETER EXPLORER: FROM HYBRID PARAMETERS TO MULTI-QOR METRICS Speaker: Donger Luo, ShanghaiTech University, CN Authors: Donger Luo¹, QI SUN², Qi Xu³, Tinghuan Chen⁴ and Hao Geng¹ ¹ShanghaiTech University, CN; ²Zhejiang University, CN; ³University of Science and Technology of China, CN; ⁴The Chinese University of Hong Kong, CN Abstract Improving the outcomes of very-large-scale integration design without altering the underlying design enablement, such as process, device, interconnect, and IPs, is critical for integrated circuit (IC) designers. Parameter tuning for electronic design automation (EDA) tools is an emerging technology for improving the final design Quality-of-Result (QoR). However, many complex heuristics have been accreted upon previous complex heuristics integrated into tools, resulting in a vast number of tunable parameters. Even worse, these parameters include both continuous and discrete ones, making the parameter tuning process laborious and challenging. In this paper, we propose an attention-based EDA tool parameter explorer. A self-attention mechanism is developed to navigate the parameter importance. A hybrid space Gaussian process model is leveraged to optimize continuous and discrete parameters jointly, capturing their complex interactions. In addition, considering multiple QoR metrics and the large amount of time required to invoke EDA tools, a customized acquisition function based on expected hypervolume improvement (EHVI) is proposed to enable multi-objective optimization and parallel evaluation. Experimental results on a set of IWLS2005 benchmarks demonstrate the effectiveness and efficiency of our method.
08:50 CEST	TS06.5	PARALLEL MULTI-OBJECTIVE BAYESIAN OPTIMIZATION FRAMEWORK FOR CGRA MICROARCHITECTURE Speaker: Qi Xu, University of Science and Technology of China, CN Authors: Bing Li, Wendi Sun, Xiaobing Ni, Kaixuan He, Qi Xu, Song Chen and Yi Kang, University of Science and Technology of China, CN Abstract Recently, due to the flexibility and reconfigurability of Coarse-Grained Reconfigurable Architecture (CGRA), CGRA microarchitecture has become an inevitable trend to accelerate the convolution calculation in diverse deep neural networks. However, since the vast microarchitecture design space and the complicated VLSI verification flow, it is a huge challenge to explore a perfect microarchitecture to compromise between multiple performance metrics. In this paper, we formulate the CGRA microarchitecture design as a design space exploration problem, and propose a parallel multi-objective Bayesian optimization framework (PAMBOF) to automatically explore the CGRA microarchitecture design space. Meanwhile, high-precision performance and area models are built to enable fast design space exploration. To approximate the black-box objective function in the design space, the PAMBOF framework first builds multiple Gaussian processes (GP) with deep regularization kernel learning functions (DRKL-GP). Then a parallel Bayesian optimization algorithm is developed to sample a batch of candidate design points, which are simulated in parallel by the performance and area models. Experimental results demonstrate that compared to the prior arts, the proposed PAMBOF framework can search for a CGRA microarchitecture design with the better area and performance in a shorter runtime.
08:55 CEST	TS06.6	EFFICIENT SPECTRAL-AWARE POWER SUPPLY NOISY ANALYSIS FOR LOW-POWER DESIGN VERIFICATION Speaker: Zhou Jin, China University of Petroleum-Beijing, CN Authors: Yinuo Bai¹, Xiaoyu Yang¹, Yicheng Lu¹, Dan Niu², Cheng Zhuo³, Zhou Jin¹ and Weifeng Liu¹ ¹China University of Petroleum-Beijing, CN; ²Southeast University, CN; ³Zhejiang University, CN Abstract The relentless pursuit of energy-efficient electronic devices necessitates advanced methodologies for low-power design verification, with a particular focus on mitigating power supply noise. The challenges posed by shrinking voltage margins in low-power designs lead to a significant demand for rapid and accurate power supply noise simulation and verification techniques. Too large supply noise inevitably results in the raise of supply level, thereby hurting the lower power design target. Spectral methods have demonstrated as a great alternative to produce a sparse sub-matrix with spectral-similarity property as the preconditioner to efficiently reduce the iteration number and solve the linear system for supply noise verification.However, existing methods either suffer from high computational complexity or rely on approximations to reduce computational time. Therefore, a novel approach is needed to efficiently generate high-quality preconditioners. In this paper, we propose a two-stage spectral-aware algorithm to address these challenges. Our approach has two main highlights. Firstly, by introducing spectral-aware weights, we can better assess the priority of edges and construct high-quality spanning trees with the minimum relative condition number. Secondly, by leveraging eigenvalue transformation strategies, we can quickly and accurately recover off-tree edges that are spectrally critical, avoiding time-consuming iterative computations. Compared to two SOTA methods, GRASS and feGRASS, our approach demonstrates higher accuracy and efficiency in preconditioner generation (37.3x and 2.14x speedup, respectively) as well as significant improvements in accelerating the linear solver for power supply noisy analysis in power grid simulation and other Laplacian graphs (5.16x and 1.70x speedup, respectively).
09:00 CEST	TS06.7	COMPACT POWERS-OF-TWO: AN EFFICIENT NON-UNIFORM QUANTIZATION FOR DEEP NEURAL NETWORKS Speaker: Xinkuang Geng, Shanghai Jiao Tong University, CN Authors: Xinkuang Geng¹, Siting Liu², Jianfei Jiang¹, Kai Jiang³ and Honglan Jiang¹ ¹Shanghai Jiao Tong University, CN; ²ShanghaiTech University, CN; ³Inspur Academy of Science and Technology, CN Abstract To reduce the demands for computation and memory of deep neural networks (DNNs), various quantization techniques have been extensively investigated. However, conventional methods cannot effectively capture the intrinsic data characteristics in DNNs, leading to a high accuracy degradation when employing low-bit-width quantization. In order to better align with the bell-shaped distribution, we propose an efficient non-uniform quantization scheme, denoted as compact powers-of-two (CPoT). Aiming to avoid the rigid resolution inherent in powers-of-two (PoT) without introducing new issues, we add a fractional part to its encoding, followed by a biasing operation to eliminate the unrepresentable region around 0. This approach effectively balances the grid resolution in both the vicinity of 0 and the edge region. To facilitate the hardware implementation, we optimize the dot product for CPoT based on the computational characteristics of the quantized DNNs, where the precomputable terms are extracted and incorporated into bias. Consequently, a multiply-accumulate (MAC) unit is designed for CPoT using shifters and look-up tables (LUTs). The experimental results show that, even with a certain level of approximation, our proposed CPoT outperforms state-of-the-art methods in data-free quantization (DFQ), a post-training quantization (PTQ) technique focusing on data privacy and computational efficiency. Furthermore, CPoT demonstrates superior efficiency in area and power compared to other methods in hardware implementation.
09:05 CEST	TS06.8	TREERNG: BINARY TREE GAUSSIAN RANDOM NUMBER GENERATOR FOR EFFICIENT PROBABILISTIC AI HARDWARE DESIGN Speaker: Jonas Crols, KU Leuven, BE Authors: Jonas Crols, Guilherme Paim, Shirui Zhao and Marian Verhelst, KU Leuven, BE Abstract Bayesian Neural Networks (BNNs) offer opportunities for greatly enhancing the trustworthiness of conventional neural networks by monitoring the uncertainties in decision-making. A significant drawback for BNN inference at the extreme edge, however, is the imperative need to incorporate Gaussian Random Number Generators (GRNG) within each neuron. State-of-the-art GRNG algorithms heavily depend on multiple arithmetic operations and the use of extensive look-up tables, posing significant implementation challenges for ultra-low power hardware implementations. To overcome this, this paper presents an innovative binary tree random number generator (TreeGRNG) allowing the use of ultra-low-cost constant comparators instead of arithmetic units. We further enhance the TreeGRNG proposal with a set of hardware-aware optimizations exploiting the Gaussian properties. The optimized TreeGRNG surpasses the State-of-the-Art (SoTA) in terms of distribution accuracy while achieving a 3.7x reduction in energy per sample and boosting the throughput per unit area by 5.8x. Moreover, our TreeGRNG proposal possesses a distinct advantage over the current SoTA in terms of flexibility, as it easily enables designers to adjust the shape of the sampled probability distribution, extending beyond the capabilities of traditional GRNGs, and opening the horizon towards future probabilistic AI designs. The TreeGRNG design is available open-source at the link "https://github.com/KULeuven-MICAS/TreeGRNG".
09:10 CEST	TS06.9	DACO: PURSUING ULTRA-LOW POWER CONSUMPTION VIA DNN-ADAPTIVE CPU-GPU CO-OPTIMIZATION ON MOBILE DEVICES Speaker: Yushu Wu, Northeastern University, US Authors: Yushu Wu¹, Chao Wu¹, Geng Yuan², Yanyu Li¹, Weichao Guo³, Jing Rao⁴, Xipeng Shen⁵, Bin Ren⁶ and Yanzhi Wang¹ ¹Northeastern University, US; ²University of Georgia, US; ³Guangdong Oppo Mobile Telecommunications, CN; ⁴University of New South Wales, AU; ⁵North Carolina State University, US; ⁶William and Mary University, US Abstract As Deep Neural Networks (DNNs) become popular in mobile systems, their high computational and memory demands make them major power consumers, especially in limited-budget scenarios. In this paper, we propose DACO, a DNN-Adaptive CPU-GPU CO-optimization technique, to reduce the power consumption of DNNs. First, a resource-oriented classifier is proposed to quantify the computation/memory intensity of DNN models and classify them accordingly. Second, a set of rule-based policies is deduced for achieving the best-suited CPU-GPU system configuration in a coarse-grained manner. Combined with all the rules, a coarse-to-fine CPU-GPU auto-tuning approach is proposed to reach the Pareto-optimal speed and power consumption in DNN inference. Experimental results demonstrate that, compared with the existing approach, DACO could reduce power consumption by up to 71.9% while keeping an excellent DNN inference speed.
09:11 CEST	TS06.10	LESS: LOW-POWER ENERGY-EFFICIENT SUBGRAPH ISOMORPHISM ON FPGA Speaker: Roberto Bosio, Politecnico di Torino, IT Authors: Roberto Bosio, Giovanni Brignone, Filippo Minnella, Muhammad Usman Jamal and Luciano Lavagno, Politecnico di Torino, IT Abstract Low-power energy-efficient subgraph isomorphism (LESS) is an open-source field-programmable gate array-only low-memory subgraph matching solver designed for energy efficiency. Depending on the input datagraph, the energy consumption of LESS, averaged on different diverse queries, is up to 38× and 93× lower than CPU and GPU solvers respectively.
09:12 CEST	TS06.11	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS16 Applications Of Emerging Technologies

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 08:30 CEST - 10:00 CEST
Location / Room: Multi-Purpose Room M1A+C

Session chair:
Franco Fummi, Università di Verona, IT

Session co-chair:
Mohammad Hassan Najafi, University of Louisiana, US

Time	Label	Presentation Title Authors
08:30 CEST	TS16.1	HIGH-PERFORMANCE DATA MAPPING FOR BNNS ON PCM-BASED INTEGRATED PHOTONICS Speaker: Taha Michael Shahroodi, TU Delft, NL Authors: Taha Shahroodi¹, Raphael Cardoso², Alberto Bosio³, Stephan Wong¹, Ian O'Connor³ and Said Hamdioui¹ ¹TU Delft, NL; ²Ecole Centrale de Lyon, FR; ³Lyon Institute of Nanotechnology, FR Abstract State-of-the-Art (SotA) hardware implementations of Deep Neural Networks (DNNs) incur high latencies and costs. Binary Neural Networks (BNNs) are potential alternative solutions to realize faster implementations without losing accuracy. In this paper, we first present a new data mapping, called TacitMap, suited for BNNs implemented based on a Computation-In-Memory (CIM) architecture. TacitMap maximizes the use of available parallelism, while CIM architecture eliminates the data movement overhead. We then propose a hardware accelerator based on optical phase change memory (oPCM) called EinsteinBarrier. Ein-steinBarrier incorporates TacitMap and adds an extra dimension for parallelism through wavelength division multiplexing, leading to extra latency reduction. The simulation results show that, compared to the SotA CIM baseline, TacitMap and EinsteinBarrier significantly improve execution time by up to ∼154× and ∼3113×, respectively, while also maintaining the energy consumption within 60% of that in the CIM baseline.
08:35 CEST	TS16.2	UHD: UNARY PROCESSING FOR LIGHTWEIGHT AND DYNAMIC HYPERDIMENSIONAL COMPUTING Speaker: M. Hassan Najafi, University of Louisiana at Lafayette, US Authors: Sercan Aygun, Mehran Shoushtari Moghadam and M. Hassan Najafi, University of Louisiana at Lafayette, US Abstract Hyperdimensional computing (HDC) is a novel computational paradigm that operates on long-dimensional vectors known as hypervectors. The hypervectors are constructed as long bit-streams and form the basic building blocks of HDC systems. In HDC, hypervectors are generated from scalar values without considering bit significance. HDC is efficient and robust for various data processing applications, especially computer vision tasks. To construct HDC models for vision applications, the current state-of-the-art practice utilizes two parameters for data encoding: pixel intensity and pixel position. However, the intensity and position information embedded in high-dimensional vectors are generally not generated dynamically in the HDC models. Consequently, the optimal design of hypervectors with high model accuracy requires powerful computing platforms for training. A more efficient approach is to generate hypervectors dynamically during the training phase. To this aim, this work uses low-discrepancy sequences to generate intensity hypervectors, while avoiding position hypervectors. Doing so eliminates the multiplication step in vector encoding, resulting in a power-efficient HDC system. For the first time in the literature, our proposed approach employs lightweight vector generators utilizing unary bit-streams for efficient encoding of data instead of using conventional comparator-based generators.
08:40 CEST	TS16.3	TOWARDS RELIABLE AND ENERGY-EFFICIENT RRAM BASED DISCRETE FOURIER TRANSFORM ACCELERATOR Speaker: Jianan Wen, IHP – Leibniz Institute for High Performance Microelectronics, DE Authors: Jianan Wen¹, Andrea Baroni¹, Eduardo Perez², Max Uhlmann¹, Markus Fritscher², Karthik KrishneGowda¹, Markus Ulbricht¹, Christian Wenger² and Milos Krstic³ ¹IHP – Leibniz Institute for High Performance Microelectronics, DE; ²IHP – Leibniz Institute for High Performance Microelectronics \| Brandenburgische TU Cottbus-Senftenberg, DE; ³IHP – Leibniz Institute for High Performance Microelectronics \| Universität Potsdam, DE Abstract The Discrete Fourier Transform (DFT) holds a prominent place in the field of signal processing. The development of DFT accelerators in edge devices requires high energy efficiency due to the limited battery capacity. In this context, emerging devices such as resistive RAM (RRAM) provide a promising solution. They enable the design of high-density crossbar arrays and facilitate massively parallel and in situ computations within memory. However, the reliability and performance of the RRAM-based systems are compromised by the device non-idealities, especially when executing DFT computations that demand high precision. In this paper, we propose a novel adaptive variability-aware crossbar mapping scheme to address the computational errors caused by the device variability. To quantitatively assess the impact of variability in a communication scenario, we implemented an end-to-end simulation framework integrating the modulation and demodulation schemes. When combining the presented mapping scheme with an optimized architecture to compute DFT and inverse DFT(IDFT), compared to the state-of-the-art architecture, our simulation results demonstrate energy and area savings of up to 57% and 18%, respectively. Meanwhile, the DFT matrix mapping error is reduced by 83% compared to conventional mapping. In a case study involving 16-quadrature amplitude modulation (QAM), with the optimized architecture prioritizing energy efficiency, we observed a bit error rate (BER) reduction from 1.6e-2 to 7.3e-5. As for the conventional architecture, the BER is optimized from 2.9e-3 to zero.
08:45 CEST	TS16.4	DYNAMIC RECONFIGURABLE SECURITY CELLS BASED ON EMERGING DEVICES INTEGRABLE IN FDSOI TECHNOLOGY Speaker: Niladri Bhattacharjee, Namlab gGmbH, DE Authors: Niladri Bhattacharjee¹, Viktor Havel¹, Nima Kavand², Jorge Quijada³, Akash Kumar⁴, Thomas Mikolajick⁵ and Jens Trommer¹ ¹Namlab gGmbH, DE; ²TU Dresden, DE; ³Chair for Processor Design, TU Dresden, DE; ⁴Ruhr University Bochum, DE; ⁵NaMLab Gmbh / TU Dresden, DE Abstract Globalization in integrated circuit (IC) design and manufacturing amplifies the risk of unauthorized users compromising previously trusted IC processes. While measures exist to protect the integrity of IC hardware, there is an inherent need for emerging technologies, such as Reconfigurable FETs (RFETs), to overcome the limitations offered by CMOS. RFETs are a special type of doping-free, Schottky transistors that can work as a PMOS or NMOS as a function of biasing across its gates. Their distinct transport properties enable reconfigurable logic gates such as 2-NAND-NOR and 2-XOR-XNOR, which strengthen hardware security through methods like logic locking, split manufacturing, and camouflaging. In this work, standard cell layouts for such dynamic reconfigurable security cells with three-independent-gated RFETs (TIG-RFETs) on a 22nm FDSOI technology, with minimum pitch adherence and design rule compliance for co-integration with CMOS are developed. Parallelly, a TCAD model of TIG-RFET is developed, illustrating two biasing schemes for reconfigurability: adaptive per-cell body bias and globally fixed body bias. With these layouts, an area comparison shows that the smallest dynamic 2-NAND-NOR gate occupies roughly double the CMOS 2-NAND gate area, while the smallest 2-XOR-XNOR gate is only 20% larger than the CMOS 2-XOR. The area comparison can be extended to find the number of logic locking gates that can be added per overhead area, which has been analysed for the ISCAS-85 c6288 benchmark circuit. Dynamic replacement-based logic locking with TIG-RFETs potentially doubles the number of keys compared to classical CMOS logic locking. This work allows a realistic view of the application of RFETs in hardware security and its co-integrability with CMOS for the first time. Our work intends to lead the way into future EDA developments of TIG-RFETs and consider it as a potential candidate to improve hardware security in cooperation with the CMOS legacy.
08:50 CEST	TS16.5	ANALOG PRINTED SPIKING NEUROMORPHIC CIRCUIT Speaker: Maha Shatta, Karlsruhe Institute of Technology, DE Authors: Priyanjana Pal¹, Haibin Zhao¹, Maha Shatta¹, Michael Hefenbrock², Sina Bakhtavari Mamaghani¹, Sani Nassif³, Michael Beigl¹ and Mehdi Tahoori¹ ¹Karlsruhe Institute of Technology, DE; ²RevoAI GmbH, DE; ³Radyalis LLC, US Abstract Biologically-inspired Spiking Neural Networks have emerged as a promising avenue for energy-efficient, high-performance neuromorphic computing. With the demand for highly-customized and cost-effective solutions in emerging application domains like soft robotics, wearables, or IoT-devices, Printed Electronics has emerged as an alternative to traditional silicon technologies leveraging soft materials and flexible substrates. In this paper, we propose an energy-efficient analog printed spiking neuromorphic circuit and a corresponding learning algorithm. Simulations on 13 benchmark datasets show an average of 3.86× power improvement with similar classification accuracy compared to previous works.
08:55 CEST	TS16.6	TT-SNN: TENSOR TRAIN DECOMPOSITION FOR EFFICIENT SPIKING NEURAL NETWORK TRAINING Speaker: Priyadarshini Panda, Yale University, US Authors: Donghyun Lee, Ruokai Yin, Youngeun Kim, Abhishek Moitra, Yuhang Li and Priyadarshini Panda, Yale University, US Abstract Spiking Neural Networks (SNNs) have gained significant attention as a potentially energy-efficient alternative for standard neural networks with their sparse binary activation. However, SNNs suffer from memory and computation overhead due to spatio-temporal dynamics and multiple backpropagation computations across timesteps during training. To address this issue, we introduce Tensor Train Decomposition for Spiking Neural Networks (TT-SNN), a method that reduces model size through trainable weight decomposition, resulting in reduced storage, FLOPs, and latency. In addition, we propose a parallel computation pipeline as an alternative to the typical sequential tensor computation, which can be flexibly integrated into various existing SNN architectures. To the best of our knowledge, this is the first of its kind application of tensor decomposition in SNNs. We validate our method using both static and dynamic datasets, CIFAR10/100 and N-Caltech101, respectively. We also propose a TT-SNN-tailored training accelerator to fully harness the parallelism in TT-SNN. Our results demonstrate substantial reductions in parameter size (7.98X), FLOPs (9.25X), training time (17.7%), and training energy (28.3%) during training for the N-Caltech101 dataset, with negligible accuracy degradation.
09:00 CEST	TS16.7	SCGEN: A VERSATILE GENERATOR FRAMEWORK FOR AGILE DESIGN OF STOCHASTIC CIRCUITS Speaker: Zexi Li, Shanghai Jiao Tong University, CN Authors: Zexi Li¹, Haoran Jin¹, Kuncai Zhong², Guojie Luo³, Runsheng Wang³ and Weikang Qian¹ ¹Shanghai Jiao Tong University, CN; ²Hunan University, CN; ³Peking University, CN Abstract Stochastic computing (SC) is an unconventional computing paradigm with unique features. Designing SC circuits is dramatically different from designing binary computing (BC) circuits. To support the agile design of SC circuits, we propose SCGen, a versatile generator framework, which provides users with a C++ interface to easily specify SC circuits and supports 1) accelerated accuracy simulation, 2) accelerated design space exploration (DSE) for accuracy maximization guided by simulated annealing (SA) and genetic algorithm (GA), 3) circuit optimization by random number source (RNS) sharing, 4) circuit verification via symbolic expression analysis, and 5) automatic Verilog code generation. Furthermore, we extend SCGen to also support agile design of hybrid SC-BC circuits. The experimental results show that our proposed DSE acceleration methods achieve up to 59× speedup, the DSE with SA and GA can get an average reduction of 4.0% and 12.7%, respectively, in accuracy loss compared to random search, and RNS sharing reduces the average area and power by 41% and 47%, respectively.
09:05 CEST	TS16.8	APPROXIMATION ALGORITHM FOR NOISY QUANTUM CIRCUIT SIMULATION Speaker: Mingyu Huang, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences, CN Authors: Mingyu Huang¹, Ji Guan², Wang Fang¹ and Mingsheng Ying³ ¹Institute of Computing Technology, Chinese Academy of Sciences, CN; ²State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, CN; ³State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences; Department of Computer Science and Technology, Tsinghua University, CN Abstract Simulating noisy quantum circuits is vital in designing and verifying quantum algorithms in the current NISQ (Noisy Intermediate-Scale Quantum) era, where quantum noise is unavoidable. However, it is much more inefficient than the classical counterpart because of the quantum state explosion problem (the dimension of state space is exponential in the number of qubits) and the complex (non-unitary) representation of noises. Consequently, only noisy circuits with up to about 50 qubits can be simulated approximately well. To improve the scalability of the circuits that can be simulated, this paper introduces a novel approximation algorithm for simulating noisy quantum circuits when the noisy effectiveness is insignificant. The algorithm is based on a new tensor network diagram for the noisy simulation and uses the singular value decomposition to approximate the tensors of quantum noises in the diagram. The contraction of the tensor network diagram is implemented on Google's TensorNetwork. The effectiveness and utility of the algorithm are demonstrated by experimenting on a series of practical quantum circuits with realistic superconducting noise models. As a result, our algorithm can approximately simulate quantum circuits with up to 225 qubits and 20 noises (within about 1.8 hours). In particular, our method offers a speedup over the commonly-used approximation (sampling) algorithm --- quantum trajectories method. Furthermore, our approach can significantly reduce the number of samples in the quantum trajectories method when the noise rate is small enough.
09:10 CEST	TS16.9	RTSA: AN RRAM-TCAM BASED IN-MEMORY-SEARCH ACCELERATOR FOR SUB-100 μS COLLISION DETECTION Speaker: Jiahao Sun, Shanghai Jiao Tong University, CN Authors: Jiahao Sun, Fangxin Liu, Yijian Zhang, Li Jiang and Rui Yang, Shanghai Jiao Tong University, CN Abstract Collision detection is a highly timing-consuming task in motion planning, which accounts for over 90% of the total calculation time. Previous hardware accelerators can hardly maintain fast computation speed in real time while supporting a large roadmap. In this work, we present RTSA, a novel in-memory-search collision detection accelerator, which achieves an impressive sub-100 μs response time for collision detection in 800 MB scale roadmaps. Such accelerator leverages an in-situ-search-enabled memory architecture, enabling massively parallel search operations. RTSA is powered by ternary content-addressable memories (TCAMs) based on large-scale non-volatile resistive random-access memory (RRAM) arrays. TCAM eliminates the need for extensive data transfer between memory and computing units, leading to significant energy and delay saving. Such accelerator well exceeds the speed requirement for collision detection (<1 ms), making it highly suitable for various applications, including robot motion planning in dynamic environment, manufacturing, and physical simulation.
09:11 CEST	TS16.10	TOWARDS CYCLE-BASED SHUTTLING FOR TRAPPED-ION QUANTUM COMPUTERS Speaker: Daniel Schoenberger, TU Munich, DE Authors: Daniel Schoenberger¹, Stefan Hillmich², Matthias Brandl³ and Robert Wille¹ ¹TU Munich, DE; ²Software Center Hagenberg (SCCH) GmbH, AT; ³Infineon Technologies, DE Abstract The Quantum Charge Coupled Device (QCCD) architecture offers a modular solution to enable the realization of trapped-ion quantum computers with a large number of qubits. Within these devices, ions can be shuttled (moved) throughout the trap and through different dedicated zones. However, due to decoherence of the ions' quantum states, the qubits lose their quantum information over time. Thus, the shuttling needed for these shuttling operations should be minimized. In this extended abstract, we propose a concept towards a cycle-based heuristic approach to determining an efficient shuttling schedule for a given quantum circuit.
09:12 CEST	TS16.11	TOWARDS ATOMIC DEFECT-AWARE PHYSICAL DESIGN OF SILICON DANGLING BOND LOGIC ON THE H-SI(100)-2X1 SURFACE Speaker: Marcel Walter, TU Munich, DE Authors: Marcel Walter¹, Jeremiah Croshaw², Samuel Ng³, Konrad Walus³, Robert Wolkow² and Robert Wille¹ ¹TU Munich, DE; ²University of Alberta, CA; ³University of British Columbia, CA Abstract Recent advancements in Silicon Dangling Bond (SiDB) fabrication have transitioned from manual to automated processes. However, sub-nanometer substrate defects remain a significant challenge, thus preventing the fabrication of functional logic. Current design automation techniques lack defect-aware strategies. This paper introduces an idea for a surface defect model based on experimentally verified defects, which can be applied to enhance the robustness of established gate libraries. Additionally, a prototypical automatic placement and routing algorithm is presented, utilizing STM data from physical experiments to obtain dot-accurate circuitry resilient to atomic surface defects. Initial evaluations on surfaces with varying defect rates demonstrate their critical impact, suggesting that fabrication processes must achieve defect rates of around 0.1% to further advance this circuit technology.
09:13 CEST	TS16.12	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS21 Design And Test Of Hardware Security Primitives

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 08:30 CEST - 10:00 CEST
Location / Room: Auditorium 3

Session chair:
Michael Hutter, PQShield,

Session co-chair:
Aritra Hazra, IIT Kharagpur, IN

Time	Label	Presentation Title Authors
08:30 CEST	TS21.1	ALOHA-HE: A LOW-AREA HARDWARE ACCELERATOR FOR CLIENT-SIDE OPERATIONS IN HOMOMORPHIC ENCRYPTION Speaker: Florian Krieger, Institute of Applied Information Processing and Communications, Graz University of Technology, AT Authors: Florian Krieger, Florian Hirner, Ahmet Can Mert and Sujoy Sinha Roy, TU Graz, AT Abstract Homomorphic encryption (HE) has gained broad attention in recent years as it allows computations on encrypted data enabling secure cloud computing. Deploying HE presents a notable challenge since it introduces a performance overhead by orders of magnitude. Hence, most works target accelerating server-side operations on hardware platforms, while little attention has been given to client-side operations. In this paper, we present a novel design methodology to implement and accelerate the client-side HE operations on area-constrained hardware. We show how to design an optimized floating-point unit tailored for the encoding of complex values. In addition, we introduce a novel hardware-friendly algorithm for modulo-reduction of floating-point numbers and propose various concepts for achieving efficient resource sharing between modular ring and floating-point arithmetic. Finally, we use this methodology to implement an end-to-end hardware accelerator, Aloha-HE, for the client-side operations of the CKKS scheme. In contrast to existing work, Aloha-HE supports both encoding and encryption and their counterparts within a unified architecture. Aloha-HE achieves a speedup of up to 59x compared to prior hardware solutions.
08:35 CEST	TS21.2	ESC-NTT: AN ELASTIC, SEAMLESS AND COMPACT ARCHITECTURE FOR MULTI-PARAMETER NTT ACCELERATION Speaker: Yongqing Zhu, Beihang University, CN Authors: Zhenyu Guan¹, Yongqing Zhu¹, Yicheng Huang¹, Luchang Lei², Xueyan Wang¹, Hongyang Jia³, Yi Chen⁴, Bo Zhang⁴, Jin Dong⁴ and Song Bian¹ ¹Beihang University, CN; ²Tsinghua University, CN; ³Tsinghua University, Dept EE, BNRist, CN; ⁴Beijing Academy of Blockchain and Edge Computing, CN Abstract Fully homomorphic encryption (FHE) and post-quantum cryptography (PQC) heavily rely on number theoretic transform (NTT) to accelerate polynomial multiplication. However, most existing NTT accelerators lack flexibility when the underlying modulus and polynomial lengths change. Current designs often store twiddle factors in on-chip storage, facing a noticeable drawback when frequent parameter changes occur, leading to a potential 50% decrease in computation speed due to the input bandwidth limitations. To address this challenge, we propose ESC-NTT, a fully-pipelined and flexible architecture for handling NTTs with varying parameters. ESC-NTT, a complete custom architecture, continuously performs N-point (inverse) NTT, negacyclic NTT (NCN), and inverse NCN (INCN) without introducing bubbles during modulus and NTT length switches. Additionally, we introduce a twiddle factor generator (TFG) module to replace on-chip factor storage and save 68.7% twiddle factors' bandwidth compared to inputting every factor. In the experiment, ESC-NTT is implemented on a Xilinx Alveo U280 FPGA and synthesized in a 28nm CMOS technology. In the case of frequent modulus switching and same on-chip storage, the calculation speed of ESC-NTT is 1.05 times to 241.39 times that of existing FHE accelerators when performing 4096-point NTT.
08:40 CEST	TS21.3	SPECSCOPE: AUTOMATING DISCOVERY OF EXPLOITABLE SPECTRE GADGETS ON BLACK-BOX MICROARCHITECTURES Speaker: Khaled Khasawneh, George Mason University, US Authors: Najmeh Nazari¹, Behnam Omidi², Chongzhou Fang¹, Hosein Mohammadi Makrani¹, Setareh Rafatirad¹, Avesta Sasan¹, Houman Homayoun¹ and Khaled N. Khasawneh² ¹University of California, Davis, US; ²George Mason University, US Abstract Transient execution attacks pose information leakage risks in current systems. Disabling speculative execution, though mitigating the issue, results in significant performance loss. Accurate identification of vulnerable gadgets is essential for balancing security and performance. However, uncovering all covert channels is challenging due to complex microarchitectural analysis. This paper introduces SpecScope, a framework for automating the detection of Spectre gadgets in code using a black-box microarchitecture approach. SpecScope focuses on contention between transient and non-transient instructions to precisely identify and reduce false-positive Spectre gadgets, minimizing mitigation overhead. Tested on public libraries, SpecScope outperforms existing methods, reducing False-Positive rates by 8.9% and increasing True-Positive rates by 10.4%.
08:45 CEST	TS21.4	CYCPUF: CYCLIC PHYSICAL UNCLONABLE FUNCTION Speaker: Amin Rezaei, California State University, Long Beach, US Authors: Michael Dominguez and Amin Rezaei, California State University, Long Beach, US Abstract Physical Unclonable Functions (PUFs) leverage manufacturing process imperfections that cause propagation delay discrepancies for the signals traveling along these paths. While PUFs can be used for device authentication and chip-specific key generation, strong PUFs have been shown to be vulnerable to machine learning modeling attacks. Although there is an impression that combinational circuits must be designed without any loops, cyclic combinational circuits have been shown to increase design security against hardware intellectual property theft. In this paper, we introduce feedback signals into traditional delay-based PUF designs such as arbiter PUF, ring oscillator PUF, and butterfly PUF to give them a wider range of possible output behaviors and thus an edge against modeling attacks. Based on our analysis, cyclic PUFs produce responses that can be binary, steady-state, oscillating, or pseudo-random under fixed challenges. The proposed cyclic PUFs are implemented in field programmable gate arrays, and their power and area overhead, in addition to functional metrics, are reported compared with their traditional counterparts. The security gain of the proposed cyclic PUFs is also shown against state-of-the-art attacks.
08:50 CEST	TS21.5	MODELING ATTACK TESTS AND SECURITY ENHANCEMENT OF THE SUB-THRESHOLD VOLTAGE DIVIDER ARRAY PUF Speaker: Xiaole Cui, Peking University, CN Authors: Shengjie Zhou, Yongliang Chen, Xiaole Cui and yun Liu, Peking University, CN Abstract Physical unclonable function (PUF) is widely used as the root of trust in the IoT systems. The sub-threshold voltage divider array PUF was reported as an anti-modeling-attack PUF. It utilizes the nonlinear I-V relationship in the sub-threshold region of MOS transistors to improve the security. However, the security of this PUF has not been soundly analyzed. This work presents the simulation results of modeling attack tests which reveal the vulnerability of this PUF. In the attack, the sub-threshold voltage divider array PUF is modeled by a dedicated artificial neural network (ANN). The nonlinearity of the PUF is simplified based on its working principle. The simulation results show that the prediction accuracy achieves 97% when the number of training CRPs is 350 for the single-stage PUF, and it achieves 90% when the number of training CRPs is 300 for the multi-stage PUF. Furthermore, this work improves the original structure of the sub-threshold voltage divider array PUF, to enhance its anti-modeling-attack capability. The simulation results of modeling attack tests show that the prediction accuracy of the improved multi-stage PUF is reduced to about 50%, which implies that the improved PUF has a strong anti-modeling-attack capability.
08:55 CEST	TS21.6	ENHANCING SIDE-CHANNEL ATTACKS THROUGH X-RAY-INDUCED LEAKAGE AMPLIFICATION Speaker: Nasr-eddine OULDEI TEBINA, CNRS/TIMA, FR Authors: Nasr-eddine Ouldei Tebina¹, Luc Salvo², Laurent Maingault³, Nacer-Eddine Zergainoh¹, Guillaume Hubert⁴ and Paolo Maistri¹ ¹TIMA Laboratory, FR; ²SIMaP Laboratory, FR; ³CEA-Leti, FR; ⁴ONERA, FR Abstract In this paper, we propose a novel approach that utilizes localized X-ray irradiation to amplify data-dependent leakage currents in CMOS-based cryptography circuits. Our proposed technique strategically targets specific regions in a circuit using X-rays, inducing variations in dynamic and static power consumption due to Total Ionizing Dose (TID) effects, which increases or even reveals hidden data leakage. In this work, we present several experimental campaigns highlighting the benefits of our approach to combinational and sequential logic. Our experiments show a significant increase in information leakage in the targeted regions, which improves the signal-to-noise ratio coefficient and thus makes recovering the processed bytes easier. We envision the possibility of using this technique on full cryptographic designs on both FPGA and ASICs.
09:00 CEST	TS21.7	PHOTONNTT: ENERGY-EFFICIENT PARALLEL PHOTONIC NUMBER THEORETIC TRANSFORM ACCELERATOR Speaker: Xianbin Li, Hong Kong University of Science and Technology, HK Authors: Xianbin LI¹, Jiaqi Liu¹, Yuying ZHANG¹, Yinyi LIU², Jiaxu ZHANG¹, Chengeng Li¹, Shixi Chen¹, Yuxiang Fu¹, Fengshi Tian¹, Wei ZHANG¹ and Jiang Xu¹ ¹Hong Kong University of Science and Technology, HK; ²Electronic and Computer Engineering Department, The Hong Kong University of Science and Technology, HK Abstract Fully homomorphic encryption (FHE) presents a promising opportunity to remove privacy barriers in various scenarios including cloud computing and secure database search, by enabling computation on encrypted data. However, integrating FHE with real-world applications remains challenging due to its significant computational overhead. In the FHE scheme, Number Theoretic Transform (NTT) consumes the primary computing resources and has great potential for acceleration. For the first time, we present a photonic NTT accelerator, PhotonNTT, with high energy-efficiency and parallelism to address the above challenge. Our approach involves formulating the NTT into matrix-vector multiplication (MVM) operations and mapping the data flow into parallel photonic MVM units. A dedicated data mapping scheme is proposed to introduce free spectral range (FSR) and distributed RAM design into the system, which enables a high bit-wise parallelism level. The system's reliability is validated through the Monte-Carlo bit error rate (BER) analysis. The experimental evaluation shows that the proposed architecture outperforms SOTA CiM-based NTT accelerators with an improvement of 50x in throughput and 63x improvement in energy efficiency.
09:05 CEST	TS21.8	MABFUZZ: MULTI-ARMED BANDIT ALGORITHMS FOR FUZZING PROCESSORS Speaker: Vasudev Gohil, Texas A&M University, US Authors: Vasudev Gohil¹, Rahul Kande¹, Chen Chen¹, Jeyavijayan Rajendran¹ and Ahmad-Reza Sadeghi² ¹Texas A&M University, US; ²TU Darmstadt, DE Abstract As the complexities of processors keep increasing, the task of effectively verifying their integrity and security becomes ever more daunting. The intricate web of instructions, microarchitectural features, and interdependencies woven into modern processors pose a formidable challenge for even the most diligent verification and security engineers. To tackle this growing concern, recently, researchers have developed fuzzing techniques explicitly tailored for hardware processors. However, a prevailing issue with these hardware fuzzers is their heavy reliance on static strategies to make decisions in their algorithms. To address this problem, we develop a novel dynamic and adaptive decision-making framework <tt>MABFuzz</tt>, that uses multi-armed bandit (MAB) algorithms to fuzz processors. <tt>MABFuzz</tt> is agnostic to, and hence, applicable to, any existing hardware fuzzer. In the process of designing <tt>MABFuzz</tt>, we encounter challenges related to the compatibility of MAB algorithms with fuzzers and maximizing their efficacy for fuzzing. We overcome these challenges by modifying the fuzzing process and tailoring MAB algorithms to accommodate special requirements for hardware fuzzing. We integrate three widely used MAB algorithms in a state-of-the-art hardware fuzzer and evaluate them on three popular RISC-V-based processors. Experimental results demonstrate the ability of <tt>MABFuzz</tt> to cover a broader spectrum of processors' intricate landscapes and doing so with remarkable efficiency. In particular, <tt>MABFuzz</tt> achieves an average speedup of 53.72× in detecting vulnerabilities and an average speedup of 3.11× in achieving coverage compared to a state-of-the-art technique.
09:10 CEST	TS21.9	PARASITIC CIRCUS: ON THE FEASIBILITY OF GOLDEN-FREE PCB VERIFICATION Speaker: Shahin Tajik, Worcester Polytechnic Institute, US Authors: Maryam Saadat Safa, Patrick Schaumont and Shahin Tajik, Worcester Polytechnic Institute, US Abstract Printed circuit boards (PCBs) are an integral part of electronic systems. Hence, verifying their physical integrity in the presence of supply chain attacks (e.g., tampering and counterfeiting) is of utmost importance. Recently, tamper detection techniques grounded in impedance characterization of PCB's Power Delivery Network (PDN) have gained prominence due to their global detection coverage, non-invasive, and low-cost nature. Similar to other physical verification methods, these techniques rely on the existence of a physical golden sample for signature comparisons. However, having access to a physical golden sample for golden signature extraction is not feasible in many real-world scenarios. In this work, we assess the feasibility of eliminating a physical golden sample and replacing it with a simulated golden signature obtained by the PCB design files. By performing extensive simulation and measurements on an in-house designed PCB, we demonstrate how the parasitic impedance of the PCB components plays a major role in reaching a successful verification. Based on the obtained results and using statistical metrics, we show that we can mitigate the discrepancy between collected signatures from simulation and measurements.
09:11 CEST	TS21.10	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

W07 Workshop: Enabling rapid and sustainable RISC-V based research using open source HW and SW

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 08:30 CEST - 12:30 CEST
Location / Room: Break-Out Room S3+4

Organisers

David Atienza, EPFL, CH

Davide Schiavone, ESL, EPFL, CH

Jose Miranda, ESL, EPFL, CH

Alfonso Rodriguez, CEI, UPM, ES

Alessio Burrello, Polytechnic of Turin, IT

Daniele Pagliari, Polytechnic of Turin, IT

Maurizio Martina, Polytechnic of Turin, IT

Links:

As our world becomes more interconnected and reliant on digital systems, the need for transparency, accessibility, and collaborative innovation in hardware and software design has become paramount. Closed and proprietary solutions often limit user sovereignty and control, hinder interoperability, and stifle innovation by locking users into a specific ecosystem. In contrast, open-source initiatives are leading towards rapid prototyping of novel ideas and systems and unlocking from any specific vendor by empowering academic and industrial users to control, modify, and build upon existing technology. This workshop will bring both academic and industry members into a discussion about open-source hardware (HW) and software (SW) interests and showcase the potential of open-source technology with the eXtendable Heterogeneous Energy-Efficient Platform, X-HEEP. X-HEEP is an EU and RISC-V-based enabler platform for rapid and sustainable VLSI/ASIC system design, automation, and testing. Following the success stories of open-source projects in the SW ecosystem, with Linux kernel as the best example, X-HEEP aims to foster innovative, easy-to-integrate, easy-to-use ASIC-oriented technology. Similarly, other open-source HW initiatives, such as OpenHW Group, OpenRoad, and SkyWater are already revolutionising the semiconductor industry by democratising chip design via open-source IPs, EDA tools, and PDKs. To this end, X-HEEP enables HW research and freely delivers industrial-grade, silicon-proven open-source CPU IPs from OpenHW. Together with our academic and industrial partners, we will showcase the usage of X-HEEP for 1) SW engineers to deploy end-to-end applications, 2) HW engineers to integrate and evaluate their own IPs, and 3) system-level designers to architect a full system using its configurability and extendibility capabilities.

Open-source HW and SW are becoming essential in the rapidly evolving technological landscape. This workshop fosters a shared knowledge and creativity culture, enabling a broader range of applications, from affordable healthcare devices to sustainable energy solutions. It stands as a critical pillar for a more inclusive, innovative, and sustainable technological future. Thus, open-source hardware and software is a paramount topic for the DATE conference, as it encourages ethical and sustainable practices by promoting repairability and reducing electronic waste. This topic is completely integrated and part of DATE’s technical scope considering the four different conference tracks. Moreover, the presentation and fostering of openly, easy-to-integrate, and easy-to-use ASIC-oriented technology is innovative, and we can expect submissions in this regard in the future.

DivEDA DivEDA: Forum on Advancing Diversity in EDA

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Break-Out Room S6+7

Presenter:
Cristiana Bolchini, Politecnico di Milano, IT

Authors:
Cristiana Bolchini, Politecnico di Milano, IT
Catherine Le Lan, SYNOPSYS, FR
Elena Ioana Vatajelu, TIMA, FR
Ana Lucia Varbanescu, University of Twente, NL

The 4th Advancing Diversity in EDA (DivEDA) forum aims to support women and underrepresented minorities (URM) in advancing their careers in academia and industry, fostering diversity within the EDA community. A diverse community is pivotal in accelerating innovation within the EDA ecosystem, ultimately contributing to societal progress. Through an interactive environment, DivEDA seeks to offer practical advice for women and URM to navigate career obstacles, while also facilitating connections between senior and junior researchers to foster a more inclusive community. This initiative builds upon previous diversity-focused efforts within the EDA field, with prior editions of DivEDA held at DATE’18, DAC’19, and DATE’22. This year's forum, embedded into the DATE program, will feature a panel and discussion addressing topics such as tokenism, barriers to entry, and strategies for addressing bias within the EDA community. DivEDA has been supported by IEEE CEDA and ACM SIGDA.
Forum Organizers: Nusa Zidaric and Marina Zapater Sancho
Advisors: Ayse K. Coskun and Nele Mentens

FS02 Focus Session: Smoothing Disruption Across The Stack: Tales Of Memory, Heterogeneity, And Compilers

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Break-Out Room S8

Session chair:
Ian O'Connor, École Centrale de Lyon, FR

Session co-chair:
Pascal Vivet, CEA-List, FR

Organisers:
Michael Niemier, University of Notre Dame, US
Ian O'Connor, École Centrale de Lyon, FR

Time	Label	Presentation Title Authors
11:00 CEST	FS02.1	HETEROGENEOUS SCALING DRIVEN BY SYSTEM-TECHNOLOGY CO-OPTIMIZATION Presenter: Julien Ryckaert, IMEC, BE Author: Julien Ryckaert, IMEC, BE Abstract This talk will provide an overview of challenges and interests with respect to emerging technologies (e.g. new devices, emerging memories, 3D integration, wafer backside processing, etc.) as well as approaches to their integration in a common technology platform. It will highlight how heterogeneous scaling is aimed at addressing the growing diversity of system applications requiring large power/perfromance improvements that miniaturization cannot answer alone. It will describe the current interests in industry, research labs/consortia, and academics, and provide important context for subsequent talks.
11:22 CEST	FS02.2	PROSPECTS AND TRADEOFFS FOR TECHNOLOGY-ENABLED IN-MEMORY COMPUTING ARCHITECTURES VERSUS HIGHLY-SCALED CMOS SOLUTIONS Presenter: Michael Niemier, University of Notre Dame, US Authors: Michael Niemier¹, Zephan Enciso¹, Mohammad Mehdi Sharifi¹, X. Sharon Hu¹, Ian O'Connor², Nashrah Afroze³ and Asif Khan³ ¹University of Notre Dame, US; ²Lyon Institute of Nanotechnology, FR; ³Georgia Tech, US Abstract This talk examines the prospects for technology-enabled, in-memory computing (IMC) architectures that may afford the potential for novel compute functionality within the memory itself, thereby obviating the need for data transfer overheads, etc. associated with more conventional architectures. While IMC has the potential to provide substantial application-level benefits with respect to figures of merit such as energy, delay, etc., it may also be challenged by issues such as the ability to reliably program and scale multi-bit memory cells, increased device variation, etc. This can in-turn impact other figures of merit such as application-level accuracy. It is also important to quantify the potential impact of technology-enabled IMC architectures to highly-scaled CMOS solutions such that industrial partners can prioritize/justify future investments with respect to specific devices. The talk will also highlight how collaborations both up-and-down the stack, and within the design automation community can facilitate these pathfinding exercises.
11:45 CEST	FS02.3	SYSTEM-TECHNOLOGY CO-OPTIMIZATION IN THE ERA OF CHIPLETS Presenter: Alexander Graening, University of California, Los Angeles, US Authors: Alexander Graening¹, Ravit Sharma² and Puneet Gupta¹ ¹University of California, Los Angeles, US; ²University of California at Los Angeles, US Abstract As conventional technology scaling becomes harder, 2.5D and 3D integration provides a viable pathway to building larger systems at lower cost. In this talk, we will describe our efforts toward building materials to algorithms cross-stack pathfinding frameworks to predict system-level power, performance and cost as a function of low-level technology changes.First, we describe a system cost modeling framework which allows for modeling of manufacturing, packaging and test costs of 2D/2.5D/3D heterogeneously integrated systems. Next, we describe an end-to-end system-technology co-optimization framework, DeepFlow, for large distributed 2.5D/3D systems, especially for the important class of distributed machine learning training applications. DeepFlow leverages a self-consistent modeling pipeline from low-level logic/memory/integration technology parameters to micro-architectures and finally to software parallelization approaches for distributed training of large language models on hundreds to thousands of accelerators. Finally, we discuss cost-performance pareto for an exemplar multi-GPU 2.5D system analyzed using these frameworks.
12:07 CEST	FS02.4	AUTOMATIC OPTIMIZATION FOR HETEROGENEOUS IN-MEMORY COMPUTING Presenter: Jeronimo Castrillon, TU Dresden, DE Authors: Jeronimo Castrillon¹, Joao Paulo De Lima², Asif Ali Khan¹ and Hamid Farzaneh¹ ¹TU Dresden, DE; ²Technische UniversitÃ¤t Dresden, DE Abstract Fuelled by exciting advances in materials and devices, in-memory computing architectures now represent a promising avenue to advance computing systems. Plenty of manual designs have already demonstrated orders of magnitude improvement in compute efficiency compared to classical Von Neumann machines in different application domains. In this talk we discuss automation flows for programming and exploring the parameter space of in-memory architectures. We report on current efforts to build an extensible framework around the MLIR compiler infrastructure to abstract from individual technologies to foster re-use. Concretely, we present optimising flows for in-memory accelerators based on crossbars, content addressable memories, and bulk-wise logic operations. We believe this kind of automation is essential to more quickly navigate the heterogeneous landscape of in-memory accelerators, and to bring the benefits of emerging architectures to a boarder range of applications.

MPP02 Multi-Partner Projects

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Break-Out Room S1+2

Session chair:
Frank Oppenheimer, OFFIS, DE

Session co-chair:
Maksim Jenihhin, Tallinn University of Technology, EE

Time	Label	Presentation Title Authors
11:00 CEST	MPP02.1	AMBEATION: ANALOG MIXED-SIGNAL BACK-END DESIGN AUTOMATION WITH MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE TECHNIQUES Speaker: Daniele Jahier Pagliari, Politecnico di Torino, IT Authors: Giulia Aliffi¹, Joao Baixinho², Dalibor Barri³, Francesco Daghero⁴, Nicola Di Carolo⁵, Gabriele Faraone⁵, Michelangelo Grosso⁶, Daniele Jahier Pagliari⁴, Jiri Jakovenko⁷, Vladimír Janíček⁷, Dario Licastro⁵, Vazgen Melikyan⁸, Matteo Risso⁴, Vittorio Romano¹, Eugenio Serianni⁵, Martin Štastný⁷, Patrik Vacula³, Giorgia Vitanza¹ and Chen Xie⁴ ¹University of Catania, IT; ²Synopsys, PT; ³STMicroelectronics, CZ; ⁴Politecnico di Torino, IT; ⁵STMicroelectronics, IT; ⁶STMicroelectronics s.r.l., IT; ⁷Czech TU in Prague, CZ; ⁸Synopsys, AM Abstract For the competitiveness of the European economy, automation techniques in the design of complex electronic systems are a prerequisite for winning the global chip challenge. Specifically, while the physical design of digital circuits can be largely automated, the physical design of Analog-Mixed-Signal (AMS) integrated circuits (IC) built with an analog-on-top flow, where digital subsystems are instantiated as Intellectual Property (IP) modules, is still carried out predominantly by hand, with a time-consuming methodology. The AMBEATion consortium, including global semiconductor and design automation companies as well as leading universities, aims to address this challenge by combining classic Electronic Design Automation (EDA) algorithms with novel Artificial Intelligence and Machine Learning (ML) techniques. Specifically, the scientific and technical result expected at the end of the project will be a new methodology, implemented in a framework of scripts for Analog Mixed Signal placement, internally making use of state-of-the-art AI/ML models, and fully integrated with Industrial design flows. With this methodology, the AMBEATion consortium aims to reduce the design turnaround-time and, consequently, the silicon development costs of complex AMS IC.
11:05 CEST	MPP02.2	DESIGN AUTOMATION FOR CYBER-PHYSICAL PRODUCTION SYSTEMS: LESSONS LEARNED FROM THE DEFACTO PROJECT Speaker: Michele Lora, Università di Verona, IT Authors: Michele Lora¹, Sebastiano Gaiardelli¹, Chanwook Oh², Stefano Spellini¹, Pierluigi Nuzzo² and Franco Fummi¹ ¹Università di Verona, IT; ²University of Southern California, US Abstract The DeFacto project, supported by the European Commission via a Marie Skłodowska-Curie Global Individual Fellowship, tackles the complexity arising from the transformation of industrial manufacturing systems into intricate cyber-physical systems. This evolution offers unprecedented opportunities but also poses intellectual and engineering challenges. DeFacto aims to advance the design automation of cyber-physical production systems by developing innovative modeling paradigms, scalable algorithms, software architectures, and tools. In the DeFacto approach, production systems are managed through service-oriented manufacturing software architectures. System-level models capture the features and requirements of production systems, representing both production and computational processes as services provided by the infrastructure. Methodologies for system analysis and optimization rely on compositional abstractions of system behaviors grounded in assume-guarantee contracts. This paper outlines key research endeavors, findings, and lessons learned from the DeFacto project.
11:10 CEST	MPP02.3	FORMAL METHODS FOR HIGH INTEGRITY GPU SOFTWARE DEVELOPMENT AND VERIFICATION Speaker: Leonidas Kosmidis, BSC, ES Authors: Dimitris Aspetakis¹, Leonidas Kosmidis², Jose Ruiz³ and Gabor Marosy⁴ ¹BSC, ES; ²UPC \| BSC, ES; ³AdaCore, FR; ⁴European Space Agency (ESA), NL Abstract Modern safety critical systems require high levels of performance for advanced functionalities, which are not possible with the simple conventional architectures currently used in them. Embedded General Purpose Graphics Processing Units (GPGPUs) are among the hardware technologies which can provide the high performance required in these domains. However, their massively parallel nature complicates the verification of their software and increases its cost because it usually involves code coverage through extensive human-driven testing. The Ada SPARK language has traditionally been used in highly-critical environments for its formal verification capabilities and powerful type system. The use of such tools which are backed up by theorem provers, has significantly lowered the amount of effort needed to validate functionality of safety-critical systems. In this European Space Agency (ESA) funded project, we utilize AdaCore's CUDA backend for Ada in conjunction with the SPARK language subset to assess the state of static verification for GPU kernels. We assess the error detection capabilities of the available tools and we formulate a methodology to maximise their effectiveness. Moreover, our project results using ESA's open source GPU4S Benchmarking suite show that common programming mistakes in GPU software development can be prevented.
11:15 CEST	MPP02.4	THE METASAT MODEL-BASED ENGINEERING WORKFLOW AND DIGITAL TWIN CONCEPT Speaker: Leonidas Kosmidis, BSC, ES Authors: Alejandro J. Calderon¹, Irune Yarza¹, Stefano Sinisi², Lorenzo Lazzara², Valerio Di Valerio², Giulia Stazi², Leonidas Kosmidis³, Matina Maria Trompouki⁴, Alessandro Ulisse², Aitor Amonarriz¹ and Peio Onaindia¹ ¹Ikerlan Technology Research Centre, ES; ²Collins Aerospace, IT; ³UPC \| BSC, ES; ⁴BSC, ES Abstract Considering the complexity in the design of future satellites and the need for compliance with ECSS standards, the METASAT project proposes a novel design methodology based on model-based engineering and supported by open architecture hardware. The initiative highlights the potential of software virtualisation layers, like hypervisors, to meet standards compliance on high-performance computing platforms. Our focus is on the development of a toolchain tailored for these advanced hardware/software layers. In our view, without such innovation, the satellite industry may face high and unsustainable development costs and timelines, which could affect its competitiveness and reliability. This paper provides an overview of the model-based engineering toolchain, the workflow, and the digital twin concept proposed for the METASAT project. We present the purpose of each component and how they work together in the broader METASAT vision, with the objective of showing how this approach can make the development process more efficient and enhance dependability in the sector.
11:20 CEST	MPP02.5	AN AI-ENABLED FRAMEWORK FOR SMART SEMICONDUCTOR MANUFACTURING Speaker: Nicola Dall'Ora, Università di Verona, IT Authors: Khaled Alamin¹, Davide Appello², Alessandro Beghi³, Nicola Dall'Ora⁴, Fabio Depaoli¹, Santa Di Cataldo⁵, Franco Fummi⁴, Sebastiano Gaiardelli⁴, Michele Lora⁴, Enrico Macii⁵, Alessio Mascolini¹, Daniele Pagano⁶, Francesco Ponzio⁵, Gian Antonio Susto⁷ and Sara Vinco⁵ ¹Polytechnic of Turin, IT; ²Technoprobe Spa, IT; ³University of Padua, IT; ⁴Università di Verona, IT; ⁵Politecnico di Torino, IT; ⁶STMicroelectronics, IT; ⁷University of Padova, IT Abstract With the rise of Machine Learning (ML) and Artificial Intelligence (AI), the semiconductor industry is undergoing a revolution in how it approaches manufacturing. The SMART-IC project (DATE'24 MPP category: initial stage) works in this direction, by proposing an AI-enabled framework to support the smart monitoring and optimization of the semiconductor manufacturing process. An AI-powered engine examines sensor data recording physical parameters during production (like gas flow, temperature, voltage, etc.) as well as test data, with different goals: (1) the identification of anomalies in the production chain, either offline from collected data-traces or online from a continuous stream of sensed data; (2) the forecasting of new data of the future production; and (3) the automatic generation of synthetic traces, to strengthen the data-based algorithms. All such tasks provide valuable information to an advanced Manufacturing Execution System (MES), which reacts by optimizing the production process and management of the equipment maintenance policies. SMART-IC is a 300k€ academic project funded by the Italian Ministry of University and supported by STMicroelectronics and Technoprobe with industrial expertise and real-world applications. This paper shares the view of SMART-IC on the future of semiconductor manufacturing, the preliminary efforts, and the future results that will be reached by the end of the project, in 2025.
11:25 CEST	MPP02.6	DESIGN AUTOMATION FOR QUANTUM COMPUTING: INTERMEDIATE STAGE REPORT OF THE ERC CONSOLIDATOR GRANT "DAQC" Speaker and Author: Robert Wille, TU Munich, DE Abstract We are at the dawn of a new ``computing age'' in which quantum computers will find their way into practical applications. However, while impressive accomplishments can be observed in the physical realization of quantum computers, the development of automated tools and methods that provide assistance in the design and realization of applications for those devices is at risk of not being able to keep up with this development anymore---leaving a situation where we might have powerful quantum computers but hardly any proper means to actually use them. The ERC Consolidator project ``Design Automation for Quantum Computing'' aims to provide a solution for this upcoming design gap by developing efficient and practically relevant design methods for this emerging technology. While the current state of the art suffers from the interdisciplinarity of quantum computing (leading to the consideration of inappropriate models, inconsistent interpretations, and ``wrong'' problem formulations), this project builds a bridge between the design automation community and the quantum computing community. This will allow to fully exploit the potential of design automation which is hardly utilized in quantum computing yet. This intermediate stage report provides an overview of the motivation and approach of the project as well as showcases selected results and outreach activities.
11:30 CEST	MPP02.7	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

SD04 Special Day On Responsible And Robust AI

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Auditorium 2

Time	Label	Presentation Title Authors
11:00 CEST	SD04.1	HOW TO MAKE FEDERATED LEARNING ROBUST AND SECURE? Presenter: Aurelien Mayoue, CEA, FR Author: Aurelien Mayoue, CEA, FR Abstract Federated learning (FL) is a new machine learning paradigm which allows to build models in a collaborative and decentralized way. Contrary to traditional server-side approaches which aggregate data on a central server for training, FL leaves the training data distributed on the end user devices while learning a shared global model by aggregating only ephemeral locally-computed model updates. Such a decentralized optimization procedure helps to ensure data privacy and reduces communication costs as the data remains in its original location. However, FL still faces key challenges on which a growing research effort is currently deployed: How to make the FL process fairness despite the data distribution heterogeneity across the participants? How to ensure the integrity of the federated model despite the presence of faults or malicious actors? How to secure the training process against attackers who could exploit model updates to retrieve sensitive information about participants? In this presentation, we will start by introducing the FL process and its main key challenges. Next, we will propose solutions to address these challenges and illustrate them with use cases.
12:00 CEST	SD04.2	TRUSTWORTHY AI IN THALES: BEYOND PURE PERFORMANCE Presenter: Simon Fossier, Thales, FR Author: Simon Fossier, Thales, FR Abstract Recent advances in Artificial Intelligence have opened the way to its integration in safety-critical complex systems. Designing such systems involves defining a structured process to ensure their reliability, which in turn requires to bring safety analysis to AI-based sub-components. Robustness, embeddability, formal and experimental guarantees, ethical and legal aspects – many dimensions of trustworthiness are involved in designing systems, and AI-based systems need to tackle them, despite their specificities.

TS02 Advanced Formal Methods And Verification

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Auditorium 3

Session chair:
Stefano Quer, Politecnico di Torino, IT

Session co-chair:
Farahmandi Farimah, University of Florida, US

Time	Label	Presentation Title Authors
11:00 CEST	TS02.1	DEDUCTIVE FORMAL VERIFICATION OF SYNTHESIZABLE, TRANSACTION-LEVEL HARDWARE DESIGNS USING COQ Speaker and Author: Tobias Strauch, EDAptix e.K., DE Abstract We present the compilation process of synthesizable, transaction-level hardware designs into Gallina code. This allows the Coq theorem prover to execute formal verification scripts on top of the automatically generated code to prove individual theorems. The novelty of this work is the use of PDVL (Programming Design and Verification Language), which adds specific language constructs to SystemVerilog to enable an aspect-oriented, transaction-level design style. In this paper, we show that PDVL code can also be compiled into Gallina code, with the advantage that it is highly usable for theorem proving and perfect for using the Coq proof assistant. We outline individual proof strategies for theorem proving, sequential equivalence checking, property checking, and show applications of symbolic simulation techniques. This is done by using a complex timer peripheral, AES and UART modules as well as RISC-V based designs.
11:05 CEST	TS02.2	FORMAL VERIFICATION OF BOOTH RADIX-8 AND RADIX-16 MULTIPLIERS Speaker and Author: Mertcan Temel, Intel Corporation, US Abstract In processors where low power consumption is essential, higher Booth encodings such as radix-8 and radix-16 may be preferred over the more common radix-4 encoding. With higher radix multipliers included in commercial hardware, formal verification of such designs poses a real challenge. Verifying multipliers is difficult in general; state-of-the-art verification methods like S-C-Rewriting and computer algebra have primarily addressed the multiplier verification problem for lower Booth radices such as radix-4. However, these methods do not scale well for higher radices. This paper explores the cause of this limitation and proposes a list of solutions for automatic, sound, and fast verification of such designs. These solutions include three improvements for the S-C-Rewriting method, some of which may be applicable to computer algebra methods as well. Experiments have shown that higher Booth radix multipliers can now be verified soundly, fully automatically, and in a matter of seconds for common operand sizes such as 64 and 128 bits.
11:10 CEST	TS02.3	PURSE: PROPERTY ORDERING USING RUNTIME STATISTICS FOR EFFICIENT MULTI-PROPERTY VERIFICATION Speaker: Aritra Hazra, IIT Kharagpur, IN Authors: Sourav Das¹, Aritra Hazra¹, Pallab Dasgupta², Sudipta Kundu² and Himanshu Jain² ¹IIT Kharagpur, IN; ²Synopsys, US Abstract Multi-property verification has emerged as a contemporary challenge in the chip design industry. With designs now encompassing hundreds of properties, conventional sequential verification without information sharing is no longer preferred. Past attempts towards grouping or ordering properties based on cone-of-influence (COI) are typically ineffective for complex designs. This paper introduces PURSE, a novel approach that addresses this challenge by dynamically reordering properties for sequential and incremental solving. By identifying and prioritizing simpler properties, the process accelerates convergence. This article presents two dynamic reordering techniques guided by statistical data gathered from the IC3/Property Directed Reachability (PDR) proof engine. The study compares dynamic ordering strategies against static ordering and a default ordering based on design structure. Empirical results from various industrial designs demonstrate that our proposed methodology performs better in most cases, with up to 25% improvements in convergence.
11:15 CEST	TS02.4	PARALLEL GROBNER BASIS REWRITING AND MEMORY OPTIMIZATION FOR EFFICIENT MULTIPLIER VERIFICATION Speaker: Hongduo Liu, The Chinese University of Hong Kong, HK Authors: Hongduo Liu¹, Peiyu Liao¹, Junhua Huang², Hui-Ling Zhen³, Mingxuan Yuan², Tsung-Yi Ho¹ and Bei Yu¹ ¹The Chinese University of Hong Kong, HK; ²Huawei Noah's Ark Lab, HK; ³Huawei Inc, HK Abstract Formal verification of integer multipliers is a significant but time-consuming problem. This paper introduces a novel approach that emphasizes the acceleration of Symbolic Computer Algebra (SCA)-based verification systems from the perspective of efficient implementation instead of traditional algorithm enhancement. Our first strategy involves leveraging parallel computing to accelerate the rewriting process of the Grobner basis. Confronting the issue of frequent memory operations during the Grobner basis reduction phase, we propose a double buffering scheme coupled with an operator scheduler to minimize memory allocation and deallocation. These unique contributions are integrated into a state-of-the-art verification tool and result in substantial improvements in verification speed, demonstrating more than 15× speedup for a 1024×1024 multiplier.
11:20 CEST	TS02.5	FORMAL VERIFICATION OF SECURE BOOT PROCESS Speaker: Sriram Vasudevan, Temasek Lab @ NTU, SG Authors: sriram vasudevan¹, PRASANNA RAVI², Arpan Jati³, Shivam Bhasin⁴ and Anupam Chattopadhyay³ ¹nanyang technological university, SG; ²Temasek Labs and School of Computer Science and Engineering, Nanyang Technological University, Singapore, SG; ³Nanyang Technological University, SG; ⁴Temasek Laboratories, Nanyang Technological University, SG Abstract Formal verification is widely used for checking the correctness of a system with respect to the succinct properties at early design phase. In recent times, these methods are also adopted to validate the security of a system by rightly abstracting the security-oriented properties. In this work, we focus on a recently reported attack on the secure boot process of Zynq-7000 platform. We study the vulnerability with formal analysis of its boot loader code. The challenge is to model the system and identify the right set of properties so that, the vulnerability is exposed quickly. It is shown that the UPPAAL model checking analysis can be used, which helps to find the vulnerability instantaneously with the help of four properties and a virtual memory usage peak of about 50MB to test each property. To the best of our knowledge, we perform the first analysis of UPPAAL on a concrete software implementation and perform a case study on a real security vulnerability reported in a CVE.
11:25 CEST	TS02.6	COMPLETE AND EFFICIENT VERIFICATION FOR A RISC-V PROCESSOR USING FORMAL VERIFICATION Speaker: Lennart Weingarten, University of Bremen, DE Authors: Lennart Weingarten¹, Kamalika Datta¹, Abhoy Kole² and Rolf Drechsler³ ¹University of Bremen, DE; ²DFKI, DE; ³University of Bremen \| DFKI, DE Abstract Formal verification techniques are computationally complex and the exact time and space complexities are in general not known, which makes the performance of the process unpredictable. Some of the recent works have shown that it is possible to carry out formal verification with polynomial time and space complexities for specific designs like arithmetic circuits. However, the methodology used cannot be directly extended to complex designs like processors. A recent work has shown polynomial verification of a single-cycle RISC-V processor with limited functionality, which considers only the combinational parts of the ALU. In this paper we propose for the first time a complete verification approach that covers all the functional units of the processor, and at the same time considers its sequential behavior. Experimental results show that the verification can be carried out in polynomial time, and also demonstrate significant improvement over previous methods.
11:30 CEST	TS02.7	A TRANSISTOR LEVEL RELATIONAL SEMANTICS FOR ELECTRICAL RULE CHECKING BY SMT SOLVING Speaker: Oussama Oulkaid, Université Lyon 1 - Université Grenoble Alpes - Aniah, FR Authors: Oussama Oulkaid¹, Bruno Ferres², Matthieu Moy³, Pascal Raymond⁴, Mehdi Khosravian⁵, Ludovic Henrio³ and Gabriel Radanne³ ¹University Lyon, EnsL, UCBL, CNRS, Inria, LIP, F-69342, LYON Cedex 07, France - University Grenoble Alpes, CNRS, Grenoble INP, VERIMAG, 38000 Grenoble, France - Aniah, 38000 Grenoble, France, FR; ²University Lyon, EnsL, UCBL, CNRS, Inria, LIP, F-69342, LYON Cedex 07, France - University Grenoble Alpes, CNRS, Grenoble INP, VERIMAG, 38000 Grenoble, France, FR; ³University Lyon, EnsL, UCBL, CNRS, Inria, LIP, F-69342, LYON Cedex 07, France, FR; ⁴University Grenoble Alpes, CNRS, Grenoble INP, VERIMAG, 38000 Grenoble, France, FR; ⁵Aniah, FR Abstract We present a novel technique for Electrical Rule Checking (ERC) based on formal methods. We define a relational semantics of Integrated Circuits (IC) as a means to model circuits' behavior at transistor-level. We use Z3, a Satisfiability Modulo Theory (SMT) solver, to verify electrical properties on circuits — thanks to the defined semantics. We demonstrate the usability of the approach to detect current leakage due to missing level-shifter on large industrial circuits, and we conduct experiments to study the scalability of the approach.
11:35 CEST	TS02.8	LLM-BASED PROCESSOR VERIFICATION: A CASE STUDY FOR NEUROMORPHIC PROCESSOR Speaker: Chao Xiao, National University of Defense Technology, CN Authors: Chao Xiao, Yifei Deng, Zhijie Yang, Renzhi Chen, Hong Wang, Jingyue Zhao, Huadong Dai, Lei Wang, Yuhua Tang and Weixia Xu, National University of Defense Technology, CN Abstract With the increasing complexity of the hardware design, conducting verification before the tapeout is of utmost importance. Simulation-based verification remains the primary method owing to its scalability and flexibility. A comprehensive verification of modern processors usually requires numerous effective tests to cover all possible conditions and use cases, leading to significant time, resource, and manual effort even with the EDA. Moreover, novel domain specific architecture (DSA), such as neuromorphic processors, will exacerbate the challenge of verification. Fortunately, emerging large language models (LLMs) have been demonstrating a powerful ability to complete specific tasks assigned by human instructions. In this paper, we explore the challenges and opportunities encountered when using the LLMs to accelerate the DSA verification using the proposed LLM-based workflow consisting of test generation, compilation&simulation, and result collection&processing. By verifying a RISC-V core and a neuromorphic processor, we examine the capabilities and limitations of the LLMs when using them for the function verification of traditional processors and emerging DSA. In the experiment, 36 C programs and 128 assembly snippets for the RISC-V core and the neuromorphic processor are generated using an advanced LLM to demonstrate our claim. The experimental results show that the code coverage based on the LLM test generation can reach 89% and 91% for the above two architectures respectively, showing a promising research direction for the future processor verification in the new golden age for computer architecture.
11:40 CEST	TS02.9	RANGE-BASED RUN-TIME REQUIREMENT ENFORCEMENT OF NON-FUNCTIONAL PROPERTIES ON MPSOCS Speaker: Khalil Esper, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Authors: Khalil Esper, Stefan Wildermann and Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Abstract Embedded system applications normally come with a set of non-functional requirements defined over properties (e.g., latency), expressed as a corridor of correct values via a lower and an upper bound per requirement. These requirements should be guaranteed during each execution of an application program on a given MPSoC platform. This can be achieved using a reactive control loop, where an enforcer controls a set of properties to be enforced, e.g., by adapting the number of cores allocated to a program or by scaling the voltage/frequency mode of active processors. An enforcement strategy may react on a requirement response differently, depending on (a) satisfying a requirement, or violating (b) a lower bound or (c) an upper bound. A better strategy might be to differentiate the reaction taken according to the amount of violation of a lower or upper bound, thus to react in a finer granular way. In this paper, we propose a design space exploration (DSE) method called Co-explore that automatically partitions the requirement corridors into so-called response ranges (i.e., sub-corridors) such that formulated verification goals of simultaneously generated enforcer FSMs, e.g., the number of consecutive violations of a requirement, are optimized. The evaluation shows that the explored enforcement FSMs can achieve higher probabilities of meeting a given set of requirements compared to reacting solely based on the ternary information (a), (b), or (c).
11:41 CEST	TS02.10	ASYMSAT: ACCELERATING SAT SOLVING WITH ASYMMETRIC GRAPH-BASED MODEL PREDICTION Speaker: Zhiyuan Yan, The Hong Kong University of Science and Technology(Guangzhou), CN Authors: Zhiyuan Yan¹, Min Li², Zhengyuan Shi², Wenjie Zhang³, Yingcong Chen⁴ and Hongce Zhang⁴ ¹The Hong Kong University of Science and Technology(Guangzhou), CN; ²The Chinese University of Hong Kong, HK; ³Peking University, CN; ⁴Hong Kong University of Science and Technology, HK Abstract Though graph neural networks (GNNs) have been used in SAT solution prediction, for a subset of symmetric SAT problems, we unveil that the current GNN-based end-to-end SAT solvers are bound to yield incorrect outcomes as they are unable to break symmetry in variable assignments. In response, we introduce AsymSAT, a new GNN architecture coupled where a recurrent neural network is (RNN) to produce asymmetric models. Moreover, we bring up a method to integrate machine-learning-based SAT assignment prediction with classic SAT solvers and demonstrate its performance on non-trivial SAT instances including logic equivalence checking and cryptographic analysis problems with as much as 75.45% time saving.
11:42 CEST	TS02.11	A HYBRID APPROACH TO REVERSE ENGINEERING ON COMBINATIONAL CIRCUITS Speaker: Yi-Ting Li, National Tsing Hua University, TW Authors: Wuqian Tang¹, Yi-Ting Li¹, Kai-Po Hsu¹, Kuan-Ling Chou¹, You-Cheng Lin¹, Chia-Feng Chien¹, Tzu-Li Hsu¹, Yung-Chih Chen², Ting-Chi Wang¹, Shih-Chieh Chang¹, TingTing Hwang¹ and Chun-Yao Wang¹ ¹National Tsing Hua University, TW; ²National Taiwan University of Science and Technology, TW Abstract Reverse engineering is a process that converts low-level design description to high-level design description. In this paper, we propose a hybrid approach consisting of structural analysis and black-box testing to reverse engineering on combinational circuits. Our approach is able to convert combinational circuits from gate-level netlist to Register-Transfer Level (RT-level) design accurately and efficiently. We developed our approach and participated in Problem A of the 2022 CAD Contest @ ICCAD. The revised version of our program successfully converted most cases and achieved higher scores than the 1st place team in the contest.
11:43 CEST	TS02.12	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS11 Physical Analysis And Design

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Multi-Purpose Room M1A+C

Session chair:
Patrick Groenveld, Stanford University, US

Session co-chair:
Carles Hernandes, Universidad politecnica de Valencia, ES

Time	Label	Presentation Title Authors
11:00 CEST	TS11.1	CTRL-B: BACK-END-OF-LINE CONFIGURATION PATHFINDING USING CROSS-TECHNOLOGY TRANSFERABLE REINFORCEMENT LEARNING Speaker: Sung-Yun Lee, Pohang University of Science and Technology, KR Authors: Sung-Yun Lee, Kyungjun Min and Seokhyeong Kang, Pohang University of Science and Technology, KR Abstract In advanced technology nodes, the impact of the back-end-of-line (BEOL) on chip performance and power consumption becomes progressively significant. This paper presents a BEOL configuration pathfinding framework using the proposed cross-technology transferable reinforcement learning (CTRL) model. We optimize the BEOL technology parameters, including the metal stack, geometry, and design rules, to enhance the power efficiency and performance resulting from the physical design process. First, we extract various design and technology features and embed them into metal-type-wise deep neural networks. Our multi-policy model selects a BEOL parameter configuration that is estimated to improve chip power efficiency and performance. We employ a policy gradient algorithm complemented by various training strategies, such as data normalization, action-reward buffer, and network optimization, to expedite training convergence. Furthermore, we transfer the pathfinding knowledge from the trained model in the old node to the new model for efficient BEOL configuration pathfinding in the advanced node, referred to as cross-technology transfer learning. The proposed framework achieved 19% reduced total power consumption, 68% improved worst negative slack, and 87% improved total negative slack with a high reward efficiency, on average, in several designs. Further, we demonstrate that the reward efficiency of our proposed CTRL model outperforms that of the conventional fine-tuning method in transferring knowledge by 24%.
11:05 CEST	TS11.2	A DEEP-LEARNING-BASED STATISTICAL TIMING PREDICTION METHOD FOR SUB-16NM TECHNOLOGIES Speaker: Jiajie Xu, Southeast University, CN Authors: Jiajie Xu, Leilei Jin, Wenjie Fu, Hao Yan and Longxing Shi, Southeast University, CN Abstract Pre-routing timing estimation is vital but challenging since accurate net information is available only after routing and parasitic extraction. Existing methodologies predict the timing metrics with the help of the placement information of standard cells. However, neglecting the analysis of process variation effects hinders the precision of those methodologies, especially in sub-16nm technologies as delay distributions become asymmetric. Therefore, a deep-learning-based statistical timing prediction method is proposed to model process variation effects in the pre-routing stage. Congestion features and pin-to-pin features are fed into graph neural networks for post-routing interconnect parasitic and arc delay prediction. Moreover, a calibration method is proposed to compensate for the precision loss of the delay propagation. We evaluate our methods using open-source designs and EDA tools, which demonstrate improved accuracy in pre-routing timing prediction methods and a remarkable speed-up compared to traditional routing and timing analysis process.
11:10 CEST	TS11.3	STANDARD CELL LAYOUT GENERATOR AMENABLE TO DESIGN TECHNOLOGY CO-OPTIMIZATION IN ADVANCED PROCESS NODES Speaker: Handong Cho, Seoul National University, KR Authors: Handong Cho, Hyunbae Seo, Sehyeon Chung, Kyu-Myung Choi and Taewhan Kim, Seoul National University, KR Abstract To generate standard cell (SC) layouts of competitive quality, pin accessibility and in-cell routing congestion should be thoroughly taken into account. In this work, we develop a new tool to address this issue. Precisely, we (1) develop a technology compilation module that can convert diverse cell architectures and design rules into grid based design parameters and layer configuration, (2) generate optimal FET placement using metrics that can accurately and efficiently predict intra-cell pin accessibility and in-cell routing congestion, and (3) introduce the concept of ghost-via and ghost-metal, and formulate in-cell routing using satisfiability modulo theory for pin separation and extension. Experimental results show that our system is able to synthesize SC layouts with a routing completion rate of 95∼98%, which is far better than the previous SC layout generator, and produce layouts comparable to the ARM's hand-crafted layouts. In addition, the design implementations produced by using our 2-layer 1D SC library exhibit on average 76.6% fewer design rule violations (DRVs) with similar or better quality of timing and area, while in comparison with that produced by using the library of hand-crafted ARM SCs, the implementations produced by using our 1-layer 2D SC library exhibit on average 11.7% smaller area with comparable timing and DRV count.
11:15 CEST	TS11.4	BOXGB: DESIGN PARAMETER OPTIMIZATION WITH SYSTEMATIC INTEGRATION OF BAYESIAN OPTIMIZATION AND XGBOOST Speaker: Chanhee Jeon, Seoul National University, KR Authors: Chanhee Jeon, Doyeon Won, Jaewan Yang, Kyu-Myung Choi and Taewhan Kim, Seoul National University, KR Abstract Finding design flow parameters that ensure a high quality of final chip is a very important task, but requires an excessive amount of effort and time. In this work, we automate this task by proposing a machine learning (ML)- based design space optimization (DSO) framework. Rather than simply applying one ML model exclusively or multiple ones in a naive manner, we develop a comprehensive chain of ML engines which is able to explore the design parameter space more economically but effectively to make a fast convergence on finding the best parameter set. Specifically, we solve the DSO problem in three steps: (1) random sampling of parameter sets and then performing design evaluation to produce an initial ML training dataset; (2) iteratively, downsizing parameter dimension through Principal Component Analysis (PCA) followed by sampling through an exploration-centric mechanism which is internally driven by Bayesian Optimization (BO) and then evaluating the sample; (3) iteratively, sampling through an exploitation-centric mechanism driven by XGBoost regression and then checking anomaly by using XGBoost classification followed by evaluating the sample if it's not anomaly. From our experiments with benchmark designs, it is shown that our approach is able to find design parameter sets which are far better than that found by the prior state-of-the-art ML-based approaches, even with fewer number of design evaluations (i.e., EDA tool runs). In addition, in comparison with the designs produced by using the default parameter setting, our DSO framework is able to improve the design PPA metrics by 5∼30%.
11:20 CEST	TS11.5	MIRACLE: MULTI-ACTION REINFORCEMENT LEARNING-BASED CHIP FLOORPLANNING REASONER Speaker: Qi Xu, University of Science and Technology of China, CN Authors: Bo Yang¹, Qi Xu¹, Hao Geng², Song Chen¹ and Yi Kang¹ ¹University of Science and Technology of China, CN; ²ShanghaiTech University, CN Abstract Floorplanning is one of the most critical but time-consuming tasks in the chip design process. Machine learning techniques, especially reinforcement learning, have provided a promising direction for floorplanning design. In this paper, an end-to-end reinforcement learning (RL) framework is proposed to learn a policy for floorplanning automatically, in the combination of edge-augmented graph attention network (EGAT), position-wise multi-layer perceptron, and gated selfattention mechanism. We formulate floorplanning as a Markov Decision Process (MDP) model, where a multi-action mechanism and a dense reward function are developed to adapt the floorplanning problem. In addition, in order to make full use of prior knowledge, we further propose a supervised learning approach on the generated synthetic netlist-floorplan dataset. Experimental results demonstrate that, compared with state-of-the-art floorplanners, the proposed end-to-end framework significantly reduces wirelength with a smaller area.
11:25 CEST	TS11.6	CIRCUITS PHYSICS CONSTRAINED PREDICTOR OF STATIC IR DROP WITH LIMITED DATA Speaker: Yuan Meng, Fudan University, CN Authors: Yuan Meng¹, Ruiyu Lyu¹, Zhaori Bi¹, Changhao Yan¹, Fan Yang¹, Wenchuang Hu², Dian Zhou³ and Xuan Zeng¹ ¹Fudan University, CN; ²Sichuan University, CN; ³University of Texas at Dallas, US Abstract We propose a pyramid scene parsing network (PSPN) with skip-connection architecture to effectively utilize physical information that characterizes IR drop distribution, including current source locations, via locations, and asymmetric topological connections, achieving highly accurate IR drop prediction for power delivery networks (PDN) of varying scales, even with a limited dataset. Skip-connection architecture preserves the positional information of current sources, which often correlates with large IR drop, facilitating the identification of hotspots. We incorporate via locations into the model to effectively describe the topological connection distance between voltage sources and different nodes in the multi-layer PDN, while the traditional method only considers the horizontal distance between nodes and voltage sources, which is invalid for prediction. To capture asymmetric connection features within the PDN efficiently, we introduce a shape-adaptive convolutional kernel to solve the problem of inadequate extraction of feature information in a traditional method. Finally, we propose a loss function with Kirchhoff's law constraints to ensure the model's prediction aligns with the electrical characteristics of the circuit, which can't be guaranteed by traditional machine learning-based methods only taking the prediction accuracy into consideration. Our results, based on training with only 100 synthetic circuits, demonstrate the superiority of our method over the state-of-the-art prediction technique. Across evaluations on 10 real circuits, our approach consistently delivers a 50\% improvement in precision.
11:30 CEST	TS11.7	REINFORCEMENT LEARNING-BASED OPTIMIZATION OF BACK-SIDE POWER DELIVERY NETWORKS IN VLSI DESIGN FOR IR-DROP REDUCTION Speaker: Taigon Song, Kyungpook National University , KR Authors: Seungmin Woo¹, Hyunsoo Lee², Yunjeong Shin², MinSeok Han², Yunjeong Go², Jongbeom Kim², Hyundong Lee², Hyunwoo Kim² and Taigon Song² ¹Georgia Tech, US; ²Kyungpook National University , KR Abstract On-chip power planning is a crucial step in chip design. As process nodes advance and the need to supply lower operating voltages without loss becomes vital, the optimal design of the Power Delivery Network (PDN) has become pivotal in VLSI to mitigate IR-drop effectively. To address IR-drop issues in the latest nodes, a back-side power delivery network (BSPDN) has been proposed as an alternative to the conventional front-side PDN. However, BSPDN encounters design issues related to the pitch and resistance of through-silicon vias (TSVs). In addition, BSPDN faces optimization challenges due to the trade-off between rail and grid IR-drop, particularly in the effectiveness of uniform grid design patterns. In this study, we introduce a design framework that utilizes reinforcement learning to identify optimized grid width patterns for individual VLSI designs on the silicon back-side, aiming to reduce IR-drop. We have applied our design approach to various benchmarks and validated its improvement. Our results demonstrate a significant improvement in total IR-drop, with a maximum improvement of up to -19.0% in static analysis and up to -18.8% in dynamic analysis, compared to the conventional uniform BSPDN.
11:35 CEST	TS11.8	LEARNING CIRCUIT PLACEMENT TECHNIQUES THROUGH REINFORCEMENT LEARNING WITH ADAPTIVE REWARDS Speaker: Luke Vassallo, University of Malta, MT Authors: Luke Vassallo and Josef Bajada, University of Malta, MT Abstract Placement is the initial step of Printed Circuit Board (PCB) physical design and demands considerable time and domain expertise. Placement quality impacts the performance of subsequent tasks, and the generation of an optimal placement is known to be, at the very least, NP-complete. While stochastic optimisation and analytic techniques have had some success, they often lack the intuitive understanding of human engineers. In this study, we propose a novel end-to-end Machine Learning (ML) approach to learn fundamental placement techniques and use experience to optimise PCB layouts efficiently. To achieve this, we formulate the PCB placement problem as a Markov Decision Process (MDP) and use Reinforcement Learning (RL) to learn general placement techniques. The agent-driven data collection process generates highly diverse and consistent data points sufficient for learning general policies without expert knowledge under the guidance of an adaptive reward signal. Compared to state-of-the-art simulated annealing approaches on unseen circuits, the resulting policies trained with TD3 and SAC, on average, yield 17% and 21% reduction in post-routing wirelength. Qualitative analysis shows that the policies learn fundamental placement techniques and demonstrate an understanding of the underlying problem dynamics. Collectively, they demonstrate emergent collaborative or competitive behaviours and faster placement convergence, sometimes exceeding an order of magnitude.
11:40 CEST	TS11.9	TRAINING BETTER CNN MODELS FOR 3-D CAPACITANCE EXTRACTION WITH NEURAL ARCHITECTURE SEARCH Speaker: Haoyuan Li, Tsinghua University, CN Authors: Haoyuan Li, Dingcheng Yang and Wenjian Yu, Tsinghua University, CN Abstract Accurate capacitance extraction is becoming more important for designing integrated circuits under advanced process technology. The pattern matching-based full-chip extraction methodology delivers fast computational speed but suffers from large error and tedious efforts on building capacitance models of the increasing structure patterns. Recent work proposes a grid-based data representation and a Convolutional Neural Network-based capacitance models (called CNN-Cap) following the architecture of Resnet. In this work, we find better network architecture for capacitance extraction of three-dimensional (3-D) interconnect structures using Neural Architecture Search (NAS). Along with the skills of data augmentation and efficient training, the models called NAS-Cap are developed for computing total and coupling capacitances. Experimental results show that the obtained NAS-Cap achieves higher accuracy on capacitance extraction than CNN-Cap, while consuming less runtime for inference and storage space for parameters. The proposed NAS-Cap predicts the coupling capacitance with less than 10\% error in 98.8\% probability and with a average error of 1.9\%.
11:41 CEST	TS11.10	RLPLANNER: REINFORCEMENT LEARNING BASED FLOORPLANNING FOR CHIPLETS WITH FAST THERMAL ANALYSIS Speaker: Yuanyuan Duan, Zhejiang University, CN Authors: Yuanyuan Duan¹, Xingchen Liu², Zhiping Yu³, Hanming Wu², Leilai Shao⁴ and Xiaolei Zhu² ¹College of Integrated Circuits, Zhejiang University, Hangzhou, CN; ²Zhejiang University, CN; ³Tsinghua University, CN; ⁴Shanghai Jiao Tong University, CN Abstract Chiplet-based systems have gained signiﬁcant attention in recent years due to their low cost and competitive performance. As the complexity and compactness of a chipletbased system increase, careful consideration must be given to microbump assignments, interconnect delays, and thermal limitations during the ﬂoorplanning stage. This paper introduces RLPlanner, an efﬁcient early-stage ﬂoorplanning tool for chipletbased systems with a novel fast thermal evaluation method. RLPlanner employs advanced reinforcement learning to jointly minimize total wirelength and temperature. To alleviate the time-consuming thermal calculations, RLPlanner incorporates the developed fast thermal evaluation method to expedite the iterations and optimizations. Comprehensive experiments demonstrate that our proposed fast thermal evaluation method achieves a mean absolute error (MAE) of ±0.25 K and delivers over 120x speedup compared to the open-source thermal solver HotSpot. When integrated with our fast thermal evaluation method, RLPlanner achieves an average improvement of 20.28% in minimizing the target objective (a combination of wirelength and temperature), within a similar running time, compared to the classic simulated annealing method with HotSpot.
11:42 CEST	TS11.11	LEARNING TO FLOORPLAN LIKE HUMAN EXPERTS VIA REINFORCEMENT LEARNING Speaker: Binjie yan, Shanghai Jiao Tong University, CN Authors: Binjie Yan, lin xu, Zefang Yu, Mingye Xie, Wei Ran, Jingsheng Gao, Yuzhuo Fu and Ting Liu, Shanghai Jiao Tong University, CN Abstract Deep reinforcement learning (RL) has gained popularity for automatically generating placements in modern chip design. However, the visual style of the floorplans generated by these RL models is significantly different from the manual layouts' style, for RL placers usually only adopt metrics like wirelength and routing congestion as the reward in reinforcement learning, ignoring the complex and fine-grained layout experience of human experts. In this paper, we propose a placement scorer to rate the quality of layouts and apply abnormal detection to the floorplanning task. In addition, we add the output of this scorer as a part of the reward for reinforcement learning of the placement process. Experimental results on ISPD 2005 benchmark show that our proposed placement quality scorer can evaluate the layouts according to human craft style efficiently, and that adding this scorer into reinforcement learning reward helps generating placements with shorter wirelength than previous methods for some circuit designs.
11:43 CEST	TS11.12	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS29 Efficient Computing Paradigms For Specific Applications

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 11:00 CEST - 12:30 CEST
Location / Room: Multi-Purpose Room M1B+D

Session chair:
Stefano Di Carlo, Politecnico di Torino, IT

Session co-chair:
Ayesha Khalid, Queen's University-Belfast, UK

Time	Label	Presentation Title Authors
11:00 CEST	TS29.1	CPF: A CROSS-LAYER PREFETCHING FRAMEWORK FOR HIGH-DENSITY FLASH-BASED STORAGE Speaker: Longfei Luo, Software/Hardware Co-design Engineering Research Center, Ministry of Education, and School of Computer Science and Technology, East China Normal University, CN Authors: Longfei Luo¹, Han Wang², Yina Lv³, Dingcui Yu² and Liang Shi¹ ¹Software/Hardware Co-design Engineering Research Center, Ministry of Education, and School of Computer Science and Technology, East China Normal University, CN; ²East China Normal University, CN; ³Department of Computer Science, City University of Hong Kong, CN Abstract The pseudo-single-level-cell (pSLC) technique is widely adopted in high-density flash-based storage to mitigate the performance and endurance problem of high-density flash memory. Furthermore, prefetching schemes can compensate for performance differences among storage tiers. Existing prefetchers are implemented in the operating system (OS) or storage layers. However, OS layer prefetchers are conservative since it is a challenge to achieve both high accuracy and large coverage simultaneously. Storage layer prefetchers are sub-optimal due to the performance differences between pSLC and DRAM. In this paper, a cross-layer prefetching framework (CPF) is proposed to prefetch data selectively. The basic idea is that high-accuracy data will be prefetched to DRAM and large-coverage data will be prefetched to the pSLC flash in storage. To make it practical, an adaptive regulator is further designed to dynamically adjust the cross-layer prefetching to ensure accuracy and coverage. Evaluations show that CPF can improve read performance and reduce data transfer costs significantly.
11:05 CEST	TS29.2	A READ LATENCY VARIATION AWARE INDEPENDENT READ SCHEME FOR QLC SSDS Speaker: Dong Huang, Huazhong University of Science & Technology, CN Authors: Dong Huang, Dan Feng, Qiankun Liu, Bo Ding, Xueliang Wei, Wei Zhao and Wei Tong, Huazhong University of Science & Technology, CN Abstract LC ﬂash-based SSDs has attracted growing interest and is expected to ﬁt in read-intensive scenarios owing to its higher cost-effectiveness and shorter write endurance compared with the Triple-Level Cell (TLC) SSDs. Recently, new commands supporting independent reads such as Single Operation Multiple Locations (SOML) and Independent multi-plane (IMP) read are proposed to improve read performance. Unfortuantely, while independent read exhibits a signiﬁcant performance improvement, we show in this paper that the exisiting approach fails to fully exploit its potential due to the larger read latency variation and more planes per die for current SSD architecture. Through a set of experiments, we demonstrate that the lack of a read latency variation aware machanism leads to low performance of independent read among a wide variety of workloads. To alleviate this issue, we propose LITA, which key idea is to combine read transactions with similar latency into one command. LITA includes (1) LIT, a latency variation aware transaction combination, (2) and TASAP, a latency variation aware transaction completion service. The experimental results show LITA can reduce read latency by 20.4% and 9.6% on average for 4-planes QLC SSDs under IMP and SOML, respectively.
11:10 CEST	TS29.3	ADAPTIVE DRAM CACHE DIVISION FOR COMPUTATIONAL SOLID-STATE DRIVES Speaker: Shuaiwen Yu, Southwest University, CN Authors: Shuaiwen Yu¹, zhibing sha², Chengyong Tang¹, Zhigang Cai¹, Peng Tang¹, Min Huang¹, Jun Li¹ and Jianwei Liao³ ¹Southwest University, CN; ²southwest university, CN; ³Southwest University of China, CN Abstract High computational capabilities enable modern solid-state drives (SSDs) to be computing nodes, not just faster storage devices, and the SSD having such capability is generally called as the computational SSD (CompSSD). Then, the DRAM data cache of CompSSD should hold not only the output data of the tasks running at the host side, but also the input data of the tasks executed at the SSD side. To boost the use efficiency of the cache inside CompSSD, this paper proposes an adaptive cache division scheme, to dynamically split the cache space for separately buffering the output data running at the host and the input data running at the CompSSD. Specifically, we construct a mathematical model running at flash translation layer of CompSSD, to periodically determine the cache proportion of the workloads running at the host side and the CompSSD side, by considering the factors of the ratios of read/write data amount, the cache hits, and the overhead of data transfer between the storage device and the host. Then, both the output data and the input data can be buffered in their own private cache parts, so that the overall I/O performance can be enhanced. Trace-driven simulation experiments show that our proposal can reduce the overall I/O latency by 27.5% on average, in contrast to existing cache management schemes.
11:15 CEST	TS29.4	LOADM: LOAD-AWARE DIRECTORY MIGRATION POLICY IN DISTRIBUTED FILE SYSTEMS Speaker: Yuanzhang Wang, WNLO, Huazhong University of Science and Technology, CN Authors: Yuanzhang Wang, Peng Zhang, Fengkui Yang, Ke Zhou and Chunhua Li, Huazhong University of Science & Technology, CN Abstract Distributed file systems often suffer from load imbalance when encountering skewed workloads. A few directories can become hotspots due to frequent access. Failure to migrate these high-load directories promptly will result in node overload, which can seriously degrade the performance of the system. To solve this challenge, in this paper, we propose a novel load-aware directory migration policy named LoADM to alleviate the load imbalance caused by hot directories. LoADM consists of three parts, i.e. learning-based directory hotness model, urgency analysis, and multidimensional directory migration model. Specifically, we use a directory hotness model to identify potentially high-load directories in advance. Second, by combining the predicted directory hotness and system node status, the urgency analysis determines when to trigger a migration or tolerate an imbalance. Then, peer directory co-migration is proposed to better exploit data locality. Finally, we migrate high-load directories to appropriate storage nodes through a Particle Swarm Optimization based directory migration model. Extensive experiments show that our approach provides a promising data migration policy and can greatly improve performance compared to the state-of-the-art.
11:20 CEST	TS29.5	ACCELERATING MACHINE LEARNING-BASED MEMRISTOR COMPACT MODELING USING SPARSE GAUSSIAN PROCESS Speaker: Yuta Shintani, Nara Institute of Science and Technology, JP Authors: Yuta Shintani¹, Michiko Inoue¹ and Michihiro Shintani² ¹Nara Institute of Science and Technology, JP; ²Kyoto Institute of Technology, JP Abstract Research on dedicated circuits for multiply and accumulate processing, which is vital to machine learning (ML), using memristors has attracted considerable attention. However, memristors have unknown operating principles, making it challenging to create compact models with sufficient accuracy. This study proposes a compact modeling method based on Gaussian process for memristors. Although various ML-based modeling methods have been proposed, only the reproduction accuracy has been evaluated using SPICE circuit simulator, and long learning times have not been sufficiently discussed. The proposed method reduces the learning and inference times using a Gaussian process with considering sparsity. An evaluation using data from memristor devices obtained by actual measurements demonstrates that the proposed method achieves over 2,629 times faster than conventional method using long short-term memory (LSTM). Moreover, inference on a commercial SPICE simulator can be performed with the same accuracy and computation time. All experimental environments, including the source code, are available at https://github.com/sntnmchr/SGPR-memristor/blob/main/README.md.
11:25 CEST	TS29.6	DROPHD: TECHNOLOGY/ALGORITHM CO-DESIGN FOR RELIABLE ENERGY-EFFICIENT NVM-BASED HYPERDIMENSIONAL COMPUTING UNDER VOLTAGE SCALING Speaker: Paul R. Genssler, University of Stuttgart, DE Authors: Paul Genssler¹, Mahta Mayahinia², Simon Thomann³, Mehdi Tahoori² and Hussam Amrouch⁴ ¹University of Stuttgart, DE; ²Karlsruhe Institute of Technology, DE; ³Chair of AI Processor Design, TU Munich (TUM), DE; ⁴TU Munich (TUM), DE Abstract Brain-inspired hyperdimensional computing (HDC) offers much more efficient computing compared to other classical deep learning and related machine learning algorithms. Unlike classical CMOS, emerging non-volatile memories (NVMs) used in the realization of HDC are susceptible to failures under voltage scaling, which is essential for energy saving. Although HDC is inherently robust against errors, this is only possible when hypervectors with a large dimension (e.g., 10,000 bits) are being used, resulting in significant energy consumption. This work demonstrates, for the first time, that different NVM technologies exhibit different error characteristics under voltage scaling. In contrast to conventional CMOS-based SRAM, we demonstrate that the error behavior is data-dependent and not captured by simple bit flips in emerging NVMs. We employ our cross-layer framework that starts from the underlying technology all the way up to the algorithm to develop the novel HDC training approach DropHD. DropHD considerably shrinks the size of hypervectors (e.g., from 10,000 bits down to merely 3000 bits), while maintaining a high inference accuracy. The use of aggressive voltage scaling reduces energy consumption by 1.6x. DropHD further reduces it to up to 9.5x while fully recovering the induced accuracy drop, i.e., without a tradeoff.
11:30 CEST	TS29.7	AN AUTONOMIC RESOURCE ALLOCATING SSD Speaker: Dongjoon Lee, Hanyang University, KR Authors: Dongjoon Lee¹, Jongin Choe¹, Chanyoung Park¹, Kyungtae Kang¹, Mahmut Kandemir² and Wonil Choi¹ ¹Hanyang University, KR; ²Pennsylvania State University, US Abstract When an SSD is used for executing multiple workloads, its internal resources should be allocated to prevent the competing workloads from interfering with each other. While channel-based allocation strategies turn out to be quite effective in offering performance isolation, questions like "what is the optimal allocation?'' and "how can one efficiently search for the optimal allocation?'' remain unaddressed. To this end, we explore the channel allocation problem in SSDs and employ a reinforcement learning-based approach to address the problem. Specifically, we present an autonomic channel allocating SSD, called AutoAlloc, which can seek near-optimal channel allocation in a self-learning fashion for a given set of co-running workloads. The salient features of AutoAlloc include the following: (i) the optimal allocation can change depending on the user-defined optimization metrics; (ii) the search process takes place in an online setting without any need of extra workload profiling or performance estimation; and, (iii) the search process is fully-automated without requiring any user intervention. We implement AutoAlloc in LightNVM (the Linux subsystem) as part of the FTL, which operates with an emulated Open-Channel SSD. Our extensive experiments using various user-defined optimization metrics and workload execution scenarios indicate that AutoAlloc can find a near-optimal allocation after examining only a very limited number of candidate allocations.
11:35 CEST	TS29.8	SPARROW: FLEXIBLE MEMORY DEDUPLICATION IN ANDROID SYSTEMS WITH SIMILAR-PAGE AWARENESS Speaker: Guangyu Wei, East China Normal University, CN Authors: Guangyu Wei, Changlong Li, Rui Xu, Qingfeng Zhuge and Edwin H.-M. Sha, East China Normal University, CN Abstract Mobile devices have become ubiquitous in daily life. In contrast to traditional servers, mobile devices suffer from limited memory resources, leading to a significant degradation in the user experience. This paper demonstrates that the primary cause of memory consumption lies in anonymous pages associated with application heaps. Existing schemes are ineffective in deduplicating these pages due to the limited occurrence of the same anonymous pages. This paper presents Sparrow, a similar-page aware deduplication solution for mobile systems. Sparrow shows that memory pages still have the potential to deduplicate, even though the same pages are rare. An interesting observation inspires this, that is, a high number of pages having the partially-same contents. We have implemented Sparrow on real-life smartphones. Experimental results indicate that 30.45% more space can be saved with Sparrow.
11:40 CEST	TS29.9	INTELLIGENT HYBRID MEMORY SCHEDULING BASED ON PAGE PATTERN RECOGNITION Speaker: Yanjie Zhen, Tsinghua University, CN Authors: Yanjie Zhen, Weining Chen, Wei Gao, Ju Ren, Kang Chen and Yu Chen, Tsinghua University, CN Abstract Hybrid memory systems exhibit disparities in their heterogeneous memory components' access speeds. Dynamic page scheduling to ensure memory access predominantly occurs in the faster memory components is essential for optimizing the performance of hybrid memory systems. Recent works attempt to optimize page scheduling by predicting their hotness using neural network models. However, they face two crucial challenges: the page explosion problem and the new pages problem. We propose an intelligent hybrid memory scheduler driven by page pattern recognition to address these two challenges. Experimental results demonstrate that our approach outperforms state-of-the-art intelligent schedulers regarding effectiveness and cost.
11:41 CEST	TS29.10	EXTENDING SSD LIFETIME VIA BALANCING LAYER ENDURANCE IN 3D NAND FLASH MEMORY Speaker: Siyi Huang, Wuhan University of Technology, CN Authors: Siyi Huang¹, Yajuan Du¹, Yi Fan¹ and Cheng Ji² ¹Wuhan University of Technology, CN; ²Nanjing University of Science and Technology, CN Abstract By stacking layers vertically, 3D flash memory enables continuous growth in capacity. In this paper, we study the layer variation in 3D flash blocks and find that bottom layer pages exhibit the lowest endurance, whereas middle layer pages demonstrate the highest endurance. The imbalanced endurance across different layers will diminish the overall SSD lifetime. To address this issue, we introduce a novel layer-aware write strategy, named LA-Write. It performs write-skip operations with layer-specific probabilities. The endurance of bottom layer pages with the highest probability would be noticeably improved, which could balance the layer endurance. Experiment results show that LA-Write can improve SSD lifetime by 29%.
11:42 CEST	TS29.11	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

LK03 Special Day on Responsible and Robust AI Lunchtime Keynote

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 13:15 CEST - 14:00 CEST
Location / Room: Auditorium 2

Session chair:
Marcello Traiola, Inria Rennes, FR

The talk in this session replaces the one from Francisco Herrara who, due to unforeseen circumstances, is unable to deliver his keynote speech.

Time	Label	Presentation Title Authors
13:15 CEST	LK03.1	VALIDATION AND VERIFICATION OF AI-ENABLED VEHICLES IN THEORY AND PRACTICE Presenter: Marilyn Wolf, University of Nebraska, US Authors: Marilyn Wolf¹ and William Widen² ¹University of Nebraska, US; ²University of Nebraska – Lincoln, US Abstract This talk will consider engineering methods for the evaluation of the safety of AI-enabled vehicles. We do not yet have theories and models for AI systems equivalent to those used to guide software/hardware verification and manufacturing test. However, engineers can adapt existing methods to help provide some assurance as to the safety and effectiveness of AI-enabled vehicles. We will also consider the role of management in the monitoring of validation for AI-enabled vehicles.

FS03 Focus Session: Towards Large Scale Quantum Computing Design: The Quest From Automatic Methods And Tools To Integrated EDA Frameworks

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: Auditorium 2

Session chair:
Carles Hernández, TU Valencia, ES

Organisers:
Carmen G. Almudever, Technical University of Valencia, ES, Eduard Alarcon, TU Catalonia, BarcelonaTech, ES,
Robert Wille, Technical University of Munich, DE, Fabio Sebastiano, Delft University of Technology, NL,

Time	Label	Presentation Title Authors
14:00 CEST	FS03.1	ARCHITECTING LARGE-SCALE QUANTUM COMPUTING SYSTEMS Presenter: Carmen G. Almudever, TU Valencia, ES Authors: Carmen G. Almudever¹ and Eduard Alarcon² ¹TU Valencia, ES; ²TU Catalonia, BarcelonaTech, ES Abstract The advances in quantum hardware with functional quantum processors integrating tens and even hundreds of noisy qubits, together with the availability of near-term quantum algorithms have allowed the development of the so-called full-stacks that bridge quantum applications with quantum devices. Up to now, intermmediate-size quantum computers have been designed in an ‘adhoc' manner, with heterogeneous methods and tools that are becoming more and more sophisticated as the complexity of the systems is increasing. However, there is neither a standard and automated design procedure yet, nor comprehensive design flows that fully exploit the benefits of using design automation. In the coming years, having a full-fledged automatic design framework will be indispensable to address the design of large-scale quantum computing systems given their increasing complexity. In this talk, we will provide an overview of the different layers of the quantum computing full-stack and emphasize the key principles for architecting quantum computers such as codesign, optimization and benchmarking. We will then talk about the design of large-scale quantum computing systems and describe some of the open questions and challenges the quantum computing community is currently facing and for which the expertise of the EDA community will be crucial.
14:22 CEST	FS03.2	SUPERCONDUCTING QUANTUM PROCESSORS FOR SURFACE CODE: DESIGN AND ANALYSIS Presenter: Nadia Haider, TU Delft, NL Authors: Nadia Haider, Marc Beekman and Leonardo Dicarlo, TU Delft, NL Abstract Electronic design automation (EDA) plays a key role in advancing the field of various qubits by streamlining and optimizing the design process. In the realm of quantum applications, the reliance on microwave signals and systems spans a broad spectrum. The importance of proper electromagnetic analysis and microwave design is evident in the pursuit of groundbreaking advancements in quantum computing, particularly for large-scale, fault-tolerant quantum information processing units. This is impacting quantum processing chip design processes, particularly for superconducting based systems. In this talk, we will present an effective numerical method to analyse bus-mediated qubit-qubit couplings and two-qubit-gate speed limits in superconducting quantum processors designed for surface-code quantum error correction. Our hybrid simulation tool, combining finite-element and circuit simulation, enables the investigation of a complex chip layout with limited computational resources. We apply the simulation method to the design of Surface-17 processors, finding good agreement with the experimental realization of Surface-17.
14:45 CEST	FS03.3	DESIGNING THE CRYOGENIC ELECTRICAL INTERFACE FOR LARGE-SCALE QUANTUM PROCESSORS Presenter: Fabio Sebastiano, TU Delft, NL Author: Fabio Sebastiano, TU Delft, NL Abstract To enable practical applications, quantum computers must scale up from the few (<1000) quantum bits (qubits) available today to the millions required for useful computations. To follow and support the scaling of the quantum processors, the complexity of the electrical interface required for controlling and reading the qubits must also grow accordingly. In particular, wiring an increasing number of cryogenic qubits to their room-temperature control electronics will soon hit a brick wall due to the sheer size of the required wires, their cost, and their reliability concerns. As an alternative to alleviate such an interconnect bottleneck, cryogenic electronics operating in close proximity to the qubtis has been proposed, but this comes with additional challenges, including the limited cooling-power budget in cryogenic refrigerators, the operation of semiconductor devices well below (<4 K) their rated operating temperature range, and the need for high performance. This talk will briefly outline the functionality and the requirements for such cryogenic electrical interface and describe the design of state-of-the-art cryogenic integrated circuits, with a specific focus on cryogenic CMOS (cryo-CMOS) technology. By reviewing the current design approaches, we will highlight hurdles and opportunities, also pointing out the needs to be addressed by the EDA community, such as the availability of quantum/classical co-simulation and co-design flows, and the distinguishing features required for the process design kits (PDK) and the design tools for cryogenic integrated circuits. Those challenges will thus define a roadmap for cryogenic electronics as a crucial enabling technology for future large-scale quantum computers.
15:07 CEST	FS03.4	SOFTWARE FOR QUANTUM COMPUTING Presenter: Robert Wille, TU Munich, DE Author: Robert Wille, TU Munich, DE Abstract In the classical computing realm, software is omnipresent and not only used to realize applications but is essential in the design of electronic circuits and systems themselves. For the quantum realm, similar developments are currently emerging. But these tools do not (yet) fully utilize experiences gained in the field of design automation over the last decades. This leaves huge potential for further improvement untapped. In fact, many of the design problems considered for quantum computing are of combinatorial and exponential nature ---a complexity with which design automation experts are quite familiar with. In this talk, a selection of design tasks is discussed, for which software and particularly design automation background can be utilized. The corresponding solutions are available through the Munich Quantum Toolkit (MQT) which offers open-source implementations for these tasks (available at https://www.cda.cit.tum.de/research/quantum/mqt/). By this, an overview of the full spectrum of tasks is provided where dedicated software can be useful and, in fact, is required.

LBR02 Late Breaking Results: Innovative AI architecture, circuit and methodology ; Embedded Applications

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: Break-Out Room S1+2

Session chair:
Michele Lora, Università di Verona, IT

Session co-chair:
Pascal Vivet, CEA-List, FR

Time	Label	Presentation Title Authors
14:00 CEST	LBR02.1	DESIGNING AN ENERGY-EFFICIENT FULLY-ASYNCHRONOUS DEEP LEARNING CONVOLUTION ENGINE Presenter: Mattia Vezzoli, Yale University, US Authors: Mattia Vezzoli¹, Lukas Nel¹, Kshitij Bhardwaj², Rajit Manohar¹ and Maya Gokhale² ¹Yale University, US; ²Lawrence Livermore National Lab, US Abstract In the face of exponential growth in semiconductor energy usage, there is a significant push towards highly energyefficient microelectronics design. While the traditional circuit designs typically employ clocks to synchronize the computing operations, these circuits incur significant performance and energy overheads due to their data-independent worst-case operation and complex clock tree networks. In this paper, we explore asynchronous or clockless techniques where clocks are replaced by request, acknowledge handshaking signals. To quantify the potential energy and performance gains of asynchronous logic, we design a highly energy-efficient asynchronous deep learning convolution engine, which uses 87% of total DL accelerator energy. Our asynchronous design shows 5.06× lower energy and 5.09× lower delay than the synchronous one.
14:03 CEST	LBR02.2	FINE-TUNING SURROGATE GRADIENT LEARNING FOR OPTIMAL HARDWARE PERFORMANCE IN SPIKING NEURAL NETWORKS Presenter: Ilkin Aliyev, University of Arizona, US Authors: Ilkin Aliyev and Tosiron Adegbija, University of Arizona, US Abstract The highly sparse activations in Spiking Neural Networks (SNNs) can provide tremendous energy efficiency benefits when carefully exploited in hardware. The behavior of sparsity in SNNs is uniquely shaped by the dataset and training hyperparameters. This work reveals novel insights into the impacts of training on hardware performance. Specifically, we explore the trade-offs between model accuracy and hardware efficiency. We focus on three key hyperparameters: surrogate gradient functions, beta, and membrane threshold. Results on an FPGA-based hardware platform show that the fast sigmoid surrogate function yields a lower firing rate with similar accuracy compared to the arctangent surrogate on the SVHN dataset. Furthermore, by cross-sweeping the beta and membrane threshold hyperparameters, we can achieve a 48% reduction in hardware-based inference latency with only 2.88% trade-off in inference accuracy compared to the default setting. Overall, this study highlights the importance of fine-tuning model hyperparameters as crucial for designing efficient SNN hardware accelerators, evidenced by the fine-tuned model achieving a 1.72× improvement in accelerator efficiency (FPS/W) compared to the most recent work.
14:06 CEST	LBR02.3	CLASSONN: CLASSIFICATION WITH OSCILLATORY NEURAL NETWORKS USING THE KURAMOTO MODEL Presenter: Filip Sabo, Eindhoven University of Technology, NL Authors: Filip Sabo and Aida Todri-Sanial, Eindhoven University of Technology, NL Abstract Over the recent years, networks of coupled oscillators or oscillatory neural networks (ONNs) emerged as an alternative computing paradigm with information encoded in phase. Such networks are intrinsically attractive for associative memory applications such as pattern retrieval. Thus far, there are few works focusing on image classification using ONNs, as there is no straightforward way to do it. This paper investigates the performance of a neuromorphic phase-based classification model using a fully connected single layer ONNs. For benchmarking, we deploy the ONN on the full set of 28x28 binary MNIST handwritten digits and achieve around 70\% accuracy on both training and test set. To the best of our knowledge, this is the first effort classifying such large images utilizing ONNs.
14:09 CEST	LBR02.4	FLUID DYNAMIC DNNS FOR RELIABLE AND ADAPTIVE DISTRIBUTED INFERENCE ON EDGE DEVICES Presenter: Lei Xun, University of Southampton, GB Authors: Lei Xun¹, Mingyu Hu¹, Hengrui Zhao¹, Amit Kumar Singh², Jonathon Hare¹ and Geoff Merrett¹ ¹University of Southampton, GB; ²University of Essex, GB Abstract Distributed inference is a popular approach for efficient DNN inference at the edge. However, traditional Static and Dynamic DNNs are not distribution-friendly, causing system reliability and adaptability issues. In this paper, we introduce Fluid Dynamic DNNs (Fluid DyDNNs), tailored for distributed inference. Distinct from Static and Dynamic DNNs, Fluid DyDNNs utilize a novel nested incremental training algorithm to enable independent and combined operation of its sub-networks, enhancing system reliability and adaptability. Evaluation on embedded Arm CPUs with a DNN model and the MNIST dataset, shows that in scenarios of single device failure, Fluid DyDNNs ensure continued inference, whereas Static and Dynamic DNNs fail. When devices are fully operational, Fluid DyDNNs can operate in either a High-Accuracy mode and achieve comparable accuracy with Static DNNs, or in a High-Throughput mode and achieve 2.5x and 2x throughput compared with Static and Dynamic DNNs, respectively.
14:12 CEST	LBR02.5	PIMSIM-NN: AN ISA-BASED SIMULATION FRAMEWORK FOR PROCESSING-IN-MEMORY ACCELERATORS Presenter: Xinyu Wang, Institute of Computing Technology, Chinese Academy of Sciences, CN Authors: Xinyu Wang, Xiaotian Sun, yinhe han and Xiaoming Chen, Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract Processing-in-memory (PIM) has shown extraordinary potential in accelerating neural networks. To evaluate the performance of PIM accelerators, we present an ISA-based simulation framework including a dedicated ISA targeting neural networks running on PIM architectures, a compiler, and a cycle-accurate configurable simulator. Compared with prior works, this work decouples software algorithms and hardware architectures through the proposed ISA, providing a more convenient way to evaluate the effectiveness of software/hardware optimizations. The simulator adopts an event-driven simulation approach and has better support for hardware parallelism. The framework is open-sourced at https://github.com/wangxy-2000/pimsim-nn.
14:15 CEST	LBR02.6	CIMCOMP: AN ENERGY EFFICIENT COMPUTE-IN-MEMORY BASED COMPARATOR FOR CONVOLUTIONAL NEURAL NETWORKS Presenter: Kavitha Soundra pandiyan, Research scholar, IN Authors: Kavitha Soundra pandiyan¹, Bhupendra Singh Reniwal² and Binsu J Kailath³ ¹Research scholar, IN; ²Indian Institute of Technology Jodhpur, IN; ³IIndian Institute of Information Technology, Design and Manufacturing, Kancheepuram, IN Abstract The utilization of large datasets in applications results in significant energy expenditures attributed to frequent data shifts between memory and processing units. In-Memory-Computing (IMC) distinguishes itself by employing computations within a memory crossbar to perform logic operations, leading to enhanced computational speed and energy efficiency. This study introduces RASA-based subtractor, strategically improved for computation, and energy consumption. Subsequently, the proposed subtractor are employed to construct a comparator and facilitate pooling operations. The comparator is developed using the proposed subtractor, achieves the comparison in n steps for a n-bit comparator. Additionally, a n-bit min pooling operation for a n x n (4 x 4) feature map requires 2^n - 1 (15) steps. Energy consumption of the RASA design demonstrates hopped-up performance, showcasing an average savings of 87.42% and 89.98% compared to the ASA and Muller C based subtractor.
14:18 CEST	LBR02.7	A SOUND AND COMPLETE ALGORITHM TO IDENTIFY INDEPENDENT VARIABLES IN A REACTIVE SYSTEM SPECIFICATION Speaker: Montserrat Hermo, University of the Basque Country, ES Authors: Montserrat Hermo¹, Josu Oca¹ and Alexander Bolotov² ¹University of the Basque Country, ES; ²University of Westminster, GB Abstract We present a sound and complete algorithm for the detection of independent variables in linear temporal logic formulae. These formulae are often used to specify reactive systems. The algorithm is based on the use of model checkers.
14:21 CEST	LBR02.8	LATE BREAKING RESULTS: ITERATIVE DESIGN AUTOMATION FOR TRAIN CONTROL WITH HYBRID TRAIN DETECTION Presenter: Stefan Engels, TU Munich, DE Authors: Stefan Engels and Robert Wille, TU Munich, DE Abstract To increase the capacity of existing railway infrastructure, the European Train Control System (ETCS) allows the introduction of virtual subsections. As of today, the planning of such systems is mainly done by hand. Previous design automation methods suffer from long runtimes in certain instances. However, late breaking results show that these methods can highly benefit from an iterative approach. An initial implementation of the resulting method is available in open-source as part of the Munich Train Control Toolkit at https://github.com/cda-tum/mtct.
14:24 CEST	LBR02.9	TOWARDS AN EMBEDDED SYSTEM FOR FAILURE DIAGNOSIS IN DRONES USING AI AND SAC-DM ON FPGA Presenter: Rafael Batista, Federal University of Paraíba, BR Authors: Rafael Batista¹, Matthias Nickel², Alexander Lehnert³, Sergio Pertuz², Marc Reichenbach³, Diana Goehringer² and Alisson Brito¹ ¹Federal University of Paraíba, BR; ²TU Dresden, DE; ³Universität Rostock, DE Abstract We present a way of failure detection in real-time unmanned aerial vehicles (UAVs) by integrating Chaos Theory and AI techniques on an FPGA board. The Signal Analysis based on Chaos using the Density of Maxima (SAC-DM) validates the input of the Machine Learning (ML) model due to the relation between the density of maxima and autocorrelation length. The accuracies achieved only by SAC-DM are not notably high, however ML model provides an accuracy of 92.46% with SAC-DM results as inputs. An FPGA board is used as a solution for high-speed onboard processing, parallel integrated data synchronization and fusion, and an enhanced low-power architecture.
14:27 CEST	LBR02.10	ENVIRONMENTAL MICROCHANGES IN WIFI SENSING Speaker: Florenc Demrozi, University of Stavanger, NO Authors: Cristian Turetta¹, Philipp H. Kindt², Alejandro Masrur³, Samarjit Chakraborty⁴, Graziano Pravadelli¹ and Florenc Demrozi⁵ ¹Università di Verona, IT; ²Lehrstuhl für Realzeit-Computersysteme (RCS), TU München (TUM), DE; ³TU Chemnitz, DE; ⁴UNC Chapel Hill, US; ⁵Department of Electrical Engineering and Computer Science, University of Stavanger, NO Abstract Using WiFi's Channel State Information for human activity recognition—referred to as WiFi sensing—has attracted considerable attention. But despite this interest and many publications over a decade, WiFi sensing has not yet found its way into practice because of a lack of robustness of the inference results. In this paper, we quantitatively show that even "microchanges" in the environment can significantly impact WiFi signals, and potentially alter the ML inference results. We therefore argue that new training and inference techniques might be necessary for mainstream adoption of WiFi sensing.
14:30 CEST	LBR02.11	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

LKS04 Later … With The Keynote Speakers

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: VIP Room

Session chair:
Nima TaheriNejad, Heidelberg University, DE

Speaker: Francisco Herrara

TS10 Logical And Physical Analysis And Design

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: Multi-Purpose Room M1A+C

Session chair:
Weikang Qian, Shanghai Jiao Tong University, CN

Session co-chair:
Rassul Bairamkulov, EPFL, CH

Time	Label	Presentation Title Authors
14:00 CEST	TS10.1	SCALABLE LOGIC REWRITING USING DON'T CARES Speaker: Alessandro Tempia Calvino, EPFL, CH Authors: Alessandro Tempia Calvino and Giovanni De Micheli, EPFL, CH Abstract Logic rewriting is a powerful optimization technique that replaces small sections of a Boolean network with better implementations. Typically, exact synthesis is used to compute optimum replacement on-the-fly, with possible support for Boolean don't cares. However, exact synthesis is computationally expensive, rendering it impractical in industrial tools. For this reason, optimum structures are typically pre-computed and stored in a database, commonly limited to 4-inputs. Nevertheless, this method does not support the use of don't cares. In this paper, we propose a technique to enable the usage of don't cares in pre-computed databases. We show how to process the database and perform Boolean matching with Boolean don't cares, with negligible run time overhead. Logic rewriting techniques are typically very effective at optimizing majority-inverter graphs (MIGs). In the experiments, we show that the usage of don't cares in logic rewriting on MIGs offers an average size improvement of 4.31% and up to 14.32% compared to state-of-the-art synthesis flow.
14:05 CEST	TS10.2	A SEMI-TENSOR PRODUCT BASED CIRCUIT SIMULATION FOR SAT-SWEEPING Speaker: Hongyang Pan, Fudan university, CN Authors: Hongyang Pan¹, Ruibing Zhang², Lunyao Wang², Yinshui Xia², Zhufei Chu², Fan Yang³ and Xuan Zeng³ ¹Fudan university, CN; ²Ningbo University, CN; ³Fudan University, CN Abstract This paper introduces a novel circuit simulator of k-input lookup table (k-LUT) networks, based on semi-tensor product (STP). STP-based simulators use computation of logic matrices, the primitives of logic networks, as opposed to relying on bitwise logic operations for simulation of k-LUT networks. Experimental results show that our STP-based simulator reduces the runtime by an average of 7.2x. Furthermore, we integrate this proposed simulator into a SAT sweeper. Through a combination of structural hashing, simulation, and SAT queries, SAT sweeper simplifies logic networks by systematically merging graph vertices from input to output. To enhance the efficiency, we used STP-based exhaustive simulation, which significantly reduces the number of false equivalence class candidates, thereby improving the computational efficiency by reducing the number of SAT calls required. When compared to the state-of-the-art SAT sweeper, our method demonstrates an average 35% runtime reduction.
14:10 CEST	TS10.3	BESWAC: BOOSTING EXACT SYNTHESIS VIA WISER SAT SOLVER CALL Speaker: Sunan Zou, Peking University, CN Authors: Sunan Zou, Jiaxi Zhang, Bizhao Shi and Guojie Luo, Peking University, CN Abstract SAT-based exact synthesis is a critical technique in logic synthesis to generate optimal circuits for given Boolean functions. The lengthy trial-and-error process limits its applica- tion in on-the-fly logic optimization and optimal netlist library construction. Previous research focuses on reducing the execution time of each trial. However, unnecessary SAT solver calls and varying execution times among encoding methods remained issues. This paper presents BESWAC to boost exact synthesis from the flow level. It leverages initial value prediction, encoding method selection, and an optional early exit to call SAT solvers efficiently and wisely. Moreover, BESWAC can seamlessly integrate existing acceleration methods focusing on individual trials. Experimental results show that BESWAC achieves a 1.79x speedup compared to state-of-the-art exact synthesis flows.
14:15 CEST	TS10.4	CBTUNE: CONTEXTUAL BANDIT TUNING FOR LOGIC SYNTHESIS Speaker: Fangzhou Liu, The Chinese University of Hong Kong, CN Authors: Fangzhou Liu¹, Zehua Pei¹, Ziyang Yu¹, Haisheng Zheng², Zhuolun He¹, Tinghuan Chen³ and Bei Yu¹ ¹The Chinese University of Hong Kong, HK; ²Shanghai AI Laboratory, CN; ³The Chinese University of Hong Kong, CN Abstract Logic synthesis pre-optimization involves applying a sequence of transformations called synthesis flow to reduce the circuit's Boolean logic graph, like AIG. However, the challenge lies in selecting and arranging these transformations due to the exponentially expanding solution space. In this work, we propose a novel online learning framework CBTune that employs a contextual bandit algorithm to explore the solution space and generate synthesis flows efficiently. We develop the Syn-LinUCB algorithm as the agent, which incorporates circuit characteristics and leverages long-term payoffs to guide decision-making and effectively prevent falling into local optima. Experimental results show that our framework achieves the optimal synthesis flow with a lower time cost, substantially reducing the number of AIG nodes and 6-LUTs compared to SOTA approaches.
14:20 CEST	TS10.5	FAST IR-DROP PREDICTION OF ANALOG CIRCUITS USING RECURRENT SYNCHRONIZED GCN AND Y-NET MODEL Speaker: Seunggyu Lee, KAIST, KR Authors: Seunggyu Lee¹, Daijoon Hyun², Younggwang Jung¹, Gangmin Cho¹ and Youngsoo Shin¹ ¹KAIST, KR; ²Sejong University, KR Abstract IR-drop analysis of analog circuits is a challenge because (1) the current waveforms of target transistors, with connection to VDD or VSS, are extracted through transistor-level simulation, and (2) the analysis itself, in particular dynamic one, is computationally expensive. We introduce two ML models for high-speed analysis. (1) Recurrent synchronized graph convolutional network (RS-GCN) is used for quick prediction of current waveforms. Each subcircuit is modeled with recurrent-GCN, in which recurrent connection is for the analysis in discrete time series. Recurrent-GCNs are synchronized to take account of common connections including VDD, VSS, and the inputs and outputs of subcircuits. Experiments show that RS-GCN takes only 0.85% of SPICE runtime, while prediction error is 14% on average. (2) Y-Net is applied for actual IR-drop analysis of small layout partition, one by one. Pad location and PDN resistance are provided as one 2D input of Y-Net; they are encoded and go through GCNs to account for neighbor layout partitions. Current map, derived from RS-GCN, becomes the second input. Final IR-drop map is extracted from the decoder. Experiments demonstrate that Y-Net, in conjunction with RS-GCN for current extraction, takes 2.5% of runtime from popular commercial solution with 15% prediction inaccuracy.
14:25 CEST	TS10.6	STANDARD CELLS DO MATTER: UNCOVERING HIDDEN CONNECTIONS FOR HIGH-QUALITY MACRO PLACEMENT Speaker: Xiaotian Zhao, Shanghai Jiao Tong University, CN Authors: Xiaotian Zhao¹, Tianju Wang¹, Run Jiao² and Xinfei Guo¹ ¹Shanghai Jiao Tong University, CN; ²HiSilicon, CN Abstract It becomes increasingly critical for an intelligent macro placer to be able to uncover a good dataflow for a large-scale chip to reduce churns from manual trials and errors. Existing macro placers or mixed-size placement engines that were equipped with dataflow awareness mostly focused on extracting intrinsic relations among macros only, ignoring the fact that standard cell clusters play an essential role in determining the location of macros. In this paper, we identify the necessity of macro-cell connection awareness for high-quality macro placement and propose a novel methodology to extract all ``hidden'' relationships efficiently among macros and cell clusters. By integrating the discovered connections as part of the placement constraints, the proposed methodology achieves an average of 2.8% and 5.5% half perimeter wire length (HPWL) improvement for considering one-hop macro-cell and two-hop macro-cell-cell dataflow connections respectively, when compared against a recently proposed dataflow-aware macro placer. A maximum of 9.7% HPWL improvement has been achieved, incurring only less than 1% runtime penalty. In addition, the congestion has been improved significantly by the proposed method, yielding an average of 62.9% and 73.4% overflow reduction for one-hop and two-hop dataflow considerations. The proposed dataflow connection extraction methodology has been demonstrated to be a significant starting point for macro placement and can be integrated into the existing design flows while delivering better design quality.
14:30 CEST	TS10.7	ELECTROSTATICS-BASED ANALYTICAL GLOBAL PLACEMENT FOR TIMING OPTIMIZATION Speaker: Zhifeng Lin, Fuzhou University, CN Authors: Zhifeng Lin¹, Min Wei², Yilu Chen¹, Peng Zou³, Jianli Chen² and Yao-Wen Chang⁴ ¹Fuzhou University, CN; ²Fudan University, CN; ³Shanghai LEDA Technology Co., Ltd., CN; ⁴National Taiwan University, TW Abstract Placement is a critical stage for VLSI timing closure. A global placer without considering timing delay might lead to inferior solutions with timing violations. This paper proposes an electrostatics-based timing optimization method for VLSI global placement. Simulating the optimal buffering behavior, we first present an analytical delay model to calculate each connection delay accurately. Then, a timing-driven block distribution scheme is developed to optimize the critical path delay while considering the path-sharing effect. Finally, we develop a timing-aware precondition technique to speed up placement convergence without degrading timing quality. Experimental results on industrial benchmark suites show that our timing-driven placement algorithm outperforms a leading commercial tool by 6.7% worst negative slack (WNS) and 21.6% total negative slack (TNS).
14:35 CEST	TS10.8	IMPROVEMENT OF MIXED TRACK-HEIGHT STANDARD-CELL PLACEMENT Speaker: Minhyuk Kweon, Pohang University of Science and Technology, KR Authors: Andrew Kahng¹, Seokhyeong Kang² and Minhyuk Kweon² ¹University of California, San Diego, US; ²Pohang University of Science and Technology, KR Abstract In sub-5nm nodes, track-height of standard cells must be aggressively scaled down while preserving design PPA. This requirement brings the challenge of placing a set of cells that have mixed track-heights, subject to the constraint that cells with the same height must be placed together in an "island" of cell rows. We apply integer linear programming (ILP) to solve the row assignment problem and improve the runtime of ILP with clustering, using a cost function that combines half-perimeter wirelength and displacement from a starting unconstrained placement. Considering the row assignment solution, we define fence-regions, which enable an existing place-and-route (P&R) tool to place the cells while considering the row-island constraints. Experimental results show that our proposed method can on average reduce final-routed wirelength by 8.5% and total power by 3.3%, with worst negative slack and total negative slack reductions of 24.0% and 13.0%, compared with the previous state-of-the-art method [10].
14:40 CEST	TS10.9	BOOLGEBRA: ATTRIBUTED GRAPH-LEARNING FOR BOOLEAN ALGEBRAIC MANIPULATION Speaker: Yingjie Li, University of Maryland, US Authors: Yingjie Li¹, Anthony Agnesina², Yanqing Zhang², Haoxing Ren² and Cunxi Yu¹ ¹University of Maryland, US; ²NVIDIA Corp., US Abstract Boolean algebraic manipulation is at the core of logic synthesis in Electronic Design Automation (EDA) design flow. Existing methods struggle to fully exploit optimization opportunities, and often suffer from an explosive search space and limited scalability efficiency, e.g., existing state-of-the-art logic synthesis approaches are implemented based on local algebraic rewriting, i.e., rewriting, resubstitution, and refactoring. To address these challenges, this work presents BoolGebra, a novel attributed graph-learning approach for Boolean algebraic manipulation that aims to improve fundamental logic synthesis. BoolGebra incorporates Graph Neural Networks (GNNs) and takes initial feature embeddings from both structural and functional information as inputs. A fully connected neural network is employed as the predictor for direct optimization result predictions, significantly reducing the search space and efficiently locating the optimization space. The approach offers a more comprehensive and efficient means of algebraic manipulation compared to traditional stand-alone optimization methods. The experiments involve training the BoolGebra model with respect to different training datasets and performing design-specific and cross-design inferences using the trained model. The proposed model demonstrates generalizability for cross-design inference and its potential to scale from small, simple training datasets to large, complex inference datasets. Finally, BoolGebra is integrated with existing synthesis tool ABC to perform end-to-end logic minimization evaluation w.r.t SOTA baselines.
14:41 CEST	TS10.10	AN EFFICIENT LOGIC OPERATION SCHEDULER FOR MINIMIZING MEMORY FOOTPRINT OF IN-MEMORY SIMD COMPUTATION Speaker: Xingyue Qian, Shanghai Jiao Tong University, CN Authors: Xingyue Qian, Zhezhi He and Weikang Qian, Shanghai Jiao Tong University, CN Abstract Many in-memory computing (IMC) designs based on single instruction multiple data (SIMD) concept have been proposed in recent years to perform primitive logic operations within memory, for improving energy efficiency. To fully exploit the advantage of SIMD IMC, it is crucial to identify an optimized schedule for the operations with less intermediate memory usage, known as memory footprint (MF). In this work, we implement a recursive partition-based scheduler which consists of our scheduler-friendly partition algorithm and a modified optimal scheduler. Compared to three state-of-the-art heuristic strategies, ours can reduce MF by 56.9%, 46.0%, and 31.9%, respectively.
14:42 CEST	TS10.11	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS19 Test Generation, Test Architectures, Design For Test, And Diagnosis

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: Multi-Purpose Room M1B+D

Session chair:
Angeliki Kritikakou, IRISA, FR

Session co-chair:
Annachiara Ruospo, Politecnico di Torino, IT

Time	Label	Presentation Title Authors
14:00 CEST	TS19.1	GRADIENT BOOSTING-ACCELERATED EVOLUTION FOR MULTIPLE-FAULT DIAGNOSIS Speaker: Hongfei Wang, Huazhong University of Science & Technology, CN Authors: Hongfei Wang, Chenliang Luo, Deqing Zou, Hai Jin and Wenjie Cai, Huazhong University of Science & Technology, CN Abstract Logic diagnosis is a key step in yield learning. Multiple faults diagnosis is challenging because of several reasons, including error masking, fault reinforcement, and huge search space for possible fault combinations. This work proposes a two-phase method for multiple-fault diagnosis. The first phase efficiently reduces the potential number of fault candidates through machine learning. The second phase obtains the final diagnosis results, by formulating the task as an combinational optimization problem that is later iteratively solved using binary evolution computation. Experiments shows that our method outperforms two existing methods for multiple-fault diagnosis, and achieves better diagnosability (improved by 1.87X) and resolution (improved by 1.42X) compared with a state-of-the-art commercial diagnosis tool.
14:05 CEST	TS19.2	A NOVEL MARCH TEST ALGORITHM FOR TESTING 8T SRAM-BASED IMC ARCHITECTURES Speaker: Lila AMMOURA, LIRMM/CNRS, FR Authors: Lila Ammoura¹, Marie-Lise Flottes², Patrick Girard¹, Jean-Philippe Noel³ and Arnaud Virazel¹ ¹LIRMM, FR; ²LIRMM CNRS, FR; ³CEA, FR Abstract The shift towards data-centric computing paradigms has given rise to new architectural approaches aimed at minimizing data movement and enhancing computational efficiency. In this context, In-Memory Computing (IMC) architectures have gained prominence for their ability to perform processing tasks within the memory array, reducing the recourse to data transfers. However, the susceptibility of these new paradigms to manufacturing defects poses a critical challenge. This paper presents a novel March-like test algorithm for 8T SRAM-based IMC architectures, addressing the imperative need for comprehensive read port related defect coverage. The proposed algorithm achieves complete coverage of potential read port defects while maintaining the level of complexity equivalent to existing state-of-the-art test solutions.
14:10 CEST	TS19.3	RELIABLE INTERVAL PREDICTION OF MINIMUM OPERATING VOLTAGE BASED ON ON-CHIP MONITORS VIA CONFORMALIZED QUANTILE REGRESSION Speaker: Chen He, NXP Semiconductors, US Authors: Yuxuan Yin¹, Xiaoxiao Wang², Rebecca Chen², Chen He² and Peng Li¹ ¹University of California, Santa Barbara, US; ²NXP Semiconductors, US Abstract Predicting the minimum operating voltage ($V_{min}$) of chips is one of the important techniques for improving the manufacturing testing flow, as well as ensuring the long-term reliability and safety of in-field systems. Current $V_{min}$ prediction methods often provide only point estimates, necessitating additional techniques for constructing prediction confidence intervals to cover uncertainties caused by different sources of variations. While some existing techniques offer region predictions, but they rely on certain distributional assumptions and/or provide no coverage guarantees. In response to these limitations, we propose a novel distribution-free $V_{min}$ interval estimation methodology possessing a theoretical guarantee of coverage. Our approach leverages conformalized quantile regression and on-chip monitors to generate reliable prediction intervals. We demonstrate the effectiveness of the proposed method on an industrial 5nm automotive chip dataset. Moreover, we show that the use of on-chip monitors can reduce the interval length significantly for $V_{min}$ prediction.
14:15 CEST	TS19.4	PA-2SBF: PATTERN-ADAPTIVE TWO-STAGE BLOOM FILTER FOR RUN-TIME MEMORY DIAGNOSTIC DATA COMPRESSION IN AUTOMOTIVE SOCS Speaker: Sunyoung Park, Ewha Womans University, KR Authors: Sunyoung Park, Hyunji Kim, Hana Kim and Ji-Hoon Kim, Ewha Womans University, KR Abstract In the realm of safety-critical Automotive System-on-Chip (SoC) design, memory functionality plays a crucial role in determining overall die yield due to its sizable footprint on the chip. Therefore, efficient methodologies for both post-manufacturing offline testing and real-time monitoring are essential to provide timely diagnostic feedback. This paper proses a real-time memory diagnostic data compression technique Pattern-Adaptive Two-Stage Bloom Filter (PA-2SBF) for automotive System-on-Chip (SoC) applications. PA-2SBF is designed to address the challenge of false positives incorporating frequently encountered failure patterns. Bloom filters, a probabilistic data structure renowned for their space-efficiency and quick approximate membership queries, are employed to expedite fault memory diagnosis information lookup and compression. Furthermore, failure patterns are considered to mitigate the false positive rate inherent in Bloom filters. The paper also presents a strategy for leveraging the compressed diagnostic information during run-time. Specifically, it exploits the quick lookup feature of Bloom filter to prohibit CPU access to defective memory regions, enhancing overall system reliability.
14:20 CEST	TS19.5	DEVICE-AWARE DIAGNOSIS FOR YIELD LEARNING IN RRAMS Speaker: Hanzhi Xun, TU Delft, NL Authors: Hanzhi Xun¹, Moritz Fieback¹, Sicong Yuan¹, Hassan AZIZA², Mottaqiallah Taouil¹ and Said Hamdioui¹ ¹TU Delft, NL; ²Aix-Marseille Université, FR Abstract Resistive Random Access Memories (RRAMs) are now undergoing commercialization, with substantial investment from many semiconductor companies. However, due to the immature manufacturing process, RRAMs are prone to exhibit unique defects, which should be efficiently identified for high-volume production. Hence, obtaining diagnostic solutions for RRAMs is necessary to facilitate yield learning, and improve RRAM quality. Recently, the Device-Aware Test (DAT) approach has been proposed as an effective method to detect unique defects in RRAMs. However, the DAT focuses more on developing defect models to aid production testing but does not focus on the distinctive features of defects to diagnose different defects. This paper proposes a Device-Aware Diagnosis method; it is based on the DAT approach, which is extended for diagnosis. The method aims to efficiently distinguish unique defects and conventional defects based on their features. To achieve this, we first define distinctive features of each defect based on physical analysis and characterizations. Then, we develop efficient diagnosis algorithms to extract electrical features and fault signatures for them. The simulation results show the effectiveness of the developed method to reliably diagnose all targeted defects.
14:25 CEST	TS19.6	TESTING ALGORITHMS FOR HARD TO DETECT THERMAL CROSSTALK INDUCED WRITE DISTURB FAULTS IN PHASE CHANGE MEMORIES Speaker: Yiorgos Tsiatouhas, University of Ioannina, GR Authors: Spyridon Spyridonos and Yiorgos Tsiatouhas, University of Ioannina, GR Abstract Phase change memory (PCM) is a promising non-volatile memory technology, to serve as storage class memory in modern systems. However, during write operations on PCM cells, high temperatures are locally generated for the heating of the used phase change material. The heat diffusion may influence adjacent cells (thermal crosstalk), leading to the generation of write disturb faults and the appearance of errors in the data stored in the memory array. Moreover, as technology scaling increases proximity among cells, the probability of thermal crosstalk influence is also increased. Thermal crosstalk is a key reliability challenge in PCMs. Although various techniques have been proposed to alleviate its influence, it may affect the memory operation under special conditions that maximize the heat in the neighborhood of a cell. Due to these special conditions, the corresponding write disturb faults are hard to detect. Existing testing solutions in literature are not capable of producing such conditions and may fail to cover those hard to detect faults. In this paper, we propose effective testing algorithms, capable of generating proper conditions for the activation and detection of hard to detect thermal induced write disturb faults, at a cost-efficient manner. The algorithms' complexity varies from 3MxN to 10MxN (where M and N are the number of memory array rows and columns respectively).
14:30 CEST	TS19.7	ON GATE FLIP ERRORS IN COMPUTING-IN-MEMORY Speaker: Ulya Karpuzcu, University of Minnesota, US Authors: Zamshed Iqbal Chowdhury¹, Husrev Cilasun², Salonik Resch², Masoud Zabihi², Yang Lv², Brandon Zink², Jian-Ping Wang², Sachin S. Sapatnekar² and Ulya Karpuzcu¹ ¹University of Minnesota, Twin Cities, US; ²University of Minnesota, US Abstract Computing-in-memory (CIM) architectures that perform logic gate operations directly within memory arrays, in-situ, are particularly effective in addressing memory-induced performance bottlenecks. When paired with nonvolatile memory, energy efficiency in performing bulk bitwise logic operations can reach unprecedented levels. However, unlocking this potential is not possible if functional correctness is compromised. In this paper we present a CIM-specific class of functional errors termed gate flips, where parametric variations make a logic gate behave as another. Through detailed functional and electrical characterization we demonstrate that gate flips stem from a significant subclass of write errors. Accordingly, we introduce an abstract model to enable efficient functional reliability assessment and to guide design decisions in forming universal CIM gate libraries. We also evaluate the impact on the end accuracy of computation using representative benchmarks.
14:35 CEST	TS19.8	GUIDED FAULT INJECTION STRATEGY FOR RAPID CRITICAL BIT DETECTION IN RADIATION-PRONE SRAM-FPGA Speaker: Trishna Rajkumar, KTH Royal Institute of Technology, SE Authors: Trishna Rajkumar and Johnny Öberg, KTH Royal Institute of Technology, SE Abstract Fault injection test is vital for assessing the reliability of SRAM-FPGAs used in radiative environments. Considering the scale and complexity of modern FPGAs, exhaustive fault injection is tedious and computationally expensive. A common approach to optimise the injection campaign involves targeting a subset of the configuration memory containing essential and critical bits crucial for the system's functionality. Identifying Essential bits in an FPGA design is often feasible through manufacturer documentation. However, detecting Critical bits requires complex reverse engineering to map the correspondence between the configuration bits and the FPGA modules. This task requires substantial amount of details about the logic layout and the bitstream, which is not easily available due to their proprietary nature. In some cases, manual floorplanning becomes necessary, which could impact the performance of the application. Given these limitations, we examine the potential of Monte Carlo Tree Search in guiding the fault injection process to identify critical bits with minimal injections. The key benefit of this approach is its ability to harness the spatial relations among the configuration bits without relying on reverse engineering or offline campaign planning. Evaluation results demonstrate that the proposed approach detects 99% of the critical bits using 18% fewer injections than traditional methods. Notably, 95% of the critical bits were detected in under 50% injections, achieving at least 2X higher sensitivity to critical bits with a minimal overhead of 0.04%.
14:40 CEST	TS19.9	IN-FIELD DETECTION OF SMALL DELAY DEFECTS AND RUNTIME DEGRADATION USING ON-CHIP SENSORS Speaker: Seyedehmaryam Ghasemi, Karlsruhe Institute of Technology, DE Authors: Seyedehmaryam Ghasemi, Sergej Meschkov, Jonas Krautter, Dennis Gnad and Mehdi Tahoori, Karlsruhe Institute of Technology, DE Abstract The increasing safety requirements for modern complex systems mandate Silicon Lifecycle Management (SLM) using various sensors for in-field test. In this work, we evaluate so-called Path Transient Monitors (PTMs), which are based on delay lines, to detect path delay increase caused by manufacturing defects or runtime degradation. These sensors are integrated into a mbox{RISC-V} SoC on an FPGA, allowing software-controlled measurements and calibration. Additionally, we introduce means to emulate delay defects and degradations by injecting additional delay elements into a custom add instruction. Furthermore, by using power wasters, we provoke runtime voltage variations. Our evaluation in different temperatures shows the dependencies between different sources of delay variations and how the sensors can help in better detection of delay defects.
14:41 CEST	TS19.10	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS26 Advances On Edge Machine Learning: From Device To Architecture And Application

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 14:00 CEST - 15:30 CEST
Location / Room: Auditorium 3

Session chair:
Angelo Garofalo, ETH Zurich & University Bologna, CH

Session co-chair:
Manuele Rusci, KU Leuven, BE

Time	Label	Presentation Title Authors
14:00 CEST	TS26.1	MATAR: MULTI-QUANTIZATION-AWARE TRAINING FOR ACCURATE AND FAST HARDWARE RETARGETING Speaker: Pierpaolo Mori, BMW Group, IT Authors: Pierpaolo Mori¹, Moritz Thoma², Lukas Frickenstein², Shambhavi Balamuthu Sampath², Nael Fasfous², Manoj Rohit Vemparala², Alexander Frickenstein², Walter Stechele³, Daniel Mueller-Gritschneder⁴ and Claudio Passerone¹ ¹Politecnico di Torino, IT; ²BMW Group, DE; ³TU Munich, DE; ⁴TU Wien, AT Abstract Quantization of deep neural networks (DNNs) reduces their memory footprint and simplifies their hardware arithmetic logic, enabling efficient inference on edge devices. Different hardware targets can support different forms of quantization, e.g. full 8-bit, or 8/4/2-bit mixed-precision combinations, or fully-flexible bit-serial solutions. This makes standard quantization-aware training (QAT) of a DNN for different targets challenging, as there needs to be careful consideration of the supported quantization-levels of each target at training time. In this paper, we propose a generalized QAT solution that results in a DNN which can be retargeted to different hardware, without any retraining or prior knowledge of the hardware's supported quantization policy. First, we present the novel training scheme which makes the model aware of multiple quantization strategies. Then we demonstrate the retargeting capabilities of the resulting DNN by using a genetic algorithm to search for layer-wise, mixed-precision solutions that maximize performance and accuracy on several hardware targets, without the need of fine-tuning. By making the DNN agnostic of the final hardware target, our method allows DNNs to be distributed to many users on different hardware platforms, without the need for sharing the training loop or dataset of the DNN developers, nor detailing the hardware capabilities ahead of time by the end-users of the efficient quantized solution. Models trained with our approach can generalize on multiple quantization policies with minimal accuracy degradation compared to target-specific quantization counterparts.
14:05 CEST	TS26.2	RESOURCE-EFFICIENT HETEROGENOUS FEDERATED CONTINUAL LEARNING ON EDGE Speaker: Zhao Yang, Northwestern Polytechnical University, CN Authors: Zhao Yang¹, Shengbing Zhang², Chuxi Li², Haoyang Wang² and Meng Zhang² ¹Chang'an University, CN; ²Northwestern Polytechnical University, CN Abstract Federated learning (FL) has widely deployed on edge devices. In practical, the data collected by edge devices exhibits temporal variations. This leads to catastrophic forgetting issue. Continual learning methods can be used to address this problem. However, when deploying these methods in FL on edge devices, it is challenging to adapt to the limited resources and heterogeneous data of the deployed devices, which reduces the efficiency and effectiveness of federated continual learning (FCL). Therefore, this article proposes a resource-efficient heterogeneous FCL framework. This framework divides the global model into an adaptation part for new knowledge and a preservation part for old knowledge. The preservation part is used to address the catastrophic forgetting problem. Only the adaptation part is trained when learning new knowledge on a new task, reducing resource consumption. Additionally, the framework mitigates the impact of heterogeneous data through an aggregation method based on feature representation. Experimental results show that our method performs well in mitigating catastrophic forgetting in a resource-efficient manner.
14:10 CEST	TS26.3	FMTT: FUSED MULTI-HEAD TRANSFORMER WITH TENSOR-COMPRESSION FOR 3D POINT CLOUDS DETECTION ON EDGE DEVICES Speaker: Ziyi Guan, University of Hong Kong, HK Authors: Zikun Wei¹, Tingting Wang¹, Chenchen Ding², Bohan Wang¹, Hantao Huang¹ and Hao Yu¹ ¹Southern University of Science and Technology, CN; ²University of Hong Kong, HK Abstract The real-time detection of 3D objects represents a grand challenge on edge devices. Existing 3D point clouds models are over-parameterized with heavy computation load. This paper proposes a highly compact model for 3D point clouds detection using tensor-compression. Compared to conventional methods, we propose a fused multi-head transformer tensor-compression (FMTT) to achieve both compact size yet with high accuracy. The FMTT leverages different ranks to extract both high and low-level features and then fuses them together to improve the accuracy. Experiments on the KITTI dataset show that the proposed FMTT can achieve 6.04× smaller than the uncompressed model from 55.09MB to 9.12MB such that the compressed model can be implemented on edge devices. It also achieves 2.62% improved accuracy in easy mode and 0.28% improved accuracy in hard mode.
14:15 CEST	TS26.4	XINET-POSE: EXTREMELY LIGHTWEIGHT POSE DETECTION FOR MICROCONTROLLERS Speaker: Alberto Ancilotto, Fondazione Bruno Kessler, IT Authors: Alberto Ancilotto, Francesco Paissan and Elisabetta Farella, Fondazione Bruno Kessler, IT Abstract Accurate human body keypoint detection is crucial in fields like medicine, entertainment, and VR. However, it often demands complex neural networks best suited for high-compute environments. This work instead presents a keypoint detection approach targeting embedded devices with very low computational resources, such as microcontrollers. The proposed end-to-end solution is based on the development and optimization of each component of a neural network specifically designed for highly constrained devices. Our methodology works top-down, from object to keypoint detection, unlike alternative bottom-up approaches relying instead on complex decoding algorithms or additional processing steps. The proposed network is optimized to ensure maximum compatibility with different embedded runtimes by making use of commonly used operators. We demonstrate the viability of our approach using an STM32H7 microcontroller with 2MB of Flash and 1MB of RAM.
14:20 CEST	TS26.5	VALUE-DRIVEN MIXED-PRECISION QUANTIZATION FOR PATCH-BASED INFERENCE ON MICROCONTROLLERS Speaker: Wei Tao, Huazhong University of Science & Technology, CN Authors: Wei Tao¹, Shenglin He¹, Kai Lu¹, Xiaoyang Qu², Guokuan Li³, Jiguang Wan¹, Jianzong Wang⁴ and Jing Xiao⁴ ¹Huazhong University of Science & Technology, CN; ²Ping An Technology (shenzhen)Co., Ltd, CN; ³Huazhong University of Sci and Tech, CN; ⁴Ping An Technology, CN Abstract Deploying neural networks on microcontroller units (MCUs) presents substantial challenges due to their constrained computation and memory resources. Previous researches have explored patch-based inference as a strategy to conserve memory without sacrificing model accuracy. However, this technique suffers from severe redundant computation overhead, leading to a substantial increase in execution latency. A feasible solution to address this issue is mixed-precision quantization, but it faces the challenges of accuracy degradation and a time-consuming search time. In this paper, we propose QuantMCU, a novel patch-based inference method that utilizes value-driven mixed-precision quantization to reduce redundant computation. We first utilize value-driven patch classification (VDPC) to maintain the model accuracy. VDPC classifies patches into two classes based on whether they contain outlier values. For patches containing outlier values, we apply 8-bit quantization to the feature maps on the dataflow branches that follow. In addition, for patches without outlier values, we utilize value-driven quantization search (VDQS) on the feature maps of their following dataflow branches to reduce search time. Specifically, VDQS introduces a novel quantization search metric that takes into account both computation and accuracy, and it employs entropy as an accuracy representation to avoid additional training. VDQS also adopts an iterative approach to determine the bitwidth of each feature map to further accelerate the search process. Experimental results on real-world MCU devices show that QuantMCU can reduce computation by 2.2x on average while maintaining comparable model accuracy compared to the state-of-the-art patch-based inference methods.
14:25 CEST	TS26.6	TINY-VBF: RESOURCE-EFFICIENT VISION TRANSFORMER BASED LIGHTWEIGHT BEAMFORMER FOR ULTRASOUND SINGLE-ANGLE PLANE WAVE IMAGING Speaker: Abdul Rahoof, IIT Palakkad, IN Authors: ABDUL RAHOOF¹, Vivek Chaturvedi¹, Mahesh Raveendranatha Panicker¹ and Muhammad Shafique² ¹IIT Palakkad, IN; ²New York University Abu Dhabi, AE Abstract Accelerating compute intensive non-real-time beam- forming algorithms in ultrasound imaging using deep learning architectures has been gaining momentum in the recent past. Nonetheless, the complexity of the state-of-the-art deep learn- ing techniques poses challenges for deployment on resource- constrained edge devices. In this work, we propose a novel vision transformer based tiny beamformer (Tiny-VBF), which works on the raw radio-frequency channel data acquired through single-angle plane wave insonification. The output of our Tiny- VBF provides fast envelope detection requiring very low frame rate, i.e. 0.34 GOPs/Frame for a frame size of 368 x 128 in comparison to the state-of-the-art deep learning models. It also exhibited an 8% increase in contrast and gains of 5% and 33% in axial and lateral resolution respectively when compared to Tiny-CNN on in-vitro dataset. Additionally, our model showed a 4.2% increase in contrast and gains of 4% and 20% in axial and lateral resolution respectively when compared against conventional Delay-and-Sum (DAS) beamformer. We further propose an accelerator architecture and implement our Tiny-VBF model on a Zynq UltraScale+ MPSoC ZCU104 FPGA using a hybrid quantization scheme with 50% less resource consumption compared to the floating-point implementation, while preserving the image quality.
14:30 CEST	TS26.7	DECENTRALIZED FEDERATED LEARNING IN PARTIALLY CONNECTED NETWORKS WITH NON-IID DATA Speaker: Nanxiang Yu, Shandong University, CN Authors: Xiaojun Cai, Nanxiang Yu, Mengying Zhao, Mei Cao, Tingting Zhang and Jianbo Lu, Shandong University, CN Abstract Federated learning is a promising paradigm to enable joint model training across distributed data while preserving data privacy. The distributed data are usually not identically and independently distributed (Non-IID), which brings great challenges for federated learning. There have been existing work proposing to guide model aggregation between similar clients to deal with Non-IID data. But they typically assume a fully connected network topology, while new design issues need to be considered when it comes to a partially connected topology. In this work, we propose a probability-driven gossip framework for partially connected network topology with Non-IID data. The main idea is to discover similarity relationship between non-adjacent clients and guide the model exchange to encourage aggregation between similar clients. We explore cross-node similarity assessment and define probability to guide the model exchange and aggregation. Both similarity and communication cost are considered in the probability-driven gossip. Evaluation shows that the proposed scheme can achieve 13.04%-14.24% improvement in model accuracy, when compared with related work.
14:35 CEST	TS26.8	CRISP: HYBRID STRUCTURED SPARSITY FOR CLASS-AWARE MODEL PRUNING Speaker: Shivam Aggarwal, National University of Singapore, SG Authors: Shivam Aggarwal, Kuluhan Binici and Tulika Mitra, National University of Singapore, SG Abstract Machine learning pipelines for classification tasks often train a universal model to achieve accuracy across a broad range of classes. However, a typical user encounters only a limited selection of classes regularly. This disparity provides an opportunity to enhance computational efficiency by tailoring models to focus on user-specific classes. Existing works rely on unstructured pruning, which introduces randomly distributed non-zero values in the model, making it unsuitable for hardware acceleration. Alternatively, some approaches employ structured pruning, such as channel pruning, but these tend to provide only minimal compression and may lead to reduced model accuracy. In this work, we propose CRISP, a novel pruning framework leveraging a hybrid structured sparsity pattern that combines both fine-grained N:M structured sparsity and coarse-grained block sparsity. Our pruning strategy is guided by a gradient-based class-aware saliency score, allowing us to retain weights crucial for user-specific classes. CRISP achieves high accuracy with minimal memory consumption for popular models like ResNet-50, VGG-16, and MobileNetV2 on ImageNet and CIFAR-100 datasets. Moreover, CRISP delivers up to 14x reduction in latency and energy consumption compared to existing pruning methods while maintaining comparable accuracy. Our code is available here.
14:40 CEST	TS26.9	WORK IN PROGRESS: LINEAR TRANSFORMERS FOR TINYML Speaker: Cristian Cioflan, ETH Zurich, CH Authors: Moritz Scherer, Cristian Cioflan, Michele Magno and Luca Benini, ETH Zurich, CH Abstract We present the WaveFormer, a neural network architecture based on a linear attention transformer to enable long sequence inference for TinyML devices. Waveformer achieves a new state-of-the-art accuracy of 98.8 % and 99.1 % on the Google Speech V2 keyword spotting (KWS) dataset for the 12 and 35 class problems with only 130 kB of weight storage, compatible with MCU class devices. Top-1 accuracy is improved by 0.1 and 0.9 percentage points while reducing the model size and number of operations by 2.5 x and 4.7 x compared to the state of the art. We also propose a hardware-friendly 8-bit integer quantization algorithm for the linear attention operator, enabling efficient deployment on low-cost, ultra-low-power microcontrollers without loss of accuracy.
14:41 CEST	TS26.10	DEEP QUASI-PERIODIC PRIORS: SIGNAL SEPARATION IN WEARABLE SYSTEMS WITH LIMITED DATA Speaker: Soheil Ghiasi, University of California-Davis, US Authors: Mahya Saffarpour¹, Kourosh Vali¹, Weitai Qian¹, Begum Kasap¹, Diana Farmer², Aijun Wang¹ and Soheil Ghiasi¹ ¹University of California, Davis, US; ²University of California Davis Health, US Abstract Quasi-periodic signal separation poses a significant challenge in wearable systems with limited data, particularly when the measured signal, influenced by multiple physiological sources, is under-represented. Addressing this issue, we introduce Deep Quasi-Periodic Priors (DQPP), a signal separation method for non-stationary, single-detector, quasi-periodic signals using an isolated input data. This approach incorporates masking and in-painting of the time-frequency spectrogram, while integrating prior harmonic and temporal patterns within the deep neural network structure. Moreover, a pattern alignment unit transforms the input signal's time-frequency patterns to closely align with the deep harmonic neural structure. The efficacy of DQPP is demonstrated in non-invasive fetal oxygen monitoring, using both synthetic and in vivo data, underscoring its applicability and potential in wearable technology. When applied to the synthesized data, our method exhibits significant improvements in signal-to-distortion ratio (26% on average) and mean squared error (80% on average), compared to the best competing method. When applied to in vivo data captured in pregnant animal studies, our method improves the correlation error between estimated fetal blood oxygen saturation and the ground truth by 80.5% compared to the state of the art.
14:42 CEST	TS26.11	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

W05 Workshop: Unlocking Tomorrow's Technology: Hyperdimensional Computing and Vector Symbolic Architectures for Automation and Design in Technology and Systems

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 14:00 CEST - 18:00 CEST
Location / Room: Break-Out Room S3+4

Organiser

Antonello Rosato, Sapienza University of Rome, IT

Short Description: The Workshop at DATE 2024 invites submissions for an exploration of cutting-edge developments in Hyperdimensional Computing (HDC) and Vector Symbolic Architectures (VSA) regarding the both the theory and application of these concepts to study, design and automate new systems that leverage the mathematical properties of high-dimensional spaces. The goal is to provide new insights on how HDC and VSA can be useful for a variety of practical applications in different field, also to enhance our understanding of human perception and cognition. The workshop will be organized in two main parts: an invited speakers’ keynote presentation and a poster session with open discussion.

Format: Hybrid event, with invited speakers and call for poster presentations.

Scope: We seek participants that delve into the practical application of HDC and VSA principles in various domains, with a particular focus on automation, design, and advanced AI functionalities. The poster presentations are encouraged in the following areas (but not limited to):

Neural Network Models and AI algorithms: Present novel solutions and systems utilizing HDC and VSA in neural network architectures and generalized AI models, emphasizing their application potential.
Practical Applications: Showcase how HDC and VSA concepts have been deployed to solve real-world challenges in fields such as Computer Vision, Wireless Communication, Language Processing, Classification, Time Series Prediction, and Neural Modeling.
System Design: Explore the potential of constructing comprehensive artificial intelligence systems based on HDC/VSA principles and discuss the taxonomy of recent advancements in this area.
Interpretability and Human Interaction: Investigate the new possible insights obtainable via HDC & VSA on explainability and human-machine teaming.
Future Directions and New Perspectives: Reflect on the present and future of HDC and VSA principles, in the theory realm and examining their applications in emerging technologies such as robotics and energy-aware AI developments.

Talks

Wed, 14:00 - 18:00

Speakers

Evgeny Osipov, Luleå University of Technology, SE

Abbas Rahimi, IBM Research Zurich, CH

Alpha Renner, Peter Grünberg Institute, DE

Kenny Schlegel, Chemnitz University of Technology, DE

Poster Session

Wed, 16:00 - 18:00

TBA

W06 Workshop: Tips and Tricks for a Successful Multi-Project-Wafer (MPW) Chip

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 14:00 CEST - 18:00 CEST
Location / Room: Break-Out Room S6+7

Organisers

Clive Holmes, Europractice, UK

Tobias Vanderhenst, imec / EUROPRACTICE, BE

Paul Malisse, imec / EUROPRACTICE, BE

Thomas Drischel, Fraunhofer IIS / EUROPRACTICE, DE

URL: Learn about MPW Fabrication with EUROPRACTICE

Multi-Project-Wafer (MPW) prototyping is a cost-effective and fast way to fabricate, test and validate new chip designs. However, it also poses several challenges and pitfalls that can compromise the quality and functionality of the final chip. This leads to eventual design iterations, extending the initial timeline and increasing costs.

In this workshop, EUROPRACTICE experts will discuss the best practices and lessons learned from the MPW prototyping process, sharing practical tips and tricks on avoiding common errors, optimising the design for tape-out, and troubleshooting the issues that may arise.

We will also discuss how to select the appropriate test and packaging options, especially for smaller advanced nodes. In addition, we will showcase some successful examples and case studies of MPW prototyping from different domains and applications. Finally, we will explain how to access the EUROPRACTICE platform, which provides affordable MPW fabrication services in a wide range of technologies, including ASICs, MEMS, Photonics, Microfluidics, Flexible Electronics, and more.

The workshop is intended for anyone who is interested in or involved in MPW prototyping, from beginners to experts.

RTL signoff analysis

Wed, 14:00 - 14:45

Speakers

Kwan Cheung , UKRI-STFC / EUROPRACTICE, UK

Domas Druzas, UKRI-STFC / EUROPRACTICE, UK

Choosing a technology for your design

Wed, 14:45 - 15:30

Speaker

Domas Druzas, UKRI-STFC / EUROPRACTICE, UK

Design finishing and sign off

Wed, 16:00 - 16:45

Speakers

Tobias Vanderhenst, imec / EUROPRACTICE, BE

Syed Shahnawaz, Fraunhofer IIS / EUROPRACTICE, DE

Selecting a packaging solution

Wed, 16:45 - 17:30

Speaker

Tobias Vanderhenst, imec / EUROPRACTICE, BE

FS04 Focus Session: Panel On Resilience Of Deep Learning Applications: Where We Are And Where We Want To Go

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Auditorium 2

Session chair:
Alberto Bosio, Ecole Centrale de Lyon, FR

Organisers:
Cristiana Bolchini, Politecnico di Milano, IT
Alberto Bosio, Ecole Centrale de Lyon, FR

Panellists:
Annachiara Ruospo, Politecnico di Torino, IT
Paolo Rech, Università di Trento, IT
Maksim Jenihhin, Tallinn University of Technology, EE
Masanori Hashimoto, Kyoto University, JP
Luciano Ost, Loughborough University, GB
Muhammad Shafique, New York University Abu Dhabi, AE

Hardware for AI (HW-AI), similar to traditional computing hardware, is subject to hardware faults (HW faults) that can have several sources: variations in fabrication process parameters, fabrication process defects, latent defects, i.e., defects undetectable at time-zero post-fabrication testing that manifest themselves later in the field of application, silicon ageing, e.g., time-dependent dielectric breakdown, or even environmental stress, such as heat, humidity, vibration, and Single Event Upsets (SEUs) stemming from ionization. All these HW faults can cause operational failures, potentially leading to important consequences, especially for safety-critical systems. This panel aims at gathering the various contributors of this scientific community, to discuss the milestones reached so far and the future directions in the form of open challenges to be addressed. The panel format will be exploited to involve the audience (DATE community is very active on the topic) to make the conversation as inclusive as possible, to actually allow for a broader view on the matter.

MPP03 Multi-Partner Projects

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Break-Out Room S1+2

Session chair:
Enrico Fraccaroli, University of North Carolina, US

Session co-chair:
Franco Fummi, Università di Verona, IT

Time	Label	Presentation Title Authors
16:30 CEST	MPP03.1	KIHT: KALIGO-BASED INTELLIGENT HANDWRITING TEACHER Speaker: Tanja Harbaum, Karlsruhe Institute of Technology, DE Authors: Tanja Harbaum¹, Alexey Serdyuk¹, Fabian Kreß¹, Tim Hamann², Jens Barth², Peter Kämpf², Florent Imbert³, Yann Soullard⁴, Romain Tavenard⁴, Eric Anquetil⁵ and Jessica Delahaie⁶ ¹Karlsruhe Institute of Technology, DE; ²STABILO International GmbH, DE; ³IRISA, Universite de Rennes, INSA Rennes, FR; ⁴LETG, UMR 6554, IRISA, Université Rennes 2, FR; ⁵IRISA, Universite de Rennes, FR; ⁶Learn&Go, FR Abstract Kaligo-based Intelligent Handwriting Teacher (KIHT) is a bi-nationally funded research project. The aim of this joint project is to develop an intelligent learning device for automated handwriting, composed of existing components, which can be made available to as many students as possible. With KIHT, we specifically address the challenging task of using inertial sensors to retrace the trajectory of a pen without relying on external reference systems. The nearly unlimited freedom to let the pen glide over the paper has not yet provided a satisfactory solution to this challenge in the state-of-the-art methods, even with sophisticated algorithms and AI approaches. The final phase of the project is now being launched and together with partners from industry and academia, we are taking a holistic approach by considering the entire chain of components, from the pen to the embedded processing system, the algorithms and the app.
16:35 CEST	MPP03.2	UNCOVER: DATA-DRIVEN DESIGN SUPPORT THROUGH CONTINUOUS MONITORING OF SECURITY INCIDENTS Speaker: Matthias Stammler, Karlsruhe Institute of Technology, DE Authors: Matthias Stammler¹, Julian Lorenz¹, Eric Sax¹, Juergen Becker¹, Matthias Hamann², Patrick Bidinger², Andreas Dewald², Paraskevi Georgouti³, Alexios Camarinopoulos³, Günter Becker³, Klaus Finsterbusch⁴, Maximilian Kirschner⁵, Laurenz Adolph⁵, Carl Philipp Hohl⁵, Maria Rill⁵, Daniel Vonderau⁵ and Victor Pazmino⁶ ¹Karlsruhe Institute of Technology, DE; ²ERNW Research GmbH, DE; ³RISA Sicherheitsanalysen GmbH, DE; ⁴EnCo Software GmbH, DE; ⁵Forschungszentrum Informatik, DE; ⁶FZI Research Center for Information Technology, DE Abstract The seamless and secure integration of subsystems is a pivotal requirement within contemporary automotive development, necessitating the application of design methodologies like the Vee model. While this approach includes dedicated verification steps for the included contexts and provides a high level of assurance that the system will operate correctly under specified conditions, formalizing specifications outside its operational design domain is per definition not included. Additionally, black-box systems like machine learning based functions prove difficulty to test by these traditional methodologies. In this project, we introduce and demonstrate a design workflow combining the Vee model design paradigm with continuous data-driven software engineering. Our workflow assists the continuous, safe and secure development and improvement of consumer vehicle functionality over the product lifecycle. This is achieved through the continuous monitoring of anomalies, as well as system states that deviate from the established design domain. The UNCOVER methodology consists of a continuous reduction in the amount of necessary monitored messages and presents a methodology throughout the entirety of the product lifecycle. We demonstrate our methodology through a simulation and show our automatic generation of monitoring components, and an automated preselection of identified safety or security incidents.
16:40 CEST	MPP03.3	XANDAR: AN X-BY-CONSTRUCTION FRAMEWORK FOR SAFETY, SECURITY, AND REAL-TIME BEHAVIOR OF EMBEDDED SOFTWARE SYSTEMS Speaker: Tobias Dörr, Karlsruhe Institute of Technology, DE Authors: Tobias Dörr¹, Florian Schade¹, Juergen Becker¹, Georgios Keramidas², Nikos Petrellis², Vasilios Kelefouras², Michail Mavropoulos², Konstantinos Antonopoulos², Christos P. Antonopoulos², Nikolaos Voros², Alexander Ahlbrecht³, Wanja Zaeske³, Vincent Janson³, Phillip Nöldeke³, Umut Durak³, Christos Panagiotou⁴, Dimitris Karadimas⁴, Nico Adler⁵, Clemens Reichmann⁵, Andreas Sailer⁵, Raphael Weber⁵, Thomas Wilhelm⁵, Wolfgang Gabler⁶, Katrin Weiden⁶, Xavier Anzuela Recasens⁶, Sakir Sezer⁷, Fahad Siddiqui⁷, Rafiullah Khan⁷, Kieran McLaughlin⁷, Sena Yengec Tasdemir⁷, Balmukund Sonigara⁷, Henry Hui⁷, Esther Soriano Viguer⁸, Aridane Alvarez Suarez⁸, Vicente Nicolau Gallego⁸, Manuel Muñoz Alcobendas⁸ and Miguel Masmano Tello⁸ ¹Karlsruhe Institute of Technology, DE; ²University of Peloponnese, GR; ³German Aerospace Center (DLR), DE; ⁴AVN Innovative Technology Solutions Limited, CY; ⁵Vector Informatik GmbH, DE; ⁶Bayerische Motoren Werke Aktiengesellschaft, DE; ⁷Queen's University Belfast, GB; ⁸Fent Innovative Software Solutions, SL, ES Abstract The safe and secure implementation of increasingly complex features is a major challenge in the development of autonomous and distributed embedded systems. Automated design-time procedures that guarantee the fulfillment of critical system properties are a promising approach to tackle this challenge. In the European project XANDAR, which took place from 2021 to 2023, eight partners developed an X-by-Construction (XbC) design framework to support developers in the creation of embedded software systems with certain safety, security, and real-time properties. The design framework combines a model-based toolchain with a hypervisor-based runtime architecture. It targets modern high-performance hardware, facilitates the integration of machine learning applications, and employs a library of trusted safety and security patterns to reduce the implementation and verification effort. This paper describes the concepts developed during the project, the prototypical implementation of the design framework, and its application in both an automotive and an avionics use case.
16:45 CEST	MPP03.4	AUTO-TUNING MULTI-GPU HIGH-FIDELITY NUMERICAL SIMULATIONS FOR URBAN AIR MOBILITY Speaker: Sotirios Xydis, National TU Athens, GR Authors: Konstantina Koliogeorgi¹, Georgios Anagnostopoulos¹, Gerardo Zampino², Marcial Agudo², Ricardo Vinuesa² and Sotirios Xydis¹ ¹National TU Athens, GR; ²Kungliga Tekniska högskolan Royal Institute of Technology (KTH), SE Abstract The aviation field is rapidly evolving towards an era where both typical aviation and Unmanned Aicraft Systems are essential and co-exist in the same airspace. This new territory raises important concerns regarding environmental impact, safety and societal acceptance. The RefMap European Project is an initiative that addresses these issues and aims at optimizing air traffic in terms of the environmental footprint in aviation and drone flights. One of RefMap's objectives is the development of powerful deep-learning models that predict urban flow based on extensive CFD simulations. The excessive time requirements of CFD simulations require the computational power of exascale heterogeneous supercomputer clusters. This work presents RefMap's strategy to rely on GPU-enabled high-class solvers and further leverage sophisticated autotuning HPC techniques for performance tuning of the solver on any GPU architecture and parallel system.
16:50 CEST	MPP03.5	A SYSTEM DEVELOPMENT KIT FOR BIG DATA APPLICATIONS ON FPGA-BASED CLUSTERS: THE EVEREST APPROACH Speaker: Christian Pilato, Politecnico di Milano, IT Authors: Christian Pilato¹, Subhadeep Banik², Jakub Beranek³, Fabien Brocheton⁴, Jeronimo Castrillon⁵, Riccardo Cevasco⁶, Radim Cmar⁷, Serena Curzel¹, Fabrizio Ferrandi¹, Karl Friebel⁵, Antonella Galizia⁸, Matteo Grasso⁶, Paulo Guimaraes da Silva³, Jan Martinovic³, Gianluca Palermo¹, Michele Paolino⁹, Andrea Parodi⁸, Antonio Parodi⁸, Fabio Pintus⁸, Raphael Polig¹⁰, David Poulet⁴, Francesco Regazzoni¹¹, Burkhard Ringlein¹², Roberto Rocco¹, Katerina Slaninova³, Tom Slooff¹³, Stephanie Soldavini¹, Felix Suchert⁵, Mattia Tibaldi¹, Beat Weiss¹² and Christoph Hagleitner¹⁴ ¹Politecnico di Milano, IT; ²Nanyang technological University, SG; ³IT4Innovations, CZ; ⁴Numtech, FR; ⁵TU Dresden, DE; ⁶Duferco Energia, IT; ⁷Sygic, SK; ⁸CIMA, IT; ⁹Virtual Open Systems, FR; ¹⁰IBM Research, CH; ¹¹University of Amsterdam and Università della Svizzera italiana, CH; ¹²IBM Research -- Zurich, CH; ¹³Università della Svizzera italiana, CH; ¹⁴IBM, CH Abstract Modern big data workflows are characterized by computationally intensive kernels. The simulated results are often combined with knowledge extracted from AI models to ultimately support decision-making. These energy-hungry workflows are increasingly executed in data centers with energy-efficient hardware accelerators since FPGAs are well-suited for this task due to their inherent parallelism. We present the H2020 project EVEREST, which has developed a system development kit (SDK) to simplify the creation of FPGA-accelerated kernels and manage the execution at runtime through a virtualization environment. This paper describes the main components of the EVEREST SDK and the benefits that can be achieved in our use cases.
16:55 CEST	MPP03.6	SECURITY LAYERS AND RELATED SERVICES WITHIN THE HORIZON EUROPE NEUROPULS PROJECT Speaker: Fabio Pavanello, University Grenoble Alpes, University Savoie Mont Blanc, CNRS, Grenoble INP, CROMA, FR Authors: Fabio Pavanello¹, Cédric Marchand², Xavier Letartre², Paul Jimenez², Ricardo Chaves³, Niccolò Marastoni⁴, Alberto Lovato⁴, Mariano Ceccato⁴, George Papadimitriou⁵, Vasileios Karakostas⁵, Dimitris Gizopoulos⁵, Roberta Bardini⁶, Tzamn Carmona⁶, Stefano Di Carlo⁶, Alessandro Savino⁶, Laurence Lerch⁷, Ulrich Ruhrmair⁷, Sergio Vinagrero Gutiérrez⁸, Giorgio Di Natale⁸ and Elena Ioana Vatajelu⁸ ¹University Grenoble Alpes, University Savoie Mont Blanc, CNRS, Grenoble INP, IMEP-LAHC, FR; ²University Lyon, Ecole Centrale de Lyon, INSA Lyon, Université Claude Bernard Lyon 1, CPE Lyon, CNRS, INL, FR; ³INESC-ID Lisboa, PT; ⁴Department of Computer Science, University of Verona, IT; ⁵Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, GR; ⁶Politecnico di Torino, IT; ⁷TU Berlin, DE; ⁸University Grenoble Alpes, CNRS, Grenoble INP, TIMA, FR Abstract In the contemporary security landscape, the incorporation of photonics has emerged as a transformative force, unlocking a spectrum of possibilities to enhance the resilience and effectiveness of security primitives. This integration represents more than a mere technological augmentation; it signifies a paradigm shift towards innovative approaches capable of delivering security primitives with key properties for low-power systems. This not only augments the robustness of security frameworks, but also paves the way for novel strategies that adapt to the evolving challenges of the digital age. This paper discusses the security layers and related services that will be developed, modeled, and evaluated within the Horizon Europe NEUROPULS project. These layers will exploit novel implementations for security primitives based on physical unclonable functions (PUFs) using integrated photonics technology. Their objective is to provide a series of services to support the secure operation of a neuromorphic photonic accelerator for edge computing applications.
17:00 CEST	MPP03.7	SECURED FOR HEALTH: SCALING UP PRIVACY TO ENABLE THE INTEGRATION OF THE EUROPEAN HEALTH DATA SPACE Speaker: Francesco Regazzoni, University of Amsterdam and Università della Svizzera italiana, NL Authors: Francesco Regazzoni¹, Paolo Palmieri², George Tasopoulos³, Marco Brohet³, Kyrian Maat³, Zoltan Mann³, Kostas Papagiannopoulos³, Sotirios Ioannidis⁴, Kalliopi Mastoraki⁴, Andrés G. Castillo Sanz⁵, Joppe W. Bos⁶, Gareth T. Davies⁷, SeoJeong Moon⁶, Alice Héliou⁸, Vincent Thouvenot⁸, Katarzyna Kapusta⁸, Pierre-Elisée Flory⁸, Muhammad Ali Siddiqi⁹, Christos Strydis⁹, Stefanos Florescu⁹, Pieter Kruizinga⁹, Daniela Spajic¹⁰, Maja Nisevic¹⁰, Alberto Gutierrez-Torre¹¹, Josep Berral¹¹, Luca Pulina¹², Francesca Palumbo¹³, Albert Zoltan Aszalos¹⁴, Peter Pollner¹⁴, Vassilis Paliuras¹⁵, Alexander El-Kady¹⁵, Christos Tselios:¹⁵, Konstantina Karagianni¹⁵, Gergely Acs¹⁶, Balázs Pejó¹⁶, Christos Avgerinos¹⁷, Nikolaos Bakalos¹⁸, Juan Carlos Perez Baun¹⁹ and Apostolos Fournaris¹⁵ ¹University of Amsterdam and Università della Svizzera italiana, CH; ²UCC, IE; ³University of Amsterdam, NL; ⁴Circular Economy Foundation, BE; ⁵FHUNJ, ES; ⁶NXP, BE; ⁷NXP, ; ⁸Thales, FR; ⁹EMC, NL; ¹⁰KULeuven, BE; ¹¹BSC, ES; ¹²University of Sassari, IT; ¹³Università degli Studi di Sassari, IT; ¹⁴Semmelweis University, HU; ¹⁵Industrial Systems Institute/Research Center ATHENA, GR; ¹⁶BME, HU; ¹⁷Catalink, CY; ¹⁸ICCS, GR; ¹⁹ATOS, ES Abstract In this paper, we present the SECURED project, aimed at improving privacy-preserving processing of data in the health domain. The technologies developed in the project will be demonstrated in four health-related use cases and with the involvement of SME's selected through an open funding call.
17:01 CEST	MPP03.8	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS09 Reconfigurable Systems

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Auditorium 3

Session chair:
Christos Bouganis, Imperial College London, UK

Session co-chair:
Jerzy Tyszer, Poznan University of Technology, PL

Time	Label	Presentation Title Authors
16:30 CEST	TS09.1	AUTOWS: AUTOMATE WEIGHTS STREAMING IN LAYER-WISE PIPELINED DNN ACCELERATORS Speaker: Zhewen Yu, Imperial College London, GB Authors: Zhewen Yu and Christos Bouganis, Imperial College London, GB Abstract With the great success of Deep Neural Networks (DNN), the design of efficient hardware accelerators has triggered wide interest in the research community. Existing research explores two architectural strategies: sequential layer execution and layer-wise pipelining. While the former supports a wider range of models, the latter is favoured for its enhanced customization and efficiency. A challenge for the layer-wise pipelining architecture is its substantial demand for the on-chip memory for weights storage, impeding the deployment of large-scale networks on resource-constrained devices. This paper introduces AutoWS, a pioneering memory management methodology that exploits both on-chip and off-chip memory to optimize weight storage within a layer-wise pipelining architecture, taking advantage of its static schedule. Through a comprehensive investigation on both the hardware design and the Design Space Exploration, our methodology is fully automated and enables the deployment of large-scale DNN models on resource-constrained devices, which was not possible in existing works that target layer-wise pipelining architectures. AutoWS is open-source: https://github.com/Yu-Zhewen/AutoWS
16:35 CEST	TS09.2	FLEXFORGE: EFFICIENT RECONFIGURABLE CLOUD ACCELERATION VIA PERIPHERAL RESOURCE DISAGGREGATION Speaker: Se-Min Lim, University of California, Irvine, US Authors: Se-Min Lim and Sang-Woo Jun, University of California, Irvine, US Abstract Reconfigurable hardware acceleration in the cloud using Field-Programmable Gate Arrays (FPGAs) is an increasingly popular solution for scaling performance and cost-effectiveness. For efficient utilization of FPGA resources, cloud platforms typically support elastic FPGA resource allocation. However, FPGAs are usually allocated in a homogeneous unit consisting of logic, memory, and PCIe bandwidth. Because user kernels have a wide and varying combination of resource requirements, this can result in high internal fragmentation and underutilization of each resource. To address this issue, we present FlexForge, a platform facilitating high-performance disaggregation of peripheral resources over a network of potentially untrusted FPGAs, aided by a secondary inter-FPGA network. Evaluated on a mix of representative accelerator applications deployed on a prototype cluster, FlexForge improves the overall performance of the cloud by up to 70% and 20% on average across all possible combinations without significant additional hardware resource requirements.
16:40 CEST	TS09.3	WIDESA: A HIGH ARRAY UTILIZATION MAPPING SCHEME FOR UNIFORM RECURRENCES ON ACAP Speaker: Tuo Dai, Peking University, CN Authors: Tuo Dai, Bizhao Shi and Guojie Luo, Peking University, CN Abstract The Versal Adaptive Compute Acceleration Platform (ACAP) is a new architecture that combines AI Engines (AIEs) with reconfigurable fabric. This architecture offers significant acceleration potential for uniform recurrences in various domains, such as deep learning, high-performance computation, and signal processing. However, efficiently mapping these computations onto the Versal ACAP architecture while achieving high utilization of AIEs poses a challenge. To address this issue, we propose a mapping scheme called /fname, which aims to accelerate uniform recurrences on the Versal ACAP architecture by leveraging the features of both the hardware and the computations. Considering the array architecture of AIEs, our approach utilizes space-time transformations based on the polyhedral model to generate legally optimized systolic array mappings. Concurrently, we have developed a routing-aware PLIO assignment algorithm tailored for communication on the AIE array, and the algorithm aims at successful compilation while maximizing array utilization. Furthermore, we introduce an automatic mapping framework. This framework is designed to generate the corresponding executable code for uniform recurrences, which encompasses the AIE kernel program, programmable logic bitstreams, and the host program. The experimental results validate the effectiveness of our mapping scheme. Specifically, when applying our scheme to matrix multiplication computations on the VCK5000 board, we achieve a throughput of 4.15TOPS on float data type, which is 1.11$ imes$ higher compared to the state-of-the-art accelerator on the Versal ACAP architecture.
16:45 CEST	TS09.4	OPTIMIZING IMPERFECTLY-NESTED LOOP MAPPING ON CGRAS VIA POLYHEDRAL-GUIDED FLATTENING Speaker: Xingyu Mo, Chongqing University, CN Authors: Xingyu Mo and Dajiang Liu, Chongqing University, CN Abstract Coarse-Grained Reconfigurable Arrays (CGRAs) offer a promising balance between high performance and power efficiency. To reduce the invocation overhead when mapping an imperfectly-nested loop, loop flattening is used to transform the nested loop into a single-level loop. However, loop flattening not only leads to a big loop body but also has a narrow application scope. To this end, this work proposes a polyhedral model-based loop flattening approach for imperfectly-nested loop mapping. By exploring loop structures via polyhedral transformation, we can find a flattening-friendly loop structure with more data reuse opportunities and reduced sibling loops, resulting in improved loop pipelining performance. Experimental results demonstrate a remarkable (1.37-1.62 X) speedup compared to the state-of-the-art approaches while maintaining short compilation times.
16:50 CEST	TS09.5	AN EFFICIENT HYPERGRAPH PARTITIONER UNDER INTER-BLOCK INTERCONNECTION CONSTRAINTS Speaker: Shunyang Bi, School of Microelectronics, Xidian University, CN Authors: Benzheng Li, Hailong You, Shunyang Bi and Yuming Zhang, Xidian University, CN Abstract Multi-FPGA systems are increasingly employed for very large scale integration circuit emulation and prototyping. Due to limited I/O resources, each FPGA often only has direct physical connections to a few other FPGAs. Therefore, if signals between FPGAs originate from a source FPGA and flow toward a target FPGA not directly connected to the source FPGA, intermediate FPGAs will be used as hops in the signal path.These FPGA-hops increase signal delays and the number of physical lines used in signal multiplexing between FPGAs, degrading system performance. To address these issues, researchers proposed partitioners that guarantees zero hop, but they lead to a considerable cut-size. In this paper, building on previous research, we introduce a new candidate block propagation theorem and optimize the initial partition process based on its corollary. Additionally, we also present a method for correcting violations during uncoarsening to improve the solver capability. Results of experiments demonstrate that our proposed method significantly reduces the cut size by 96% while retaining comparable running times.
16:55 CEST	TS09.6	REDCAP: RECONFIGURABLE RFET-BASED CIRCUITS AGAINST POWER SIDE-CHANNEL ATTACKS Speaker: Nima Kavand, TU Dresden, DE Authors: Nima Kavand¹, Armin Darjani¹, Giulio Galderisi², Jens Trommer², Thomas Mikolajick³ and Akash Kumar⁴ ¹TU Dresden, DE; ²Namlab gGmbH, DE; ³NaMLab Gmbh / TU Dresden, DE; ⁴Ruhr University Bochum, DE Abstract Power attacks are effective side-channel attacks (SCAs) that exploit weaknesses in the physical implementation of a cryptographic circuit to extract its secret information like encryption key. In recent years, emerging technologies have unlocked new possibilities in designing effective SCA countermeasures with less overhead. Reconfigurable Field-Effect Transistors (RFETs) are a type of beyond-CMOS technology that can be configured at run-time to act as an NFET or PFET transistor and provide two or more independent gates. These features make RFETs potent candidates for implementing hardware security techniques like logic locking and SCA countermeasures. In this paper, we propose REDCAP, a method to add randomness to the power traces of a circuit, employing compact reconfigurable RFET-based gates to make the design resilient against power SCAs. First, we explain the construction and control of reconfigurable blocks with isofunctional configurations inside the circuit. Then, we provide an algorithm to efficiently compose the reconfigurable blocks with other circuit parts to minimize the overhead and enable designers to determine the granularity of the reconfiguration. To evaluate our approach, we performed a Correlation Power Attack (CPA) on the S-box of the Piccolo and PRESENT, two lightweight cryptographic circuits, and the results show that REDCAP can highly enhance the resilience of the circuit against power SCAs.
17:00 CEST	TS09.7	RECONFIGURABLE FREQUENCY MULTIPLIERS BASED ON COMPLEMENTARY FERROELECTRIC TRANSISTORS Speaker: Haotian Xu, Zhejiang University, CN Authors: Haotian Xu¹, Jianyi Yang¹, Cheng Zhuo¹, Thomas Kampfe², Kai Ni³ and Xunzhao Yin¹ ¹Zhejiang University, CN; ²Fraunhofer IPMS, DE; ³University of Notre Dame, US Abstract Frequency multipliers, a class of essential electronic components, play a pivotal role in contemporary signal processing and communication systems. They serve as crucial building blocks for generating high-frequency signals by multiplying the frequency of an input signal. However, traditional frequency multipliers that rely on nonlinear devices often require energy- and area-consuming filtering and amplification circuits, and emerging designs based on an ambipolar ferroelectric transistor require costly non-trivial characteristic tuning or complex technology process. In this paper, we show that a pair of standard ferroelectric field effect transistors (FeFETs) can be used to build compact frequency multipliers without aforementioned technology issues. By leveraging the tunable parabolic shape of the 2FeFET structures' transfer characteristics, we propose four reconfigurable frequency multipliers, which can switch between signal transmission and frequency doubling. Furthermore, based on the 2FeFET structures, we propose four frequency multipliers that realize triple, quadruple frequency modes, elucidating a scalable methodology to generate more multiplication harmonics of the input frequency. Performance metrics such as maximum operating frequency, power, etc., are evaluated and compared with existing works. We also implement a practical case of frequency modulation scheme based on the proposed reconfigurable multipliers without additional devices. Our work provides a novel path of scalable and reconfigurable frequency multiplier designs based on devices that have characteristics similar to FeFETs, and show that FeFETs are a promising candidate for signal processing and communication systems in terms of maximum frequency and power.
17:05 CEST	TS09.8	FEREX: A RECONFIGURABLE DESIGN OF MULTI-BIT FERROELECTRIC COMPUTE-IN-MEMORY FOR NEAREST NEIGHBOR SEARCH Speaker: Zhicheng Xu, University of Hong Kong, HK Authors: Zhicheng XU¹, Che-Kai Liu², Chao Li³, Ruibin Mao¹, Jianyi Yang³, Thomas K ̈ampfe⁴, Mohsen Imani⁵, Can Li¹, Cheng Zhuo³ and Xunzhao Yin³ ¹University of Hong Kong, HK; ²Georgia Tech, US; ³Zhejiang University, CN; ⁴Fraunhofer IPMS, DE; ⁵University of California, Irvine, US Abstract Rapid advancements in artificial intelligence have given rise to transformative models, profoundly impacting our lives. These models demand massive volumes of data to operate effectively, exacerbating the data-transfer bottle-neck inherent in the conventional von-Neumann architecture. Compute-in-memory (CIM), a novel computing paradigm, tackles these issues by seamlessly embedding in-memory search functions, thereby obviating the need for data transfers. However, existing non-volatile memory (NVM)-based accelerators are application specific. During the similarity search operation, they support a single, specific distance metric, such as Hamming, Manhattan, or Euclidean distance in measuring the query against the stored data, calling for the development of reconfigurable in-memory solutions, adaptable to various applications. To overcome such a limitation, in this paper, we present FeReX, a reconfigurable associative memory (AM) that accommodates Hamming, Manhattan, and Euclidean distances. Leveraging multi-bit ferroelectric field-effect transistors (FeFETs) as the proxy and a hardware-software co-design approach, we introduce a constrained satisfaction problem (CSP)-based method to automate AM search voltage and stored voltage settings for different distance functions. Device-circuit co-simulations first validate the effectiveness of the proposed FeReX methodology for reconfigurable search distance functions. Then, we benchmark FeReX against k-nearest neighbor (KNN) and hyperdimensional computing (HDC), which highlights the robustness of FeReX and demonstrates up to 250× speedup and 104 energy savings compared with GPU.
17:10 CEST	TS09.9	A FRAMEWORK FOR DESIGNING GAUSSIAN BELIEF PROPAGATION ACCELERATORS FOR USE IN SLAM PROBLEMS Speaker: Omar Sharif, Imperial College London, GB Authors: Omar Sharif and Christos Bouganis, Imperial College London, GB Abstract Gaussian Belief Propagation (GBP) is an iterative method for factor graph inference that provides an approximate solution to the probability distribution of a system. It has been shown to be a powerful tool in numerous applications including SLAM, where the estimation of the robot's position and the map of the environment is required. State-of-the-art implementations suffer from scalability issues, or exhibit performance degradation when off-chip memory access is required. This paper addresses these challenges using a streaming architecture via a chain of parameterizable Processing Elements (PE) that can be tuned to the problem's characteristics through the use of an optimizer. This work overcomes the limitations of existing GBP implementations achieving 142x-168x performance improvements over an embedded CPU for large graphs.
17:11 CEST	TS09.10	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS14 Power-Efficient And Sustainable Computing

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Multi-Purpose Room M1B+D

Session chair:
Pierluigi Nuzzo, University of Southern California, US

Session co-chair:
Seonyeong Heo, Kyung Hee University, KR

Time	Label	Presentation Title Authors
16:30 CEST	TS14.1	MX: ENHANCING RISC-V'S VECTOR ISA FOR ULTRA-LOW OVERHEAD, ENERGY-EFFICIENT MATRIX MULTIPLICATION Speaker: Matteo Perotti, ETH Zurich, CH Authors: Matteo Perotti¹, Yichao Zhang¹, Matheus Cavalcante¹, Enis Mustafa¹ and Luca Benini² ¹ETH Zurich, CH; ²ETH Zurich, CH \| Università di Bologna, IT Abstract Dense Matrix Multiplication (MatMul) is arguably one of the most ubiquitous compute-intensive kernels, spanning linear algebra, DSP, graphics, and machine learning applications. Thus, MatMul optimization is crucial not only in high-performance processors but also in embedded low-power platforms. Several Instruction Set Architectures (ISAs) have recently included matrix extensions to improve MatMul performance and efficiency at the cost of added matrix register files and units. In this paper, we propose Matrix eXtension (MX), a lightweight approach that builds upon the open-source RISC-V Vector (RVV) ISA to boost MatMul energy efficiency. Instead of adding expensive dedicated hardware, MX uses the pre-existing vector register file and functional units to create a hybrid vector/matrix engine at a negligible area cost (< 3%), which comes from a compact near-FPU tile buffer for higher data reuse, and no clock frequency overhead. We implement MX on a compact and highly energy-optimized RVV processor and evaluate it in both a Dual- and 64-Core cluster in a 12-nm technology node. MX boosts the Dual-Core's energy efficiency by 10% for a double-precision 64 × 64 × 64 matrix multiplication with the same FPU utilization (~97%) and by 25% on the 64-Core cluster for the same benchmark on 32-bit data, with a 56% performance gain.
16:35 CEST	TS14.2	DIAC: DESIGN EXPLORATION OF INTERMITTENT-AWARE COMPUTING REALIZING BATTERYLESS SYSTEMS Speaker: Arman Roohi, University of Nebraska-Lincoln, US Authors: Sepehr Tabrizchi¹, Shaahin Angizi² and Arman Roohi³ ¹University of Nebraska–Lincoln, US; ²New Jersey Institute of Technology, US; ³University of Nebraska–Lincoln (UNL), US Abstract Battery-powered IoT devices face challenges like cost, maintenance, and environmental sustainability, prompting the emergence of batteryless energy-harvesting systems that harness ambient sources. However, their intermittent behavior can disrupt program execution and cause data loss, leading to unpredictable outcomes. Despite exhaustive studies employing conventional checkpoint methods and intricate programming paradigms to address these pitfalls, this paper proposes an innovative systematic methodology, namely DIAC. The DIAC synthesis procedure enhances the performance and efficiency of intermittent computing systems, with a focus on maximizing forward progress and minimizing the energy overhead imposed by distinct memory arrays for backup. Then, a finite-state machine is delineated, encapsulating the core operations of an IoT node, sense, compute, transmit, and sleep states. First, we validate the robustness and functionalities of a DIAC-based design in the presence of power disruptions. Finally, DIAC is applied to a wide range of benchmarks, including ISCAS-89, MCNS, and ITC-99. The simulation results substantiate the power-delay-product (PDP) benefits. For example, results for complex MCNC benchmarks indicate a PDP improvement of 61%, 56%, and 38% on average compared to three alternative techniques, evaluated at 45 nm.
16:40 CEST	TS14.3	A MULTI-BIT NEAR-RRAM BASED COMPUTING MACRO WITH HIGHLY COMPUTING PARALLELISM FOR CNN APPLICATION Speaker: Shyh-Jye Jou, National Yang Ming Chiao Tung University, TW Authors: Kuan-Chih Lin, Hao Zuo, Hsiang-Yu Wang, Yuan-Ping Huang, Ci-Hao Wu, Yan-Cheng Guo, Shyh-Jye Jou, Tuo-Hung Hou and Tian-Sheuan Chang, National Yang Ming Chiao Tung University, TW Abstract Resistive random-access memory (RRAM) based compute-in-memory (CIM) is an emerging approach to address the demand for practical implementation of artificial intelligence (AI) on resource constrained edge devices by reducing the power-hungry data transfer between memory and processing unit. However, the state-of-the-art RRAM CIM designs fail to strike a balance between precision, energy efficiency, throughput, and latency, while this work merging the techniques of CIM and compute-near-memory (CNM) delivers high precision, high energy efficiency, high throughput, and low latency. In this paper, a 256Kb RRAM based CNM macro fabricated in TSMC 40nm process is presented featuring: 1) opposite weight mapping with variation-robust SA to mitigate the impact of RRAM device variations on MAC (Multiply-Accumulate) results; 2) switch-capacitor-based analog multiplication circuit to achieve highly parallel computing of 128 4-bit by 4-bit MAC result with low power consumption and high operation speed; and 3) joint optimization of hardware and software to compensate for the accuracy loss after considering the non-idealities of circuits. The macro achieves a low latency of 17ns and high energy efficiency of 71 TOPS/W for MAC operations with 4-bit input, 4-bit weight and 4-bit output precision. It is used to accelerate the convolution process in the Light-CSPDenseNet AI model, resulting in a high accuracy of 86.33% on Visual Wake Words dataset.
16:45 CEST	TS14.4	FAST PARAMETER OPTIMIZATION OF DELAYED FEEDBACK RESERVOIR WITH BACKPROPAGATION AND GRADIENT DESCENT Speaker: Takashi Sato, Kyoto University, JP Authors: Sosei Ikeda, Hiromitsu Awano and Takashi Sato, Kyoto University, JP Abstract A delayed feedback reservoir (DFR) is a reservoir computing system well-suited for hardware implementations. However, achieving high accuracy in DFRs depends heavily on selecting appropriate hyperparameters. Conventionally, due to the presence of a non-linear circuit block in the DFR, the grid search has only been the preferred method, which is computationally intensive and time-consuming and thus performed offline. This paper presents a fast and accurate parameter optimization method for DFRs. To this end, we leverage the well-known backpropagation and gradient descent framework with the stateof-the-art DFR model for the first time to facilitate parameter optimization. We further propose a truncated backpropagation strategy applicable to the recursive dot-product reservoir representation to achieve the highest accuracy with reduced memory usage. With the proposed lightweight implementation, the computation time has been significantly reduced by up to 1/700 of the grid search.
16:50 CEST	TS14.5	ON-SENSOR PRINTED MACHINE LEARNING CLASSIFICATION VIA BESPOKE ADC AND DECISION TREE CO-DESIGN Speaker: Giorgos Armeniakos, National and TU Athens, GR Authors: Giorgos Armeniakos¹, Paula Lozano Duarte², Priyanjana Pal², Georgios Zervakis³, Mehdi Tahoori² and Dimitrios Soudris⁴ ¹National Technichal University of Athens, GR; ²Karlsruhe Institute of Technology, DE; ³University of Patras, GR; ⁴National TU Athens, GR Abstract Printed electronics (PE) technology provides cost-effective hardware with unmet customization, due to their low non-recurring engineering and fabrication costs. PE exhibit features such as flexibility, stretchability, porosity, and conformality, which make them a prominent candidate for enabling ubiquitous computing. Still, the large feature sizes in PE limit the realization of complex printed circuits, such as machine learning classifiers, especially when processing sensor inputs is necessary, due to the costly analog-to-digital converters (ADCs). To this end, we propose the design of fully customized ADCs and present, for the first time, an ADC-aware co-design framework for generating bespoke Decision Tree classifiers. Our comprehensive evaluation shows that our co-design enables self-powered operation of on-sensor printed classifiers in all benchmark cases.
16:55 CEST	TS14.6	H3DFACT: HETEROGENEOUS 3D INTEGRATED CIM FOR FACTORIZATION WITH HOLOGRAPHIC PERCEPTUAL REPRESENTATIONS Speaker: Che-Kai Liu, Georgia Tech, US Authors: Zishen Wan, Che-Kai Liu, Mohamed Ibrahim, Hanchen Yang, Samuel Spetalnick, Tushar Krishna and Arijit Raychowdhury, Georgia Tech, US Abstract Disentangling attributes of various sensory signals is central to human-like perception and reasoning and a critical task for higher-order cognitive and neuro-symbolic AI systems. An elegant approach to represent this intricate factorization is via high-dimensional holographic vectors drawing on brain-inspired vector symbolic architectures. However, holographic factorization involves iterative computation with high-dimensional matrix-vector multiplications and suffers from non-convergence problems. In this paper, we present H3DFACT, a heterogeneous 3D integrated in-memory compute engine capable of efficiently factorizing high-dimensional holographic representations. H3DFACT exploits the computation-in-superposition capability of holographic vectors and the intrinsic stochasticity associated with memristive-based 3D compute-in-memory. Evaluated on large-scale factorization and perceptual problems, H3DFACT demonstrates superior capability in factorization accuracy and operational capacity by up to five orders of magnitude, with 5.5× compute density, 1.2× energy efficiency improvements, and 5.9× less silicon footprint compared to iso-capacity 2D designs.
17:00 CEST	TS14.7	A HARDWARE ACCELERATED AUTOENCODER FOR RF COMMUNICATION USING SHORT-TIME-FOURIER-TRANSFORM ASSISTED CONVOLUTIONAL NEURAL NETWORK Speaker: Kuchul Jung, Georgia Tech, US Authors: Jongseok Woo and Saibal Mukhopadhyay, Georgia Tech, US Abstract This paper presents a Short-Time-Fourier-Transform Assisted Convolutional Neural Network (STFT-CNN-AE) design for autoencoders in wireless communication. Compared to the traditional Multi-Layer-Perceptron-based AE (MLP-AE), our design on a Zynq UltraScale+ FPGA platform demonstrates superior throughput, power efficiency, and resource utilization even in low SNR channels. The prototype measurement results show that STFT-CNN-AE achieves 3.5 times higher throughput at 2.6 times faster frequency, 59% lower power, and require 76 % less hardware (LUT and DSP) compared to a prior Multi-Layer-Perceptron based AE (MLP-AE) at similar performance for low SNR (< 7.5dB) channel. The STFT-CNN-AE's reduced computation needs, and optimized resource usage make it an excellent choice for RF communications.
17:05 CEST	TS14.8	MODEL-DRIVEN FEATURE ENGINEERING FOR DATA-DRIVEN BATTERY SOH MODEL Speaker: Khaled Alamin, Politecncio di Torino, IT Authors: Khaled Alamin¹, Daniele Jahier Pagliari¹, Yukai Chen², Enrico Macii¹, Sara Vinco¹ and Massimo Poncino¹ ¹Politecnico di Torino, IT; ²Imec, BE Abstract Accurate State of Health (SoH) estimation is indispensable for ensuring battery system safety, reliability, and runtime monitoring. However, as instantaneous runtime measurement of SoH remains impractical when not unfeasible, appropriate models are required for its estimation. Recently, various data-driven models have been proposed, which solve various weaknesses of traditional models. However, the accuracy of data-driven models heavily depends on the quality of the training datasets, which usually contain data that are easy to measure but that are only partially or weakly related to the physical/chemical mechanisms that determine battery aging. In this study, we propose a novel feature engineering approach, which involves augmenting the original dataset with purpose-designed features that better represent the aging phenomena. Our contribution does not consist of a new machine-learning model but rather in the addition of selected features to an existing model. This methodology consistently demonstrates enhanced accuracy across various machine-learning models and battery chemistries, yielding an approximate 25% SoH estimation accuracy improvement. Our work bridges a critical gap in battery research, offering a promising strategy to significantly enhance SoH estimation by optimizing feature selection.
17:10 CEST	TS14.9	SEARCH-IN-MEMORY (SIM): CONDUCTING DATA-BOUND COMPUTATIONS ON FLASH MEMORY CHIP FOR ENHANCED EFFICIENCY Speaker: Yun-Chih Chen, TU Dortmund, DE Authors: Yun-Chih Chen¹, Yuan-Hao Chang² and Tei-Wei Kuo³ ¹TU Dortmund, DE; ²Academia Sinica, TW \| National Taiwan University, TW; ³National Taiwan University, TW Abstract Large-scale data systems utilize indexes like hash tables and trees for efficient data retrieval. These indexes are stored on disk and loaded into DRAM on demand, where they are post-processed and analyzed by the CPU. This method incurs substantial data I/O, especially when optimizations like prefetching is used. This issue is inherent in the von Neumann architecture, where storage systems are dedicated solely to data storage, while CPUs handle all computations. However, data indexing primarily involves filtering tasks, which require only simple equality tests and not the complex arithmetic capabilities of a CPU. This inefficiency in the von Neumann architecture has led to a growing interest in in-memory computing, initially centered on DRAM. Recently, NAND flash-based in-storage computing has gained attention due to its ability to compute over larger working sets without requiring initial memory loading. In response, we propose the Search-in-Memory (SiM) chip, which minimally modifies an existing flash memory chip to allow it to conduct equality tests internally and send only relevant search results, not the entire data page. Specifically, we implement data filtering by using the existing logic gates in a flash memory chip's peripheral circuits for bit-serial equality tests, which processes all bits on a page simultaneously. Additionally, we introduce a versatile SIMD interface with two primary commands: search and gather, making SiM adaptable to different application scenarios. We use "Optimistic Error Correction" to efficiently ensure data accuracy. Our evaluations show that this new architecture could significantly improve throughput over traditional CPU-centric architectures.
17:11 CEST	TS14.10	HIGH-PERFORMANCE FEATURE EXTRACTION FOR GPU-ACCELERATED ORB-SLAMX Speaker: Filippo Muzzini, Dep. of Physics, Informatics and Mathematics - University of Modena and Reggio Emilia - Modena, Italy, IT Authors: Filippo Muzzini¹, Nicola Capodieci¹, Roberto Cavicchioli¹ and Benjamin Rouxel² ¹Università di Modena e Reggio Emilia, IT; ²Unimore, IT Abstract In the autonomous vehicles field, localization is a crucial aspect. While the ORB-SLAM algorithm is a recognized solution for these tasks, it poses challenges due to its computational intensity. Although accelerated implementation exists, a bottleneck persists in the Point Filtering phase which relies on the Distribute Octree algorithm that is not suitable for GPU processing. In this paper, we introduce a novel GPU-suitable algorithm designed to enhance the Point Filtering step, surpassing Distribute Octree. We conducted a comprehensive comparison with state-of-the-art CPU and GPU implementations, considering both computational time and trajectory accuracy. Our experimental results, demonstrate significant speed-ups up to 3x compared to previous contributions.
17:12 CEST	TS14.11	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

TS20 Design And Test For Analog And Mixed-Signal Circuits And Systems, And Mems

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 16:30 CEST - 18:00 CEST
Location / Room: Multi-Purpose Room M1A+C

Session chair:
Haralampos Stratigopoulos, Sorbonne Université - LIP6, FR

Session co-chair:
Graziano Pravadelli, U Verona, IT

Time	Label	Presentation Title Authors
16:30 CEST	TS20.1	CRONUS: CIRCUIT RAPID OPTIMIZATION WITH NEURAL SIMULATOR Speaker: Youngmin Oh, Samsung Advanced Institute of Technology, KR Authors: Youngmin Oh¹, Doyun Kim¹, Yoon Hyeok Lee¹ and Bosun Hwang² ¹Samsung Advanced Institute of Technology, KR; ²Samsung Advanced Institute of Technology (SAIT), Samsung Electronics, KR Abstract Automation of analog circuit design is highly desirable, yet challenging. Various approaches such as deep reinforcement learning (DRL), genetic algorithms, and Bayesian optimization have been proposed and found to be effective. However, these techniques require a large number of interactions with a real simulator, leading to high computational costs. Therefore, we present a novel DRL method, CRONuS, for automatic analog circuit design that uses a surrogate for the simulator. With the help of the surrogate, our method is capable of augmenting a data set with a conservative reward design for stable policy training, without having to interact with the simulator. Regardless of the type of analog circuit, our experiment demonstrated a more than 5x improvement in sample efficiency with varying target performance metrics.
16:35 CEST	TS20.2	SELF-LEARNING AND TRANSFER ACROSS TOPOLOGIES OF CONSTRAINTS FOR ANALOG / MIXED-SIGNAL CIRCUIT LAYOUT SYNTHESIS Speaker: Kaichang Chen, KU Leuven, BE Authors: Kaichang Chen and Georges Gielen, KU Leuven, BE Abstract Truly full automation of analog/mixed-signal (AMS) integrated circuit design and layout has long been a target in electronic design automation. Making good use of human designer heuristics as constraints that steer today's tools is key to balancing efficiency and design space exploration. However, explicitly getting the constraints for every circuit from designers is the weak spot. Learning-based methods on the other hand can learn efficiently from training examples. This paper proposes a flexible framework that can self-learn various layout constraints for a circuit from some expert-generated example layouts. Constraints like alignment, symmetry, and device matching are learned from those expert layouts with the generate-and-aggregate methodology. Secondly, through feature matching, the learned knowledge can then be transferred as constraints for the layout synthesis of different circuit topologies, making the approach flexible and technology-agnostic. Experimental results show that our framework can learn constraints with 100\% accuracy. Compared to other state-of-the-art tools, our framework also achieves a high efficiency and a high transfer accuracy over various types of constraints.
16:40 CEST	TS20.3	X-PIM: FAST MODELING AND VALIDATION FRAMEWORK FOR MIXED-SIGNAL PROCESSING-IN-MEMORY USING COMPRESSED EQUIVALENT MODEL IN SYSTEMVERILOG Speaker: Ingu Jeong, Sungkyunkwan University, KR Authors: Ingu Jeong and Jun-Eun Park, Sungkyunkwan University, KR Abstract Mixed-signal processing-in-memory (PIM) has gained prominence as a promising approach for implementing deep neural networks in energy-constrained systems. However, the co-design and optimization of mixed-signal circuits in PIM demand substantial time and effort for simulation and validation. This work presents X-PIM, a fast modeling and validation framework for mixed-signal PIMs. X-PIM encompasses not only the precise modeling of transistor-level analog computation circuits in SystemVerilog but also the rapid validation of system-level neural networks implemented using the mixed-signal PIM model. For achieving both accuracy and speed in simulation, X-PIM introduces a technique called compressed equivalent model (CEM) for the mixed-signal PIM circuits. This technique transforms a two-dimensional PIM array into an equivalent single-cell model. Furthermore, X-PIM can account for the impact of non-ideal operations in mixed-signal circuits by incorporating effects such as ADC quantization noise, parasitic components, intrinsic noise, and finite bandwidth. Based on the proposed mixed-signal PIM modeling, X-PIM can perform the system-level neural network validation with a significantly reduced simulation time at least 200 times faster than that of SPICE-based validation. X-PIM demonstrates three mixed-signal PIMs: XNOR PIM, capacitive PIM, and ReRAM-based PIM. For multi-layer perceptron (MLP) network, X-PIM can complete accuracy evaluation for MNIST-1000 dataset within only 30 minutes.
16:45 CEST	TS20.4	TSS-BO: SCALABLE BAYESIAN OPTIMIZATION FOR ANALOG CIRCUIT SIZING VIA TRUNCATED SUBSPACE SAMPLING Speaker: Tianchen Gu, Fudan University, CN Authors: Tianchen Gu¹, Jiaqi Wang², Zhaori Bi¹, Changhao Yan¹, Fan Yang¹, Yajie Qin¹, Tao Cui² and Xuan Zeng¹ ¹Fudan University, CN; ²Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract We propose a novel scalable Bayesian optimization method with truncated subspace sampling (tSS-BO) to tackle high-dimensional optimization challenges for large-scale analog circuit sizing. To address the high-dimensional challenges, we propose subspace sampling subject to a truncated Gaussian distribution. This approach limits the effective sampling dimensionality down to a constant upper bound, independent of the original dimensionality, leading to a significant reduction in complexity associated with the curse of dimensionality. The distribution covariance is iteratively updated using a truncated flow, where approximate gradients and center steps are integrated with decaying prior subspace features. We introduce gradient sketching and local Gaussian process (GP) models to approximate gradients without additional simulations to mitigate systematic errors. To enhance efficiency and ensure compatibility with constraints, we utilize local GP models for the selection of promising candidates, avoiding the cost of acquisition function optimization. The proposed tSS-BO method exhibits clear advantages over state-of-the-art methods in experimental comparisons. In synthetic benchmark functions, the tSS-BO method achieves up to 4.93X evaluation speedups and a remarkable over 30X algorithm complexity reduction compared to the Bayesian baseline. In real-world analog circuits, our method achieves up to 2X speedups in simulation number and runtime.
16:50 CEST	TS20.5	ADAPTIVE ODE SOLVERS FOR TIMED DATA FLOW MODELS IN SYSTEMC-AMS Speaker: Christian Haubelt, University of Rostock, DE Authors: Alexandra Kuester¹, Rainer Dorsch¹ and Christian Haubelt² ¹Bosch Sensortec GmbH, DE; ²University of Rostock, DE Abstract The analog/mixed signal extensions to SystemC effectively tackle the needs for heterogeneous system integration using virtual prototyping. However, they introduce the inherent trade-off between accuracy and performance due to the discrete timestep. Besides the discrete-time scheduler, analog solvers are used within SystemC-AMS to solve linear ordinary differential equations (ODEs). In this paper, we derive two methodologies to integrate adaptive ODE solvers into SystemC-AMS that estimate the optimal timestep based on error control. The main advantage of the approaches is the avoidance of time-consuming global backtracking. Instead, they fit well into the execution semantics and scheduling approach of SystemC-AMS. A detailed comparison of both integration schemes is given and they are evaluated using a MEMS accelerometer as classical example of a heterogeneous system.
16:55 CEST	TS20.6	ANALOG TRANSISTOR PLACEMENT OPTIMIZATION CONSIDERING NON-LINEAR SPATIAL VARIATION Speaker: Supriyo Maji, University of Texas at Austin, US Authors: Supriyo Maji¹, Sungyoung Lee² and David Pan¹ ¹University of Texas at Austin, US; ²Seoul National University, KR Abstract Analog circuit performance can degrade due to random and spatial variations. While random variations can be mitigated using larger-sized devices, such devices tend to have more spatial variations. To address this, a common technique involves employing symmetric layout like the common-centroid, which effectively reduces linear variations or first-order effect. However, achieving high performance in analog systems often necessitates mitigating nonlinear spatial variations, for which common-centroid layout is unsuitable. In response, this work introduces an efficient approach based on simulated annealing for transistor placement, with a particular focus on mitigating nonlinear spatial variations. Importantly, our proposed method can also handle important layout constraints, including routing complexity, layout-dependent effects, and diffusion-sharing within the optimization. Experimental results show the proposed method beats state-of-the-art in all important parameters while minimizing nonlinear spatial variations. Moreover, our approach gives users better control over optimization objectives than existing methods.
17:00 CEST	TS20.7	A DATA-DRIVEN ANALOG CIRCUIT SYNTHESIZER WITH AUTOMATIC TOPOLOGY SELECTION AND SIZING Speaker: Souradip Poddar, University of Texas at Austin, Austin, TX, USA, US Authors: Souradip Poddar¹, Ahmet Budak², Linran Zhao¹, Chen-Hao Hsu¹, Supriyo Maji¹, Keren Zhu³, Yaoyao Jia¹ and David Z. Pan¹ ¹University of Texas at Austin, US; ²Analog Devices Inc., US; ³The Chinese University of Hong Kong, HK Abstract Despite significant recent advancements in analog design automation, analog front-end design remains a challenge characterized by its heavy reliance on human designer expertise together with extensive trial-and-error simulations. In this paper, we present a novel data-driven analog circuit synthesizer with automatic topology selection and sizing. We propose a modular approach to build a comprehensive, parameterized circuit topology library. Instead of starting from an exhaustive dataset, which is often not available or too expensive to build, we build an adaptive topology dataset, which can later be enhanced with synthetic data generated using variational autoencoders (VAE), a generative machine learning technique. This integration bolsters our methodology's predictive capabilities, minimizing the risk of inadvertent oversight of viable topologies. To ensure accuracy and robustness, the predicted topology is re-sized for verification and further performance optimization. Our experiments, which involve over 360 OPAMP topologies and over 540K data points demonstrate our framework's capability to identify optimal topology and its sizing within minutes, achieving design quality comparable to that of experienced designers.
17:05 CEST	TS20.8	SAGEROUTE2.0: HIERARCHICAL ANALOG AND MIXED SIGNAL ROUTING CONSIDERING VERSATILE ROUTING SCENARIOS Speaker: Haoyi Zhang, Peking University, CN Authors: Haoyi Zhang¹, Xiaohan Gao¹, Zilong Shen¹, Jiahao Song¹, Xiaoxu Cheng², Xiyuan Tang¹, Yibo Lin¹, Runsheng Wang¹ and Ru Huang¹ ¹Peking University, CN; ²Primarius Technologies Co., Ltd., Shanghai, China, CN Abstract Recent advances in analog and mixed-signal (AMS) circuit applications call for a shorter design cycle and time-to-market period. Routing is one of the most time-consuming and tedious steps in the AMS design cycle. A modern AMS routing should simultaneously consider versatile routing scenarios (e.g., analog routing, digital routing, inter-analog-digital routing) to shoot for outstanding performance. Most previous studies only focus on one of the routing scenarios and ignore the synergism among different routing scenarios, lacking holistic and systematic investigation. In our work, we propose a hierarchical routing engine to handle the complex routing requirements in AMS circuits. By leveraging the carefully- designed routing kernels hierarchically, the framework can generate high-quality routing solutions for real-world AMS circuits.
17:10 CEST	TS20.9	TRANS-NET: KNOWLEDGE-TRANSFERRING ANALOG CIRCUIT OPTIMIZER WITH A NETLIST-BASED CIRCUIT REPRESENTATION Speaker: Ho-Jin Lee, Pohang University of Science and Technology, KR Authors: Ho-Jin Lee¹, Kyeong-Jun Lee², Youngchang Choi¹, Kyongsu Lee¹, Seokhyeong Kang¹ and Jae-Yoon Sim¹ ¹Pohang University of Science and Technology, KR; ²Samsung Electronics, KR Abstract Finding an optimal point in the design space of analog circuits requires a substantial time-consuming effort even for skillful circuit designers. There have been extensive studies on automated sizing of transistors in analog circuits based on machine learning (ML) algorithms. However, the previous approaches suffer from lack of expandability and necessitate an inevitable retraining process of the given model to apply for optimization of different circuits. The graph-based representation of a circuit with reinforcement learning (RL) achieved a knowledge transfer when optimizing the same circuit with different process technologies. However, it can be hardly applied to different circuit topologies due to the failure of generalizing the training of RL agent. This paper introduces Trans-Net, an analog circuit optimizer that is capable of supporting the knowledge transfer across different circuits as well as different process technologies with a circuit representation that defines the circuit topology by one-to-one mapping from SPICE netlist. The proposed analog circuit optimizer successfully supports multiple circuits within a single ML model, showcasing its effectiveness on five different circuit topologies across three different process technologies.
17:11 CEST	TS20.10	A NOVEL MULTI-OBJECTIVE OPTIMIZATION FRAMEWORK FOR ANALOG CIRCUIT CUSTOMIZATION Speaker: Sandeep Gupta, University of Southern California, US Authors: Mutian Zhu, Qiaochu Zhang, Mohsen Hassanpourghadi, Mike Shuo-Wei Chen, Tony Levi and Sandeep Gupta, University of Southern California, US Abstract Prior research has developed an approach called Analog Mixed-signal Parameter Search Engine (AMPSE) to reduce the cost of design of analog/mixed-signal (AMS) circuits. In this paper, we propose an adaptive sampling method (AS) to identify a range of Pareto-optimal versions of a given AMS circuit with different combinations of metric values to enable parameter-search based methods like AMPSE to efficiently serve multiple users with diverse requirements. As AMS circuit simulation has high run-time complexity, our method uses a surrogate model to estimate the values of metrics for the circuit, given the values of its parameters. In each iteration, we use a mix of uniform and adaptive sampling to identify parameter value combinations, use the surrogate model to identify a subset of these samples to simulate, and use the simulation results to retrain the model. Our method is more effective and has lower complexity compared with prior methods because it works with any surrogate model, uses a low-complexity yet effective strategy to identify samples for simulation, and uses an adaptive annealing strategy to balance exploration vs. exploitation. Experimental results demonstrate that, at lower complexity, our method discovers better Pareto-optimal designs compared to prior methods. The benefits of our method, relative to prior methods, increase as we move from AMS circuits with low simulation complexities to those with higher simulation complexities. For an AMS circuit with very high simulation complexity, our method identifies designs that are superior to the version of the circuit optimized by experienced designers.
17:12 CEST	TS20.11	INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Presenter: Session Chairs, DATE, ES Author: Session Chairs, DATE, ES Abstract Participants can freely interact with authors during their interactive technical presentations.

CC Closing Ceremony

Add this session to my calendar

Date: Wednesday, 27 March 2024
Time: 18:00 CEST - 18:30 CEST
Location / Room: Auditorium 2

Session chair:
Andy Pimentel, University of Amsterdam, NL

Session co-chair:
Valeria Bertacco, University of Michigan, US

Time	Label	Presentation Title Authors
18:00 CEST	CC.1	CLOSING REMARKS Presenter: Andy Pimentel, University of Amsterdam, NL Authors: Andy Pimentel¹ and Valeria Bertacco² ¹University of Amsterdam, NL; ²University of Michigan, US Abstract .
18:15 CEST	CC.2	SAVE THE DATE 2025 Presenter: Aida Todri-Sanial, Eindhoven University of Technology, NL Author: Aida Todri-Sanial, Eindhoven University of Technology, NL Abstract .