scispace - formally typeset
Search or ask a question

Showing papers on "Program counter published in 2021"


Proceedings ArticleDOI
18 Oct 2021
TL;DR: Pythia as discussed by the authors proposes a reinforcement learning agent to learn to prefetch using multiple program context and system-level feedback information inherent to its design, which can generate highly accurate, timely, and systemaware requests in the future.
Abstract: Past research has proposed numerous hardware prefetching techniques, most of which rely on exploiting one specific type of program context information (e.g., program counter, cacheline address, or delta between cacheline addresses) to predict future memory accesses. These techniques either completely neglect a prefetcher’s undesirable effects (e.g., memory bandwidth usage) on the overall system, or incorporate system-level feedback as an afterthought to a system-unaware prefetch algorithm. We show that prior prefetchers often lose their performance benefit over a wide range of workloads and system configurations due to their inherent inability to take multiple different types of program context and system-level feedback information into account while prefetching. In this paper, we make a case for designing a holistic prefetch algorithm that learns to prefetch using multiple different types of program context and system-level feedback information inherent to its design. To this end, we propose Pythia, which formulates the prefetcher as a reinforcement learning agent. For every demand request, Pythia observes multiple different types of program context information to make a prefetch decision. For every prefetch decision, Pythia receives a numerical reward that evaluates prefetch quality under the current memory bandwidth usage. Pythia uses this reward to reinforce the correlation between program context information and prefetch decision to generate highly accurate, timely, and system-aware prefetch requests in the future. Our extensive evaluations using simulation and hardware synthesis show that Pythia outperforms two state-of-the-art prefetchers (MLOP and Bingo) by 3.4% and 3.8% in single-core, 7.7% and 9.6% in twelve-core, and 16.9% and 20.2% in bandwidth-constrained core configurations, while incurring only 1.03% area overhead over a desktop-class processor and no software changes in workloads. The source code of Pythia can be freely downloaded from https://github.com/CMU-SAFARI/Pythia.

28 citations


Proceedings ArticleDOI
TL;DR: Pythia as mentioned in this paper proposes a holistic prefetch algorithm that learns to prefetch using multiple different types of program context and system-level feedback information inherent to its design, which formulates the prefetcher as a reinforcement learning agent.
Abstract: Past research has proposed numerous hardware prefetching techniques, most of which rely on exploiting one specific type of program context information (e.g., program counter, cacheline address) to predict future memory accesses. These techniques either completely neglect a prefetcher's undesirable effects (e.g., memory bandwidth usage) on the overall system, or incorporate system-level feedback as an afterthought to a system-unaware prefetch algorithm. We show that prior prefetchers often lose their performance benefit over a wide range of workloads and system configurations due to their inherent inability to take multiple different types of program context and system-level feedback information into account while prefetching. In this paper, we make a case for designing a holistic prefetch algorithm that learns to prefetch using multiple different types of program context and system-level feedback information inherent to its design. To this end, we propose Pythia, which formulates the prefetcher as a reinforcement learning agent. For every demand request, Pythia observes multiple different types of program context information to make a prefetch decision. For every prefetch decision, Pythia receives a numerical reward that evaluates prefetch quality under the current memory bandwidth usage. Pythia uses this reward to reinforce the correlation between program context information and prefetch decision to generate highly accurate, timely, and system-aware prefetch requests in the future. Our extensive evaluations using simulation and hardware synthesis show that Pythia outperforms multiple state-of-the-art prefetchers over a wide range of workloads and system configurations, while incurring only 1.03% area overhead over a desktop-class processor and no software changes in workloads. The source code of Pythia can be freely downloaded from https://github.com/CMU-SAFARI/Pythia.

25 citations


Proceedings ArticleDOI
05 Mar 2021
TL;DR: A 16-bit RISC processor using the Verilog Hardware Description Language (HDL) and simulated in the Xilinx ISE 14.7 design suite is presented in this paper.
Abstract: Reduced Instruction Set Computer (RISC) is a design which presents better performances, higher speed of operation and favors the smaller and simpler set of instructions. A 16 bit RISC processor designed in this paper is capable of executing more number of instructions with simple design, using the Verilog Hardware Description Language (HDL) and the design is simulated in the Xilinx ISE 14.7 design suite. The main achievement in this work is that the multiplier unit in Arithmetic and Logic Unit (ALU) and Multiplier and Accumulator (MAC) is implemented using Vedic Sutras. The main principle used in Vedic mathematics is to reduce the typical calculation of conventional mathematics to very simple one and hence reduce the overall computational complexity. In addition to these blocks, designed RISC Processor consists of other blocks like Control unit and data path, Register Bank, Program Counter and Memory. The proposed RISC processor is very simple and capable of executing 14 instructions. The achievement in this work is that 44% savings in power in case of MAC and that of 12% in case of ALU is achieved compared to conventional ALU and MAC respectively. Also the delay is reduced by 45% in case of MAC and that of 35% in case of ALU in comparison with conventional ALU and MAC correspondingly. These Vedic MAC and ALU are then integrated with other blocks in processor and 16-bit Vedic processor is developed. This reduces the delay by 34% and saves around 88% power compared to conventional processor. Hence the improvement in speed of operation, reduction in power utilization and less area utilization are the key features of designed RISC processor.

4 citations


Book ChapterDOI
01 Jan 2021
TL;DR: Lynsyn and Lite Power Measurement Units (PMUs) as mentioned in this paper use the debug information in the application binary to map energy consumption to source code constructs, such as loops and procedures, to facilitate rapid detection of power and energy hotspots.
Abstract: The end of Dennard scaling has resulted in power or energy consumption becoming first-order design constraints of virtually every computer system. A key challenge is to attribute the power and energy consumption to source code constructs, such as loops and procedures, to facilitate rapid detection of power and energy hotspots. In this paper, we present our Lynsyn and LynsynLite Power Measurement Units (PMUs) which concurrently sample platform power consumption and Program Counter (PC) values over the processor’s out-of-band hardware debug interface. When combined with the debug information in the application binary, this enables the Lynsyn PMUs to map energy consumption to source code constructs. It is commonly stated that PC sampling using hardware debug interfaces is non-intrusive, and a key contribution of this work is a rigorous analysis of this claim. We find that performance, power, and energy overheads are at most 1.2% and hence conclude that hardware-based PC sampling is practically non-intrusive. Further, we include a case study where we analyse selected single- and multi-threaded benchmarks to exemplify practical use of the PMUs.

4 citations


Proceedings ArticleDOI
06 Oct 2021
TL;DR: In this article, indirect transfer oriented programming (iTOP) is proposed to automate the construction of control-flow hijacking attacks in the presence of strong protections including control flow integrity, data execution prevention, and stack canaries.
Abstract: Exploiting a program requires a security analyst to manipulate data in program memory with the goal to obtain control over the program counter and to escalate privileges. However, this is a tedious and lengthy process as: (1) the analyst has to massage program data such that a logical reliable data passing chain can be established, and (2) depending on the attacker goal certain in-place fine-grained protection mechanisms need to be bypassed. Previous work has proposed various techniques to facilitate exploit development. Unfortunately, none of them can be easily used to address the given challenges. This is due to the fact that data in memory is difficult to be massaged by an analyst who does not know the peculiarities of the program as the attack specification is most of the time only textually available, and not automated at all. In this paper, we present indirect transfer oriented programming (iTOP), a framework to automate the construction of control-flow hijacking attacks in the presence of strong protections including control flow integrity, data execution prevention, and stack canaries. Given a vulnerable program, iTOP automatically builds an exploit payload with a chain of viable gadgets with solved SMT-based memory constraints. One salient feature of iTOP is that it contains 13 attack primitives powered by a Turing complete payload specification language, ESL. It also combines virtual and non-virtual gadgets using COOP-like dispatchers. As such, when searching for gadget chains, iTOP can respect, for example, a previously enforced CFI policy, by using only legitimate control flow transfers. We have evaluated iTOP with a variety of programs and demonstrated that it can successfully generate exploits with the developed attack primitives.

1 citations


Patent
Dongqi Liu1, Chang Liu, Lu Yimin, Jiang Tao, Zhao Chaojun 
25 Mar 2021
TL;DR: In this article, a processor core, a processor, an apparatus, and an instruction processing method are disclosed, where the processor core includes an instruction fetch unit, where a speculative execution predictor compares a program counter of a memory access instruction with a table entry stored in the speculative execution predictors and marks the memory access instructions.
Abstract: A processor core, a processor, an apparatus, and an instruction processing method are disclosed. The processor core includes: an instruction fetch unit, where the instruction fetch unit includes a speculative execution predictor and the speculative execution predictor compares a program counter of a memory access instruction with a table entry stored in the speculative execution predictor and marks the memory access instruction; a scheduler unit adapted to adjust a send order of marked memory access instructions and send the marked memory access instructions according to the send order; an execution unit adapted to execute the memory access instructions according to the send order. In the instruction fetch unit, a memory access instruction is marked according to a speculative execution prediction result. In the scheduler unit, a send order of memory access instructions is determined according to the marked memory access instruction and the memory access instructions are sent. In the execution unit, the memory access instructions are executed according to the send order. This helps avoiding re-execution of a memory access instruction due to an address correlation of the memory access instruction. Consequently, this eliminates the need of adding an idle cycle in an instruction pipeline and the need of refreshing the pipeline to clear a memory access instruction that is incorrectly speculated.

Patent
11 Mar 2021
TL;DR: In this article, a pre-fetch logic is configured to track one or more statistics with respect to cache prefetch requests, and link the statistics with the program counter.
Abstract: The disclosure relates to technology for pre-fetching data. An apparatus comprises a processor core, pre-fetch logic, and a memory hierarchy. The pre-fetch logic is configured to generate cache pre-fetch requests for a program instruction identified by a program counter. The pre-fetch logic is configured to track one or more statistics with respect to the cache pre-fetch requests. The pre-fetch logic is configured to link the one or more statistics with the program counter. The pre-fetch logic is configured to determine a degree of the cache pre-fetch requests for the program instruction based on the one or more statistics. The memory hierarchy comprises main memory and a hierarchy of caches. The memory hierarchy further comprises a memory controller configured to pre-fetch memory blocks identified in the cache pre-fetch requests from a current level in the memory hierarchy into a higher level of the memory hierarchy.

Patent
09 Feb 2021
TL;DR: In this paper, a program protection method in an embedded processor based on RISC-V architecture, which is executed by a computer main control unit and comprises the following steps: controlling a program counter judgment module PC_Area_Judge, the target area output signal Target_Judge and a mark signal DATA_ACCESS currently operated by the CPU, determining whether access is valid or not according to a logic calculation result, and realizing program protection in the embedded processor according to the validity of the access result.
Abstract: The invention provides a program protection method in an embedded processor based on RISC-V architecture, which is executed by a computer main control unit and comprises the following steps: controlling a program counter judgment module PC_Area_Judge to generate a program counter output signal PC_Judge according to a data relationship among a starting address AddrStart, an ending address AddrEnd and a PC data value; controlling a target area judgment module Target_Area_Judge to generate a target area output signal Target_Judge according to the data relationship among the initial address AddrStart, the end address AddrEnd and the ADDR data value; and controlling a program execution module Protect_Execution to perform logic calculation according to the program counter output signal PC_Judge,the target area output signal Target_Judge and a mark signal DATA_ACCESS currently operated by the CPU, determining whether access is valid or not according to a logic calculation result, and realizing program protection in the embedded processor according to the validity of the access result. According to the program protection method, the control circuit is simple, the program safety is improved, the chip manufacturing cost is reduced, and the program protection method is very suitable for an embedded processor.

Journal ArticleDOI
01 Sep 2021
TL;DR: In this article, the branch predictor is used to predict the execution result of the current branch instruction (taken/non-taken), which can make the program counter get the target instruction to continue to execute according to the prediction result before calculating the target jump address.
Abstract: When CPU enters the era of instruction level parallelism, the execution cycle of instructions has become shorter and shorter. Meanwhile, the processing speed of CPU has become faster and faster. However, only from the execution process of CPU instructions, there is a key factor that limits the speed of CPU processing instructions, namely, the branch instruction processing. The branch predictor is used to predict the execution result of the current branch instruction (taken/non-taken), which can make the program counter get the target instruction to continue to execute according to the prediction result before calculating the target jump address. If the prediction is correct, it will greatly improve the execution efficiency of CPU. This paper starts from the two basic categories of static branch prediction and dynamic branch prediction. Then analyse their prediction principle, prediction accuracy and component consumption used for prediction. For static branch predictor, its hardware consumption is less than that of dynamic branch predictor, and its accuracy is also lower than that of dynamic branch predictor. Moreover, their shortcomings are pointed out in this paper. At last, based on the original branch predictor, better branch predictors developed in recent years are introduced. In the future, considering the hardware consumption, I think the development of branch predictor will continue to optimize the branch predictor combining global history and local history, from the number of stored entries, the correlation between branch instructions and so on. On the premise of executing a large number of instructions, every little improvement in the efficiency of the predictor will bring huge benefits.

Patent
27 Apr 2021
TL;DR: In this paper, a processor stores a copy of a first subset of the architected state information in on-die storage elements capable of retaining storage after power is turned off, and when a wakeup event is detected, circuitry within the processor is powered up again.
Abstract: Systems, apparatuses, and methods for retaining architected state for relatively frequent switching between sleep and active operating states are described. A processor receives an indication to transition from an active state to a sleep state. The processor stores a copy of a first subset of the architected state information in on-die storage elements capable of retaining storage after power is turned off. The processor supports programmable input/output (PIO) access of particular stored information during the sleep state. When a wakeup event is detected, circuitry within the processor is powered up again. A boot sequence and recovery of architected state from off-chip memory are not performed. Rather than fetch from a memory location pointed to by a reset base address register, the processor instead fetches an instruction from a memory location pointed to by a restored program counter of the retained subset of the architected state information.

Patent
24 Mar 2021
TL;DR: In this article, the second processor is configured to refer to a value of a program counter of the first processor and fetch an instruction from a memory by using the value referred from the program counter.
Abstract: A multiprocessor device includes a first processor and a second processor, wherein the multiprocessor device is configured to cause, when debugging of the first processor is to be performed by using the second processor, the second processor to refer to a value of a program counter of the first processor and fetch an instruction from a memory by using the value referred from the program counter.

Patent
22 Apr 2021
TL;DR: In this article, a trace circuit is integrated into a semiconductor device together with a microprocessor comprising an m-bit program counter, and the trace circuit externally outputs a trace clock and bit trace data for n bits.
Abstract: This trace circuit is integrated into a semiconductor device together with a microprocessor comprising an m-bit program counter, and the trace circuit externally outputs a trace clock and bit trace data for n bits (where 2≦n≦m). The trace circuit: synchronizes to the trace clock and uses the trace data as a first output value when the program counter does not change; synchronizes to the trace clock and uses the trace data as a second output value when the program counter increments; synchronizes to the trace clock and uses the trace data as a third output value at the time of loading to the program counter; and after pausing the microprocessor state machine, divides and outputs, as the trace data, a branching destination address or an interrupt address loaded to the program counter.

Patent
10 Jun 2021
TL;DR: In this article, a method for providing flexible command pointers to microcodes in a memory device is described, which includes receiving a command to access memory devices; accessing a configuration parameter; identifying a program counter value based on the configuration parameter and the command; and loading and executing a microcode based on program counter.
Abstract: Disclosed are apparatuses, methods, and computer-readable media for providing flexible command pointers to microcodes in a memory device. In one embodiment, a method is disclosed comprising receiving a command to access a memory device; accessing a configuration parameter; identifying a program counter value based on the configuration parameter and the command; and loading and executing a microcode based on the program counter.