US9569279B2 - Heterogeneous multiprocessor design for power-efficient and area-efficient computing - Google Patents

Heterogeneous multiprocessor design for power-efficient and area-efficient computing Download PDF

Info

Publication number
US9569279B2
US9569279B2 US13/723,995 US201213723995A US9569279B2 US 9569279 B2 US9569279 B2 US 9569279B2 US 201213723995 A US201213723995 A US 201213723995A US 9569279 B2 US9569279 B2 US 9569279B2
Authority
US
United States
Prior art keywords
core
new
determining
cores
workload
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US13/723,995
Other versions
US20140181501A1 (en
Inventor
Gary D. Hicok
Matthew Raymond LONGNECKER
Rahul Gautam Patel
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nvidia Corp
Original Assignee
Nvidia Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nvidia Corp filed Critical Nvidia Corp
Assigned to NVIDIA CORPORATION reassignment NVIDIA CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HICOK, GARY D., LONGNECKER, MATTHEW RAYMOND, PATEL, RAHUL GAUTAM
Priority to DE102013108041.3A priority Critical patent/DE102013108041B4/en
Priority to TW102127477A priority patent/TWI502333B/en
Publication of US20140181501A1 publication Critical patent/US20140181501A1/en
Application granted granted Critical
Publication of US9569279B2 publication Critical patent/US9569279B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5094Allocation of resources, e.g. of the central processing unit [CPU] where the allocation takes into account power or heat criteria
    • Y02B60/142
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention generally relates to multiprocessor computer systems and, more specifically, to a heterogeneous multiprocessor design for power-efficient and area-efficient computing.
  • Battery-powered mobile computing platforms have become increasingly important in recent years, intensifying the need for efficient, low power systems that deliver highly scalable computational capacity with diminishing cost.
  • a typical mobile device may need to operate over a wide performance range, according to workload requirements. Different performance ranges are conventionally mapped to a different operating mode, with power consumption proportionally related to performance within a given operating mode.
  • a low-power sleep mode the mobile device may provide a small amount of computational capacity, such as to maintain radio contact with a cellular tower.
  • the mobile device In an active mode, the mobile device may provide low-latency response to user input, for example via a window manager. Many operations associated with typical applications execute with satisfactory performance in an active mode.
  • a high-performance mode the mobile device needs to provide peak computational capacity, such as to execute a real-time game or perform transient user-interface operations. Active mode and high-performance mode typically require progressively increasing power consumption.
  • a number of techniques have been developed to improve both performance and power efficiency for mobile devices. Such techniques include reducing device parasitic loads by reducing device size, reducing operating and threshold voltages, trading off performance for power-efficiency, and adding different circuit configurations tuned to operate well under certain operating modes.
  • a mobile device processor complex comprises a low-power, but low-performance processor and a high-performance, but high-power processor.
  • the low-power processor In idle and low activity active modes, the low-power processor is more power efficient at lower performance levels and is therefore selected for execution, while in high-performance modes, the high-performance processor is more power efficient and is therefore selected for execution of larger workloads.
  • the trade-off space includes a cost component since the mobile device carries a cost burden of two processors, where only one processor can be active at a time. While such a processor complex enables both low power operation and high-performance operation, the processor complex makes inefficient use of expensive resources.
  • One embodiment of the present invention sets forth a method for configuring one or more cores within a processing unit for executing different workloads, the method comprising receiving information related to a new workload, determining, based on the information, that the new workload is different than a current workload, determining how many of the one or more cores should be configured to execute the new workload based on the information, determining whether a new core configuration is needed based on how many of the one or more cores should be configured to execute the new workload, and if a new core configuration is needed, then transitioning the processing unit to the new core configuration, or if a new core configuration is not needed, then maintaining a current core configuration for executing the new workload.
  • inventions of the present invention include, without limitation, a computer-readable storage medium including instructions that, when executed by a processing unit, cause the processing unit to perform the techniques described herein as well as a computing device that includes a processing unit configured to perform the techniques described herein.
  • One advantage of the disclosed technique is that it advantageously improves power efficiency of a multi-core central processing unit over a wide workload range, while efficiently utilizing processing resources.
  • FIG. 1 is a block diagram illustrating a computer system configured to implement one or more aspects of the present invention
  • FIG. 2 is a block diagram of a central processing unit (CPU) of the computer system of FIG. 1 , according to one embodiment of the present invention
  • FIG. 3 illustrates different operating regions of a CPU comprising multiple cores, according to one embodiment of the present invention.
  • FIG. 4 is a flow diagram of method steps for configuring a CPU comprising multiple cores to operate within a power-efficient region, according to one embodiment of the present invention.
  • FIG. 1 is a block diagram illustrating a computer system 100 configured to implement one or more aspects of the present invention.
  • Computer system 100 includes a central processing unit (CPU) 102 and a system memory 104 communicating via an interconnection path that may include a memory bridge 105 .
  • Memory bridge 105 which may be, e.g., a Northbridge chip, is connected via a bus or other communication path 106 (e.g., a HyperTransport link) to an I/O (input/output) bridge 107 .
  • a bus or other communication path 106 e.g., a HyperTransport link
  • I/O bridge 107 which may be, e.g., a Southbridge chip, receives user input from one or more user input device(s) 108 (e.g., keyboard, pointing device, capacitive touch tablet) and forwards the input to CPU 102 via communication path 106 and memory bridge 105 .
  • a parallel processing subsystem 112 is coupled to memory bridge 105 via a bus or second communication path 113 (e.g., a Peripheral Component Interconnect (PCI)Express, Accelerated Graphics Port, or HyperTransport link).
  • PCI Peripheral Component Interconnect
  • parallel processing subsystem 112 is a graphics subsystem that delivers pixels to a display device 110 that may be any conventional cathode ray tube, liquid crystal display, light-emitting diode display, or the like.
  • a system disk 114 is also connected to I/O bridge 107 and may be configured to store content and applications and data for use by CPU 102 and parallel processing subsystem 112 .
  • System disk 114 provides non-volatile storage for applications and data and may include fixed or removable hard disk drives, flash memory devices, and CD-ROM (compact disc read-only-memory), DVD-ROM (digital versatile disc-ROM), Blu-ray, HD-DVD (high definition DVD), or other magnetic, optical, or solid state storage devices.
  • a switch 116 provides connections between I/O bridge 107 and other components such as a network adapter 118 and various add-in card 120 .
  • Other components including universal serial bus (USB) or other port connections, compact disc (CD) drives, digital versatile disc (DVD) drives, film recording devices, and the like, may also be connected to I/O bridge 107 .
  • the various communication paths shown in FIG. 1 including the specifically named communication paths 106 and 113 may be implemented using any suitable protocols, such as PCI Express, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol(s), and connections between different devices may use different protocols as is known in the art.
  • the parallel processing subsystem 112 incorporates circuitry optimized for graphics and video processing, including, for example, video output circuitry, and constitutes a graphics processing unit (GPU).
  • the parallel processing subsystem 112 incorporates circuitry optimized for general purpose processing, while preserving the underlying computational architecture, described in greater detail herein.
  • the parallel processing subsystem 112 may be integrated with one or more other system elements in a single subsystem, such as joining the memory bridge 105 , CPU 102 , and I/O bridge 107 to form a system on chip (SoC).
  • SoC system on chip
  • connection topology including the number and arrangement of bridges, the number of CPUs 102 , and the number of parallel processing subsystems 112 , may be modified as desired.
  • system memory 104 is connected to CPU 102 directly rather than through a bridge, and other devices communicate with system memory 104 via memory bridge 105 and CPU 102 .
  • parallel processing subsystem 112 is connected to I/O bridge 107 or directly to CPU 102 , rather than to memory bridge 105 .
  • I/O bridge 107 and memory bridge 105 might be integrated into a single chip instead of existing as one or more discrete devices.
  • Large embodiments may include two or more CPUs 102 and two or more parallel processing subsystems 112 .
  • the particular components shown herein are optional; for instance, any number of add-in cards or peripheral devices might be supported.
  • switch 116 is eliminated, and network adapter 118 and add-in card 120 are connect directly to I/O bridge 107 .
  • computer system 100 comprises a mobile device and network adapter 118 implements a digital wireless communications subsystem.
  • input devices 108 comprise a touch tablet input subsystem and display device 110 implements a mobile screen subsystem, such as a liquid crystal display module.
  • CPU 102 comprises at least two processor cores 140 ( 0 ), 140 (N).
  • a first processor core 140 ( 0 ) is designed for low power operation, while a second processor core 140 (N) is design for high performance operation.
  • a symmetric number of low power and high performance processor cores are implemented within CPU 102 .
  • An operating system kernel 150 residing in system memory 104 includes a scheduler 152 and device drivers 154 , 156 . Kernel 150 is configured to provide certain conventional kernel services, including services related to process and thread management.
  • Scheduler 152 is configured to manage thread and process allocation to different processor cores 140 within CPU 102 .
  • Device driver 154 is configured to manage which processor cores 140 are enabled for use and which are disabled, such as via powering down.
  • Device driver 156 is configured to manage parallel processing subsystem 112 , including processing and buffering command and input data streams to be processed.
  • FIG. 2 is a block diagram of CPU 102 of computer system 100 of FIG. 1 , according to one embodiment of the present invention.
  • CPU 102 includes at least two cores 140 ( 0 ), 140 (N), a core interconnect 220 , a cache 222 , a memory interface 224 , an interrupt distributor 226 , and a cluster control unit 230 .
  • Each core 140 may operate within a corresponding voltage-frequency (VF) domain, distinct from other VF domains.
  • circuitry associated with core 140 ( 0 ) may operate on a first voltage and first operating frequency associated with VF domain 210 ( 0 ), while circuits associated with core 140 (N) may operate on a second voltage and a second frequency associated with VF domain 210 (N).
  • each voltage and each frequency may be varied independently within technically feasible ranges to achieve certain power and performance goals.
  • core 140 ( 0 ) is designed for low power operation, while core 140 (N) is designed for high performance operation, while preserving mutual instruction set architecture (ISA) compatibility.
  • Core 140 (N) may achieve higher performance via any applicable technique, such as circuit design directed to high clock speeds, logic design directed to simultaneously issuing and processing multiple concurrent instructions, and architectural design directed to improved cache size and performance.
  • Design trade-off associated with core 140 (N) may tolerate increased marginal power consumption to achieve greater marginal execution performance.
  • Core 140 ( 0 ) may achieve lower power operation via circuit design directed to reducing leakage current, crossbar current, and parasitic loss, logic design directed to reducing switching energy associated with processing an instruction.
  • Design trade-offs associated with core 140 ( 0 ) should generally favor reducing power consumption, even at the expense of clock speed and processing performance.
  • Each core 140 includes a programmable virtual identifier (ID) 212 , which identifies the processor core.
  • ID programmable virtual identifier
  • Each core 140 may be programmed with an arbitrary core identifier via virtual ID 212 , which may be associated with a particular thread or processed maintained by scheduler 152 .
  • Each core 140 may include logic to facilitate replicating internal execution state to another core 140 .
  • core interconnect 220 couples cores 140 to a cache 222 , which is further coupled to a memory interface 224 .
  • Core interconnect 220 may be configured to facilitate state replication between cores 140 .
  • Interrupt distributor 226 is configured to receive an interrupt signal and transmit the interrupt signal to an appropriate core 140 , identified by a value programmed within virtual ID 212 . For example, an interrupt that is targeted for core zero will be directed to whichever core 140 has a virtual ID 212 programmed to zero.
  • Cluster control unit 230 manages availability state for each core 140 , which may be individually hot plugged in to become available or hot plugged out to no longer be available. Prior to hot plugging a specified core out, cluster control unit 230 may cause execution state for the core to be replicated to another core for continued execution. For example, if execution should transition from a low power core to a high performance core, then execution state for the low power core may be replicated to the high performance core before the high performance core begins executing. Execution state is implementation specific and may include, without limitation, register data, translation buffer data, and cache state.
  • cluster control unit 230 is configured to power off one or more voltage supplies to a core that has been hot plugged out and to power on one or more voltage supplies to a core that has been hot plugged in.
  • cluster control unit 230 may power off a voltage supply associated with VF domain 210 ( 0 ) to hot plug out core 140 ( 0 ).
  • Cluster control unit 230 may also implement frequency control circuitry for each core 140 .
  • Cluster control unit 230 receives commands from a cluster switch software module residing within device driver 154 .
  • the cluster switch manages transitions between core configurations. For example, cluster switch is able to direct each core to save context, including a virtual ID 212 , and to load a saved context, including an arbitrary virtual ID 212 .
  • the cluster switch may include hardware support for saving and loading context via cluster control unit 230 .
  • Control unit 230 may provide automatic detection of workload changes and indicate to the cluster switch that a new workload requires a new configuration. The cluster switch then directs control unit 230 to transition a workload form one core 140 to another core 140 , or enable additional cores via hot plugging in the additional cores.
  • FIG. 3 illustrates different operating regions of a CPU comprising multiple cores, according to one embodiment of the present invention.
  • the CPU such as CPU 102 of FIG. 1 , includes at least a low power core 140 ( 0 ) and a high performance core 140 (N).
  • a power curve 320 for low power core 140 ( 0 ) is plotted as a function of throughput 310 .
  • a power curve 322 is plotted for high performance core 140 (N)
  • a power curve 324 is plotted for a dual core configuration.
  • Throughput 310 is defined here as instructions executed per second, while power 312 is defined in units of power, such as watts (or a fraction thereof), needed to sustain a corresponding throughput 310 .
  • a core clock frequency may be varied to achieve continuously different levels of throughput along the throughput 310 axis.
  • low power core 140 ( 0 ) has a maximum throughput that is lower than a maximum throughput for high performance core 140 (N).
  • high performance core 140 (N) is able to operate at a higher clock frequency than low power core 140 ( 0 ).
  • low power core 140 ( 0 ) may be driven with one clock frequency that is in an associated upper operating range, while high performance core 140 (N) may be driven with a different clock frequency that is in an associated medium operating range.
  • each core 140 ( 0 ), 140 (N) in dual core mode is driven with an identical clock frequency within range of both cores.
  • each core 140 ( 0 ), 140 (N) in dual core mode is driven with a different clock within an associated range of each core.
  • each clock frequency may be selected to achieve similar forward execution progress for each core.
  • cores 140 are configured to operate from a common voltage supply and may operate from independent clock frequencies.
  • low power core 140 ( 0 ) is able to satisfy throughput requirements using the least power of the three core configurations (low power, high performance, dual core).
  • high performance core 140 (N) is able to satisfy throughput requirements using the least power of the three core configurations, while extending throughput 310 beyond a maximum throughput 314 for low power core 140 ( 0 ).
  • operating both low power core 140 ( 0 ) and high performance core 140 (N) simultaneously may achieve a throughput that is higher than a maximum throughput 316 for high performance core 140 (N), thereby extending overall throughput, but at the expense of additional power consumption.
  • a first state transition is between region 330 and region 332 ; a second state transition is between region 332 and region 330 ; a third state transition is between region 330 and region 334 ; a fourth state transition is between region 334 and region 330 ; a fifth state transition is between region 332 and region 334 ; and a sixth state transition is between region 334 and region 332 .
  • Additional cores may add additional operating regions and additional potential state transitions between core configurations without departing the scope and spirit of the present invention.
  • cores 140 within CPU 102 are characterized in terms of power consumption and throughput as a function voltage and frequency.
  • a resulting characterization comprises a family of power curves and different operating regions having different power requirements.
  • the different operating regions may be determined statically for a given CPU 102 design.
  • the different operating regions may be stored in tables within device driver 154 , which is then able to configure CPU 102 to hot plug in and hot plug out different cores 140 based on a prevailing workload requirements.
  • device driver 154 reacts to current workload requirements and reconfigures different cores 140 within CPU 102 to best satisfy the requirements.
  • scheduler 152 is configured to schedule workloads according to available cores 140 .
  • Scheduler 152 may direct device driver 154 to hot plug in or hot plug out different cores based on present and future knowledge of workload requirements.
  • FIG. 4 is a flow diagram of method steps for configuring a multi-core CPU to operate within a power-efficient region, according to one embodiment of the present invention.
  • the method steps are described in conjunction with the systems of FIGS. 1-2 , persons of ordinary skill in the art will understand that any system configured to perform the method steps, in any order, is within the scope of the invention. In one embodiment, the method steps are performed by CPU 102 of FIG. 1 .
  • a method 400 begins in step 410 , where cluster control unit 230 of FIG. 2 initializes core configuration for CPU 102 .
  • cluster control unit 230 initializes core configuration for CPU 102 to reflect availability of low power core 140 ( 0 ) of FIG. 1 .
  • core 140 ( 0 ) executes an operating system boot chronology, including loading and initiating execution of kernel 150 .
  • step 412 device driver 154 receives workload information.
  • the workload information may include, without limitation, CPU load statistics, latency statistics, and the like.
  • the workload information may be received from cluster control unit 230 within CPU 102 or from conventional kernel task and thread services. If, in step 420 , there is a change in workload reflected by the workload information, then the method proceeds to step 422 , otherwise, the method proceeds back to step 412 .
  • the device driver determines a matching core configuration to support the new workload information. The driver may use statically pre-computed workload tables that map power curve information to efficient core configurations that support a required workload reflected in the workload information.
  • step 430 the matching core configuration represents a change to the current core configuration
  • the method proceeds to step 432 , otherwise, the method proceeds back to step 412 .
  • the device driver causes CPU 102 to transition to the matching core configuration.
  • the transition process may involve hot plugging one or more core in and may also involve hot plugging one or more core out, as a function of differences between a current core configuration and the matching core configuration.
  • step 440 the method should terminate, then the method proceeds to step 490 , otherwise the method proceeds back to step 412 .
  • the method may need to terminate upon receiving a termination signal, such as during an overall shutdown event.
  • a technique for managing processor cores within a multi-core CPU.
  • the technique involves hot plugging core resources in and hot plugging core resources out as needed.
  • Each core includes a virtual ID to allow the core execution context to be abstracted away from a particular physical core circuit.
  • core configurations may be changed to support the increases.
  • core configurations may be changed to reduce power consumption while supporting the reduced workload.
  • One advantage of the disclosed technique is that it advantageously improves power efficiency of a multi-core central processing unit over a wide workload range, while efficiently utilizing processing resources.
  • aspects of the present invention may be implemented in hardware or software or in a combination of hardware and software.
  • One embodiment of the invention may be implemented as a program product for use with a computer system.
  • the program(s) of the program product define functions of the embodiments (including the methods described herein) and can be contained on a variety of computer-readable storage media.
  • Illustrative computer-readable storage media include, but are not limited to: (i) non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive, flash memory, ROM chips or any type of solid-state non-volatile semiconductor memory) on which information is permanently stored; and (ii) writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive or any type of solid-state random-access semiconductor memory) on which alterable information is stored.
  • non-writable storage media e.g., read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive, flash memory, ROM chips or any type of solid-state non-volatile semiconductor memory
  • writable storage media e.g., floppy disks within a diskette drive or hard-disk drive or any type of solid-state random-access semiconductor memory

Abstract

A technique for managing processor cores within a multi-core central processing unit (CPU) provides efficient power and resource utilization over a wide workload range. The CPU comprises at least one core designed for low power operation and at least one core designed for high performance operation. For low workloads, the low power core executes the workload. For certain higher workloads, the high performance core executes the workload. For certain other workloads, the low power core and the high performance core both share execution of the workload. This technique advantageously enables efficient processing over a wider range of workloads than conventional systems.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims benefit of the United States Provisional Patent Application having Ser. No. 61/678,026, filed on Jul. 31, 2012, which is hereby incorporated herein by reference.
BACKGROUND OF THE INVENTION
Field of the Invention
The present invention generally relates to multiprocessor computer systems and, more specifically, to a heterogeneous multiprocessor design for power-efficient and area-efficient computing.
Description of the Related Art
Battery-powered mobile computing platforms have become increasingly important in recent years, intensifying the need for efficient, low power systems that deliver highly scalable computational capacity with diminishing cost. A typical mobile device may need to operate over a wide performance range, according to workload requirements. Different performance ranges are conventionally mapped to a different operating mode, with power consumption proportionally related to performance within a given operating mode. In a low-power sleep mode, the mobile device may provide a small amount of computational capacity, such as to maintain radio contact with a cellular tower. In an active mode, the mobile device may provide low-latency response to user input, for example via a window manager. Many operations associated with typical applications execute with satisfactory performance in an active mode. In a high-performance mode, the mobile device needs to provide peak computational capacity, such as to execute a real-time game or perform transient user-interface operations. Active mode and high-performance mode typically require progressively increasing power consumption.
A number of techniques have been developed to improve both performance and power efficiency for mobile devices. Such techniques include reducing device parasitic loads by reducing device size, reducing operating and threshold voltages, trading off performance for power-efficiency, and adding different circuit configurations tuned to operate well under certain operating modes.
In one example, a mobile device processor complex comprises a low-power, but low-performance processor and a high-performance, but high-power processor. In idle and low activity active modes, the low-power processor is more power efficient at lower performance levels and is therefore selected for execution, while in high-performance modes, the high-performance processor is more power efficient and is therefore selected for execution of larger workloads. In this scenario, the trade-off space includes a cost component since the mobile device carries a cost burden of two processors, where only one processor can be active at a time. While such a processor complex enables both low power operation and high-performance operation, the processor complex makes inefficient use of expensive resources.
As the foregoing illustrates, what is needed in the art is a more efficient technique for accommodating a wide range of different workloads.
SUMMARY OF THE INVENTION
One embodiment of the present invention sets forth a method for configuring one or more cores within a processing unit for executing different workloads, the method comprising receiving information related to a new workload, determining, based on the information, that the new workload is different than a current workload, determining how many of the one or more cores should be configured to execute the new workload based on the information, determining whether a new core configuration is needed based on how many of the one or more cores should be configured to execute the new workload, and if a new core configuration is needed, then transitioning the processing unit to the new core configuration, or if a new core configuration is not needed, then maintaining a current core configuration for executing the new workload.
Other embodiments of the present invention include, without limitation, a computer-readable storage medium including instructions that, when executed by a processing unit, cause the processing unit to perform the techniques described herein as well as a computing device that includes a processing unit configured to perform the techniques described herein.
One advantage of the disclosed technique is that it advantageously improves power efficiency of a multi-core central processing unit over a wide workload range, while efficiently utilizing processing resources.
BRIEF DESCRIPTION OF THE DRAWINGS
So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
FIG. 1 is a block diagram illustrating a computer system configured to implement one or more aspects of the present invention;
FIG. 2 is a block diagram of a central processing unit (CPU) of the computer system of FIG. 1, according to one embodiment of the present invention;
FIG. 3 illustrates different operating regions of a CPU comprising multiple cores, according to one embodiment of the present invention; and
FIG. 4 is a flow diagram of method steps for configuring a CPU comprising multiple cores to operate within a power-efficient region, according to one embodiment of the present invention.
DETAILED DESCRIPTION
In the following description, numerous specific details are set forth to provide a more thorough understanding of the present invention. However, it will be apparent to one of skill in the art that the present invention may be practiced without one or more of these specific details.
System Overview
FIG. 1 is a block diagram illustrating a computer system 100 configured to implement one or more aspects of the present invention. Computer system 100 includes a central processing unit (CPU) 102 and a system memory 104 communicating via an interconnection path that may include a memory bridge 105. Memory bridge 105, which may be, e.g., a Northbridge chip, is connected via a bus or other communication path 106 (e.g., a HyperTransport link) to an I/O (input/output) bridge 107. I/O bridge 107, which may be, e.g., a Southbridge chip, receives user input from one or more user input device(s) 108 (e.g., keyboard, pointing device, capacitive touch tablet) and forwards the input to CPU 102 via communication path 106 and memory bridge 105. A parallel processing subsystem 112 is coupled to memory bridge 105 via a bus or second communication path 113 (e.g., a Peripheral Component Interconnect (PCI)Express, Accelerated Graphics Port, or HyperTransport link). In one embodiment parallel processing subsystem 112 is a graphics subsystem that delivers pixels to a display device 110 that may be any conventional cathode ray tube, liquid crystal display, light-emitting diode display, or the like. A system disk 114 is also connected to I/O bridge 107 and may be configured to store content and applications and data for use by CPU 102 and parallel processing subsystem 112. System disk 114 provides non-volatile storage for applications and data and may include fixed or removable hard disk drives, flash memory devices, and CD-ROM (compact disc read-only-memory), DVD-ROM (digital versatile disc-ROM), Blu-ray, HD-DVD (high definition DVD), or other magnetic, optical, or solid state storage devices.
A switch 116 provides connections between I/O bridge 107 and other components such as a network adapter 118 and various add-in card 120. Other components (not explicitly shown), including universal serial bus (USB) or other port connections, compact disc (CD) drives, digital versatile disc (DVD) drives, film recording devices, and the like, may also be connected to I/O bridge 107. The various communication paths shown in FIG. 1, including the specifically named communication paths 106 and 113 may be implemented using any suitable protocols, such as PCI Express, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol(s), and connections between different devices may use different protocols as is known in the art.
In one embodiment, the parallel processing subsystem 112 incorporates circuitry optimized for graphics and video processing, including, for example, video output circuitry, and constitutes a graphics processing unit (GPU). In another embodiment, the parallel processing subsystem 112 incorporates circuitry optimized for general purpose processing, while preserving the underlying computational architecture, described in greater detail herein. In yet another embodiment, the parallel processing subsystem 112 may be integrated with one or more other system elements in a single subsystem, such as joining the memory bridge 105, CPU 102, and I/O bridge 107 to form a system on chip (SoC).
It will be appreciated that the system shown herein is illustrative and that variations and modifications are possible. The connection topology, including the number and arrangement of bridges, the number of CPUs 102, and the number of parallel processing subsystems 112, may be modified as desired. For instance, in some embodiments, system memory 104 is connected to CPU 102 directly rather than through a bridge, and other devices communicate with system memory 104 via memory bridge 105 and CPU 102. In other alternative topologies, parallel processing subsystem 112 is connected to I/O bridge 107 or directly to CPU 102, rather than to memory bridge 105. In still other embodiments, I/O bridge 107 and memory bridge 105 might be integrated into a single chip instead of existing as one or more discrete devices. Large embodiments may include two or more CPUs 102 and two or more parallel processing subsystems 112. The particular components shown herein are optional; for instance, any number of add-in cards or peripheral devices might be supported. In some embodiments, switch 116 is eliminated, and network adapter 118 and add-in card 120 are connect directly to I/O bridge 107. In still other embodiments, computer system 100 comprises a mobile device and network adapter 118 implements a digital wireless communications subsystem. In such embodiments, input devices 108 comprise a touch tablet input subsystem and display device 110 implements a mobile screen subsystem, such as a liquid crystal display module.
CPU 102 comprises at least two processor cores 140(0), 140(N). A first processor core 140(0) is designed for low power operation, while a second processor core 140(N) is design for high performance operation. In one embodiment, a symmetric number of low power and high performance processor cores are implemented within CPU 102. An operating system kernel 150 residing in system memory 104 includes a scheduler 152 and device drivers 154, 156. Kernel 150 is configured to provide certain conventional kernel services, including services related to process and thread management. Scheduler 152 is configured to manage thread and process allocation to different processor cores 140 within CPU 102. Device driver 154 is configured to manage which processor cores 140 are enabled for use and which are disabled, such as via powering down. Device driver 156 is configured to manage parallel processing subsystem 112, including processing and buffering command and input data streams to be processed.
Heterogeneous Multiprocessor
FIG. 2 is a block diagram of CPU 102 of computer system 100 of FIG. 1, according to one embodiment of the present invention. As shown, CPU 102 includes at least two cores 140(0), 140(N), a core interconnect 220, a cache 222, a memory interface 224, an interrupt distributor 226, and a cluster control unit 230.
Each core 140 may operate within a corresponding voltage-frequency (VF) domain, distinct from other VF domains. For example, circuitry associated with core 140(0) may operate on a first voltage and first operating frequency associated with VF domain 210(0), while circuits associated with core 140(N) may operate on a second voltage and a second frequency associated with VF domain 210(N). In this example, each voltage and each frequency may be varied independently within technically feasible ranges to achieve certain power and performance goals.
In this example, core 140(0) is designed for low power operation, while core 140(N) is designed for high performance operation, while preserving mutual instruction set architecture (ISA) compatibility. Core 140(N) may achieve higher performance via any applicable technique, such as circuit design directed to high clock speeds, logic design directed to simultaneously issuing and processing multiple concurrent instructions, and architectural design directed to improved cache size and performance. Design trade-off associated with core 140(N) may tolerate increased marginal power consumption to achieve greater marginal execution performance. Core 140(0) may achieve lower power operation via circuit design directed to reducing leakage current, crossbar current, and parasitic loss, logic design directed to reducing switching energy associated with processing an instruction. Design trade-offs associated with core 140(0) should generally favor reducing power consumption, even at the expense of clock speed and processing performance.
Each core 140 includes a programmable virtual identifier (ID) 212, which identifies the processor core. Each core 140 may be programmed with an arbitrary core identifier via virtual ID 212, which may be associated with a particular thread or processed maintained by scheduler 152. Each core 140 may include logic to facilitate replicating internal execution state to another core 140.
In one embodiment, core interconnect 220 couples cores 140 to a cache 222, which is further coupled to a memory interface 224. Core interconnect 220 may be configured to facilitate state replication between cores 140. Interrupt distributor 226 is configured to receive an interrupt signal and transmit the interrupt signal to an appropriate core 140, identified by a value programmed within virtual ID 212. For example, an interrupt that is targeted for core zero will be directed to whichever core 140 has a virtual ID 212 programmed to zero.
Cluster control unit 230 manages availability state for each core 140, which may be individually hot plugged in to become available or hot plugged out to no longer be available. Prior to hot plugging a specified core out, cluster control unit 230 may cause execution state for the core to be replicated to another core for continued execution. For example, if execution should transition from a low power core to a high performance core, then execution state for the low power core may be replicated to the high performance core before the high performance core begins executing. Execution state is implementation specific and may include, without limitation, register data, translation buffer data, and cache state.
In one embodiment, cluster control unit 230 is configured to power off one or more voltage supplies to a core that has been hot plugged out and to power on one or more voltage supplies to a core that has been hot plugged in. For example, cluster control unit 230 may power off a voltage supply associated with VF domain 210(0) to hot plug out core 140(0). Cluster control unit 230 may also implement frequency control circuitry for each core 140. Cluster control unit 230 receives commands from a cluster switch software module residing within device driver 154. The cluster switch manages transitions between core configurations. For example, cluster switch is able to direct each core to save context, including a virtual ID 212, and to load a saved context, including an arbitrary virtual ID 212. The cluster switch may include hardware support for saving and loading context via cluster control unit 230. Control unit 230 may provide automatic detection of workload changes and indicate to the cluster switch that a new workload requires a new configuration. The cluster switch then directs control unit 230 to transition a workload form one core 140 to another core 140, or enable additional cores via hot plugging in the additional cores.
FIG. 3 illustrates different operating regions of a CPU comprising multiple cores, according to one embodiment of the present invention. The CPU, such as CPU 102 of FIG. 1, includes at least a low power core 140(0) and a high performance core 140(N). As shown, a power curve 320 for low power core 140(0) is plotted as a function of throughput 310. Similarly, a power curve 322 is plotted for high performance core 140(N), and a power curve 324 is plotted for a dual core configuration. Throughput 310 is defined here as instructions executed per second, while power 312 is defined in units of power, such as watts (or a fraction thereof), needed to sustain a corresponding throughput 310.
A core clock frequency may be varied to achieve continuously different levels of throughput along the throughput 310 axis. As shown, low power core 140(0) has a maximum throughput that is lower than a maximum throughput for high performance core 140(N). In one implementation scenario, high performance core 140(N) is able to operate at a higher clock frequency than low power core 140(0). In a dual core mode associated with power curve 324, low power core 140(0) may be driven with one clock frequency that is in an associated upper operating range, while high performance core 140(N) may be driven with a different clock frequency that is in an associated medium operating range. In one configuration, each core 140(0), 140(N) in dual core mode is driven with an identical clock frequency within range of both cores. In a different configuration, each core 140(0), 140(N) in dual core mode is driven with a different clock within an associated range of each core. In one embodiment, each clock frequency may be selected to achieve similar forward execution progress for each core. In certain embodiments, cores 140 are configured to operate from a common voltage supply and may operate from independent clock frequencies.
Within a low power core region 330, low power core 140(0) is able to satisfy throughput requirements using the least power of the three core configurations (low power, high performance, dual core). Within a high performance core region 332, high performance core 140(N) is able to satisfy throughput requirements using the least power of the three core configurations, while extending throughput 310 beyond a maximum throughput 314 for low power core 140(0). Within a dual core region 334, operating both low power core 140(0) and high performance core 140(N) simultaneously may achieve a throughput that is higher than a maximum throughput 316 for high performance core 140(N), thereby extending overall throughput, but at the expense of additional power consumption.
Given the three operating regions 330, 332, 334, and one low power core 140(0) and one high-performance core 140(N), six direct state transitions are supported between different core configurations. A first state transition is between region 330 and region 332; a second state transition is between region 332 and region 330; a third state transition is between region 330 and region 334; a fourth state transition is between region 334 and region 330; a fifth state transition is between region 332 and region 334; and a sixth state transition is between region 334 and region 332. Persons skilled in the art will recognize that additional cores may add additional operating regions and additional potential state transitions between core configurations without departing the scope and spirit of the present invention.
In one embodiment, cores 140 within CPU 102 are characterized in terms of power consumption and throughput as a function voltage and frequency. A resulting characterization comprises a family of power curves and different operating regions having different power requirements. The different operating regions may be determined statically for a given CPU 102 design. The different operating regions may be stored in tables within device driver 154, which is then able to configure CPU 102 to hot plug in and hot plug out different cores 140 based on a prevailing workload requirements. In one embodiment, device driver 154 reacts to current workload requirements and reconfigures different cores 140 within CPU 102 to best satisfy the requirements. In another embodiment, scheduler 152 is configured to schedule workloads according to available cores 140. Scheduler 152 may direct device driver 154 to hot plug in or hot plug out different cores based on present and future knowledge of workload requirements.
FIG. 4 is a flow diagram of method steps for configuring a multi-core CPU to operate within a power-efficient region, according to one embodiment of the present invention. Although the method steps are described in conjunction with the systems of FIGS. 1-2, persons of ordinary skill in the art will understand that any system configured to perform the method steps, in any order, is within the scope of the invention. In one embodiment, the method steps are performed by CPU 102 of FIG. 1.
As shown, a method 400 begins in step 410, where cluster control unit 230 of FIG. 2 initializes core configuration for CPU 102. In one embodiment, cluster control unit 230 initializes core configuration for CPU 102 to reflect availability of low power core 140(0) of FIG. 1. In this configuration, core 140(0) executes an operating system boot chronology, including loading and initiating execution of kernel 150.
In step 412, device driver 154 receives workload information. The workload information may include, without limitation, CPU load statistics, latency statistics, and the like. The workload information may be received from cluster control unit 230 within CPU 102 or from conventional kernel task and thread services. If, in step 420, there is a change in workload reflected by the workload information, then the method proceeds to step 422, otherwise, the method proceeds back to step 412. In step 422, the device driver determines a matching core configuration to support the new workload information. The driver may use statically pre-computed workload tables that map power curve information to efficient core configurations that support a required workload reflected in the workload information.
If, in step 430 the matching core configuration represents a change to the current core configuration, then the method proceeds to step 432, otherwise, the method proceeds back to step 412. In step 432, the device driver causes CPU 102 to transition to the matching core configuration. The transition process may involve hot plugging one or more core in and may also involve hot plugging one or more core out, as a function of differences between a current core configuration and the matching core configuration.
If, in step 440, the method should terminate, then the method proceeds to step 490, otherwise the method proceeds back to step 412. The method may need to terminate upon receiving a termination signal, such as during an overall shutdown event.
In sum, a technique is disclosed for managing processor cores within a multi-core CPU. The technique involves hot plugging core resources in and hot plugging core resources out as needed. Each core includes a virtual ID to allow the core execution context to be abstracted away from a particular physical core circuit. As system workload increases, core configurations may be changed to support the increases. Similarly, as system workload decreases, core configurations may be changed to reduce power consumption while supporting the reduced workload.
One advantage of the disclosed technique is that it advantageously improves power efficiency of a multi-core central processing unit over a wide workload range, while efficiently utilizing processing resources.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. For example, aspects of the present invention may be implemented in hardware or software or in a combination of hardware and software. One embodiment of the invention may be implemented as a program product for use with a computer system. The program(s) of the program product define functions of the embodiments (including the methods described herein) and can be contained on a variety of computer-readable storage media. Illustrative computer-readable storage media include, but are not limited to: (i) non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive, flash memory, ROM chips or any type of solid-state non-volatile semiconductor memory) on which information is permanently stored; and (ii) writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive or any type of solid-state random-access semiconductor memory) on which alterable information is stored.
The invention has been described above with reference to specific embodiments. Persons of ordinary skill in the art, however, will understand that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The foregoing description and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Therefore, the scope of the present invention is determined by the claims that follow.

Claims (23)

What is claimed is:
1. A method for configuring two or more cores within a processing unit for executing different workloads, the method comprising:
receiving information related to a new workload;
determining, based on the information, that the new workload is different than a current workload;
retrieving characterization data associated with power consumption characterizations for each core included in the two or more cores;
determining how many of the two or more cores should be configured to execute the new workload based on the information and the characterization data;
determining whether a new core configuration is needed based on how many of the two or more cores should be configured to execute the new workload;
if a new core configuration is needed, then transitioning the processing unit to the new core configuration, or
if a new core configuration is not needed, then maintaining a current core configuration for executing the new workload;
receiving a first interrupt associated with a first logical core identifier and related to the new workload; and
transmitting the first interrupt to a first core included in the two or more cores that is executing the new workload and is associated with a programmable identifier matching the first logical core identifier.
2. The method of claim 1, wherein only a low-power core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only a high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the low-power core, and turning on the high-performance core to execute the new workload.
3. The method of claim 1, wherein only a high-performance core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only a low-power core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the high-performance core, and turning on the low-power core to execute the new workload.
4. The method of claim 1, wherein only a low-power core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that both the low-power core and a high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning on the high-performance core to execute the new workload.
5. The method of claim 1, wherein only a high-performance core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that both a low-power core and the high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning on the low-power core to execute the new workload.
6. The method of claim 1, wherein both a low-power core and a high-performance core execute work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only the high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the low-power core to execute the new workload.
7. The method of claim 1, wherein both a low-power core and a high-performance core execute work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only the low-power core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the high-performance core to execute the new workload.
8. The method of claim 1, wherein the processing unit comprises a central processing unit or a graphics processing unit.
9. The method of claim 1, wherein each core included in the two or more cores is identifiable via a programmable identifier, and two or more programmable identifiers are used in transitioning the processing unit to the new core configuration.
10. The method of claim 1, wherein determining how many of the two or more cores should be configured to execute the new workload comprises determining a subset of the two or more cores that is capable of satisfying throughput requirements of the new workload with less power consumption relative to all other potential subsets of the two or more cores based on the characterization data.
11. The method of claim 1, wherein transitioning the processing unit to the new core configuration includes power on or powering off at least one of the two or more cores based on information associated with a future workload.
12. A non-transitory computer-readable storage medium including instructions that, when executed by a processing unit, cause the processing unit to configure two or more cores within the processing unit for executing different workloads, the method comprising:
receiving information related to a new workload;
determining, based on the information, that the new workload is different than a current workload;
retrieving characterization data associated with power consumption characterizations for each core included in the two or more cores;
determining how many of the two or more cores should be configured to execute the new workload based on the information and the characterization data;
determining whether a new core configuration is needed based on how many of the two or more cores should be configured to execute the new workload;
if a new core configuration is needed, then transitioning the processing unit to the new core configuration, or
if a new core configuration is not needed, then maintaining a current core configuration for executing the new workload;
receiving a first interrupt associated with a first logical core identifier and related to the new workload; and
transmitting the first interrupt to a first core included in the two or more cores that is executing the new workload and is associated with a programmable identifier matching the first logical core identifier.
13. The computer-readable storage medium of claim 12, wherein only a low-power core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only a high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the low-power core, and turning on the high-performance core to execute the new workload.
14. The computer-readable storage medium of claim 12, wherein only a high-performance core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only a low-power core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the high-performance core, and turning on the low-power core to execute the new workload.
15. The computer-readable storage medium of claim 12, wherein only a low-power core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that both the low-power core and a high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning on the high-performance core to execute the new workload.
16. The computer-readable storage medium of claim 12, wherein only a high-performance core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that both a low-power core and the high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning on the low-power core to execute the new workload.
17. The computer-readable storage medium of claim 12, wherein both a low-power core and a high-performance core execute work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only the high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the low-power core to execute the new workload.
18. The computer-readable storage medium of claim 12, wherein both a low-power core and a high-performance core execute work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only the low-power core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the high-performance core to execute the new workload.
19. The computer-readable storage medium of claim 12, wherein the processing unit comprises a central processing unit or a graphics processing unit.
20. The computer-readable storage medium of claim 12, wherein each core included in the two or more cores is identifiable via a programmable identifier, and one or more programmable identifiers are used in transitioning the processing unit to the new core configuration.
21. A computing device, comprising:
a memory including instructions; and
a central processing unit that is coupled to the memory and includes at least one low-power core and at least one high-performance core, the central processing unit programmed via the instructions to configure two or more cores for executing different workloads by:
receiving information related to a new workload;
determining, based on the information, that the new workload is different than a current workload;
retrieving characterization data associated with power consumption characterizations for each core included in the two or more cores;
determining how many of the two or more cores should be configured to execute the new workload based on the information and the characterization data;
determining whether a new core configuration is needed based on how many of the two or more cores should be configured to execute the new workload;
if a new core configuration is needed, then transitioning the processing unit to the new core configuration, or
if a new core configuration is not needed, then maintaining a current core configuration for executing the new workload;
receiving a first interrupt associated with a first logical core identifier and related to the new workload; and
transmitting the first interrupt to a first core included in the two or more cores that is executing the new workload and is associated with a programmable identifier matching the first logical core identifier.
22. The computing device of claim 21, wherein each core included in the two or more cores is identifiable via a programmable identifier, and one or more programmable identifiers are used in transitioning the processing unit to the new core configuration.
23. The computing device of claim 21, wherein transitioning the processing unit to the new core configuration includes power on or powering off at least one of the two or more cores based on information associated with a future workload.
US13/723,995 2012-07-31 2012-12-21 Heterogeneous multiprocessor design for power-efficient and area-efficient computing Active 2033-10-26 US9569279B2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
DE102013108041.3A DE102013108041B4 (en) 2012-07-31 2013-07-26 Heterogeneous multiprocessor arrangement for power-efficient and area-efficient computing
TW102127477A TWI502333B (en) 2012-07-31 2013-07-31 Heterogeneous multiprocessor design for power-efficient and area-efficient computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US201261678026P 2012-07-31 2012-07-31

Publications (2)

Publication Number Publication Date
US20140181501A1 US20140181501A1 (en) 2014-06-26
US9569279B2 true US9569279B2 (en) 2017-02-14

Family

ID=50976117

Family Applications (1)

Application Number Title Priority Date Filing Date
US13/723,995 Active 2033-10-26 US9569279B2 (en) 2012-07-31 2012-12-21 Heterogeneous multiprocessor design for power-efficient and area-efficient computing

Country Status (1)

Country Link
US (1) US9569279B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9965279B2 (en) * 2013-11-29 2018-05-08 The Regents Of The University Of Michigan Recording performance metrics to predict future execution of large instruction sequences on either high or low performance execution circuitry

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130117168A1 (en) 2011-11-04 2013-05-09 Mark Henrik Sandstrom Maximizing Throughput of Multi-user Parallel Data Processing Systems
US8789065B2 (en) 2012-06-08 2014-07-22 Throughputer, Inc. System and method for input data load adaptive parallel processing
US8745626B1 (en) * 2012-12-17 2014-06-03 Throughputer, Inc. Scheduling application instances to configurable processing cores based on application requirements and resource specification
US9448847B2 (en) 2011-07-15 2016-09-20 Throughputer, Inc. Concurrent program execution optimization
JPWO2015015756A1 (en) * 2013-08-02 2017-03-02 日本電気株式会社 Power saving control system, control device, control method and control program for non-volatile memory mounted server
KR20160054850A (en) * 2014-11-07 2016-05-17 삼성전자주식회사 Apparatus and method for operating processors
US9898071B2 (en) * 2014-11-20 2018-02-20 Apple Inc. Processor including multiple dissimilar processor cores
US9958932B2 (en) * 2014-11-20 2018-05-01 Apple Inc. Processor including multiple dissimilar processor cores that implement different portions of instruction set architecture
US20160378551A1 (en) * 2015-06-24 2016-12-29 Intel Corporation Adaptive hardware acceleration based on runtime power efficiency determinations
US9891926B2 (en) 2015-09-30 2018-02-13 International Business Machines Corporation Heterogeneous core microarchitecture
US10310858B2 (en) * 2016-03-08 2019-06-04 The Regents Of The University Of Michigan Controlling transition between using first and second processing circuitry
US10355975B2 (en) 2016-10-19 2019-07-16 Rex Computing, Inc. Latency guaranteed network on chip
US10700968B2 (en) * 2016-10-19 2020-06-30 Rex Computing, Inc. Optimized function assignment in a multi-core processor
CN108334405A (en) * 2017-01-20 2018-07-27 阿里巴巴集团控股有限公司 Frequency isomery CPU, frequency isomery implementation method, device and method for scheduling task
US10540300B2 (en) 2017-02-16 2020-01-21 Qualcomm Incorporated Optimizing network driver performance and power consumption in multi-core processor-based systems
CN113792847B (en) 2017-02-23 2024-03-08 大脑系统公司 Accelerated deep learning apparatus, method and system
US10459517B2 (en) * 2017-03-31 2019-10-29 Qualcomm Incorporated System and methods for scheduling software tasks based on central processing unit power characteristics
WO2018193353A1 (en) 2017-04-17 2018-10-25 Cerebras Systems Inc. Neuron smearing for accelerated deep learning
US10762418B2 (en) 2017-04-17 2020-09-01 Cerebras Systems Inc. Control wavelet for accelerated deep learning
US11488004B2 (en) 2017-04-17 2022-11-01 Cerebras Systems Inc. Neuron smearing for accelerated deep learning
US11010330B2 (en) * 2018-03-07 2021-05-18 Microsoft Technology Licensing, Llc Integrated circuit operation adjustment using redundant elements
CN108717362B (en) * 2018-05-21 2022-05-03 北京晨宇泰安科技有限公司 Network equipment configuration system and method based on inheritable structure
EP3572909A1 (en) * 2018-05-25 2019-11-27 Nokia Solutions and Networks Oy Method and apparatus of reducing energy consumption in a network
WO2020044152A1 (en) 2018-08-28 2020-03-05 Cerebras Systems Inc. Scaled compute fabric for accelerated deep learning
WO2020044238A1 (en) 2018-08-29 2020-03-05 Cerebras Systems Inc. Processor element redundancy for accelerated deep learning
WO2020044208A1 (en) 2018-08-29 2020-03-05 Cerebras Systems Inc. Isa enhancements for accelerated deep learning

Citations (49)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6314515B1 (en) 1989-11-03 2001-11-06 Compaq Computer Corporation Resetting multiple processors in a computer system
US6501999B1 (en) * 1999-12-22 2002-12-31 Intel Corporation Multi-processor mobile computer system having one processor integrated with a chipset
US20030101362A1 (en) 2001-11-26 2003-05-29 Xiz Dia Method and apparatus for enabling a self suspend mode for a processor
US20030120910A1 (en) 2001-12-26 2003-06-26 Schmisseur Mark A. System and method of remotely initializing a local processor
US6732280B1 (en) 1999-07-26 2004-05-04 Hewlett-Packard Development Company, L.P. Computer system performing machine specific tasks before going to a low power state
US6804632B2 (en) 2001-12-06 2004-10-12 Intel Corporation Distribution of processing activity across processing hardware based on power consumption considerations
US20040215926A1 (en) 2003-04-28 2004-10-28 International Business Machines Corp. Data processing system having novel interconnect for supporting both technical and commercial workloads
US20040215987A1 (en) 2003-04-25 2004-10-28 Keith Farkas Dynamically selecting processor cores for overall power efficiency
US20050013705A1 (en) 2003-07-16 2005-01-20 Keith Farkas Heterogeneous processor core systems for improved throughput
US6981083B2 (en) * 2002-12-05 2005-12-27 International Business Machines Corporation Processor virtualization mechanism via an enhanced restoration of hard architected states
WO2006037119A2 (en) 2004-09-28 2006-04-06 Intel Corporation Method and apparatus for varying energy per instruction according to the amount of available parallelism
US20070074011A1 (en) 2005-09-28 2007-03-29 Shekhar Borkar Reliable computing with a many-core processor
US20070083785A1 (en) 2004-06-10 2007-04-12 Sehat Sutardja System with high power and low power processors and thread transfer
US7210139B2 (en) * 2002-02-19 2007-04-24 Hobson Richard F Processor cluster architecture and associated parallel processing methods
US20070136617A1 (en) 2005-11-30 2007-06-14 Renesas Technology Corp. Semiconductor integrated circuit
US7383423B1 (en) 2004-10-01 2008-06-03 Advanced Micro Devices, Inc. Shared resources in a chip multiprocessor
US7421602B2 (en) 2004-02-13 2008-09-02 Marvell World Trade Ltd. Computer with low-power secondary processor and secondary display
US7434002B1 (en) 2006-04-24 2008-10-07 Vmware, Inc. Utilizing cache information to manage memory access and cache utilization
US20080263324A1 (en) 2006-08-10 2008-10-23 Sehat Sutardja Dynamic core switching
US20080307244A1 (en) 2007-06-11 2008-12-11 Media Tek, Inc. Method of and Apparatus for Reducing Power Consumption within an Integrated Circuit
US20090055826A1 (en) 2007-08-21 2009-02-26 Kerry Bernstein Multicore Processor Having Storage for Core-Specific Operational Data
TWI311729B (en) 2005-08-08 2009-07-01 Via Tech Inc Global spreader and method for a parallel graphics processor
US20090172423A1 (en) 2007-12-31 2009-07-02 Justin Song Method, system, and apparatus for rerouting interrupts in a multi-core processor
US20090222654A1 (en) 2008-02-29 2009-09-03 Herbert Hum Distribution of tasks among asymmetric processing elements
US7587716B2 (en) * 2003-02-21 2009-09-08 Sharp Kabushiki Kaisha Asymmetrical multiprocessor system, image processing apparatus and image forming apparatus using same, and unit job processing method using asymmetrical multiprocessor
US20090235260A1 (en) 2008-03-11 2009-09-17 Alexander Branover Enhanced Control of CPU Parking and Thread Rescheduling for Maximizing the Benefits of Low-Power State
US20090259863A1 (en) 2008-04-10 2009-10-15 Nvidia Corporation Responding to interrupts while in a reduced power state
US20090292934A1 (en) 2008-05-22 2009-11-26 Ati Technologies Ulc Integrated circuit with secondary-memory controller for providing a sleep state for reduced power consumption and method therefor
US20090300396A1 (en) 2008-05-30 2009-12-03 Kabushiki Kaisha Toshiba Information processing apparatus
US7730335B2 (en) 2004-06-10 2010-06-01 Marvell World Trade Ltd. Low power computer with main and auxiliary processors
US20100146513A1 (en) 2008-12-09 2010-06-10 Intel Corporation Software-based Thread Remapping for power Savings
US20100153954A1 (en) 2008-12-11 2010-06-17 Qualcomm Incorporated Apparatus and Methods for Adaptive Thread Scheduling on Asymmetric Multiprocessor
US20100162014A1 (en) 2008-12-24 2010-06-24 Mazhar Memon Low power polling techniques
EP2254048A1 (en) 2009-04-21 2010-11-24 LStar Technologies LLC Thread mapping in multi-core processors
US20110022833A1 (en) 2009-07-24 2011-01-27 Sebastien Nussbaum Altering performance of computational units heterogeneously according to performance sensitivity
TWI340900B (en) 2004-09-30 2011-04-21 Ibm System and method for virtualization of processor resources
US20110314314A1 (en) 2010-06-18 2011-12-22 Samsung Electronics Co., Ltd. Power gating of cores by an soc
US8140876B2 (en) * 2009-01-16 2012-03-20 International Business Machines Corporation Reducing power consumption of components based on criticality of running tasks independent of scheduling priority in multitask computer
US8166324B2 (en) 2002-04-29 2012-04-24 Apple Inc. Conserving power by reducing voltage supplied to an instruction-processing portion of a processor
US20120102344A1 (en) 2010-10-21 2012-04-26 Andrej Kocev Function based dynamic power control
US8180997B2 (en) * 2007-07-05 2012-05-15 Board Of Regents, University Of Texas System Dynamically composing processor cores to form logical processors
US20120151225A1 (en) * 2010-12-09 2012-06-14 Lilly Huang Apparatus, method, and system for improved power delivery performance with a dynamic voltage pulse scheme
US20120159496A1 (en) * 2010-12-20 2012-06-21 Saurabh Dighe Performing Variation-Aware Profiling And Dynamic Core Allocation For A Many-Core Processor
US20120185709A1 (en) * 2011-12-15 2012-07-19 Eliezer Weissmann Method, apparatus, and system for energy efficiency and energy conservation including thread consolidation
US8284205B2 (en) 2007-10-24 2012-10-09 Apple Inc. Methods and apparatuses for load balancing between multiple processing units
US20120266179A1 (en) * 2011-04-14 2012-10-18 Osborn Michael J Dynamic mapping of logical cores
US20130124890A1 (en) 2010-07-27 2013-05-16 Michael Priel Multi-core processor and method of power management of a multi-core processor
US20130238912A1 (en) * 2010-11-25 2013-09-12 Michael Priel Method and apparatus for managing power in a multi-core processor
US20130346771A1 (en) * 2012-06-20 2013-12-26 Douglas D. Boom Controlling An Asymmetrical Processor

Patent Citations (54)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6314515B1 (en) 1989-11-03 2001-11-06 Compaq Computer Corporation Resetting multiple processors in a computer system
US6732280B1 (en) 1999-07-26 2004-05-04 Hewlett-Packard Development Company, L.P. Computer system performing machine specific tasks before going to a low power state
US6501999B1 (en) * 1999-12-22 2002-12-31 Intel Corporation Multi-processor mobile computer system having one processor integrated with a chipset
US20030101362A1 (en) 2001-11-26 2003-05-29 Xiz Dia Method and apparatus for enabling a self suspend mode for a processor
US20050050373A1 (en) 2001-12-06 2005-03-03 Doron Orenstien Distribution of processing activity in a multiple core microprocessor
US6804632B2 (en) 2001-12-06 2004-10-12 Intel Corporation Distribution of processing activity across processing hardware based on power consumption considerations
US20030120910A1 (en) 2001-12-26 2003-06-26 Schmisseur Mark A. System and method of remotely initializing a local processor
US7210139B2 (en) * 2002-02-19 2007-04-24 Hobson Richard F Processor cluster architecture and associated parallel processing methods
US8166324B2 (en) 2002-04-29 2012-04-24 Apple Inc. Conserving power by reducing voltage supplied to an instruction-processing portion of a processor
US6981083B2 (en) * 2002-12-05 2005-12-27 International Business Machines Corporation Processor virtualization mechanism via an enhanced restoration of hard architected states
US7587716B2 (en) * 2003-02-21 2009-09-08 Sharp Kabushiki Kaisha Asymmetrical multiprocessor system, image processing apparatus and image forming apparatus using same, and unit job processing method using asymmetrical multiprocessor
US20040215987A1 (en) 2003-04-25 2004-10-28 Keith Farkas Dynamically selecting processor cores for overall power efficiency
US7093147B2 (en) * 2003-04-25 2006-08-15 Hewlett-Packard Development Company, L.P. Dynamically selecting processor cores for overall power efficiency
US20040215926A1 (en) 2003-04-28 2004-10-28 International Business Machines Corp. Data processing system having novel interconnect for supporting both technical and commercial workloads
US20050013705A1 (en) 2003-07-16 2005-01-20 Keith Farkas Heterogeneous processor core systems for improved throughput
US7421602B2 (en) 2004-02-13 2008-09-02 Marvell World Trade Ltd. Computer with low-power secondary processor and secondary display
US20070083785A1 (en) 2004-06-10 2007-04-12 Sehat Sutardja System with high power and low power processors and thread transfer
US7730335B2 (en) 2004-06-10 2010-06-01 Marvell World Trade Ltd. Low power computer with main and auxiliary processors
US7788514B2 (en) 2004-06-10 2010-08-31 Marvell World Trade Ltd. Low power computer with main and auxiliary processors
WO2006037119A2 (en) 2004-09-28 2006-04-06 Intel Corporation Method and apparatus for varying energy per instruction according to the amount of available parallelism
US20060095807A1 (en) 2004-09-28 2006-05-04 Intel Corporation Method and apparatus for varying energy per instruction according to the amount of available parallelism
TWI340900B (en) 2004-09-30 2011-04-21 Ibm System and method for virtualization of processor resources
US7383423B1 (en) 2004-10-01 2008-06-03 Advanced Micro Devices, Inc. Shared resources in a chip multiprocessor
TWI311729B (en) 2005-08-08 2009-07-01 Via Tech Inc Global spreader and method for a parallel graphics processor
US7412353B2 (en) 2005-09-28 2008-08-12 Intel Corporation Reliable computing with a many-core processor
US20070074011A1 (en) 2005-09-28 2007-03-29 Shekhar Borkar Reliable computing with a many-core processor
US20070136617A1 (en) 2005-11-30 2007-06-14 Renesas Technology Corp. Semiconductor integrated circuit
US7434002B1 (en) 2006-04-24 2008-10-07 Vmware, Inc. Utilizing cache information to manage memory access and cache utilization
US20080263324A1 (en) 2006-08-10 2008-10-23 Sehat Sutardja Dynamic core switching
US20080307244A1 (en) 2007-06-11 2008-12-11 Media Tek, Inc. Method of and Apparatus for Reducing Power Consumption within an Integrated Circuit
US8180997B2 (en) * 2007-07-05 2012-05-15 Board Of Regents, University Of Texas System Dynamically composing processor cores to form logical processors
US20090055826A1 (en) 2007-08-21 2009-02-26 Kerry Bernstein Multicore Processor Having Storage for Core-Specific Operational Data
US8284205B2 (en) 2007-10-24 2012-10-09 Apple Inc. Methods and apparatuses for load balancing between multiple processing units
US20090172423A1 (en) 2007-12-31 2009-07-02 Justin Song Method, system, and apparatus for rerouting interrupts in a multi-core processor
US20090222654A1 (en) 2008-02-29 2009-09-03 Herbert Hum Distribution of tasks among asymmetric processing elements
US20090235260A1 (en) 2008-03-11 2009-09-17 Alexander Branover Enhanced Control of CPU Parking and Thread Rescheduling for Maximizing the Benefits of Low-Power State
US20090259863A1 (en) 2008-04-10 2009-10-15 Nvidia Corporation Responding to interrupts while in a reduced power state
US20090292934A1 (en) 2008-05-22 2009-11-26 Ati Technologies Ulc Integrated circuit with secondary-memory controller for providing a sleep state for reduced power consumption and method therefor
US20090300396A1 (en) 2008-05-30 2009-12-03 Kabushiki Kaisha Toshiba Information processing apparatus
US20100146513A1 (en) 2008-12-09 2010-06-10 Intel Corporation Software-based Thread Remapping for power Savings
US20100153954A1 (en) 2008-12-11 2010-06-17 Qualcomm Incorporated Apparatus and Methods for Adaptive Thread Scheduling on Asymmetric Multiprocessor
US20100162014A1 (en) 2008-12-24 2010-06-24 Mazhar Memon Low power polling techniques
US8140876B2 (en) * 2009-01-16 2012-03-20 International Business Machines Corporation Reducing power consumption of components based on criticality of running tasks independent of scheduling priority in multitask computer
EP2254048A1 (en) 2009-04-21 2010-11-24 LStar Technologies LLC Thread mapping in multi-core processors
US20110022833A1 (en) 2009-07-24 2011-01-27 Sebastien Nussbaum Altering performance of computational units heterogeneously according to performance sensitivity
US20110314314A1 (en) 2010-06-18 2011-12-22 Samsung Electronics Co., Ltd. Power gating of cores by an soc
US20130124890A1 (en) 2010-07-27 2013-05-16 Michael Priel Multi-core processor and method of power management of a multi-core processor
US20120102344A1 (en) 2010-10-21 2012-04-26 Andrej Kocev Function based dynamic power control
US20130238912A1 (en) * 2010-11-25 2013-09-12 Michael Priel Method and apparatus for managing power in a multi-core processor
US20120151225A1 (en) * 2010-12-09 2012-06-14 Lilly Huang Apparatus, method, and system for improved power delivery performance with a dynamic voltage pulse scheme
US20120159496A1 (en) * 2010-12-20 2012-06-21 Saurabh Dighe Performing Variation-Aware Profiling And Dynamic Core Allocation For A Many-Core Processor
US20120266179A1 (en) * 2011-04-14 2012-10-18 Osborn Michael J Dynamic mapping of logical cores
US20120185709A1 (en) * 2011-12-15 2012-07-19 Eliezer Weissmann Method, apparatus, and system for energy efficiency and energy conservation including thread consolidation
US20130346771A1 (en) * 2012-06-20 2013-12-26 Douglas D. Boom Controlling An Asymmetrical Processor

Non-Patent Citations (12)

* Cited by examiner, † Cited by third party
Title
International Search Report for Application No. GB1108715.2, dated Sep. 23, 2011.
International Search Report for Application No. GB1108716.0, dated Sep. 28, 2011.
International Search Report for Application No. GB1108717.8 dated Sep. 30, 2011.
Kumar et al. (Single-ISA Heterogenous Multi-Core Architecture: The potential for processor Power Reduction); MICRO 36 Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture; 12 pages.
Non-Final Office Action for U.S. Appl. No. 12/787,359, dated Aug. 30, 2012.
Non-Final Office Action for U.S. Appl. No. 12/787,361, dated Sep. 13, 2012.
Non-Final Office Action for U.S. Appl. No. 13/360,559, dated Apr. 8, 2014.
Non-Final Office Action for U.S. Appl. No. 13/360,559, dated Oct. 18, 2013.
Non-Final Office Action for U.S. Appl. No. 13/604,390, dated Nov. 13, 2014.
Non-Final Office Action for U.S. Appl. No. 13/604,496, dated Sep. 10, 2015.
NVDIA (Variable SMP-A Multiple-Core CPU Architecture for Low Power and High Performance); Whitepaper; 2011, 16 pages.
Tanenbaum (Structured Computer Organization: Third Edition); Prentice-Hall, Inc, 1990; 5 pages.

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9965279B2 (en) * 2013-11-29 2018-05-08 The Regents Of The University Of Michigan Recording performance metrics to predict future execution of large instruction sequences on either high or low performance execution circuitry

Also Published As

Publication number Publication date
US20140181501A1 (en) 2014-06-26

Similar Documents

Publication Publication Date Title
US9569279B2 (en) Heterogeneous multiprocessor design for power-efficient and area-efficient computing
US20110213950A1 (en) System and Method for Power Optimization
US8924758B2 (en) Method for SOC performance and power optimization
US20120331319A1 (en) System and method for power optimization
US20120331275A1 (en) System and method for power optimization
TWI493332B (en) Method and apparatus with power management and a platform and computer readable storage medium thereof
TWI578154B (en) System, method and apparatus for power management
TW201137753A (en) Methods and apparatus to improve turbo performance for events handling
TWI553549B (en) Processor including multiple dissimilar processor cores
US20120102348A1 (en) Fine grained power management in virtualized mobile platforms
US20140025930A1 (en) Multi-core processor sharing li cache and method of operating same
US9501299B2 (en) Minimizing performance loss on workloads that exhibit frequent core wake-up activity
US10025370B2 (en) Overriding latency tolerance reporting values in components of computer systems
US8717371B1 (en) Transitioning between operational modes in a hybrid graphics system
TWI502333B (en) Heterogeneous multiprocessor design for power-efficient and area-efficient computing
US10168765B2 (en) Controlling processor consumption using on-off keying having a maxiumum off time
US8717372B1 (en) Transitioning between operational modes in a hybrid graphics system
CN107209544B (en) System and method for SoC idle power state control based on I/O operating characteristics
US20210089326A1 (en) Dynamic bios policy for hybrid graphics platforms
US20240028222A1 (en) Sleep mode using shared memory between two processors of an information handling system
US8199601B2 (en) System and method of selectively varying supply voltage without level shifting data signals
US20230090567A1 (en) Device and method for two-stage transitioning between reduced power states
CN112486870A (en) Computer system and computer system control method

Legal Events

Date Code Title Description
AS Assignment

Owner name: NVIDIA CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HICOK, GARY D.;LONGNECKER, MATTHEW RAYMOND;PATEL, RAHUL GAUTAM;REEL/FRAME:029576/0507

Effective date: 20121220

STCF Information on status: patent grant

Free format text: PATENTED CASE

CC Certificate of correction
MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4