US9569279B2 - Heterogeneous multiprocessor design for power-efficient and area-efficient computing - Google Patents
Heterogeneous multiprocessor design for power-efficient and area-efficient computing Download PDFInfo
- Publication number
- US9569279B2 US9569279B2 US13/723,995 US201213723995A US9569279B2 US 9569279 B2 US9569279 B2 US 9569279B2 US 201213723995 A US201213723995 A US 201213723995A US 9569279 B2 US9569279 B2 US 9569279B2
- Authority
- US
- United States
- Prior art keywords
- core
- new
- determining
- cores
- workload
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5094—Allocation of resources, e.g. of the central processing unit [CPU] where the allocation takes into account power or heat criteria
-
- Y02B60/142—
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present invention generally relates to multiprocessor computer systems and, more specifically, to a heterogeneous multiprocessor design for power-efficient and area-efficient computing.
- Battery-powered mobile computing platforms have become increasingly important in recent years, intensifying the need for efficient, low power systems that deliver highly scalable computational capacity with diminishing cost.
- a typical mobile device may need to operate over a wide performance range, according to workload requirements. Different performance ranges are conventionally mapped to a different operating mode, with power consumption proportionally related to performance within a given operating mode.
- a low-power sleep mode the mobile device may provide a small amount of computational capacity, such as to maintain radio contact with a cellular tower.
- the mobile device In an active mode, the mobile device may provide low-latency response to user input, for example via a window manager. Many operations associated with typical applications execute with satisfactory performance in an active mode.
- a high-performance mode the mobile device needs to provide peak computational capacity, such as to execute a real-time game or perform transient user-interface operations. Active mode and high-performance mode typically require progressively increasing power consumption.
- a number of techniques have been developed to improve both performance and power efficiency for mobile devices. Such techniques include reducing device parasitic loads by reducing device size, reducing operating and threshold voltages, trading off performance for power-efficiency, and adding different circuit configurations tuned to operate well under certain operating modes.
- a mobile device processor complex comprises a low-power, but low-performance processor and a high-performance, but high-power processor.
- the low-power processor In idle and low activity active modes, the low-power processor is more power efficient at lower performance levels and is therefore selected for execution, while in high-performance modes, the high-performance processor is more power efficient and is therefore selected for execution of larger workloads.
- the trade-off space includes a cost component since the mobile device carries a cost burden of two processors, where only one processor can be active at a time. While such a processor complex enables both low power operation and high-performance operation, the processor complex makes inefficient use of expensive resources.
- One embodiment of the present invention sets forth a method for configuring one or more cores within a processing unit for executing different workloads, the method comprising receiving information related to a new workload, determining, based on the information, that the new workload is different than a current workload, determining how many of the one or more cores should be configured to execute the new workload based on the information, determining whether a new core configuration is needed based on how many of the one or more cores should be configured to execute the new workload, and if a new core configuration is needed, then transitioning the processing unit to the new core configuration, or if a new core configuration is not needed, then maintaining a current core configuration for executing the new workload.
- inventions of the present invention include, without limitation, a computer-readable storage medium including instructions that, when executed by a processing unit, cause the processing unit to perform the techniques described herein as well as a computing device that includes a processing unit configured to perform the techniques described herein.
- One advantage of the disclosed technique is that it advantageously improves power efficiency of a multi-core central processing unit over a wide workload range, while efficiently utilizing processing resources.
- FIG. 1 is a block diagram illustrating a computer system configured to implement one or more aspects of the present invention
- FIG. 2 is a block diagram of a central processing unit (CPU) of the computer system of FIG. 1 , according to one embodiment of the present invention
- FIG. 3 illustrates different operating regions of a CPU comprising multiple cores, according to one embodiment of the present invention.
- FIG. 4 is a flow diagram of method steps for configuring a CPU comprising multiple cores to operate within a power-efficient region, according to one embodiment of the present invention.
- FIG. 1 is a block diagram illustrating a computer system 100 configured to implement one or more aspects of the present invention.
- Computer system 100 includes a central processing unit (CPU) 102 and a system memory 104 communicating via an interconnection path that may include a memory bridge 105 .
- Memory bridge 105 which may be, e.g., a Northbridge chip, is connected via a bus or other communication path 106 (e.g., a HyperTransport link) to an I/O (input/output) bridge 107 .
- a bus or other communication path 106 e.g., a HyperTransport link
- I/O bridge 107 which may be, e.g., a Southbridge chip, receives user input from one or more user input device(s) 108 (e.g., keyboard, pointing device, capacitive touch tablet) and forwards the input to CPU 102 via communication path 106 and memory bridge 105 .
- a parallel processing subsystem 112 is coupled to memory bridge 105 via a bus or second communication path 113 (e.g., a Peripheral Component Interconnect (PCI)Express, Accelerated Graphics Port, or HyperTransport link).
- PCI Peripheral Component Interconnect
- parallel processing subsystem 112 is a graphics subsystem that delivers pixels to a display device 110 that may be any conventional cathode ray tube, liquid crystal display, light-emitting diode display, or the like.
- a system disk 114 is also connected to I/O bridge 107 and may be configured to store content and applications and data for use by CPU 102 and parallel processing subsystem 112 .
- System disk 114 provides non-volatile storage for applications and data and may include fixed or removable hard disk drives, flash memory devices, and CD-ROM (compact disc read-only-memory), DVD-ROM (digital versatile disc-ROM), Blu-ray, HD-DVD (high definition DVD), or other magnetic, optical, or solid state storage devices.
- a switch 116 provides connections between I/O bridge 107 and other components such as a network adapter 118 and various add-in card 120 .
- Other components including universal serial bus (USB) or other port connections, compact disc (CD) drives, digital versatile disc (DVD) drives, film recording devices, and the like, may also be connected to I/O bridge 107 .
- the various communication paths shown in FIG. 1 including the specifically named communication paths 106 and 113 may be implemented using any suitable protocols, such as PCI Express, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol(s), and connections between different devices may use different protocols as is known in the art.
- the parallel processing subsystem 112 incorporates circuitry optimized for graphics and video processing, including, for example, video output circuitry, and constitutes a graphics processing unit (GPU).
- the parallel processing subsystem 112 incorporates circuitry optimized for general purpose processing, while preserving the underlying computational architecture, described in greater detail herein.
- the parallel processing subsystem 112 may be integrated with one or more other system elements in a single subsystem, such as joining the memory bridge 105 , CPU 102 , and I/O bridge 107 to form a system on chip (SoC).
- SoC system on chip
- connection topology including the number and arrangement of bridges, the number of CPUs 102 , and the number of parallel processing subsystems 112 , may be modified as desired.
- system memory 104 is connected to CPU 102 directly rather than through a bridge, and other devices communicate with system memory 104 via memory bridge 105 and CPU 102 .
- parallel processing subsystem 112 is connected to I/O bridge 107 or directly to CPU 102 , rather than to memory bridge 105 .
- I/O bridge 107 and memory bridge 105 might be integrated into a single chip instead of existing as one or more discrete devices.
- Large embodiments may include two or more CPUs 102 and two or more parallel processing subsystems 112 .
- the particular components shown herein are optional; for instance, any number of add-in cards or peripheral devices might be supported.
- switch 116 is eliminated, and network adapter 118 and add-in card 120 are connect directly to I/O bridge 107 .
- computer system 100 comprises a mobile device and network adapter 118 implements a digital wireless communications subsystem.
- input devices 108 comprise a touch tablet input subsystem and display device 110 implements a mobile screen subsystem, such as a liquid crystal display module.
- CPU 102 comprises at least two processor cores 140 ( 0 ), 140 (N).
- a first processor core 140 ( 0 ) is designed for low power operation, while a second processor core 140 (N) is design for high performance operation.
- a symmetric number of low power and high performance processor cores are implemented within CPU 102 .
- An operating system kernel 150 residing in system memory 104 includes a scheduler 152 and device drivers 154 , 156 . Kernel 150 is configured to provide certain conventional kernel services, including services related to process and thread management.
- Scheduler 152 is configured to manage thread and process allocation to different processor cores 140 within CPU 102 .
- Device driver 154 is configured to manage which processor cores 140 are enabled for use and which are disabled, such as via powering down.
- Device driver 156 is configured to manage parallel processing subsystem 112 , including processing and buffering command and input data streams to be processed.
- FIG. 2 is a block diagram of CPU 102 of computer system 100 of FIG. 1 , according to one embodiment of the present invention.
- CPU 102 includes at least two cores 140 ( 0 ), 140 (N), a core interconnect 220 , a cache 222 , a memory interface 224 , an interrupt distributor 226 , and a cluster control unit 230 .
- Each core 140 may operate within a corresponding voltage-frequency (VF) domain, distinct from other VF domains.
- circuitry associated with core 140 ( 0 ) may operate on a first voltage and first operating frequency associated with VF domain 210 ( 0 ), while circuits associated with core 140 (N) may operate on a second voltage and a second frequency associated with VF domain 210 (N).
- each voltage and each frequency may be varied independently within technically feasible ranges to achieve certain power and performance goals.
- core 140 ( 0 ) is designed for low power operation, while core 140 (N) is designed for high performance operation, while preserving mutual instruction set architecture (ISA) compatibility.
- Core 140 (N) may achieve higher performance via any applicable technique, such as circuit design directed to high clock speeds, logic design directed to simultaneously issuing and processing multiple concurrent instructions, and architectural design directed to improved cache size and performance.
- Design trade-off associated with core 140 (N) may tolerate increased marginal power consumption to achieve greater marginal execution performance.
- Core 140 ( 0 ) may achieve lower power operation via circuit design directed to reducing leakage current, crossbar current, and parasitic loss, logic design directed to reducing switching energy associated with processing an instruction.
- Design trade-offs associated with core 140 ( 0 ) should generally favor reducing power consumption, even at the expense of clock speed and processing performance.
- Each core 140 includes a programmable virtual identifier (ID) 212 , which identifies the processor core.
- ID programmable virtual identifier
- Each core 140 may be programmed with an arbitrary core identifier via virtual ID 212 , which may be associated with a particular thread or processed maintained by scheduler 152 .
- Each core 140 may include logic to facilitate replicating internal execution state to another core 140 .
- core interconnect 220 couples cores 140 to a cache 222 , which is further coupled to a memory interface 224 .
- Core interconnect 220 may be configured to facilitate state replication between cores 140 .
- Interrupt distributor 226 is configured to receive an interrupt signal and transmit the interrupt signal to an appropriate core 140 , identified by a value programmed within virtual ID 212 . For example, an interrupt that is targeted for core zero will be directed to whichever core 140 has a virtual ID 212 programmed to zero.
- Cluster control unit 230 manages availability state for each core 140 , which may be individually hot plugged in to become available or hot plugged out to no longer be available. Prior to hot plugging a specified core out, cluster control unit 230 may cause execution state for the core to be replicated to another core for continued execution. For example, if execution should transition from a low power core to a high performance core, then execution state for the low power core may be replicated to the high performance core before the high performance core begins executing. Execution state is implementation specific and may include, without limitation, register data, translation buffer data, and cache state.
- cluster control unit 230 is configured to power off one or more voltage supplies to a core that has been hot plugged out and to power on one or more voltage supplies to a core that has been hot plugged in.
- cluster control unit 230 may power off a voltage supply associated with VF domain 210 ( 0 ) to hot plug out core 140 ( 0 ).
- Cluster control unit 230 may also implement frequency control circuitry for each core 140 .
- Cluster control unit 230 receives commands from a cluster switch software module residing within device driver 154 .
- the cluster switch manages transitions between core configurations. For example, cluster switch is able to direct each core to save context, including a virtual ID 212 , and to load a saved context, including an arbitrary virtual ID 212 .
- the cluster switch may include hardware support for saving and loading context via cluster control unit 230 .
- Control unit 230 may provide automatic detection of workload changes and indicate to the cluster switch that a new workload requires a new configuration. The cluster switch then directs control unit 230 to transition a workload form one core 140 to another core 140 , or enable additional cores via hot plugging in the additional cores.
- FIG. 3 illustrates different operating regions of a CPU comprising multiple cores, according to one embodiment of the present invention.
- the CPU such as CPU 102 of FIG. 1 , includes at least a low power core 140 ( 0 ) and a high performance core 140 (N).
- a power curve 320 for low power core 140 ( 0 ) is plotted as a function of throughput 310 .
- a power curve 322 is plotted for high performance core 140 (N)
- a power curve 324 is plotted for a dual core configuration.
- Throughput 310 is defined here as instructions executed per second, while power 312 is defined in units of power, such as watts (or a fraction thereof), needed to sustain a corresponding throughput 310 .
- a core clock frequency may be varied to achieve continuously different levels of throughput along the throughput 310 axis.
- low power core 140 ( 0 ) has a maximum throughput that is lower than a maximum throughput for high performance core 140 (N).
- high performance core 140 (N) is able to operate at a higher clock frequency than low power core 140 ( 0 ).
- low power core 140 ( 0 ) may be driven with one clock frequency that is in an associated upper operating range, while high performance core 140 (N) may be driven with a different clock frequency that is in an associated medium operating range.
- each core 140 ( 0 ), 140 (N) in dual core mode is driven with an identical clock frequency within range of both cores.
- each core 140 ( 0 ), 140 (N) in dual core mode is driven with a different clock within an associated range of each core.
- each clock frequency may be selected to achieve similar forward execution progress for each core.
- cores 140 are configured to operate from a common voltage supply and may operate from independent clock frequencies.
- low power core 140 ( 0 ) is able to satisfy throughput requirements using the least power of the three core configurations (low power, high performance, dual core).
- high performance core 140 (N) is able to satisfy throughput requirements using the least power of the three core configurations, while extending throughput 310 beyond a maximum throughput 314 for low power core 140 ( 0 ).
- operating both low power core 140 ( 0 ) and high performance core 140 (N) simultaneously may achieve a throughput that is higher than a maximum throughput 316 for high performance core 140 (N), thereby extending overall throughput, but at the expense of additional power consumption.
- a first state transition is between region 330 and region 332 ; a second state transition is between region 332 and region 330 ; a third state transition is between region 330 and region 334 ; a fourth state transition is between region 334 and region 330 ; a fifth state transition is between region 332 and region 334 ; and a sixth state transition is between region 334 and region 332 .
- Additional cores may add additional operating regions and additional potential state transitions between core configurations without departing the scope and spirit of the present invention.
- cores 140 within CPU 102 are characterized in terms of power consumption and throughput as a function voltage and frequency.
- a resulting characterization comprises a family of power curves and different operating regions having different power requirements.
- the different operating regions may be determined statically for a given CPU 102 design.
- the different operating regions may be stored in tables within device driver 154 , which is then able to configure CPU 102 to hot plug in and hot plug out different cores 140 based on a prevailing workload requirements.
- device driver 154 reacts to current workload requirements and reconfigures different cores 140 within CPU 102 to best satisfy the requirements.
- scheduler 152 is configured to schedule workloads according to available cores 140 .
- Scheduler 152 may direct device driver 154 to hot plug in or hot plug out different cores based on present and future knowledge of workload requirements.
- FIG. 4 is a flow diagram of method steps for configuring a multi-core CPU to operate within a power-efficient region, according to one embodiment of the present invention.
- the method steps are described in conjunction with the systems of FIGS. 1-2 , persons of ordinary skill in the art will understand that any system configured to perform the method steps, in any order, is within the scope of the invention. In one embodiment, the method steps are performed by CPU 102 of FIG. 1 .
- a method 400 begins in step 410 , where cluster control unit 230 of FIG. 2 initializes core configuration for CPU 102 .
- cluster control unit 230 initializes core configuration for CPU 102 to reflect availability of low power core 140 ( 0 ) of FIG. 1 .
- core 140 ( 0 ) executes an operating system boot chronology, including loading and initiating execution of kernel 150 .
- step 412 device driver 154 receives workload information.
- the workload information may include, without limitation, CPU load statistics, latency statistics, and the like.
- the workload information may be received from cluster control unit 230 within CPU 102 or from conventional kernel task and thread services. If, in step 420 , there is a change in workload reflected by the workload information, then the method proceeds to step 422 , otherwise, the method proceeds back to step 412 .
- the device driver determines a matching core configuration to support the new workload information. The driver may use statically pre-computed workload tables that map power curve information to efficient core configurations that support a required workload reflected in the workload information.
- step 430 the matching core configuration represents a change to the current core configuration
- the method proceeds to step 432 , otherwise, the method proceeds back to step 412 .
- the device driver causes CPU 102 to transition to the matching core configuration.
- the transition process may involve hot plugging one or more core in and may also involve hot plugging one or more core out, as a function of differences between a current core configuration and the matching core configuration.
- step 440 the method should terminate, then the method proceeds to step 490 , otherwise the method proceeds back to step 412 .
- the method may need to terminate upon receiving a termination signal, such as during an overall shutdown event.
- a technique for managing processor cores within a multi-core CPU.
- the technique involves hot plugging core resources in and hot plugging core resources out as needed.
- Each core includes a virtual ID to allow the core execution context to be abstracted away from a particular physical core circuit.
- core configurations may be changed to support the increases.
- core configurations may be changed to reduce power consumption while supporting the reduced workload.
- One advantage of the disclosed technique is that it advantageously improves power efficiency of a multi-core central processing unit over a wide workload range, while efficiently utilizing processing resources.
- aspects of the present invention may be implemented in hardware or software or in a combination of hardware and software.
- One embodiment of the invention may be implemented as a program product for use with a computer system.
- the program(s) of the program product define functions of the embodiments (including the methods described herein) and can be contained on a variety of computer-readable storage media.
- Illustrative computer-readable storage media include, but are not limited to: (i) non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive, flash memory, ROM chips or any type of solid-state non-volatile semiconductor memory) on which information is permanently stored; and (ii) writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive or any type of solid-state random-access semiconductor memory) on which alterable information is stored.
- non-writable storage media e.g., read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive, flash memory, ROM chips or any type of solid-state non-volatile semiconductor memory
- writable storage media e.g., floppy disks within a diskette drive or hard-disk drive or any type of solid-state random-access semiconductor memory
Abstract
A technique for managing processor cores within a multi-core central processing unit (CPU) provides efficient power and resource utilization over a wide workload range. The CPU comprises at least one core designed for low power operation and at least one core designed for high performance operation. For low workloads, the low power core executes the workload. For certain higher workloads, the high performance core executes the workload. For certain other workloads, the low power core and the high performance core both share execution of the workload. This technique advantageously enables efficient processing over a wider range of workloads than conventional systems.
Description
This application claims benefit of the United States Provisional Patent Application having Ser. No. 61/678,026, filed on Jul. 31, 2012, which is hereby incorporated herein by reference.
Field of the Invention
The present invention generally relates to multiprocessor computer systems and, more specifically, to a heterogeneous multiprocessor design for power-efficient and area-efficient computing.
Description of the Related Art
Battery-powered mobile computing platforms have become increasingly important in recent years, intensifying the need for efficient, low power systems that deliver highly scalable computational capacity with diminishing cost. A typical mobile device may need to operate over a wide performance range, according to workload requirements. Different performance ranges are conventionally mapped to a different operating mode, with power consumption proportionally related to performance within a given operating mode. In a low-power sleep mode, the mobile device may provide a small amount of computational capacity, such as to maintain radio contact with a cellular tower. In an active mode, the mobile device may provide low-latency response to user input, for example via a window manager. Many operations associated with typical applications execute with satisfactory performance in an active mode. In a high-performance mode, the mobile device needs to provide peak computational capacity, such as to execute a real-time game or perform transient user-interface operations. Active mode and high-performance mode typically require progressively increasing power consumption.
A number of techniques have been developed to improve both performance and power efficiency for mobile devices. Such techniques include reducing device parasitic loads by reducing device size, reducing operating and threshold voltages, trading off performance for power-efficiency, and adding different circuit configurations tuned to operate well under certain operating modes.
In one example, a mobile device processor complex comprises a low-power, but low-performance processor and a high-performance, but high-power processor. In idle and low activity active modes, the low-power processor is more power efficient at lower performance levels and is therefore selected for execution, while in high-performance modes, the high-performance processor is more power efficient and is therefore selected for execution of larger workloads. In this scenario, the trade-off space includes a cost component since the mobile device carries a cost burden of two processors, where only one processor can be active at a time. While such a processor complex enables both low power operation and high-performance operation, the processor complex makes inefficient use of expensive resources.
As the foregoing illustrates, what is needed in the art is a more efficient technique for accommodating a wide range of different workloads.
One embodiment of the present invention sets forth a method for configuring one or more cores within a processing unit for executing different workloads, the method comprising receiving information related to a new workload, determining, based on the information, that the new workload is different than a current workload, determining how many of the one or more cores should be configured to execute the new workload based on the information, determining whether a new core configuration is needed based on how many of the one or more cores should be configured to execute the new workload, and if a new core configuration is needed, then transitioning the processing unit to the new core configuration, or if a new core configuration is not needed, then maintaining a current core configuration for executing the new workload.
Other embodiments of the present invention include, without limitation, a computer-readable storage medium including instructions that, when executed by a processing unit, cause the processing unit to perform the techniques described herein as well as a computing device that includes a processing unit configured to perform the techniques described herein.
One advantage of the disclosed technique is that it advantageously improves power efficiency of a multi-core central processing unit over a wide workload range, while efficiently utilizing processing resources.
So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
In the following description, numerous specific details are set forth to provide a more thorough understanding of the present invention. However, it will be apparent to one of skill in the art that the present invention may be practiced without one or more of these specific details.
A switch 116 provides connections between I/O bridge 107 and other components such as a network adapter 118 and various add-in card 120. Other components (not explicitly shown), including universal serial bus (USB) or other port connections, compact disc (CD) drives, digital versatile disc (DVD) drives, film recording devices, and the like, may also be connected to I/O bridge 107. The various communication paths shown in FIG. 1 , including the specifically named communication paths 106 and 113 may be implemented using any suitable protocols, such as PCI Express, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol(s), and connections between different devices may use different protocols as is known in the art.
In one embodiment, the parallel processing subsystem 112 incorporates circuitry optimized for graphics and video processing, including, for example, video output circuitry, and constitutes a graphics processing unit (GPU). In another embodiment, the parallel processing subsystem 112 incorporates circuitry optimized for general purpose processing, while preserving the underlying computational architecture, described in greater detail herein. In yet another embodiment, the parallel processing subsystem 112 may be integrated with one or more other system elements in a single subsystem, such as joining the memory bridge 105, CPU 102, and I/O bridge 107 to form a system on chip (SoC).
It will be appreciated that the system shown herein is illustrative and that variations and modifications are possible. The connection topology, including the number and arrangement of bridges, the number of CPUs 102, and the number of parallel processing subsystems 112, may be modified as desired. For instance, in some embodiments, system memory 104 is connected to CPU 102 directly rather than through a bridge, and other devices communicate with system memory 104 via memory bridge 105 and CPU 102. In other alternative topologies, parallel processing subsystem 112 is connected to I/O bridge 107 or directly to CPU 102, rather than to memory bridge 105. In still other embodiments, I/O bridge 107 and memory bridge 105 might be integrated into a single chip instead of existing as one or more discrete devices. Large embodiments may include two or more CPUs 102 and two or more parallel processing subsystems 112. The particular components shown herein are optional; for instance, any number of add-in cards or peripheral devices might be supported. In some embodiments, switch 116 is eliminated, and network adapter 118 and add-in card 120 are connect directly to I/O bridge 107. In still other embodiments, computer system 100 comprises a mobile device and network adapter 118 implements a digital wireless communications subsystem. In such embodiments, input devices 108 comprise a touch tablet input subsystem and display device 110 implements a mobile screen subsystem, such as a liquid crystal display module.
Each core 140 may operate within a corresponding voltage-frequency (VF) domain, distinct from other VF domains. For example, circuitry associated with core 140(0) may operate on a first voltage and first operating frequency associated with VF domain 210(0), while circuits associated with core 140(N) may operate on a second voltage and a second frequency associated with VF domain 210(N). In this example, each voltage and each frequency may be varied independently within technically feasible ranges to achieve certain power and performance goals.
In this example, core 140(0) is designed for low power operation, while core 140(N) is designed for high performance operation, while preserving mutual instruction set architecture (ISA) compatibility. Core 140(N) may achieve higher performance via any applicable technique, such as circuit design directed to high clock speeds, logic design directed to simultaneously issuing and processing multiple concurrent instructions, and architectural design directed to improved cache size and performance. Design trade-off associated with core 140(N) may tolerate increased marginal power consumption to achieve greater marginal execution performance. Core 140(0) may achieve lower power operation via circuit design directed to reducing leakage current, crossbar current, and parasitic loss, logic design directed to reducing switching energy associated with processing an instruction. Design trade-offs associated with core 140(0) should generally favor reducing power consumption, even at the expense of clock speed and processing performance.
Each core 140 includes a programmable virtual identifier (ID) 212, which identifies the processor core. Each core 140 may be programmed with an arbitrary core identifier via virtual ID 212, which may be associated with a particular thread or processed maintained by scheduler 152. Each core 140 may include logic to facilitate replicating internal execution state to another core 140.
In one embodiment, core interconnect 220 couples cores 140 to a cache 222, which is further coupled to a memory interface 224. Core interconnect 220 may be configured to facilitate state replication between cores 140. Interrupt distributor 226 is configured to receive an interrupt signal and transmit the interrupt signal to an appropriate core 140, identified by a value programmed within virtual ID 212. For example, an interrupt that is targeted for core zero will be directed to whichever core 140 has a virtual ID 212 programmed to zero.
In one embodiment, cluster control unit 230 is configured to power off one or more voltage supplies to a core that has been hot plugged out and to power on one or more voltage supplies to a core that has been hot plugged in. For example, cluster control unit 230 may power off a voltage supply associated with VF domain 210(0) to hot plug out core 140(0). Cluster control unit 230 may also implement frequency control circuitry for each core 140. Cluster control unit 230 receives commands from a cluster switch software module residing within device driver 154. The cluster switch manages transitions between core configurations. For example, cluster switch is able to direct each core to save context, including a virtual ID 212, and to load a saved context, including an arbitrary virtual ID 212. The cluster switch may include hardware support for saving and loading context via cluster control unit 230. Control unit 230 may provide automatic detection of workload changes and indicate to the cluster switch that a new workload requires a new configuration. The cluster switch then directs control unit 230 to transition a workload form one core 140 to another core 140, or enable additional cores via hot plugging in the additional cores.
A core clock frequency may be varied to achieve continuously different levels of throughput along the throughput 310 axis. As shown, low power core 140(0) has a maximum throughput that is lower than a maximum throughput for high performance core 140(N). In one implementation scenario, high performance core 140(N) is able to operate at a higher clock frequency than low power core 140(0). In a dual core mode associated with power curve 324, low power core 140(0) may be driven with one clock frequency that is in an associated upper operating range, while high performance core 140(N) may be driven with a different clock frequency that is in an associated medium operating range. In one configuration, each core 140(0), 140(N) in dual core mode is driven with an identical clock frequency within range of both cores. In a different configuration, each core 140(0), 140(N) in dual core mode is driven with a different clock within an associated range of each core. In one embodiment, each clock frequency may be selected to achieve similar forward execution progress for each core. In certain embodiments, cores 140 are configured to operate from a common voltage supply and may operate from independent clock frequencies.
Within a low power core region 330, low power core 140(0) is able to satisfy throughput requirements using the least power of the three core configurations (low power, high performance, dual core). Within a high performance core region 332, high performance core 140(N) is able to satisfy throughput requirements using the least power of the three core configurations, while extending throughput 310 beyond a maximum throughput 314 for low power core 140(0). Within a dual core region 334, operating both low power core 140(0) and high performance core 140(N) simultaneously may achieve a throughput that is higher than a maximum throughput 316 for high performance core 140(N), thereby extending overall throughput, but at the expense of additional power consumption.
Given the three operating regions 330, 332, 334, and one low power core 140(0) and one high-performance core 140(N), six direct state transitions are supported between different core configurations. A first state transition is between region 330 and region 332; a second state transition is between region 332 and region 330; a third state transition is between region 330 and region 334; a fourth state transition is between region 334 and region 330; a fifth state transition is between region 332 and region 334; and a sixth state transition is between region 334 and region 332. Persons skilled in the art will recognize that additional cores may add additional operating regions and additional potential state transitions between core configurations without departing the scope and spirit of the present invention.
In one embodiment, cores 140 within CPU 102 are characterized in terms of power consumption and throughput as a function voltage and frequency. A resulting characterization comprises a family of power curves and different operating regions having different power requirements. The different operating regions may be determined statically for a given CPU 102 design. The different operating regions may be stored in tables within device driver 154, which is then able to configure CPU 102 to hot plug in and hot plug out different cores 140 based on a prevailing workload requirements. In one embodiment, device driver 154 reacts to current workload requirements and reconfigures different cores 140 within CPU 102 to best satisfy the requirements. In another embodiment, scheduler 152 is configured to schedule workloads according to available cores 140. Scheduler 152 may direct device driver 154 to hot plug in or hot plug out different cores based on present and future knowledge of workload requirements.
As shown, a method 400 begins in step 410, where cluster control unit 230 of FIG. 2 initializes core configuration for CPU 102. In one embodiment, cluster control unit 230 initializes core configuration for CPU 102 to reflect availability of low power core 140(0) of FIG. 1 . In this configuration, core 140(0) executes an operating system boot chronology, including loading and initiating execution of kernel 150.
In step 412, device driver 154 receives workload information. The workload information may include, without limitation, CPU load statistics, latency statistics, and the like. The workload information may be received from cluster control unit 230 within CPU 102 or from conventional kernel task and thread services. If, in step 420, there is a change in workload reflected by the workload information, then the method proceeds to step 422, otherwise, the method proceeds back to step 412. In step 422, the device driver determines a matching core configuration to support the new workload information. The driver may use statically pre-computed workload tables that map power curve information to efficient core configurations that support a required workload reflected in the workload information.
If, in step 430 the matching core configuration represents a change to the current core configuration, then the method proceeds to step 432, otherwise, the method proceeds back to step 412. In step 432, the device driver causes CPU 102 to transition to the matching core configuration. The transition process may involve hot plugging one or more core in and may also involve hot plugging one or more core out, as a function of differences between a current core configuration and the matching core configuration.
If, in step 440, the method should terminate, then the method proceeds to step 490, otherwise the method proceeds back to step 412. The method may need to terminate upon receiving a termination signal, such as during an overall shutdown event.
In sum, a technique is disclosed for managing processor cores within a multi-core CPU. The technique involves hot plugging core resources in and hot plugging core resources out as needed. Each core includes a virtual ID to allow the core execution context to be abstracted away from a particular physical core circuit. As system workload increases, core configurations may be changed to support the increases. Similarly, as system workload decreases, core configurations may be changed to reduce power consumption while supporting the reduced workload.
One advantage of the disclosed technique is that it advantageously improves power efficiency of a multi-core central processing unit over a wide workload range, while efficiently utilizing processing resources.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. For example, aspects of the present invention may be implemented in hardware or software or in a combination of hardware and software. One embodiment of the invention may be implemented as a program product for use with a computer system. The program(s) of the program product define functions of the embodiments (including the methods described herein) and can be contained on a variety of computer-readable storage media. Illustrative computer-readable storage media include, but are not limited to: (i) non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive, flash memory, ROM chips or any type of solid-state non-volatile semiconductor memory) on which information is permanently stored; and (ii) writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive or any type of solid-state random-access semiconductor memory) on which alterable information is stored.
The invention has been described above with reference to specific embodiments. Persons of ordinary skill in the art, however, will understand that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The foregoing description and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Therefore, the scope of the present invention is determined by the claims that follow.
Claims (23)
1. A method for configuring two or more cores within a processing unit for executing different workloads, the method comprising:
receiving information related to a new workload;
determining, based on the information, that the new workload is different than a current workload;
retrieving characterization data associated with power consumption characterizations for each core included in the two or more cores;
determining how many of the two or more cores should be configured to execute the new workload based on the information and the characterization data;
determining whether a new core configuration is needed based on how many of the two or more cores should be configured to execute the new workload;
if a new core configuration is needed, then transitioning the processing unit to the new core configuration, or
if a new core configuration is not needed, then maintaining a current core configuration for executing the new workload;
receiving a first interrupt associated with a first logical core identifier and related to the new workload; and
transmitting the first interrupt to a first core included in the two or more cores that is executing the new workload and is associated with a programmable identifier matching the first logical core identifier.
2. The method of claim 1 , wherein only a low-power core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only a high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the low-power core, and turning on the high-performance core to execute the new workload.
3. The method of claim 1 , wherein only a high-performance core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only a low-power core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the high-performance core, and turning on the low-power core to execute the new workload.
4. The method of claim 1 , wherein only a low-power core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that both the low-power core and a high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning on the high-performance core to execute the new workload.
5. The method of claim 1 , wherein only a high-performance core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that both a low-power core and the high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning on the low-power core to execute the new workload.
6. The method of claim 1 , wherein both a low-power core and a high-performance core execute work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only the high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the low-power core to execute the new workload.
7. The method of claim 1 , wherein both a low-power core and a high-performance core execute work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only the low-power core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the high-performance core to execute the new workload.
8. The method of claim 1 , wherein the processing unit comprises a central processing unit or a graphics processing unit.
9. The method of claim 1 , wherein each core included in the two or more cores is identifiable via a programmable identifier, and two or more programmable identifiers are used in transitioning the processing unit to the new core configuration.
10. The method of claim 1 , wherein determining how many of the two or more cores should be configured to execute the new workload comprises determining a subset of the two or more cores that is capable of satisfying throughput requirements of the new workload with less power consumption relative to all other potential subsets of the two or more cores based on the characterization data.
11. The method of claim 1 , wherein transitioning the processing unit to the new core configuration includes power on or powering off at least one of the two or more cores based on information associated with a future workload.
12. A non-transitory computer-readable storage medium including instructions that, when executed by a processing unit, cause the processing unit to configure two or more cores within the processing unit for executing different workloads, the method comprising:
receiving information related to a new workload;
determining, based on the information, that the new workload is different than a current workload;
retrieving characterization data associated with power consumption characterizations for each core included in the two or more cores;
determining how many of the two or more cores should be configured to execute the new workload based on the information and the characterization data;
determining whether a new core configuration is needed based on how many of the two or more cores should be configured to execute the new workload;
if a new core configuration is needed, then transitioning the processing unit to the new core configuration, or
if a new core configuration is not needed, then maintaining a current core configuration for executing the new workload;
receiving a first interrupt associated with a first logical core identifier and related to the new workload; and
transmitting the first interrupt to a first core included in the two or more cores that is executing the new workload and is associated with a programmable identifier matching the first logical core identifier.
13. The computer-readable storage medium of claim 12 , wherein only a low-power core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only a high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the low-power core, and turning on the high-performance core to execute the new workload.
14. The computer-readable storage medium of claim 12 , wherein only a high-performance core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only a low-power core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the high-performance core, and turning on the low-power core to execute the new workload.
15. The computer-readable storage medium of claim 12 , wherein only a low-power core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that both the low-power core and a high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning on the high-performance core to execute the new workload.
16. The computer-readable storage medium of claim 12 , wherein only a high-performance core executes work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that both a low-power core and the high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning on the low-power core to execute the new workload.
17. The computer-readable storage medium of claim 12 , wherein both a low-power core and a high-performance core execute work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only the high-performance core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the low-power core to execute the new workload.
18. The computer-readable storage medium of claim 12 , wherein both a low-power core and a high-performance core execute work in the current core configuration, and determining how many of the two or more cores should be configured comprises determining that only the low-power core should be configured to execute the new workload, and further comprising determining that a new core configuration is needed, and transitioning the processing unit by turning off the high-performance core to execute the new workload.
19. The computer-readable storage medium of claim 12 , wherein the processing unit comprises a central processing unit or a graphics processing unit.
20. The computer-readable storage medium of claim 12 , wherein each core included in the two or more cores is identifiable via a programmable identifier, and one or more programmable identifiers are used in transitioning the processing unit to the new core configuration.
21. A computing device, comprising:
a memory including instructions; and
a central processing unit that is coupled to the memory and includes at least one low-power core and at least one high-performance core, the central processing unit programmed via the instructions to configure two or more cores for executing different workloads by:
receiving information related to a new workload;
determining, based on the information, that the new workload is different than a current workload;
retrieving characterization data associated with power consumption characterizations for each core included in the two or more cores;
determining how many of the two or more cores should be configured to execute the new workload based on the information and the characterization data;
determining whether a new core configuration is needed based on how many of the two or more cores should be configured to execute the new workload;
if a new core configuration is needed, then transitioning the processing unit to the new core configuration, or
if a new core configuration is not needed, then maintaining a current core configuration for executing the new workload;
receiving a first interrupt associated with a first logical core identifier and related to the new workload; and
transmitting the first interrupt to a first core included in the two or more cores that is executing the new workload and is associated with a programmable identifier matching the first logical core identifier.
22. The computing device of claim 21 , wherein each core included in the two or more cores is identifiable via a programmable identifier, and one or more programmable identifiers are used in transitioning the processing unit to the new core configuration.
23. The computing device of claim 21 , wherein transitioning the processing unit to the new core configuration includes power on or powering off at least one of the two or more cores based on information associated with a future workload.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE102013108041.3A DE102013108041B4 (en) | 2012-07-31 | 2013-07-26 | Heterogeneous multiprocessor arrangement for power-efficient and area-efficient computing |
TW102127477A TWI502333B (en) | 2012-07-31 | 2013-07-31 | Heterogeneous multiprocessor design for power-efficient and area-efficient computing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201261678026P | 2012-07-31 | 2012-07-31 |
Publications (2)
Publication Number | Publication Date |
---|---|
US20140181501A1 US20140181501A1 (en) | 2014-06-26 |
US9569279B2 true US9569279B2 (en) | 2017-02-14 |
Family
ID=50976117
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/723,995 Active 2033-10-26 US9569279B2 (en) | 2012-07-31 | 2012-12-21 | Heterogeneous multiprocessor design for power-efficient and area-efficient computing |
Country Status (1)
Country | Link |
---|---|
US (1) | US9569279B2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9965279B2 (en) * | 2013-11-29 | 2018-05-08 | The Regents Of The University Of Michigan | Recording performance metrics to predict future execution of large instruction sequences on either high or low performance execution circuitry |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130117168A1 (en) | 2011-11-04 | 2013-05-09 | Mark Henrik Sandstrom | Maximizing Throughput of Multi-user Parallel Data Processing Systems |
US8789065B2 (en) | 2012-06-08 | 2014-07-22 | Throughputer, Inc. | System and method for input data load adaptive parallel processing |
US8745626B1 (en) * | 2012-12-17 | 2014-06-03 | Throughputer, Inc. | Scheduling application instances to configurable processing cores based on application requirements and resource specification |
US9448847B2 (en) | 2011-07-15 | 2016-09-20 | Throughputer, Inc. | Concurrent program execution optimization |
JPWO2015015756A1 (en) * | 2013-08-02 | 2017-03-02 | 日本電気株式会社 | Power saving control system, control device, control method and control program for non-volatile memory mounted server |
KR20160054850A (en) * | 2014-11-07 | 2016-05-17 | 삼성전자주식회사 | Apparatus and method for operating processors |
US9898071B2 (en) * | 2014-11-20 | 2018-02-20 | Apple Inc. | Processor including multiple dissimilar processor cores |
US9958932B2 (en) * | 2014-11-20 | 2018-05-01 | Apple Inc. | Processor including multiple dissimilar processor cores that implement different portions of instruction set architecture |
US20160378551A1 (en) * | 2015-06-24 | 2016-12-29 | Intel Corporation | Adaptive hardware acceleration based on runtime power efficiency determinations |
US9891926B2 (en) | 2015-09-30 | 2018-02-13 | International Business Machines Corporation | Heterogeneous core microarchitecture |
US10310858B2 (en) * | 2016-03-08 | 2019-06-04 | The Regents Of The University Of Michigan | Controlling transition between using first and second processing circuitry |
US10355975B2 (en) | 2016-10-19 | 2019-07-16 | Rex Computing, Inc. | Latency guaranteed network on chip |
US10700968B2 (en) * | 2016-10-19 | 2020-06-30 | Rex Computing, Inc. | Optimized function assignment in a multi-core processor |
CN108334405A (en) * | 2017-01-20 | 2018-07-27 | 阿里巴巴集团控股有限公司 | Frequency isomery CPU, frequency isomery implementation method, device and method for scheduling task |
US10540300B2 (en) | 2017-02-16 | 2020-01-21 | Qualcomm Incorporated | Optimizing network driver performance and power consumption in multi-core processor-based systems |
CN113792847B (en) | 2017-02-23 | 2024-03-08 | 大脑系统公司 | Accelerated deep learning apparatus, method and system |
US10459517B2 (en) * | 2017-03-31 | 2019-10-29 | Qualcomm Incorporated | System and methods for scheduling software tasks based on central processing unit power characteristics |
WO2018193353A1 (en) | 2017-04-17 | 2018-10-25 | Cerebras Systems Inc. | Neuron smearing for accelerated deep learning |
US10762418B2 (en) | 2017-04-17 | 2020-09-01 | Cerebras Systems Inc. | Control wavelet for accelerated deep learning |
US11488004B2 (en) | 2017-04-17 | 2022-11-01 | Cerebras Systems Inc. | Neuron smearing for accelerated deep learning |
US11010330B2 (en) * | 2018-03-07 | 2021-05-18 | Microsoft Technology Licensing, Llc | Integrated circuit operation adjustment using redundant elements |
CN108717362B (en) * | 2018-05-21 | 2022-05-03 | 北京晨宇泰安科技有限公司 | Network equipment configuration system and method based on inheritable structure |
EP3572909A1 (en) * | 2018-05-25 | 2019-11-27 | Nokia Solutions and Networks Oy | Method and apparatus of reducing energy consumption in a network |
WO2020044152A1 (en) | 2018-08-28 | 2020-03-05 | Cerebras Systems Inc. | Scaled compute fabric for accelerated deep learning |
WO2020044238A1 (en) | 2018-08-29 | 2020-03-05 | Cerebras Systems Inc. | Processor element redundancy for accelerated deep learning |
WO2020044208A1 (en) | 2018-08-29 | 2020-03-05 | Cerebras Systems Inc. | Isa enhancements for accelerated deep learning |
Citations (49)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6314515B1 (en) | 1989-11-03 | 2001-11-06 | Compaq Computer Corporation | Resetting multiple processors in a computer system |
US6501999B1 (en) * | 1999-12-22 | 2002-12-31 | Intel Corporation | Multi-processor mobile computer system having one processor integrated with a chipset |
US20030101362A1 (en) | 2001-11-26 | 2003-05-29 | Xiz Dia | Method and apparatus for enabling a self suspend mode for a processor |
US20030120910A1 (en) | 2001-12-26 | 2003-06-26 | Schmisseur Mark A. | System and method of remotely initializing a local processor |
US6732280B1 (en) | 1999-07-26 | 2004-05-04 | Hewlett-Packard Development Company, L.P. | Computer system performing machine specific tasks before going to a low power state |
US6804632B2 (en) | 2001-12-06 | 2004-10-12 | Intel Corporation | Distribution of processing activity across processing hardware based on power consumption considerations |
US20040215926A1 (en) | 2003-04-28 | 2004-10-28 | International Business Machines Corp. | Data processing system having novel interconnect for supporting both technical and commercial workloads |
US20040215987A1 (en) | 2003-04-25 | 2004-10-28 | Keith Farkas | Dynamically selecting processor cores for overall power efficiency |
US20050013705A1 (en) | 2003-07-16 | 2005-01-20 | Keith Farkas | Heterogeneous processor core systems for improved throughput |
US6981083B2 (en) * | 2002-12-05 | 2005-12-27 | International Business Machines Corporation | Processor virtualization mechanism via an enhanced restoration of hard architected states |
WO2006037119A2 (en) | 2004-09-28 | 2006-04-06 | Intel Corporation | Method and apparatus for varying energy per instruction according to the amount of available parallelism |
US20070074011A1 (en) | 2005-09-28 | 2007-03-29 | Shekhar Borkar | Reliable computing with a many-core processor |
US20070083785A1 (en) | 2004-06-10 | 2007-04-12 | Sehat Sutardja | System with high power and low power processors and thread transfer |
US7210139B2 (en) * | 2002-02-19 | 2007-04-24 | Hobson Richard F | Processor cluster architecture and associated parallel processing methods |
US20070136617A1 (en) | 2005-11-30 | 2007-06-14 | Renesas Technology Corp. | Semiconductor integrated circuit |
US7383423B1 (en) | 2004-10-01 | 2008-06-03 | Advanced Micro Devices, Inc. | Shared resources in a chip multiprocessor |
US7421602B2 (en) | 2004-02-13 | 2008-09-02 | Marvell World Trade Ltd. | Computer with low-power secondary processor and secondary display |
US7434002B1 (en) | 2006-04-24 | 2008-10-07 | Vmware, Inc. | Utilizing cache information to manage memory access and cache utilization |
US20080263324A1 (en) | 2006-08-10 | 2008-10-23 | Sehat Sutardja | Dynamic core switching |
US20080307244A1 (en) | 2007-06-11 | 2008-12-11 | Media Tek, Inc. | Method of and Apparatus for Reducing Power Consumption within an Integrated Circuit |
US20090055826A1 (en) | 2007-08-21 | 2009-02-26 | Kerry Bernstein | Multicore Processor Having Storage for Core-Specific Operational Data |
TWI311729B (en) | 2005-08-08 | 2009-07-01 | Via Tech Inc | Global spreader and method for a parallel graphics processor |
US20090172423A1 (en) | 2007-12-31 | 2009-07-02 | Justin Song | Method, system, and apparatus for rerouting interrupts in a multi-core processor |
US20090222654A1 (en) | 2008-02-29 | 2009-09-03 | Herbert Hum | Distribution of tasks among asymmetric processing elements |
US7587716B2 (en) * | 2003-02-21 | 2009-09-08 | Sharp Kabushiki Kaisha | Asymmetrical multiprocessor system, image processing apparatus and image forming apparatus using same, and unit job processing method using asymmetrical multiprocessor |
US20090235260A1 (en) | 2008-03-11 | 2009-09-17 | Alexander Branover | Enhanced Control of CPU Parking and Thread Rescheduling for Maximizing the Benefits of Low-Power State |
US20090259863A1 (en) | 2008-04-10 | 2009-10-15 | Nvidia Corporation | Responding to interrupts while in a reduced power state |
US20090292934A1 (en) | 2008-05-22 | 2009-11-26 | Ati Technologies Ulc | Integrated circuit with secondary-memory controller for providing a sleep state for reduced power consumption and method therefor |
US20090300396A1 (en) | 2008-05-30 | 2009-12-03 | Kabushiki Kaisha Toshiba | Information processing apparatus |
US7730335B2 (en) | 2004-06-10 | 2010-06-01 | Marvell World Trade Ltd. | Low power computer with main and auxiliary processors |
US20100146513A1 (en) | 2008-12-09 | 2010-06-10 | Intel Corporation | Software-based Thread Remapping for power Savings |
US20100153954A1 (en) | 2008-12-11 | 2010-06-17 | Qualcomm Incorporated | Apparatus and Methods for Adaptive Thread Scheduling on Asymmetric Multiprocessor |
US20100162014A1 (en) | 2008-12-24 | 2010-06-24 | Mazhar Memon | Low power polling techniques |
EP2254048A1 (en) | 2009-04-21 | 2010-11-24 | LStar Technologies LLC | Thread mapping in multi-core processors |
US20110022833A1 (en) | 2009-07-24 | 2011-01-27 | Sebastien Nussbaum | Altering performance of computational units heterogeneously according to performance sensitivity |
TWI340900B (en) | 2004-09-30 | 2011-04-21 | Ibm | System and method for virtualization of processor resources |
US20110314314A1 (en) | 2010-06-18 | 2011-12-22 | Samsung Electronics Co., Ltd. | Power gating of cores by an soc |
US8140876B2 (en) * | 2009-01-16 | 2012-03-20 | International Business Machines Corporation | Reducing power consumption of components based on criticality of running tasks independent of scheduling priority in multitask computer |
US8166324B2 (en) | 2002-04-29 | 2012-04-24 | Apple Inc. | Conserving power by reducing voltage supplied to an instruction-processing portion of a processor |
US20120102344A1 (en) | 2010-10-21 | 2012-04-26 | Andrej Kocev | Function based dynamic power control |
US8180997B2 (en) * | 2007-07-05 | 2012-05-15 | Board Of Regents, University Of Texas System | Dynamically composing processor cores to form logical processors |
US20120151225A1 (en) * | 2010-12-09 | 2012-06-14 | Lilly Huang | Apparatus, method, and system for improved power delivery performance with a dynamic voltage pulse scheme |
US20120159496A1 (en) * | 2010-12-20 | 2012-06-21 | Saurabh Dighe | Performing Variation-Aware Profiling And Dynamic Core Allocation For A Many-Core Processor |
US20120185709A1 (en) * | 2011-12-15 | 2012-07-19 | Eliezer Weissmann | Method, apparatus, and system for energy efficiency and energy conservation including thread consolidation |
US8284205B2 (en) | 2007-10-24 | 2012-10-09 | Apple Inc. | Methods and apparatuses for load balancing between multiple processing units |
US20120266179A1 (en) * | 2011-04-14 | 2012-10-18 | Osborn Michael J | Dynamic mapping of logical cores |
US20130124890A1 (en) | 2010-07-27 | 2013-05-16 | Michael Priel | Multi-core processor and method of power management of a multi-core processor |
US20130238912A1 (en) * | 2010-11-25 | 2013-09-12 | Michael Priel | Method and apparatus for managing power in a multi-core processor |
US20130346771A1 (en) * | 2012-06-20 | 2013-12-26 | Douglas D. Boom | Controlling An Asymmetrical Processor |
-
2012
- 2012-12-21 US US13/723,995 patent/US9569279B2/en active Active
Patent Citations (54)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6314515B1 (en) | 1989-11-03 | 2001-11-06 | Compaq Computer Corporation | Resetting multiple processors in a computer system |
US6732280B1 (en) | 1999-07-26 | 2004-05-04 | Hewlett-Packard Development Company, L.P. | Computer system performing machine specific tasks before going to a low power state |
US6501999B1 (en) * | 1999-12-22 | 2002-12-31 | Intel Corporation | Multi-processor mobile computer system having one processor integrated with a chipset |
US20030101362A1 (en) | 2001-11-26 | 2003-05-29 | Xiz Dia | Method and apparatus for enabling a self suspend mode for a processor |
US20050050373A1 (en) | 2001-12-06 | 2005-03-03 | Doron Orenstien | Distribution of processing activity in a multiple core microprocessor |
US6804632B2 (en) | 2001-12-06 | 2004-10-12 | Intel Corporation | Distribution of processing activity across processing hardware based on power consumption considerations |
US20030120910A1 (en) | 2001-12-26 | 2003-06-26 | Schmisseur Mark A. | System and method of remotely initializing a local processor |
US7210139B2 (en) * | 2002-02-19 | 2007-04-24 | Hobson Richard F | Processor cluster architecture and associated parallel processing methods |
US8166324B2 (en) | 2002-04-29 | 2012-04-24 | Apple Inc. | Conserving power by reducing voltage supplied to an instruction-processing portion of a processor |
US6981083B2 (en) * | 2002-12-05 | 2005-12-27 | International Business Machines Corporation | Processor virtualization mechanism via an enhanced restoration of hard architected states |
US7587716B2 (en) * | 2003-02-21 | 2009-09-08 | Sharp Kabushiki Kaisha | Asymmetrical multiprocessor system, image processing apparatus and image forming apparatus using same, and unit job processing method using asymmetrical multiprocessor |
US20040215987A1 (en) | 2003-04-25 | 2004-10-28 | Keith Farkas | Dynamically selecting processor cores for overall power efficiency |
US7093147B2 (en) * | 2003-04-25 | 2006-08-15 | Hewlett-Packard Development Company, L.P. | Dynamically selecting processor cores for overall power efficiency |
US20040215926A1 (en) | 2003-04-28 | 2004-10-28 | International Business Machines Corp. | Data processing system having novel interconnect for supporting both technical and commercial workloads |
US20050013705A1 (en) | 2003-07-16 | 2005-01-20 | Keith Farkas | Heterogeneous processor core systems for improved throughput |
US7421602B2 (en) | 2004-02-13 | 2008-09-02 | Marvell World Trade Ltd. | Computer with low-power secondary processor and secondary display |
US20070083785A1 (en) | 2004-06-10 | 2007-04-12 | Sehat Sutardja | System with high power and low power processors and thread transfer |
US7730335B2 (en) | 2004-06-10 | 2010-06-01 | Marvell World Trade Ltd. | Low power computer with main and auxiliary processors |
US7788514B2 (en) | 2004-06-10 | 2010-08-31 | Marvell World Trade Ltd. | Low power computer with main and auxiliary processors |
WO2006037119A2 (en) | 2004-09-28 | 2006-04-06 | Intel Corporation | Method and apparatus for varying energy per instruction according to the amount of available parallelism |
US20060095807A1 (en) | 2004-09-28 | 2006-05-04 | Intel Corporation | Method and apparatus for varying energy per instruction according to the amount of available parallelism |
TWI340900B (en) | 2004-09-30 | 2011-04-21 | Ibm | System and method for virtualization of processor resources |
US7383423B1 (en) | 2004-10-01 | 2008-06-03 | Advanced Micro Devices, Inc. | Shared resources in a chip multiprocessor |
TWI311729B (en) | 2005-08-08 | 2009-07-01 | Via Tech Inc | Global spreader and method for a parallel graphics processor |
US7412353B2 (en) | 2005-09-28 | 2008-08-12 | Intel Corporation | Reliable computing with a many-core processor |
US20070074011A1 (en) | 2005-09-28 | 2007-03-29 | Shekhar Borkar | Reliable computing with a many-core processor |
US20070136617A1 (en) | 2005-11-30 | 2007-06-14 | Renesas Technology Corp. | Semiconductor integrated circuit |
US7434002B1 (en) | 2006-04-24 | 2008-10-07 | Vmware, Inc. | Utilizing cache information to manage memory access and cache utilization |
US20080263324A1 (en) | 2006-08-10 | 2008-10-23 | Sehat Sutardja | Dynamic core switching |
US20080307244A1 (en) | 2007-06-11 | 2008-12-11 | Media Tek, Inc. | Method of and Apparatus for Reducing Power Consumption within an Integrated Circuit |
US8180997B2 (en) * | 2007-07-05 | 2012-05-15 | Board Of Regents, University Of Texas System | Dynamically composing processor cores to form logical processors |
US20090055826A1 (en) | 2007-08-21 | 2009-02-26 | Kerry Bernstein | Multicore Processor Having Storage for Core-Specific Operational Data |
US8284205B2 (en) | 2007-10-24 | 2012-10-09 | Apple Inc. | Methods and apparatuses for load balancing between multiple processing units |
US20090172423A1 (en) | 2007-12-31 | 2009-07-02 | Justin Song | Method, system, and apparatus for rerouting interrupts in a multi-core processor |
US20090222654A1 (en) | 2008-02-29 | 2009-09-03 | Herbert Hum | Distribution of tasks among asymmetric processing elements |
US20090235260A1 (en) | 2008-03-11 | 2009-09-17 | Alexander Branover | Enhanced Control of CPU Parking and Thread Rescheduling for Maximizing the Benefits of Low-Power State |
US20090259863A1 (en) | 2008-04-10 | 2009-10-15 | Nvidia Corporation | Responding to interrupts while in a reduced power state |
US20090292934A1 (en) | 2008-05-22 | 2009-11-26 | Ati Technologies Ulc | Integrated circuit with secondary-memory controller for providing a sleep state for reduced power consumption and method therefor |
US20090300396A1 (en) | 2008-05-30 | 2009-12-03 | Kabushiki Kaisha Toshiba | Information processing apparatus |
US20100146513A1 (en) | 2008-12-09 | 2010-06-10 | Intel Corporation | Software-based Thread Remapping for power Savings |
US20100153954A1 (en) | 2008-12-11 | 2010-06-17 | Qualcomm Incorporated | Apparatus and Methods for Adaptive Thread Scheduling on Asymmetric Multiprocessor |
US20100162014A1 (en) | 2008-12-24 | 2010-06-24 | Mazhar Memon | Low power polling techniques |
US8140876B2 (en) * | 2009-01-16 | 2012-03-20 | International Business Machines Corporation | Reducing power consumption of components based on criticality of running tasks independent of scheduling priority in multitask computer |
EP2254048A1 (en) | 2009-04-21 | 2010-11-24 | LStar Technologies LLC | Thread mapping in multi-core processors |
US20110022833A1 (en) | 2009-07-24 | 2011-01-27 | Sebastien Nussbaum | Altering performance of computational units heterogeneously according to performance sensitivity |
US20110314314A1 (en) | 2010-06-18 | 2011-12-22 | Samsung Electronics Co., Ltd. | Power gating of cores by an soc |
US20130124890A1 (en) | 2010-07-27 | 2013-05-16 | Michael Priel | Multi-core processor and method of power management of a multi-core processor |
US20120102344A1 (en) | 2010-10-21 | 2012-04-26 | Andrej Kocev | Function based dynamic power control |
US20130238912A1 (en) * | 2010-11-25 | 2013-09-12 | Michael Priel | Method and apparatus for managing power in a multi-core processor |
US20120151225A1 (en) * | 2010-12-09 | 2012-06-14 | Lilly Huang | Apparatus, method, and system for improved power delivery performance with a dynamic voltage pulse scheme |
US20120159496A1 (en) * | 2010-12-20 | 2012-06-21 | Saurabh Dighe | Performing Variation-Aware Profiling And Dynamic Core Allocation For A Many-Core Processor |
US20120266179A1 (en) * | 2011-04-14 | 2012-10-18 | Osborn Michael J | Dynamic mapping of logical cores |
US20120185709A1 (en) * | 2011-12-15 | 2012-07-19 | Eliezer Weissmann | Method, apparatus, and system for energy efficiency and energy conservation including thread consolidation |
US20130346771A1 (en) * | 2012-06-20 | 2013-12-26 | Douglas D. Boom | Controlling An Asymmetrical Processor |
Non-Patent Citations (12)
Title |
---|
International Search Report for Application No. GB1108715.2, dated Sep. 23, 2011. |
International Search Report for Application No. GB1108716.0, dated Sep. 28, 2011. |
International Search Report for Application No. GB1108717.8 dated Sep. 30, 2011. |
Kumar et al. (Single-ISA Heterogenous Multi-Core Architecture: The potential for processor Power Reduction); MICRO 36 Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture; 12 pages. |
Non-Final Office Action for U.S. Appl. No. 12/787,359, dated Aug. 30, 2012. |
Non-Final Office Action for U.S. Appl. No. 12/787,361, dated Sep. 13, 2012. |
Non-Final Office Action for U.S. Appl. No. 13/360,559, dated Apr. 8, 2014. |
Non-Final Office Action for U.S. Appl. No. 13/360,559, dated Oct. 18, 2013. |
Non-Final Office Action for U.S. Appl. No. 13/604,390, dated Nov. 13, 2014. |
Non-Final Office Action for U.S. Appl. No. 13/604,496, dated Sep. 10, 2015. |
NVDIA (Variable SMP-A Multiple-Core CPU Architecture for Low Power and High Performance); Whitepaper; 2011, 16 pages. |
Tanenbaum (Structured Computer Organization: Third Edition); Prentice-Hall, Inc, 1990; 5 pages. |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9965279B2 (en) * | 2013-11-29 | 2018-05-08 | The Regents Of The University Of Michigan | Recording performance metrics to predict future execution of large instruction sequences on either high or low performance execution circuitry |
Also Published As
Publication number | Publication date |
---|---|
US20140181501A1 (en) | 2014-06-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9569279B2 (en) | Heterogeneous multiprocessor design for power-efficient and area-efficient computing | |
US20110213950A1 (en) | System and Method for Power Optimization | |
US8924758B2 (en) | Method for SOC performance and power optimization | |
US20120331319A1 (en) | System and method for power optimization | |
US20120331275A1 (en) | System and method for power optimization | |
TWI493332B (en) | Method and apparatus with power management and a platform and computer readable storage medium thereof | |
TWI578154B (en) | System, method and apparatus for power management | |
TW201137753A (en) | Methods and apparatus to improve turbo performance for events handling | |
TWI553549B (en) | Processor including multiple dissimilar processor cores | |
US20120102348A1 (en) | Fine grained power management in virtualized mobile platforms | |
US20140025930A1 (en) | Multi-core processor sharing li cache and method of operating same | |
US9501299B2 (en) | Minimizing performance loss on workloads that exhibit frequent core wake-up activity | |
US10025370B2 (en) | Overriding latency tolerance reporting values in components of computer systems | |
US8717371B1 (en) | Transitioning between operational modes in a hybrid graphics system | |
TWI502333B (en) | Heterogeneous multiprocessor design for power-efficient and area-efficient computing | |
US10168765B2 (en) | Controlling processor consumption using on-off keying having a maxiumum off time | |
US8717372B1 (en) | Transitioning between operational modes in a hybrid graphics system | |
CN107209544B (en) | System and method for SoC idle power state control based on I/O operating characteristics | |
US20210089326A1 (en) | Dynamic bios policy for hybrid graphics platforms | |
US20240028222A1 (en) | Sleep mode using shared memory between two processors of an information handling system | |
US8199601B2 (en) | System and method of selectively varying supply voltage without level shifting data signals | |
US20230090567A1 (en) | Device and method for two-stage transitioning between reduced power states | |
CN112486870A (en) | Computer system and computer system control method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NVIDIA CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HICOK, GARY D.;LONGNECKER, MATTHEW RAYMOND;PATEL, RAHUL GAUTAM;REEL/FRAME:029576/0507 Effective date: 20121220 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
CC | Certificate of correction | ||
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |