巴西专利BR112012021102B1 DATA PROCESSING DEVICE, METHOD FOR OPERATING A DATA PROCESSING DEVICE

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
DATA PROCESSING DEVICE, METHOD FOR OPERATING A DATA PROCESSING DEVICE A data processing device and a method for switching the performance of a workload between two processing circuits are provided. The data processing apparatus has the first processing circuit system which is architecturally compatible with the second processing circuit system, but with the first processing circuit system being micro-architecturally different from the second processing circuit system. At any time, a workload consisting of at least one application and at least one operating system to run this application is performed by one of the first processing circuit system and the second processing circuit system. A switching controller is responsive to a transfer stimulus to perform a pass-through operation to transfer the workload performance from the source processing circuit system to the destination processing circuit system, with the circuit system source processing system being one of the first and second processing circuit systems and the target processing circuit system (...).
公开号:BR112012021102B1
申请号:R112012021102-1
申请日:2011-02-17
公开日:2020-11-24
发明作者:Peter Richard Greenhalgh；Richard Roy Grisenthwaite
申请人:Arm Limited；
IPC主号:

专利说明:

FIELD OF THE INVENTION
[001] The present invention relates to a data processing apparatus and a method for switching a workload between first and second processing circuit systems and, in particular, to a technique for performing such switching to increase the energy efficiency of the data processing device. BACKGROUND OF THE INVENTION
[002] In modern data processing systems, the difference in performance demand between high intensity tasks, such as operating games, and low intensity tasks, such as MP3 playback, can exceed a 100: 1 ratio. For a single processor to be used for all tasks, this processor will need to have high performance, but an axiom of processor microarchitecture is that high performance processors are less energy efficient than low performance processors. It is known to increase energy efficiency at the processor level using techniques, such as Dynamic Voltage and Frequency Scaling (DVFS) or power blocking, to provide the processor with a range of performance levels and corresponding power consumption characteristics. However, in general, such techniques are becoming insufficient to allow a single processor to take on tasks with such diverging performance requirements.
[003] In this way, consideration was given to the use of multi-core architectures to provide an energy efficient system for the performance of such diverse tasks. Although systems with multiple processor cores have been used for some time to increase performance, by allowing different cores to operate in parallel on different tasks in order to increase throughput, analysis on common systems can be used to increase energy efficiency. a relatively recent development.
[004] The article "Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-Core Systems” by V Kumar et al, ACM S IGOPS Operating Systems Review, Volume 43, Issue 3 (July 2009), discusses multi-core systems Asymmetric Instruction Set Architecture (ASISA), which consists of multiple cores exposing the same instruction set architecture (ISA), but differ in features, complexity, power consumption and performance. virtualized systems are studied to scrutinize how these workloads should be scheduled on ASISA systems in order to improve performance and energy consumption.The document identifies that certain tasks are more applicable to high frequency / performance microarchitectures (typically, computing tasks intensive), although others are better suited to lower frequency / performance microarchitectures and, as a side effect, will consume less energy (typically, intensive input / output tasks). Although such studies show how ASISA systems can be used to perform various tasks in an energy efficient manner, it is still necessary to provide a mechanism for scheduling individual tasks on the most appropriate processors and, typically, such scheduling management will place a significant burden on the operating system. .
[005] The article "Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Process or Power Reduction" by R Kumar et al, Procedures of the 36th International Symposium of Microarchitecture (MICRO-36'03) discusses a multicore architecture in which all the cores execute the same set of instructions, but have different capacities and performance levels. At the time of execution, the system software evaluates an application's resource requirements and chooses the core that can best satisfy these requirements, while minimizing energy consumption. As discussed in section 2 of this document, during the execution of an application, the operating system software tries to match the application with the different cores, trying to satisfy a defined objective function, for example, a particular performance requirement. In section 2.3, it is clear that there is a cost to switch cores, which requires restriction of the switching granularity. Then, a particular example is discussed where, if the operating system decides that a switch is in order, it energizes the new core, triggers a cache cleanup to save all dirty data from the cache in a shared memory structure and, then, signals the new core to start at a predefined operating system entry point. Then, the old core can be shut down, while the new core retrieves required data from memory. An approach like this is described in section 2.3, allowing an application to be switched between cores by the operating system. Then, the rest of the document discusses how such switching can be carried out dynamically in a multi-core configuration with the aim of reducing energy consumption.
[006] Although the exposed document discusses the potential of single ISA heterogeneous multi-core architectures to provide reductions in energy consumption, it still requires that the operating system be provided with sufficient functionality to enable scheduling decisions for individual applications to be made. In this respect, the role of the operating system becomes more complex when switching between instances of the processor with different architectural features. In this regard, it should be noted that the Alpha EV4 through EV8 cores, considered in the document, are not completely compatible with ISA, as discussed, for example, in the fifth paragraph of section 2.2.
[007] Additionally, the document does not address the problem in which there is significant overprocessing involved in switching applications between cores, which can significantly reduce the benefits to be achieved from such switching. SUMMARY OF THE INVENTION
[008] Viewed from a first aspect, the present invention provides a data processing apparatus comprising: first processing circuit system for performing data processing operations; second processing circuit system to perform data processing operations; the first processing circuit system being architecturally compatible with the second processing circuit system, so that a workload to be carried out by the data processing apparatus can be carried out in both the first processing circuit system and the second system of processing circuits, said workload comprising at least one application and at least one operating system for executing said at least one application; the first processing circuit system being micro-architecturally different from the second processing circuit system, so that the performance of the first processing circuit system is different from the performance of the second processing circuit system; the first and second processing circuit systems being configured so that the workload is carried out by one of the first processing circuit system and the second processing circuit system at any time; a switching controller, responsive to a transfer stimulus, to perform a pass-through operation to transfer the workload performance from the source processing circuit system to the destination processing circuit system, the circuit system source processing being one of the first processing circuit system and the second processing circuit system, and the destination processing circuit system being the other of the first processing circuit system and the second processing circuit system; the switching controller being arranged, during the pass-through operation: (i) to make the source processing circuit system make its current architectural state available to the destination processing circuit system, the architectural state current being that state not available from the shared memory, shared between the first and second processing circuit systems at the time the pass transfer operation is initiated, and that is necessary for the processing circuit system to destination takes control of the workload performance of the source processing circuit system successfully; and (ii) to mask the predetermined processor-specific configuration information from said at least one operating system, so that the transfer of the workload is transparent to said at least one operating system.
[009] According to the present invention, a data processing apparatus is provided with first and second processing circuit systems, which are architecturally compatible with each other, but microarchitecturally different. Due to the architectural compatibility of the first and second systems of processing circuits, a workload that consists not only of one or more applications, but that also includes at least one operating system to run these one or more applications, can be moved between the first and second processing circuit systems In addition, because the first and second processing circuit systems are microarchitecturally different, the performance characteristics (and therefore energy consumption characteristics) of the first and second processing circuit systems differ.
[0010] According to the present invention, at any time, the workload is carried out by one of the first and second processing circuits and a switching controller is responsive to a transfer stimulus to perform a pass-through operation to transfer workload performance between processing circuits. Upon receipt of a transfer stimulus, any of the two processing circuits that are currently performing the workload is considered to be the source processing circuit system, and the other is considered to be the destination processing circuit system. The switching controller responsible for performing the pass-through operation makes the current architecture state of the source processing circuit system available to the destination processing circuit system and additionally masks specific configuration information of the predetermined processor from at least one operating system that is part of the workload, so that the transfer of the workload is transparent to this operating system.
[0011] Through the use of the present invention, it is possible to migrate the entire workload from one processing circuit system to another, still masking this transfer from the operating system, and still ensuring that the necessary architectural state which is not available in shared memory at the time the pass-through operation is initiated becomes available to the target processing circuit system, so that it can successfully take control of workload performance.
[0012] By treating the entire workload as a macroscopic entity that is performed only in one of the first and second processing circuits at any particular time, the technique of the present invention enables the workload to be readily switched between the first and second processing circuits in a way that is transparent to the operating system, while still ensuring that the target processing circuit has all the information necessary to enable it to successfully take control of workload performance. An approach like this addresses the aforementioned problems that result from using the operating system to manage the scheduling of applications in particular processing circuits, and it has been found to enable significant energy consumption savings to be achieved.
[0013] In one embodiment, the data processing apparatus additionally comprises: energy control circuit system to independently control the energy supplied to the first processing circuit system and the second processing circuit system; in which, before the transfer stimulus occurs, the target processing circuit system is in an energy-saving condition and, during the pass-through operation, the energy control circuit system makes the system of destination processing circuits leaves the energy-saving condition before the destination processing circuit system takes control of workload performance. Through the use of such an energy control circuit system, it is possible to reduce the energy consumed by any processing circuit system that does not currently carry out the workload.
[0014] In one embodiment, following the pass-through transfer operation, the energy control circuit system causes the original processing circuit system to enter the energy-saving condition. This can occur immediately after the passage transfer operation or, in alternative modalities, the original processing circuit system can be arranged to only enter the energy saving condition after some predetermined period of time has elapsed, which may allow that data still retained by the source processing circuit system become available to the destination processing circuit system in a more energy efficient manner and with superior performance.
[0015] An additional problem existing in the prior art, regardless of the way in which a switch between different processing circuits occurs, is how to quickly and energy transfer the information required for this switch to be successful. In particular, the aforementioned current architectural state needs to be made available to the target processing circuit system. One way in which this can be achieved is to write all of this current architecture state into the shared memory as part of the pass-through operation, so that it can then be read from the shared memory by the processing circuit system. destiny. As used herein, the term "shared memory" refers to memory that can be directly accessed by both the first processing circuit system and the second processing circuit system, for example, main memory coupled in both the first and second systems of processing circuits through an interconnection.
[0016] However, a problem that arises while recording the entire current state of architecture in shared memory is that a process like this not only takes a significant amount of time, but also consumes significant energy, which can dramatically detract from the benefits potential that can be achieved by switching.
[0017] According to one embodiment, during the transfer operation, the switching controller causes the source processing circuit system to employ an accelerated mechanism to make its current architectural state available to the destination processing circuit system without reference by the target processing circuit system to shared memory in order to obtain the current architectural state. Therefore, according to such modalities, a mechanism is provided that avoids the requirement that the state of architecture be routed through shared memory in order to make it available to the target processing circuit system. This results not only in a performance improvement during the transfer operation, but also in a reduction in the energy consumption associated with the transfer operation.
[0018] In one embodiment, at least the source circuit system has an associated cache, the data processing device additionally comprises sniffing control circuit system, and the accelerated mechanism comprises transferring the current architectural state to the system of target processing circuits through the use of the cache and the associated sniff control circuit system of the source circuit system.
[0019] According to this technique, the local cache of the source processing circuit system is used to store the current architecture state that must be made available to the destination processor. This state is then marked as shareable, which allows the state to be sniffed by the target processing circuit system using the sniff control circuit system. Therefore, in such modalities, the first and second processing circuit systems become coherent in relation to each other's hardware cache, this reducing the amount of time, energy and complexity of the hardware involved in switching the circuit circuit system. source processing to the target processing circuit system.
[0020] In a particular embodiment, the accelerated mechanism is a save and restore mechanism, which causes the source processing circuit system to store its current architecture state in its associated cache, and makes the circuit system destination processor performs a restore operation whereby the sniffing control circuit system retrieves the current architecture state from the associated cache of the source processing circuit system and provides this recovered current architecture state to the system of target processing circuits. The save / store mechanism provides a particularly efficient technique for saving the architecture state in the local cache of the source circuit system, and for the target processing circuit system to recover this state then.
[0021] An approach like this can be used regardless of whether the target processing circuit system has its own local cache associated or not. Whenever a request for an architectural status item is received by the sniffing control circuit system, either directly from the target processing circuit system, or from an associated local cache of the processing circuit system destination in the event that a cache does not exist, then it will determine that the required architecture state item is stored in the local cache associated with the source circuit system and retrieve this data from the source circuit system's local cache for return to the target processing circuit system (either directly or via the target processing circuit system's associated cache, if present).
[0022] In a particular embodiment, the destination processing circuit system has an associated cache in which the transferred architecture state obtained by the air-conditioning control circuit system is stored for reference by the destination processing circuit system.
[0023] However, the hardware cache coherence approach described above is not the only technique that can be used to provide the aforementioned accelerated mechanism. For example, in an alternative embodiment, the accelerated mechanism comprises a dedicated bus between the source processing circuit system and the destination processing circuit system over which the source processing circuit system provides its current architectural state. to the target processing circuit system. Although an approach like this typically has a higher hardware cost than using the cache coherence approach, it will provide an even faster way to perform switching, which can be beneficial in certain implementations.
[0024] The switching controller can take a variety of forms. However, in one embodiment, the switching controller comprises at least virtualization software that logically separates at least one operating system from the first processing circuit system and the second processing circuit system. It is known to use virtual machines to allow applications written using a particular set of native instructions to run on the hardware with a different set of native instructions. The applications are run in a virtual machine environment, where the application instructions are native to the virtual machine, but the virtual machine is implemented by software running on the hardware with a different set of native instructions. The virtualization software provided by the switching controller of the exposed modality can be viewed as operating in a manner similar to a hypervisor in a virtual machine environment, as it provides separation between the workload and the fundamental hardware platform. In the context of the present invention, virtualization software provides an efficient mechanism for transferring the workload from one processing circuit system to another processing circuit system, while still masking processor specific configuration information from the system (s) ( s) operational (s) that form (s) this workload.
[0025] The transfer stimulus can be generated for a variety of reasons. However, in one mode, synchronism of the transfer stimulus is chosen to increase the energy efficiency of the data processing device. This can be achieved in a variety of ways. For example, the performance counters of the processing circuit system can be configured to count performance-sensitive events (for example, the number of instructions executed or the number of loading - storage operations). Coupled with a cycle counter or system synchronizer, this allows the identification that a highly computationally intensive application is running, which can best be served by switching to the higher performance processing circuit system, identifying a large number of loading - storage operations that indicate an application with intensive IO, which can be better served in the system of energy efficient processing circuits, etc. An alternative approach is for applications to be profiled and marked as 'large', 'small' or 'large / small', according to which, the operating system can interface with the switching controller to move the workload of this way (here, the term "large" refers to a higher performance processing circuit system and the term "small" refers to a more energy efficient processing circuit system).
[0026] The architectural state that is required for the target processing circuit system to take control of the workload performance of the successfully originating processing circuit system can take a variety of forms. However, in one embodiment, the architectural state comprises at least the current value of one or more registers for special use of the source processing circuit system, including a program counter value. In addition to the program's counter value, various other information can be stored in special use records. For example, other special-use records include processor state records (for example, CPSR and SPSR in the ARM architecture) that maintain control bits for processor mode, masking interruption, run state, and indicators. Other special-use registers include architectural control (the CP 15 system control register in the ARM architecture) that holds bits to change data endpoints, enable or disable the MMU, enable or disable data / instruction caches, etc. Other special use registers in CP15 store exception addresses and status information.
[0027] In one embodiment, the architecture state additionally comprises the current values stored in an architecture log file of the source processing circuit system. As will be understood by those skilled in the art, the architecture log file contains records that will be referred to by the instructions executed while applications are running, these records maintaining the original operation for computations, and providing locations in which the results of these computations are stored.
[0028] In one embodiment, at least one of the first processing circuit system and the second processing circuit system comprises a single processing unit. Additionally, in one embodiment, at least one of the first processing circuit system and the second processing circuit system comprises a grouping of processing units with the same microarchitecture. In a particular embodiment, the first processing circuit system may comprise a grouping of processing units with the same microarchitecture, while the second processing circuit system comprises a single processing unit (with a microarchitecture different from the microarchitecture of the processing units in the cluster that forms the first processing circuit system).
[0029] The energy-saving condition in which the power control circuit system can selectively place the first and second processing circuits can take a variety of forms. In one embodiment, the energy-saving condition is one of: an off condition; a partial / complete data retention condition; a sleeping condition; or an idle condition. Such conditions will be well understood by those skilled in the art and, therefore, will not be discussed in more detail here.
[0030] There are countless ways in which the first and second processing circuits can be arranged to be microarchitecturally different. In one embodiment, the first processing circuit system and the second processing circuit system are microarchitecturally different in that they have at least one of: different execution chain lengths; or different execution resources. Typically, differences in thread length will result in differences in operating frequency, which, in turn, will have an effect on performance. Similarly, differences in execution resources will have an effect on throughput and, therefore, on performance. For example, a processing circuit with more execution resources will enable more information to be processed at any particular time, improving throughput. Furthermore, or alternatively, one processing circuit may have more execution resources than the other, for example, more arithmetic logic units (ALUs), which, again, will increase throughput. As another example of different execution features, an energy-efficient processing circuit can be provided with a simple orderly chain, while a superior performance processing circuit can be provided with an out-of-order superscalar chain.
[0031] An additional problem that may arise when using high performance processing circuits, for example, running at GHz frequencies, is that such processors are approaching, and sometimes exceeding, the thermal limit at which they were designed to operate. Known techniques for seeking to address these problems may involve the processing circuit being placed in a low energy condition to reduce heat output, which may include clock regulation and / or voltage reduction, and potentially even disable the circuit. processing completely over a period of time. However, when adopting the modalities technique of the present invention, it is possible to implement an alternative approach to prevent the thermal limit from being exceeded. In particular, in one embodiment, the source processing circuit system outperforms the destination processing circuit system, and the data processing apparatus further comprises a thermal monitoring circuit system to monitor a thermal output from the source processing circuits and to trigger said transfer stimulus when said thermal output reaches a predetermined level. According to such techniques, the entire workload can be migrated from the higher-performance processing circuit system to the lower-performance processing circuit system, after which, less heat will be generated, and the system of origin processing circuits cool. Therefore, the package containing the two processing circuits can cool down while continuous execution of the program can occur, despite being in lower yield.
[0032] The data processing apparatus can be arranged in a variety of ways. However, in one embodiment, the first processing circuit system and the second processing circuit system reside on a single integrated circuit.
[0033] Viewed from a second aspect, the present invention provides a data processing apparatus comprising: first processing means for performing data processing operations; second processing medium to perform data processing operations; the first processing medium being architecturally compatible with the second processing medium, so that a workload to be carried out by the data processing apparatus can be carried out in both the first processing medium and the second processing medium, said load working system comprising at least one application and at least one operating system to execute said at least one application; the first processing medium being microarchitecturally different from the second processing medium, so that the performance of the first processing medium is different from the performance of the second processing medium; the first and second processing devices being configured so that the workload is carried out by one of the first processing medium and the second processing medium at any time; a transfer control means, responsive to a transfer stimulus, to perform a pass-through transfer operation to transfer the workload performance from the source processing medium to the destination processing medium, the source processing medium one being the first processing medium and the second processing medium, and the destination processing medium being the other the first processing medium and the second processing medium; the transfer control means, during the pass transfer operation: (i) to make the source processing medium take its current architecture state available to the destination processing medium, the current architecture state being that state not available from the shared memory device, shared between the first and second processing devices at the time the pass-through operation is initiated, and which is necessary for the target processing medium to take control of performance the workload of the source processing medium successfully; and (ii) to mask the predetermined processor-specific configuration information from said at least one operating system, so that the transfer of the workload is transparent to said at least one operating system.
[0034] Viewed from a third aspect, the present invention provides a method of operating a data processing apparatus with first processing circuit system to perform data processing operations and second processing circuit system to perform operations data processing system, the first processing circuit system being architecturally compatible with the second processing circuit system, so that a workload to be carried out by the data processing apparatus can be carried out on both the first processing circuit system. processing and in the second processing circuit system, said workload comprising at least one application and at least one operating system to execute said at least one application, and the first processing circuit system being microarchitecturally different from the second processing system processing circuits so that the performance of the first processing circuit system is different from the performance of the second processing circuit system, the method comprising the steps of: carrying out, at any time, the workload in one of the first processing circuit system and the second system of processing circuits; perform, in response to a transfer stimulus, a pass-through operation to transfer the workload performance from the source processing circuit system to the destination processing circuit system, the source processing circuit system one being the first processing circuit system and the second processing circuit system, and the destination processing circuit system the other being the first processing circuit system and the second processing circuit system; during the pass-through operation: (i) make the source processing circuit system make its current architecture state available to the destination processing circuit system, the current architecture state being that state not available from of shared memory, shared between the first and second processing circuit systems at the time when the pass-through operation is initiated, and that is necessary for the target processing circuit system to take control of the load performance working system of the original processing circuit successfully; and (ii) mask the predetermined processor-specific configuration information from said at least one operating system, so that the transfer of the workload is transparent to said at least one operating system. BRIEF DESCRIPTION OF THE DRAWINGS
[0035] The present invention will be further described, by way of example only, in relation to its modalities, as illustrated in the attached drawings, in which: Figure 1 is a block diagram of a data processing system according to a modality; Figure 2 schematically illustrates the provision of a switching controller (also referred to here as a workload transfer controller) according to a method for logically separating the workload that is carried out by the data processing apparatus of the platform. hardware in particular in the data processing apparatus that is used to carry out this workload; Figure 3 is a diagram that schematically illustrates the steps taken by both a source processor and a destination processor in response to a switching stimulus in order to transfer the workload from the source processor to the destination processor accordingly. with a modality; figure 4A schematically illustrates the storage of the current architecture state of the source processing circuit system in its associated cache during the save operation of figure 3; figure 4B schematically illustrates the use of the sniff control unit to control the transfer of the current architecture state from the source processing circuit to the destination processing circuit during the restore operation of figure 3; Figure 5 illustrates an alternative structure for providing an accelerated mechanism for transferring the current architectural state from the source processing circuit system to the destination processing circuit system during the transfer operation according to an embodiment; figures 6A through 61 schematically illustrate the steps taken to transfer a workload from a source processing circuit to a destination processing circuit according to an embodiment; figure 7 is a graph showing the variation of energy efficiency with performance, and illustrating how the various processor cores illustrated in figure 1 are used at various points along this curve according to a modality; Figures 8A and 8B schematically illustrate a low-performance processor thread and a high-performance processor thread, respectively, used in one embodiment; and Figure 9 is a graph showing the variation in the energy consumed by the data processing system. The performance of a processing workload is switched between a low performance and high energy efficiency processing circuit and a high processing circuit. performance and low energy efficiency. DESCRIPTION OF THE MODALITIES
[0036] Figure 1 is a block diagram that schematically illustrates a data processing system according to a modality. As shown in figure 1, the system contains two architecturally compatible processing circuit instances (the processing circuit system 0 10 and the processing circuit system 1 50), but with these different processing circuit instances having different microarchitectures . In particular, the processing circuit system 10 is arranged to operate at a higher performance than the processing circuit system 50, but on the other hand, the processing circuit system 10 will be less energy efficient than the processing circuit system 50. Examples of micro-architectural differences will be described in more detail below in relation to figures 8A and 8B.
[0037] Each processing circuit may include a single processing unit (also referred to herein as a processor core) or, alternatively, at least one of the processing circuit instances may itself comprise a grouping of processing units with the same microarchitecture.
[0038] In the example illustrated in figure 1, the processing circuit 10 includes two processor cores 15, 20 which are both architecturally and microarchitecturally identical. In contrast, processing circuit 50 contains only a single processor core 55. In the following description, processor cores 15, 20 will be referred to as "large" cores, while processor core 55 will be referred to as a "small" core, since , typically, processor cores 15, 20 will be more complex than processor core 55 because these cores are designed with performance in mind, whereas, on the contrary, typically processor core 55 is significantly less complex due to being designed having energy efficiency in mind.
[0039] In Figure 1, each of the cores 15, 20, 55 is considered to have its own local level 1 cache associated with 25, 30, 60, respectively, which can be arranged as a unified cache to store both instructions and data for reference by the associated core, or can be arranged with a Harvard architecture, providing distinct level 1 data and level 1 instruction caches. Although each core is shown to have its own associated level 1 cache, this is not a requirement and, in alternative modalities, one or more of the cores may have no local cache.
[0040] In the modality shown in figure 1, the processing circuit system 10 also includes a level 2 cache 35 shared between core 15 and core 20, with a sniffing control unit 40 being used to ensure cache coherence between the two level 1 caches 25, 30 and the level 2 cache 35. In one embodiment, the level 2 cache is arranged as an inclusive cache and therefore any data stored in any of the level 1 caches 25, 30 will also be resident on the level2 cache 35. As will be well understood by those skilled in the art, the purpose of the sniffing control unit 40 is to ensure cache coherence between the various caches, so that it can be guaranteed that each of the cores 15, 20 will always access the most up-to-date version of any data when it issues an access request. Therefore, purely by way of example, if core 15 issues an access request for data that is not resident in the associated level 1 cache 25, then the sniffing control unit 40 intercepts the request propagated from the level 1 cache 25 and determines, in relation to the cache level 1 30 and / or the cache level 2 35, if this access request can be fulfilled from the contents of one of these other caches. Only if the data is not present in any of the caches, the access request is then propagated through interconnection 70 to main memory 80, main memory 80 being the memory that is shared between both the processing circuit system 10 as for the processing circuit system 50.
[0041] The sniffing control unit 75 provided in interconnection 70 operates in a similar way to the sniffing control unit 40, but in this instance it seeks to maintain coherence between the cache structure provided in the processing circuit system 10 and the structure cache provided in the processing circuit system 50. In the examples where the level 2 cache 35 is an inclusive cache, then the sniffing control unit maintains coherence of the hardware cache between the level 2 cache 35 of the processing circuit system processing 10 and the processing circuit system level 1 cache 60. However, if the level 2 cache 35 is arranged as a unique level 2 cache, then the sniffing control unit 75 will also sniff the data held in the level caches. 1 25, 30 in order to guarantee the coherence of the cache between the caches of the processing circuit system 10 and the cache 60 of the processing circuit system 50.
[0042] According to one embodiment, only one of the processing circuit system 10 and the processing circuit system 50 will be actively processing a workload at any time. For the purposes of this application, the workload can be considered comprising at least one application and at least one operating system to run this at least one application, as illustrated schematically by reference number 100 in figure 2. In this example, two applications 105,110 are running under the control of operating system 115 and, collectively, applications 105, 110 and operating system 115 form workload 100. Applications can be considered to exist at a user level, while the operating system exists at a privileged level, and collectively the workload formed by the applications and the operating system runs on a hardware platform 125 (representing the view at the hardware level). At any time, this hardware platform will be provided by both the processing circuit system 10 and the processing circuit system 50.
[0043] As shown in Figure 1, the power control circuit system 65 is provided to selectively and independently power the processing circuit system 10 and the processing circuit system 50. Before a load transfer is work from one processing circuit to the other, typically only one of the processing circuits will be fully energized, that is, typically, the processing circuit that currently performs the workload (the source processing circuit system) and the another processing circuit (the target processing circuit system) will be in an energy-saving condition. When it is determined that the workload is to be transferred from one processing circuit to the other, then there will be a period of time during the transfer operation in which both processing circuits are in the energized state, but at some point following the transfer operation, then the source processing circuit from which the workload was transferred will be placed in the energy-saving condition.
[0044] The energy-saving condition can take a variety of forms, depending on the implementation, and therefore, for example, it can be one of an off condition, a partial / complete data retention condition, a sleeping condition or a idle condition. Such conditions will be well understood by those skilled in the art and, therefore, will not be discussed in more detail here.
[0045] The purpose of the described modalities is to perform switching of the workload between the processing circuits, depending on the level of performance / energy required of the workload. In this way, when the workload involves performing one or more performance-intensive tasks, such as running game applications, then the workload can be performed on the high performance processing circuit 10, using either a as for both large cores 15, 20. However, on the contrary, when the workload is only performing tasks with low performance intensity, such as MP3 playback, then the entire workload can be transferred to the circuit. processing 50, to benefit from the energy efficiencies that can be achieved from using the processing circuit 50.
[0046] To make better use of such switching capabilities, it is necessary to provide a mechanism that allows switching to take place in a simple and efficient manner, so that the action of transferring the workload does not consume energy up to a limit where it will neutralize the benefits of switching and also to ensure that the switching process is fast enough that it itself does not degrade performance to a significant extent.
[0047] In one embodiment, such benefits are, at least in part, achieved by the arrangement of the processing circuit system 10 to be architecturally compatible with the processing circuit system 50. This ensures that the workload can be migrated from one system of processing circuits to the other, still guaranteeing the correct operation. At a minimum, such architectural compatibility requires that both processing circuits 10 and 50 share the same instruction set architecture. However, in one embodiment, such architectural compatibility also requires a higher compatibility requirement to ensure that the two instances of the processing circuit are seen as identical from the perspective of a programmer. In one embodiment, this involves the use of the same architectural records, and one or more special use records that store data used by the operating system when running applications. With an architectural compatibility level like this, then, it is possible to mask, from the operating system 115, the transfer of the workload between processing circuits, so that the operating system is completely aloof if the workload is being executed in the processing circuit system 10 or in the processing circuit system 50.
[0048] In one embodiment, the handling of the transfer from one processing circuit to the other is managed by the switching controller 120 shown in figure 2 (also referred to as a virtualizer and, in other circumstances, as a transfer controller for the load load) job). The switching controller can be incorporated by a mixture of hardware, embedded software and / or software resources, but, in one mode, it includes software similar in nature to the hypervisor software found in virtual machines to enable applications recorded in a set of native instructions to be run on a hardware platform that adopts a different set of native instructions. Due to the architectural compatibility between the two processing circuits 10, 50, switching controller 120 can mask the transfer from operating system 115, merely by masking one or more items of the predetermined processor-specific configuration information from the system operational. For example, processor specific configuration information may include the contents of a CPI5 processor ID record and a CP15 cache record.
[0049] So, in a mode like this, the switching controller needs merely to ensure that any current architectural state maintained by the source processing circuit at the time of the transfer, and that it is not at the time when the transfer is already started available from shared memory 80, make it available to the target processing circuit in order to enable the target circuit to stay in position to take control of the workload performance successfully. Using the above example, typically, such an architectural state will comprise the current values stored in the source processing circuit system architecture log file, together with the current values of one or more special use records of the processing circuit system source. Due to architectural compatibility between processing circuits 10, 50, if this current architectural state can be transferred from the source processing circuit to the destination processing circuit, then the destination processing circuit to be in a position will take command the workload performance from the source processing circuit successfully.
[0050] Although the architectural compatibility between the processing circuits 10, 50 facilitates the transfer of the entire workload between the two processing circuits, in one embodiment, the processing circuits 10, 50 are microarchitecturally different from each other, so that there are different performance characteristics, and therefore energy consumption characteristics, associated with the two processing circuits. As previously discussed, in one embodiment, processing circuit 10 is a high performance and high energy consumption processing circuit, while processing circuit 50 is a lower performance and lower energy consumption processing circuit. The two processing circuits can be microarchitecturally different from one another in a number of ways, but typically will have at least one of different execution chain lengths and / or different execution resources. Typically, differences in thread length will result in differences in operating frequency, which, in turn, will have an effect on performance. Similarly, differences in execution resources will have an effect on throughput and therefore performance. Therefore, for example, the processing circuit system 10 may have more execution resources and / or more execution resources in order to improve performance. In addition, the threads in the processor cores 15, 20 can be arranged to perform out-of-order superscalar processing, while the simpler core 55 of the energy efficient processing circuit 50 can be arranged as an orderly thread. A further discussion of micro-architectural differences will be provided later in relation to figures 8A and 8B.
[0051] The generation of a transfer stimulus to cause the switching controller 120 to instigate a pass-through operation to transfer the workload from one processing circuit to another can be triggered for a variety of reasons. For example, in one mode, applications can be profiled and marked as 'large', 'small' or 'large / small', depending on what the operating system can interface with the switching controller to move the workload in this way. Therefore, using an approach like this, the generation of the transfer stimulus can be mapped to particular combinations of applications that are executed, to ensure that when high performance is required, the workload is performed on the high performance processing circuit. 10, whereas, when this performance is not required, the energy-efficient processing circuit 50 is used instead. In other embodiments, algorithms can be executed to dynamically determine when to trigger a transfer of the workload from one processing circuit to another based on one or more inputs. For example, the performance counters of the processing circuit system can be configured to count performance-sensitive events (for example, the number of instructions executed or the number of loading - storage operations). Coupled with a cycle counter or system synchronizer, this allows for the identification that a highly computationally intensive application is running, which can be better served by switching to the higher performance processing circuit system, or the identification a large number of loading - storage operations, which indicates an application with intensive IO that can be better served in the system of energy efficient processing circuits, etc.
[0052] As an even further example of when a transfer stimulus can be generated, the data processing system can include one or more thermal sensors 90 to monitor the temperature of the data processing system during operation. There may be a case where modern high-performance processing circuits, for example, those running at GHz frequencies, sometimes reach, or exceed, the thermal limit at which they were designed to operate. By using such thermal sensors 90, it can be detected when such thermal limits are being reached and, under these conditions, a transfer stimulus can be generated to trigger a transfer of the workload to a more energy efficient processing circuit in order to provide a general cooling of the data processing system. Therefore, considering the example in figure 1, where the processing circuit 10 is a high performance processing circuit and the processing circuit 50 is a lower performance processing circuit that consumes less energy, the migration of the workload from the processing circuit 10 to processing circuit 50 when the thermal limits of the device are being reached will provide a subsequent cooling of the device, while still allowing the continuous execution of the program to occur, albeit in lower yield.
[0053] Although, in figure 1, two processing circuits 10, 50 are shown, it is noticed that the techniques of the above described modalities can also be applied in systems that incorporate more than two different processing circuits, allowing the processing system cover a wider range of performance / energy levels. In such embodiments, each of the different processing circuits will be arranged to be architecturally compatible with one another to allow for the prompt migration of the entire workload between the processing circuits, but they will also be microarchitecturally different from one another to allow choices between the use of these processing circuits depending on the required performance / energy levels.
[0054] Figure 3 is a flowchart that illustrates the sequence of steps performed on both the source processor and the destination processor when the workload is transferred from the source processor to the destination processor upon receipt of a transfer stimulus. A transfer stimulus like this can be generated by the operating system 115 or by the virtualizer 120 through an embedded software interface of the system, resulting in the detection of the switching stimulus in step 200 by the source processor (which will be executing not only the load but also the virtualizer software that is at least part of the switching controller 120). Receiving the transfer stimulus (also referred to here as the switching stimulus) in step 200 will cause the power controller 65 to initiate an energizing operation and restart 205 at the destination processor. Following such energization and restart, the target processor will invalidate its local cache in step 210 and then enable sniffing in step 215. At this point, then, the target processor will signal to the source processor that it is ready for the workload transfer occurs, this signal causing the source processor to perform a save state operation in step 225. This save state operation will be discussed in more detail below in relation to figure 4A, but, in one embodiment, involves the source processing circuitry system which stores in its local cache any of its current architectural state that is not available from shared memory at the time the pass-through operation is initiated, and which is necessary for the target processor to successfully take control of workload performance.
[0055] Following the save state 225 operation, a signal of the switching state will be sent to the destination processor 230 which indicates to the destination processor that it must now start sniffing the source processor in order to recover the state of architecture required. This process occurs through a state restoration operation 230 which will be discussed in more detail below in relation to figure 4B, but which, in one embodiment, involves the target processing circuit system initiating a sequence of accesses that are intercepted by the sniffing control unit 75 on interconnect 70, which causes the cached copy of the architecture state in the local cache of the source processor to be retrieved and returned to the destination processor.
[0056] Then, following step 230, the target processor is in a position to take control of the workload processing and, in this way, normal operation begins at step 235.
[0057] In one embodiment, once normal operation begins at the destination processor, the cache of the source processor can be cleared, as indicated in step 250, in order to empty all dirty data into shared memory 80 and , then, the source processor can be turned off in step 255. However, in one embodiment, to further increase the efficiency of the target processor, the source processor is arranged to remain energized for the period of time referred to in figure 3 as the sniffing period. During this time, at least one of the caches of the source circuit remains energized, so that its contents can be sniffed by the sniff control circuit 75 in response to access requests issued by the destination processor. Following the transfer of the entire workload using the process described in figure 3, it is expected that, at least for an initial period of time after which the target processor begins operating the workload, some of the required data during performance of the workload are resident in the cache of the source processor. If the source processor has removed its contents into memory, and has been shut down, then the target processor, during these previous stages, will operate relatively inefficiently, as there will be many cache errors in its local cache, and much searching for data from shared memory, resulting in a significant performance impact while the target processor cache is "heated", that is, filled with data values required by the target circuit processor to perform the operations specified by the load of work. However, by leaving the source processor cache energized during the sniffing period, the sniff control circuit 75 will be able to serve many of these failed cache requests in relation to the source circuit cache, producing significant performance benefits when compared to recovering this data from shared memory 80.
[0058] However, it is expected that this performance benefit will only last a certain amount of time after switching, after which the contents of the source processor's cache will become obsolete. In this way, at some point, a sniffing interruption event will be generated to disable sniffing, in step 245, after which the source processor cache will be cleared, in step 250, and then the source processor will be turned off in step 255. A discussion of the various scenarios under which the sniffing interruption event can be generated will be carried out in more detail below in relation to figure 6G.
[0059] Figure 4A schematically illustrates the save operation performed in step 225 of figure 3 according to an embodiment. In particular, in one embodiment, the architectural state that needs to be stored from the source processing circuit system 300 in local cache 330 consists of the contents of a log file 310 referenced by an arithmetic logic unit (ALU) 305 during performance of data processing operations, along with the contents of several special use records 320 that identify a variety of pieces of information required by the workload to enable workload control to be successfully taken over by the processing circuit system destination. The contents of the special use registers 320 will include, for example, a program counter value that identifies a current instruction that is executed, along with various other information. For example, other special-use records include processor state records (for example, CPSR and SPSR in the ARM architecture) that maintain control bits for processor mode, masking interruption, execution state, and indicators. Other special-use registers include architectural control (the CP 15 system control register in the ARM architecture) that holds bits to change data endpoints, enable or disable the MMU, enable or disable instruction data / caches, etc. Other special use registers in CP 15 store exception addresses and status information.
[0060] As schematically illustrated in figure 4A, typically, the source processing circuit 300 will also maintain some configuration information specific to processor 315, but this information does not need to be saved in cache 330, as it will not be applicable in target processing circuit system. Typically, processor specific configuration information 315 is permanently encoded in the source processing circuit 300 using logic constants, and may include, for example, the contents of the CP 15 processor ID record (which will be different for each processing circuit) ) or the contents of the CP 15 cache type record (which will depend on the configuration of caches 25, 30, 60, for example, indicating that the caches have different line lengths). When the operating system 115 requires a piece of configuration information specific to the 315 processor, then, unless the processor is already in hypervisor mode, an execution trap for hypervisor mode occurs. In response, virtualizer 120 may, in one mode, indicate the value of the requested information, but in another mode, it will return to a "virtual" value. In the case of the processor ID value, this virtual value can be chosen to be the same for both "large" and "small" processors, this way, causing the actual hardware configuration to be hidden from the operating system 115 by virtualizer 120.
[0061] As schematically illustrated in figure 4A, during the save operation, the contents of log file 310 and special use logs 320 are stored by the source processing circuit system in cache 330 to form a cached copy 335. This cached copy is then marked as shareable, which allows the target processor to sniff this state through the sniff control unit 75.
[0062] Then, the restore operation subsequently performed on the destination processor is illustrated schematically in figure 4B. In particular, the target processing circuitry 350 (which may or may not have its own local cache) will issue a request for a particular architectural status item, with this request being intercepted by the sniffing control unit 75. Then , the sniffing control unit will issue a sniffing request to the local cache of the source processing circuit 330 to determine whether this architectural state item is present in the source cache. Due to the steps taken during the save operation discussed in figure 4A, a hit will be detected in the source cache 330, resulting in the cached architecture state being resumed through the sniffing control unit 75 to the data processing circuit. destination 350. This process can be repeated iteratively until all architectural state items have been retrieved by sniffing the cache of the source processing circuit. Typically, any processor specific configuration information relevant to the destination processing circuit 350 is permanently encoded in the destination processing circuit 350, as discussed above. Thus, once the restore operation has been completed, then, the target processing circuit system has all the information required to enable it to take control of the workload handling successfully.
[0063] Additionally, in one embodiment, regardless of whether the workload 100 is being performed by the "large" processing circuit 10 or the "small" processing circuit 50, the virtualizer 120 provides the operating system 115 with virtual configuration information with the same values and then the hardware differences between the "big" and "small" processing circuits 10, 50 are masked from the operating system 115 by the virtualizer 120. This means that the operating system 115 is unaware that the performance of workload 100 has been transferred to a different hardware platform.
[0064] According to the save and restore operations described in relation to figures 4A and 4B, the various instances of processor 10, 50 are arranged to be consistent in relation to the hardware cache with each other, in order to reduce the amount of time, energy and complexity of the hardware involved in transferring the architectural state from the source processor to the destination processor. The technique uses the source processor's local cache to store all state that must be transferred from the source processor to the destination processor and which is not available from shared memory at the time the transfer operation occurs. Because the state is marked as shareable in the source processor cache, this allows the target processor to be consistent with the sniffing hardware cache in this state during the transfer operation. Using a technique like this, it is possible to transfer the state between the processor instances without the need to save this state both in main memory and in a storage element mapped to the memory location. Therefore, this produces significant performance and energy consumption benefits, increasing the variety of situations in which it would be appropriate to switch the workload in order to seek to achieve energy consumption benefits.
[0065] However, although the technique of using the aforementioned cache coherence provides an accelerated mechanism for making the current architecture state available to the non-router destination processor from the current architecture state through shared memory, it is not the only way in which an accelerated mechanism like this can be implemented. For example, figure 5 illustrates an alternative mechanism in which a dedicated bus 380 is provided between the source processing circuit system 300 and the destination processing circuit system 350 in order to allow the architectural state to be transferred during the passage transfer operation. Therefore, in such modalities, the save and restore operations 225, 230 of figure 3 are replaced by an alternative transfer mechanism that uses the dedicated bus 380. Although, typically, such an approach has a higher hardware cost than the employing the cache coherence approach (typically, the cache coherence approach making use of the hardware already working in the data processing system), it will provide an even faster way to perform switching, which can be beneficial in certain implementations.
[0066] Figures 6A through 61 schematically illustrate a series of steps that are performed in order to transfer the performance of a workload from the source processing circuit system 300 to the destination processing circuit system 350. The system source processing circuit 300 is any of the processing circuits 10, 50 that is carrying out the workload prior to transfer, with the target processing circuit system being the other of the processing circuits 10, 50.
[0067] Figure 6A shows the system in an initial state in which the source processing circuit system 300 is powered by the power controller 65 and is performing the processing workload 100, while the destination processing circuit system 350 is in an energy-saving condition. In this modality, the energy saving condition is a deactivated energy condition, but, in the above way, other types of energy saving condition can also be used. Workload 100, which includes applications 105, 110 and an operating system 115 to run applications 105, 110, is abstracted from the hardware platform of the source processing circuit system 300 by virtualizer 120. Still performing the work 100, the source processing circuit system 300 maintains the architectural state 400, which can comprise, for example, the contents of the log file 310 and the special use logs 320, as shown in figure 4A.
[0068] In figure 6B, a transfer stimulus 430 is detected by virtualizer 120. Although transfer stimulus 430 is shown, in figure 6B, as an external event (for example, thermal uncontrolled detection by thermal sensor 90), the transfer stimulus 430 can also be an event triggered by virtualizer 120 itself or by operating system 115 (for example, operating system 115 can be configured to inform virtualizer 120 when a particular application type is to be processed). The virtualizer 120 responds to transfer stimulus 430 by controlling the power controller 65 to supply power to the target processing circuit system 350 in order to place the target processing circuit system 350 in an energized state.
[0069] In figure 6C, the target processing circuit system 350 starts running virtualizer 120. Virtualizer 120 controls the target processing circuit system 350 to invalidate its cache 420, in order to prevent processing errors caused by erroneous data values that may be present in cache 420 when powering the target processing circuit system 350. While target cache 420 is being invalidated, the source processing circuit system 350 continues to perform workload 100 When target cache 420 invalidation is complete, virtualizer 120 controls the target processing circuit system 350 to signal the source processing circuit system 300 that it is ready for the transfer of workload 100. further processing of workload 100 in the source processing circuit system 300 until the circuit system many destination processing 350 is ready for the pass-through transfer operation, the impact on transfer performance can be reduced.
[0070] In the next stage, shown in figure 6D, the source processing circuit system 300 stops performing the workload 100. During this stage, neither the source processing circuit system 300 nor the control circuit system target processing 350 perform workload 100. A copy of architecture state 400 is transferred from source processing circuit system 300 to target processing circuit system 350. For example, architecture state 400 can be saved in the source cache 410 and restored in the target processing circuit system 350, as shown in figures 4A and 4B, or can be transferred on a dedicated bus, as shown in figure 5. Architectural state 400 contains all the status information required for the target processing circuit system 350 to perform workload 100, different from the information already present in shared memory 80.
[0071] Having transferred the architectural state 400 to the destination processing circuit system 350, the source processing circuit system 300 is placed in the energy saving state by the energy control circuit system 65 (see figure 6E), with the exception that the source cache 410 remains powered. In the meantime, the target processing circuit system 350 begins to perform workload 100 using the transferred architectural state 400.
[0072] When the target processing circuit system 350 starts processing workload 100, the sniffing period begins (see figure 6F). During the sniffing period, the sniffing control unit 75 can store the sniffing data in the source cache 410 and retrieve the data on behalf of the target processing circuit system 350. When the target processing circuit system 350 request data that is not present in the target cache 420, the target processing circuit system 350 requests data from the sniff control unit 75. Then, the sniff control unit 75 sniffs the source cache 410 and, if sniffing results in a cache hit, then sniffing control unit 75 retrieves sniffed data from source cache 410 and returns it to target processing circuit system 350 where sniffed data can be stored in the destination cache 420. On the other hand, if sniffing results in a cache failure in source cache 410, then the requested data is fetched from shared memory island 80 and resumed to the destination 350 processing circuit system. Since access to data in source cache 410 is faster and requires less power than access to shared memory 80, sniffing source cache 410 for a period increases processing performance and reduces energy consumption during an initial period following the transfer of workload 100 to the target processing circuit system 350.
[0073] In the step shown in figure 6G, the sniffing control unit 75 detects a sniffing interruption event which indicates that it is no longer efficient to keep the source cache 410 in the energized state. The sniffing interruption event triggers the end of the sniffing period. The sniffing interruption event can be any one of a set of sniffing interruption events monitored by the sniffing control circuit system 75. For example, the sniffing interruption event set can include any one or more of the following events : a) when the percentage or fraction of sniffing hits that result in a hit in the cache, in the source cache 410, (this is an amount proportional to the number of sniffing hits / total sniffing number) falls below a threshold level predetermined after the target processing circuit system 350 has initiated the performance of workload 100; b) when the number of transactions, or the number of transactions of a predetermined type (for example, cacheable transactions), performed since the target processing circuit system 350 started to perform the workload 100 exceeds a threshold predetermined; c) when the number of processing cycles elapsed since the target processing circuit system 350 began to perform the workload 100 exceeds a predetermined limit; d) when a particular region of shared memory 80 is accessed for the first time since the target processing circuit system 350 began to perform workload 100; e) when a particular region of shared memory 80, which was accessed for an initial period after the target processing circuit system 350 has started to perform workload 100, is not accessed for a predetermined number of cycles or for a predetermined period of time; f) when the target processing circuit system 350 writes to a predetermined memory location for the first time since the performance of the transferred workload 100 has started.
[0074] These sniffing interruption events can be detected using programmable counters on the coherent interconnect 70 that includes the sniffing control unit 75. Other types of sniffing interruption events can also be included in the sniffing interruption event set.
[0075] Upon detection of a sniffing interruption event, sniffing control unit 75 sends a sniffing interruption signal 440 to source processor 300. Sniffing control unit 75 stops sniffing source cache 410 and , from now on, responds to requests for data access from the destination processing circuit system 350 by searching for the requested data from shared memory 80 and returning the data sought to the destination processing circuit system 350, where fetched data can be cached.
[0076] In figure 6H, the source cache control circuit is responsive to the sniffing stop signal 440 to clear cache 410 in order to save in the shared memory 80 all valid and dirty data values (that is, whose cached value is more updated than the corresponding value in shared memory 80).
[0077] In figure 61, then, the source cache 410 is turned off by the power controller 65 so that the system of source processing circuits 300 is entirely in the state of energy saving. The target processing circuit system 350 continues to perform the workload 100. From the point of view of operating system 115, the situation is now the same as in figure 6A. Operating system 115 is unaware that workload execution has been transferred from one processing circuit to another processing circuit. When another transfer stimulus occurs, then the same steps in figures 6A through 61 can be used to recommend the performance of the workload to the first processor (in this case, which of the processing circuits 10, 50 are the "circuit system" source processing "and the" destination processing circuit system "will be the other way around).
[0078] In the embodiment of figures 6A to 61, independent power control in cache 410 and in the source processing circuit system 300 is available, so that the source processing circuit system 300, different from the source cache 410 , can be shut down once the target processing circuit system 350 has initiated workload performance (see figure 6E), although only cache 410 of the source processing circuit system 350 remains in the powered state (see figures 6F to 6H). Then, the source cache 410 is turned off in figure 61. This approach can be useful to save energy, especially when the source processing circuit system 300 is the "large" processing circuit 10.
[0079] However, it is also possible to continue to energize the entire source processing circuit system 300 during the sniffing period and then put the source processing circuit system 300 as a whole into the economy state figure 61, following the end of the sniffing period and cleaning the source cache 410. This can be more useful in case the source cache 410 is too deeply embedded with the source processor core to be powered regardless of the source processor core. This approach can also be more practical when the source processor is the "small" processing circuit 50, whose power consumption is insignificant compared to the "large" processing circuit 10, since, since the processing circuit " large "10 started processing the transferred workload 100, so switching the" small "processing circuit 50, different from cache 60, to the energy-saving state during the sniffing period can have little effect on the consumption of overall system energy. This may mean that the extra hardware complexity of providing individual power control to the "small" processing circuit 50 and the "small" core cache 60 may not be justified.
[0080] In some situations, it may be known prior to the transfer of the workload that the data stored in the source cache 410 will not be required by the target processing circuit system 350 when it begins to perform the workload 100. For example, the source processing circuitry 300 may have just completed an application when the transfer takes place, and therefore the data in the source cache 410 at the time of transfer refers to the completed application and not the application being performed by the target processing circuit system 350 after transfer. In a case like this, a sniff neutralizer controller can trigger virtualizer 120 and sniff control circuitry 75 to neutralize sniffing from the source cache 410 and to control the source processing circuit 300 to clear and shut down the source cache 410 without waiting for a sniffing interrupt event to signal the end of the sniffing period. In this case, the technique of figures 6A to 61 will jump from the stage of figure 6E straight to the stage of figure 6G, without the stage of figure 6F in which data are sniffed from the source cache 410. Thus, if it is known in advance that the data in the source cache 410 will not be useful for the target processing circuit system 350, energy can be saved by placing the source cache 410 and the source processing circuit system 300 in the energy-saving condition without waiting for a sniffing interruption event. The sniffer neutralization controller can be part of virtualizer 120, or can be implemented as embedded software running on the source processing circuit system 300. The sniffer neutralization controller can also be implemented as a combination of elements, for example , operating system 115 can inform virtualizer 120 when an application has terminated, and then virtualizer 120 can neutralize the sniffing of source cache 410 if a transfer occurs when an application has terminated.
[0081] Figure 7 is a graph in which line 600 illustrates how energy consumption varies with performance. For various parts of this graph, the data processing system can be arranged to use different combinations of the processor cores 15, 20, 55 illustrated in figure 1 in order to seek to obtain the appropriate proportionality between performance and energy consumption. Therefore, by way of example, when numerous very high performance tasks need to be performed, it is possible to execute both large cores 15, 20 of the processing circuit 10 in order to achieve the desired performance. Optionally, supply voltage variation techniques can be used to allow for some variation in performance and energy consumption when using these two cores.
[0082] When performance requirements fall to a level where the required performance can be achieved using only one of the large cores, then tasks can be migrated to only one of the large cores 15, 20, with the other core being shut down or placed in some other energy saving condition. Again, the variation in supply voltage can be used to allow for some variation between performance and power consumption when using a single large core like this. It should be noted that the transition from the two large cores to a large core will not require the generation of a transfer stimulus, nor the use of the aforementioned techniques to transfer workload, since, in all instances, it is the processing circuit 10 being used, and the processing circuit 50 will be in an energy saving condition. However, as indicated by the dotted line 610 in figure 7, when performance drops to a level where the small nucleus can achieve the required performance, then a transfer stimulus can be generated to trigger the aforementioned mechanism for transferring the whole of the workload from processing circuit 10 to processing circuit 50, so that the entire workload is then performed on the small core 55, with processing circuit 10 being placed in an energy-saving condition. Again, the variation in supply voltage can be used to allow for some variation in small core performance and energy consumption 55.
[0083] Figures 8A and 8B illustrate, respectively, micro-architectural differences between a low-performance processor thread 800 and a high-performance processor thread 850 according to an embodiment. The low-performance processor 800 of Figure 8A will be suitable for the small processing core 55 of Figure 1, while the high-performance processor 850 of Figure 8B will be suitable for the large cores 15, 20.
[0084] The chaining of the low performance processor 800 of figure 8A comprises a search stage 810 to search for instructions from memory 80, a decoding stage 820 to decode the instructions searched, an emission stage 830 to issue instructions for execution , and multiple execution threads that include an integral 840 thread to perform integral operations, a MAC 842 thread to perform multiplication accumulation operations, and a SIMD / FPU 844 thread to perform SIMD operations (single instruction, multiple data) or point operations floating. In the thread of the low-performance processor 800, the emission stage 830 issues a single instruction at a time, and issues the instructions in the order in which the instructions are fetched.
[0085] The chaining of the high-performance processor 850 of figure 8B comprises a search stage 860 to search for instructions from memory 80, a decoding stage 870 to decode the instructions searched, a renaming stage 875 to rename records specified in decoded instructions, a dispatch stage 880 to dispatch instructions for execution, and multiple execution threads that include two integral threads 890, 892, a MAC thread 894 and two threads SIMD / FPU 896, 898. In the high performance processor thread 850 , dispatch stage 880 is a parallel emission stage that can issue multiple instructions to different threads 890, 892, 894, 896, 898 at once. Dispatch stage 880 can also issue out of order instructions. Unlike the low-performance processor 800 thread, SIMD / FPU 896, 898 threads have variable lengths, which means that operations that proceed through the SIMD / FPU 896, 898 threads can be controlled to bypass certain stages. An advantage of such an approach is that if each of the multiple threads of execution has different resources, there is no need to artificially increase the shorter thread to make it the same length as the longer threads, but contrary logic is required for dealing with the out-of-order nature of the results produced by the different threads (for example, to put everything back in order if its processing exception occurs).
[0086] Rename stage 875 is provided to map record specifiers, which are included in program instructions, and to identify architectural records in particular when viewed from a programmer's model point of view, for physical records, which are the actual records of the hardware platform. The renaming stage 875 enables the microprocessor to provide a larger group of physical records than is present in the view of the microprocessor programmer model. This largest group of physical records is useful during out-of-order execution because it enables risks, such as recording after recording (WAW) risks, to be avoided by mapping the same architectural records specified in two or more different instructions to two or more different physical records, so that the different instructions can be executed concurrently. For more details on registration renaming techniques, the reader is referred to the commonly owned US patent application 2008/114966 and US patent 7,590,826.
[0087] Low-performance thread 800 and high-performance thread 850 are microarchitecturally different in countless ways. Microarchitectural differences can include: a) chains with different stages. For example, the high-performance thread 850 has a rename stage 875 that is not present in the low-performance thread 800; b) the stages of chaining with different capacities. For example, the emission stage 830 of the low-performance thread 800 is capable of single-issue instructions only, while the dispatch stage 880 of the high-performance thread 850 can issue instructions in parallel. Parallel emission improves the processing efficiency of the thread and, therefore, improves the performance; c) the stages of chaining with different lengths. For example, decoding stage 870 of the high-performance thread 850 may include three substages, while decoding stage 820 of the low-performance thread 800 may include only a single substage. The larger a chaining stage (the greater the number of sub-stages), the greater the number of instructions that can be in effect at the same time, and then the greater the operational frequency at which the chaining can operate, resulting in a level superior performance; d) a different number of threads of execution (for example, the high-performance thread 850 has more threads of execution than the low-performance thread 800). By providing more threads of execution, more instructions can be processed in parallel and then performance improves; e) provision of in-order execution (as in the 800 thread) or out-of-order execution (as in the 850 thread). When instructions can be executed out of order, then performance improves as instruction execution can be dynamically scheduled to optimize performance. For example, in the low-performance thread in order 800, a series of MAC instructions will need to be executed one by one by the MAC 842 chain before a later instruction can be executed by one of the integral chain 840 and the SIMD / floating point chain 844 In contrast, then, in the high-performance thread 850, MAC statements can be executed by MAC thread 894, while (subject to any data risks that cannot be resolved by renaming) a later statement that uses a different execution thread 890, 892, 896, 898 can be performed in parallel with the MAC instructions. This means that running out of order can improve processing performance.
[0088] These and yet other examples of microarchitectural differences result in the 850 thread providing superior performance processing than the 800 thread. On the other hand, the microarchitectural differences also make the 850 thread consume more energy than the 800 thread. different 800, 850 enables workload processing to be optimized both for high performance (using a "large" processing circuit 10 with the 850 high performance chain) and energy efficiency (using a processing circuit "small" 50 with low performance thread 800).
[0089] Figure 9 shows a graph illustrating the variation in energy consumption of the data processing system as the performance of workload 100 is switched between the large processing circuit 10 and the small processing circuit 50.
[0090] In point A of figure 9, the workload 100 is being performed on the small processing circuit system 50 and then the power consumption is low. At point B, a transfer stimulus occurs that indicates that high-intensity processing must be performed and then the workload performance is transferred to the large processing circuit system 10. Then, the energy consumption increases and remains high at point C, while the large processing circuit system 10 is performing the workload. At point D, both large cores are considered to be operating in combination to process the workload. However, if the performance requirements fall to a level where the workload can be handled by only one of the large cores, then the workload is migrated to only one of the large cores, and the other is shut down, as indicated by the drop in energy to the level adjacent to point E. However, at point E, another transfer stimulus occurs (indicating that a return to low-intensity processing is desired) to trigger a transfer of workload performance back to the small processing circuit system 50.
[0091] When the small processing circuit system 50 starts processing the processing workload, most of the large processing circuit system is in the energy-saving state, but the large processing circuit system cache 10 remains energized during the sniffing period (point F in figure 9) to enable the data in the cache to be retrieved to the small processing circuit system 50. Therefore, the large processing circuit system cache 10 causes the energy consumption at point F is higher than at point A when only the small processing circuit system 50 has been energized. At the end of the sniffing period, the cache of the large processing circuit system 10 is switched off and, at point G, energy consumption returns to the low level when only the small processing circuit system 50 is active.
[0092] As mentioned above, in figure 9, energy consumption is higher during the sniffing period at point F than at point G, due to the cache of the large processing circuit system 10 being energized during the sniffing period . Although this increase in energy consumption is only indicated after the large to small transition, following the small to large transition there may also be a sniffing period, during which the data in the cache of the small processing circuit system 50 can be sniffed at name of the large processing circuit system 10 by the sniffing control unit 75. The sniffing period for the small to large transition was not shown in figure 9 because the energy consumed by leaving the small processing circuit system cache 50 in an energized state during the sniffing period is insignificant compared to the energy consumed by the large processing circuit system 10 during the performance of the processing workload, and then the very small increase in power consumption due to the cache of the small processing circuit system 50 being energized is not visible in the graph of figure 9.
[0093] The modalities described above describe a system that contains two or more architecturally compatible instances of the processor with microarchitectures optimized for energy efficiency or performance. The architecture state required by the operating system and applications can be switched between processor instances, depending on the required performance / power level, to allow the entire workload to be switched between processor instances. In one embodiment, only one of the processor instances is running the workload at any given time, with the other processing instance being in an energy-saving condition or in the energy saving condition's input / output process.
[0094] In one embodiment, processor instances can be arranged to be consistent with the hardware cache with each other to reduce the amount of time, energy and hardware complexity involved in switching the processor's architectural state source to the target processor. This reduces the time to carry out the switching operation, which increases the opportunities in which the modalities techniques can be used.
[0095] Such systems can be used in a variety of situations where energy efficiency is important for both battery life and / or thermal management, and performance dispersion is such that a more energy efficient processor can be used for loads lower processing workloads, while a higher performance processor can be used for higher processing workloads.
[0096] Because two or more processing instances are architecturally compatible, from an application perspective, the only difference between the two processors is the available performance. Through the techniques of a modality, the entire state of architecture required can be moved between the processors without having to involve the operating system, so that, then, it becomes transparent to the operating system and the applications running on the operating system in relation to which processor this operating system and applications are running.
[0097] When using architecturally compatible processor instances described in the exposed modalities, the total amount of architectural state that needs to be transferred can easily fit into a data cache and, as modern processing systems often implement cache coherence then, by storing the architectural state as switched in the data cache, the target processor can quickly sniff out this state in an energy efficient manner using existing circuit structures.
[0098] In a described mode, the switching mechanism is used to ensure that the thermal limits for the data processing system are not violated. In particular, when the thermal limits are close to being reached, the entire workload can be switched to a more energy efficient processor instance, allowing the entire system to cool down while continuous program execution occurs, albeit at a yield bottom.
[0099] Although a particular modality has been described above, it is clear that the invention is not limited to it and that many modifications and additions to it can be made within the scope of the invention. For example, various combinations of the features of the following dependent claims can be made with the features of the independent claims without departing from the scope of the present invention.

权利要求:
Claims (19)
[0001]
1.Data processing apparatus, comprising: first processing circuit system (10) to perform data processing operations; second system of processing circuits (50) for performing data processing operations; the first processing circuit system being architecturally compatible with the second processing circuit system, so that a workload (100) to be performed by the data processing apparatus can be performed on both the first processing circuit system and the in the second processing circuit system, said workload comprising at least one application (105, 110) and at least one operating system (115) for executing said at least one application; the first and second processing circuit systems being configured so that the workload is carried out by one of the first processing circuit system and the second processing circuit system at any time; a switching controller (120), responsive to a transfer stimulus, to carry out a pass-through operation to transfer the workload performance of the source processing circuit system (300) to the processing circuit system destination (350), the source processing circuit system being one of the first processing circuit system and the second processing circuit system, and the destination processing circuit system being the other of the first processing circuit system and the second processing circuit system; characterized by the fact that the first processing circuit system being microarchitecturally different from the second processing circuit system, so that the performance of the first processing circuit system is different from the performance of the second processing circuit system; and the switching controller comprising at least virtualization software that logically separates the at least one operating system from the first processing circuit system and the second processing circuit system; and the switching controller being arranged, during the pass-through operation: (i) to make the source processing circuit system take its current architectural state available to the destination processing circuit system, the state of current architecture being that state not available from the shared memory, shared between the first and the second processing circuit systems at the moment when the pass-through operation is initiated, and that is necessary for the processing circuit system target take control of the workload performance of the source processing circuit system successfully; and (ii) to apply the virtualization software to mask specific predetermined processor configuration information from said at least one operating system so that the transfer of the workload is transparent to said at least one operating system.
[0002]
2.Data processing apparatus according to claim 1, characterized by the fact that it additionally comprises: energy control circuit system (65) to independently control the energy supplied to the first processing circuit system (10) and to the second processing circuit system (50); in which, before the transfer stimulus occurs, the destination processing circuit system (350) is in an energy-saving condition and during the pass-through operation the energy control circuit system causes the target processing circuits leave the energy-saving condition before the target processing circuit system takes control of workload performance.
[0003]
3.Data processing apparatus according to claim 2, characterized by the fact that following the pass-through operation the energy control circuit system (65) makes the system of origin processing circuit (300) ) enter the energy saving condition.
[0004]
4.Data processing apparatus according to any one of claims 1 to 3, characterized in that during the transfer operation the switching controller (120) causes the original processing circuit system (300) to employ an accelerated mechanism for making its current architectural state available to the target processing circuit system (350) without reference by the target processing circuit system to shared memory in order to obtain the current architectural state.
[0005]
5. Data processing apparatus according to claim 4, characterized by the fact that: at least the source circuit system has an associated cache (330); the data processing apparatus further comprises a sniff control circuit system (75); and the accelerated mechanism comprises transferring the current architectural state to the target processing circuit system (350) through the use of the cache and the associated flare control circuit system (75) of the source circuit system.
[0006]
6. Data processing apparatus according to claim 5, characterized by the fact that the accelerated mechanism is a save and restore mechanism that causes the source processing circuit system (300) to store its current architectural state in its associated cache (330) and causes the target processing circuit system (350) to perform a restore operation whereby the sniffing control circuit system (75) retrieves the current architectural state from the associated cache of the source processing circuit system and provides this current architecture state recovered to the target processing circuit system.
[0007]
The data processing apparatus according to claim 5 or 6, characterized by the fact that the destination processing circuit system (350) has an associated cache (420) in which the transferred architecture state obtained by the data processing system Sniff control circuits (75) are stored for reference by the target processing circuit system.
[0008]
8. Data processing apparatus according to any one of claims 4 to 7, characterized in that the accelerated mechanism comprises a dedicated bus (380) between the original processing circuit system (300) and the circuit system target processing circuit (350) on which the source processing circuit system provides its current architectural state to the target processing circuit system.
[0009]
The data processing apparatus according to any one of claims 1 to 8, characterized by the fact that the transfer stimulus timing is chosen to increase the energy efficiency of the data processing apparatus.
[0010]
10. Data processing apparatus according to any one of claims 1 to 9, characterized in that said state of architecture comprises at least the current value of one or more special purpose registers of the origin processing circuit system ( 300), including a program count value.
[0011]
11. Data processing apparatus according to claim 10, characterized by the fact that said state of architecture additionally comprises the current values stored in an architecture record file of the source processing circuit system (300).
[0012]
The data processing apparatus according to any one of claims 1 to 11, characterized in that at least one of the first processing circuit system (10) and the second processing circuit system (50) comprises a single processing unit.
[0013]
13. Data processing apparatus according to any one of claims 1 to 12, characterized in that at least one of the first processing circuit system (10) and the second processing circuit system (50) comprises a grouping of processing units with the same microarchitecture.
[0014]
14. Data processing apparatus according to any one of claims 1 to 13, characterized in that said energy-saving condition is one of: an off condition; a partial / complete data retention condition; a sleeping condition; or an idle condition.
[0015]
15. Data processing apparatus according to any one of claims 1 to 14, characterized in that the first processing circuit system (10) and the second processing circuit system (50) are micro-architecturally different in that they have at least one of: different lengths of execution ducts; or different execution resources.
[0016]
The data processing apparatus according to any one of claims 1 to 15, characterized in that the source processing circuit system (300) outperforms the destination processing circuit system (350), and the data processing apparatus further comprises: thermal monitoring circuit system (90) for monitoring a thermal output from the source processing circuit system, and for triggering said transfer stimulus when said thermal output reaches a predetermined level.
[0017]
17. The data processing apparatus according to any one of claims 1 to 16, characterized by the fact that the first processing circuit system (10) and the second processing circuit system (50) are in a single integrated circuit.
[0018]
18. Data processing apparatus, comprising: first processing means (10) to perform data processing operations; second processing means (50) for performing data processing operations; the first processing medium being architecturally compatible with the second processing medium, so that a workload (100) to be performed by the data processing apparatus can be performed on both the first processing medium and the second processing medium, said workload comprising at least one application (105, 110) and at least one operating system (115) for executing said at least one application; the first and the second processing medium being configured so that the workload is carried out by one of the first processing medium and the second processing medium at any time; a transfer control means (120), responsive to a transfer stimulus, to carry out a pass-through operation to transfer the workload performance from the source processing medium (300) to the destination processing medium ( 350), the source processing medium being one of the first processing medium and the second processing medium, and the destination processing medium being the other of the first processing medium and the second processing medium; characterized by the fact that: the first processing medium being micro-architecturally different from the second processing medium, so that the performance of the first processing medium is different from the performance of the second processing medium; and the transfer control means comprising at least virtualization software that logically separates the at least one operating system from the first processing circuit system and the second processing circuit system; and the transfer control means, during the pass-through operation: (i) to make the source processing medium take its current architecture state available to the destination processing medium, the current architecture state being that state not available from the memory device shared between the first and the second processing medium at the time the pass-through transfer operation is initiated, and which is necessary for the target processing medium to take control of the load performance working of the source processing medium successfully; and (ii) to apply the virtualization software to mask the predetermined processor specific configuration information from said at least one operating system so that the transfer of the workload is transparent to said at least one operating system.
[0019]
19. Method for operating a data processing apparatus, the apparatus comprising a first processing circuit system (10) for carrying out data processing operations and a second processing circuit system (50) for carrying out data processing operations, the first processing circuit system being architecturally compatible with the second processing circuit system, so that a workload (100) to be performed by the data processing apparatus can be performed on both the first processing circuit system and the in the second processing circuit system, said workload comprising at least one application (105, 110) and at least one operating system (115) for executing said at least one application, the method comprising the steps of: carrying out, at any time, the workload in one of the first processing circuit system and the second circuit system processing; perform, in response to a transfer stimulus, a pass-through operation to transfer the workload performance from the source processing circuit system (300) to the destination processing circuit system (350), the system source processing circuitry being one of the first processing circuitry system and the second processing circuitry system, and the destination processing circuitry system being the other of the first processing circuitry system and the second processing circuitry system ; the method characterized by the fact that the first processing circuit system being micro-architecturally different from the second processing circuit system, so that the performance of the first processing circuit system is different from the performance of the second processing circuit system; the transfer operation is carried out using virtualization software that logically separates at least one operating system from the first processing circuit system and the second processing circuit system; and during the pass-through operation: (i) make the source processing circuit system make its current architecture state available to the destination processing circuit system, the current architecture state being that state not available to from the memory shared between the first and second processing circuit systems at the time the pass-through operation is initiated, and that it is necessary for the destination processing circuit system to take control of the load performance of the work of the original processing circuit system successfully; and (ii) mask specific predetermined processor configuration information from said at least one operating system using virtualization software, so that the transfer of the workload is transparent to said at least one operating system.

类似技术:

公开号 | 公开日 | 专利标题

BR112012021102B1|2020-11-24|DATA PROCESSING DEVICE, METHOD FOR OPERATING A DATA PROCESSING DEVICE

BR112012021121B1|2020-12-01|data processing apparatus, and, data processing method

US20190332158A1|2019-10-31|Dynamic core selection for heterogeneous multi-core systems

TWI494850B|2015-08-01|Providing an asymmetric multicore processor system transparently to an operating system

US20110213935A1|2011-09-01|Data processing apparatus and method for switching a workload between first and second processing circuitry

JP5932044B2|2016-06-08|Application event control | based on priority to reduce power consumption

US10423216B2|2019-09-24|Asymmetric multi-core processor with native switching mechanism

US9128857B2|2015-09-08|Flush engine

US10209991B2|2019-02-19|Instruction set and micro-architecture supporting asynchronous memory access

Gutierrez et al.2014|Evaluating private vs. shared last-level caches for energy efficiency in asymmetric multi-cores

US20170329626A1|2017-11-16|Apparatus with at least one resource having thread mode and transaction mode, and method

Renau et al.0|Speculative Multithreading Does not | Waste Energy Draft paper submitted for publication. November 6, 2003. Please keep confidential

同族专利:

公开号 | 公开日

RU2520411C2|2014-06-27|

JP2013521557A|2013-06-10|

JP5823987B2|2015-11-25|

CN102782671B|2015-04-22|

IL221270D0|2012-10-31|

KR20130044211A|2013-05-02|

DE112011100744T5|2013-06-27|

BR112012021102A2|2017-07-11|

KR101802140B1|2017-12-28|

CN102782671A|2012-11-14|

RU2012141606A|2014-04-10|

IL221270A|2016-07-31|

GB2490823B|2017-04-12|

US8418187B2|2013-04-09|

GB2490823A|2012-11-14|

WO2011107776A1|2011-09-09|

US20110213934A1|2011-09-01|

GB201214368D0|2012-09-26|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US3309A|1843-10-18|Weaver s loom for working any number of heddles |

US288748A|1883-11-20|John watson |

GB2318194B|1996-10-08|2000-12-27|Advanced Risc Mach Ltd|Asynchronous data processing apparatus|

JP3459056B2|1996-11-08|2003-10-20|株式会社日立製作所|Data transfer system|

JP3864509B2|1997-08-19|2007-01-10|株式会社日立製作所|Multiprocessor system|

US6501999B1|1999-12-22|2002-12-31|Intel Corporation|Multi-processor mobile computer system having one processor integrated with a chipset|

US6631474B1|1999-12-31|2003-10-07|Intel Corporation|System to coordinate switching between first and second processors and to coordinate cache coherency between first and second processors during switching|

JP2002215597A|2001-01-15|2002-08-02|Mitsubishi Electric Corp|Multiprocessor device|

US7100060B2|2002-06-26|2006-08-29|Intel Corporation|Techniques for utilization of asymmetric secondary processing resources|

US20040225840A1|2003-05-09|2004-11-11|O'connor Dennis M.|Apparatus and method to provide multithreaded computer processing|

US20050132239A1|2003-12-16|2005-06-16|Athas William C.|Almost-symmetric multiprocessor that supports high-performance and energy-efficient execution|

US20080263324A1|2006-08-10|2008-10-23|Sehat Sutardja|Dynamic core switching|

US20060064606A1|2004-09-21|2006-03-23|International Business Machines Corporation|A method and apparatus for controlling power consumption in an integrated circuit|

US7437581B2|2004-09-28|2008-10-14|Intel Corporation|Method and apparatus for varying energy per instruction according to the amount of available parallelism|

JP4982971B2|2004-09-29|2012-07-25|ソニー株式会社|Information processing apparatus, process control method, and computer program|

US7275124B2|2005-02-24|2007-09-25|International Business Machines Corporation|Method and system for controlling forwarding or terminating of a request at a bus interface based on buffer availability|

US7461275B2|2005-09-30|2008-12-02|Intel Corporation|Dynamic core swapping|

US7624253B2|2006-10-25|2009-11-24|Arm Limited|Determining register availability for register renaming|

US7590826B2|2006-11-06|2009-09-15|Arm Limited|Speculative data value usage|

US7996663B2|2007-12-27|2011-08-09|Intel Corporation|Saving and restoring architectural state for processor cores|

US20110213947A1|2008-06-11|2011-09-01|John George Mathieson|System and Method for Power Optimization|

JP4951034B2|2009-06-25|2012-06-13|株式会社日立製作所|Computer system and its operation information management method|

US9367462B2|2009-12-29|2016-06-14|Empire Technology Development Llc|Shared memories for energy efficient multi-core processors|US8533505B2|2010-03-01|2013-09-10|Arm Limited|Data processing apparatus and method for transferring workload between source and destination processing circuitry|

CN103038749B|2010-07-01|2017-09-15|纽戴纳公司|Split process between cluster by process type to optimize the use of cluster particular configuration|

US8782645B2|2011-05-11|2014-07-15|Advanced Micro Devices, Inc.|Automatic load balancing for heterogeneous cores|

US8683468B2|2011-05-16|2014-03-25|Advanced Micro Devices, Inc.|Automatic kernel migration for heterogeneous cores|

US20130007376A1|2011-07-01|2013-01-03|Sailesh Kottapalli|Opportunistic snoop broadcastin directory enabled home snoopy systems|

GB2536824B|2011-09-06|2017-06-14|Intel Corp|Power efficient processor architecture|

KR101624061B1|2011-09-06|2016-05-24|인텔 코포레이션|Power efficient processor architecture|

WO2013101032A1|2011-12-29|2013-07-04|Intel Corporation|Migrating threads between asymmetric cores in a multiple core processor|

KR101975288B1|2012-06-15|2019-05-07|삼성전자 주식회사|Multi cluster processing system and method for operating thereof|

US9195285B2|2012-12-27|2015-11-24|Intel Corporation|Techniques for platform duty cycling|

US10162687B2|2012-12-28|2018-12-25|Intel Corporation|Selective migration of workloads between heterogeneous compute elements based on evaluation of migration performance benefit and available energy and thermal budgets|

US9569223B2|2013-02-13|2017-02-14|Red Hat Israel, Ltd.|Mixed shared/non-shared memory transport for virtual machines|

US20140269611A1|2013-03-14|2014-09-18|T-Mobile Usa, Inc.|Communication Handovers from Networks Using Unlicensed Spectrum to Circuit-Switched Networks|

JP6244771B2|2013-09-24|2017-12-13|日本電気株式会社|Information processing system, processing apparatus, distributed processing method, and program|

US20150095614A1|2013-09-27|2015-04-02|Bret L. Toll|Apparatus and method for efficient migration of architectural state between processor cores|

KR20150050135A|2013-10-31|2015-05-08|삼성전자주식회사|Electronic system including a plurality of heterogeneous cores and operating method therof|

US9665372B2|2014-05-12|2017-05-30|International Business Machines Corporation|Parallel slice processor with dynamic instruction stream mapping|

US20150355946A1|2014-06-10|2015-12-10|Dan-Chyi Kang|“Systems of System” and method for Virtualization and Cloud Computing System|

US9870226B2|2014-07-03|2018-01-16|The Regents Of The University Of Michigan|Control of switching between executed mechanisms|

US9720696B2|2014-09-30|2017-08-01|International Business Machines Corporation|Independent mapping of threads|

US9582052B2|2014-10-30|2017-02-28|Qualcomm Incorporated|Thermal mitigation of multi-core processor|

US9898071B2|2014-11-20|2018-02-20|Apple Inc.|Processor including multiple dissimilar processor cores|

US9958932B2|2014-11-20|2018-05-01|Apple Inc.|Processor including multiple dissimilar processor cores that implement different portions of instruction set architecture|

US9977678B2|2015-01-12|2018-05-22|International Business Machines Corporation|Reconfigurable parallel execution and load-store slice processor|

US10133576B2|2015-01-13|2018-11-20|International Business Machines Corporation|Parallel slice processor having a recirculating load-store queue for fast deallocation of issue queue entries|

JP6478762B2|2015-03-30|2019-03-06|ルネサスエレクトロニクス株式会社|Semiconductor device and control method thereof|

WO2016195274A1|2015-06-01|2016-12-08|Samsung Electronics Co., Ltd.|Method for scheduling entity in multi-core processor system|

US9928115B2|2015-09-03|2018-03-27|Apple Inc.|Hardware migration between dissimilar cores|

US10775859B2|2016-09-23|2020-09-15|Hewlett Packard Enterprise Development Lp|Assignment of core identifier|

JP2018101256A|2016-12-20|2018-06-28|ルネサスエレクトロニクス株式会社|Data processing system and data processing method|

US10579575B2|2017-02-24|2020-03-03|Dell Products L.P.|Systems and methods of management console user interface pluggability|

US10628223B2|2017-08-22|2020-04-21|Amrita Vishwa Vidyapeetham|Optimized allocation of tasks in heterogeneous computing systems|

US10491524B2|2017-11-07|2019-11-26|Advanced Micro Devices, Inc.|Load balancing scheme|

US11188379B2|2018-09-21|2021-11-30|International Business Machines Corporation|Thermal capacity optimization for maximized single core performance|

CN110413098B|2019-07-31|2021-11-16|联想有限公司|Control method and device|

法律状态:
2019-01-08| B06F| Objections, documents and/or translations needed after an examination request according [chapter 6.6 patent gazette]|

2019-09-17| B06U| Preliminary requirement: requests with searches performed by other patent offices: procedure suspended [chapter 6.21 patent gazette]|

2020-07-21| B09A| Decision: intention to grant [chapter 9.1 patent gazette]|

2020-11-24| B16A| Patent or certificate of addition of invention granted|Free format text: PRAZO DE VALIDADE: 20 (VINTE) ANOS CONTADOS A PARTIR DE 17/02/2011, OBSERVADAS AS CONDICOES LEGAIS. |

优先权:

申请号 | 申请日 | 专利标题

US12/659234|2010-03-01|

US12/659,234|US8418187B2|2010-03-01|2010-03-01|Virtualization software migrating workload between processing circuitries while making architectural states available transparent to operating system|

PCT/GB2011/050317|WO2011107776A1|2010-03-01|2011-02-17|A data processing apparatus and method for switching a workload between first and second processing circuitry|

[返回顶部]