巴西专利BR112012021121B1 data processing apparatus, and, data processing method

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
DATA PROCESSING DEVICE, E, DATA PROCESSING METHOD In response to a transfer stimulus, the performance of a processing workload is transferred from a set of source processing circuits to a set of destination processing circuits , in preparation for the original processing circuit assembly to be placed in an energy-saving condition following the transfer. To reduce the number of memory searches required by the destination processing circuitry following the transfer, a cache of the source processing circuitry is maintained in an energized state for a sniffing period. During the sniffing period, the cache sniffing circuitry sniffs data values in the source cache and retrieves the sniffing data values for the target processing circuitry.
公开号:BR112012021121B1
申请号:R112012021121-8
申请日:2011-02-17
公开日:2020-12-01
发明作者:Peter Richard Greenhalgh
申请人:Arm Limited；
IPC主号:

专利说明:

FIELD OF THE INVENTION
[001] The present invention relates to data processing apparatus and method for switching a workload between the first and second sets of processing circuits and, in particular, to a technique for increasing the processing performance of the workload after switching. BACKGROUND OF THE INVENTION
[002] In modern data processing systems, the difference in performance demand between high intensity tasks, such as operating games, and low intensity tasks, such as MP3 playback, can exceed a 100: 1 ratio. In order for a single processor to be used for all tasks, this processor will need to have high performance, but one axiom of processor microarchitecture is that high performance processors are less energy efficient than low performance processors. It is known to increase energy efficiency at the processor level using techniques, such as Dynamic Voltage and Frequency Scaling (DVFS) or power blocking, to provide the processor with a range of performance levels and corresponding power consumption characteristics. However, in general, such techniques are becoming insufficient to allow a single processor to take on tasks with such diverging performance requirements.
[003] In this way, consideration was given to the use of multi-core architectures to provide an energy efficient system for the performance of such diverse tasks. Although systems with multiple processor cores have been used for some time to increase performance by allowing different cores to operate in parallel in different tasks to increase throughput, analysis of how such systems can be used to increase energy efficiency has been a development relatively recent.
[004] The article "Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-Core Systems" by V Kumar et al, ACM SIGOPS Operating Systems Review, Volume 43, Issue 3 (July 2009), discusses multi-core systems Asymmetric Single Instruction Set Architecture (ASISA), which consist of multiple cores using the same instruction set architecture (ISA), but differ in features, complexity, power consumption and performance. In the document, properties of virtualized workloads are studied to examine how these workloads should be scheduled in ASISA systems in order to improve performance and energy consumption. The document identifies that certain tasks are more applicable to high frequency / performance microarchitectures (typically, computationally intensive tasks), while others are more suitable for lower frequency / performance microarchitectures and, as a side effect, will consume less energy (typically, tasks with intensive input / output). Although such studies show how ASISA systems can be used to perform various tasks in an energy efficient manner, it is still necessary to provide a mechanism for scheduling individual tasks on the most appropriate processors. Typically, such scheduling management will place a significant load on the operating system.
[005] The article "Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction" by R Kumar et al, Procedures of the 36th International Symposium of Microarchitecture (MICRO-36'03) discusses a multi-core architecture in which all cores execute the same set of instructions, but have different capacities and performance levels. At the time of execution, the system software evaluates an application's resource requirements and chooses the core that can best satisfy these requirements, while minimizing energy consumption. As discussed in section 2 of this document, during the execution of an application, the operating system software tries to match the application with the different cores, trying to satisfy a defined objective function, for example, a particular performance requirement. In section 2.3, it is clear that there is a cost to switch cores, which requires restriction of the switching granularity. Then, a particular example is discussed where, if the operating system decides that a switch is in order, it energizes the new core, triggers a cache cleanup to save all dirty data from the cache in a shared memory structure and, then, signals the new core to start at a predefined operating system entry point. Then, the old core can be shut down, while the new core retrieves required data from memory. An approach like this is described in section 2.3, allowing an application to be switched between cores by the operating system. Then, the rest of the document discusses how such switching can be carried out dynamically in a multi-core configuration with the aim of reducing energy consumption.
[006] Although the exposed document discusses the potential of single ISA heterogeneous multi-core architectures to provide reductions in energy consumption, it still requires that the operating system be provided with sufficient functionality to enable scheduling decisions for individual applications to be made. In this respect, the role of the operating system becomes more complex when switching between instances of the processor with different architectural features. In this regard, it should be noted that the Alpha EV4 through EV8 cores, considered in the document, are not completely compatible with ISA, as discussed, for example, in the fifth paragraph of section 2.2.
[007] Additionally, the document does not address the problem in which there is significant overprocessing involved in switching applications between cores, which can significantly reduce the benefits to be achieved from such switching. Overprocessing includes not only the time taken to perform the switching during which no processor is performing the transferred workload, but also the penalty for cache errors after switching. When the target kernel begins to perform the transferred processing, all cache provided in the target kernel starts containing no valid data, and then the target kernel experiences cold start cache failures. This means that data needs to be fetched from memory, which slows down processing performance and uses a significant amount of energy. Performance and energy efficiency are restored only once the target cache has been "heated" by caching some of the data values stored in memory. Although the exposed document by R. Kumar et al acknowledges the problem of cold start cache errors in section 4.4, Kumar does not provide any solution to this problem. The present technique seeks to increase processing performance after switching to the target processor. SUMMARY OF THE INVENTION
[008] Viewed from a first aspect, the present invention provides a data processing apparatus comprising: first set of processing circuits and second set of processing circuits configured to perform a processing workload in such a way that the processing workload is performed by one of the first set of processing circuits and the second set of processing circuits at once; power control circuitry to independently control the power supply to the first set of processing circuits and the second set of processing circuits; a workload transfer controller, responsive to a transfer stimulus, to control a transfer of the processing workload performance from a source processing circuit set to a target processing circuit set before the set source processing circuitry is placed in an energy-saving condition by the energy control circuitry set, the source processing circuitry set being one of the first and second processing circuitry set and the destination processing the other from the first and the second processing circuitry; where: at least the source processing circuitry has a cache; the power control circuitry is configured to maintain at least the source processing circuitry cache in an energized condition during a sniffing period following the start of the performance of the processing workload transferred by the power circuitry. destination processing; the data processing apparatus comprises cache sniffing circuitry set during the sniffing period to sniff data values in the source processing circuitry cache and to retrieve the sniffed data values for the processing circuitry destination; and the power control circuitry is configured to place at least said source processing circuitry cache in the energy-saving condition following the end of the sniffing period.
[009] The data processing apparatus of the present invention has first and second sets of processing circuits and, at any time during processing, one of the first and second sets of processing circuits performs a processing workload. When a transfer stimulus occurs, the performance of the processing workload is transferred from a set of source processing circuits (any of the first and second sets of processing circuits that are currently performing the workload at the time of receipt of the transfer stimulus) to a target processing circuit set (the other from the first and second processing circuit sets), in preparation for the source processing circuit set to be placed in an energy-saving condition. Regardless of how the transfer itself is achieved, the present technique increases the performance level of the target circuitry after the processing workload has been transferred to the destination circuitry.
[0010] The present technique recognizes that, following the transfer, the destination processing circuitry may require data values that were stored in a source processing circuitry cache prior to the transfer. At least the source processing circuitry cache is maintained in an energized condition for a finite period (the sniffing period) following the start of the performance of the processing workload transferred by the target processing circuitry. The cache sniffing circuitry sniffs the data in the source cache during the sniffing period and retrieves data on behalf of the target processing circuitry. By maintaining energy in the source cache during the sniffing period, the target processing circuitry has access to the data in the source cache for an initial processing period, thus avoiding the need to fetch data from memory . Since accesses to the source cache are faster and use less energy than accesses to memory, the present technique increases the performance level of the set of destination processing circuits and the energy efficiency of the device as a whole, following the transfer processing workload.
[0011] The present technique also recognizes that the sniffing of data in this source cache is useful only for a finite period following the beginning of the performance of the processing workload by the target processing circuitry. Consequently, the data in the source cache is no longer relevant to the processing performed by the target processing circuitry. For example, the target processing circuitry may start processing another application that does not require data from the source cache, or the destination processing circuitry may have processed the data in such a way that a updated value, different from the value stored in the source cache, is being used now. Therefore, at the end of the sniffing period, the cache sniffing circuitry stops sniffing the data values from the source cache, and the power control circuitry is configured to place at least the cache of the pool. of original processing circuits in the energy-saving condition, to save energy.
[0012] In summary, instead of turning off the source cache immediately after transferring the processing workload to the target processing circuitry, the source cache is maintained in an energized state for a sniffing period during which the cache sniffing circuitry can sniff data values in the source cache and retrieve the sniffing data values for the target processing circuitry. By reducing the number of times that data is fetched from memory, the level of performance and energy efficiency increases.
[0013] Although, in general, the present application describes the present technique for the case in which there are two processing circuits (the first and the second set of processing circuits), the data processing apparatus may comprise additional processing circuits , and the technique can be applied to transferring a processing workload between any two of the processing circuits. In addition, each set of processing circuits can include a single processor core or a plurality of processor cores.
[0014] The processing workload can include at least one processing application and at least one operating system to run at least one processing application. By treating the entire workload as a macroscopic entity that is performed only on one of the first and second processing circuits at any particular time, the workload can be readily switched between the first and second processing circuits of a transparent to the operating system. An approach like this addresses the aforementioned problems that result from using the operating system to manage the scheduling of applications in particular processing circuits.
[0015] The workload transfer controller can be configured, during the transfer, to mask the predetermined processor specific configuration information from at least one operating system, in such a way that the transfer of the workload is transparent to the at least one operating system. This means that the configuration of the operating system is simplified because the operating system does not need to be aware of the differences between the processor specific configuration information associated with the first set of processing circuits and the processor specific configuration information associated with the second set of processing circuits. Since the processor-specific differences in the first and second sets of processing circuits are masked by the operating system, then, from the perspective of the operating system (and from the perspective of all applications run by the operating system), the workload is running on a single hardware platform. Whether the workload is running on the first set of processing circuits or the second set of processing circuits, the operating system view of the hardware platform is the same. This makes it easier to configure the operating system and applications.
[0016] The workload transfer controller may comprise at least virtualization software that logically separates at least one operating system from the first set of processing circuits and the second set of processing circuits. Virtualization software provides a level of abstraction in order to hide the hardware configuration of the respective operating system's processing circuitry, so that the operating system is unaware of which processing circuitry is performing the workload. Thus, the configuration of the operating system can be simplified. Virtualization software can control the allocation of processing workload for both the first set of processing circuits and the second set of processing circuits.
[0017] The first set of processing circuits can be architecturally compatible with the second set of processing circuits, in such a way that a processing workload to be performed by the data processing apparatus can be performed both on the first set of processing circuits. processing circuits and the second set of processing circuits. This means that, from an application perspective, the only difference between the application running on the first set of processing circuits and the application running on the second set of processing circuits is the level of performance or energy efficiency achieved. There is no need for a conversion of the instruction set between the first and the second set of processing circuits. The entire processing workload that includes the operating system and the applications that are run by the operating system can be transferred backwards and forwards between the first and second sets of processing circuits in a simple way.
[0018] The first set of processing circuits may be microarchitecturally different from the second set of processing circuits, such that the performance of the first set of processing circuits is different from the performance of the second set of processing circuits. In general, that of the first and second sets of processing circuits with the higher performance level will consume more energy than the other of the first and second sets of processing circuits. This means that the workload can be switched to the higher performance processing circuitry if high performance processing is required (for example, when a game application is being performed). On the other hand, if low performance processing, such as MP3 playback, is being performed, then the processing workload can be switched in its entirety to the underperforming processing circuitry to increase efficiency energy. Thus, by providing microarchitecturally different processing circuits, the performance of the processing workload can be optimized for performance or energy consumption, depending on the nature of the workload to be performed.
[0019] Microarchitectural differences between sets of processing circuits can include, for example, different lengths of execution thread or different execution resources. Typically, differences in chaining length will result in differences in operating frequency which, in turn, will have an effect on performance. Similarly, differences in execution resources will have an effect on throughput and therefore on performance. For example, a processing circuit with more execution resources will enable more information to be processed at any particular time, improving performance. Furthermore, or alternatively, one processing circuit may have more execution resources than the other, for example, more arithmetic logic units (ALUs), which, again, will increase throughput. As another example of different execution features, an energy-efficient processing circuit can be provided with a simple orderly chain, while a superior performance processing circuit can be provided with an out-of-order superscalar chain. Also, a higher performance processing circuit may have derivation prediction capability, which speeds up processing by prefetching predicted derivation targets before the derivation has been resolved, while a more energy efficient processing circuit may not have a derivation predictor. Such microarchitectural differences do not affect the ability of each architecturally compatible processing circuit to perform the same processing workload, but result in different levels of performance and energy consumption when the respective processing circuits are performing the processing workload.
[0020] The present technique can be used when only the source processing circuitry has a cache. In this case, some memory accesses can be avoided by the target processing circuitry by using the cache sniffing circuitry to sniff the source cache during the sniffing period. Following the end of the sniffing period, all data will need to be fetched from memory.
[0021] However, typically, the target processing circuitry will also comprise a cache, such that both the first and second processing circuitry comprise a cache. In this case, the data values sniffed by the cache sniffing circuitry and retrieved for the target processing circuitry by the cache sniffing circuitry can be stored in the destination processing circuitry cache to speed up future references to the data.
[0022] In one embodiment, the set of power control circuits can be configured to place the set of source processing circuits, other than the cache, in the condition of energy saving during the sniffing period and to place the cache of the set of original processing circuits in the energy-saving condition following the end of the sniffing period. This reduces the power consumption of the data processing apparatus, since most of the source processing circuitry can be turned off after the workload has been transferred to the destination processor. Only the source processing circuitry cache remains energized during the sniffing period, to enable the cache sniffing circuitry to retrieve the values stored in the source cache for the destination processing circuitry.
[0023] In one embodiment, when the cache is part of a cache hierarchy in the set of source processing circuits, the sniffed source cache can be kept in the energized condition during the sniffing period, while at least one other cache in the cache hierarchy is in the energy saving condition.
[0024] An example of this is when the source cache to be sniffed is an inclusive level two cache that is configured to store all data stored in any level one cache (s) in the cache hierarchy. In this case, the level two cache can be left in an energized state during the sniffing period to enable sniffing from the cache sniffing circuitry on behalf of the target processing circuitry, while the cache (s) (s) level one can be switched off along with the rest of the original processing circuitry.
[0025] Alternatively, the power control circuitry can be configured to keep the source processing circuitry in the energized condition during the sniffing period and to place the entire source processing circuitry, including the cache, in the energy-saving condition following the end of the sniffing period. Although leaving the set of source processing circuits energized during the sniffing period increases energy consumption, this reduces the complexity of the data processing device, since independent power control in both the source cache and the rest of the set of source processing circuits is not required.
[0026] An example situation in which it may be desired to energize the source cache and the source processing circuitry together is when the source processing circuitry cache to be sniffed by the sniffing circuitry. cache is a level one cache. A level one cache can be very closely integrated into a processor core in the source processing circuitry to be able to provide separate power control to the cache and the rest of the source processing circuitry. In this case, the entire set of source processing circuits, including the cache, can be left energized during the sniffing period and turned off following the end of the sniffing period.
[0027] The source processing circuitry can be configured to perform a clean operation on the source cache to rewrite all dirty data from the cache into a shared memory following the end of the sniffing period and before the circuitry of power control put the cache of the original processing circuitry in the energy-saving condition. By clearing the source cache before the cache is turned off, it is guaranteed that no dirty data, the most recent value of which has not yet been rewritten in memory, is lost.
[0028] To save energy, it may be useful for the energy control circuitry to keep the target processing circuitry in the energy-saving condition before the transfer stimulus occurs. In this case, the power control circuitry can energize the target processing circuitry in response to the transfer stimulus.
[0029] The target processing circuitry can be configured to invalidate the target cache before the target processing circuitry starts the performance of the transferred processing workload. For example, if the target processing circuitry was in an energy-saving condition prior to the transfer of processing workload performance, then, on energizing the target processing circuitry, the target cache may contain erroneous data. By invalidating the target cache before the target processing circuitry starts performing the transferred processing workload, processing errors can be avoided.
[0030] To improve processing performance, the source processing circuitry can be configured to continue performing the processing workload while the target processing circuitry cache is being invalidated, and the transfer controller workload can be configured to transfer performance of the processing workload to the target processing circuit set after the target processing circuit set cache has been invalidated. By allowing the source processing circuitry to continue performing the processing workload until the target processing circuitry is ready to start processing workload performance, the period of time during which no processing circuitry is performing the processing workload is reduced, and then the performance level of the processing workload increases.
[0031] In one embodiment, the sniffing period can begin when the target processing circuitry starts the performance of the processing workload.
[0032] The sniffing period can end in the occurrence of any one of a set of sniffing interruption events that comprises at least one sniffing interruption event. One or more sniffing interruption events, which indicate that it is no longer important to keep the source cache in the powered state, can trigger the cache sniffing circuitry to end the sniffing period. Typically, these events indicate that the data in the source cache is no longer needed by the target processing circuitry.
[0033] The set of cache sniffing circuits can be configured to monitor whether any of the set of sniffing interruption events has occurred. For example, the cache sniffing circuitry may comprise performance counters to monitor the processing of the target processing circuitry and the data accesses performed by the target processing circuitry. Using performance counters, the cache sniffing circuitry can analyze whether the data in the source cache is still relevant to the processing that is performed by the target processing circuitry. By configuring the cache sniffing circuitry, instead of the target processing circuitry, to monitor whether any of the sniffing interrupt events occurred, the destination processing circuitry can be left unnoticed about whether the source cache is still being sniffed. This makes the configuration of the target processing circuitry simpler.
[0034] The at least one sniffing interruption event may include an event that occurs when the sniffing percentage performed by the cache sniffing circuitry that results in a cache hit, in the source processing circuitry cache , fall below a predetermined threshold level. If the percentage of hits in the cache, in the source cache, becomes low, then this indicates that many of the data values sought by the target processing circuitry are no longer present in the source cache, and then the data in the source cache is not relevant to the target processing circuitry. Therefore, energy efficiency can increase by the end of the sniffing and deactivation period of the source cache once the percentage of hits in the cache falls below a predetermined threshold level.
[0035] The at least one sniffing interrupt event may also include an event that occurs when the target processing circuitry completes a predetermined number of processing transactions of a predetermined type following the transfer of the processing workload. Although the target processing circuitry can access data stored in the source cache through the cache sniffing circuitry, typically, the target processing circuitry will not be able to update the values in the source cache . It can be expected that, after a predetermined number of transactions have been completed, the target processing circuitry has generated new values for some of the data originally stored in the source cache. Since the target processing circuitry cannot write data to the source cache, the new data values will be stored in memory and / or in a destination cache, which means that the original data values in the source cache are no longer relevant to the target processing circuitry. Therefore, the completion of the predetermined number of processing transactions of the predetermined type may indicate that the source cache is no longer needed and may therefore trigger the end of the sniffing period. The predetermined type of processing transactions can comprise, for example, all transactions performed by the target processing circuitry, or it can comprise only cacheable transactions.
[0036] Another type of sniffing interruption event can be an event that occurs when a predetermined number of processing cycles has elapsed after the target processing circuitry initiates the performance of the transferred processing workload. As above, typically, target processing will not be able to update the values in the source cache. Therefore, after the target processing circuitry has performed processing for a number of processing cycles, it is unlikely that the data used by the data processing circuitry (for example, data stored in memory or in a cache destination) are the same data still stored in the source cache. This means that the number of processing cycles that have elapsed since the beginning of the processing workload performance by the target processing circuit set can be an indicator that the source cache is no longer useful to the processing circuit set destination and can be turned off.
[0037] When the device comprises a shared memory, shared between the first and the second set of processing circuits, the at least one sniffing interruption event may include an event that occurs when a particular memory region of the shared memory is first accessed by the target processing circuitry after the performance of the transferred processing workload begins. The first access to a particular memory region may indicate, for example, that the target processing circuitry started a new application associated with the particular memory region, different from the application that was previously processed by the memory circuitry. target processing. This may indicate that the data in the source cache, which is not associated with the new application, is no longer relevant to the target processing circuitry. Therefore, the first access to the memory region in particular may trigger the end of the sniffing period.
[0038] For similar reasons, the sniffing interruption event can also include an event that occurs when a particular memory region of shared memory, which was accessed by the target circuitry for an initial period after the start of the performance of the processing workload transferred, is not accessed by the target processing circuitry for a predetermined period. When the target processing circuitry starts processing an application other than the one originally processed by the source processing circuitry, then a region of memory associated with the original application may not be accessed for a period of time. This may indicate that the data in the source cache is no longer being used by the target processing circuitry, and may therefore trigger the end of the sniffing period.
[0039] Another type of sniffing interrupt event is an event that occurs when the target processing circuitry writes to a predetermined memory location of shared memory for the first time after processing workload performance starts. transferred. This allows the target processing circuitry to signal the cache sniffing circuitry that it no longer needs the data in the source cache by writing to the predetermined memory location.
[0040] The set of sniffing interruption events may include any or a plurality of the aforementioned sniffing interruption events, as well as other types of sniffing interruption events.
[0041] As used herein, the term "shared memory" refers to memory that can be directly accessed by both the first set of processing circuits and the second set of processing circuits, for example, main memory coupled in both the first and in the second set of processing circuits by means of an interconnection.
[0042] The apparatus may comprise a sniffing override controller, responsive to a sniffing override condition, to override the cache sniffing of the source processing circuitry by the cache sniffing circuitry and to control the assembly of power control circuits to place the original processing circuitry, including the cache, in the power-saving state after transferring the processing workload performance without waiting for the sniffing period to end. In certain situations, sniffing the source cache may not be useful for the target processing circuitry. In such situations, the sniffer override controller can override the cache sniffing circuitry by preventing source cache sniffing and controlling the power control circuitry to place the source processing circuitry, including the cache, in the state of energy saving without waiting for the end of the sniffing period. The sniffer override controller can be provided, for example, as embedded software running on the source processing circuitry or as part of the virtualization software that masks the hardware specific information of the operating system's processing circuitry.
[0043] For example, it may be known, before transferring the performance of the processing workload, that the data in the source cache will not be required for processing about to be performed by the set of destination processing circuits following the transfer. For example, if the source processing circuitry has just finished the performance of a game application, then the data used by the game application may not be useful for the target processing circuitry after it starts processing a different application. In this case, the sniffer override controller can signal to the cache sniffing circuitry and the power control circuitry that cache sniffing is not necessary.
[0044] The cache sniffing circuitry may comprise a coherent interconnection coupled to the first and second processing circuitry. The coherent interconnection has a view of both the source cache and any shared memory present in the data processing device. The target processing circuitry can simply request data from the coherent interconnect, and the coherent interconnect can manage whether data is sniffed from the source cache or fetched into memory (depending on whether the sniffing period has already ended or no and whether the data access request results in a hit in the source cache or not). The coherent interconnection manages data access, so that the target processing circuitry does not need to be aware of the exact location of the requested data. The target processing circuitry may be oblivious to sniffing data from the source cache. In some embodiments, coherent interconnection can also provide a convenient mechanism for transferring architectural state data from the source processing circuit set to the destination processing circuit set during the transfer of the processing workload.
[0045] Viewed from another aspect, the present invention provides a data processing apparatus comprising: first processing device to perform processing and second processing device to perform processing, the first processing device and the second processing device processing being configured to perform a processing workload, such that the processing workload is performed by one of the first processing device and the second processing device at once; power control device to independently control the power supply to the first processing device and the second processing device; workload transfer control device to, in response to a transfer stimulus, control a transfer of the processing workload performance from a source processing device to a target processing device before the processing device source device is placed in an energy-saving condition by the power control device, the source processing device being one of the first and second processing devices and the target processing device being the other of the first and second devices processing; wherein: at least the source processing device has a cache device for storing cached data values; the power control device is configured to maintain at least the caching device of the source processing device in an energized condition during a sniffing period following the start of performance of the processing workload transferred by the target processing device; the data processing apparatus comprises a cache-sniffing device for sniffing data values on the cache device of the source processing device during the sniffing period and retrieving the sniffed data values for the target processing device; and the energy control device is configured to place at least said cache device of the source processing device in the energy-saving condition following the end of the sniffing period.
[0046] Viewed from yet another aspect, the present invention provides a data processing method for an apparatus comprising first set of processing circuits and second set of processing circuits configured to perform a processing workload, in such a way that the processing workload is performed by one of the first set of processing circuits and the second set of processing circuits at once; the method comprising: performing the processing workload with a source processing circuitry, the source processing circuitry being one of the first and second processing circuitry and comprising a cache, the other of the first and second sets of processing circuits being a set of destination processing circuits; in response to a transfer stimulus, transfer the performance of the processing workload from the source processing circuit set to the target processing circuit set before the source processing circuit set is placed in a energy saving; maintaining at least the source processing circuitry cache in an energized condition during a sniffing period following the start of the performance of the processing workload transferred by the target processing circuitry; during the sniffing period, sniff the data values in the cache of the source processing circuitry and retrieve the sniffed data values for the destination processing circuitry; and placing at least said cache of the original processing circuitry in the energy-saving condition following the end of the sniffing period.
[0047] Viewed from an additional aspect, the present invention provides a data processing apparatus comprising: first set of processing circuits and second set of processing circuits configured to perform a processing workload in such a way that the processing workload is performed by one of the first set of processing circuits and the second set of processing circuits at once; a workload transfer controller, responsive to a transfer stimulus, to control a transfer of the processing workload performance from a source processing circuit set to a target processing circuit set before the set source processing circuitry is placed in an energy-saving condition by the energy control circuitry set, the source processing circuitry set being one of the first and second processing circuitry set and the destination processing the other from the first and the second processing circuitry; where: at least the target processing circuitry has a cache; the target processing circuitry is configured to invalidate the target processing circuitry cache before the target processing circuitry starts the performance of the transferred processing workload; the source processing circuitry is configured to continue performing the processing workload while the target processing circuitry cache is being invalidated; and the workload transfer controller is configured to transfer the performance of the processing workload to the target processing circuit set after the target processing circuit set cache has been invalidated.
[0048] The present technique can improve processing performance by allowing the source processing circuitry to continue processing the processing workload for a period following receipt of the transfer stimulus while caching a set of processing target processing circuits is being invalidated. By transferring the performance of the processing workload to the target processing circuitry only after the target processing circuitry cache has been invalidated, the time during which no processing circuitry is performing the workload can be reduced. Therefore, the processing workload is performed faster and more efficiently. BRIEF DESCRIPTION OF THE DRAWINGS
[0049] The present invention will be further described, by way of example only, in relation to its modalities, as illustrated in the attached drawings, in which: Figure 1 is a block diagram of a data processing system according to a modality; Figure 2 schematically illustrates the provision of a switching controller (also referred to here as a workload transfer controller) according to a method for logically separating the workload that is performed by the data processing apparatus from the platform. hardware in particular in the data processing apparatus that is used to perform this workload; Figure 3 is a diagram that schematically illustrates the steps taken by both a source processor and a destination processor in response to a switching stimulus in order to transfer the workload from the source processor to the destination processor accordingly. with a modality; figure 4A schematically illustrates the storage of the current architectural state of the source processing circuitry in its associated cache during the save operation of figure 3; figure 4B schematically illustrates the use of the sniff control unit to control the transfer of the current architectural state from the source processing circuit to the destination processing circuit during the restore operation of figure 3; Figure 5 illustrates an alternative structure for providing an accelerated mechanism for transferring the current architectural state from the source processing circuitry to the destination processing circuitry during the transfer operation according to an embodiment; figures 6A through 6I schematically illustrate the steps taken to transfer a workload from a source processing circuit to a destination processing circuit according to an embodiment; figure 7 is a graph showing the variation of energy efficiency with performance, and illustrating how the various processor cores illustrated in figure 1 are used at various points along this curve according to a modality; figures 8A and 8B schematically illustrate a low performance processor thread and a high performance processor thread, respectively, as used in one embodiment; and figure 9 is a graph showing the variation in energy consumed by the data processing system as the performance of a processing workload is switched between a low performance and high energy efficient processing circuit and a high performance and low energy efficiency processing. DESCRIPTION OF THE MODALITIES
[0050] Figure 1 is a block diagram that schematically illustrates a data processing system according to a modality. As shown in figure 1, the system contains two architecturally compatible processing circuit instances (the processing circuit set 0 10 and the processing circuit set 1 50), but with these different processing circuit instances having different microarchitectures . In particular, the processing circuitry 10 is arranged to operate at a higher performance than the processing circuitry 50, but on the other hand, the processing circuitry 10 will be less energy efficient than the processing circuitry 50. Examples of micro-architectural differences will be described in more detail below in relation to figures 8A and 8B.
[0051] Each processing circuit may include a single processing unit (also referred to herein as a processor core) or, alternatively, at least one of the processing circuit instances may itself comprise a grouping of processing units with the same microarchitecture.
[0052] In the example illustrated in Figure 1, the processing circuit 10 includes two processor cores 15, 20 which are both architecturally and microarchitecturally identical. In contrast, processing circuit 50 contains only a single processor core 55. In the following description, processor cores 15, 20 will be referred to as "large" cores, while processor core 55 will be referred to as a "small" core, since , typically, processor cores 15, 20 will be more complex than processor core 55 because these cores are designed with performance in mind, whereas, on the contrary, typically processor core 55 is significantly less complex due to being designed having energy efficiency in mind.
[0053] In figure 1, it is considered that each of the cores 15, 20, 55 has its own local level 1 cache associated with 25, 30, 60, respectively, which can be arranged as a unified cache to store both instructions and data for reference by the associated core, or can be arranged with a Harvard architecture, providing distinct level 1 data and level 1 instruction caches. While each core is shown to have its own associated level 1 cache, this is not a requirement, and, alternatively, one or more of the cores may have no local cache.
[0054] In the mode shown in figure 1, the processing circuitry 10 also includes a level 2 cache 35 shared between core 15 and core 20, with a sniffing control unit 40 being used to ensure cache coherence between the two level 1 caches 25, 30 and the level 2 cache 35. In one embodiment, the level 2 cache is arranged as an inclusive cache and therefore any data stored in any of the level 1 caches 25, 30 will also be resident on the cache level 2 35. As will be well understood by those skilled in the art, the purpose of the sniffing control unit 40 is to guarantee the coherence of the cache between the various caches, so that it can be guaranteed that each of the cores 15, 20 will always access the most up-to-date version of any data when it issues an access request. Therefore, purely by way of example, if core 15 issues an access request for data that is not resident in the associated level 1 cache 25, then the sniffing control unit 40 intercepts the request propagated from the level 1 cache 25 , and determines, in relation to the cache level 1 30 and / or the cache level 2 35, if this access request can be fulfilled from the contents of one of these other caches. Only if the data is not present in any of the caches, the access request is then propagated through interconnection 70 to main memory 80, main memory 80 being the memory that is shared between both the set of processing circuits 10 as the set of processing circuits 50.
[0055] The sniffing control unit 75 provided in interconnection 70 operates in a similar way to the sniffing control unit 40, but, in this instance, it seeks to maintain coherence between the cache structure provided in the set of processing circuits 10 and the cache structure provided in the processing circuitry 50. In examples where the level 2 cache 35 is an inclusive cache, then the sniffing control unit maintains coherence of the hardware cache between the level 2 cache 35 of the circuitry number 10 and the level 1 cache 60 of the processing circuitry 50. However, if the level 2 cache 35 is arranged as a unique level 2 cache, then the sniffing control unit 75 will also sniff the data held in the caches level 1 25, 30 in order to guarantee coherence of the cache between the caches of the processing circuit set 10 and the cache 60 of the processing circuit set 50.
[0056] According to one embodiment, only one of the processing circuitry 10 and the processing circuitry 50 will be actively processing a workload at any time. For the purposes of this application, the workload can be considered comprising at least one application and at least one operating system to run this at least one application, as illustrated schematically by reference number 100 in figure 2. In this example, two applications 105, 110 are running under the control of operating system 115 and, collectively, applications 105, 110 and operating system 115 form workload 100. Applications can be considered to exist at a user level, while the operating system it exists at a privileged level, and collectively the workload formed by the applications and the operating system runs on a hardware platform 125 (representing the view at the hardware level). At any time, this hardware platform will be provided by both the processing circuitry 10 and the processing circuitry 50.
[0057] As shown in figure 1, the power control circuitry 65 is provided to selectively and independently power the processing circuitry 10 and the processing circuitry 50. Before a load transfer is work from one processing circuit to the other, typically, only one of the processing circuits will be fully energized, that is, the processing circuit that currently performs the workload (the source processing circuit set), and the other processing circuit (the target processing circuitry) will typically be in an energy-saving condition. When it is determined that the workload is to be transferred from one processing circuit to the other, then there will be a period of time during the transfer operation in which both processing circuits are in the energized state, but at some point following to the transfer operation, then the source processing circuit from which the workload was transferred will be placed in an energy-saving condition.
[0058] The energy saving condition can take a variety of forms, depending on the implementation, and therefore, for example, it can be one of an off condition, a partial / complete data retention condition, a sleeping condition or an idle condition. Such conditions will be well understood by those skilled in the art and, therefore, will not be discussed in more detail here.
[0059] The purpose of the described modalities is to switch the workload between the processing circuits, depending on the level of performance / energy required of the workload. In this way, when the workload involves performing one or more performance-intensive tasks, such as running game applications, then the workload can be performed on the high performance processing circuit 10, using either a as for both large cores 15, 20. However, on the contrary, when the workload is only performing tasks with low performance intensity, such as MP3 playback, then the entire workload can be transferred to the circuit. processing 50, to benefit from the energy efficiencies that can be achieved from using the processing circuit 50.
[0060] To make better use of such switching capabilities, it is necessary to provide a mechanism that allows switching to take place in a simple and efficient manner, so that the workload transfer action does not consume energy up to a limit where it will override the benefits of switching and also to ensure that the switching process is fast enough that it does not degrade performance to some significant level.
[0061] In one embodiment, such benefits are, at least in part, achieved by arranging the processing circuitry 10 to be architecturally compatible with the processing circuitry 50. This ensures that the workload can be migrated from one set of processing circuits to the other, still guaranteeing the correct operation. At a minimum, such architectural compatibility requires that both processing circuits 10 and 50 share the same instruction set architecture. However, in one embodiment, such architectural compatibility also requires a higher compatibility requirement to ensure that the two instances of the processing circuit are seen as identical from the perspective of a programmer. In one embodiment, this involves the use of the same architectural records, and one or more special use records that store data used by the operating system when running applications. With an architectural compatibility level like this, then, it is possible to mask, from the operating system 115, the transfer of the workload between processing circuits, so that the operating system is completely aloof if the workload is being executed in the processing circuit set 10 or in the processing circuit set 50.
[0062] In one embodiment, the handling of the transfer from one processing circuit to the other is managed by the switching controller 120 shown in figure 2 (also referred to as a virtualizer and, here, in other circumstances, as a transfer controller of the work load). The switching controller can be incorporated by a mixture of hardware, embedded software and / or software resources, but, in one embodiment, it includes software similar in nature to the hypervisor software found on virtual machines to enable applications written in a set native instructions to be executed on a hardware platform that adopts a different set of native instructions. Due to the architectural compatibility between the two processing circuits 10, 50, the switching controller 120 may mask the transfer of the operating system 115 merely by masking one or more items of the specific configuration information of the predetermined processor coming from the operating system. For example, processor specific configuration information may include the contents of a CP15 processor ID record and CP15 cache record.
[0063] So, in a mode like this, the switching controller merely needs to ensure that any current architectural state maintained by the source processing circuit at the time of the transfer, and that it is not at the time when the transfer is already initiated available to from shared memory 80, make it available to the target processing circuit to enable the target circuit to stay in position to take charge of the workload performance successfully. Using the above example, typically, such an architectural state will comprise the current values stored in the source processing circuitry architectural record file, along with the current values of one or more special use records from the source processing circuitry . Due to architectural compatibility between processing circuits 10, 50, if this current architectural state can be transferred from the source processing circuit to the destination processing circuit, then the destination processing circuit to be in a position will take over. command the workload performance from the source processing circuit successfully.
[0064] Although the architectural compatibility between the processing circuits 10, 50 facilitates the transfer of the entire workload between the two processing circuits, in one embodiment, the processing circuits 10, 50 are microarchitecturally different from each other, such that there are different performance characteristics, and therefore energy consumption characteristics, associated with the two processing circuits. As previously discussed, in one embodiment, processing circuit 10 is a high performance and high energy consumption processing circuit, while processing circuit 50 is a lower performance and lower energy consumption processing circuit. The two processing circuits can be microarchitecturally different from one another in a number of ways, but typically will have at least one of different execution thread lengths and / or different execution features. Typically, differences in thread length will result in differences in operating frequency, which, in turn, will have an effect on performance. Similarly, differences in execution resources will have an effect on performance and, therefore, on performance. Therefore, for example, the processing circuitry 10 may have more execution resources and / or more execution resources, in order to improve performance. In addition, the threads in the processor cores 15, 20 can be arranged to perform out-of-order superscalar processing, while the simpler core 55 in the energy efficient processing circuit 50 can be arranged as an orderly thread. A further discussion of micro-architectural differences will be provided later in relation to figures 8A and 8B.
[0065] The generation of a transfer stimulus to cause the switching controller 120 to instigate a transfer operation to transfer the workload from one processing circuit to another can be triggered for a variety of reasons. For example, in one mode, applications can be profiled and marked as 'large', 'small' or 'large / small', according to which, the operating system can interface with the switching controller to move the load of work this way. Therefore, using an approach like this, the generation of the transfer stimulus can be mapped to particular combinations of applications that are executed, to ensure that when high performance is required, the workload is performed on the high performance processing circuit. 10, whereas, when this performance is not required, the energy-efficient processing circuit 50 is used instead. In other embodiments, algorithms can be executed to dynamically determine when to trigger a transfer of the workload from one processing circuit to another based on one or more inputs. For example, performance counters for the processing circuitry can be configured to count performance-sensitive events (for example, the number of instructions executed or the number of loading - storage operations). Coupled with a cycle counter or a system synchronizer, this allows for the identification that a highly computationally intensive application is running, which can be better served by switching to the higher-performance processing circuitry, or identification a large number of loading - storage operations that indicate an application with intensive IO, which can be better served in the set of energy efficient processing circuits, etc.
[0066] As an even further example of when a transfer stimulus can be generated, the data processing system can include one or more thermal sensors 90 to monitor the temperature of the data processing system during operation. This may be the case where modern high-performance processing circuits, for example, those that run at GHz frequencies, sometimes reach, or exceed, the thermal limit at which they were designed to operate. By using such thermal sensors 90, it can be detected when such thermal limits are being reached and, under these conditions, a transfer stimulus can be generated to trigger a transfer of the workload to a more energy efficient processing circuit in order to to provide a general cooling of the data processing system. Therefore, considering the example in figure 1 where the processing circuit 10 is a high performance processing circuit and the processing circuit 50 is a lower performance processing circuit that consumes less energy, the migration of the circuit workload process 10 for the processing circuit 50 when the thermal limits of the device are being reached will provide a subsequent cooling of the device, still allowing the continuous execution of the program to occur, although in lower yield.
[0067] Although, in figure 1, two processing circuits 10, 50 are shown, it is noticed that the techniques of the above described modalities can also be applied in systems that incorporate more than two different processing circuits, allowing the processing system cover a wider range of performance / energy levels. In such modalities, each of the different processing circuits will be arranged to be architecturally compatible with one another to allow for the prompt migration of the entire workload between the processing circuits, but they will also be microarchitecturally different from one another to allow that choices are made between the use of these processing circuits depending on the required performance / energy levels.
[0068] Figure 3 is a flowchart that illustrates the sequence of steps performed on both the source processor and the destination processor when the workload is transferred from the source processor to the destination processor upon receipt of a transfer stimulus. A transfer stimulus like this can be generated by the operating system 115 or by the virtualizer 120 through an embedded software interface of the system, resulting in the detection of the switching stimulus in step 200 by the source processor (which will be executing not only the load but also the virtualizer software that is at least part of the switching controller 120). Receiving the transfer stimulus (also referred to here as the switching stimulus) in step 200 will cause the power controller 65 to initiate an energizing and restart operation 205 at the destination processor. Following such energization and restart, the target processor will invalidate its local cache in step 210 and then enable sniffing in step 215. At this point, then, the target processor will signal to the source processor that it is ready for the workload transfer occurs, this signal causing the source processor to perform a save state operation in step 225. This save state operation will be discussed in more detail below in relation to figure 4A, but, in one embodiment, involves the set of source processing circuits storing in its local cache any of its current architectural state that is not available from shared memory at the time the transfer operation is initiated, and which is necessary for the target take charge of workload performance successfully.
[0069] Following the save state 225 operation, a signal of the switching state will be sent to the destination processor 230 which indicates to the destination processor that it must now start sniffing the source processor in order to recover the architectural state required. This process occurs through a state restoration operation 230 which will be discussed in more detail below in relation to figure 4B, but which, in one embodiment, involves the set of destination processing circuits starting a sequence of accesses that are intercepted by the sniffing control unit 75 on interconnect 70, which causes the cached copy of the architectural state in the source processor's local cache to be retrieved and returned to the destination processor.
[0070] Following step 230, then, the target processor is in a position to take control of the workload processing and, in this way, normal operation begins at step 235.
[0071] In one embodiment, once normal operation begins at the destination processor, the cache of the source processor can be cleared, as indicated in step 250, in order to remove all dirty data to shared memory 80 and , then, the source processor can be turned off in step 255. However, in one embodiment, to further increase the efficiency of the target processor, the source processor is arranged to remain energized for the period of time referred to in figure 3 as the sniffing period. During this time, at least one of the caches of the source circuit remains energized, so that its contents can be sniffed by the sniff control circuit 75 in response to access requests issued by the destination processor. Following the transfer of the entire workload using the process described in figure 3, it is expected that, at least for an initial period of time after which the target processor begins operating the workload, some of the data required during performance of the workload is resident in the cache of the source processor. If the source processor has removed its contents into memory, and has been shut down, then the target processor, during these previous stages, will operate relatively inefficiently, as there will be many cache errors in its local cache, and much searching for data from shared memory, resulting in a significant performance impact while the target processor cache is "heated", that is, filled with data values required by the target circuit processor to perform the operations specified by the load of work. However, by leaving the source processor cache energized during the sniffing period, the sniff control circuit 75 will be able to serve many of these failed cache requests in relation to the source circuit cache, producing significant performance benefits, if compared with the recovery of this data from shared memory 80.
[0072] However, it is expected that this performance benefit will only last a certain amount of time after switching, after which the contents of the source processor's cache will become obsolete. Thus, at some point, a sniffing interruption event will be generated to disable sniffing in step 245, after which the source processor cache will be cleared in step 250 and then the source processor will be shut down in step 255. A discussion of the various scenarios under which the sniffing interruption event can be generated will be carried out in more detail below in relation to figure 6G.
[0073] Figure 4A schematically illustrates the save operation performed in step 225 of figure 3 according to a modality. In particular, in one embodiment, the architectural state that needs to be stored from the source processing circuitry 300 in local cache 330 consists of contents of a log file 310 referenced by an arithmetic logical unit (ALU) 305 during performance of data processing operations, along with the contents of several special-use records 320 that identify a variety of pieces of information required by the workload to enable workload control to be successfully taken over by the circuit set. target processing. The contents of the special use registers 320 will include, for example, a program counter value that identifies a current instruction that is executed, along with various other information. For example, other special-use records include processor state records (for example, CPSR and SPSR in the ARM architecture) that maintain control bits for processor mode, masking interruption, run state, and indicators. Other special-use registers include architectural control (the CP15 system control register in the ARM architecture) that maintains bits to change data endpoints, enable or disable the MMU, enable or disable instruction data / caches, etc. Other special use registers in CP15 store exception addresses and status information.
[0074] As schematically illustrated in figure 4A, typically, the source processing circuit 300 will also maintain some configuration information specific to processor 315, but this information does not need to be saved in cache 330, as it will not be applicable in the set of target processing circuits. Typically, processor specific configuration information 315 is permanently encoded in the source processing circuit 300 using logic constants, and may include, for example, the contents of the CP15 processor ID record (which will be different for each processing circuit) or the contents of the CP15 cache type record (which will depend on the configuration of caches 25, 30, 60, for example, indicating that the caches have different line lengths). When the operating system 115 requires a piece of configuration information specific to the 315 processor, then, unless the processor is already in hypervisor mode, an execution trap for hypervisor mode occurs. In response, virtualizer 120 may, in one mode, indicate the value of the requested information, but in another mode, it will return a "virtual" value. In the case of the processor ID value, this virtual value can be chosen to be the same for both "large" and "small" processors, thereby causing the actual hardware configuration to be hidden from the operating system 115 by virtualizer 120 .
[0075] As schematically illustrated in figure 4A, during the save operation, the contents of log file 310 and special use logs 320 are stored by the source processing circuitry in cache 330 to form a cached copy 335. This cached copy is then marked as shareable, which allows the target processor to sniff this state through sniff control unit 75.
[0076] Then, the restore operation subsequently performed on the destination processor is illustrated schematically in figure 4B. In particular, the target processing circuitry 350 (which may or may not have its own local cache) will issue a request for a particular architectural state item, with this request being intercepted by the sniffing control unit 75. So, the sniff control unit will issue a sniff request to the local cache of the source processing circuit 330 to determine if this architectural state item is present in the source cache. Due to the steps taken during the save operation discussed in figure 4A, a hit will be detected in the source cache 330, resulting in this cached architectural state being returned via the sniffing control unit 75 to the destination processing circuit 350. This process can be repeated iteratively until all items of architectural state have been recovered by sniffing the cache of the source processing circuit. Typically, any processor specific configuration information relevant to the destination processing circuit 350 is permanently encoded in the destination processing circuit 350 in the manner discussed above. Thus, once the restore operation has been completed, then, the target processing circuitry has all the information required to enable it to take charge of the workload handling successfully.
[0077] Additionally, in one embodiment, regardless of whether the workload 100 is being performed by the "large" processing circuit 10 or the "small" processing circuit 50, the virtualizer 120 provides the operating system 115 with virtual configuration information with the same values, and then the hardware differences between the "big" and "small" processing circuits 10, 50 are masked from operating system 115 by virtualizer 120. This means that operating system 115 is oblivious to performance of workload 100 was transferred to a different hardware platform.
[0078] According to the save and restore operations described in relation to figures 4A and 4B, the various instances of processor 10, 50 are arranged to be consistent with the hardware cache with each other in order to reduce the amount time, power and hardware complexity involved in transferring the architectural state from the source processor to the destination processor. The technique uses the source processor's local cache to store all state that must be transferred from the source processor to the destination processor and which is not available from shared memory at the time the transfer operation occurs. Because the state is marked as shareable in the source processor cache, this allows the target processor consistent with the hardware cache to sniff out this state during the transfer operation. Using a technique like this, it is possible to transfer the state between the processor instances without the need to save this state both in main memory and in a storage element mapped to the memory location. Therefore, this produces significant performance and energy consumption benefits, increasing the variety of situations in which it would be appropriate to switch the workload in order to seek to achieve energy consumption benefits.
[0079] However, although the technique of using the aforementioned cache coherence provides an accelerated mechanism for making the current architectural state available to the target processor without routing the current architectural state through shared memory, it is not the only way in which an accelerated mechanism like this can be implemented. For example, figure 5 illustrates an alternative mechanism in which a dedicated bus 380 is provided between the source processing circuitry 300 and the destination processing circuitry 350 in order to allow the architectural state to be transferred during the transfer operation. Therefore, in such modalities, the save and restore operations 225, 230 of figure 3 are replaced by an alternative transfer mechanism that uses the dedicated bus 380. Although, typically, such an approach has a higher hardware cost than the employing the cache coherence approach (typically, the cache coherence approach making use of hardware already working in the data processing system), it will provide an even faster way to perform switching, which can be beneficial in certain implementations.
[0080] Figures 6A through 6I schematically illustrate a series of steps that are performed in order to transfer the performance of a workload from the source processing circuit set 300 to the destination processing circuit set 350. The set source processing circuit 300 is whichever processing circuit 10, 50 is performing the workload prior to transfer, with the target processing circuit set being the other of processing circuit 10, 50.
[0081] Figure 6A shows the system in an initial state in which the source processing circuitry 300 is powered by the power controller 65 and is performing the processing workload 100, while the processing circuitry of destination 350 is in the energy-saving condition. In this modality, the energy saving condition is a deactivated energy condition, but, in the above way, other types of energy saving condition can also be used. Workload 100, including applications 105, 110 and an operating system 115 to run applications 105, 110, is abstracted from the hardware platform of the source processing circuitry 300 by virtualizer 120. During performance of the workload work 100, the source processing circuitry 300 maintains the architectural state 400, which can comprise, for example, the contents of the log file 310 and the special use logs 320, as shown in figure 4A.
[0082] In figure 6B, a transfer stimulus 430 is detected by virtualizer 120. Although transfer stimulus 430 is shown in figure 6B as an external event (for example, thermal uncontrolled detection by thermal sensor 90), the transfer 430 can also be an event triggered by virtualizer 120 itself or by operating system 115 (for example, operating system 115 can be configured to inform virtualizer 120 when a particular application type is to be processed). Virtualizer 120 responds to transfer stimulus 430 by controlling the power controller 65 to supply power to the target processing circuitry 350 in order to place the target processing circuitry 350 in an energized state.
[0083] In figure 6C, the target processing circuit set 350 starts the execution of virtualizer 120. Virtualizer 120 controls the target processing circuit set 350 to invalidate its cache 420, in order to prevent processing errors caused by erroneous data values that may be present in cache 420 when energizing target processing circuitry 350. While destination cache 420 is being invalidated, source processing circuitry 350 continues to perform the workload 100. When target cache 420 invalidation is complete, virtualizer 120 controls target processing circuitry 350 to signal source source processing circuitry 300 that it is ready to transfer workload 100. By continuing to process workload 100 on the source processing circuitry 300 until the With destination processing circuits 350 ready for the transfer operation, the impact on transfer performance can be reduced.
[0084] In the next stage, shown in figure 6D, the source processing circuitry 300 stops performing the workload 100. During this stage, neither the source processing circuitry 300 nor the target processing 350 perform workload 100. A copy of architectural state 400 is transferred from source processing circuit set 300 to target processing circuit set 350. For example, architectural state 400 can be saved in source cache 410 and restored in the target processing circuitry 350, as shown in figures 4A and 4B, or can be transferred on a dedicated bus, as shown in figure 5. Architectural state 400 contains all information state required for destination processing circuitry 350 to perform workload 100, different from information already present in shared memory 80.
[0085] Having transferred the architectural state 400 to the destination processing circuitry 350, the source processing circuitry 300 is placed in the energy saving state by the energy control circuitry 65 (see figure 6E ), with the exception that the source cache 410 remains powered. In the meantime, the target processing circuitry 350 begins to perform the workload 100 using the transferred architectural state 400.
[0086] When the target processing circuit pack 350 starts processing workload 100, the sniffing period begins (see figure 6F). During the sniffing period, the sniffing control unit 75 can sniff the data stored in the source cache 410 and retrieve the data on behalf of the target processing circuit pack 350. When the target processing circuit pack 350 requests data that is not present in destination cache 420, destination processing circuitry 350 requests data from sniffing control unit 75. Then, sniffing control unit 75 sniffs out source cache 410 and, if sniffing results in a cache hit, then sniffing control unit 75 retrieves sniffed data from source cache 410 and returns it to destination processing circuitry 350, where sniffed data can be stored in the destination cache 420. On the other hand, if sniffing results in a cache failure in source cache 410, then the requested data is fetched from comparative memory 80 and returned to destination 350 processing circuitry. Since access to data in source cache 410 is faster and requires less power than access to shared memory 80, sniffing source cache 410 for a period it increases processing performance and reduces energy consumption during an initial period following the transfer of workload 100 to destination processing circuitry 350.
[0087] In the step shown in figure 6G, the sniffing control unit 75 detects a sniffing interruption event which indicates that it is no longer efficient to keep the source cache 410 in the energized state. The sniffing interruption event triggers the end of the sniffing period. The sniffing interruption event can be any one of a set of sniffing interrupted events monitored by the sniffing control circuitry 75. For example, the sniffing interrupt event set can include any one or more of the following events : a) when the percentage or fraction of sniffing hits that result in a hit in the cache, in the source cache 410, (this is an amount proportional to the number of sniffing hits / number of total sniffings) falls below a threshold level predetermined after the target processing circuitry 350 has initiated the performance of workload 100; b) when the number of transactions, or the number of transactions of a predetermined type (for example, cacheable transactions), performed since the target processing circuitry 350 started to perform the workload 100 exceeds a limit predetermined; c) when the number of processing cycles elapsed since the target processing circuitry 350 began to perform the workload 100 exceeds a predetermined limit; d) when a particular region of shared memory 80 is accessed for the first time since the target processing circuitry 350 began to perform workload 100; e) when a particular region of shared memory 80, which was accessed for an initial period after the target processing circuitry 350 has started to perform workload 100, is not accessed for a predetermined number of cycles or for a predetermined period of time; f) when the target processing circuitry 350 writes to a predetermined memory location for the first time since the performance of the transferred workload 100 has started.
[0088] These sniffing interruption events can be detected using programmable counters at the coherent interconnect 70 that includes the sniffing control unit 75. Other types of sniffing interruption events can also be included in the sniffing interruption event set.
[0089] Upon detection of a sniffing interruption event, sniffing control unit 75 sends a sniffing interruption signal 440 to source processor 300. Sniffing control unit 75 stops sniffing source cache 410 and , from now on, responds to requests for data access from the destination processing circuitry 350 by searching for the requested data from shared memory 80 and returning the fetched data to the destination processing circuitry 350, where fetched data can be cached.
[0090] In figure 6H, the source cache control circuit is responsive to the sniffing stop signal 440 to clear cache 410 in order to save in shared memory 80 all valid and dirty data values (that is, whose cached value is more updated than the corresponding value in shared memory 80).
[0091] In figure 6I, then, the source cache 410 is turned off by the power controller 65, so that the set of source processing circuits 300 is entirely in the energy-saving state. Target processing circuitry 350 continues to perform workload 100. From the point of view of operating system 115, the situation is now the same as in Figure 6A. Operating system 115 is unaware that workload execution has been transferred from one processing circuit to another processing circuit. When another transfer stimulus occurs, then the same steps in figures 6A through 6I can be used to recommend the performance of the workload to the first processor (in this case, which of the processing circuits 10, 50 are the "set of circuits source processing "and the" destination processing circuitry "is the reverse).
[0092] In the embodiment of figures 6A to 6I, independent power control in cache 410 and in the source processing circuitry 300 is available, so that the source processing circuitry 300, different from the source cache 410, can be shut down once the target processing circuit pack 350 has initiated workload performance (see figure 6E), although only cache 410 of the source processing circuit pack 350 remains in the powered state ( see figures 6F to 6H). Then, the source cache 410 is turned off in figure 6I. This approach can be useful for saving energy, especially when the source processing circuitry 300 is the "large" processing circuit 10.
[0093] However, it is also possible to continue to energize the entire source processing circuitry 300 during the sniffing period and then put the source processing circuitry 300 as a whole in the energy-saving state of figure 6I, following the end of the sniffing period and the cleaning of the source cache 410. This can be more useful if the source cache 410 is too deeply embedded in the source processor core to be able to be powered independently from the core source processor. This approach can also be more practical when the source processor is the "small" processing circuit 50, whose power consumption is insignificant compared to the "large" processing circuit 10, since, since the processing circuit "large" 10 started processing the transferred workload 100, so switching the "small" processing circuit 50, different from cache 60, to the energy-saving state during the sniffing period may have little effect on consumption overall system power. This may mean that the extra hardware complexity of providing individual power control to the "small" processing circuit 50 and the "small" core cache 60 may not be justified.
[0094] In some situations, it may be known before transferring the workload that the data stored in the source cache 410 will not be required by the destination processing circuitry 350 when it starts to perform the workload 100. For example, the source processing circuitry 300 may have just completed an application when the transfer takes place, and therefore the data in the source cache 410 at the time of transfer refers to the completed application and not the application being performed by the target processing circuitry 350 after transfer. In a case like this, a sniffer override controller can trigger virtualizer 120 and sniff control circuitry 75 to override source cache sniffing 410 and to control source processing circuit 300 to clear and shut down the source cache 410 without waiting for a sniffing interrupt event to signal the end of the sniffing period. In this case, the technique of figures 6A through 6I will jump from the stage of figure 6E straight to the stage of figure 6G, without the stage of figure 6F in which data is sniffed from the source cache 410. Thus, if it is known in advance that the data in the source cache 410 will not be useful for the target processing circuit pack 350, energy can be saved by placing the source cache 410 and the source processing circuit set 300 in the energy-saving condition without waiting by a sniffing interruption event. The sniffer controller can be part of virtualizer 120 or can be implemented as embedded software running on the source processing circuitry 300. The sniffer controller can also be implemented as a combination of elements, for example, operating system 115 can inform virtualizer 120 when an application has terminated and then virtualizer 120 can override source cache sniffing 410 if a transfer occurs when an application has terminated.
[0095] Figure 7 is a graph in which line 600 illustrates how energy consumption varies with performance. For various parts of this graph, the data processing system can be arranged to use different combinations of the processor cores 15, 20, 55 illustrated in figure 1 in order to seek to obtain the appropriate proportionality between performance and energy consumption. Therefore, by way of example, when numerous very high performance tasks need to be performed, it is possible to execute both large cores 15, 20 of the processing circuit 10 in order to achieve the desired performance. Optionally, supply voltage variation techniques can be used to allow for some variation in performance and energy consumption when using these two cores.
[0096] When performance requirements fall to a level where the required performance can be achieved using only one of the large cores, then tasks can be migrated to only one of the large cores 15, 20, with the other core being shut down or placed in some other energy saving condition. Again, the variation in supply voltage can be used to allow for some variation between performance and power consumption when using a single large core like this. It should be noted that the transition from the two large cores to a large core will not require the generation of a transfer stimulus, nor the use of the aforementioned techniques to transfer workload, since, in all instances, it is the processing circuit 10 being used, and the processing circuit 50 will be in an energy saving condition. However, as indicated by the dotted line 610 in figure 7, when performance drops to a level where the small nucleus can achieve the required performance, then a transfer stimulus can be generated to trigger the aforementioned mechanism for transferring the whole of the workload from processing circuit 10 to processing circuit 50, in such a way that the entire workload is then performed on the small core 55, with processing circuit 10 being placed in an energy-saving condition . Again, the variation in supply voltage can be used to allow for some variation in small core performance and energy consumption 55.
[0097] Figures 8A and 8B illustrate, respectively, micro-architectural differences between a low-performance processor thread 800 and a high-performance processor thread 850 according to an embodiment. The low-performance processor 800 of Figure 8A will be suitable for the small processing core 55 of Figure 1, while the high-performance processor 850 of Figure 8B will be suitable for the large cores 15, 20.
[0098] The chaining of the low-performance processor 800 of figure 8A comprises a search stage 810 to search for instructions from memory 80, a decoding stage 820 to decode the instructions searched, an emission stage 830 to issue instructions for execution and multiple execution threads, which include an integral 840 thread, to perform integral operations, a MAC 842 thread to perform multiplication accumulation operations, and a SIMD / FPU 844 thread to perform SIMD operations (single instruction, multiple data) or operations of floating point. In the thread of the low-performance processor 800, the emission stage 830 issues a single instruction at a time, and issues the instructions in the order in which the instructions are fetched.
[0099] The chaining of the high performance processor 850 of figure 8B comprises a search stage 860 to search for instructions from memory 80, a decoding stage 870 to decode the instructions searched, a renaming stage 875 to rename records specified in decoded instructions, a dispatch stage 880 for dispatching execution instructions and multiple execution threads, which include two integral threads 890, 892, a MAC 894 thread and two SIMD / FPU threads 896, 898. In the 850 high-performance processor thread , dispatch stage 880 is a parallel emission stage that can issue multiple instructions to different threads 890, 892, 894, 896, 898 at once. Dispatch stage 880 can also issue out of order instructions. Unlike the low-performance processor 800 thread, SIMD / FPU 896, 898 threads have variable lengths, which means that operations that proceed through the SIMD / FPU 896, 898 threads can be controlled to bypass certain stages. An advantage of such an approach is that if each of the multiple threads of execution has different resources, there is no need to artificially increase the shorter thread to make it the same length as the longer thread, but contrary logic is required for dealing with the out-of-order nature of the results produced by the different threads (for example, to put everything back in order if a processing exception occurs).
[00100] The renaming stage 875 is provided to map record specifiers, which are included in program instructions, and to identify architectural records in particular when viewed from a programmer model point of view, for physical records that are the actual records of the hardware platform. The renaming stage 875 enables the microprocessor to provide a larger group of physical records than is present in the view of the microprocessor programmer model. This larger group of physical records is useful during out-of-order execution because it enables risks, such as recording after recording (WAW) risks, to be avoided by mapping the same architectural record specified in two or more different instructions to two or more different physical records, so that the different instructions can be executed concurrently. For more details on registration renaming techniques, the reader is referred to the commonly owned US patent application 2008/114966 and US patent 7,590,826.
[00101] Low performance thread 800 and high performance thread 850 are microarchitecturally different in countless ways. Microarchitectural differences can include: a) chains with different stages. For example, the high-performance thread 850 has a rename stage 875 that is not present in the low-performance thread 800; b) the stages of chaining with different capacities. For example, the emission stage 830 of the low-performance thread 800 is capable of single-issue instructions only, while the dispatch stage 880 of the high-performance thread 850 can issue instructions in parallel. Parallel emission improves the throughput of the thread and then improves the performance; c) the stages of chaining with different lengths. For example, decoding stage 870 of the high-performance thread 850 may include three substages, while decoding stage 820 of the low-performance thread 800 may include only a single substage. The longer a chaining stage (the greater the number of sub-stages), the greater the number of instructions that can be in effect at the same time, and then the greater the operational frequency at which the chaining can operate, resulting in a superior performance level; d) a different number of threads of execution (for example, the high-performance thread 850 has more threads of execution than the low-performance thread 800). By providing more threads of execution, more instructions can be processed in parallel and then performance improves; e) provision of in-order execution (as in the 800 thread) or out-of-order execution (as in the 850 thread). When instructions can be executed out of order, then performance improves, as instruction execution can be dynamically scheduled to optimize performance. For example, in the low-performance thread in order 800, a series of MAC instructions will need to be executed one by one by the MAC 842 chain before a later instruction can be executed by one of the integral chain 840 and the SIMD / floating point chain 844 In contrast, in the high-performance thread 850, then, MAC instructions can be executed by MAC thread 894, while (subject to any data risks that cannot be resolved by renaming) a later statement that uses a different execution thread 890, 892, 896, 898 can be performed in parallel with the MAC instructions. This means that running out of order can improve processing performance.
[00102] These and other examples of microarchitectural differences result in the 850 thread providing superior performance processing to the 800 thread. On the other hand, the microarchitectural differences also make the 850 thread consume more energy than the 800 thread. 800, 850 enables workload processing to be optimized both for high performance (using a "large" processing circuit 10 with the 850 high performance thread) and energy efficiency (using a processing circuit " small "50 with low performance thread 800).
[00103] Figure 9 shows a graph illustrating the variation in energy consumption of the data processing system as the performance of workload 100 is switched between the large processing circuit 10 and the small processing circuit 50.
[00104] At point A of figure 9, the workload 100 is being performed on the small processing circuitry 50 and then the power consumption is low. At point B, a transfer stimulus occurs that indicates that high-intensity processing must be performed, and then the workload performance is transferred to the large processing circuit set 10. Then, energy consumption increases and remains high at point C, while the large processing circuitry 10 is performing the workload. At point D, both large cores are considered to be operating in combination to process the workload. However, if the performance requirements fall to a level where the workload can be handled by only one of the large cores, then the workload is migrated to only one of the large cores, and the other is shut down, as indicated by the drop in energy to the level adjacent to point E. However, at point E, another transfer stimulus occurs (indicating that a return to low-intensity processing is desired) to trigger a transfer of the workload performance from returns to the small processing circuitry 50.
[00105] When the small processing circuitry 50 starts processing the processing workload, most of the large processing circuitry is in the power-saving state, but the large processing circuitry cache 10 remains energized during the sniffing period (point F of figure 9) to enable data in the cache to be retrieved for the small processing circuit set 50. Therefore, the large processing circuit set cache 10 causes the energy consumption at point F is higher than at point A, when only the small processing circuitry 50 has been energized. At the end of the sniffing period, the cache of the large processing circuit set 10 is turned off and, at point G, the power consumption returns to the low level when only the small processing circuit set 50 is active.
[00106] As mentioned above, in figure 9 the energy consumption is higher during the sniffing period at point F than at point G, due to the cache of the large processing circuit set 10 being energized during the sniffing period. Although this increase in energy consumption is only indicated after the large to small transition, following the small to large transition there may also be a sniffing period, during which the data in the small processing circuitry cache 50 can be sniffed at name of the large processing circuitry 10 by the sniffing control unit 75. The sniffing period for the small to large transition has not been indicated in figure 9 because the energy consumed by leaving the small processing circuitry cache 50 in an energized state during the sniffing period is insignificant compared to the energy consumed by the large processing circuitry 10 during the performance of the processing workload, and then the very small increase in energy consumption due to the cache of the small processing circuitry 50 being energized is not visible in the figure graph to 9.
[00107] The above described modalities describe a system that contains two or more processor instances architecturally compatible with microarchitectures optimized for energy efficiency or performance. The architectural state required by the operating system and applications can be switched between processor instances, depending on the level of performance / power required, to allow the entire workload to be switched between processor instances. In one embodiment, only one of the processor instances is running the workload at any given time, with the other processing instance being in an energy-saving condition or in the energy saving condition's input / output process.
[00108] In one embodiment, processor instances can be arranged to be consistent with the hardware cache with each other to reduce the amount of time, energy and hardware complexity involved in switching the architectural state of the processor source to the target processor. This reduces the time to carry out the switching operation, which increases the opportunities in which the modalities techniques can be used.
[00109] Such systems can be used in a variety of situations where energy efficiency is important for both battery life and thermal management, and the dispersion of performance is such that a more energy efficient processor can be used for lower processing workloads, while a higher performance processor can be used for higher processing workloads.
[00110] Because two or more processing instances are architecturally compatible, from an application perspective, the only difference between the two processors is the available performance. Through the techniques of a modality, the entire architectural state required can be moved between the processors without having to involve the operating system, in such a way that, then, it is transparent to the operating system and the applications running on the operating system in relation to which processor these operating system and applications are running.
[00111] When using architecturally compatible processor instances described in the exposed modalities, the total amount of architectural state that needs to be transferred can easily fit into a data cache, and since modern processing systems often implement cache coherence, then , by storing the architectural state as switched in the data cache, the target processor can quickly sniff out this state in an energy efficient manner using existing circuit structures.
[00112] In a described mode, the switching mechanism is used to ensure that the thermal limit for the data processing system is not violated. In particular, when thermal limits are close to being reached, the entire workload can be switched to a more energy efficient processor instance, allowing the entire system to cool down while continuous program execution occurs, albeit at a yield bottom.
[00113] Although a particular modality has been described here, it is realized that the invention is not limited to it, and that many modifications and additions to it can be made within the scope of the invention. For example, various combinations of the features of the following dependent claims can be made with the features of the independent claims without departing from the scope of the present invention.

权利要求:
Claims (13)
[0001]
1. Data processing apparatus, comprising: first set of processing circuits (10) and second set of processing circuits (50), where both those circuits are configured to perform a processing workload (100) in such a way that the processing workload is performed by one of the first set of processing circuits and the second set of processing circuits at once; power control circuitry (65) to independently control power supply to the first set of processing circuits and the second set of processing circuits; a workload transfer controller configured to be responsive to a transfer stimulus to initiate a transfer of processing workload performance from a source processing circuit set to a target processing circuit set before the source processing circuitry is placed in an energy-saving condition by the energy control circuitry, the source processing circuitry being one of the first and second processing circuitry and the of destination processing circuits, the other of which is between the first and second sets of processing circuits; characterized by the fact that: at least the source processing circuitry has a cache (25); the power control circuitry is configured, following said transfer, to maintain at least the cache of the original processing circuitry in an energized condition during a sniffing period following the start of performance of the transferred processing workload by the destination processing circuitry; the data processing apparatus comprises cache sniffing circuitry (75) configured during the sniffing period to sniff data values in the source processing circuitry cache and to retrieve the sniffed data values for the target processing circuits; and the power control circuitry is configured to place at least the cache of the original processing circuitry in the energy-saving condition following the end of the sniffing period, in which said power control circuitry is configured to place the source processing circuitry, other than the cache, in the energy saving condition during the sniffing period, and to place the source processing circuitry cache in the energy saving condition following the end of the sniffing period, and where the cache of the source processing circuitry is part of a cache hierarchy within the source processing circuitry, and, during that sniffing period, the cache is maintained in the state energized, while at least one other cache in such cache hierarchy is in the energy-saving condition.
[0002]
2. Data processing apparatus according to claim 1, characterized in that the processing workload includes at least one processing application and at least one operating system to execute that at least one processing application.
[0003]
3. Data processing apparatus according to claim 2, characterized in that the workload transfer controller is configured during the transfer to mask specific predetermined processor configuration information from said at least one operating system , such that the transfer of the workload is transparent to said at least one operating system.
[0004]
4. Data processing apparatus according to claim 3, characterized by the fact that the workload transfer controller comprises at least virtualization software that logically separates that at least one operating system from the first set of processing circuits and the second set of processing circuits.
[0005]
Data processing apparatus according to any one of claims 1 to 4, characterized in that the first set of processing circuits (10) is architecturally compatible with the second set of processing circuits (50), such as that a processing workload to be performed by the data processing apparatus can be performed either on the first set of processing circuits (10) or on the second set of processing circuits (50).
[0006]
6. Data processing apparatus according to claim 5, characterized by the fact that the first set of processing circuits is microarchitecturally different from the second set of processing circuits, such that the performance of that first set of processing circuits is different performance of that second set of processing circuits.
[0007]
Data processing apparatus according to any one of claims 1 to 6, characterized in that the destination processing circuitry also comprises a cache.
[0008]
8. Data processing apparatus according to claim 7, characterized by the fact that the sniffed data values retrieved for the target processing circuit set by the cache sniffing circuit set are stored in the cache of the set of processing target processing circuits.
[0009]
9. Data processing apparatus according to any one of claims 1 to 8, characterized by the fact that the set of energy control circuits is configured to keep the set of processing circuits of origin in the energized condition during said sniffing period and to place the set of original processing circuits, including the cache, in the energy saving condition following the end of that sniffing period.
[0010]
10. Data processing apparatus according to any one of claims 1 to 9, characterized in that the source processing circuitry is configured to perform a cleaning operation in the source processing circuitry cache to rewrite all the dirty data from the cache to a shared memory following the end of the sniffing period and before the power control circuitry put the source processing circuitry cache in the energy-saving condition.
[0011]
11. Data processing apparatus according to any one of claims 1 to 10, characterized in that the set of destination processing circuits is in an energy-saving condition before the transfer stimulus occurs and the set of circuits power control be configured to place the target processing circuitry in the energized condition in response to the transfer stimulus.
[0012]
12. Data processing apparatus according to any one of claims 1 to 13, characterized in that the sniffing period ends on the occurrence of any one of a set of sniffing interruption events that includes at least one interruption event sniffing.
[0013]
13. Data processing method for an apparatus, which comprises first set of processing circuits (10) and second set of processing circuits (50) configured to perform a processing workload (100), such that the load processing work is performed by one of the first set of processing circuits and the second set of processing circuits at once, which method is characterized by the fact that it comprises the steps of: performing the processing workload (100 ) with a source processing circuitry, the source processing circuitry being one of the first and second processing circuitry and comprising a cache, the other of the first and second processing circuitry being one target processing circuitry; in response to a transfer stimulus, transfer performance of the processing workload from the source processing circuit set to the destination processing circuit set before such source processing circuit set is placed in a power saving condition. energy; following the transfer step, keeping at least the cache of the source processing circuitry in an energized condition during a sniffing period following the start of performance of the processing workload transferred by the target processing circuitry; during sniffing period, sniff data values in the cache of the source processing circuitry and retrieve the sniffed data values for the destination processing circuitry; and placing at least said cache of the source processing circuitry in the energy-saving condition following the end of the sniffing period, where the source processing circuitry, other than the cache, is placed in the energy-saving condition. during the sniffing period, and the cache of the source processing circuitry is placed in the energy-saving condition following the end of the sniffing period, the cache of the source processing circuitry being part of a hierarchy of cache within the set of source processing circuits, and, during such a sniffing period, the cache is maintained in the energized state, while at least one other cache in the cache hierarchy is in the energy-saving condition.

类似技术:

公开号 | 公开日 | 专利标题

BR112012021121B1|2020-12-01|data processing apparatus, and, data processing method

BR112012021102B1|2020-11-24|DATA PROCESSING DEVICE, METHOD FOR OPERATING A DATA PROCESSING DEVICE

US20190332158A1|2019-10-31|Dynamic core selection for heterogeneous multi-core systems

TWI494850B|2015-08-01|Providing an asymmetric multicore processor system transparently to an operating system

US20110213935A1|2011-09-01|Data processing apparatus and method for switching a workload between first and second processing circuitry

US9146844B2|2015-09-29|Apparatus, method, and system for providing a decision mechanism for conditional commits in an atomic region

US8099556B2|2012-01-17|Cache miss detection in a data processing apparatus

JP4376692B2|2009-12-02|Information processing device, processor, processor control method, information processing device control method, cache memory

US20120079245A1|2012-03-29|Dynamic optimization for conditional commit

Gutierrez et al.2014|Evaluating private vs. shared last-level caches for energy efficiency in asymmetric multi-cores

US11249657B2|2022-02-15|Non-volatile storage circuitry accessible as primary storage for processing circuitry

Renau et al.0|Speculative Multithreading Does not | Waste Energy Draft paper submitted for publication. November 6, 2003. Please keep confidential

Hughes1999|Exploiting the Potential of a Network of IRAMs

同族专利:

公开号 | 公开日

DE112011100743B4|2014-07-10|

JP2013521556A|2013-06-10|

WO2011107775A1|2011-09-09|

US20130311725A1|2013-11-21|

KR101740225B1|2017-05-26|

RU2015107993A3|2018-09-27|

RU2550535C2|2015-05-10|

KR20130012120A|2013-02-01|

JP5702407B2|2015-04-15|

US9286222B2|2016-03-15|

US20110213993A1|2011-09-01|

IL221269D0|2012-10-31|

GB2490825B|2016-06-08|

DE112011100743T5|2013-06-06|

RU2015107993A|2015-06-27|

GB2490825A|2012-11-14|

RU2711336C2|2020-01-16|

RU2012141563A|2014-04-10|

GB201214397D0|2012-09-26|

CN102804103A|2012-11-28|

IL221269A|2017-02-28|

US8533505B2|2013-09-10|

BR112012021121A2|2017-07-18|

CN102804103B|2015-08-12|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US288748A|1883-11-20|John watson |

US3309A|1843-10-18|Weaver s loom for working any number of heddles |

US5530932A|1994-12-23|1996-06-25|Intel Corporation|Cache coherent multiprocessing computer system with reduced power operating features|

JP3864509B2|1997-08-19|2007-01-10|株式会社日立製作所|Multiprocessor system|

JP2000347758A|1999-06-03|2000-12-15|Nec Kofu Ltd|Information processor|

US6501999B1|1999-12-22|2002-12-31|Intel Corporation|Multi-processor mobile computer system having one processor integrated with a chipset|

US6631474B1|1999-12-31|2003-10-07|Intel Corporation|System to coordinate switching between first and second processors and to coordinate cache coherency between first and second processors during switching|

US6671795B1|2000-01-21|2003-12-30|Intel Corporation|Method and apparatus for pausing execution in a processor or the like|

US6725354B1|2000-06-15|2004-04-20|International Business Machines Corporation|Shared execution unit in a dual core processor|

JP2002215597A|2001-01-15|2002-08-02|Mitsubishi Electric Corp|Multiprocessor device|

US7100060B2|2002-06-26|2006-08-29|Intel Corporation|Techniques for utilization of asymmetric secondary processing resources|

US7487502B2|2003-02-19|2009-02-03|Intel Corporation|Programmable event driven yield mechanism which may activate other threads|

US20040225840A1|2003-05-09|2004-11-11|O'connor Dennis M.|Apparatus and method to provide multithreaded computer processing|

US20050132239A1|2003-12-16|2005-06-16|Athas William C.|Almost-symmetric multiprocessor that supports high-performance and energy-efficient execution|

US20080263324A1|2006-08-10|2008-10-23|Sehat Sutardja|Dynamic core switching|

US20060064606A1|2004-09-21|2006-03-23|International Business Machines Corporation|A method and apparatus for controlling power consumption in an integrated circuit|

US7275124B2|2005-02-24|2007-09-25|International Business Machines Corporation|Method and system for controlling forwarding or terminating of a request at a bus interface based on buffer availability|

US7461275B2|2005-09-30|2008-12-02|Intel Corporation|Dynamic core swapping|

US7624253B2|2006-10-25|2009-11-24|Arm Limited|Determining register availability for register renaming|

US7590826B2|2006-11-06|2009-09-15|Arm Limited|Speculative data value usage|

US8489862B2|2007-06-12|2013-07-16|Panasonic Corporation|Multiprocessor control apparatus for controlling a plurality of processors sharing a memory and an internal bus and multiprocessor control method and multiprocessor control circuit for performing the same|

US8527709B2|2007-07-20|2013-09-03|Intel Corporation|Technique for preserving cached information during a low power mode|

US7996663B2|2007-12-27|2011-08-09|Intel Corporation|Saving and restoring architectural state for processor cores|

US20110213947A1|2008-06-11|2011-09-01|John George Mathieson|System and Method for Power Optimization|

US8725953B2|2009-01-21|2014-05-13|Arm Limited|Local cache power control within a multiprocessor system|

US8566628B2|2009-05-06|2013-10-22|Advanced Micro Devices, Inc.|North-bridge to south-bridge protocol for placing processor in low power state|

US9367462B2|2009-12-29|2016-06-14|Empire Technology Development Llc|Shared memories for energy efficient multi-core processors|

US20110213935A1|2010-03-01|2011-09-01|Arm Limited|Data processing apparatus and method for switching a workload between first and second processing circuitry|

US8418187B2|2010-03-01|2013-04-09|Arm Limited|Virtualization software migrating workload between processing circuitries while making architectural states available transparent to operating system|

US8533505B2|2010-03-01|2013-09-10|Arm Limited|Data processing apparatus and method for transferring workload between source and destination processing circuitry|

US8751833B2|2010-04-30|2014-06-10|Arm Limited|Data processing system|US8533505B2|2010-03-01|2013-09-10|Arm Limited|Data processing apparatus and method for transferring workload between source and destination processing circuitry|

US9010641B2|2010-12-07|2015-04-21|Hand Held Products, Inc.|Multiple platform support system and method|

WO2013172843A1|2012-05-17|2013-11-21|Intel Corporation|Managing power consumption and performance of computing systems|

US9804896B2|2012-07-31|2017-10-31|Empire Technology Development Llc|Thread migration across cores of a multi-core processor|

US9000805B2|2013-01-29|2015-04-07|Broadcom Corporation|Resonant inductor coupling clock distribution|

US20150007190A1|2013-06-28|2015-01-01|Paul S. Diefenbaugh|Techniques to aggregate compute, memory and input/output resources across devices|

JP2015035073A|2013-08-08|2015-02-19|ルネサスエレクトロニクス株式会社|Semiconductor device and semiconductor device control method|

WO2015047314A1|2013-09-27|2015-04-02|Intel Corporation|Techniques to compose memory resources across devices|

US9965279B2|2013-11-29|2018-05-08|The Regents Of The University Of Michigan|Recording performance metrics to predict future execution of large instruction sequences on either high or low performance execution circuitry|

US9561232B2|2014-02-18|2017-02-07|Demerx, Inc.|Low dose noribogaine for treating nicotine addiction and preventing relapse of nicotine use|

US9244747B2|2014-03-13|2016-01-26|Qualcomm Incorporated|System and method for providing dynamic clock and voltage scalingaware interprocessor communication|

US9591978B2|2014-03-13|2017-03-14|Demerx, Inc.|Methods and compositions for pre-screening patients for treatment with noribogaine|

US20150379678A1|2014-06-25|2015-12-31|Doa'a M. Al-otoom|Techniques to Compose Memory Resources Across Devices and Reduce Transitional Latency|

US9870226B2|2014-07-03|2018-01-16|The Regents Of The University Of Michigan|Control of switching between executed mechanisms|

US9547592B2|2014-07-29|2017-01-17|International Business Machines Corporation|Cache mobility|

CN105550140B|2014-11-03|2018-11-09|联想有限公司|A kind of electronic equipment and data processing method|

US9891964B2|2014-11-19|2018-02-13|International Business Machines Corporation|Network traffic processing|

US9958932B2|2014-11-20|2018-05-01|Apple Inc.|Processor including multiple dissimilar processor cores that implement different portions of instruction set architecture|

US9898071B2|2014-11-20|2018-02-20|Apple Inc.|Processor including multiple dissimilar processor cores|

EP3223906B1|2014-11-26|2021-01-06|DemeRx, Inc.|Methods and compositions for potentiating the action of opioid analgesics using iboga alkaloids|

CN104375963B|2014-11-28|2019-03-15|上海兆芯集成电路有限公司|Control system and method based on buffer consistency|

CN104407995B|2014-11-28|2018-10-09|上海兆芯集成电路有限公司|Control system based on buffer consistency and method|

US9891699B2|2014-12-18|2018-02-13|Vmware, Inc.|System and method for performing distributed power management without power cycling hosts|

US9697124B2|2015-01-13|2017-07-04|Qualcomm Incorporated|Systems and methods for providing dynamic cache extension in a multi-cluster heterogeneous processor architecture|

GB2536202B|2015-03-02|2021-07-28|Advanced Risc Mach Ltd|Cache dormant indication|

JP6478762B2|2015-03-30|2019-03-06|ルネサスエレクトロニクス株式会社|Semiconductor device and control method thereof|

US10055259B2|2015-05-26|2018-08-21|Mediatek Inc.|Method for performing processor resource allocation in an electronic device, and associated apparatus|

WO2016195274A1|2015-06-01|2016-12-08|Samsung Electronics Co., Ltd.|Method for scheduling entity in multi-core processor system|

US9928115B2|2015-09-03|2018-03-27|Apple Inc.|Hardware migration between dissimilar cores|

CN105302498A|2015-11-24|2016-02-03|浪潮电子信息产业有限公司|Storage redundant system and method|

US10310858B2|2016-03-08|2019-06-04|The Regents Of The University Of Michigan|Controlling transition between using first and second processing circuitry|

RU2652460C1|2017-06-23|2018-04-26|Федеральное государственное бюджетное образовательное учреждение высшего образования "Вятский государственный университет"|Method of facilitating multiplication of two numbers in modular-index presentation format with floating point on universal multi-core processors|

US10482016B2|2017-08-23|2019-11-19|Qualcomm Incorporated|Providing private cache allocation for power-collapsed processor cores in processor-based systems|

US11119830B2|2017-12-18|2021-09-14|International Business Machines Corporation|Thread migration and shared cache fencing based on processor core temperature|

US11106261B2|2018-11-02|2021-08-31|Nvidia Corporation|Optimal operating point estimator for hardware operating under a shared power/thermal constraint|

法律状态:
2019-01-08| B06F| Objections, documents and/or translations needed after an examination request according [chapter 6.6 patent gazette]|

2019-09-17| B06U| Preliminary requirement: requests with searches performed by other patent offices: procedure suspended [chapter 6.21 patent gazette]|

2020-07-21| B09A| Decision: intention to grant [chapter 9.1 patent gazette]|

2020-12-01| B16A| Patent or certificate of addition of invention granted [chapter 16.1 patent gazette]|Free format text: PRAZO DE VALIDADE: 20 (VINTE) ANOS CONTADOS A PARTIR DE 17/02/2011, OBSERVADAS AS CONDICOES LEGAIS. |

优先权:

申请号 | 申请日 | 专利标题

US12/659,230|2010-03-01|

US12/659,230|US8533505B2|2010-03-01|2010-03-01|Data processing apparatus and method for transferring workload between source and destination processing circuitry|

PCT/GB2011/050315|WO2011107775A1|2010-03-01|2011-02-17|Data processing apparatus and method for transferring workload between source and destination processing circuitry|

[返回顶部]