巴西专利BR102013015049B1 apparatus and method

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
LEARNING STORAGE LEARNING. The present invention relates to methods, apparatus, and processors for tracking loop candidates in an instruction flow. A cargo storage control unit detects a backward deviation and starts tracking the loop candidate. The control unit tracks the deviations from the loop candidate, and keeps a record of the distance to each deviation taken from the start of the loop. If the distance to each deviation remains the same for multiple loop iterations, the loop will then be stored in a loop store. The following loop is dispatched from the loop store, and the front end of the processor will turn off until the loop ends.
公开号:BR102013015049B1
申请号:R102013015049-5
申请日:2013-06-17
公开日:2021-03-02
发明作者:Conrado Blasco-Allue；Ian D. Kountanis
申请人:Apple Inc；
IPC主号:

专利说明:

Field of the Invention
[0001] The present invention relates to processors, and in particular to methods and mechanisms for identifying and learning the characteristics of a loop within an instruction flow. Description of the Related Art
[0002] Today's processors are generally structured in multiple stages in a chained form. Typical threads often include separate units for searching for instructions, decoding instructions, mapping instructions, executing instructions, and then writing results to another unit, such as a record. A microprocessor instruction search unit is responsible for providing a constant flow of instructions for the next stage of the processor chain. Typically, search units use an instruction cache to keep the rest of the chain continuously replenished with instructions. The search unit and instruction cache tend to consume a significant amount of energy while performing their functions. It is an objective of modern microprocessors to reduce energy consumption as much as possible, especially for microprocessors that are used in battery operated devices.
[0003] In many software applications, the same steps in the software can be repeated many times to perform a specific function or task. In these situations, the search unit will continue to search for instructions and consume energy even if the same instruction loop is being executed continuously. If the loop could be detected and cached in a loop store, then the search unit could be turned off to reduce power consumption while the loop is running. However, it is difficult to detect and learn an instruction loop within the program code when the loop includes multiple branches. It is also a challenge to determine exactly whether the loop is invariant before caching the loop in the loop store. SUMMARY
[0004] Apparatus, processors and methods for detecting and tracking loops within a flow of instructions are disclosed. A processor thread includes a loop store and a loop store control unit. The loop storage control unit can detect loop termination deviations in the instruction flow. In one embodiment, when the loop store control unit detects a loop termination bypass, the control unit can retain the instruction address of the loop termination bypass, a loop detection indicator can be triggered, and a loop iteration counter and a uops counter (micro-operations) can be started.
[0005] The next time the loop termination deviation is detected, the control unit can compare the value of the uops counter with the size of the loop store. If the value of the uops counter is greater than the size of the loop store, then that loop candidate cannot be stored in the loop store, and this way the tracking will be finished. If the uops counter is less than the size of the loop store, then the loop content can be traced to multiple loop interactions. For each loop iteration, if the loop content remains the same during the iteration, then the loop iteration counter can be incremented and the loop tracking can continue.
[0006] In one embodiment, deviations taken from the loop can be tracked during each loop iteration. The distance from the start of the loop to each deviation taken can be stored in a deviation tracking table during the first iteration of the loop, and during subsequent loop iterations, the value of the uops counter when a deviation is detected can be compared to the value correspondent stored in the deviation tracking table. If the distances from the beginning of the loop to the loop deviations are invariant, then loop tracking can continue. When the value of the loop iteration counter exceeds a predetermined threshold, then the loop can be cached in the loop store. The loop can be read from the loop store and the search unit can be turned off until the loop ends.
[0007] These and other aspects and advantages will become evident to those normally skilled in the art in view of the detailed description below of the approaches presented here. BRIEF DESCRIPTION OF THE DRAWINGS
[0008] The above advantages and other advantages of the methods and mechanisms can be better understood by reference to the following description in conjunction with the accompanying drawings, in which:
[0009] Figure 1 illustrates an embodiment of a portion of an integrated circuit.
[00010] Figure 2 is a block diagram that illustrates a modality of a processor core.
[00011] Figure 3 is a block diagram that illustrates a modality of a front end of a processor thread.
[00012] Figure 4 illustrates a block diagram of another modality of a loop store within a search and decode unit.
[00013] Figure 5 is a modality of a sample loop.
[00014] Figure 6 illustrates an embodiment of a loop storage control unit.
[00015] Figure 7 is a general flow chart illustrating a modality of a method for tracking a loop candidate.
[00016] Figure 8 is a block diagram of a modality of a system.
[00017] Figure 9 is a block diagram of a modality of a computer-readable medium. DETAILED DESCRIPTION OF THE MODALITIES
[00018] In the description that follows, numerous specific details are shown to provide a complete understanding of the methods and mechanisms presented here. However, the person normally versed in the technique must recognize that the various modalities can be put into practice without these specific details. In some cases, widely known structures, components, signals, computer program instructions and techniques have not been shown in detail to avoid confusion with the procedures described here. It will be appreciated that, for the sake of simplicity and clarity of illustration, elements shown in the figures were not necessarily drawn to scale. For example, the dimensions of some of the elements can be exaggerated in relation to other elements.
[00019] This report includes references to "a modality". The use of the phrase "in one modality" in different contexts does not necessarily refer to the same modality. Particular aspects, structures or characteristics can be combined in any way consistent with this disclosure. In addition, as used throughout that application, the word "can" is used in a sense of permission (that is, meaning that there is the potential for) and not of obligation (which would mean has to). Likewise, the words "include", "including", and "includes" mean including, but not limited to.
[00020] Terminology. The following paragraphs provide definitions and / or context for the terms found in that disclosure (including the attached claims):
[00021] "Understanding". This term is not restrictive. As used in the appended claims, this term does not exclude additional structures or steps. Consider a claim with the wording: "A processor comprising a loop owner control unit ...". This claim does not prevent the processor from including additional components (for example, a cache, a search unit, an execution unit).
[00022] "Configured for". Various units, circuits, or other components can be described or claimed as "configured (s)" to perform a task or tasks. In these contexts, "configured for" is used to mean structure when indicating that the units / circuits / components include structure (ie, circuitry) that performs the task or tasks during operation. Therefore, it can be said that the unit / circuit / component is not in operation at the moment (that is, it is not connected (o)). Units / circuits / components accompanied by the expression "configured for" include hardware - for example, circuits, instructions for executable memory storage programs to implement the operation, etc. Indicate that a unit / circuit / component is "configured (a) to" perform one or more tasks is expressly intended to invoke US Code 35 § 112, sixth paragraph, for this unit / circuit / component. In addition, "configured for" may include a generic structure (for example, generic circuitry) that is handled by software and / or firmware (for example, an FPGA or a general-purpose processor that runs software) to operate in order to be able to perform the tasks (s) in question. "Configured for" may also include adapting a manufacturing process (for example, a semiconductor manufacturing facility) to manufacture devices (for example, integrated circuits) that are adapted to implement or perform one or more tasks.
[00023] "Based on". As used here, this term is used to describe one or more factors that affect a determination. This term does not exclude additional factors that may affect a determination. That is, a determination can be based only on those factors or based, at least in part, on those factors. Consider the phrase "determine A based on B". Although B may be a factor affecting the determination of A, this phrase does not exclude the determination of A to also be based on C. In other cases, A can be determined based on only B.
[00024] With reference now to Figure 1, a block diagram showing an embodiment of a portion of an integrated circuit (IC) is shown. In the illustrated embodiment, IC 10 includes a complex processor 12, a memory controller 22, and physical memory interface circuits (PHYs) 24 and 26. It is noted that IC 10 can also include many other components not shown in Figure 1 In several modalities, the IC 10 can also be called a chip system (SoC), an application-specific integrated circuit (ASIC), or a device.
[00025] The processor complex 12 may include central processing units (CPUs) 14 and 16, cache 18 level 2 (L2), and bus interface unit (BIU) 20. In other embodiments, the processor complex 12 may include other numbers of CPUs. CPUs 14 and 16 can also be called processors or cores. It is noted that the processor complex 12 may include other components not shown in Figure 1.
[00026] CPUs 14 and 16 can include a circuitry to execute instructions defined in an instruction set assembly. Specifically, one or more programs comprising the instructions can be executed by CPUs 14 and 16. Any instruction set assembly can be implemented in several modalities. For example, in one embodiment, the ARMTM instruction set (ISA) assembly can be implemented. The ARM instruction set can include 16-bit (or Miniature) and 32-bit instructions. Other exemplary ISAs may include the PowerPCTM instruction set, the MIPSTM instruction set, the SPARCTM instruction set, the x86 instruction set (also called IA-32), the IA-64 instruction set, etc.
[00027] In one embodiment, each instruction executed by CPUs 14 and 16 can be associated with a PC value. In addition, one or more assembly records can be specified between some instructions for reading and writing. These assembly records can be mapped as actual physical records by a unit to rename the record. Furthermore, some instructions (for example, Miniature ARM instructions) can be divided into a sequence of instruction operations (or micro-ops), and each instruction operation in the sequence can be indicated by a micro-op number (or uop ) exclusive.
[00028] Each of the CPUs 14 and 16 can also include a level 1 (L1) cache (not shown), and each L1 cache can be coupled to the L2 cache 18. Other modalities may include additional cache levels (for example, level cache) 3 (L3)). In one embodiment, the L2 18 cache can be configured to cache instructions and data for low latency access by CPUs 14 and 16. The L2 18 cache can comprise any capacity and configuration (for example, directly mapped, associative to a set). The L2 cache 18 may be coupled to the memory controller 22 via BIU 20. The BIU 20 may also include several other logical structures for coupling CPUs 14 and 16 and the L2 cache 18 to various other devices and blocks.
[00029] The memory controller 22 can include any number of memory inputs and can include a set of circuits configured to interface with the memory. For example, memory controller 22 can be configured to interface with dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM), dual data rate (DDR) SDRAM, DDR2 SDRAM, Rambus DRAM (RDRAM) , etc. The memory controller 22 can also be coupled to the physical memory interface circuits (PHYs) 24 and 26. The PHYs memories 24 and 26 are representative of any number of PHYs memories that can be coupled to the memory controller 22. The memories PHYs 24 and 26 can be configured to interface with memory devices (not shown).
[00030] It is noted that other modalities may include other combinations of components, including subsets or main assemblies of the components shown in Figure 1 and / or other components. Although an occurrence of a given component can be shown in Figure 1, other modalities can include two or more occurrences of the given component. Likewise, throughout this detailed description, two or more occurrences of a given component can be included even if only one is shown, and / or modalities that include only one occurrence can be used even if multiple occurrences are shown.
[00031] Turning now to Figure 2, a modality of a processor core is shown. Core 30 is an example of a processor core, and core 30 can be used within a processor complex, such as a processor complex 12 in Figure 1. In one embodiment, each of the CPUs 14 and 16 in Figure 1 may include the components and functionality of core 30. Core 30 may include search and decode unit (EDF) 32, mapping and dispatch unit 36, memory management unit (MMU) 40, core interface unit ( CIF) 42, execution units 44, loading-storage units (LSU) 46. It is noted that core 30 may include other components and interfaces not shown in Figure 2.
[00032] The EDF unit 32 may include a circuitry configured to read instructions from memory and place them in the instruction cache level one (L1) 34. The instruction cache L1 34 can be a memory cache to store instructions to be executed by the core 30. The instruction cache L1 34 can have any capacity and type of construction (for example, directly mapped, associative to a set, fully associable, etc.). In addition, the L1 34 instruction cache can have any cache line size. The FED unit 32 may also include drift prediction hardware configured to predict drift instructions and to search for the predicted path. The FED 32 unit can also be redirected (for example, via wrong prediction, exception, interruption, discharge, etc.).
[00033] The EDF 32 unit can be configured to decode instructions in instruction operations. In addition, the FED 32 unit can also be configured to decode multiple instructions in parallel. Generally, an instruction operation can be an operation that the hardware included in execution units 44 and LSU 46 is capable of performing. Each instruction can be translated into one or more instruction operations which, when executed, result in the execution of the operations defined for that instruction according to the assembly of the instruction set. It is noted that the terms "instruction operation" and "uop" can be used interchangeably throughout this disclosure. In other embodiments, the functionality included within the FED unit 32 can be divided into two or more separate units, such as a search unit, a decoding unit, and / or other units.
[00034] In several ISAS, some instructions can be decoded in a single uop. The FED unit 32 can be configured to identify the type of instruction, source operands, etc., and each decoded instruction operation can comprise the instruction together with a part of the decoding information. In other modalities in which each instruction is translated to a single uop, each uop can simply be the corresponding instruction or a portion of it (for example, the operation code field or fields or the instruction). In some embodiments, the FED unit 32 may include any combination of circuitry and / or microcode to generate uops for instructions. For example, relatively simple uop generations (for example, one or two uops per instruction) can be handled in hardware, whereas more extensive uop generations (for example, more than three uops for an instruction) can be handled in microcode .
[00035] Decoded Uops can be provided for Mapping / Dispatch Unit 36. Mapping / Dispatch Unit 36 can be configured to map Uops and assembly records to physical core records 30. The Mapping / Dispatch Unit Dispatch 36 can implement the registration by renaming uops registration addresses to the source operating numbers by identifying the renamed source records. The mapping / dispatching unit 36 can also be configured to dispatch uops to reserve stations (not shown) within execution units 44 and LSU 46.
[00036] In one embodiment, the mapping / dispatch unit 36 may include the reordering store (ROB) 38. In other embodiments, ROB 38 may be located elsewhere. Before being dispatched, uops can be written to ROB 38. ROB 38 can be configured to hold uops until they can be committed in order. Each uop can be assigned an ROB index (RNUM) that corresponds to a specific entry in ROB 38. RNUMs can be used to keep a record of ongoing operations at core 30. The mapping / dispatch unit 36 can also include other components (for example, mapper arrangement, dispatch unit, dispatch store) not shown in Figure 2. In addition, in other embodiments, the functionality included within the mapping / dispatch unit 36 can be divided into two or more separate units, such as a map unit, a dispatch unit, and / or other units.
[00037] The execution units 44 can include any number and type of execution units (for example, integer, floating point, vector). Each of the execution units 44 may also include one or more reserve stations (not shown). The CIF 42 can be coupled to the LSU 46, FED unit 32, MMU 40, and an L2 cache (not shown). The CIF 42 can be configured to manage the interface between core 30 and the L2 cache. The MMU 40 can be configured to perform address transfer and memory management functions.
[00038] LSU 46 can include L1 data cache 48, storage queue 50, and loading queue 52. Loading and storage operations can be dispatched from the mapping / dispatching unit 36 to reservation stations within the LSU 46. Storage queue 50 can store data corresponding to storage operations, and loading queue 52 can store data associated with loading operations. LSU 46 can also be coupled to the L2 cache via CIF 42. It is noted that LSU 46 can also include other components (for example, reserve stations, log file, prefetch unit, TLB) not shown in Figure 2.
[00039] It should be understood that the distribution of functionality illustrated in Figure 2 is not the only possible micro-assembly that can be used for a processor core. Other processor cores may include other components, omit one or more of the components shown, and / or include a different arrangement of functionality between the components.
[00040] With reference now to Figure 3, a block diagram of a modality of a front end of a processor thread is shown. In one embodiment, the front end logic shown in Figure 3 can be located within a search and decode unit, such as the FED Unit 32 (from Figure 2). It should be understood that the distribution of functionality illustrated in Figure 3 is only one possible structure for implementing a loop store within a processor thread. Other suitable distributions of logic to implement a loop store are possible and considered.
[00041] The front search end 60 can be configured to search and pre-code instructions and then transport pre-decoded uops to loop store 62 and decoders 70A-F (via multiplexer 68). In one embodiment, the front search end 60 can be configured to emit six pre-decoded uops per cycle. In other embodiments, the front search end 60 can be configured to output other numbers of pre-decoded uops per cycle.
[00042] Loop store 62, multiplexer 68, and decoder 70A-F can have six tracks to process and / or store six uops per cycle. Each track can include a valid bit to indicate whether the track contains a valid uop. It is noted that the "tracks" of the loop storage 62, multiplexer 68, and decoder 70A-F can also be called "partitions" or "inputs". In other embodiments, the loop store 62, the multiplexer 68, and the decoder 70A-F can include more or less than six tracks, and the front search end 60 can be configured to emit as many uop per cycle as they can be accommodated by the next stage of the chain.
[00043] The front end of the search 60 can expand instructions in uops and feed those uops to the loop store 62 and multiplexer 68. In one embodiment, the instructions fetched by the front end of the search 60 and decoded in pre-decoded uops can be based on ARM ISA. Each pre-decoded uop can include instruction opcode bits, instruction pre-decode bits, and a uop number. The instruction operation code bits specify the operation to be performed. The pre-decoding bits indicate the number of uops to which the instruction maps. The number of uops represents which uop in a sequence of multi-uops instruction should be generated. In other modalities, other ISAs can be used, and instructions can be decoded and formatted in different ways.
[00044] When the processor is not in the loop store mode, then the uops emitted from the front search end 60 can be transported to the decoders 70A-F via multiplexer 68. A selection signal coming from the control unit loop store control 64 can be coupled to multiplexer 68 to determine which path is coupled through multiplexer 68 to the inputs of decoders 70A-F. When the processor is in loop store mode, uops can be read from loop store 62 and transported to decoders 70A-F. Uops can be transported from the 70A-F decoder outputs to the next stage of the processor chain. In one embodiment, the next stage of the processor chain may be a mapping / dispatching unit, such as the mapping / dispatching unit 36 of Figure 2.
[00045] The loop store control unit 64 can be configured to identify a loop within the searched and pre-decoded instructions. Once a loop has been identified with some degree of certainty, the loop can then be cached in loop store 62, the search front end 60 can be turned off, and then the rest of the processor thread can be fed to from loop store 62. In one embodiment, an iteration of the loop can be cached in the loop store 62, and that iteration cached can be repeatedly dispatched down the chain. In another embodiment, multiple loop iterations can be cached in the loop store 62.
[00046] To identify a loop to be cached, first a backward deviation can be detected between the instructions searched. A "backward deviation" can be defined as a deviation taken that heads to a previous instruction in the sequence of instructions. The instruction to which the backward deviation is directed can be considered the beginning of the loop. In one embodiment, only certain types of ties can be considered candidates for storage. For example, in one embodiment, for a loop candidate to be considered for storage, all loop iterations must be invariant. In other words, the loop candidate performs the same sequence of instructions in each iteration. In addition, loops with indirect deviations (for example, BX - deviation switching, BLX - deviation with link switching) following the loop instructions may not be taken into account for storage. Furthermore, only a loop backward can be allowed. The rest of the loop deviations must be forward deviations. In other modalities, all types of loops can be considered, so that all types of loops are candidate for loops, while the only criterion that can be performed may be the loop's invariability. For example, more than one backward deviation can be accepted as a loop candidate, as, for example, in a nested loop.
[00047] The loop storage control unit 64 can monitor the flow of instructions that form loops that meet the loop storage criteria. The loop storer control unit 64 can capture all information regarding what a particular loop candidate looks like. For a certain period of time, the loop candidate can be tracked through multiple iterations to ensure that the loop candidate remains the same. For example, the distances from the start of the loop to one or more instructions within the loop can be recorded in a first iteration and monitored in subsequent iterations to determine whether those distances remain the same.
[00048] In some embodiments, even if the loop candidate is invariable and meets the other criteria listed above, other characteristics of the loop candidate may be considered ineligible to be cached in the loop store 62. For example, if the loop candidate size is too large to fit loop storage 62, then the loop candidate can be considered unqualified. In addition, there may be a maximum acceptable number of deviations taken within the loop, equal to the size of the deviation tracking table 66. If the number of deviations taken exceeds that number, then the loop can be excluded from consideration of being a candidate for be cached in loop store 62. In one embodiment, the deviation tracking table 66 can include eight entries for deviations taken within a loop. In other embodiments, the deviation tracking table 66 may have more or less than eight entries for deviations taken within a loop. Once a loop candidate has been disqualified to be cached in loop store 62, the backward instruction address for that disqualified loop candidate can be registered. Therefore, if this backward deviation is detected again, the loop tracking logic can ignore this deviation and restart only when a new backward deviation is detected.
[00049] In one embodiment, once the same backward deviation has been detected more than once, then a finite state machine to capture the information for that loop can be started by the loop store control unit 64. For example, the loop store control unit 64 can use the deviation tracking table 66 to track the deviations of a loop candidate. The deviation tracking table 66 can keep a record of the distance from the start of the loop to each deviation taken. In one mode, the distance can be measured in uops. In another mode, the distance can be measured in instructions. In other modalities, the distance can be measured using other measures, and / or a combination of two or more measures. Measuring the distance from the start of the loop to each deviation taken is a way of determining that the path along the underlying code has not changed.
[00050] If each loop iteration executes so that there are the same number of uops from the beginning of the loop until each deviation, then the loop candidate can be considered invariant. The distance to each deviation in table 66 can be traced to a certain number of iterations before determining that the loop candidate is invariable and must be cached. The amount of time allocated to track the loop candidate invariance can be based on several loop iterations and / or on a number of deviations found.
[00051] In one embodiment, the only deviations that are acceptable within a loop candidate can be conditional deviations that have the same target. In this modality, indirect deviations may not be supported since an indirect deviation may have a different target in different iterations of the loop. It is possible that an indirect deviation could take two different paths through the code in two separate iterations, but the loop can still be considered by the loop storage control unit 64 to be invariable. This may be because it is possible that the distances were the same even though the loop took two different paths in the two separate iterations. This would lead to the false determination that the tie is invariable. To avoid these false positives, indirect deviations may not be supported. Therefore, in this embodiment, the loop storage control unit 64 can allow deviations only within a loop candidate that has the same target in each loop iteration.
[00052] In another modality, indirect deviations can be supported and may be allowed within loop candidates. In this modality, the deviation tracking table 66 can also include information indicating the target of each deviation taken, to ensure that the loop is invariant. During each loop candidate iteration, the target of each loop deviation can be compared to the value stored in table 66 to ensure that the target has not changed. In other embodiments, additional information can be included in the deviation tracking table 66 to ensure that the loop content is invariant.
[00053] In one embodiment, decoders 70A-F can detect a deviation and signal this to the loop storage control unit 64. In another embodiment, the front search end 60 can detain a deviation and carry a detection indication to unit 64. Alternatively, in another embodiment, unit 64 can monitor the flow of instructions for deviations and detect deviations independently of the decoders. 70A-F or the front search end 60. Unit 64 may include an uops counter (not shown) that counts the number of uops from the beginning of the loop. In the first iteration of the loop, unit 64 can write the uops counter value to the deviation tracking table 66 whenever a deviation is detected in the loop. A pointer to table 66 can also be incremented each time a deviation is detected, to switch to the next entry in table 66. In subsequent loop iterations, whenever a deviation is detected, the uops counter value can be compared to the value in the corresponding entry in table 66. Each entry in table 66 can include a value representing a number of uops from the beginning of the loop for a respective deviation. Each input can also include a valid bit to indicate that this input corresponds to a deviation taken in the loop. In other embodiments, each entry in table 66 may include other information, such as a diversion tag or identifier, a diversion target, and / or other information.
[00054] In one mode, whenever an incorrectly predicted deviation is detected, then a reset signal can be carried to loop store control unit 64. In addition, any time an occurrence is signaled from the rear end that redirects the front search end 60, the loop storage control unit 64 can unload and restart the candidate detection logic. These situations will typically cause the program to exit any code stream that is being tracked by unit 64.
[00055] After a predetermined period of time, unit 64 can determine that the loop candidate is cached in the loop store 62. The length of the predetermined period of time can be based on one or more of a variety of factors. For example, in a modality, the predetermined period of time can be measured by a certain number of loop iterations. If the number of iterations while the loop was invariant is above a limit, then the loop can be cached in loop storage 62. Alternatively, the time period can be based on several deviations taken that have been detected. For example, if the loop candidate includes 8 deviations taken, then a count of 40 deviations like these can be used to indicate the occurrence of a particular number of iterations (5 in this example). In one embodiment, the predetermined period of time can be based on the provision for the deviation predictor sufficient time to predict the end of the loop. Several ways of tracking these iterations are possible and are considered.
[00056] Turning now to Figure 4, another modality of a loop store is shown inside a search and decoding unit. In one embodiment, loop store 84 can be located downstream of decoders 82A-F in the processor chain, as shown in Figure 4. This contrasts with loop store 62 (in Figure 3) that is located in the processor chain. before the 70A-F decoders. The front search end 80 can search for instructions and pre-decode the searched instructions in pre-decoded uops. Next, the pre-decoded uops can be transported to 82A-F decoders. In one embodiment, the front search end 80 can be configured to generate and transport six pre-decoded uops per cycle for the six tracks of the 82A-F decoders. .
[00057] 82A-F decoders can decode the pre-decoded uops into decoded uops. Next, the 82A-F decoders can transport the decoded uops to the next stage of the processor chain via multiplexer 90. In addition, 82A-F decoders can transport uops to the loop store 84 when a loop candidate has been identified and met the criteria to be cached in loop store 84. The outputs of multiplexer 90 may be coupled to the next stage of the processor chain. In one embodiment, the next stage of the processor chain can be a mapping / dispatching unit.
[00058] The loop store 84, the loop storage control unit 86, and the deviation tracking table 88 can be configured to perform functions similar to those described in relation to the front end of the processor shown in Figure 3. A difference The key in Figure 4 is that the loop store 84 can store decoded uops as opposed to the loop store 62 which stores pre-decoded uops in Figure 3. Therefore, the loop store 84 can be larger in size than the loop 62 to accommodate the largest amount of data, since decoded uops typically have more information than pre-decoded uops. Note that the loop store 84 can also be located in other locations within a processor array, in addition to the two locations shown in Figures 3 and 4. For example, the loop store 84 can be located within a front end of the search, or, alternatively, the loop store 84 can be located within a mapping / dispatching unit. Depending on where the loop store is located in the thread, the content of the loop that is stored in the loop store can vary based on the amount of instruction processing that was performed at that point in the thread.
[00059] In one embodiment, in an initial iteration of a loop candidate, the loop store control unit 86 can populate the deviation tracking table 88 with the distance from the start of the loop to each loop deviation. In subsequent loop iterations, control unit 86 can determine whether each deviation is at the same distance from the start of the loop as the corresponding distance stored in table 88. After a loop candidate has been invariable for a given number of iterations, then the loop candidate can be cached in loop storage 84 and fed to the remainder of the thread from loop storage 84. the front search end 80 and decoders 82A-F can be turned off while the loop is being dispatched to out of loop store 84 for the remainder of the processor thread.
[00060] With reference now to Figure 5, a sample loop modality is shown. It is noted that the loop 100 program code shown in Figure 5 is used for illustrative purposes. Other loops can be structured differently with other numbers of instructions and deviations.
[00061] Loop 100 can start at instruction address 0001 with instruction 102. Instruction 102 is followed by instruction 104, and these instructions can be any type of non-deviating instructions that are defined in the ISA. Deviation 106 can follow instruction 104, and deviation 106 can be a forward deviation that deviates to instruction address 0025.
[00062] As shown in table 120, instructions 102 and 104 and branch 106 can each be broken into a single uop. This is purely for illustrative purposes, and instructions within a program may correspond to any number of uops, and the examples shown in table 120 are for illustrative purposes only. It is observed that table 120 showing the uops per instruction is not a table used or stored by the processor thread, but is shown in Figure 5 for the purposes of this discussion.
[00063] Deviation 106 is the first forward deviation found in loop 100, and the number of uops from the beginning of loop 100 can be fed into the deviation tracking table 130. Therefore, based on the two instructions, each with only one uop, the first value stored in the deviation tracking table 130 can be two. Deviation 106 can jump to instruction address 0025, which corresponds to instruction 108. Instruction 108 can be any type of non-diverted instruction. Then, after instruction 108, another forward deviation can be performed, in this case, the deviation instruction 110. As can be seen in table 120, instruction 108 is broken into three uops. Therefore, the written value for the second entry of the deviation tracking table 130 can be six for the number of uops from the beginning of the loop to the deviation 110.
[00064] Detour 100 can skip to instruction 112 at instruction address 0077. Instruction 112 can be followed by instruction 114 and then to diversion 116. Deviation 116 is a backward deviation such that it deviates backward to a previous address in the instruction sequence. Instruction 112 splits into two uops and instruction 114 splits into four uops, as shown in table 120. Therefore, the distance in uops from the beginning of the loop to the offset 116 is 13, and this value can be stored in the third entry deviation tracking table 130.
[00065] When deviation 116 is detected for the first time, it can trigger a finite state machine within a loop store control unit to start tracking loop 100 as a tie store candidate. The loop store control unit can determine the number of uops in loop 100 and the number of deviations in loop 100. If both of these values are less than the limits that are supported by the loop hardware, then the deviation tracking table 130 can be filled in the next iteration of loop 100. Alternatively, the deviation tracking table 130 can be filled in the first iteration of loop 100 after detecting deviation 116. If loop 100 does not meet all the criteria required by the hardware loop for loop candidates, then loop tracking can be abandoned. If loop 100 meets all criteria, then, in subsequent iterations of loop 100, whenever a deviation is found, the corresponding value in table 130 can be read and compared with the distance in uops from the beginning of the loop.
[00066] It is noted that for other loops, table 130 can include other numbers of valid entries depending on the number of deviations in the loop. It is also noted that in other modalities, the distance stored in the deviation tracking table 130 can be measured in values other than uops. For example, in another, the distances stored in table 130 can be measured in instructions. In addition, in other embodiments, the deviation tracking table 130 may include other information fields in each entry. For example, there may be a valid bit for each input to indicate whether the input corresponds to a deviation in the loop candidate and contains a valid distance. In the example shown in Figure 5 for table 130 and loop 100, only the first three inputs would have a valid bit set to '1' and the rest of the valid bits on the other inputs could be set to '0'. In addition, in other embodiments, a target address for the diversion can be stored at each entry.
[00067] With reference now to Figure 6, a block diagram of an embodiment of a loop storage control unit 140 is shown. Unit 140 may include comparator 142, which can compare a backward instruction (BTB) address of a current BTB instruction with a retainer instruction address 144. Retainer 144 can maintain the found BTB instruction address more recently, and this can be compared to the current BTB instruction address. Retainer 144 and comparator 142 can receive a signal indicating that a backward deviation (BTB) has been detected. Retainer 144 and comparator 142 can also receive the instruction address of the detected BTB. Retainer 144 can store the most recent backward deviation (BTB) instruction. Then, the next time a BTB is detected, the instruction address of the BTB can be compared to the instruction address of the previous BTB stored in retainer 144. Alternatively, in another embodiment, retainer 144 can be a record or another memory unit. . Comparator 142 provides an indication that a loop may have been detected in the instruction flow.
[00068] In one embodiment, comparator 142 may have two exits, a first exit indicating equality and a second exit indicating inequality. The first output, indicating equality, can be coupled to the detection start indicator 146, OR port 160, and iteration counter 150. The equality output of comparator 142 can be a pulse for one or more clock cycles indicating that a BTB was detected and that BTB was seen at least twice in a row. The equality output of comparator 142 can increment iteration counter 150, and iteration counter 150 can provide a count of the number of loop iterations that have been detected in the instruction flow. For this modality, if the same BTB is found twice in a row, with no other BTB between them, then this indicates that a loop candidate has been found. Therefore, the loop tracking circuitry can be started to learn more about the loop candidate.
[00069] The second output of comparator 142, indicating inequality, can be coupled to the OR 162 port. The output of the OR 162 port may be coupled to reset the detection start indicator 146. The second output of comparator 142 can be high when the BTB detected at that time is different from the BTB previously detected. This indicates that the previous BTB was not part of a tie candidate for this modality. Although not shown in Figure 6, the second output of comparator 142 can also be coupled to other locations to indicate that loop detection has been reset.
[00070] The uops counter 148 can be configured to keep a record of the number of uops that have been detected since the start of the loop candidate. One or more signals can be coupled to the uops counter 148 indicating the number of uops that have been detected. These entry (s) for the uops counter 148 may indicate a number of uops that have been fetched and / or decoded. In one embodiment, the signal (s) may come from a search unit. In one embodiment, if the search unit emits six decoded uops per clock, then a high input coupled to the uops counter 148 can cause the uops counter 148 to increase its count by six. In another embodiment, these signals can be coupled to the uops counter 148 from the decoder units.
[00071] Uops counter 148 can also include another logic for determining the number of uops for the specific uop that corresponds to a deviation. When a deviation is found, the uops counter 148 can also receive an entry indicating the track on which the uop was located. Next, the uops counter 148 can determine how many of the uops of the most recent cycle were in front of the deviation uop. In this way, the uops counter 148 can generate an exact count of the number of uops from the beginning of the loop to the specific deviation uop corresponding to the deviation that was detected. Uops counter 148 can be reset if BTB is detected (meaning the end of the loop), if a wrong prediction or offload is signaled (o) from the rear end of the processor, or if comparator 152 signals that an inequality has been detected in at a distance of a detour.
[00072] The iteration counter 150 can be configured to keep a record of the number of loop iterations that have been fetched and / or decoded. Iteration counter 150 can be reset if a wrong prediction or offload is signaled (o) from the rear edge of the processor or if the distance from one of the loop deviations is different from the value stored in the deviation tracking table ( not shown). This can be indicated by comparator 152, which can generate a signal indicating inequality if the value of the current uops counter for a detected deviation is not equal to the corresponding value stored in the deviation tracking table (BTT). Comparator 152 can receive a signal detected by deviation and the BTT value for the current deviation of the loop. Comparator 152 can compare the BTT value to the current uops counter value and output the result of that composition. If the comparison leads to an inequality, then the loop detection logic can be reinitialized.
[00073] In one embodiment, comparator 154 can be configured to compare the output of iteration counter 150 with a limit 156. When iteration counter 150 equals or exceeds limit 156, comparator 154 can emit a signal that initiates the loop store mode for the processor. In this mode, the loop candidate can be tracked for multiple iterations before the loop store mode is initiated, and the number of iterations required for tracking can be indicated by the 156 limit. In several modalities, the 156 limit is a value programmable. In one embodiment, the threshold value can be based on the time or number of cycles required for the processor deviation prediction mechanism to detect the end of the loop. In some embodiments, the deviation prediction mechanism can be turned off while the processor is in the loop storage mode.
[00074] In another mode, the number of deviations can be counted, and when the number of deviations reaches a limit, then the loop store can be started. For example, if a loop has five deviations, and the deviation limit is 40, then the loop candidate would require eight iterations to reach the deviation limit. In other embodiments, other ways of determining how long to track a loop candidate before starting a loop storage mode can be used. For example, in another mode, if a certain number of deviations or a certain number of iterations is achieved, then the processor can enter the loop storage mode.
[00075] Although unit 140 is shown to receive multiple signals, such as BTB detected, number of detected uops, and deviation detected, in another mode unit 140 can generate these signals internally by monitoring the uops that are traversing the processor chain . It should also be understood that the distribution of functionality illustrated in Figure 6 is not the only possible logic distribution for implementing a loop store control unit within a processor chain. Other modalities may include other components and logic and have any appropriate distribution of those components and logic. In addition, each of the individual components can be replaced by one or more similar components that can be configured differently depending on the modality. For example, in the modality shown in Figure 6, only a backward deviation is tolerable within a loop candidate. However, in other embodiments, a loop candidate can include more than one backward deviation, and the logic of the loop storage control unit can be modified accordingly.
[00076] Referring now to Figure 7, an embodiment of a method for tracking a loop candidate is shown. For the purpose of discussion, the steps in this modality are shown in order of sequence. It should be noted that in various modalities of the method described below, one or more of the elements described can be performed at the same time, in an order different from that shown, or can be omitted entirely. Additional elements can also be performed as desired.
[00077] In one embodiment, a loop termination deviation can be detected in a processor thread (block 172). In various embodiments, a loop termination offset can be defined as a backward offset excluding subroutine calls. In various modalities, the loop termination deviation can be detected at a seek stage, a decoder stage, or at another stage of the processor chain. The loop termination deviation uop can be marked so that it can be identified as the end of a possible loop storage candidate.
[00078] In response to the detection of the loop termination bypass, the address of the loop termination bypass instruction can be retained in a loop store control unit, a detection indicator can be established, an iteration counter can be be started, and a uops counter can be started (block 174). The iteration counter can be used to keep track of the number of loop iterations. In some embodiments, too, a deviation counter can be started to keep track of the number of deviations that have been detected in all iterations of the loop candidate. The iteration counter value and / or the deviation counter value can be used to determine when to start loop storage mode. When the loop store mode is initiated, the loop candidate can be cached in the loop store and the front end of the search can be turned off. The uops counter can be used to determine the distance (in number of uops) for each deviation that is detected at the boundary of the loop candidate.
[00079] Note that in one mode, the counter maintained by the uops counter can include empty partitions that are generated as part of the decoding and searching steps. In this modality, it can be assumed for the purposes of this discussion that the search unit is configured to emit six uops per cycle. For some clock cycles, the search unit may not generate an entire emission of six uops for a variety of reasons. Therefore, a line of uops sent to the decoder units may not include an entire line of valid uops. The uops counter can take this into account and count six for each line even if the line does not contain six valid uops. For example, a loop can include six rows of uops, and the loop termination offset may be the last partition of the last row of the sixth generated uops cycle. The uops counter can count that the loop has 36 uops for the six cycles, even if one or more lines contained less than six valid uops. For example, an intermediate line can contain only two valid uops, and the remaining four partitions on the line can be empty. Therefore, the loop would include 32 valid uops, but the loop counter will count that the loop includes 36 uops. In general, in this mode, the uops counter can keep a record of how many partitions will be needed in the loop store to store the loop candidate even if some of these partitions do not contain valid uops.
[00080] After configuring the counters and any other additional tracking logic, the loop candidate can be executed and tracked (block 176). In one embodiment, the loop candidate tracking can include detecting deviations in the loop candidate and filling in a deviation tracking table with the distances from the start of the loop for each detected deviation (block 178). Then, a loop termination deviation can be detected at the end of the loop candidate (block 180). If the loop termination deviation is the same deviation detected previously (conditional block 182), then the iteration counter can be incremented (block 186).
[00081] If the loop termination deviation is not the same deviation previously detected (conditional block 182), then the loop candidate tracking can be interrupted and the counters, the retainer, the detection start indicator and the deviation tracking can be reset (block 184). In addition, the loop candidate tracking can be terminated if any excluded instructions are detected in the loop. After block 184, method 170 can be reinitialized and wait for a loop termination deviation to be detected (block 172).
[00082] After block 186, the uops counter can be compared to the size of the loop store (conditional block 188) to determine whether the loop candidate fits in the loop store. Alternatively, in another embodiment, these steps of method 170 can be reordered. For example, if it is determined that the uops counter exceeds the size of the loop store (conditional block 188) before detecting a loop termination deviation (block 180), the loop detection can then be canceled.
[00083] If the uops counter is smaller than the size of the loop store (conditional block 188), the loop candidate can then fit into the loop store, and so the next condition can be verified, if the number of deviations in the loop candidate is less than the size of the deviation tracking table (BTT) (conditional block 190). If the uops counter is larger than the size of the loop store (conditional block 188), the loop candidate will then be too large to fit into the loop store and tracking can be completed. Method 170 can go back to block 184 and the counters, retainer, start detection indicator, and the deviation tracking table can be reset.
[00084] If the number of deviations in the loop candidate is less than the BTT size (conditional block 190), the loop candidate is still being considered, and the uops counter can be reset (block 192). Then, another loop iteration can be performed and tracked (block 194). Tracking the loop iteration can include monitoring the deviations taken and the number of uops from the beginning of the loop to each deviation. The distance to each deviation from the beginning of the loop can be compared to the values stored in the deviation tracking table.
[00085] When a loop iteration is completed, a loop termination deviation must be detected, and it can be determined whether it is the same loop termination deviation (conditional block 196). Alternatively, if the loop termination deviation is not detected, the loop tracking can be terminated by monitoring the uops counter and the last entry in the deviation tracking table and determining that the loop termination deviation it should have already been detected. If the loop termination deviation is detected and it is the same loop termination deviation (conditional block 196), then it can be determined whether the loop content was invariant for this loop iteration (conditional block 198).
[00086] Alternatively, conditional block 198 can be checked before conditional block 196 in some cases. For example, it can be determined that the loop content has changed before detecting the loop termination deviation if one of the loop deviations is not at the same distance from the beginning of the loop as the value stored in the deviation tracking table. In this case, the loop tracking can be terminated before detecting the same loop termination deviation.
[00087] If the loop content was invariable for this loop iteration (conditional block 198), this then indicates that the same loop is being executed, and the iteration counter can then be incremented (block 200). Thus, it can be determined whether the iteration counter is above a limit to determine whether the loop has been tracked long enough for the loop to be stored (conditional block 202). Alternatively, in another embodiment, a deviation counter can be compared to a threshold to determine whether the processor should enter a loop store mode.
[00088] If the iteration counter is below the limit (conditional block 202), method 170 can then reset the uops counter (block 192). If the iteration counter is over the limit (conditional block 202), the processor can then enter loop store mode and the loop can be cached in the loop store (block 204). After block 204, method 170 can terminate. At this point, the front end of the processor can be turned off and uops can be dispatched out of the loop store. When the loop ends, the processor can carry a signal to exit loop storage mode and the front end of the processor can be turned on again. At this point, method 170 can be restarted, and the loop store control unit can re-monitor the instruction flow for loop termination deviations (block 172).
[00089] Referring to Figure 8, a block diagram of a modality of a system 210 is shown. As illustrated, system 210 can represent, chip, circuitry, components, etc., of a computer table 220, laptop 230, tablet 240, cell phone 250, or others. In the illustrated embodiment, system 210 includes at least one example of IC 10 (from Figure 1) coupled to an external memory 212.
[00090] IC 10 is coupled to one or more peripherals 214 and external memory 212. A power source 216 is also provided that provides the supply voltages to IC 10 as well as one or more supply voltages for memory 212 and / or for peripherals 214. In several modalities, the 216 power source can represent a battery (for example, a rechargeable battery in a smartphone, laptop or tablet). In some embodiments, more than one example of IC 10 can be included (and more than an external memory 212 can be included as well).
[00091] Memory 212 can be any type of memory, such as dynamic random access memory (DRAM), synchronous DRAM (SDRAM), DDR, DDR2, DDR3, etc.), SDRAM (including mobile versions of SDRAMs such as DDR3, etc., and / or low-power versions of SDRAMs such as LPDDR2, etc.), RAMBUS DRAM (RDRAM), static RAM (SRAM), etc. One or more memory devices can be coupled to a circuit board to form memory modules such as single line memory modules (SIMMs), double line memory modules (DIMMs), etc.
[00092] Peripherals 214 may include any desired circuit set, depending on the type of system 210. For example, in one embodiment, peripherals 214 may include devices for various types of wireless communication, such as Wi-Fi, Bluetooth, cell phone, global positioning system, etc. The 214 peripherals may also include additional storage, including RAM storage, solid state storage, disk storage. The 214 peripherals may include user interface devices such as a display screen, including touch screens or multitouch display screens, a keyboard or other input devices, microphones, speakers, etc.
[00093] In relation now to Figure 9, a modality of a block diagram of a computer-readable medium 260 is shown including one or more data structures representative of the circuit set included in IC 10 (of Figure 1). In general, computer-readable medium 260 may include any non-transitory storage media such as optical or magnetic media, for example, disk, CD-ROM, or DVD-ROM, volatile or non-volatile memory media such as RAM , (for example, SDRAM, RDRAM, SRAM, etc.), ROM, etc., as well as means accessible through means of transmission or signals such as electrical, electromagnetic, or digital signals, transported through a means of communication such as a network and / or a wireless link.
[00094] Generally, the circuit data structure (s) in the computer readable medium 260 can be read by a program and used, directly or indirectly, to manufacture the hardware comprising the circuitry. For example, the data structure (s) may include one or more behavior-level descriptions or record transfer level (RTL) descriptions of hardware functionality in a high-level design language ( HDL) such as Verilog or VHDL. The description (s) can be read by a synthesis tool that can synthesize the description to produce one or more network lists comprising port lists from a synthesis library. The network list (s) comprises a set of ports that also represent the functionality of the hardware comprising the set of circuits. The network list (s) can then be placed and routed to produce one or more data sets describing geometric shapes to be applied to masks. The masks can then be used in various semiconductor manufacturing steps to produce a semiconductor circuit or circuits corresponding to the circuitry. Alternatively, the data structure (s) in the computer-readable medium 230 may be the list (s) of networks (with or without the synthesis library) or the set (s) ) of data, as desired. In yet another alternative, data structures may comprise the production of a schematic program, or list (s) of networks or set (s) of data derived therefrom.
[00095] Although the computer-readable medium 260 includes a representation of the IC 10, other embodiments may include a representation of any portion or combination of portions of the IC 10 (e.g., loop store, loop store control unit) .
[00096] It should be emphasized that the modalities described above are only non-limiting examples of implementations. Numerous variations and modifications will be evident to those skilled in the art once the above description has been fully analyzed. The following claims are intended to be interpreted to encompass all such variations and modifications.

权利要求:
Claims (11)
[0001]
1. Device, characterized by the fact that it comprises: a loop buffer configured to store instruction operations, in which instruction operations are sent from the loop buffer in response to the detection that the device is in a loop buffer mode ; and a loop buffer control unit coupled to the loop buffer; where the loop buffer control unit is configured to: detect a direct reverse branch to a previous instruction in a sequence of instructions; consider the previous statement to be the start of a bond candidate; determining whether an indication is stored that indicates that a loop candidate corresponding to the direct reverse branch has been disqualified from being cached in the loop buffer during the tracking of the loop candidate in a previous loop candidate encounter; in response to the determination that said referral is stored, ignore the loop candidate and waive the tracking of the loop candidate; if the forward reverse branch has not been previously disqualified, follow the loop candidate, whereby the loop buffer control unit is configured to: store a forward reverse branch ID; track various instructions executed from the start of the loop candidate to each reverse branch within the loop candidate; responsive to the detection of several instructions executed since the start of the loop candidate for each of the reverse branches is invariable for at least a certain number of iterations of the loop candidate, stores the loop candidate in the loop buffer and initiates the buffer mode loop; and in response to detecting the number of instructions executed from the start of the loop candidate for each of the reverse branches, it is not invariable: end the loop candidate tracking; and storing an indication that the direct reverse branch is disqualified.
[0002]
2. Device, according to claim 1, characterized by the fact that it still comprises a search unit and an instruction cache, in which the device is configured to turn off at least one search unit and the instruction cache responsive to the mode loop buffer being started.
[0003]
3. Apparatus according to claim 1, characterized by the fact that instruction operations are sent from the loop buffer to a decoding unit when the apparatus is in the loop buffer mode.
[0004]
4. Apparatus, according to claim 1, characterized by the fact that, when tracking the loop candidate, the loop buffer control unit is further configured to finish tracking the loop candidate responsive to the detection of a loop. second loop termination branch which is not the first loop termination branch.
[0005]
5. Apparatus according to claim 1, characterized by the fact that the determined number of iterations corresponds to a number of iterations greater than a limit and in which the limit is based on a selected amount of time to provide the branch predictor an amount of time sufficient to predict a tie candidate's end.
[0006]
6. Apparatus according to claim 1, characterized by the fact that it still comprises a branch tracking table, in which the branch tracking table comprises an entry for each branch of the loop candidate, and in which each entry includes a value that corresponds to a distance from the beginning of the loop candidate to the respective branch taken.
[0007]
7. Method characterized by the fact that it comprises the steps of: maintaining a loop buffer to store instruction operations, in which instruction operations are sent from the loop buffer responsive to the detection of a loop buffer mode; detecting a first loop terminating branch, where the first loop terminating branch is a direct reverse branch to a previous instruction in a sequence of instructions; considering the previous instruction the beginning of a bond candidate; determining whether an indication is stored that indicates that a loop candidate corresponding to the direct reverse branch has been disqualified from being cached in the loop buffer during the tracking of the loop candidate in a previous loop candidate encounter; in response to the determination that said referral is stored, ignoring the loop candidate and the previous tracking of the loop candidate; if the direct reverse branch has not been previously disqualified, start tracking the loop candidate, where said tracking comprises: storing an identification of the direct reverse branch; track a number of instructions executed from the beginning of the loop candidate for each branch taken within the loop candidate; in response to the detection of various instructions executed since the start of the loop candidate for each of the branches taken is invariable for at least a certain number of iterations of the loop candidate, storing the loop candidate in the loop buffer and initiating the loop mode. loop buffer; and in response to the detection of the number of instructions executed since the beginning of the loop candidate for each of the branches taken, it is not invariable: finalize the tracking of the loop candidate; and store an indication that direct reverse branching is disqualified.
[0008]
8. Method according to claim 7, characterized by the fact that the trace further comprises the termination trace of the loop candidate in response to the detection of a second loop termination branch which is not the first loop termination branch .
[0009]
9. Method, according to claim 7, characterized by the fact that it still comprises the shutdown of a search unit sensitive to the entry in the loop buffer mode.
[0010]
10. Method according to claim 7, characterized by the fact that it still comprises sending the loop candidate from the loop buffer to a next stage of a processor pipeline in response to entering loop buffer mode.
[0011]
11. Method, according to claim 10, characterized by the fact that the next stage of the processor pipeline is a decoding unit.

类似技术:

公开号 | 公开日 | 专利标题

BR102013015049B1|2021-03-02|apparatus and method

BR102013010877B1|2021-07-06|load-store dependency predictor content management method and processor

US9471322B2|2016-10-18|Early loop buffer mode entry upon number of mispredictions of exit condition exceeding threshold

TWI552069B|2016-10-01|Load-store dependency predictor, processor and method for processing operations in load-store dependency predictor

BR102013015262A2|2015-07-14|Loop Buffer Packaging

TWI494852B|2015-08-01|Processor and method for maintaining an order of instructions in relation to barriers in the processor

ES2655852T3|2018-02-21|Procedures and apparatus for canceling data pre-capture requests for a loop

TWI564707B|2017-01-01|Apparatus,method and system for controlling current

JP2018533135A|2018-11-08|Method and apparatus for cache line deduplication by data matching

BR102012024721A2|2013-11-26|REGULATION OF ISSUANCE OF PROCESSOR INSTRUCTIONS

KR20210043631A|2021-04-21|Control access to branch prediction units for the sequence of fetch groups

同族专利:

公开号 | 公开日

JP5799465B2|2015-10-28|

KR20130141394A|2013-12-26|

US9557999B2|2017-01-31|

TWI520060B|2016-02-01|

EP2674858A2|2013-12-18|

TW201411487A|2014-03-16|

WO2013188122A2|2013-12-19|

KR101497214B1|2015-02-27|

BR102013015049A2|2015-06-23|

WO2013188122A3|2014-02-13|

US20130339700A1|2013-12-19|

CN103593167B|2017-02-22|

JP2014013565A|2014-01-23|

EP2674858A3|2014-04-30|

CN103593167A|2014-02-19|

EP2674858B1|2019-10-30|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

DE69129872T2|1990-03-27|1999-03-04|Philips Electronics Nv|Data processing system with a performance-enhancing instruction cache|

JP3032030B2|1991-04-05|2000-04-10|株式会社東芝|Loop optimization method and apparatus|

JP3032031B2|1991-04-05|2000-04-10|株式会社東芝|Loop optimization method and apparatus|

MX9306994A|1992-12-15|1994-06-30|Ericsson Telefon Ab L M|FLOW CONTROL SYSTEM FOR PACKAGE SWITCHES.|

EP0840208B1|1996-10-31|2003-01-08|Texas Instruments Incorporated|Method and system for single cycle execution of successive iterations of a loop|

US5893142A|1996-11-14|1999-04-06|Motorola Inc.|Data processing system having a cache and method therefor|

US6076159A|1997-09-12|2000-06-13|Siemens Aktiengesellschaft|Execution of a loop instructing in a loop pipeline after detection of a first occurrence of the loop instruction in an integer pipeline|

US6125440A|1998-05-21|2000-09-26|Tellabs Operations, Inc.|Storing executing instruction sequence for re-execution upon backward branch to reduce power consuming memory fetch|

JP2000298587A|1999-03-08|2000-10-24|Texas Instr Inc <Ti>|Processor having device branched to specified party during instruction repetition|

US6269440B1|1999-02-05|2001-07-31|Agere Systems Guardian Corp.|Accelerating vector processing using plural sequencers to process multiple loop iterations simultaneously|

EP1050804A1|1999-05-03|2000-11-08|STMicroelectronics SA|Execution of instruction loops|

US6963965B1|1999-11-30|2005-11-08|Texas Instruments Incorporated|Instruction-programmable processor with instruction loop cache|

EP1107110B1|1999-11-30|2006-04-19|Texas Instruments Incorporated|Instruction loop buffer|

US7302557B1|1999-12-27|2007-11-27|Impact Technologies, Inc.|Method and apparatus for modulo scheduled loop execution in a processor architecture|

US6598155B1|2000-01-31|2003-07-22|Intel Corporation|Method and apparatus for loop buffering digital signal processing instructions|

US6757817B1|2000-05-19|2004-06-29|Intel Corporation|Apparatus having a cache and a loop buffer|

US6671799B1|2000-08-31|2003-12-30|Stmicroelectronics, Inc.|System and method for dynamically sizing hardware loops and executing nested loops in a digital signal processor|

US6898693B1|2000-11-02|2005-05-24|Intel Corporation|Hardware loops|

US6748523B1|2000-11-02|2004-06-08|Intel Corporation|Hardware loops|

US6950929B2|2001-05-24|2005-09-27|Samsung Electronics Co., Ltd.|Loop instruction processing using loop buffer in a data processing device having a coprocessor|

JP2004038601A|2002-07-04|2004-02-05|Matsushita Electric Ind Co Ltd|Cache memory device|

CN1717654A|2002-11-28|2006-01-04|皇家飞利浦电子股份有限公司|A loop control circuit for a data processor|

US20040123075A1|2002-12-19|2004-06-24|Yoav Almog|Extended loop prediction techniques|

US7159103B2|2003-03-24|2007-01-02|Infineon Technologies Ag|Zero-overhead loop operation in microprocessor having instruction buffer|

US7130963B2|2003-07-16|2006-10-31|International Business Machines Corp.|System and method for instruction memory storage and processing based on backwards branch control information|

US7752426B2|2004-08-30|2010-07-06|Texas Instruments Incorporated|Processes, circuits, devices, and systems for branch prediction and other processor improvements|

JP2006309337A|2005-04-26|2006-11-09|Toshiba Corp|Processor, and method for operating command buffer of processor|

US7475231B2|2005-11-14|2009-01-06|Texas Instruments Incorporated|Loop detection and capture in the instruction queue|

US7330964B2|2005-11-14|2008-02-12|Texas Instruments Incorporated|Microprocessor with independent SIMD loop buffer|

US7873820B2|2005-11-15|2011-01-18|Mips Technologies, Inc.|Processor utilizing a loop buffer to reduce power consumption|

TWI295032B|2005-12-01|2008-03-21|Ind Tech Res Inst|

US9052910B2|2007-10-25|2015-06-09|International Business Machines Corporation|Efficiency of short loop instruction fetch|

US20090217017A1|2008-02-26|2009-08-27|International Business Machines Corporation|Method, system and computer program product for minimizing branch prediction latency|

JP2010066892A|2008-09-09|2010-03-25|Renesas Technology Corp|Data processor and data processing system|

US9952869B2|2009-11-04|2018-04-24|Ceva D.S.P. Ltd.|System and method for using a branch mis-prediction buffer|

CN102238179B|2010-04-07|2014-12-10|苹果公司|Real-time or near real-time streaming|

US8446186B2|2010-06-07|2013-05-21|Silicon Laboratories Inc.|Time-shared latency locked loop circuit for driving a buffer circuit|

US20120079303A1|2010-09-24|2012-03-29|Madduri Venkateswara R|Method and apparatus for reducing power consumption in a processor by powering down an instruction fetch unit|

US20120185714A1|2011-12-15|2012-07-19|Jaewoong Chung|Method, apparatus, and system for energy efficiency and energy conservation including code recirculation techniques|

US9753733B2|2012-06-15|2017-09-05|Apple Inc.|Methods, apparatus, and processors for packing multiple iterations of loop in a loop buffer|

US9471322B2|2014-02-12|2016-10-18|Apple Inc.|Early loop buffer mode entry upon number of mispredictions of exit condition exceeding threshold|US9753733B2|2012-06-15|2017-09-05|Apple Inc.|Methods, apparatus, and processors for packing multiple iterations of loop in a loop buffer|

US9459871B2|2012-12-31|2016-10-04|Intel Corporation|System of improved loop detection and execution|

EP3005078A2|2013-05-24|2016-04-13|Coherent Logix Incorporated|Memory-network processor with programmable optimizations|

US9632791B2|2014-01-21|2017-04-25|Apple Inc.|Cache for patterns of instructions with multiple forward control transfers|

US9471322B2|2014-02-12|2016-10-18|Apple Inc.|Early loop buffer mode entry upon number of mispredictions of exit condition exceeding threshold|

US9524011B2|2014-04-11|2016-12-20|Apple Inc.|Instruction loop buffer with tiered power savings|

CN105511838B|2014-09-29|2018-06-29|上海兆芯集成电路有限公司|Processor and its execution method|

US20160179549A1|2014-12-23|2016-06-23|Intel Corporation|Instruction and Logic for Loop Stream Detection|

US9830152B2|2015-12-22|2017-11-28|Qualcomm Incorporated|Selective storing of previously decoded instructions of frequently-called instruction sequences in an instruction sequence buffer to be executed by a processor|

GB2548602B|2016-03-23|2019-10-23|Advanced Risc Mach Ltd|Program loop control|

GB2548603B|2016-03-23|2018-09-26|Advanced Risc Mach Ltd|Program loop control|

US10223118B2|2016-03-24|2019-03-05|Qualcomm Incorporated|Providing references to previously decoded instructions of recently-provided instructions to be executed by a processor|

JP2018005488A|2016-06-30|2018-01-11|富士通株式会社|Arithmetic processing unit and control method for arithmetic processing unit|

CN108256735B|2017-12-14|2020-12-25|中国平安财产保险股份有限公司|Processing method for surveying and dispatching and terminal equipment|

US10915322B2|2018-09-18|2021-02-09|Advanced Micro Devices, Inc.|Using loop exit prediction to accelerate or suppress loop mode of a processor|

US11269642B2|2019-09-20|2022-03-08|Microsoft Technology Licensing, Llc|Dynamic hammock branch training for branch hammock detection in an instruction stream executing in a processor|

法律状态:
2015-06-23| B03A| Publication of a patent application or of a certificate of addition of invention [chapter 3.1 patent gazette]|

2018-03-27| B15K| Others concerning applications: alteration of classification|Ipc: G06F 9/38 (2006.01), G06F 9/32 (2006.01) |

2018-12-04| B06F| Objections, documents and/or translations needed after an examination request according [chapter 6.6 patent gazette]|

2019-11-19| B06U| Preliminary requirement: requests with searches performed by other patent offices: procedure suspended [chapter 6.21 patent gazette]|

2020-12-29| B09A| Decision: intention to grant [chapter 9.1 patent gazette]|

2021-03-02| B16A| Patent or certificate of addition of invention granted [chapter 16.1 patent gazette]|Free format text: PRAZO DE VALIDADE: 20 (VINTE) ANOS CONTADOS A PARTIR DE 17/06/2013, OBSERVADAS AS CONDICOES LEGAIS. |

优先权:

申请号 | 申请日 | 专利标题

US13/524,508|US9557999B2|2012-06-15|2012-06-15|Loop buffer learning|

US13/524,508|2012-06-15|

[返回顶部]