![]() method for automated recovery and alert escalation in distributed systems, system for automated reco
专利摘要:
METHOD AND SYSTEM FOR AUTOMATED RECOVERY AND ALERT SCALE IN DISTRIBUTED SYSTEMS. The present invention relates to the fact that alerts (113) based on detected hardware and/or software problems in a complex distributed application environment are mapped to recovery actions for automatic problem resolution. Unmapped alerts (113) are escalated to designated individuals or teams (101) through a cyclic escalation method that includes a transfer notification confirmation from the designated individual or team (101). The information collected for each alert (113) as well as solutions through the escalation process can be recorded to expand the automated resolution knowledge base. 公开号:BR112012026917B1 申请号:R112012026917-8 申请日:2011-03-30 公开日:2021-04-20 发明作者:Jon Avner;Shane Brady;Wing Man Yim;Haruya Shida;Selim Yazicioglu;Andrey Lukyanov;Brent Alinger;Colin Nash 申请人:Microsoft Technology Licensing, Llc; IPC主号:
专利说明:
BACKGROUND [001] In today's networked communication environments, many services that used to be provided by applications running locally are provided through distributed services. For example, email services, calendar/scheduling services, and those comparable are provided through complex networked systems that involve multiple physical and virtual servers, storage facilities and other components across geographic boundaries. Even organizational systems, such as corporate networks, can be implemented through physically separate server banks, etc. [002] Although distributed services make it easier to manage an application installation, update and maintenance (ie, instead of installing, updating and maintaining hundreds if not thousands of local applications, a centrally managed service can take care of these tasks) , such services still involve multiple applications running on multiple servers. When managing these large-scale distributed applications continuously, a variety of issues is to be expected. Hardware failures, software problems and other unexpected small defects can regularly occur. Trying to manage and recover from these issues manually can require a cost-prohibitive number of dedicated domain and recognizable operations engineers. SUMMARY [003] This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This summary is not intended to uniquely identify the key features or essential features of the claimed subject, nor is it intended as an aid in determining the scope of the claimed subject. [004] The modalities are directed to the mapping of detected alerts for the recovery of actions for the automatic resolution of problems in a networked communication environment. Unmapped alerts can be escalated to designated individuals through a cyclic escalation method that includes a confirmation transfer notification from the designated individual. According to some modalities, information collected from each alert, as well as solutions through the escalation process can be recorded to expand the automated resolution knowledge base. [005] These and other features and advantages will be evident from a reading of the following detailed description and a review of the associated drawings. It is to be understood that the foregoing general description and the following detailed description are explanatory and not restrictive of aspects as claimed. BRIEF DESCRIPTION OF THE DRAWINGS [006] Figure 1 is a conceptual diagram that illustrates an example environment in which the detection of an alert can lead to a repair action or alert escalation; [007] Figure 2 is an action diagram that illustrates actions during the escalation of an alert; [008] Figure 3 is another conceptual diagram that illustrates an alert management in a multiple region environment; [009] Figure 4 is a network environment, where a system according to the modalities can be implemented; [0010] Figure 5 is a block diagram of an example computing operation environment, where the modalities can be implemented; and [0011] Figure 6 illustrates a logical flowchart for automated management of alerts in a network communication environment according to the modalities. DETAILED DESCRIPTION [0012] As briefly described above, alerts on a network system can be managed through an automated action/escalation process that performs actions mapped to alerts and/or escalations for manual resolution, while expanding a knowledge base for the automated action portion and providing collected information to designated individuals tasked with addressing problems. In the following detailed description, references are made to the associated drawings which form a part thereof, and which are shown by way of specific embodiment illustrations or examples. These aspects can be combined, other aspects can be used, and structural changes can be made, without deviating from the spirit or scope of this presentation. The following detailed description is therefore to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims and their equivalents. [0013] Although the modalities are described in the general context of program modules that run in conjunction with an application program that runs on an operating system on a personal computer, those skilled in the art will recognize that aspects can also be implemented in combination with others. program modules. [0014] Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular types of abstract data. Furthermore, those skilled in the art will appreciate that modalities can be practiced with other computer system configurations, including handheld devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and comparable computing devices. The modalities can also be practiced in distributed computing environments, where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules can be located on local and remote memory storage devices. [0015] The modalities may be implemented as a computer-implemented process (method), or as an article of manufacture, such as a computer program product or computer-readable media. The computer program product may be a computer storage medium that can be read by a computer system and encoding a computer program comprising instructions for causing a computer or computer system to execute the process(es). are(s) for example. The computer-readable storage medium can be implemented, for example, through one or more of a volatile computer memory, a non-volatile memory, a hard drive, a flash drive, a floppy disk, or a compact disk, and media. comparable. The computer program product can also be a signal propagated on a carrier (e.g., a frequency or phase modulated signal) or a medium that can be read by a computer system and encoding a computer program of instructions for execution of a computer process. [0016] Throughout this descriptive report, references are made to services. A service as used here describes any (any) networked/online application(s) that may receive an alert as part of their regular operations and process/store/forward that information. These application(s) can run on a single computing device, on multiple computing devices in a distributed manner, and so on. Modalities can also be implemented in a hosted service run by a plurality of servers or comparable systems. The term "server" generally refers to a computing device running one or more software programs typically in a networked environment. However, a server can also be implemented as a virtual server (software programs) running on one or more computing devices seen as a server on the network. Further details on these example technologies and operations are provided below. [0017] Referring to Figure 1, conceptual diagram 100 illustrates an example environment where the detection of an alert can lead to a repair action or alert escalation. As mentioned briefly before, modalities address the complexity of technical support services by automating repair actions and escalating alerts. For example, in a distributed technical support services system, the monitoring engine 103 can send an alert 113 to an automation engine 102 upon detection of a hardware, software, or hardware/software combination problem in the system. distributed. Automation engine 102 may attempt to map alert 113 to repair action 112. If automation engine 102 successfully maps alert 113 to repair action 112, then automation engine 102 may perform action 112, which may include a set of instructions to address the detected problem. [0018] The problem may be associated with one or more devices 104 at geographically distributed service location 105. The devices may include any computing device, such as a desktop computer, a server, a smartphone, a laptop computer, and those comparable. Devices 104 may further include additional remotely accessible devices, such as monitors, audio equipment, television sets, video capture devices, and other similar devices. [0019] Alert 113 may include a status information of the device or program associated with the detected problem, such as device memory contents, sensor readings, last executed instructions and others. Alert 113 may further include a problem description, such as which instruction failed execution, which executions indicate results beyond predefined thresholds, and the like. [0020] Automation engine 102 may attempt to map alert 113 to a repair action 112 by fetching a troubleshooting database 114. Troubleshooting database 114 may store combined alert profiles to actions additionally classified by device or software programs. An example implementation might be a communication device "no connection" alert combined with a communication device network interface restart repair action. One or more repair actions can be mapped to each alert. Furthermore, one or more alerts can be mapped to a single repair action. [0021] If the automation engine 102 determines multiple repair actions for an alert, an execution priority may depend on a predefined priority of repair actions. For example, a primary repair action in the scenario discussed above might be a network interface restart followed by a secondary repair action of rebutting the communication device. The predefined priority of repair actions can be manually entered into the problem resolution database 114 or automatically determined based on a repair action success evaluation scheme upon successful correction of the problem. [0022] Under some embodiments, the repair action 112 may include the accumulation of additional diagnostic information from the device and/or a software program associated with the problem. Additional diagnostic information can be transmitted to the monitoring engine as an alert restarting the automated cycle according to other modalities. In response to an alert, additional diagnostic information may also be collected and stored in the system. The stored information can be used to capture the problem state and provide context when the alert is escalated to a designated person or team (eg 101). [0023] If a mapped repair action is not found in the troubleshooting database 114 by automation engine 102, alert 113 can be escalated to a designated person or team 101. Designated person or team 101 can be notified , even if a mapped action is found and executed for informational purposes. The transmission of alert 113 to the designated person or team 101 can be determined from an alert naming convention 113. The alert naming convention can indicate to which support personnel the alert should be escalated, such as a hardware support team, a software support team and those comparable. The naming convention scheme can also be used for mapping alerts to recovery actions. For example, alerts can be named in a hierarchical fashion (ie system/component/alert name), and recovery actions can be mapped anywhere from all alerts for a system (system/*) to a special recovery action for a specific alert. Under some modalities, each specific alert can have a designated team associated with it, although that team can default to a specific value for an entire component. Determining which team member to send the alert to may depend on a predetermined mapping algorithm residing in the automation engine to science support staff schedules. The predetermined mapping algorithm can be updated manually or automatically by integrated or external programming systems. [0024] The automation engine 102 can escalate the alert 113 to a designated first person or team via an email, an instant message, a text message, a pager message, a voicemail or similar means. Alerts can be mapped to team names, and a team name mapped to a group of individuals who are on-call at predefined intervals (eg, one day, one week, etc.). Part of the mapping can be used to identify which people are on duty by the break. In this way, alert mappings can be abstracted from individual team members, which can be fluid. Automation engine 102 can then wait for a transfer notification from the first designated person or team. The handoff notification may be received by automation engine 102 in the manner in which the alert was sent or may be received through other means. If the automation engine 102 does not receive the handover notification within a predetermined amount of time, it can escalate alert 113 to the next designated person or team in rotation, as determined by a predefined mapping algorithm. The algorithm can be kept escalating the alert to the next designated person or team in the rotation, until it receives a transfer notification. [0025] The monitoring engine 103 may receive a feedback response (e.g., in the form of an action) from the device or software program after executing the repair action 112 by passing the response on to the automation engine 102 The automation engine 102 can then update the problem solving database 114. A statistical information, such as a success report of repair actions, can be used in changing the priority of execution of repair actions. Furthermore, a feedback response associated with actions performed by a designated person or team can also be recorded in the problem solving database 114, so that a machine learning algorithm or similar mechanism can be employed to expanding the action list, mapping new alerts to existing actions, mapping existing alerts to new actions, and so on. Automation engine actions and designated person actions can be audited by the system according to some modalities. The system can keep a history of who performed a specific action, when and against which device or server. The logs can then be used for troubleshooting, tracking system changes and/or developing new automated alert responses. [0026] According to further embodiments, the automation engine 102 can perform a wildcard search of the problem solving database 114 and determine multiple repair actions in response to a received alert. The execution of single repair actions or groups of them can depend on the predetermined priority of the repair actions. Repair action groups can also be mapped to alert groups. Although an alert can combine multiple wildcard mappings, more specific mapping can actually be applied. For example, an alert exchange/transport/queue might combine mapping exchange/*, exchange/transport/*, and exchange/transport/queue. However, the latter may actually be the true mapping, as it is the most specific. [0027] Figure 2 illustrates actions during alert escalation in diagram 200. The monitoring motor 202 can provide a problem detected as an alert (211) to the automation motor 204. The automation motor 204 can check the available actions ( 212) from the action store 206 (troubleshooting database 114 of figure 1) and perform the action if one is available (213). If no action is available, automation engine 204 can escalate the alert (214) to process owner 208. The alert can still be escalated (215) to another assignee 209. As previously discussed, escalation can be performed in parallel with the execution of a determined action. [0028] Upon receipt of a new action to be performed (216, 217) from the owner of the process 208 or another person in charge 209, the automation engine 204 can perform the new action (218) and update the records with the new action (219) for future use. The example interactions in diagram 200 illustrate a limited scenario. Other interactions, such as transfers with designated persons, device/software returns reporting the problem, and the like can also be included in an operation of an automated recovery and escalation system as per the modalities. [0029] Figure 3 is a conceptual diagram that illustrates an alert management in a multiple region environment in diagram 300. In a distributed system, the escalation of alerts can depend on a predetermined priority of geographic regions. For example, a predetermined priority can escalate an alert from a region that is in daytime and maintain an alert from a region that is in nighttime when escalations are managed by a single support team for both regions. Similarly, repair actions from different regions can be prioritized based on a predetermined priority, when repair actions from different regions compete for the same hardware, software, communication resources to address detected problems. [0030] Diagram 300 illustrates how alerts from different regions can be addressed by a system according to the modalities. According to an example scenario, monitoring engines 303, 313, and 323 can be responsible for monitoring hardware and/or software issues from regions 1, 2, and 3 (304, 314, and 324), respectively. Upon detection of a problem, each of the monitoring engines can transmit alerts to the respective automation engines 302, 312 and 322, which may be responsible for the respective regions. The logic for the automation engines can be distributed to each region in the same way as the monitoring logic is. According to some modalities, an automation can occur in a traversed region, such as full local failure and recovery. According to other modalities, an automation engine can be responsible for several regions. Similarly, the automation target can also be centralized or distributed. For example, the system can scale different teams based on the time of day. Monitoring engines 303, 313 and 323 can have their own separate regional databases for managing monitoring processes. Automation engines 302, 312 and 322 can query the problem resolution database (central or distributed) for alert mapping to repair actions. [0031] If the action(s) are found, the automation engines 302, 312 and 322 will be able to perform the action(s) on devices and/or programs in regions 304, 314 and 324. A global monitoring database 310 can also be implemented for all regions. If automation engines 302, 312, and 322 are unable to find matching repair actions, they can escalate alerts to a designated 301 support team based on predefined regional priorities, such as an organizational structure. For example, region 304 can be the corporate company network for a commercial organization, while region 324 can be the documentation support network. A problem detected in region 304, in this scenario, can be prioritized over a problem detected in region 324. Similarly, a time of day or working day/holiday distinction between different regions, and comparable factors can be taken into account when determining regional priorities. [0032] According to some modalities, multiple automation engines can be assigned to different regions and the scaling and/or execution of repair action priorities decided through a consensus algorithm among the automation engines, as mentioned above. Alternatively, a process overseeing the regional automation engines can present the priority decisions. Furthermore, the 302, 312 and 322 automation engines can interface with regional troubleshooting databases, which include repair action mappings - customized alerts for different regions. [0033] Although the recovery and escalation processes in distributed systems have been discussed above using example scenarios, the execution of specific repair actions and alert escalations in conjunction with figures 1, 2 and 3, the modalities are not limited to those. Mapping alerts to repair actions, prioritizing repair actions, escalating alerts, and other processes can be implemented using other operations, priorities, assessments, and so on, using the principles discussed here. [0034] Figure 4 is an example network environment, where the modalities can be implemented. Mapping an alert to a repair action can be implemented through software running by one or more 422 servers, such as a hosted service. Server 422 can communicate with client applications on individual computing devices, such as a cell phone 411, a mobile computing device 412, a smartphone 413, a laptop computer 414, and a desktop computer 415 (client devices) via network(s) 410. Client applications on client devices 411 to 415 can facilitate user interactions with the service running on server(s) 422, enabling automated management of software and/or hardware issues associated with the service. The automation and monitoring engine(s) can run on any of the 422 servers. [0035] Data associated with operations, such as the alert mapping to the repair action may be stored in one or more data stores (for example, in data store 425 or 426), which can be managed by either server 422 or database server 424. Automation recovery and escalation of detected issues according to modalities can be triggered when an alert is detected by the monitoring engine, as discussed in the examples above. [0036] The network(s) 410 can comprise any topology of servers, clients, Internet service providers, and communication media. A modalities system can have a static or dynamic topology. Network(s) 410 may include a secure network, such as a corporate network, an insecure network, such as an open wireless network, or the Internet. The network(s) 410 provides a communication between the nodes described here. By way of example, and not limitation, the network(s) 410 may include wireless media, such as acoustic, RF, infrared, and other wireless media. [0037] Many other configurations of computing devices, applications, data sources and data distribution systems can be employed to implement a system for automating distributed system problem management according to the modalities. Furthermore, the network environments discussed in Figure 4 are for illustration purposes only. Modalities are not limited to example applications, modules or processes. [0038] Figure 5 and the associated discussion are intended to provide a brief general description of a suitable computing environment in which the modalities can be implemented. Referring to Figure 5, a block diagram of an example computing operating environment for a service application according to the embodiments is illustrated, such as a computing device 500. In a basic configuration, the computing device 500 may be a server in a hosted service system and include at least one processing unit 502 and system memory 504. Computing device 500 may also include a plurality of processing units that cooperate in executing programs. Depending on the exact configuration and type of computing device, system memory 504 can be volatile (such as RAM), non-volatile (such as ROM, flash memory, etc.), or some combination of the two. The 504 system memory typically includes a suitable 505 operating system for controlling the operation of the platform, such as WINDOWS® operating systems from MICROSOFT CORPORATION of Redmond, Washington. System memory 504 may also include one or more program modules 506, automation engine 522, and monitoring engine 524. [0039] The automation and monitoring engines 522 and 524 can be separate applications or integrated modules of a hosted service that handles system alerts, as discussed above. This basic configuration is illustrated in Figure 5 by those components in the dashed line 508. [0040] Computing device 500 may have additional features or functionality. For example, computing device 500 may include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is illustrated in Figure 5 by removable storage 509 and non-removable storage 510. Computer readable storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage such as computer readable instructions, data structures, program modules or other data. System memory 504, removable storage 509, and non-removable storage 510 are all examples of computer readable storage media. Computer readable storage media include, but are not limited to, RAM, a ROM, EEPROM, flash memory or other memory technology, a CD-ROM, digital versatile discs (DVD) or other optical storage, cassettes magnetic, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computing device 500. Any storage medium which computer readable as it may be part of computing device 500. Computing device 500 may also have input device(s) 512, such as a keyboard, a mouse, a pen, a voice input device, a touch input device and comparable input devices. 514 output device(s), such as a display, speakers, a printer, and other types of output devices may also be included. These devices are well known in the art and need not be discussed extensively here. [0041] The computing device 500 may also contain communication connections 516 that allow the device to communicate with other devices 518, such as over a wireless network, in a distributed computing environment, a link with satellite, a link cell, and comparable mechanisms. Other 518 devices may include computer device(s) that run(s) distributed applications, and perform comparable operations. Communication connection(s) 516 is(are) an example of communication media. Communication media therein may include computer readable instructions, data structures, program modules or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and include any means of delivery of information. The term "modulated data signal" means a signal which has one or more of its characteristics regulated or changed so as to encode information in the signal. By way of example, and not limitation, communication media includes wired media, such as a wired network or a direct wired connection, and wireless media, such as acoustic, RF, infrared, and another wireless. [0042] The example embodiments also include methods. These methods can be implemented in a variety of ways, including the structures described in this document. One way like this is by machine operations of devices of the type described in this document. [0043] Another optional way is that one or more of the individual operations of the methods are performed together with one or more human operators performing some. These human operators need not be co-located with each other, but each can be just a machine that runs a portion of the program. [0044] Figure 6 illustrates a logical flowchart 600 for automating the management of a program recovery and escalation in distributed systems according to the modalities. Process 600 can be implemented on a server as part of a hosted service or in a client application to interact with a service such as those described previously. [0045] Process 600 begins with operation 602, where an automation engine detects an alert sent by a monitoring engine in response to a device and/or software application problem in the system. In operation 604, the automation engine having received the alert from the monitoring engine may begin collecting information associated with the alert. This can be followed by attempting to map the alert to one or more repair actions in operation 606. [0046] If an explicit action mapped to the alert is found in decision operation 608, the action(s) can be performed in subsequent operation 610. If no explicit action is determined during the process of mapping, the alert can be escalated to a designated person or team in operation 614. Operation 614 can be followed by optional operations 616 and 618, in which a new action can be received from the designated person or team and performed. In operation 612, records can be updated with the action taken (mapped or new) so that the mapping database can be expanded or statistical information associated with success rates can be used for future monitoring and response tasks automated. [0047] The operations included in process 600 are for illustration purposes. Automation of problem recovery and escalation in distributed applications can be implemented by similar processes with fewer steps or with additional steps, as well as in a different order of operations using the principles described here. [0048] The above descriptive report, examples and data provide a complete description of the manufacture and use of the composition of the modalities. Although the subject matter has been described in language specific to structural resources and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific resources or acts described above. Rather, the specific resources and acts described above are exposed as exemplary forms of implementing the claims and modalities.
权利要求:
Claims (14) [0001] 1. Method to be performed, at least in part, on a computing device for automated retrieval and alert escalation in distributed systems comprising the steps of: detecting, by a monitoring engine (103), a problem associated with at least one of a device and a software application within a distributed system; transmitting, by the monitoring motor (103), an alert (113) based on the detected problem to an automation motor (102); receiving an alert associated with a detected problem from a monitoring engine (103) in the automation engine (102); method further comprises the following steps performed by the automation engine (102); collect diagnostic information associated with the problem detected; the method characterized in that it comprises: attempting to map the alert to a recovery action, wherein the automation engine (102) performs a wildcard search of a problem solving database to determine multiple recovery actions. repair in response to the incoming alert, the troubleshooting database stores alert profiles corresponding to repair actions sorted by device or software programs; if the alert is mapped to a recovery action, perform the recovery action, where if the alert matches multiple wildcard mappings, only the most specific mapping is applied; otherwise escalate (111) the alert to a designee (101) along with the collected diagnostic information; and update, using the collected diagnostic information, records associated with mapping alert and repair actions; the method further comprises the monitoring engine (103) receiving a feedback response from a device or software program after performing the recovery action and passing the response to the automation engine (102); in response, the automation engine (102) updates the troubleshooting database. [0002] 2. Method according to claim 1, characterized in that the collected diagnostic information includes at least one of a set of: device memory contents, sensor readings, last executed instructions, failed instructions and results associated with the problem detected. [0003] 3. Method, according to claim 1, characterized in that it further comprises: waiting for a transfer response from the designated after the alert is escalated; and if the transfer response is not received within a predefined period, escalate the alert to another nominee. [0004] 4. Method according to claim 1, characterized in that the designee is determined from a pre-defined list of designees and a naming convention of the alert, and the designee includes one of a person and a team . [0005] 5. Method according to claim 1, characterized in that escalating the alert includes: transmitting the alert to the designated by at least one of a set of: an email, an instant message, a text message, a page and a voicemail. [0006] 6. Method according to claim 1, characterized in that it further comprises: updating a success rate record associated with the recovery action. [0007] 7. System for automated retrieval and alert escalation in distributed systems comprising: a server running a monitoring engine (103) and an automation engine (102), wherein the monitoring engine is configured to: detect a problem associated with at least one of a device and a software application within a distributed system; and transmitting an alert (113) based on the problem detected; and the automation engine (102) is configured to: receive the alert (113); collect diagnostic information associated with the detected problem; the system characterized in that it comprises: attempting to map the alert (113) to a repair action, wherein the automation engine (102) performs a wildcard search of a problem solving database to determine multiple repair actions in response to the incoming alert, the troubleshooting database storing alert profiles corresponding to repair actions further classified by devices or software programs; if the alert is mapped to a repair action, perform the repair action, where if the alert matches multiple wildcard mappings, only the most specific mapping will be applied; otherwise escalate (111) the alert to a designee (101) along with the collected diagnostic information; and update the records in the problem resolution database using the collected diagnostic information; the monitoring engine (103) further configured to receive a feedback response from a device or software program after performing the recovery action and pass the response to the automation engine (102); the automation engine still configured to update the troubleshooting database. [0008] 8. System according to claim 7, characterized in that it further comprises a plurality of monitoring engines, each monitoring engine being configured to monitor a distinct geographic region based on the system scale for each geographic region within the system of distributed and broadcast alerts based on problems detected in their respective regions, where the automation engine is further configured to: one of perform a mapped recovery action and escalate to designated alerts from different regions based on a priority regional. [0009] 9. System, according to claim 7, characterized in that the regional priority is further determined based on the availability of at least one of a set of: a designated support team, a hardware resource, a software resource and a communication feature. [0010] 10. System according to claim 7, characterized in that the alert is mapped to a plurality of recovery actions, and the recovery actions are executed according to the predefined execution priority. [0011] 11. System according to claim 7, characterized in that the device includes one of the following: a desktop computer, a laptop, a handheld, a server, a smartphone, a monitor, audio equipment , a television set and a video capture device. [0012] 12. A computer readable storage medium having a method comprising: detecting, by a monitoring engine (103), a problem associated with at least one of a device and a software application within a distributed system; transmitting, by a monitoring motor (103), an alert (113) based on the detected problem to an automation motor (102); and receiving the alert (113) associated with the detected problem from the monitoring engine (103) in the automation engine (102); the method further comprises the following steps performed by the automation engine (102); collect diagnostic information associated with the detected problem; characterized by the fact that it further comprises: attempting to map the alert (113) to one of the problem resolution database recovery actions to determine multiple recovery actions in response to the received alert, the problem resolution database storing alert profiles corresponding to recovery actions sorted by device or software programs; if the alert is mapped to a recovery action, perform the recovery action, where if the alert matches multiple wildcard mappings, only the most specific mapping is applied; otherwise escalate (111) the alert to a designee (101) along with the collected diagnostic information; and update, using the collected diagnostic information, the records associated with mapping alert and recovery actions; the method further comprises the monitoring engine (103) receiving a feedback response from a device or software program after performing the recovery action and passing the response to the automation engine (102); in response, the automation engine (102) updates the troubleshooting database. [0013] 13. Computer-readable storage medium according to claim 12, characterized in that the recovery action is mapped to one of the following: a single alert and a group of alerts. [0014] 14. Computer-readable storage medium according to claim 12, characterized in that the designated is determined from one of an alert naming convention and a rotational algorithm based on the availability of support personnel.
类似技术:
公开号 | 公开日 | 专利标题 BR112012026917B1|2021-04-20|method for automated recovery and alert escalation in distributed systems, system for automated recovery and alert escalation in distributed systems and computer-readable storage medium US9548886B2|2017-01-17|Help desk ticket tracking integration with root cause analysis US9497072B2|2016-11-15|Identifying alarms for a root cause of a problem in a data processing system US10360193B2|2019-07-23|Method and apparatus for smart archiving and analytics US9058359B2|2015-06-16|Proactive risk analysis and governance of upgrade process US9497071B2|2016-11-15|Multi-hop root cause analysis US7590666B2|2009-09-15|Predicting capacity consumption in a memory component US20160088006A1|2016-03-24|Predictive model for anomaly detection and feedback-based scheduling WO2018147901A1|2018-08-16|Building management system with declarative views of timeseries data US10592666B2|2020-03-17|Detecting anomalous entities US9276803B2|2016-03-01|Role based translation of data US20150281011A1|2015-10-01|Graph database with links to underlying data US20170091252A1|2017-03-30|Reconciling sensor data in a database US10372572B1|2019-08-06|Prediction model testing framework US10970152B2|2021-04-06|Notification of network connection errors between connected software systems US10581637B2|2020-03-03|Computational node adaptive correction system US11012289B2|2021-05-18|Reinforced machine learning tool for anomaly detection US11133955B2|2021-09-28|Testing automated smart device functions within smart environments Longo2020|Big Data per la diagnosi avanzata dei guasti e sistemi di monitoraggio: gestire la complessità di sistemi in ambienti distribuiti US20210271560A1|2021-09-02|Methods and systems for determining backup schedules US20220027331A1|2022-01-27|Cross-Environment Event Correlation Using Domain-Space Exploration and Machine Learning Techniques US20210097433A1|2021-04-01|Automated problem detection for machine learning models US9672489B1|2017-06-06|Inventory validator with notification manager EMILIA0|Big Data for advanced fault diagnosis and monitoring systems: managing system complexity in a distributed environment Grigorescu et al.2010|Buffering application for an industrial monitoring software system
同族专利:
公开号 | 公开日 US20110260879A1|2011-10-27| RU2589357C2|2016-07-10| EP2561444B1|2018-12-19| EP2561444A4|2017-08-30| RU2012144650A|2014-04-27| CN102859510A|2013-01-02| BR112012026917A2|2016-07-12| JP2013527957A|2013-07-04| WO2011133299A2|2011-10-27| KR20130069580A|2013-06-26| WO2011133299A3|2012-03-01| EP2561444A2|2013-02-27| KR101824273B1|2018-01-31| ES2716029T3|2019-06-07| US8823536B2|2014-09-02| HK1179724A1|2013-10-04| JP5882986B2|2016-03-09| CN102859510B|2015-07-15|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题 DE546339T1|1991-12-09|1993-11-25|Yokogawa Electric Corp|Distributed tax system.| JP3449425B2|1993-02-23|2003-09-22|本田技研工業株式会社|Computer network monitoring support system| US5619656A|1994-05-05|1997-04-08|Openservice, Inc.|System for uninterruptively displaying only relevant and non-redundant alert message of the highest severity for specific condition associated with group of computers being managed| US6615240B1|1998-12-18|2003-09-02|Motive Communications, Inc.|Technical support chain automation with guided self-help capability and option to escalate to live help| US6918059B1|1999-04-28|2005-07-12|Universal Music Group|Method and system for handling errors in a distributed computer system| US6742141B1|1999-05-10|2004-05-25|Handsfree Networks, Inc.|System for automated problem detection, diagnosis, and resolution in a software driven system| JP2003085003A|2001-09-06|2003-03-20|Matsushita Electric Ind Co Ltd|Fault restoration assist method and fault restoration assist system| US7243124B1|2002-09-06|2007-07-10|Oracle International Corporation|Architecture for general purpose near real-time business intelligence system with client devices and methods therefor| US7376969B1|2002-12-02|2008-05-20|Arcsight, Inc.|Real time monitoring and analysis of events from multiple network security devices| US7137040B2|2003-02-12|2006-11-14|International Business Machines Corporation|Scalable method of continuous monitoring the remotely accessible resources against the node failures for very large clusters| US7089220B2|2003-06-24|2006-08-08|Palo Alto Research Center Incorporated|Complexity-directed cooperative problem solving| JP4728565B2|2003-07-16|2011-07-20|日本電気株式会社|Failure recovery apparatus, failure recovery method and program| US7103874B2|2003-10-23|2006-09-05|Microsoft Corporation|Model-based management of computer systems and distributed applications| KR20070041579A|2004-07-20|2007-04-18|소프트리시티, 인크.|Method and system for minimizing loss in a computer application| EP1630710B1|2004-07-21|2019-11-06|Microsoft Technology Licensing, LLC|Containment of worms| US20060064481A1|2004-09-17|2006-03-23|Anthony Baron|Methods for service monitoring and control| JP2006163509A|2004-12-02|2006-06-22|Olympus Corp|Failure report system| US7900201B1|2004-12-21|2011-03-01|Zenprise, Inc.|Automated remedying of problems in software application deployments| JP2007079896A|2005-09-14|2007-03-29|Nomura Research Institute Ltd|Monitoring device and monitoring method| JP2007141007A|2005-11-21|2007-06-07|Hitachi Ltd|Support systemization for failure in system operation monitoring| CN101039498B|2007-05-09|2010-06-16|中兴通讯股份有限公司|Base station system having distributed warning process and method for processing warning thereof| US20080281607A1|2007-05-13|2008-11-13|System Services, Inc.|System, Method and Apparatus for Managing a Technology Infrastructure| US8892719B2|2007-08-30|2014-11-18|Alpha Technical Corporation|Method and apparatus for monitoring network servers| JP2009099135A|2007-09-28|2009-05-07|Fujitsu Ltd|Support management method, support management system and information processing device| JP2009087136A|2007-10-01|2009-04-23|Nec Corp|Fault repair system and fault repair method| JP4872058B2|2008-05-13|2012-02-08|株式会社日立システムズ|Automatic failure response system| US8103909B2|2008-09-15|2012-01-24|Juniper Networks, Inc.|Automatic hardware-based recovery of a compromised computer| US8074107B2|2009-10-26|2011-12-06|Amazon Technologies, Inc.|Failover and recovery for replicated data instances|US6021911A|1998-03-02|2000-02-08|Mi-Jack Products|Grappler sway stabilizing system for a gantry crane| US20130097272A1|2011-10-18|2013-04-18|International Business Machines Corporation|Prioritized Alert Delivery In A Distributed Processing System| US9413893B2|2012-04-05|2016-08-09|Assurant, Inc.|System, method, apparatus, and computer program product for providing mobile device support services| US9483344B2|2012-04-05|2016-11-01|Assurant, Inc.|System, method, apparatus, and computer program product for providing mobile device support services| KR101426382B1|2013-03-29|2014-08-13|케이티하이텔 주식회사|Method for data recovery using pipeline in distributed file system| US9292402B2|2013-04-15|2016-03-22|Century Link Intellectual Property LLC|Autonomous service management| US9361184B2|2013-05-09|2016-06-07|International Business Machines Corporation|Selecting during a system shutdown procedure, a restart incident checkpoint of an incident analyzer in a distributed processing system| US9471474B2|2013-08-19|2016-10-18|Microsoft Technology Licensing, Llc|Cloud deployment infrastructure validation engine| US9602337B2|2013-09-11|2017-03-21|International Business Machines Corporation|Event and alert analysis in a distributed processing system| US9389943B2|2014-01-07|2016-07-12|International Business Machines Corporation|Determining a number of unique incidents in a plurality of incidents for incident processing in a distributed processing system| CN104915219B|2014-03-12|2018-11-27|奇点新源国际技术开发有限公司|Program updating method of single chip processor and device| CN104007996B|2014-06-16|2016-07-06|南京融教科技有限公司|The authentic firmware upgrading of a kind of dcs realizes method| US9436553B2|2014-08-04|2016-09-06|Microsoft Technology Licensing, Llc|Recovering usability of cloud based service from system failure| US10108414B2|2014-10-09|2018-10-23|International Business Machines Corporation|Maintaining the integrity of process conventions within an ALM framework| US10303538B2|2015-03-16|2019-05-28|Microsoft Technology Licensing, Llc|Computing system issue detection and resolution| US9667573B2|2015-04-28|2017-05-30|Unisys Corporation|Identification of automation candidates using automation degree of implementation metrics| US9686220B2|2015-04-28|2017-06-20|Unisys Corporation|Debug and verify execution modes for computing systems calculating automation degree of implementation metrics| US10153992B2|2015-04-28|2018-12-11|Unisys Corporation|Identification of progress towards complete message system integration using automation degree of implementation metrics| US10296717B2|2015-05-14|2019-05-21|Salesforce.Com, Inc.|Automated prescription workflow for device management| US10180869B2|2016-02-16|2019-01-15|Microsoft Technology Licensing, Llc|Automated ordering of computer system repair| US20170237602A1|2016-02-16|2017-08-17|Microsoft Technology Licensing, Llc|Computer system monitoring based on entity relationships| JP6899837B2|2016-03-09|2021-07-07|アリババ・グループ・ホールディング・リミテッドAlibaba Group Holding Limited|Data transmission across regions| CN108038043B|2017-12-22|2021-04-23|郑州云海信息技术有限公司|Distributed storage cluster warning method, system and equipment| US10868711B2|2018-04-30|2020-12-15|Splunk Inc.|Actionable alert messaging network for automated incident resolution| US10270644B1|2018-05-17|2019-04-23|Accenture Global Solutions Limited|Framework for intelligent automated operations for network, service and customer experience management| FI128647B|2018-06-29|2020-09-30|Elisa Oyj|Automated network monitoring and control| FI129101B|2018-06-29|2021-07-15|Elisa Oyj|Automated network monitoring and control|
法律状态:
2017-07-25| B25A| Requested transfer of rights approved|Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC (US) | 2019-01-08| B06F| Objections, documents and/or translations needed after an examination request according [chapter 6.6 patent gazette]| 2019-09-17| B06U| Preliminary requirement: requests with searches performed by other patent offices: procedure suspended [chapter 6.21 patent gazette]| 2020-10-27| B06A| Notification to applicant to reply to the report for non-patentability or inadequacy of the application [chapter 6.1 patent gazette]| 2021-02-23| B09A| Decision: intention to grant [chapter 9.1 patent gazette]| 2021-04-20| B16A| Patent or certificate of addition of invention granted|Free format text: PRAZO DE VALIDADE: 10 (DEZ) ANOS CONTADOS A PARTIR DE 20/04/2021, OBSERVADAS AS CONDICOES LEGAIS. |
优先权:
[返回顶部]
申请号 | 申请日 | 专利标题 US12/764,263|US8823536B2|2010-04-21|2010-04-21|Automated recovery and escalation in complex distributed applications| US12/764,263|2010-04-21| PCT/US2011/030458|WO2011133299A2|2010-04-21|2011-03-30|Automated recovery and escalation in complex distributed applications| 相关专利
Sulfonates, polymers, resist compositions and patterning process
Washing machine
Washing machine
Device for fixture finishing and tension adjusting of membrane
Structure for Equipping Band in a Plane Cathode Ray Tube
Process for preparation of 7 alpha-carboxyl 9, 11-epoxy steroids and intermediates useful therein an
国家/地区
|