西班牙专利ES2648368A1 Video recommendation based on content (Machine-translation by Google Translate, not legally binding)

专利PDF首页>>西班牙专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
The methods, systems, and apparatus include computer programs in a storage medium to provide video recommendations. For each video, you get a set of images that are included in the video. For each respective image in the set of images of a video, a set of one or more keywords that describe the visual content is generated. Based at least on the sets of keywords for at least some of the images, a set of keywords describing the video is generated. Videos are assigned to groups based on the set of keywords that are generated for each video. A request for a recommendation is received based on a first video. Data identifying a second video as a recommendation is provided based on the second video that is assigned to the same group as the first video. (Machine-translation by Google Translate, not legally binding)
公开号:ES2648368A1
申请号:ES201630878
申请日:2016-06-29
公开日:2018-01-02
发明作者:Cyrille Bataller；Luis Mendez
申请人:Accenture Global Solutions Ltd；
IPC主号:

专利说明:

5 Technical field
This memory is related to machine learning
Background
Data communication networks, such as the internet, provide access to various types of information and content. A type of content available on the internet
10 are the videos. For example, video sharing websites provide access to millions of different videos. Additionally, streaming services provide access to various different movies, television shows, and events. Due to the large number of videos available on the internet, it can be difficult for users to find videos in which users are interested.
15 Brief description invention
This report describes, among other things, a system that generates data that describes the content of videos or other multimedia and uses the data to identify videos or multimedia, for example, in response to a query or to recommend a user. For example, the system can provide recommendations and links to 20 videos to the recommended videos while the user watches another video. Recommended videos may be videos considered similar to the video the user is watching. The similarity between the videos can be determined based on at least keywords that describe visual content that is represented by images in the videos. For example, images can be obtained from videos and analyzed using
25 machine learning techniques to identify keywords that describe the images of the videos. The keywords of a video can be compared with the keywords of another video to determine the similarity between the videos.
In general, an innovative aspect of the matter described herein can be performed in methods that include the actions of, for each video in a set of 30 videos: obtaining a set of images that are included in the video; for each respective image in the set of images, generate a first set of one or more keywords that describe the visual content represented by the respective image; and generate, based on at least the respective first sets of one or more words

key to at least some of the images, a second set of keywords that describe the video; assign the videos in the video set to the groups based on the second set of keywords generated for each video; receive a request for a video recommendation based on a first video of the video set; and provide, as the video recommendation, the data that identifies a second video of the set of videos based on the second video that is assigned to the same group as the first video. Other embodiments of this aspect include the corresponding systems, devices, and computer programs, configured to carry out the actions of the methods, encoded in storage devices. For a system of one or more computers that are configured to perform the particular operations or actions installed in your software, firmware, hardware, or a combination of them. For one or more computer programs that are configured to carry out particular program operations that include instructions.
The precedents and other embodiments may each optionally include one or more of the following features, alone or in combination. In some aspects, the request for the video recommendation is generated in response in (i) a presentation of the first video or (ii) a request for the first video.
In some aspects, the assignment of videos to groups based on the second set of keywords generated for each video, can include using a machine learning process, assign videos that have at least a similar limit within the same group. Several ways to calculate the similarity are discussed in the detailed description. The similarity between two videos can be based on a similarity between the respective first sets of keywords generated for the two videos.
In some aspects, the second set of keywords for each video can be defined in a sequence based on a sequence in which the images from which the keywords were generated occur in the video. The similarity between two videos can be based on the keyword sequence for a first video and the keyword sequence for a second video.
Some aspects may include identifying, for a first video, a number of occurrences of each keyword in the second set of keywords for the first video and identifying, for a second video, a number of occurrences of each keyword in the second set For the second video. The similarity between the two videos can be based on a comparison of the number of occurrences of each keyword.

Some aspects may include generating, for each video, a third set of keywords that describe the audible content of the video. The second set of keywords that describe the video can be generated additionally based on the third set of keywords.
In some aspects, the generation of the respective first set of one or more keywords that describe the visual content that is represented by a given image, may include using a deep learning process, to generate at least a portion of the first set of keywords. Generate the first set of one or more keywords that describe the visual content that is represented by a given image, may include detecting an object that is represented by the given image and include, in the first set of keywords that describe the visual content , a keyword that describes the detected object.
In some aspects, generating the first set of one or more keywords that describe the visual content that is represented by a given image may include detecting a person represented by the given image and including, in the first set of keywords, A keyword that identifies the person detected.
In some aspects, the generation, based on the first set of keywords for some of the images, of a second set of keywords describing the video, may include the identification, for each generated keyword, of a number of images of the video for which the keyword was generated and identify, for inclusion in the second set of keywords, a pre-specified number of keywords.
Some aspects may include the generation of an index of video scenes for the set of videos. The index can include, for each video scene, a set of keywords that describe the video scene. You can receive a query that specifies a query term. A video scene in the index can be identified based on a query term by matching a keyword that is included in the index for the video scene. Data specifying the identified video scenes can be provided in response to the request. The second set of keywords may include selected keywords from the respective first sets of keywords for at least some of the images.
Particular embodiments for the matter described herein can be implemented so that one or more of the following advantages are realized. Videos that may be of interest to a user can be identified more accurately using the

content that is represented by the video images. You can determine the similarity between the videos with more precision, comparing the keywords that describe the content (for example, objects, people, backgrounds, etc.) that are included in the images obtained from the videos. Similar videos can be identified as recommendations to users to help users find videos that may be of interest to users. By providing similar video recommendations more accurately, the number of user queries and user requests can be reduced, resulting in lower bandwidth consumption, lower demand on network resources used to transmit queries and videos, and lower processing cycles for a computer processor.
By indexing videos and video scenes using keywords, search results can be classified more accurately in response to user queries. By providing better ranked search results, the number of requests received by a search engine can be reduced, resulting in a lower demand on computing resources and an improved speed to respond to queries. Users can also search for scenes that have particular content, for example, the scene of a movie that includes a tiger, which allows users to find particular scenes more quickly. Internet-accessible movies that have an incorrect title can be found in the search based on the content of the movie, allowing the identification of sites that provide fraudulent copies of movies that may, for example, include fake, obfuscated movie titles or alternatives. In some embodiments, the content of the video can be searched, for example, for events or objects that are displayed in the videos recorded by a CCTV video surveillance system.
Details of embodiments are described in the accompanying drawings and in the following detailed description. Other features, aspects, and advantages of this proposal will be apparent in light of the description, drawings, and claims.
Brief description of the drawings
Fig. 1 is an example block diagram in which a video system provides videos and / or data related to the videos.
Fig. 2 is a diagram of an example keyword generator.
Fig. 3 is a flow chart of an example process to provide a video recommendation.

Fig. 4 is a flow chart of an example process for providing data specifying a video or a video scene.
Like the numbers and reference designations in the various drawings they indicate similar elements.
5 Detailed Description
This report describes systems and techniques for generating and providing recommendations for videos, for example, movies, television programs, sports videos, events, applications, music videos, etc., or other content. For example, a description of one or more scenes can be used to identify other videos that have a similar or related scene. Similar techniques and systems can be used to search for particular videos and / or video scenes in the videos. For example, the description of a scene or other portion of video can be indexed based on an identifier for the video and a time at which the scene occurs in the video. When a query is received, the query can be compared with the description. Scenes that
15 have a description that matches or is similar to the query can be submitted.
In some implementations, the system obtains images, for example, screenshots or video frames, of a video and uses the images of the video to generate a set of terms or keywords that describe the video. The system can obtain an image of the video at a particular frequency. The sampling frequency can
20 based on the number of video frames per unit of time or based on time.
The system can generate keywords for each image obtained from the video. In some implementations, the system uses automatic processes to generate keywords that describe the content represented by the image. For example, an image obtained from a movie can represent a car that runs along a bridge on a rainy day. In this example, the system can generate the keywords "car", "bridge", and "rain" to describe the image. In another example, an image obtained from a movie can represent Barack Obama standing in front of the White House. In this example, the system can generate the keywords "Barack Obama" and "White House" through techniques of recognizing an object that
30 identify the White House in the image and recognition techniques of a person who identify Barack Obama in that image. The machine learning techniques used to generate the descriptions can be trained using images labeled, for example, by a user. Labels for an image can describe the content of the image.

Other data can also be used to describe each image obtained. In some implementations, the video audio that corresponds to the image and / or the audio that occurs in the video before or after the image is analyzed to generate keywords that describe the image. Continuing with the previous example of the car and the bridge, the audio in the image or within an interval of one or two seconds with respect to the image may include the sound of thunder. In this example, the description of the image can be "car", "bridge", "rain", "thunder", and "storm".
The system can generate a set of keywords for the video based on the keywords generated for each image obtained. The system can include in the set of keywords, those that occur most often. For example, the keywords for the video can include the top-N keywords, ranked based on the number of times each keyword occurs in the keywords for the video images. In another example, the keywords for the video may include those keywords that occur a limit number of times.
In some embodiments, the keyword set for a video may be in the form of a keyword vector. In addition, the set of keywords may be organized in the vector based on the sequence in which the images occur in the video. For example, a first vector element may include the keywords for the first image obtained; The second vector element of the vector may include the keywords for the second image obtained, and so on.
The keywords generated for a video can be used to identify other similar or related videos. For example, the keywords generated for a video can be compared with keywords generated for other videos, to identify other videos that include the same or similar keywords. In some implementations, a machine learning process can group videos into groups according to the set of generated keywords. When a user watches a video in a particular group, the system can provide the user with a list of other videos in the group as recommendations.
The system can also index scenes or other shorts of a video with the keywords generated for the video. In this way, the system can provide video or short scenes in response to user queries. For example, if a user sends a query "car that runs along a bridge", the system can compare the terms of the query with the index. In this example, the system can identify the previous scene example of a car that runs along a bridge in one day

rainy. In response, the system can provide the user with data that identifies the scene and / or a link to a video that he starts in, or that includes the scene.
Fig. 1 is a diagram of an example environment 100 in which a video system 130 provides videos and / or data related to videos. Example environment 100 includes a client device 110 that allows users to download, store, and watch videos and other content, for example, other multimedia content. The client device 110 is an electronic device that is capable of requesting and receiving data on a data communications network 120, for example, a local area network (LAN), a wide area network (WAN), internet, a network mobile, or a combination of these. The devices
10 example clients include personal computers, mobile communication devices, for example, smartphones and / or computing tablet devices, smart TVs, or televisions with internet, for example, a television with a network connectivity or that is connected to a top box assembly that gives the TV network connectivity, and other appropriate devices.
15 The sample client device 110 includes a video player 112 and an internet browser 114. The internet browser 114 facilitates the sending and receiving of data on the network 120. The internet browser 114 can allow the user to interact with text, images, videos, music, and other information typically located on a website. In some implementations, video player 112 is an application
20 which facilitates downloading, broadcasting, and viewing videos, for example, from a video broadcasting service or video sharing service. For example, the video player 112 may be a native application developed for a particular platform or a particular type of device, for example, a smartphone or a smartphone that includes a particular operating system. The video player 112 and / or the browser
25 114 Internet can provide a user interface that allows users to browse or search videos. For example, in a smart television implementation, the video player 112 may provide a guide that is displayed on the television screen and allows a user to browse or search for movies, programs, or other videos.
30 The video system 130 may provide video and / or video-related data to the client devices 110 over the network 120. For example, the video system 130 may be part of a video broadcast service or a shared video service which broadcasts or downloads videos to the client device 110. In another example, the video system 130 may be a third-party service that provides video recommendations, by
35 example, recommendations of movies or television shows, and / or results of

Video search in response to requests for such data. In this example, an internet site or video broadcast service may request from video system 130, search results or recommendations in response to a user who watches a particular video or sends a request for a video.
5 The video system 130 includes a keyword generator 140 describing videos stored in a video storage system 150, for example, hard drives and / or solid state drives. The keyword generator 140 can generate keywords for a video based on the content of the video. In some implementations, generator 140 generates keywords for a video based on the
10 visual content that is represented in one or more images obtained from the video. For example, images can be screenshots taken from the video or video frames from the video. For each image, the keyword generator 140 may generate a set of one or more keywords that describe the visual content of the image. The visual content for which the keywords were generated, can include content
15 general scene, for example, exterior, rainy, dark, etc., objects represented in the image, for example, cars, buildings, etc., and / or persons represented in the image. As described in more detail below with reference to Fig. 2, the keyword generator 140 may include a deep learning engine, an object recognition engine, and / or a people recognition engine for
20 generate keywords that describe the visual content represented in the images.
The keyword generator 140 can also generate keywords for a video based on the audio content of the video. For example, the keyword generator 140 can generate keywords that describe sounds, for example, thunder, choches, birds, etc., music, for example, particular songs, and spoken words, for example, using voice recognition, which is included in the audio of the video. The keyword generator 140 may generate a set of one or more keywords for various periods of time of the audio. For example, the keyword generator 140 can segment the video into a sequence of video portions of one minute and generate a set of one or more keywords that describe the audio for each
30 minute video portion. In this example, a ten minute video can be segmented into ten one minute segments and one or more keywords can be generated for each segment based on the audio included in the segment.
The keyword generator 140 may generate a set of keywords that describe a video based on keywords generated by images obtained from the video and / or keywords generated based on the audio of the video. By

For example, the keyword generator 140 can generate an aggregate set of keywords for a video based on the keywords generated by images and the keywords generated for the audio.
In some implementations, keyword selection techniques can be used to select only a subset of generated keywords for the images and / or a subset of the generated keywords for the audio. For example, the keyword generator 140 can identify, for each keyword, the number of keyword occurrences in the set of keywords generated for the images and / or the set of keywords generated for the audio. In this example, a keyword can have multiple occurrences to be generated to describe multiple different images of the video and / or generated to describe the audio for multiple different video segments. In another example, the keyword can have multiple occurrences to be generated to describe an image of the video and to describe the audio of a video segment. The keyword generator 140 may include, in the set of keywords that describe the video, each keyword that has at least a limit number of occurrences or a particular number of keywords that have the highest number of occurrences, for example, top 10, 50, or keywords
100
The set of keywords that describe a video can also include keywords obtained from resources other than images and audio. For example, the set of keywords that describe a video may also include keywords that are included in the metadata for the video, keywords that are included in the video title, keywords that are included in the video description, words that are included in the video subtitle data, words that are included in the video credits, keywords that are obtained from comments or reviews related to the video, and / or keywords that are obtained from other appropriate sources.
The keyword generator 140 may generate a video index 152 that includes video identification data and the generated keywords for each video. For example, the video index 152 may include, for each video, a unique identifier for the video, for example, a unique title or numerical code, and the set of keywords generated by the keyword generator 140 to describe the video general.
In some implementations, the video index 152 includes an index for the scenes of other types of video segments for at least some of the videos. For example, video index 152 may include, for each scene of a video, an identifier 161

for the video in which the scene occurs, a unique identifier 162 for the scene, a time 163 in which the scene occurs in the video, and key word (s) 164 generated for the scene. As used herein, the term "scene" may refer to a particular scene of a movie or television show or other type of video segment that is less than the entire video. For example, a scene may be a portion of a video for which the keywords have been generated by the keyword generator 140. In this example, video index 152 may include a scene and its corresponding keywords for each image for which the keyword generator 140 generates the keywords.
It is considered, for example, a ten minute video. The keyword generator 140 can obtain an image from the video every ten seconds, resulting in a set of sixty images for the video. The keyword generator 140 can then generate, for each of the sixty images, a set of one or more keywords describing the image. In this example, the video index 152 may include, for the ten minute video, one entry for each image resulting in sixty entries for the video. The input for each image may include a unique identifier for the scene that corresponds to the image, a time in which the scene occurs in the video, for example, the time within the video in which the image is obtained, and the ( s) keyword (s) generated for the image. Additionally, or alternatively, the input for each image may include generated keyword (s) based on the audio of the video at the time the image is represented in the video and / or the audio that occurs in a specific amount of time before and / or after the image is represented in the video. For example, the input for a particular image may include the generated keyword (s) for the image and the generated keyword (s) for the audio of a video segment of ten seconds that begins five seconds before the image is represented in the video and ends five seconds after the image is represented in the video.
The keywords generated for each video and / or for each scene of a video can be used to identify similar videos and / or to emerge particular video scenes in response to search results. For example, video system 130 may include a search engine 142, a clustering engine 144, and a recommendation engine 146. The search engine 142, the clustering engine 144, and the recommendation engine 146, can each be implemented in one or more servers, for example, located in one or more data centers.

In some implementations, the keyword generator 140 may also generate keywords for image collections and / or image sequences and generate an index for the collections and / or sequences. The index may include, for each collection or sequence, data identifying the collection or sequence and / or one or more
5 keywords for each image in the collection or sequence. In this way, a user can search for particular images in the collection or sequence. For example, a collection of images can be images obtained from a surveillance camera. In this example, a user can search for images that include a particular object, for example, a weapon, or particular clothing, for example, a baseball cap.
10 The search engine 142 may receive queries from client devices 110 or other sources and provide search results that identify and / or link the videos or scenes in response to their queries. The search engine 142 may use video index 152 to identify videos or scenes that are sensitive to a received query. For example, search engine 142 may compare the terms included in
15 a query to the keyword (s) generated by each video and / or the keyword (s) generated for each scene. The search engine 142 may identify videos and / or scenes that have at least one corresponding keyword that matches at least one term of the query. The search engine 142 can then provide search results for at least a portion of the identified videos or scenes.
20 For example, the search engine 142 may classify videos and / or scenes and provide search results for a specific number of videos and / or better classified scenes. Videos and / or scenes can be classified based on the number of concordant terms between the query and the videos and / or scenes, the quality of the videos and / or scenes, the popularity of the videos and / or scenes, for example, in terms of
25 number of times videos or scenes have been watched.
A search result may include text that identifies a video or a scene and / or a text that describes the video or scene. The search result may also include a link to the video or the scene. A search result for a particular scene or video may include a link to a video that includes only the particular scene. In another example, a search result for a particular scene may include a link at the beginning of the particular scene within the video that includes the particular scene. For example, user interaction with the link or search result may cause the client device 110 to upload the video, for example, in the video player 112 or in the internet browser 114, and start the video in the
35 start time for the particular scene.

The search engine 142 may also allow users to search for a video that may be indexed or advertised on the internet using different titles. For example, when searching for movies that are indexed based on the content of the movies, fraudulent ads for movies that use different titles can be found.
The search engine 142 may also allow users to search for particular scenes within a particular video. For example, the search engine 142 may provide a user interface 114 within the internet browser 114 that allows users to select a video and enter keywords into a search drawer. The search engine 142 may then use the video index 152 to identify scenes within the selected video that are sensitive to the query entered. For example, a user can search for the "chase scene" in a particular movie. In some implementations, the user can enter keywords into the search box without first selecting a particular movie, and can perform a search based on the keywords through various movies based on their content. Accordingly, if a user enters the "chase scene" query, the search engine 142 may deliver results from one or multiple videos that are determined to have a chase scene. In addition, the search results may indicate the particular locations in the respective videos in which the chase scenes occur so that, for example, the user can simply click on a representation of a given video to lead directly to the reproduction of the relevant scene that was specified in the search result.
The clustering engine 144 can identify similar or related videos or scenes and generate groups of similar or related videos or scenes. In some implementations, the clustering engine 144 groups the videos based on the similarity between the keywords generated for the videos. For example, the clustering engine 144 may compare the set of keywords generated by the keyword generator 140 for a first video of the set of keywords generated by the keyword generator 140 for a second video, to determine a level of similarity between the first video and the second video. If the first video and the second video have a level of similarity that satisfies a limit level of similarity, the first video can be included in a group with the second video.
The level of similarity between the two videos can be determined using the cosine similarity. For example, each video can be represented by a vector that represents the

associated keywords for the video The similarity of cosine can be used to determine the similarity between vectors for videos.
The level of similarity between the two videos can be based on the number of common keywords for the two videos. For example, the level of similarity for the two videos may be proportional to the number of keywords generated for the first video that match the keywords generated for the second video. The level of similarity between the two videos can be based on the number of keyword occurrences for the first video and the number of keyword occurrences for the second video. For example, a keyword can occur several times for a video if the keyword is generated
10 to describe several images of the video. If the two videos have common keywords that also occur several times for both videos or that occur a similar number of times for both videos, the videos may have a high level of similarity as if the keywords that occur several times for the first video They don't happen several times for the second video.
15 In some implementations, the level of similarity between the two videos may be based on the sequence in which the keywords occur for the two videos. For example, the set of keywords for each video may be arranged in the order of the images and / or the audio for which the keywords were generated. In particular, the key word (s) generated for the first image that is represented by a video or the
The first image obtained from the video may first be arranged in the set of keywords, the key word (s) generated for the second image that is represented by the video or the second image that is obtained from the video, can be arranged after the keywords generated for the first image, and so on. The clustering engine 144 can compare the sequence
25 keywords for the first video to the keyword sequence for the second video to determine the level of similarity between the two videos.
The level of similarity between the two videos can be based on the number of keywords that occur in the same or similar sequences and / or the equal number of similar sequences of at least a specific number of keywords, for example. , at least three keywords in sequence. Two similar keyword sequences can be sequences that include the same keywords, but also include no more than a specific number of additional keywords. For example, a keyword sequence for the first video may be "dog, jump, fence, street" based on the keywords generated for one or more images of the first video. A keyword sequence for the second video can be "dog,

cat, jump, fence, yard, street ”. These two sequences can be considered similar sequences due to the common “dog, jump, fence, street” sequence in the keywords for both videos, although the keywords for the second video include additional “cat” and “yard” keywords. that are not included in the keywords for the first video.
Two keyword sequences can be considered similar sequences if less than a specific number of keywords is out of sequence. For example, the keyword sequence for the first video may be "dog, jump, fence, street" based on the keywords generated for one or more images of the first video. A sequence of keywords for the second video can be "dog, fence, jump, street". These two sequences can be considered similar sequences because the sequences include four common keywords even if two of the keywords are transposed.
In some implementations, the clustering engine 144 uses a machine learning process to assign video or scenes to groups based on the similarity between the keyword sets and / or the sequence of keywords that are generated to describe the videos or videos. scenes
Clustering engine 144 may generate a group index 154 that includes data related to groups of videos or similar scenes. The group index 154 may include, for each group, a unique identifier 166 for the group and the data specified by the videos or scenes 167 assigned to the group.
The recommendation engine 146 may provide video recommendations based on the groups generated by the clustering engine 144. The recommendation engine 146 may recommend video or scenes that are similar to a video or a scene that is viewed by a user or videos or scenes that are similar to a video
or scene requested by a user. For example, video player 112 or internet browser 114 may present video recommendations in a user interface in which a video is played. When the client device 110 requests a video, for example, from the video system 130, the recommendation engine 146 can access the group index to identify one or more groups in which the requested video is a member. The recommendation engine 146 can then select one
or more videos that are included in the identified group (s) to recommend to the user of the client device 110 from which the request was received. An example process

To provide a video recommendation is illustrated in Fig. 3 and described below.
The search engine 142 may also allow a user to search for similar videos using group index 154. For example, search engine 142 may provide a user interface within internet browser 114 that allows a user to select a video and request videos that are similar to the selected video. In another example, the video player 112 may include an icon that, when selected, sends a request to the search engine 142 for videos that are similar to a video that is presented on the client device 110. The 142 engine of
The search can access the group index 154 to identify other videos that are included in the same group (s) as the selected video or the video that is presented on the client device 110. The search engine 142 can then provide data specifying similar videos for presentation on the client device 110.
15 Fig. 2 is a diagram of the example keyword generator 140 of Fig. 1. The keyword generator 140 includes an image extractor 220 and an audio extractor 250 that simultaneously receive a video 210. The Image extractor 220 can obtain a set of images 225 from video 210 and provide the set of images 225 to one or more image analysis engines 230. The image extractor 220 can
20 obtain the set of images 225 by taking screenshots of the video 210 at a sampling frequency or by extracting the video frames from the video 210 based on the sampling frequency. The sampling frequency can be based on the number of video frames per unit of time or based on time. For example, the frequency can be each frame, every two frames, every five frames. In another example, the
25 frequency can be every second, every two seconds, every ten seconds.
The example keyword generator 140 includes a deep learning engine 232, an object recognition engine 234, and a people recognition engine 236. Other implementations may include only one or two of the 232236 engines or additional engines that are not illustrated in Fig. 2. The deep learning engine 232 may use one or more deep learning techniques, for example, of a learning stack deep, to generate or select one or more keywords that describe the visual characteristics of an image. In some implementations, the deep learning engine 232 may generate keywords that describe the visual content of the image in general or at a higher level. For example, the deep learning engine 232 35 can analyze the visual characteristics of an image to

identify environmental characteristics, for example, inside, outside, clear, dark, rain, snow, etc., and / or location characteristics, for example, city, beach, mountains, farm, etc.
The object recognition engine 234 may use one or more object recognition techniques to identify objects in an image and generate keywords that describe the object. For example, the object recognition engine 234 may use edge detection techniques, scale invariant feature transformation (SIFT) techniques, word bag techniques, and other appropriate techniques for detecting objects in the images. For each object detected, the object recognition engine 232 can generate one or more keywords that describe the object.
The people recognition engine 236 can use one or more people recognition techniques to identify people in an image and generate keywords that identify and / or describe people. For example, the person recognition engine 236 may use facial recognition techniques to detect known persons in an image. The people recognition engine 236 can also analyze visual characteristics of an image to determine or predict the gender, age, or other characteristics of an unrecognized person and generate keywords that describe these characteristics.
Each of the 232-236 engines can be trained using labeled training data. Tagged training data may include images that have tags that describe the images. For example, a user can tag the images based on what the user sees in the image. The 232-236 engines can then be trained to generate keywords that correctly describe what other images represent. 232-236 engines can be trained until each one generates keywords that correctly describe at least a percentage limit of the test images.
For a video 210, each of the 232-236 engines can analyze each image in the set of images 225 and generate a set of one or more keywords that describe the image based on their respective image analysis. Each of the key words 240 generated by the 232-236 engines can then be provided to a keyword aggregator 270 that aggregates the keywords for the video, as described below.

Audio extractor 250 can extract audio 255 from video 210. The extracted audio can be a continuous stream of audio for the entire video or a set of audio segments. For example, audio extractor 250 can segment video 210 within a sequence of one minute, two minutes, or successive portions of video of three minutes and extract audio from each portion of video. In another example, audio extractor 250 can extract audio from video for each image. In this example, the audio for an image may include audio that occurs in video 210 before the point in the video in which the image occurs and the audio that occurs in video 210 after the point in the video in which the video occurs. image. By
For example, if an image is obtained from the video at a point two minutes from the beginning of the video, the image audio can include the audio that starts in one minute and fifty seconds from the beginning of the video and ends in two minutes and Ten seconds from the beginning of the video.
The audio extractor 250 can provide the extracted audio to an analysis engine 260
15 audio The audio analysis engine 260 can analyze the audio 255 to identify the sounds that are included in the audio 255. For example, the audio analysis engine 260 can compare the audio with known sounds to detect sounds in the extracted audio. In another example, the audio analysis engine 260 may use speech recognition to detect spoken words in the audio.
20 Audio extractor 260 may generate one or more sets of 265 key words based on the sounds detected in the extracted audio. For example, if the audio is a simple direct current, the audio extractor 260 can generate a set of keywords based on the sounds detected by the extracted audio. The set of keywords can be arranged in order based on the order in the
25 which sounds occur in the audio and consequently, the order in which the sounds occur in video 210. If the audio is segmented, for example, based on the image or video portions, the set of keywords can Include a subset of one or more keywords for each image or video portion. Subsets may also be arranged in the order in which portions of images or
30 video in video 210. The audio analysis engine 260 can provide the key words 265 to the keyword aggregator 270.
The keyword aggregator 270 may add the key words 240 received from the image analysis engines 230 and / or the key words 265 received from the audio analysis engine 265 within a set of keywords that 35 describe the video. In some implementations, keyword aggregator 270

You can generate a combined list of keywords that includes each keyword generated by the image analysis engines 230 for the video 210 and each keyword generated by the audio analysis engine 265 for the video 210.
In some implementations, keyword aggregator 270 includes only one
5 subset of the keywords 240 and the keywords 265 in the set of keywords that describe video 210. For example, keyword aggregator 265 may include the most popular keywords for video 210 in the keyword set that describe video 210. In this example, keyword aggregator 270 can identify, for each keyword that is generated by a motor 230
Image analysis 10 or audio analysis engine 260 for video 210, for example, each keyword that is included in keywords 240 and key words 265 for video 210, a number of occurrences of the keyword in keywords 240 and keywords 265. For example, a keyword can have three occurrences if the keyword was generated by image analysis engines 230 to
15 two different images and the keyword was generated by the audio analysis engine 260 for a video segment. The keyword aggregator 270 may then select, for inclusion in the keyword set for video 210, the keywords that have at least a limit number of occurrences or a particular number of keywords that have the highest number of occurrences . By
For example, the set of keywords for the video may include keywords that have at least three occurrences in the keywords 240 and / or the keywords 265. In another example, the keyword aggregator 270 can classify the keywords based on the number of occurrences of each keyword in the key words 240 and / or the key words 265 and select the best ten, twenty, or other appropriate number
25 keywords in the ranking.
The keyword aggregator 270 may also add the key words 240 received from the image analysis engines 230 and the words 265 received from the audio analysis engine 265 within a set of keywords describing each image obtained from video 210. For example, as described above, audio extractor 250 can extract audio from video for each image and audio analysis engine 260 can generate one or more keywords describing audio for each picture. Keyword aggregator 270 can generate, for each image, an aggregate set of keywords for images that include all or at least a portion of keywords that were generated by the image
35 by image analysis engines 230 and audio analysis engine 260.

As described above, the keyword generator 140 can generate or populate an index, for example, a video index 152 of Fig. 1 with video related data. For example, the index may include data that identifies the video, for example, a unique identifier, and the set of keywords that describe the video that was generated by the keyword aggregator 270. The index may also include, for each image, the data that identifies a scene that corresponds to the image and the keywords generated by the image. The index can then be used to search videos and video scenes in response to inquiries received.
Fig. 3 is a flow chart of an example process 300 to provide a video recommendation. The process 300 can be implemented by one or more computer programs installed on one or more computers. The process 300 will be described as being carried out by an appropriate programmed system of one or more computers, for example, the video system 130 of Fig. 1.
The system obtains, for each video of a set of videos, a set of images that are included in the video (302). For example, the system can obtain a set of screenshots or video frames of each video based on a sampling frequency.
For each image of each video, the system generates a respective first set of a
or more keywords that describe the visual content that is represented by the image
(304) respectively. For example, each image can be analyzed using deep learning techniques, object recognition techniques, people recognition techniques, and / or other image analysis techniques to generate keywords that describe the visual content represented by the image. The set of one or more keywords for a given image can include the keywords that were generated based on each of the analyzes. In some implementations, the set of one or more keywords for an image may also include keywords that were generated based on the audio of the video that occurs at the same time as the image in the video or the audio that occurs within a time. specific before and after the image occurs.
The system generates, for each video in the set of videos, a second set of keywords that describe the video (306). The second set of keywords for a given video may include at least a portion of keywords that are included in the first set of keywords that were generated for at least some of the images obtained from the video. For example, the system may select some

keywords of the first sets of keywords generated for at least some of the images obtained from the given video. As described above, the keyword set for a video can be selected based on the number of occurrences of the keywords in the keyword sets generated for the video images.
In some implementations, the second set of keywords for a given video may also include keywords that are included in the metadata for the given video, keywords that are included in the title of the given video, keywords that are included in the description of the given video, words that are included in the subtitle data for the given video, words that are included in the credits for the given video, words that are obtained from comments or reviews related to the given video, and / or keywords that are obtained from other appropriate sources. The system can assign videos that have at least one level of similarity limit with another of a group.
The system assigns the videos in the video set to the groups based on the second set of keywords for each video (308). For example, the system can use one or more machine learning techniques to assign videos to groups based on the similarity between second sets of keywords for videos.
The level of similarity between two videos can be based on a comparison of the second set of keywords for the two videos regardless of the sequence. For example, the level of similarity between two videos can be based on a comparison of the number of occurrences of each of the keywords in the second set of keywords for a first video of the two videos and the number of occurrences of each of the keywords in the second set of keywords for a second video of the two videos.
In another example, the level of similarity between two videos can be based on a comparison of the keyword sequence in the second set of keywords for the two videos. For example, if the second set of keywords for each video has similar keyword sequences, the videos may have similar sequences of scenes that indicate the videos are similar.
The system receives a request for a video recommendation based on a first video of the video set (310). The request for the video recommendation can be transmitted from a client device to the system in response to the presentation of the first video. For example, video recommendations can be presented

adjacent to a window or a viewing area in which the first video is presented. When the first video is presented, a video player or internet browser that presents the video may cause the client device to send a request to the system for video recommendations.
The request for the recommendation of the video can be transmitted from the client device to the system in response to a request for the first video. For example, if a user interacts with, for example, select a link for the first video, the client device can transmit the request to the system to obtain video recommendations to present with the first video.
The system provides data that identifies a second video in the set of videos based on the second video that is assigned to the same group as the first video (312). For example, the system can identify each group in which the first group is a member in response to the request for the video recommendation. The system can then select the second video from one of the identified groups, or the group if the first video is assigned to only one group, and provide data identifying the second video to the client device from which the request was received.
The system can also provide a link to the second video. In this way, if the user is interested in the second recommended video, the user can easily access the second video. For example, user interaction with the link can cause the video player or internet browser presenting the first video to navigate from the first video to the second video.
In some implementations, the system can provide as recommendation data, identify multiple videos. For example, the first video can be assigned to one or more groups that include multiple different videos. The system can provide data that identifies at least a portion of the multiple different videos as recommendations.
Fig. 4 is a flow chart of an example process 400 for providing data specifying a video or a video scene. The process 400 can be implemented by one or more computer programs installed on one or more computers. The process 400 will be described as being performed by an appropriately programmed system of one or more computers, for example, the video system 130 of Fig. 1.
The system generates, for each video of a set of videos, a set of one or more keywords for each scene of the video (402). The set of one or more words

key to a given scene can be generated based on one or more images obtained from the video and that are represented by the video during the scene. For example, each or more images for a scene can be analyzed using deep learning techniques, object recognition techniques, people recognition techniques, and / or other image analysis techniques to generate a
or more keywords that describe the visual content that is represented by the image.
The set of one or more keywords that are generated for a scene can also be generated based on the audible content that occurs in the video during the scene. For example, audible content can be compared to known sounds and / or voice recognition can be used to generate one or more keywords that describe the sounds that occur during the scene.
The system generates an index of video scenes for videos that use the generated keyword (s) for each scene (404). For example, the index may include, for each scene, an identifier for the scene, an identifier for the video in which the scene occurs, and the set of one or more keywords generated for the scene. The index may also include, for each video, an identifier for the video and the generated keyword (s) for each scene in the video.
The system receives a query for a video or a video scene (406). The query can be received from a client device. For example, a user of the client device can send a query for a video. The query may include one or more query terms.
The system identifies, in the index, a video or a video scene that has at least one keyword that matches at least one query term of the query (408). For example, the system can compare the query term (s) for querying the keywords included in the index. If multiple videos or video scenes have a keyword that matches a query query term, the system can select a video or a scene from multiple videos or video scenes. The system can make the selection based on the number of concordant terms between the query and the videos and / or the scenes, the quality of the videos / or the scenes, the popularity of the videos and / or the scenes, for example, in terms of the number of times the videos or scenes have been viewed.
The system provides, in response to the query, data specifying the identified videos or video scenes (410). For example, the system can provide, to the client device from which the query is received, a search result that specifies

the identified video or the video scene. The search result may also include a link to the video or the scene. If the user interacts with the link, the client device can request the video or scene and present the video or scene to the user.

权利要求:
Claims (13)
[1]
1. A method implemented by computer to make video recommendations according to their content, including the method:
obtain, for each video of a set of videos, a set of images that are included in the video;
For each image in the set of images, generate a first set with one or more keywords that describe the visual content that represents the image using object and person recognition techniques; Y
generate, based on at least the first sets of one or more keywords for at least some of the images, a second set of keywords that describe the video;
assign the videos in the set of videos to groups based on the second set of keywords that were generated for each video using a machine learning process to assign videos that have at least one limit similarity greater than a threshold value within the same group, where the similarity between the two videos is based on the similarity between the respective first sets of one or several keywords that were generated for the two videos;
receive a request for a video recommendation based on a first video of the video set; Y
provide, as a video recommendation, data that identifies a second video of the set of videos based on the second video that is assigned to the same group as the first video.
[2]
2. The method of claim 1, wherein the request for the video recommendation is generated in response to at least one of (i) submission of the first video or (ii) a request for the first video.
[3]
3. The method of claim 1, wherein:

the second set of keywords for each video is arranged by forming a sequence based on the sequence in which the images, for which the keywords were generated, occur in the video; Y
the similarity between two videos is based on a similarity between the keyword sequence5 for a first video and the keyword sequence for a second video.
[4]
4. The method of claim 1, further comprising:
identify, for a first video of two videos, a number of occurrences of each keyword in the second set of keywords for the first video;
10 identify, for a second video of two videos, a number of occurrences of each keyword in the second set of keywords for the second video,
where the similarity between the two videos is based on a comparison of the number of
15 occurrences of each keyword in the second set of keywords for the first video and the number of occurrences of each keyword in the second set of keywords for the second video.
[5]
5. The method of claim 1, further comprising generating, for each
20 video, a third set of keywords that describe the audible content of the video, where the second set of keywords that describe the video is also generated based on the third set of keywords.
[6]
6. The method of claim 1, wherein the generation of the respective first
A set of one or more keywords that describe the visual content that is represented by a given image, comprises using a deep learning process to generate at least a portion of the respective first set of one or more keywords.
The method of claim 1, wherein the generation of the respective first set of one or more keywords describing the visual content that is represented by a given image, comprises:
detect an object that is represented by the given image; and

include, in the respective first set of one or more keywords that describe the visual content that is represented by the given image, a keyword that describes the detected object.
The method of claim 1, wherein the generation of the respective first set of one or more keywords describing the visual content that is represented by a given image, comprises: detecting a person represented by the given image; and
10 include, in the first set of one or more keywords that describe the visual content that is represented by the given image, a keyword that identifies the detected person.
[9]
9. The method of claim 1, wherein the generation, based at least on the
The respective first set of one or more keywords for at least some of the images, of a second set of keywords describing the video comprises: identifying, for each keyword generated for at least one image of the video, a number of images of the video for which the keyword was generated: and identify, for inclusion in the second set of keywords, a number
20 previously specified of the keywords based on the number of images for which each keyword was generated.
[10]
10. The method of claim 1, further comprising:
25 generate an index of video scenes for the video set, where the index includes, for each video scene, a set of keywords describing the video scene; receive a query that specifies at least one query term;
30 identify a video scene in the index based on at least one query term that matches at least one keyword that is included in the index for the video scene; Y
provide, in response to the query, data specifying the identified video scene 35.

[11]
11. The method of claim 1, wherein the second set of keywords includes keywords selected from the respective first sets for at least some of the images.
The method of claim 1, further comprising:
Receive a request for a video that is similar to a given video:
identify at least one group that includes the given video:
select, from at least one group, one or more videos; Y
provide data that specifies the one or more videos.
15 13. A system for making video recommendations based on their content, which includes:
a data processing apparatus; Y
20 a memory storage apparatus in data communication with the data processing apparatus, the memory storage apparatus stores instructions that are executed by the data processing apparatus and that upon such execution cause the data processing apparatus carry out operations that include:
For each video in the video set: get a set of images that are included in the video;
for each image in the set of images, generate a respective first set of 30 one or more keywords that describe the visual content that the image represents using object and person recognition techniques; Y
generate, based on at least the respective first sets of one or more keywords for at least some of the images, a second set of keywords 35 describing the video;

assign the videos in the set of videos to groups based on the second set of keywords that are generated for each video using a machine learning process to assign videos that have at least one limit similarity greater than a threshold value within the same group, where the similarity between the two videos is based
5 in the similarity between the respective first sets of one or more keywordsthat were generated for the two videos;
receive a request for a video recommendation based on a first video of the video set; Y
provide, as a video recommendation, data that identifies a second video 10 of the set of videos based on the second video that is assigned to the same group as the first video.
[14]
14. The system of claim 13, wherein the request for the recommendation of
Video is generated in response to at least one of (i) submission of the first video or (ii) 15 a request for the first video.
[15]
15. The system of claim 13, wherein: the second set of keywords for each video is defined by forming a sequence based on the sequence in which the images, from which they are generated
20 keywords, occur in the video; Y
The similarity between the two videos is based on a similarity between the keyword sequence for a first video and the keyword sequence for a second video.
16. The system of claim 13, wherein the operations further comprise:
identify, for a first video of two videos, a number of occurrences of each keyword in the second set of keywords for the first video;
identify, for a second video of two videos, a number of occurrences of each keyword in the second set of keywords for the second video,
where the similarity between the two videos is based on a comparison of the number of 35 occurrences of each keyword in the second set of keywords for the

first video, and the number of occurrences of each keyword in the second set of keywords for the second video.
[17]
17. The system of claim 13, wherein the operations further comprise the generation, for each video, of a third set of keywords describing the audible content of the video, wherein the second set of keywords describing the video is additionally generated. based on the third set of keywords.
[18]
18. A computer program product, encoded in one or more non-transient computer storage media, comprising instructions that when executed by one or more computers, cause one or more computers to carry out operations comprising: for each video in a set of videos:
get a set of images that are included in the video; For each image in the set of images, generate a first set with one or more keywords that describe the visual content that represents the image using object and person recognition techniques; Y
generate, based on at least the first sets of one or more keywords for at least some of the images, a second set of keywords that describe the video;
assign the videos in the video set to the groups based on the second set of keywords that are generated for each video using a machine learning process to assign videos that have at least one limit similarity greater than a threshold value within a same group, where the similarity between the two videos is based on the similarity between the respective first sets of one or more keywords that were generated for the two videos;
receive a request for a video recommendation based on a first video of the video set; Y

provide, as a video recommendation, data that identifies a second video of the set of videos based on the second video that is assigned to the same group as the first video.

类似技术:

公开号 | 公开日 | 专利标题

ES2648368B1|2018-11-14|Video recommendation based on content

CA2951849C|2019-03-26|Selection of thumbnails for video segments

US8804999B2|2014-08-12|Video recommendation system and method thereof

US20210287012A1|2021-09-16|Detection of demarcating segments in video

US9202523B2|2015-12-01|Method and apparatus for providing information related to broadcast programs

US8898714B2|2014-11-25|Methods for identifying video segments and displaying contextually targeted content on a connected television

EP2541963B1|2021-03-17|Method for identifying video segments and displaying contextually targeted content on a connected television

CN102244807B|2014-04-23|Adaptive video zoom

US9098807B1|2015-08-04|Video content claiming classifier

US8750681B2|2014-06-10|Electronic apparatus, content recommendation method, and program therefor

US9754166B2|2017-09-05|Method of identifying and replacing an object or area in a digital image with another object or area

CN104769957B|2019-01-25|Identification and the method and apparatus that internet accessible content is presented

US20130148898A1|2013-06-13|Clustering objects detected in video

US9740775B2|2017-08-22|Video retrieval based on optimized selected fingerprints

US9100701B2|2015-08-04|Enhanced video systems and methods

Zanetti et al.2008|A walk through the web’s video clips

US20110047163A1|2011-02-24|Relevance-Based Image Selection

Awad et al.2016|Trecvid semantic indexing of video: A 6-year retrospective

Saba et al.2013|Analysis of vision based systems to detect real time goal events in soccer videos

CN102542249A|2012-07-04|Face recognition in video content

US8706655B1|2014-04-22|Machine learned classifiers for rating the content quality in videos using panels of human viewers

US8990134B1|2015-03-24|Learning to geolocate videos

RU2413990C2|2011-03-10|Method and apparatus for detecting content item boundaries

WO2017149447A1|2017-09-08|A system and method for providing real time media recommendations based on audio-visual analytics

Daneshi et al.2013|Eigennews: Generating and delivering personalized news video

同族专利:

公开号 | 公开日

US10579675B2|2020-03-03|

ES2648368B1|2018-11-14|

US20180004760A1|2018-01-04|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US20030101104A1|2001-11-28|2003-05-29|Koninklijke Philips Electronics N.V.|System and method for retrieving information related to targeted subjects|

US20150037009A1|2013-07-31|2015-02-05|TCL Research America Inc.|Enhanced video systems and methods|

US20150082330A1|2013-09-18|2015-03-19|Qualcomm Incorporated|Real-time channel program recommendation on a display device|

US9092520B2|2011-06-20|2015-07-28|Microsoft Technology Licensing, Llc|Near-duplicate video retrieval|

US9244924B2|2012-04-23|2016-01-26|Sri International|Classification, search, and retrieval of complex video events|

KR102161230B1|2013-05-28|2020-09-29|삼성전자주식회사|Method and apparatus for user interface for multimedia content search|

US20150012946A1|2013-07-03|2015-01-08|United Video Properties, Inc.|Methods and systems for presenting tag lines associated with media assets|

US10437901B2|2013-10-08|2019-10-08|Flipboard, Inc.|Identifying similar content on a digital magazine server|US10762161B2|2017-08-08|2020-09-01|Accenture Global Solutions Limited|Intelligent humanoid interactive content recommender|

US20190272071A1|2018-03-02|2019-09-05|International Business Machines Corporation|Automatic generation of a hierarchically layered collaboratively edited document view|

CN108614856B|2018-03-21|2021-01-05|北京奇艺世纪科技有限公司|Video sequencing calibration method and device|

CN109189986B|2018-08-29|2020-07-28|百度在线网络技术（北京）有限公司|Information recommendation method and device, electronic equipment and readable storage medium|

US20200177531A1|2018-12-03|2020-06-04|International Business Machines Corporation|Photo sharing in a trusted auto-generated network|

KR20200101235A|2019-02-19|2020-08-27|삼성전자주식회사|Method of providing augmented reality contents and elecronic device therefor|

CN110278447B|2019-06-26|2021-07-20|北京字节跳动网络技术有限公司|Video pushing method and device based on continuous features and electronic equipment|

CN111061913B|2019-12-16|2020-11-20|腾讯科技（深圳）有限公司|Video pushing method, device, system, computer readable storage medium and equipment|

法律状态:
2018-11-14| FG2A| Definitive protection|Ref document number: 2648368 Country of ref document: ES Kind code of ref document: B1 Effective date: 20181114 |

优先权:

申请号 | 申请日 | 专利标题

ES201630878A|ES2648368B1|2016-06-29|2016-06-29|Video recommendation based on content|ES201630878A| ES2648368B1|2016-06-29|2016-06-29|Video recommendation based on content|

US15/348,773| US10579675B2|2016-06-29|2016-11-10|Content-based video recommendation|

[返回顶部]