专利摘要:
A computer-implemented method for semantically describing the content of an image comprising the steps of receiving a signature associated with said image is disclosed; receive a plurality of groups of initial visual concepts; the method being characterized by the steps of expressing the image signature as a vector including components referring to the initial visual concept groups; and modifying said signature by applying a filter rule applying to the components of said vector. Developments describe rules for threshold and / or order statistics, intra-group or intergroup, partitioning techniques including visual similarity of images and / or semantics of concepts, optional addition of manual annotations to the semantic description of the image. The advantages of the process in terms of parsimonious and diversified semantic representation are presented.
公开号:FR3030846A1
申请号:FR1463237
申请日:2014-12-23
公开日:2016-06-24
发明作者:Adrian Popescu;Nicolas Ballas;Alexandru Lucian Ginsca;Borgne Herve Le
申请人:Commissariat a lEnergie Atomique CEA;Commissariat a lEnergie Atomique et aux Energies Alternatives CEA;
IPC主号:
专利说明:

[0001] FIELD OF THE INVENTION The invention relates generally to the technical field of data mining and in particular the technical field of automatically annotating the content of an image. STATE OF THE ART A "multimedia" document comprises - by etymology - information of different natures, which are generally associated with distinct sensory or cognitive capacities (for example with vision or hearing). A multimedia document may for example be an image accompanied by "tags", that is to say annotations or a web page corresponding to images and text. A digital document can generally be divided into several "channels" of information, which may include, for example, textual information (from OCR character recognition for example) and visual information (such as illustrations and / or photos identified in the document). A video can also be separated into several such channels: a visual channel 20 (eg frames of the video), a sound channel (eg the soundtrack), a textual channel (eg resulting from the transcription of speech into text , as well as the metadata of the video, eg date, author, title, format ...). A multimedia document may therefore contain, in particular, visual information (i.e., pixels) and textual information (i.e. words). During a search in multimedia data, querying (ie for search in databases) can involve requests that can themselves take different forms: (a) one or more multimedia documents (combining images and texts), and / or (b) in form 2 3030846 of visual information only (search known as "search by image" or "by the content of the image") or even (c) in textual form only ( general case of mainstream search engines). The technical problem of finding information in multimedia databases consists in particular in finding the documents of the database that resemble the request as much as possible. In an annotated database (for example by labels and / or labels), a technical problem posed by the classification consists in predicting this or these labels for a new non-annotated document.
[0002] The content of an exclusively visual document must be associated with classification models that determine the classes to which the document may be associated, for example in the absence of tags or annotations or keyword description of the image (or indirectly via the publication context of the image for example).
[0003] In the case where these metadata are accessible, the content of a multimedia document (combining image and text) must be described in a consistent and effective manner. An initial technical problem therefore consists in determining an efficient way of determining the visual content of an image, that is to say, of constructing a semantic representation of the latter. If textual annotations exist, it will be for example to combine the representation of the visual content with these annotations. The relevance of the representation thus constructed can be carried out in many ways, including, in particular, the measurement of the accuracy of the results. In terms of image search, accuracy is given by the number of images semantically similar to an image, text query, or a combination of image and text. In terms of image classification, relevance is evaluated by the precision of the results (e.g. proportion of correctly predicted labels) and its generalization capacity (e.g. the classification "works" for several classes to be recognized). The computation time (usually determined by the complexity of the representation) is generally an important factor in these two search and classification scenarios.
[0004] 5 The availability of large collections of structured images (eg based on concepts such as ImageNet (Deng et al., 2009), coupled with the availability of learning methods (which have sufficient scalability) has led to the proposition of semantic representations of visual content (confer Li et al., 2010, Su 10 and Jurie 2012, Bergamo and Torresani, 2012) These representations are generally implemented starting from one or more visual descriptors. These are then used by learning methods to construct classifiers or descriptors for individual concepts. descriptor assigns or allocates or associates one or more classes (eg name, quality, property, etc.) to an object, here an image Finally, the final description is obtained by the the probability scores given by the classification of the test images against each classifier associated with the concepts making up the representation (Torresani et al., 2010). For their part, Li et al., In 2010 introduced ObjectBank, which is a semantic representation composed of responses from approximately 200 pre-calculated classifiers from a manually validated image database. In 2012, Su and Jurie manually selected 110 attributes to implement a semantic representation of images. In 2010, Torresani et al. introduced "classemes", which are based on more than 2000 models of individual concepts learned using images from the Web. As a result of this work, Bergamo and Torresani introduce in 2012 "meta-classes", i.e. representations based on concepts from ImageNet in which similar concepts are grouped and learned together. In 2013, deep neural networks were used to solve large-scale image classification problems (Sermanet et al, Donahue et al.) According to this approach, the classification scores given by the last layer of Network can be used as a semantic representation of the content of an image. However, several hardware limitations make it difficult to efficiently represent a large number of classes and a very large number of images within a single network. The number of classes processed is typically in the order of 1000 and the number of images in the order of one million. In 2012, Bergamo and Torresani published an article titled "Meta-class features for large scale object categorization on a budget" (CVPR, IEEE, 2012) .The authors propose a compact representation of images 15 by grouping concepts of the hierarchy ImageNet using their visual affinity.The authors use a quantization (ie the most salient dimensions are set to 1 and the others to 0) which makes the descriptor more compact.However, this approach defining "meta classes" does not allow to ensure a diversified representation of the content of the images, and quantization also leads to a drop in performance, The current state of the art rarely addresses the aspects relating to the diversity of image search. that different concepts present in an image appear in the associated representation 25. The invention proposed in this document makes it possible to respond to to these needs or limitations, at least in part.
[0005] SUMMARY OF THE INVENTION There is disclosed a computer implemented method for semantically describing the contents of an image comprising the steps of receiving a signature associated with said image; receive a plurality of groups of initial visual concepts; the method being characterized by the steps of expressing the image signature as a vector including components referring to the initial visual concept groups; and modifying said signature by applying a filter rule applying to the components of said vector. Developments in particular describe rules for threshold and / or order statistics, intra-group or intergroup, partitioning techniques including the visual similarity of the images and / or semantics of the concepts, the optional addition of manual annotations to the semantic description of the image. The advantages of the process in terms of parsimonious and diversified semantic representation are presented. The method according to the invention will find advantageous application in the context of the search for multimedia information and / or the classification of documents (for example in a context of data mining or "data mining" in English). According to one aspect of the invention, the visual documents are represented by the probabilities obtained by comparing these documents with individual concept classifiers. According to one aspect of the invention, a diversified representation of the content of the images is permitted. According to one aspect of the invention, the compact nature of the representations is ensured without loss of performance.
[0006] Advantageously, an embodiment of the invention proposes a semantic representation that is both compact and diversified. Advantageously, the invention proposes semantic representations of visual content that combine diversity and hollow character, two aspects that are not currently addressed in the domain literature. Diversity is important because it ensures that different concepts in the image appear in the representation. The hollow character is important because it speeds up the process of searching for similarity images by means of inverted files.
[0007] Advantageously, the invention provides a capacity for generalization of the semantic representation (i.e. the system can operate independently of the content itself). Advantageously, the method according to the invention is generally fast to calculate and to use for massive multimedia databases.
[0008] Advantageously, the method according to the invention allows semantic representations which are both diversified and hollow. The invention will advantageously be applicable to any task that requires the description of a multimedia document (combining visual and textual information) for the purpose of searching or classifying this document. For example, the method allows the implementation of multimedia search engines; the exploration of "massive" multimedia backgrounds is generally considerably accelerated because of the hollow character of the semantic representation of the process. The invention allows large scale recognition of objects present in an image or in a video. It will be possible, for example, to provide contextual advertising, create user profiles from their images and use these profiles to target or customize advertisements.
[0009] DESCRIPTION OF THE FIGURES Various aspects and advantages of the invention will appear in support of the description of a preferred embodiment of the invention, but without limitation, with reference to the figures below: FIG. classification or annotation of a document; Figure 2 illustrates an example of supervised classification; FIG. 3 illustrates the overall diagram of an exemplary method according to the invention; Figure 4 details some specific steps in the process according to the invention. Detailed Description of the Invention Figure 1 illustrates the classification or annotation of a document. In the example considered, the document is an image 100. The labels 130 of this document indicate its degree of belonging to each of the classes 110 110 considered. Considering, for example, four classes (here "wood", "metal", "earth" and "cement"), the tag 120 annotating the document 100 is a four-dimensional vector 140, each component of which is a probability (equal to 0 if the document does not correspond to the class, and equal to 1 if the document corresponds to it with certainty).
[0010] Figure 2 illustrates an example of supervised classification. The method comprises in particular two steps: a first so-called learning step 200 and a second so-called test step 210. The learning step 200 is generally performed "off-line" (that is to say in a prior manner or still done in advance). The second step 210 is generally performed "online" (ie in real time during the search and / or classification steps proper).
[0011] Each of these steps 200 and 210 includes a feature extraction step ("feature extraction" in English, step 203 and 212) which makes it possible to describe a document by a fixed dimension vector. This vector is generally extracted from one of the modalities (i.e. channel) of the document only. Visual features include local representations (i.e., visual word bags, Fisher vectors etc.) or global representations (color histograms, texture descriptions, etc.) of visual content or semantic representations. Semantic representations are generally obtained by using intermediate classifiers that provide probability values of appearance of an individual concept in the image and include classemes or meta-classes. In a schematic way, a visual document will be represented by a vector of type {"dog" = 0.8, "cat" = 0.03, "because" = 0.03, ..., "sunny" = 0.65}.
[0012] During the learning phase 200, a series of such vectors and the corresponding labels 202 feed a training module ("machine learning" 204) which thus produces a model 213. In the test phase 210, a multimedia document " 211 is described by a vector of the same nature as during the training 200. This is used at the input of the model 213 previously learned. At the output is returned a prediction 214 of the label of the test document 211. The learning implemented at the step 204 can comprise the use of different techniques, considered alone or in combination, in particular of "wide-margin separators (SVM), the learning process called "boosting", or even the use of neural networks, for example "deep" ("deep neural networks"). According to a specific aspect of the invention, there is disclosed a step of extracting advantageous features (steps 203 and 212). In particular, the semantic descriptor under consideration involves a set of classifiers ("on the bench"). FIG. 3 illustrates the overall diagram of an exemplary method according to the invention. The figure illustrates an example of construction of a semantic representation associated with a given image. The figure illustrates "online" (or "online" or "active") steps. These steps designate steps that are performed substantially at the time of searching or annotating the images. The figure also illustrates "offline" (or "offline" or "passive") steps. These steps are generally carried out beforehand, i.e. previously (at least in part). In advance or "offline" ("offline") mode, all the images of a supplied database 3201 can be analyzed (the method according to the invention can also proceed by accumulating and constructing the database). data progressively and / or groupings by iteration). Steps for extracting visual characteristics 3111 and standardization 3121 are repeated for each of the images constituting said image database 3201 (the latter is structured in n concepts C). One or more optional (learning) steps 3123 may be performed (positive and / or negative examples, etc.). All of these operations can also be used to determine or optimize the establishment of visual models 323 (see below) as well as grouping models 324. In step 323, a bench of visual models is received. This bench of 25 models can be determined in different ways. In particular, the model bank may be received from a third party module or system, for example following step 3101. A "bench" corresponds to a plurality of visual models V (called "individual visual models"). An "individual visual model" is associated with each of the initial concepts 3030846 ("sunset", "dog", etc.) of the reference base. The images associated with a given concept represent positive examples for each concept (while the negative examples - which are for example chosen by sampling - are associated with the images which represent the other concepts of the learning base). In step 324, the concepts (initial, i.e. as received) are grouped. Group templates are received (eg from third-party systems). In general, according to the method of the invention, an image to be analyzed 300 is submitted / received and is the subject of various treatments and analyzes 310 (which may sometimes be optional) and a semantic description 320 of this image is determined. by the process according to the invention. At the output are determined one or more annotations 340.
[0013] In the detail of step 310, in a first step 311 (i), the visual characteristics of the image 300 are determined. The base 3201 (which generally comprises thousands of images, even millions of images) is initially initially structured in n concepts C (in some embodiments, for certain applications, n may be of the order of 10000). The visual characteristics of the image are determined in step 311 (but they may also be received from a third-party module, for example, they may be provided as metadata). Step 311 is generally the same as step 3111. The content of image 300 is thus represented by a fixed size vector (or "signature"). In a second step 312 (ii), the visual characteristics of the image 300 are normalized (if necessary, that is, if necessary, it is possible that received visual characteristics are already normalized).
[0014] In the detail of step 320 (semantic description of the content of the image according to the method), in step 325 (v) according to the invention, a semantic description of each image is determined. In step 326 (vi), according to the invention, this semantic description may be "pruned" (or "simplified" or "reduced"), for one or more images. In an optional step 327 (vii), annotations of various sources (including manual annotations) may be added or exploited. FIG. 4 explains in detail certain steps specific to the method 10 according to the invention. Steps v, vi and optionally vii (taken in combination with the other steps currently described) correspond to the specific features of the method according to the invention. These steps make it possible to obtain a diversified and parsimonious representation of the images of a database.
[0015] A "diversified" representation is permitted by the use of groups - instead of the original individual concepts as provided by the originally annotated database - which advantageously allows for a greater diversity of aspects of the images to be represented. For example, one group may contain different breeds of dogs and 20 different levels of granularity of these concepts ("golden retriever", "labrador retriever", "border collie", "retriever" etc.). Another group may be associated with a natural concept (eg related to seaside scenes), another group will be related to meteorology ("good weather", "cloudy", "stormy" etc.).
[0016] A "hollow" representation of the images corresponds to a representation containing a reduced number of non-zero dimensions in the vectors (or image signatures). This parsimonious (or "hollow") character allows an efficient search in image databases even on a large scale (the signatures of the images are compared, for example to each other, generally in random access memory; these signatures, by means of inverted files for example, make it possible to accelerate the process of searching for images by similarity).
[0017] The two characters of "diversified representation" and "parsimony" work in synergy or in concert: the diversified representation according to the invention is compatible (e.g. permitted or facilitated) with a parsimonious search; parsimonious research advantageously benefits from a diversified representation.
[0018] In step 324, the concepts are grouped to obtain k groups Gx, where x = 1, ... k and k <n. Gx = {v1, v2, -, Vxy} (1) Different methods (possibly combined) can be used for group segmentation. This segmentation can be static and / or dynamic and / or configured and / or configurable.
[0019] In some embodiments, the clusters can in particular be based on the visual similarity of the images. In other embodiments, the visual similarity of the images is not necessarily taken into account. In one embodiment, the grouping of concepts may be effected according to the semantic similarity of the images (e.g., depending on the accessible annotations). In one embodiment, the grouping of concepts is supervised, i.e. benefits from human cognitive expertise. In other embodiments, the cluster is unsupervised. In one embodiment, the grouping of the concepts can be performed using a clustering method such as the K-averages (or K-medoids) applied to the characteristic vectors of each image learned on a learning basis.
[0020] This results in average feature vectors of clusters. This embodiment allows in particular a minimum human intervention upstream (only the parameter K is to be chosen). In other embodiments, the user's involvement in grouping is excluded (for example by using a clustering method such as the "shared nearest neighbor" which makes it possible to avoid any human intervention). In other embodiments, the clustering is performed according to hierarchical clustering methods and / or expectation maximization (EM) algorithms and / or density based algorithms such as DBSCAN or OPTICS and / or methods. connectionists such as adaptive adapters. Each group corresponds to a possible "aspect" (conceptual) that can represent an image. Different consequences or advantages result from the multiplicity of possible ways of grouping (number of groups and size of each group, i.e. number of images within a group). The size of a group can be variable in order to meet application needs concerning a variable granularity of the representation. The number of groups may correspond to more or less fine partitions (more or less coarse) than the initial concepts (such as inherited or accessed in the original annotated image database). Segmentation into appropriate size groups makes it possible, in particular, to characterize (more or less finely, i.e. according to different granularities) different conceptual domains. Each group may correspond to a "meta concept", for example, coarser (or wider) than the initial concepts. The step of segmenting or partitioning the conceptual space advantageously leads to the creation (ex nihilo) of "meta concepts". In other words, all of these groups (or "meta-3030846 concepts") form a new partition of the conceptual representation space in which the images are represented. In step 325 according to the invention, for any test image, one or more visual characteristics, which are normalized (steps i and ii) and compared to the visual models of the concepts (step iii), are calculated or determined in order to to obtain a semantic description D of this image based on the probability of occurrence p (V, y) (with 0 p (V, y) 1) of the elements of the concept bank. The description of an image is thus structured according to the groups of 10 concepts calculated in iv: D = f {P (7n), P (712), "- P (Via)} {Wu), P (722)," P (72b)}, ". [Win], p (vk2), -., P (vkc)}} (2) G1 G2 Gk The number of groups selected can vary in particular according to the application requirements. parsimonious, a small number of groups is used, which increases the diversification but conversely decreases the expressiveness of the representation Conversely, without groups, one has a maximum expressivity but one decreases the diversity since the same concept will be present at several levels of granularity ("golden retriever", "retriever" and "dog" in the example cited above) .After the grouping operation, the three preceding concepts will be found within a single group, which will be represented by a single value, it is therefore proposed a representation by "intermediate groups" which allows to integrate simultaneously diversity and expressiveness.
[0021] In the sixth step 326 (vi) according to the invention, the description D obtained is pruned or simplified in order to keep, within each group G, only the strongest probability or probabilities p (T). eliminate the low probabilities (which can have a negative influence when calculating the similarity of the images). In one embodiment, each group is associated with a threshold (possibly different) and the probabilities (for example lower) of these thresholds are eliminated. In one embodiment, all the groups are associated with one and the same threshold for filtering the probabilities. In one embodiment, one or more groups are associated with one or more predefined thresholds and the probabilities above and / or below these thresholds (or threshold ranges) can be eliminated. A threshold can be determined in different ways (i.e. according to different types of mathematical mean according to other types of mathematical operators). A threshold can also be the result of a predefined algorithm. In general, a threshold can be static (ie invariant over time) or dynamic (eg a function of one or more external factors, such as for example controlled by the user and / or from another system). A threshold may be configured (e.g., previously "hard-coded") but may also be configurable (e.g., depending on the type of search, etc.). In one embodiment, a threshold does not relate to a probability value (eg a score) but a number Kp (Gx), associated with the rank (after sorting) of the probability of "keeping" or "eliminating" a group. Gx. According to this embodiment, the probability values are ordered i.e. rows by value and then a determined number Kp (Gx) of probability values are selected (depending on their order or order or rank) and different filtering rules can be applied. For example, if Kp (Gx) is equal to 3, the method 3030846 can retain the 3 "larger" values (or the 3 "smaller" values, or else 3 "distributed around the median" values, etc.) . A rule can be a function (max, min, etc.). For example, considering a group 1 comprising {P (V11) = 0.9; P (V12) = 0.1; P (V13) = 0.8} and a group 2 comprising {P (V21) = 0.9; P (V22) = 0.2; P (V23) = 0.4}, the application of a threshold filtering equal to 0.5 will lead to select P (V11) and P (V13) for group 1 and P (V21) for group 2. By applying with Kp (Gx) = 2 a filter rule "keep the largest values", we will keep P (V11) and P (V13) for group 1 (same result as method 1) but P (V21) and P (V23) ) for group 2. The pruned version of the semantic description De can then be written as (in this case Kp (Gx) would be worth 1): D, = {{p (Vii), 0, ..., 0} , {0, p (V22), ..., 01, ..., {0, 0, ..., P (Vkc)}} (3) with: p (VH)> p (V12), P (V11)> P (Via) for G1; p (V22)> P (Vib), P (V22)> 15 p (Vit,), for G2 and p (vkc)> P (vki), P (vkc)> P (vk2) for Gk- The given representation in (3) illustrates the use of a method of selection of so-called "max-pooling" dimensions. This representation is illustrative and the use of said method is entirely optional. Other alternative methods can be used instead of "max 20 pooling", such as for example the so-called "average pooling" technique (average of the probabilities of the concepts in each group Gk) or even the so-called "soft max" technique. pooling "(average of the x highest probabilities within each group). The score of the groups will be noted s (Gk) below.
[0022] The pruning described in formula (3) is intra-group. A last inter-group pruning is advantageous in order to arrive at a "hollow" representation of the image. More precisely, starting from De = {s (G1), s (G2), ..., s (Gk)} and after applying the intra-group pruning described in (3), the groups having the strongest scores are only retained. For example, assuming that a description with only two non-zero dimensions is desired, and that s (G1)> s (Gk2)> ...> s (G2), then the final representation will be given by: 10 Df = {s (G1), 0, ..., s (G01 (4) The selection of one or more concepts in each group provides a "diversified" description of the images, that is, including various aspects (conceptual) of the image As a reminder, an "aspect" or "meta aspect" of the conceptual space corresponds to a group of 15 concepts, chosen from among the initial concepts The advantage of the method proposed in this invention is that it "forces" the representation of an initial image on or to one or more of these aspects (or "meta concepts"), even if one of these aspects is initially predominant, for example, if an image is mostly 20 annotated by the concepts associated with "dog", "golden retriever" and "hunting dog" but also, to a lesser extent, by the con "car" and "lamppost" cepts, and that step iv of the proposed process results in the formation of three meta-concepts (i.e. groups / aspects ...) containing {"dog" + "golden retriever" + "hunting dog"} for the first 25 group, {"car" + "bike" + "motorcycle"} for the second group and {" lamppost "+" city "+" street "} for the third group then a semantic representation according to the state of the art will carry the bulk of its weighting on the concepts" dog "," golden retriever "and" hunting dog " while the method according to the invention will make it possible to identify 3030846 that these four concepts describe a similar appearance and will also attribute weight to the aspect of belonging to "car" and "lamp post" thus making it possible to find more precise pictures of dogs taken in town, outside, in the presence of means of transport. Advantageously, in the case of a large initial number of concepts and a "hollow" representation, as proposed by the method according to the invention, the representation according to the method according to the invention allows a better comparability of the dimensions of the description. Thus, without 10 groups, an image represented by "golden retriever" and another represented by "retriever" will have a similarity or close to zero due to the presence of these concepts. With the groups according to the invention, the presence of the two concepts will contribute to increasing the (conceptual) similarity of the images because of their common membership in a group. From the point of view of the user experience, the search by the content of the image according to the invention advantageously makes it possible to take into account more aspects of the query (and not only the "dominant" concept or concepts according to the search by the known image in the state of the art). The resulting "diversification" of the process is particularly advantageous. It is nonexistent in the current image descriptors. By fixing the size of the groups at the limit value equal to 1, a process is obtained without diversification of the semantic representation of the images. In a step 322 (vii), if there are textual annotations associated with the manually-filled image (generally of high semantic quality), the associated concepts are added to the semantic description of the image with probability 1 (or at least higher than the probabilities associated with automatic classification tasks, for example). This step remains entirely optional since it depends on the existence of manual annotations that may not be available). In one embodiment, the method of the invention uniquely aggregates images (in other words, there are 5 N groups of M frames). In one embodiment, "collections" i.e. "sets" of groups of different sizes are pre-calculated (in other words there are A groups of B frames, C groups of D frames, etc.). The search by the content of the image can be "parameterized", for example according to one or more options presented to the user. In this case, any of the pre-calculated collections is activated (ie the search is performed within the specified collection). In some embodiments, the computation of the different collections is done in the background of the searches. In some embodiments, the selection of one or more collections is (at least in part) determined based on user feedback. In general, the methods and systems of the invention relate to the annotation or the classification or the automatic description of the content of the image considered as such (ie without necessarily taking into consideration other sources of data than the content of the image or associated metadata). The automatic approach disclosed by the invention may be supplemented or combined with contextual data associated with the images (for example related to the methods of publication or visual reproduction of these images). In an alternative embodiment, the contextual information (for example the keywords from the web page on which the image is published or the context of restitution if known) can be used. This information may, for example, be used to corroborate, provoke or inhibit or confirm or invalidate the annotations extracted from the analysis of the content of the image according to the invention. Different adjustment mechanisms can indeed be combined with the invention (filters, weighting, selection, etc.). Contextual annotations can be filtered and / or selected and added to the semantic description (with appropriate probabilities or factors or coefficients or weights or confidence intervals, for example).
[0023] Embodiments of the invention are described below. A computer-implemented method for semantically describing the contents of an image comprising the steps of: receiving a signature associated with said image; receive a plurality of groups of initial visual concepts; the method being characterized by the steps of: expressing the image signature as a vector including components referring to the initial visual concept groups; and modifying said signature by applying a filter rule applying to the components of said vector. The signature associated with the image, i.e. the initial vector, is generally received (eg from another system). This signature is for example obtained after the extraction of the visual characteristics of the content of the image, for example by means of predefined classifiers and known in the state of the art, and various other processing including normalization. The signature may be received as a vector expressed in a different repository. The process "expresses" or transforms (or converts or translates) the received vector into the appropriate work repository. The signature of the image is therefore a vector (including components) of a constant size of size C. An initially annotated base also provides a set of 25 initial concepts, for example in the form of annotations (textual). These groups of concepts can in particular be received in the form of "benches". The signature is then expressed with references to groups of "initial visual concepts" (textual objects) i.e. as received. References to groups are therefore components of the vector.
[0024] 21 3030846 The mapping of the vector components to the concept groups is performed. The method according to the invention manipulates cad partitioning the initial visual concepts according to G x = tvx1, vx2, ---, Vxyl, with x = 1, ... k and k <n. and created a new signature of the image.
[0025] The method then determines a semantic description of the content of the image by modifying the initial signature of the image, i.e. retaining or canceling (e.g., zeroing) one or more components (references to groups) of the vector. The modified vector is always of size C. Different filter rules may apply.
[0026] In one development, the filtering rule comprises maintaining or zeroing one or more components of the vector corresponding to the groups of initial visual concepts by applying one or more thresholds. The semantic description may be changed intra-groupwise by the application of thresholds, said thresholds being selected from among mathematical operators including for example mathematical means. The pruning can be intra-group (eg selection of the so-called "max-pooling" or "average pooling" dimensions (average of the probabilities of the 20 concepts in each Gk group) or even the so-called "soft max pooling" technique ( average of the highest x probabilities within each group.) In one development, the filtering rule includes maintaining or zeroing one or more components of the vector corresponding to the initial visual concept groups by application of 'a statistic of order.
[0027] In statistics, the rank order statistic k of a statistical sample is equal to the k-th smallest value. Combined with rank statistics, order statistics is one of the fundamental tools of non-parametric statistics and statistical inference. The 5th order statistic includes the statistics of the minimum, maximum, median of the sample as well as the different quantiles, etc. Filters (designation then action) based on thresholds and rules of order statistics can be combined (it is possible to act on the groups of concepts - as components - with only 10 thresholds or only order or both). For example, the determined semantic description can be changed intra-groupwise by applying a predefined filtering rule of a number Kp (Gx) of occurrence probability values of an initial concept within each group.
[0028] In each group, a) the probabilities values (of occurrence of an initial concept) are ordered; b) a number Kp (Gx) is determined; and c) a predefined filtering rule is applied (this rule is selected from the group including the rules "selection of Kp (Gx) maximum values", "selection of Kp (Gx) minimum values", "selection of Kp ( Gx) values around the median ", etc ...). Finally, the semantic description of the image is modified by means of the probability values thus determined. In one development, the method further comprises a step of determining a selection of initial visual concept groups and a step of zeroing the corresponding components to the selected visual concept groups (plural or all components). This development corresponds to an inter-group filtering.
[0029] In a development, the grouping into groups of initial visual concepts is based on the visual similarity of the images. Apprenticeship can be unsupervised; step 324 provides such groups based on visual similarity.
[0030] In a development, the grouping into groups of initial visual concepts is based on the semantic similarity of the concepts. In a development, the grouping into groups of initial visual concepts is carried out by one or more operations selected from the use of K-averages and / or hierarchical groupings and / or expectancy maximization (EM) and / or or density-based algorithms and / or connectionist algorithms. In a development, at least one threshold is configurable. In one development, the method further comprises a step of receiving and adding to the semantic description of the content of the image one or more manual source text annotations. In one development, the method further comprises a step of receiving at least one parameter associated with a search query by the content of an image, said parameter determining one or more groups of visual concepts and a step of proceeding to proceed. research within defined groups of concepts. In a development, the method further comprises a step of forming collections of initial visual concept groups, a step of receiving at least one parameter associated with a search query by the content of an image, said parameter determining one or more collections from the collections of initial visual concept groups and a step of searching within the specified collections.
[0031] 24 3030846 In this development, the "groups of groups" are addressed. In one embodiment, it is possible to choose (e.g., query characteristics) from among different pre-calculated partitions (i.e. according to different groupings). In a very particular embodiment, the partition may (although hardly) be made in real time (ie at the time of the query). A computer program product is disclosed, said computer program comprising code instructions for performing one or more of the process steps.
[0032] There is also disclosed a system for carrying out the process according to one or more of the process steps. The present invention can be implemented from hardware and / or software elements. It may be available as a computer program product on a computer readable medium. The support can be electronic, magnetic, optical or electromagnetic. The device implementing one or more of the steps of the method may use one or more dedicated electronic circuits or a general purpose circuit. The technique of the invention can be carried out on a reprogrammable calculation machine (a processor or a microcontroller for example) executing a program comprising a sequence of instructions, or on a dedicated computing machine (for example a set of doors as an FPGA or an ASIC, or any other hardware module). A dedicated circuit can in particular accelerate the performance in extracting characteristics of the images (or collections of video images or "dramas"). As an example of hardware architecture adapted to implement the invention, a device may comprise a communication bus which is connected to a central processing unit or microprocessor (CPU, acronym for "Central Processing Unit" in English), which processor can be "multi-core" or "many-core"; a read-only memory (ROM) which may comprise the programs necessary for the implementation of the invention; RAM or Random Access Memory (RAM) with registers adapted to record variables and parameters created and modified during the execution of the aforementioned programs; and a communication interface or I / O (I / O acronym for "Input / Output") adapted to transmit and receive data (e.g. images or videos). In particular, the random access memory can allow the rapid comparison of the images via the associated vectors. In the case where the invention is implemented on a reprogrammable calculation machine, the corresponding program (that is to say the instruction sequence) can be stored in or on a removable storage medium (for example a flash memory , an SD card, a DVD or Bluray, a mass storage means such as a hard disk (eg SSD) or non-removable, volatile or non-volatile, this storage medium being partially or fully readable by a computer or a processor. The computer readable medium may be transportable or communicable or mobile or transmissible (i.e. by a 2G, 3G, 4G, Wifi, 20 BLE, fiber optic or other telecommunication network). The reference to a computer program that, when executed, performs any of the functions described above, is not limited to an application program running on a single host computer. On the contrary, the terms computer program and software are used herein in a general sense to refer to any type of computer code (e.g., application software, firmware, microcode, or any other form of computer code). computer instruction) that can be used to program one or more processors to implement aspects of the techniques described herein. The means or computer resources can notably be distributed ("cloud computing"), possibly with or according to peer-to-peer and / or virtualization technologies. The software code may be executed on any suitable processor (for example, a microprocessor) or processor core or set of processors, whether provided in a single computing device or distributed among a plurality of computing devices. (eg such as possibly accessible in the environment of the device). 10 15 20 27
权利要求:
Claims (13)
[0001]
REVENDICATIONS1. A computer-implemented method for semantically describing the content of an image comprising the steps of: - receiving a signature associated with said image; Receiving a plurality of groups of initial visual concepts; the method being characterized by the steps of: - expressing the signature of the image as a vector including components referring to groups of initial visual concepts; modifying said signature by applying a filtering rule applying to the components of said vector.
[0002]
The method of claim 1, the filtering rule comprising maintaining or resetting one or more components of the vector corresponding to the initial visual concept groups by applying one or more thresholds.
[0003]
The method of claim 1 or 2, the filtering rule comprising maintaining or zeroing one or more components of the vector corresponding to the groups of initial visual concepts by applying a command statistic. 20
[0004]
The method of any of the preceding claims, further comprising a step of determining a selection of initial visual concept groups and a step of resetting the corresponding components to the selected visual concept groups. 25
[0005]
The method of any one of the preceding claims, wherein the grouping into groups of initial visual concepts is based on the visual similarity of the images. 2830846
[0006]
The method of any one of the preceding claims, wherein the grouping into groups of initial visual concepts is based on the semantic similarity of the concepts.
[0007]
7. A method according to any one of the preceding claims, the grouping into groups of initial visual concepts being effected by one or more operations selected from the use of K-averages and / or hierarchical groupings and / or maximization of the expectancy (EM) and / or density-based algorithms and / or connectionist algorithms.
[0008]
8. Method according to claims 2 to 7, at least one threshold being configurable.
[0009]
The method of any of the preceding claims, further comprising a step of receiving and adding to the semantic description of the image content one or more manual source text annotations.
[0010]
The method of any of the preceding claims, further comprising a step of receiving at least one parameter associated with a search query by the content of an image, said parameter determining one or more groups of visual and a step of searching within defined concept groups.
[0011]
The method of any one of the preceding claims, further comprising a step of forming collections of initial visual concept groups, a step of receiving at least one parameter associated with a search query by the content of an image, said parameter determining one or more collections among the collections of groups of initial visual concepts and a step of performing the search within the determined collections. 29303046
[0012]
A computer program product, said computer program comprising code instructions for performing the steps of the method of any one of claims 1 to 11, when said program is run on a computer. 5
[0013]
13. System for carrying out the method according to any one of claims 1 to 11. 10
类似技术:
公开号 | 公开日 | 专利标题
US20210158219A1|2021-05-27|Method and system for an end-to-end artificial intelligence workflow
EP3238137A1|2017-11-01|Semantic representation of the content of an image
CA2804230C|2016-10-18|A computer-implemented method, a computer program product and a computer system for image processing
FR2974434A1|2012-10-26|PREDICTING THE AESTHETIC VALUE OF AN IMAGE
FR2974433A1|2012-10-26|EVALUATION OF IMAGE QUALITY
US10621755B1|2020-04-14|Image file compression using dummy data for non-salient portions of images
FR2994495A1|2014-02-14|METHOD AND SYSTEM FOR DETECTING SOUND EVENTS IN A GIVEN ENVIRONMENT
US20200210647A1|2020-07-02|Automated Summarization of Extracted Insight Data
KR102259207B1|2021-05-31|Automatic taging system and method thereof
US20190180327A1|2019-06-13|Systems and methods of topic modeling for large scale web page classification
US20170286979A1|2017-10-05|Architecture for predicting network access probability of data files accessible over a computer network
Bagnall et al.2020|On the usage and performance of the hierarchical vote collective of transformation-based ensembles version 1.0 |
FR3043817A1|2017-05-19|METHOD FOR SEARCHING INFORMATION IN AN INFORMATION SET
JayaLakshmi et al.2022|Performance evaluation of DNN with other machine learning techniques in a cluster using Apache Spark and MLlib
RU2711125C2|2020-01-15|System and method of forming training set for machine learning algorithm
FR3041794A1|2017-03-31|METHOD AND SYSTEM FOR SEARCHING LIKE-INDEPENDENT SIMILAR IMAGES FROM THE PICTURE COLLECTION SCALE
US10685057B1|2020-06-16|Style modification of images in search results
CN111667022A|2020-09-15|User data processing method and device, computer equipment and storage medium
EP2374073A1|2011-10-12|System for searching visual information
FR3082962A1|2019-12-27|AUTOMATIC AND SELF-OPTIMIZED DETERMINATION OF EXECUTION PARAMETERS OF A SOFTWARE APPLICATION ON AN INFORMATION PROCESSING PLATFORM
FR3062504A1|2018-08-03|AUTOMATIC DETECTION OF FRAUD IN A NEURON NETWORK PAYMENT TRANSACTION STREAM INTEGRATING CONTEXTUAL INFORMATION
US11182433B1|2021-11-23|Neural network-based semantic information retrieval
US20210073671A1|2021-03-11|Generating combined feature embedding for minority class upsampling in training machine learning models with imbalanced samples
FR2981189A1|2013-04-12|NON-SUPERVISED SYSTEM AND METHOD OF ANALYSIS AND THEMATIC STRUCTURING MULTI-RESOLUTION OF AUDIO STREAMS
US20220083603A1|2022-03-17|Neural network-based semantic information retrieval
同族专利:
公开号 | 公开日
JP2018501579A|2018-01-18|
WO2016102153A1|2016-06-30|
US20170344822A1|2017-11-30|
FR3030846B1|2017-12-29|
CN107430604A|2017-12-01|
EP3238137A1|2017-11-01|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题
US20070078846A1|2005-09-30|2007-04-05|Antonino Gulli|Similarity detection and clustering of images|
US7043474B2|2002-04-15|2006-05-09|International Business Machines Corporation|System and method for measuring image similarity based on semantic meaning|
US7124149B2|2002-12-13|2006-10-17|International Business Machines Corporation|Method and apparatus for content representation and retrieval in concept model space|
US7680341B2|2006-05-05|2010-03-16|Xerox Corporation|Generic visual classification with gradient components-based dimensionality enhancement|
US8391618B1|2008-09-19|2013-03-05|Adobe Systems Incorporated|Semantic image classification and search|
CN102880612B|2011-07-14|2015-05-06|富士通株式会社|Image annotation method and device thereof|
US20130125069A1|2011-09-06|2013-05-16|Lubomir D. Bourdev|System and Method for Interactive Labeling of a Collection of Images|
US8873867B1|2012-07-10|2014-10-28|Google Inc.|Assigning labels to images|
CN104008177B|2014-06-09|2017-06-13|华中师范大学|Rule base structure optimization and generation method and system towards linguistic indexing of pictures|
WO2016048743A1|2014-09-22|2016-03-31|Sikorsky Aircraft Corporation|Context-based autonomous perception|CN105354307B|2015-11-06|2021-01-15|腾讯科技(深圳)有限公司|Image content identification method and device|
US11144587B2|2016-03-08|2021-10-12|Shutterstock, Inc.|User drawing based image search|
EP3540691B1|2018-03-14|2021-05-26|Volvo Car Corporation|Method of segmentation and annotation of images|
US11100366B2|2018-04-26|2021-08-24|Volvo Car Corporation|Methods and systems for semi-automated image segmentation and annotation|
US11080324B2|2018-12-03|2021-08-03|Accenture Global Solutions Limited|Text domain image retrieval|
法律状态:
2015-12-31| PLFP| Fee payment|Year of fee payment: 2 |
2016-06-24| PLSC| Publication of the preliminary search report|Effective date: 20160624 |
2016-12-29| PLFP| Fee payment|Year of fee payment: 3 |
2018-01-02| PLFP| Fee payment|Year of fee payment: 4 |
2019-12-31| PLFP| Fee payment|Year of fee payment: 6 |
2020-12-28| PLFP| Fee payment|Year of fee payment: 7 |
2021-12-31| PLFP| Fee payment|Year of fee payment: 8 |
优先权:
申请号 | 申请日 | 专利标题
FR1463237A|FR3030846B1|2014-12-23|2014-12-23|SEMANTIC REPRESENTATION OF THE CONTENT OF AN IMAGE|FR1463237A| FR3030846B1|2014-12-23|2014-12-23|SEMANTIC REPRESENTATION OF THE CONTENT OF AN IMAGE|
CN201580070881.7A| CN107430604A|2014-12-23|2015-12-01|The semantic expressiveness of picture material|
US15/534,941| US20170344822A1|2014-12-23|2015-12-01|Semantic representation of the content of an image|
EP15801869.7A| EP3238137A1|2014-12-23|2015-12-01|Semantic representation of the content of an image|
JP2017533946A| JP2018501579A|2014-12-23|2015-12-01|Semantic representation of image content|
PCT/EP2015/078125| WO2016102153A1|2014-12-23|2015-12-01|Semantic representation of the content of an image|
[返回顶部]