西班牙专利ES2657378A1 PROCEDURE FOR THE DETECTION AND LOCALIZATION OF HUMANS IN PICTURES ACQUIRED BY OMNIDIRECTIONAL CAMER

专利PDF首页>>西班牙专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
Procedure to detect and locate humans in images acquired by omnidirectional cameras. The procedure consists of two sub-procedures. The sub-procedure A of detection and localization of humans comprising the steps of capturing images of an omnidirectional camera, description of each image by a single vector of characteristics, distribution of the vector of characteristics to a set of foveal classifiers, binary prediction of the foveal classifiers, and approximate location and subsequently accurate of humans in the image, resulting in human detection in the image plane. The detection is robust to changes in pose, scale and lighting, to partial occlusions, and to geometric distortions of the omnidirectional images. The sub-procedure B of parameter adjustment uses a database of images with point annotations to adaptively generate positive and negative samples for each foveal classifier, which are used to train the parameters of the set of classifiers. (Machine-translation by Google Translate, not legally binding)
公开号:ES2657378A1
申请号:ES201730478
申请日:2017-03-30
公开日:2018-03-05
发明作者:Carlos ROBERTO DEL BLANCO ADÁN；Pablo CARBALLEIRA LÓPEZ；Fernando JAUREGUIZAR NUÑEZ；Narciso GARCÍA SANTOS；Lorena GARCIA DE LUCAS
申请人:Universidad Politecnica de Madrid；
IPC主号:

专利说明:

5
10
fifteen
twenty
25
30
PROCEDURE FOR THE DETECTION AND LOCATION OF HUMANS IN PICTURES ACQUIRED BY OMNIDIRECTIONAL CAMERAS
TECHNICAL SECTOR
The present invention encompasses the information and communications technologies sector and has applications in the industrial security (video surveillance, defense), transportation (people flow management), healthcare (patient monitoring) sectors, sport and leisure (behavior analysis, man-machine interaction). And more specifically in the detection and location of humans in complex situations in uncontrolled environments using omnidirectional cameras to cover large regions of a real three-dimensional scene.
BACKGROUND OF THE INVENTION
The detection of humans in a scene acquired by a conventional (or perspective) camera is a complex task due to the great variability of human appearance, which changes dramatically according to the dress, pose, lighting conditions and perspective point of the camera. However, it is a very desirable technology due to its non-intrusive and non-collaborative nature on the part of the person to be detected, unlike other biometric techniques such as iris recognition, fingerprints, etc.
The use of omnidirectional cameras in the detection system, instead of conventional or perspective, entails additional difficulties due to the serious geometric distortions introduced by the omnidirectional optics of the camera. However, it has the remarkable advantage that it is capable of acquiring a greater extent of the real scene, reducing the number of cameras needed (and the underlying cost, maintenance, installation and configuration) to monitor a scene of interest.
In general there are two families of methods for the detection of humans in images acquired by cameras within the field of vision
5
10
fifteen
twenty
25
30
artificial: those based on detection by parts and those based on the concept of sliding window.
Part-based detection methods try to find parts of a human in an image, generating a final detection if any or all of the parts configure a geometrically plausible arrangement.
Detection methods based on the sliding window concept directly detect a human in a certain region of an image limited by a rectangular window. To detect humans in arbitrary positions of an image, said window slides over the entire image, repeating the process for windows with different scales and proportions in order to adapt to multiple poses and sizes of a human on stage.
In US Patent 20140169664, an apparatus and a procedure oriented to the detection of people in images that adopt the sliding window scheme are described, using as Binary Local Pattern region descriptors and a combination of the Vector Support and Adaboost Vector Machine methods for The classification task.
Similarly, US 9008365 patent proposes several systems and methods for detecting people who use as variations descriptors of Binary Local Patterns that are processed by Vector Machines Support for the classification task.
DESCRIPTION OF THE INVENTION
The invention referred to in this document describes a new method of detecting people who does not belong to any family of the above methods, which can also operate on images acquired with omnidirectional cameras, unlike the other methods. This procedure has a greater detection capacity than other methods, especially in complicated situations (occlusions, people partially outside the image frame, appearances distorted by omnidirectional optics), with a lower computational cost and less complexity in the installation and configuration of the system. detection.
5
10
fifteen
twenty
25
30
The method for the detection and localization of humans in image sequences acquired by omnidirectional cameras of the present invention comprises the following sub-procedures:
to. Sub-procedure A for the detection and location of humans on images acquired by an omnidirectional static camera, which is configured for the detection and location of humans in a specific environment / scene; Y,
b. Sub-procedure B for configuring parameters for the detection and localization of human images in a specific environment / scene.
Sub-procedure A comprises the following stages:
to. Stage A.1: image acquisition with an omnidirectional camera in static position;
b. Stage A.2: extraction of a single super-feature vector for each of the acquired images;
C. Stage A.3: distribution of the characteristics super-vector associated to each of the images to a set of M foveal classifiers;
d. Stage A.4: binary prediction for each of the M foveal classifiers in the categories "human present in the fovea of the classifier" and its complementary "human not present in the fovea of the classifier";
and. Stage A.5: approximate location in the image of the humans detected; Y,
F. Stage A.6: precise location in the image of the humans detected.
Stage A.2 comprises three sub-stages:
to. Sub-stage A.2.1: division of the image into blocks of NxN pixels, which may be overlapping or not;
b. Sub-stage A.2.2: extraction of a feature vector for each block of NxN pixels referred to in the previous sub-stage; where the feature extraction algorithm is
5
10
fifteen
twenty
25
30
selected from: Histograms of Oriented Gradients, Local Binary Patterns, Transformed Scale Invariant Features, and Haar Features.
C. Sub-stage A.2.3: concatenation of the feature vectors of each block of the image into a single feature vector that represents the entire image.
Oriented Gradient Histograms represent the object structure by means of gradient histograms, where each histogram is constructed from regions other than the object, so that the gradient phase of a pixel of a region determines at what interval of the histogram contributes and the magnitude of the gradient specifies in what quantity. Local Binary Patterns encode the structure of an object through a histogram of local patterns. Each local pattern is calculated by the differences in intensity of each pixel with its neighborhood, which are thresholded by the sign function resulting in a binary code word that is converted to a decimal number that determines the contribution in the histogram. The Invariant Scale Characteristics Transform is based on Oriented Gradient Histograms but adds an initial multi-scale processing stage and also only calculates a gradient histogram that represents the entire object without considering sub-parts. Finally, Haar features is a very efficient type of Wavelet transform from the computational point of view that builds a scattered representation of the object.
Stage A.4, which consists of a set of M foveal classifiers, each of which has a fovea and a specific reference on the image plane. The fovea is a certain region of the image that focuses on the detection of humans for a given classifier and whose area and morphology is inferred by Sub-procedure B, and therefore is automatically adapted to the environment / scene. The point reference is a pair of coordinates of the image plane that represent the fovea of a classifier. The set of point references of the M foveal classifiers form a two-dimensional spatial grid / mesh that covers the area of the image acquired by the camera.
5
10
fifteen
twenty
25
30
Said spatial grid / hand is configurable by different patterns (hexagonal, rectangular, polar, etc.) and different number of classifiers (for example, M = 825). Each foveal classifier is trained / configured to detect humans in its fovea using as input the super-vector image features common to all foveal classifiers. The foveas of the classifiers may overlap. The classification algorithm is not restricted, and the following classifiers can be used among others: Support Vector Machines, Neural Networks, and Logistic Regression.
Stage A.5 uses the specific references of the active foveal classifiers, that is, those that have detected a human in their foveas, to roughly determine the location of the human in an area of the image.
Stage A.6 that merges the detection results of each active foveal classifier to refine the location of humans and produce the final and unique detection. It consists of three sub-stages:
to. Sub-stage A.6.1: extraction of clusters / neighborhoods of detections according to the grid / mesh that form the point references of the foveal classifiers; Clusters / neighborhoods of detections must exceed a minimum threshold for consideration, since the area of the human on the image overlaps with multiple foveas of neighboring foveal detectors.
b. Sub-stage A.6.2: deletion of non-maximums of
extracted groups / neighborhoods, obtaining a single grouping / neighborhood per human;
C. Sub-stage A.6.3: precise location of the human on the image by interpolating the coordinates of the specific references of the foveal classifiers that is part of each grouping / neighborhood obtained in the previous stage;
in such a way that a single point location is generated per human, which in turn represents its final detection.
5
10
fifteen
twenty
25
30
Sub-procedure B comprises the following steps:
to. Stage B.1: creation of an image database of a specific scene / environment that contains human instances; where the images are acquired with an omnidirectional camera, which must be located in the same position as the camera used in the detection;
b. Stage B.2: punctual annotation of humans in the image database. The punctual annotation consists in the specification of the coordinates of a representative point of the human on the image. The representative point chosen must be consistent in all instances of the object noted (for example, the head).
C. Stage B.3: Adaptive generation of positive and negative samples for each foveal classifier from the point annotations; A point annotation generates a positive sample for a given foveal classifier if the distance between the coordinates of the point annotation and the coordinates of the reference point of the foveal classifier is less than a threshold. If this threshold is exceeded, a negative sample is generated for the classifier involved. As a result, the same punctual annotation will generate a set of positive samples for a subset of foveal classifiers and another set of negative samples for a subset of foveal classifiers disjoint from the previous one.
d. Stage B.4: adjustment of the parameters (training in the jargon of classifiers) of each foveal classifier for the optimal detection of humans using the positive and negative samples generated.
The process of the present invention does not require scene information, beyond the specific annotations in the database, or camera calibration.
The process of the present invention can be extended to the location and detection of generic objects contained in images acquired by omnidirectional cameras or perspective.
5
10
fifteen
twenty
25
30
The invention described above has the following advantages and differences with respect to the state of the art.
The first fundamental difference is that each image is described by a single global feature vector, instead of extracting numerous vectors from each displacement of a detection window (in the case of sliding window based methods), or of different regions of the image that make up sub-parts of the human (in the case of part-based methods). This difference has a great impact on the reduction of computational cost, allowing the real-time operation of the system in low-cost hardware architectures.
The second radical difference is that to detect humans in the scenes captured by the camera in the proposed method, a set of foveal independent classifiers that analyze a single vector of characteristics corresponding to the entire image is used. Each foveal classifier is associated with a fovea (from which it receives its name) or attention area in the image plane that allows it to detect humans with different appearances, poses, sizes, locations and with arbitrary occlusions. This characteristic is essential for systems based on omnidirectional cameras, in which the appearance of a person changes radically depending on their position on the image plane due to the great distortion introduced by the optics. On the other hand, since each foveal classifier processes the same characteristic vector corresponding to the entire image, it also has access to the contextual information of the scene. In this way, each classifier not only uses its fovea for the detection of humans, but also the rest of the image areas, which allows it to be robust to partial occlusions, and people partially outside the image area. As for the process of determining the fovea of each classifier, this follows an automatic procedure based on machine learning (supervised training), so that the foveal classifier automatically learns the size and shape of its fovea from a database of specific human annotations. As a result, each foveal classifier has to deal with a limited subset of variations in human appearance, which simplifies the task of
5
10
fifteen
twenty
25
30
classification and allows to improve the performance of human detection system. In a totally different way, other methods and systems use the same set of classifiers to analyze different feature vectors from each region of the image (determined by a sliding window or partially overlapping image sub-parts) in order to be robust to the great variability of the appearance of a human. However, this classic strategy of other methods has two fundamental disadvantages. The first is that it incurs a large computational cost by having to calculate a high number of feature vectors per image. The second difficulty derives from the fact that the same set of classifiers is used for each region of the image, so it has to deal with all the geometric distortions that an omnidirectional camera introduces, and that causes the same human with the same pose It has a very different appearance depending on its position on the image plane.
The third fundamental difference is the training process of the classifiers. The rest of the methods need to specify / label regions of the image (usually rectangular) that contain humans (positive samples) and others that do not contain humans (negative samples). All this entails an enormous labor and labor force in the development of such labeling. However, the training method for foveal classifiers is not only different, but also less laborious and therefore efficient. To begin with, the labeling is reduced to a punctual representation that identifies the human in the image. Then, positive and negative samples are generated automatically and independently for each classifier, so that the same feature vector of an image can be positive for a foveal classifier but negative for other classifiers, depending on the proximity of the point labeling of the human to the foveas of each classifier.
In addition to the advantages described above, an additional competitive advantage of the method of detecting persons referred to in this invention is that it is characterized by not requiring scene information (more
5
10
fifteen
twenty
25
30
beyond punctual entries in the database), or omnidirectional camera calibration. Therefore, not only the need for camera calibration and geometric correction stages of the captured images is ignored, but also the errors and difficulties derived from them and that reduce the detection performance are avoided.
BRIEF DESCRIPTION OF THE FIGURES
Fig. 1 shows a block diagram of the human detection and location procedure of the invention referred to herein, which is composed of sub-procedure A for the detection and location of humans on images acquired by an omnidirectional static camera and sub-procedure B of configuration (or training) of parameters of sub-procedure A that allows the correct detection and location of humans in images in a specific environment / scene.
Fig. 2 shows the six stages of sub-procedure A, from the acquisition of images to the detection and final location of humans in those images.
Fig. 3 shows the four stages of sub-procedure B, from the creation of an annotated base of images to the configuration of the parameters of sub-procedure A.
Fig. 4 illustrates step A.2 which generates a single super-feature vector for each image acquired by dividing the image into blocks, calculating a feature vector for each block and concatenating the vector vectors. characteristics of all the blocks to form a super-vector that represents the image.
Fig. 5 shows a hexagonal / quincuncial grid pattern of M foveal classifiers used in an embodiment of the invention, and a cluster / neighborhood of Np classifiers.
Fig. 6 illustrates step A.4 in which the M foveal classifiers make a binary prediction to determine if there is a human present in their foveas, generating an activation represented by the point of
5
10
fifteen
twenty
25
30
active reference of the foveal classifier, and step A.6 that accurately locates in the image the humans detected by selecting and filtering clusters / neighborhoods of detections on the mesh / grid of reference points of the foveal classifiers and finally the interpolation of the coordinates of the reference points of each grouping / neighborhood.
Fig. 7 illustrates step B.2 of human point annotation in the images that make up the database.
Fig. 8 illustrates step B.3 of adaptive generation of positive and negative samples for each point classifier from the human point annotations in the database.
DESCRIPTION OF THE PREFERRED EMBODIMENT
A method according to an embodiment of the invention integrates two differentiated sub-procedures (Fig. 1). Sub-procedure A for the detection and location of humans (1) on images (3) acquired by an omnidirectional static camera (2), which is explicitly configured for the detection and location of humans in a specific environment / scene. Human detection operates in real time on a general purpose computer and is robust to the partial visibility of humans (1) on the scene (either by occlusion or by being partially outside the margins of the image (3)), a lighting changes, variations in human appearance (different morphologies, poses and scales), and geometric distortions of the optics of the omnidirectional chamber (2). And sub-procedure B of configuration (or training) of parameters of sub-procedure A that allows the correct detection and location of humans (1) in images (2) in a specific environment / scene.
Sub-procedure A comprises six stages from the acquisition of images (3) to the detection and final location (4) of humans in those images (Fig. 2).
The first stage A.1 acquires images (3) from an omnidirectional camera (2) in static position. As a practical example, the omnidirectional chamber (2) could be placed on the ceiling of the office room so that
5
10
fifteen
twenty
25
30
the acquired images (3) cover the entire interior of the room. The same camera configuration must be respected for sub-procedure B. The acquired images have a resolution of 800x600 pixels.
Step A.2 generates a single feature vector (8) for each image (3) acquired (Fig. 4). This extraction of the feature vector (8) is broken down into three sub-stages. The first sub-stage A.2.1 divides the image (3) into blocks of NxN pixels (6), which may or may not be overlapping. Sub-stage A.2.2 generates a feature vector (7) for each block of NxN pixels (6) of the previous sub-stage. The procedure for generating the feature vector (7) for each block (6) can be one of the following: Oriented Gradient Histograms (HOG), Local Binary Patterns, Transforming Scale Invariant Features, and Haar Features. However, there is no a priori limitation for the use of other feature extraction techniques. As a preferable example, the image (3) has been divided into blocks (6) of 16x16 non-overlapping pixels. The descriptor of features of Oriented Gradient Histograms (HOG) (N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," Conference on Computer Vision and Pattern Recognition (CVPR), 2005) has been calculated for each block (6), using the following parameters: 8x8 pixel cell size, area overlap between 1/2 cells, and nine histogram-oriented intervals, and finally, all HOG vectors (7) have been concatenated into a single super -vector (8) The main difference with respect to the standard HOG implementation is the block size (6), which was originally a size of 64x128 pixels, in this embodiment example smaller blocks (16x16 pixels are used) ) to reduce the final size of the concatenated super-vector (8).
Sub-stage A.2.3 concatenates the feature vectors (7) of each block (6) of the image (3) into a single feature super-vector (8) that represents the entire image.
Step A.3 distributes the super-feature vector (8) of the image to a set of M foveal classifiers (9).
5
10
fifteen
twenty
25
30
In step A.4, the M foveal classifiers (9) make a binary prediction in the categories "human present in the fovea of the classifier" and its complementary "human not present in the fovea of the classifier". The fovea of a classifier (10) is a certain region of the image (3) in which the detection of humans (1) is centered and whose area and morphology is automatically inferred by Sub-procedure B of parameter configuration. Each foveal classifier also has an associated reference point (10) on the image plane, which is a pair of coordinates of the image plane that represent the fovea of the classifier. The set of point references of the M foveal classifiers (9) form a two-dimensional spatial grid / mesh that covers the area of the image (3) acquired by the camera (Fig. 6). Said spatial grid / mesh is configurable by different patterns (hexagonal, rectangular, polar, etc.) and different number of classifiers. Fig. 5 shows an example of a hexagonal / quincuncial pattern. Each foveal classifier (10) is trained / configured to detect humans (1) in its fovea using as input the super-feature vector (8) of the image (3), which is common to all foveal classifiers. This supposes a great competitive advantage from the point of view of computational cost, since the number of operations is significantly smaller than other techniques based on the concept of sliding window or part detection, since the classifiers (9) only have to process a single feature vector (8) per image (3). On the other hand, the classification algorithm used by the foveal classifiers (9) is not a priori restricted, and the following classifiers can be used among others: Support Vector Machines, Neural Networks, and Logistic Regression. As a particular example, 825 classifiers (9) Support Vector Machines (SVM) have been used with a linear core, whose point references (10) are arranged in a mesh / grid with a hexagonal / quincuncial pattern.
Stage A.5 approximates in the image (3) the humans detected, using the specific references (10) of the active foveal classifiers (12) (those who have detected a human in their fovea).
5
10
fifteen
twenty
25
30
Note that several foveal classifiers (12) are activated by the same human if their foveas are overlapping.
The last stage A.6 accurately locates in the image (3) the humans detected, merging the detection results of each foveal classifier (10), which allows to refine the location and produce a final and unique detection (4) by human (1) (Fig. 6). Stage A.6 consists of three sub-stages. The first sub-stage A.6.1 selects clusters / neighborhoods (11) of detections according to the spatial structure of the grid / mesh (9) that form the point references (10) of the foveal classifiers (Np classifiers form a neighborhood in the Fig. 5). Clusters / neighborhoods (11) of detections must exceed a minimum threshold for consideration, since the area that a human occupies (1) on the image (3) corresponds to a region in which multiple foveas of neighboring foveal classifiers overlap (eleven). This restriction avoids false positives due to isolated detectors of the classifiers (9). Sub-stage A.6.2 deletes non-maximums from activated clusters / neighborhoods (11), obtaining a single cluster / neighborhood per human (1). The last sub-stage A.6.3 accurately locates the human (1) on the image (3) by interpolating the coordinates of the point references (10) of the foveal classifiers belonging to the selected cluster / neighborhood (11) in the previous stage. This interpolation is weighted by the measure of confidence in the detection of each foveal classifier (10). Therefore, a single point location (4) per human (1) is generated, which in turn represents its final detection. Typically, clusters / neighborhoods (11) are selected that contain at least 5 detections according to the spatial structure of the grid / mesh (9). The precise location (4) of a human (1) detected on the image (3) is done by interpolating the coordinates of all the point references (10) of the foveal classifiers that are part of a cluster / neighborhood (11) which has exceeded the criteria of minimum number of detections by grouping / neighborhood and suppression of non-maximums.
Sub-procedure B comprises four stages from the creation of an annotated base of images (5) to the configuration of the parameters of the
5
10
fifteen
twenty
25
30
sub-procedure A (Fig. 3). The first stage B.1 creates an image database (5) of a specific scene / environment that contains human instances (1). The images (3) are acquired with an omnidirectional camera (2), which must be located in the same position as the camera used in the detection. For example, the omnidirectional camera (2) is placed on the ceiling of the office room so that the acquired images (3) cover the entire interior of the room. The same camera configuration must be respected for sub-procedure A. Step B.2 makes a timely annotation of the humans (1) in the image database (5) (Fig. 7). The point annotation consists in the specification of the coordinates of a representative point of the human (1) on the image (3). The representative point chosen must be consistent in all instances of the object noted. For example, the representative point of the human (1) on the image (3) to perform the human annotation is the head, as this is a reference that is minimally affected by occlusions with objects in the room and other humans for location of the camera considered. The punctual annotation supposes a great competitive advantage in the cost of generating an annotated database in comparison with other techniques of detection in images that requires more complex annotations (rectangles or other polygons, image areas, etc.). Step B.3 adaptively generates a set of positive and negative samples for each foveal classifier (10) from the point annotations (Fig. 8). A point annotation generates a positive sample for a given foveal classifier (10) if the distance between the coordinates of the point annotation and the coordinates of the reference point of the foveal classifier is less than a threshold. If this threshold is exceeded, a negative sample is generated for the classifier involved (10). As a result, the same punctual annotation will generate a set of positive samples for a subset of foveal classifiers (11) and another set of negative samples for the subset of foveal classifiers complementary to the previous one. For example, for the set of positive and negative samples for each foveal classifier, it is determined that a punctual annotation generates a positive sample for seven foveal classifiers (Np = 7), which are those
with a smaller distance between the coordinates of the point annotation and the coordinates of the reference point of the foveal classifiers. The punctual annotation generates a negative sample for the rest of the foveal classifiers. The last stage B.4 performs the adjustment of the parameters (training 5 in the classifier jargon) of each foveal classifier (10) for the optimal detection of humans, using the positive and negative samples generated in the previous sub-stage. As an example, you can use the standard training algorithm of Support Vector Machines with regularization.
Note that the procedure referred to in this invention is characterized by not requiring information from the scene (beyond the specific annotations in the database (5)) or calibration of the camera (2). This implies competitive advantages in the deployment, configuration and practical application of the human detection and location procedure. Furthermore, said procedure could be extended to the location and detection of generic objects contained in images (3) acquired by omnidirectional cameras (2) and perspective, since on the one hand only the annotation is necessary consisting of an object in a database (5) (there is no special particularization to the characteristics of a human), and on the other there is no specific treatment for the optics of the camera (proof of this is the absence of a calibration stage).

权利要求:
Claims (1)
[1]
5
10
fifteen
twenty
25
30
1. Procedure for the detection and location of humans (1) in image sequences (3) acquired by omnidirectional cameras (2), characterized in that it comprises the following sub-procedures:
to. Sub-procedure A for the detection and location of humans (1) on images (3) acquired by an omnidirectional static camera (2), which is configured for the detection and location of humans in a specific environment / scene through the following stages :
i. Stage A.1: acquisition of images (3) with an omnidirectional camera (2) in static position;
ii. Stage A.2: extraction of a single super-vector of characteristics (8) for each of the acquired images (3);
iii. Stage A.3: distribution of the characteristics super-vector (8) associated to each of the images (3) to a set of M foveal classifiers (9);
iv. Stage A.4: binary prediction for each of the M foveal classifiers (10) in the categories "human present in the fovea of the classifier" and its complementary "human not present in the fovea of the classifier";
v. Stage A.5: approximate location (12) in the image (3) of the humans detected; Y,
saw. Stage A.6: precise location (4) in the image (3) of the humans detected;
b. Sub-procedure B for configuring parameters for the detection and location in images (3) of humans (1) in a specific environment / scene through the following stages:
i. Stage B.1: creation of a database (5) of images (3) of a specific scene / environment and containing instances of humans (1); where images are acquired with a
5
10
fifteen
twenty
25
30
omnidirectional camera (2), which must be located in the same position as the camera used in the detection;
ii. Stage B.2: punctual annotation of humans (1) in the image database (3);
iii. Stage B.3: adaptive generation of positive and negative samples for each foveal classifier (10) from the point annotations;
iv. Stage B.4: adjustment of the parameters of each foveal classifier (10) for optimal human detection (3) using the positive and negative samples generated;
where stage A.2 comprises three sub-stages:
to. Sub-stage A.2.1: division of the image (3) into blocks of NxN pixels (6), which may be overlapping or not;
b. Sub-stage A.2.2: extraction of a feature vector (7) for each block of NxN pixels (6) referred to in the previous sub-stage; where the feature extraction algorithm is selected from: Histograms of Oriented Gradients, Local Binary Patterns, Transformed Scale Invariant Features, and Haar Features;
C. Sub-stage A.2.3: concatenation of the feature vectors (7) of each block of the image into a single feature super-vector (8) representing the entire image (3);
where stage A.4, which consists of a set of M foveal classifiers (9), each of which (10) has associated a fovea and a specific reference on the image plane;
where stage A.5 uses the specific references of the active foveal classifiers (12);
where stage A.6 comprises three sub-stages:
to. Sub-stage A.6.1: extraction of clusters / neighborhoods (11) of detections according to the grid / mesh (9) that form the point references (10) of the foveal classifiers;
10
b. Sub-stage A.6.2: deletion of non-maximums of the groups / neighborhoods (11) extracted, obtaining a single group / neighborhood per human (1);
C. Sub-stage A.6.3: precise location (4) of the human (1) on the image (3) by interpolating the coordinates of the point references (10) of the foveal classifiers that is part of each grouping / neighborhood (11 ) obtained in the previous stage;
in such a way that a single point location is generated per human (1), which in turn represents the final detection (4) thereof.

类似技术:

公开号 | 公开日 | 专利标题

Palmero et al.2016|Multi-modal rgb–depth–thermal human body segmentation

Zhang et al.2007|Pedestrian detection in infrared images based on local shape features

US8761459B2|2014-06-24|Estimating gaze direction

JP2012226745A|2012-11-15|Method and system for detecting body in depth image

Krotosky et al.2008|Person surveillance using visual and infrared imagery

Cheng et al.2014|Person re-identification by articulated appearance matching

Zhang et al.2010|A pedestrian detection method based on SVM classifier and optimized Histograms of Oriented Gradients feature

Lee et al.2014|Time-sliced averaged motion history image for gait recognition

Bazzani et al.2014|Sdalf: modeling human appearance with symmetry-driven accumulation of local features

Khan et al.2016|Multiple human detection in depth images

Lee et al.2014|Gait probability image: An information-theoretic model of gait representation

Lee et al.2015|Near-infrared-based nighttime pedestrian detection using grouped part models

JP6351243B2|2018-07-04|Image processing apparatus and image processing method

Choudhury et al.2018|Improved pedestrian detection using motion segmentation and silhouette orientation

Islam et al.2018|Single-and two-person action recognition based on silhouette shape and optical point descriptors

Li et al.2019|Supervised people counting using an overhead fisheye camera

Jetley et al.2014|3D activity recognition using motion history and binary shape templates

Khan et al.2021|A deep survey on supervised learning based human detection and activity classification methods

ES2657378B2|2019-02-07|PROCEDURE FOR THE DETECTION AND LOCALIZATION OF HUMANS IN PICTURES ACQUIRED BY OMNIDIRECTIONAL CAMERAS

Forczmański2016|Human face detection in thermal images using an ensemble of cascading classifiers

Chen et al.2017|Activity recognition using a panoramic camera for homecare

Zhou et al.2021|A study on attention-based LSTM for abnormal behavior recognition with variable pooling

van Oosterhout et al.2015|RARE: people detection in crowded passages by range image reconstruction

Sim et al.2012|Detecting people in dense crowds

Farhadi et al.2011|Efficient human detection based on parallel implementation of gradient and texture feature extraction methods

同族专利:

公开号 | 公开日

ES2657378B2|2019-02-07|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

法律状态:
2019-02-07| FG2A| Definitive protection|Ref document number: 2657378 Country of ref document: ES Kind code of ref document: B2 Effective date: 20190207 |

优先权:

申请号 | 申请日 | 专利标题

ES201730478A|ES2657378B2|2017-03-30|2017-03-30|PROCEDURE FOR THE DETECTION AND LOCALIZATION OF HUMANS IN PICTURES ACQUIRED BY OMNIDIRECTIONAL CAMERAS|ES201730478A| ES2657378B2|2017-03-30|2017-03-30|PROCEDURE FOR THE DETECTION AND LOCALIZATION OF HUMANS IN PICTURES ACQUIRED BY OMNIDIRECTIONAL CAMERAS|

[返回顶部]