IMAGE RECOGNITION USING DESCRIPTOR PRUNING

Abstract

The present disclosure relates to image recognition or image searching. More precisely, the present disclosure relates to pruning local descriptors extracted from an input image. The present disclosure proposes a system, method and device directed to the pruning of local descriptors extracted from image patches of an input image. The present disclosure prunes local descriptors assigned to a codebook cell, based on a relationship of the local descriptor and the assigned codebook cell. The present disclosure includes assigning a weight value for use in pruning based on the relationship of the local descriptor and the assigned codebook cell. This weight value is then used during the encoding of the local descriptors for use in image searching or image recognition.

Claims

1. An image processing system for processing an image for image searching comprising, a memory; a processor; a local descriptor pruner configured to prune at least a local descriptor based on a relationship of the local descriptor and a codeword to which the local descriptor is assigned; wherein the local descriptor pruner determines a weight value for the local descriptor based on a distance between the local descriptor and the codeword or based on a probability value produced by a Gaussian Mixture Model (GMM) used for an encoder and evaluated at the local descriptor, and removes said local descriptor if said distance or said probability value exceeds a threshold and wherein the weight value for the local descriptor is utilized by an image encoder during encoding when said local descriptor is not removed.

2. The image processing system of claim 1, wherein the local descriptor pruner determines a hard weight value that is either 1 or 0.

3. The image processing system of claim 1, wherein the local descriptor pruner determines a soft weight value that is between 0 and 1.

4. The image processing system of claim 3, wherein the local descriptor pruner determines the soft weight value based on either exponential weighting or inverse weighting.

5. (canceled)

6. The image processing system of claim 1, wherein the local descriptor pruner determines the weight based on the following equation w.sub.k(x)=[[(x−c.sub.k).sup.TM.sub.k.sup.−1(x−c.sub.k)≦γσ.sub.k.sup.2]], wherein k is an index value, x is the local descriptor, c.sub.k is the assigned codeword, and γ, σ.sub.k, and M.sub.k are parameters computed prior to initialization, and [[ . . . ]] is the evaluation to 1 if the condition is true and 0 otherwise.

7. (canceled)

8. The image processing system of claim 1, wherein the local descriptor pruner determines the weight based on a parameter that is computed from a training set of images.

9. The image processing system of claim 1, wherein the image encoder is at least one selected from the group of a Bag of Words encoder, a Fisher Encoder or a VLAD encoder.

10. The image processing system of claim 1, further comprising an image searcher configured to retrieve at least an image result based on the results of the image encoder.

11. The image processing system of claim 1, further comprising a local descriptor extractor configured to compute at least an image patch and configured to extract a local descriptor for the image patch.

12. A method for image processing for processing an image for image searching comprising pruning a local descriptor based on a relationship of the local descriptor and a codeword to which the local descriptor is assigned; wherein the pruning of the local descriptor includes determining a weight value for the local descriptor based on a distance between the local descriptor and the codeword or based on a probability value produced by a Gaussian Mixture Model (GMM) used for an encoder and evaluated at the local descriptor, and removing said local descriptor if said distance or said probability value exceeds a threshold and wherein the weight value for the local descriptor is utilized during encoding of the pruned local descriptor when said local descriptor is not removed.

13. The method of claim 12, wherein pruning of the local descriptor includes determining a hard weight value that is either 1 or 0.

14. The method of claim 12, wherein pruning of the local descriptor includes determining a soft weight value that is between 0 and 1.

15. The method of claim 14, wherein the soft weighting value is determined based on either exponential weighting or inverse weighting.

16. (canceled)

17. The method of claim 12, wherein the weight is determined based on the following equation w.sub.k(x)=[[(x−c.sub.k).sup.TM.sub.k.sup.−1(x−c.sub.k)≦γσ.sub.k.sup.2]], wherein k is an index value, x is the local descriptor, c.sub.k is the assigned codeword, and γ, σ.sub.k, and M.sub.k are parameters computed prior to initialization and [[ . . . ]] is the evaluation to 1 if the condition is true and 0 otherwise.

18. (canceled)

19. The method of claim 12, wherein the weight is determined based on a parameter that is computed from a training set of images.

20. The method of claim 12, wherein the encoding of the pruned local descriptor is performed using at least one from the group of a Bag of Words encoder, a Fisher Encoder or a VLAD encoder.

21. The method of claim 12, further comprising searching at least an image result based on results of the encoding.

22. The method of claim 12, further comprising receiving an image and computing at least an image patch for the image.

23. Electronic device incorporating the image processing system of claim 1.

24. Electronic device of claim 23 selected from the group consisting of a computer, a laptop, a smartphone, a handheld computing system, and a remote server.

25. Non-transitory storage medium carrying instructions of program code for executing steps of the method of claim 12 when said program is executed on a computer

Description

BRIEF SUMMARY OF THE DRAWINGS

[0019] The features and advantages of the present invention may be apparent from the detailed description below when taken in conjunction with the Figures described below:

[0020] FIG. 1A illustrates a flow diagram for a standard image processing pipeline for image searching.

[0021] FIG. 1B illustrates a flow diagram for performing local descriptor detection using the patches in FIG. 1C or ID.

[0022] FIG. 1C illustrates a diagram showing an example of image patches identified using a dense detector.

[0023] FIG. 1D illustrates a diagram showing an example of image patches identified using a sparse detector.

[0024] FIG. 1E illustrates a flow diagram showing an example of an image retrieval search.

[0025] FIG. 1F illustrates a flow diagram showing an example of a semantic search.

[0026] FIG. 2 illustrates a flow diagram for an exemplary method for performing image processing with local descriptor pruning in accordance with an example of the present invention.

[0027] FIG. 2A illustrates a flow diagram for an exemplary method for performing local descriptor pruning in accordance with an example of the present invention.

[0028] FIG. 2B illustrates an example of a visual representation of a result of an exemplary pruning process.

[0029] FIG. 3 illustrates a block diagram of an exemplary image processing device.

[0030] FIG. 4 illustrates a block diagram of an exemplary distributed image processing system.

[0031] FIG. 5 illustrates an exemplary plot of percentage of Hessian-Affine SIFT descriptors that are pruned versus threshold values in accordance with an example of the present invention.

[0032] FIG. 6 illustrates an exemplary plot for selection of a pruning parameter when performing hard pruning of local descriptors in accordance with an example of the present invention.

[0033] FIG. 7 illustrates an exemplary plot for selection of a pruning parameter when performing soft pruning in accordance with an example of the present invention.

DETAILED DESCRIPTION

[0034] Examples of the present invention relate to an image processing system that includes a local descriptor pruner for pruning the local descriptors based on a relationship between the local descriptor and a codeword to which the local descriptor is assigned. The local descriptor assigns a weight value for the local descriptor based on the relationship of the local descriptor and the codeword and this weight value is then utilized during encoding.

[0035] Examples of the present invention also relate to a method for pruning local descriptors based on a relationship between the local descriptor and a codeword. The method assigns a weight value for the local descriptor based on the relationship of the local descriptor and the codeword and the weight value is utilized by an image encoder during encoding.

[0036] In one example, the local descriptor pruner or the pruning method can assign a hard weight value that is either 1 or 0. In one example, the local descriptor pruner or the pruning method can assign a soft weight value that is between 0 and 1. In one example, the local descriptor pruner or the pruning method can determine the soft weight value based on either exponential weighting or inverse weighting. In one example, the local descriptor pruner or the pruning method can determine the weight based on a distance between the local descriptor and the codebook cell. In one example, the local descriptor pruner or the pruning method can determine the weight based on the following equation w.sub.k(x)=[[(x−c.sub.k).sup.TM.sub.k.sup.−1(x−c.sub.k)≦γσ.sub.k.sup.2]], where k is an index value, x is the local descriptor, c.sub.k is the assigned codeword, and γ, σ.sub.k, and M.sub.k are parameters computed prior to initialization, and [[ . . . ]] is the evaluation to 1 if the condition is true and 0 otherwise. In one example, the local descriptor pruner or the pruning method can determine the weight based on a probability value determined based on a GMM model evaluated at the local descriptor. In one example, the local descriptor pruner or the pruning method can determine the weight based on a parameter that is computed from a training set of images.

[0037] In one example, the image encoder is at least one of a Bag of Words encoder, a Fisher Encoder or a VLAD encoder. The encoding of the method can be based on at least one of a Bag of Words encoder, a Fisher Encoder or a VLAD encoder.

[0038] In one example, the system or method further comprise an image searcher or an image searching method for retrieving an image based on the results of the image encoder or image encoding, respectively.

[0039] In one example, the system or method further comprise a local descriptor extractor or a local descriptor extracting method for computing at least an image patch and configured to extract a local descriptor for the image patch.

[0040] The scalars, vectors and matrices notation may be denoted by respectively standard, underlined, and underlined uppercase typeface (e.g., scalar a, vector a and matrix A). A variable v.sub.k may be used to denote a vector from a sequence v.sub.1, v.sub.2, . . . , v.sub.N, and v.sub.k to denote the k-th coefficient of vector v. The following notation [a.sub.k].sub.k (respectively, [a.sub.k].sub.k) denotes concatenation of the vectors a.sub.k (scalars a.sub.k) to form a single column vector. The following notation ,[[.]] denotes the evaluation to 1 if the condition is true and 0 otherwise.

[0041] The present invention may be implemented on any electronic device or combination of electronic devices. For example, the present invention may be implemented on any of variety of devices including a computer, a laptop, a smartphone, a handheld computing system, a remote server, or on any other type of dedicated hardware. Various examples of the present invention are described below with reference to the figures.

[0042] Exemplary Process of Image Searching with Local Descriptor Pruning

[0043] FIG. 2 is a flow diagram illustrating an exemplary method 200 for performing image processing with local descriptor pruning in accordance with an example of the present disclosure. The image processing method 200 includes Input Image block 210. Input Image block 210 receives an inputted image. In one example, the inputted image may be received after a selection by a user. Alternatively, the inputted image may be captured or uploaded by the user. Alternatively, the inputted image may be received or generated by a device such as a camera and/or an image processing computing device.

[0044] The Extract Local Descriptors block 220 may receive the input image from Input Image block 210. The Extract Local Descriptors block 220 may extract local descriptors in accordance with the processes described in connection with FIGS. 1A-1D.

[0045] The Extract Local Descriptors block 220 may compute one or more patches for the input image. In one example, the image patches may be computed using a dense detector, an example of which is shown in FIG. 1C. In another example, the image patches may be computed using a sparse detector, an example of which is shown in FIG. 1D.

[0046] For each image patch, the Extract Local Descriptors block 220 extracts a local descriptor using a local descriptor extraction algorithm. For example, the Extract Local Descriptors block 220 may extract local descriptors in accordance with the processes described in connection with FIGS. 1A-1B.

[0047] In one example, the Extract Local Descriptors block 220 extracts the local descriptors for each image patch by using a Scale Invariant Feature Transform (SIFT) algorithm on each image patch resulting in a corresponding SIFT vector for each patch. The SIFT vector may be of any number of entries. In one example, the SIFT vector may have 128 entries. In one example, the Extract Local Descriptors block 220 may compute N image patches for an image (image patch i, where i=1, 2, 3 . . . N). For each image patch i, a SIFT vector of size 128 is computed. At the end of processing, the Extract Local Descriptors block 220 outputs N SIFT local descriptor vectors, each SIFT local descriptor vector of size 128. In another example, the Extract Local Descriptors block 220 may use an algorithm other than the SIFT algorithm, such as, for example: Speeded Up Robust Features (SURF), Gradient Location and Orientation Histogram (GLOH), Local Energy based Shape Histogram (LESH), Compressed Histogram of Gradients (CHoG); Binary Robust Independent Elementary Features (BRIEF), Discriminative Binary Robust Independent Elementary Features (D-BRIEF) or the Daisy descriptor.

[0048] The output of the Extract Local Descriptors block 220 may be a set of local descriptors vectors. In one example, the output of Extract Local Descriptors block 220 may be a set I={x.sub.iεR.sup.d}.sub.i of local SIFT descriptor vectors, where each x.sub.i represents a local descriptor vector computed for a patch of the inputted image.

[0049] The Prune Local Descriptors block 230 receives the local descriptors from the Extract Local Descriptors block 220. The Prune Local Descriptors block 230 prunes the received local descriptors to remove local descriptors that are too far away from either codewords or GMM mixture components of an encoder. The pruning of such too-far away local descriptors prevents the degradation in quality of image searching. The present invention thus allows the return of more reliable image search results by pruning local descriptors that are too far away to be visually informative. This is particularly beneficial in multi-dimensional descriptor spaces, and particularly in high-dimensional local descriptor spaces, because in those spaces cells are almost always unbounded, meaning that they have infinite volume. Yet only a part of this volume is informative visually. The present invention allows the system to isolate this visually informative information by pruning non-visually informative local descriptors.

[0050] The Prune Local Descriptors block 230 may employ a local-descriptor pruning method applicable to any subsequently used encoding methods (BOW, VLAD and Fisher). In one example, the Prune Local Descriptors block 230 may receive a signal indicating the encoder that is utilized. Alternatively, the Prune Local Descriptors block 230 may prune the local descriptors vectors independent of the subsequent encoding method. Generally, the Prune Local Descriptors block 230 may prune local descriptors vectors for any feature encoding methods based on local descriptors where each local descriptor is related to a cell C.sub.k or mixture component/soft cell (β.sub.k,c.sub.k,Σ.sub.k), where k denotes the index. In one example, cell C.sub.k denotes a Voronoi cell {x|xεR.sup.d,k=argmin.sub.j|x−c.sub.j|} associated to codeword c.sub.k. In another example, β.sub.i,c.sub.i, Σ.sub.i denotes the soft cell of the i-th GMM component, where β.sub.i=prior weight i; c.sub.i=mean vector i; Σ.sub.i=covariance matrix i (assumed diagonal).

[0051] In one example, a cell C.sub.k may denote a Voronoi cell {x|xεR.sup.d,k=argmin.sub.j|x−c.sub.j|} associated with a codeword c.sub.k. The codewords c.sub.k may be learned using a set of training SIFT (or any other type of local descriptor) vectors from a set of training images and are kept fixed during encoding. The learning of the codewords c.sub.k may be performed at an initialization stage using K-means, where, for example, a vector may be computed to be the average of all the SIFT vectors assigned to cell number k. For example, for a codeword c.sub.1, each SIFT vector x.sub.i (i=1, 2, 3, 4, . . . ) that is closer to c.sub.1 than to any other c.sub.k, where k is a number other than 1, is assigned to cell number 1. Once all the c.sub.k are computed, the process is repeated until convergence, since changing the c.sub.k changes which SIFT vectors are closest to which c.sub.k.

[0052] In one example, each soft cell C.sub.i is defined by the parameters β.sub.i=prior weight i; c.sub.i=mean vector i; Σ.sub.i=covariance matrix i. These parameters for all the cells i=1, 2, 3, . . . , L are the output of a GMM learning algorithm implemented, for example, using standard approaches like the Expectation Maximization algorithm. When pruning descriptors based on GMM models, the same approach used for hard cells can be used: soft and hard weights w.sub.i(x) can be computed based on the distance between x and c.sub.i. An alternate hard pruning approach tailored to GMM models is to apply a threshold (learned experimentally so as to maximize the mAP on a training set) on the probability value p(x) produced by the GMM model at the point x. A soft-pruning approach might instead use the probability itself or a mapping of this probability. A possible mapping is p(x).sup.a for some value of a between 0 and 1.

[0053] In one example, the Prune Local Descriptors block 230 prunes the local descriptors of the inputted image based on a determination of whether the local descriptors are too far away from their assigned cells or soft cells. For example, the block 230 determines whether the local descriptors are too far from the codeword of cell C.sub.k or a mixture component (β.sub.k,c.sub.k,Σ.sub.k) (soft cells). In one example, the Prune Local Descriptors block 230 may prune the local descriptors by removing the local descriptors that exceed a threshold distance between the local descriptor and the codeword c.sub.k at the center of the cell C.sub.k containing the local descriptor.

[0054] FIG. 2A illustrates a process for pruning local descriptors in accordance with an example of the present invention. The process shown in FIG. 2A may be implemented by Prune Local Descriptors block 230 is shown in FIG. 2. In one example, the pruning process receives the unpruned local descriptors at block 231.

[0055] In one example, the pruning process may receive a codebook including codewords relating to cells or soft cells at block 232. The codebook may either be received from local storage or through a communication link with a remote location. The codebook may be initialized at or before the initialization of the pruning process. In one example, a codebook {c.sub.k}.sub.k defines Voronoi cells {C.sub.k}.sub.k where k denotes the index of the cell. In another example, a codebook may include soft cells C.sub.i defined by the parameters β.sub.i=prior weight i; c.sub.i=mean vector i; Σ.sub.i=covariance matrix i.

[0056] In one example, the pruning process assigns at block 233 each local descriptor to a cell or a soft cell received at block 232. In one example, the pruning process may assign each local descriptor to a cell by locating the cell whose codeword has the closest Euclidean distance to the local descriptor.

[0057] In one example the assigned local descriptors are pruned at block 234. In one example, the pruning process at block 234 evaluates each local descriptor to determine whether that local descriptor is too far away from its assigned cell or soft cell. In one example, the pruning process determines whether the local descriptor is too far away based on a determination if the distance between that local descriptor and the center or codeword of its assigned cell or soft cell exceeds a calculated or predetermined threshold. In an illustrative example, if a local descriptor is assigned to cell no. 5, the pruning process 234 may test whether the Euclidean distance between that local descriptor vector and the codeword vector no. 5 does not exceed a threshold. In another example, the pruning process may determine a probability value for the local descriptor relative to a cell(s) or a soft cell(s). The pruning process may determine if the probability value is below or above a certain threshold and prune local descriptors based on this determination. In an illustrative example, a GMM model may yield a probability value for a local descriptor x, where the probability value is between 0 and 1. In one example, the pruning process may prune the local descriptor x if the probability value is lower than a certain threshold. (e.g., less than thresh=0.01). The value of this threshold can be determined experimentally using a training set.

[0058] In one example, each local descriptor may be pruned by assigning a hard weight value (1 or 0) based on whether the local descriptor exceeds a threshold distance between the local descriptor and its assigned cell or soft cell. Alternatively, the local descriptors may be pruned by assigning a soft weight value (between 0 and 1) to each local descriptor based on the distance between the local descriptor and its assigned cell or soft cell.

[0059] In one example, each local descriptor x may be pruned based on whether the distance between local descriptor x and its assigned codeword c.sub.k exceeds the threshold determined by the following distance-to-c.sub.k condition:

(x−c.sub.k).sup.TM.sub.k.sup.−1(x−c.sub.k)≦γσ.sub.k.sup.2. (Equation 1)

[0060] The parameters γ, σ.sub.k, and M.sub.k may be computed prior to initialization and may be either stored locally or received via a communication link.

[0061] In one example, the value of γ is determined experimentally by cross-validation and the parameter σ.sub.k is computed from the variance of a training set of local descriptors custom-character as follows:

σ.sub.k= custom-character ((x−c.sub.k).sup.TM.sub.k.sup.−1(x−c.sub.k)) (Equation 2)

[0062] In one example, the matrix M.sub.k can be any of the following

[0063] Anisotropic M.sub.k: the empirical covariance matrix computed from custom-character ∩C.sub.k;

[0064] Axis-aligned M.sub.k: the same as the anisotropic M.sub.k, but with all elements outside the diagonal set to zero;

[0065] Isotropic M.sub.k: a diagonal matrix σ.sub.k.sup.2I with σ.sub.k.sup.2 equal to the mean diagonal value of the axis-aligned M.sub.k.

[0066] While the anisotropic variant may offer the most geometrical modelling flexibility, it may also increase computational cost. The isotropic variant, on the other hand, enjoys practically null computational overhead, but may have the least modelling flexibility. The axis-aligned variant offers a compromise between the two approaches.

Hard Weights

[0067] In one example, the pruning of local descriptors can be implemented by Equation 1 by means of 1/0 weights as follows, where [[.]] is the indicator function that evaluates to one if the condition is true and zero otherwise,

w.sub.k(x)=[[(x−c.sub.k).sup.TM.sub.k.sup.−1(x−c.sub.k)≦γσ.sub.k.sup.2]] (Equation 3)

Soft Weights

[0068] In another example, the pruning of local descriptors can be implemented by Equation 1 using soft weights.

[0069] In another example, the soft-weights may be computed using exponential weighting, where

[00001] $\begin{matrix} w_{k} (\underline{x}) = \exp (- {}_{σ_{k}}^{\frac{ω}{2}}{(\underline{x} - {\underline{c}}_{k})}^{T} .Math. {\underline{M}}_{k}^{- 1} (\underline{x} - {\underline{c}}_{k})) . & (Equation .Math. .Math. 4) \end{matrix}$

[0070] In another example, the soft-weights may be computed using inverse weighting, where

[00002] $\begin{matrix} w_{k} (\underline{x}) = \frac{σ_{k}^{2}}{{(\underline{x} - {\underline{c}}_{k})}^{T} .Math. {\underline{M}}_{k}^{- 1} (\underline{x} - {\underline{c}}_{k})} . & (Equation .Math. .Math. 5) \end{matrix}$

[0071] In one example, the pruned local descriptors are outputted at block 235.

[0072] FIG. 2B illustrates an example of a visual representation of a result of an exemplary pruning process. FIG. 2B illustrates five cells, 260-264 (Cells C.sub.1-C.sub.5). Each cell has a corresponding codeword. Cell C.sub.1 260 corresponds with codeword c.sub.1 265. Cell C.sub.2 261 corresponds with codeword c.sub.2 266. Cell C.sub.3 262 corresponds with codeword c.sub.3 267. Cell C.sub.4 263 corresponds with codeword c.sub.4 268. Cell C.sub.5 264 corresponds with codeword c.sub.5 269. In each cell, local descriptors x have been assigned. The local descriptors are shown by dots in FIG. 2B. For example, local descriptors 270 have been assigned to Cell C.sub.1 260. Local descriptors 271 and 272 have been assigned to Cell C.sub.2 261. Local descriptors 273 have been assigned to Cell C.sub.3 262. Local descriptors 274 have been assigned to Cell C.sub.4 263. Local descriptors 275 have been assigned to Cell C.sub.5 264.

[0073] FIG. 2B also illustrates the outcome of pruning the local descriptors in Cell C.sub.2. 261. The local descriptors 272 within the ellipse have been found to be within the threshold distance and thus are not pruned. The local descriptors 271 outside the ellipse are outside the threshold distance and thus are pruned. In one example, the local descriptors 272 have been assigned weight w(x)=1 and the local descriptors 271 have been assigned a weight w(x)=0.

[0074] The Encode Pruned Descriptors block 240 may receive the pruned local descriptors from the Prune Local Descriptors 230. The Encode Pruned Descriptors block 240 may compute image feature vectors by encoding the pruned local descriptors received from the Prune Local Descriptors block 230. The Encode Pruned Descriptors block 240 may use an algorithm such as a Bag-of-Words (BOW), Fisher or VLAD algorithm, or any other algorithm based on a codebook obtained from any clustering algorithm such as K-means or from a GMM model. The Encode Pruned Descriptors block 240 may encode the pruned local descriptors in accordance with the process described in FIG. 1A.

[0075] In one example, the Encode Pruned Descriptors block 240 may utilize a bag-of-words (BOW) encoder. The BOW encoder may be based on a codebook {c.sub.kΣR.sup.d}.sub.k=1.sup.L obtained by applying K-means to all the local descriptors custom-character =U.sub.tI.sub.t of a set of training images. Letting C.sub.k denote the Voronoi cell {x|xεR.sup.d,k=argmin.sub.j|x−c.sub.j|} associated to codeword c.sub.k, the resulting feature vector for image I may be

[00003] $\begin{matrix} {\underline{r}}^{B} = {[r_{k}^{B}]}_{k}, where & (Equation .Math. .Math. 6) \\ r_{k}^{B} = \underset{\underline{x} \in I .Math. C_{k}}{.Math.} .Math. .Math. w_{k} (\underline{x}) .Math. \frac{[[\underline{x} \in C_{k}]]}{\underset{k, \underline{x} \in I}{.Math.} .Math. .Math. w_{k} (\underline{x})} . & (Equation .Math. .Math. 7) \end{matrix}$

where [[.]] is the indicator function that evaluates to 1 if the condition is true and 0 otherwise and where [a.sub.k].sub.k denotes concatenation of the vectors a.sub.k (scalars a.sub.k) to form a single column vector.

[0076] In another example, the Encode Pruned Descriptors block 240 may utilize a Fisher encoder that may rely on a GMM model also trained on custom-character . Letting β.sub.i,c.sub.i, Σ.sub.i denote, respectively, the i-th GMM component's 1. prior weight, 2. mean vector, and 3. covariance matrix (assumed diagonal), the first-order Fisher feature vector may be

[00004] $\begin{matrix} {\underline{r}}^{F} = {[{\underline{r}}_{k}^{F}]}_{k}, where & (Equation .Math. .Math. 8) \\ {\underline{r}}_{k}^{F} = \underset{\underline{x} \in I}{.Math.} .Math. .Math. \frac{p (k  \underline{x})}{\sqrt{β_{i}}} .Math. {\underline{.Math.}}_{k}^{- 1} .Math. w_{k} (\underline{x}) .Math. (\underline{x} - {\underline{c}}_{k}) . & (Equation .Math. .Math. 9) \end{matrix}$

[0077] In another example, the Encode Pruned Descriptors block 240 may use a hybrid combination between BOW and Fisher techniques called VLAD. The VLAD encoder which may offer a compromise between the Fisher encoder's performance and the BOW encoder's processing complexity. The VLAD encoder may, similarly to the state-of-the art Fisher aggregator, encode residuals x−c.sub.k, but may also hard-assign each local descriptor to a single cell C.sub.k instead of using a costly soft-max assignment as in Equation (9). The resulting VLAD encoding may be

[00005] $\begin{matrix} {\underline{r}}^{V} = {[{\underline{r}}_{k}^{V}]}_{k}, where & (Equation .Math. .Math. 10) \\ {\underline{r}}_{k}^{V} = {\underline{Φ}}_{k}^{T} .Math. \underset{\underline{x} \in I .Math. C_{k}}{.Math.} .Math. .Math. w_{k} (\underline{x}) .Math. \frac{\underline{x} - {\underline{c}}_{k}}{.Math. \underline{x} - {\underline{c}}_{k} .Math.} \in R^{d}, & (Equation .Math. .Math. 11) \end{matrix}$

where Φ.sub.k are orthogonal PCA rotation matrices obtained from the training descriptors C.sub.k∩ custom-character in the Voronoi cell. After computing the sub-vectors r.sub.k.sup.B, r.sub.k.sup.F, or r.sub.k.sup.V, these are stacked as in Equations (6), (8) and (10) to obtain a single large vector r.sup.B, r.sup.F or r.sup.V (we use r to denote any of these variants). Two normalization steps are applied as per the standard approach in the literature: a power-normalization step, were each entry r.sub.i of r is substituted by sign(r.sub.i) abs(r.sub.i).sup.a. (common values of a are 0.2 or 0.5) and an 1-2 normalization step were every entry of the power-normalized vector is divided by the Euclidean norm of the power normalized-vector.

[0078] The Search Encoded Images block 250 receives the feature vector(s) computed by Encode Pruned Descriptors block 240. The Search Images block 250 may perform a search for one or more images by comparing the feature vector(s) received from Encoded Pruned Descriptors block 240 and the feature vectors of a search images database. The Search Images block 250 may perform an image search in accordance with the processes described in FIGS. 1A, 1E, and 1F.

[0079] Exemplary Image Processing System

[0080] FIG. 3 is a block diagram illustrating an exemplary image processing system 300. The image processing system includes an image processing device 310 and a display 320. In one example, the device 310 and the display 320 may be connected by a physical link. In another example, the device 310 and the display 320 may communicate via a communication link, such as, for example a wireless network, a wired network, a short range communication network or a combination of different communication networks.

[0081] The display 320 may allow the user to interact with image processing device 310, including, for example, inputting criteria for performing an image search. The display 320 may also display the output of an image search.

[0082] The image processing device 310 includes memory 330 and processor 340 that allow the performance of local descriptor pruning 350. The image processing device 310 further includes any other software or hardware necessary to perform local descriptor pruning 350.

[0083] The image processing device 310 executes the local descriptor pruning 350 processing. In one example, the image processing device 310 performs the local descriptor pruning 350 based on an initialization of an image search process by a user either locally or remotely. The local descriptor pruning 350 executes the pruning of local descriptors in accordance with the processes described in FIGS. 2, 2A and 2B.

[0084] In one example, the image processing device 310 may store all the information necessary to perform the local descriptor pruning 350. For example, the image processing device 310 may store and execute the algorithms and database information necessary to execute the local descriptor pruning 350 processing. Alternatively, the image processing system 310 may receive via a communication link one or more of the algorithms and database information to execute the local descriptor pruning 350 processing.

[0085] Each of the processing of extract local descriptors 360, encode pruned local descriptors 370, and perform image search 380 may be executed in whole or in part on image processing device 310. Alternatively, each of the extract local descriptors 360, encode pruned local descriptors 370, and perform image search 380 may be executed remotely and their respective results may be communicated to image processing device 310 via a communication link. In one example, the image processing device may receive an input image and execute extract local descriptors 360 and prune local descriptors 350. The results of prune local descriptors 350 may be transmitted via a communication link. The encode pruned local descriptors 370 and perform image search 380 may be executed remotely, and the results of perform image search 380 may be transmitted to image processing device 310 for display on display 320. The dashed boxes of extract local descriptors 360, encode pruned local descriptors 370, and perform image search 380 thus indicate that these processes may be executed on image processing device 310 or may be executed remotely. The extract local descriptors 360, encode pruned local descriptors 370, and perform image search 380 processes may be executed in accordance with the processes described in relation to FIGS. 1A-1F and FIG. 2.

[0086] FIG. 4 illustrates an example of various image processing devices 401-404 and a server 405. The image processing devices may be smartphones (e.g., device 401), tablets (e.g., device 402), laptops (e.g., device 403), or any other image processing device that includes software and hardware to execute the features of the present invention. The image processing devices 401-404 may be similar to the image processing device 310 and the image processing system 300 described in connection with FIG. 3. The local descriptor pruning processes described in accordance with FIGS. 2, 2A and 2B may be executed on any of the devices 401-404, on server 405, or in a combination of any of the devices 401-404 and server 405.

Exemplary Local Descriptor Pruning

[0087] In one example, image encoders operate on the local descriptors xεR.sup.d extracted from each image. Images may be represented as a set I={x.sub.iεR.sup.d}.sub.i of local SIFT descriptors extracted densely or with a Hessian Affine region detector.

[0088] In one example, local descriptors may be encoded using a BOW encoder. The BOW encoder may be based on a codebook {c.sub.kεR.sup.d}.sub.k=1.sup.L obtained by applying K-means to all the local descriptors custom-character =U.sub.tI.sub.t of a set of training images. Letting C.sub.k denote the Voronoi cell {x|xεR.sup.d,k=argmin.sub.j|x−c.sub.j|} associated to codeword c.sub.k, the resulting feature vector for image I may be

[00006] $\begin{matrix} {\underline{r}}^{B} = {[r_{k}^{B}]}_{k}, where & (Equation .Math. .Math. 12) \\ r_{k}^{B} = \underset{\underline{x} \in I .Math. C_{k}}{.Math.} .Math. \frac{[[\underline{x} \in C_{k}]]}{# .Math. .Math. I} & (Equation .Math. .Math. 13) \end{matrix}$

where [[.]] is the indicator function that evaluates to 1 if the condition is true and 0 otherwise.

[0089] In another example, local descriptors may be encoded using a Fisher encoder. The Fisher encoder relies on a GMM model also trained on custom-character . Letting β.sub.i,c.sub.i, Σ.sub.i denote, respectively, the i-th GMM component's 1. prior weight, 2. mean vector, and 3. covariance matrix (assumed diagonal), the first-order Fisher feature vector may be

[00007] $\begin{matrix} {\underline{r}}^{F} = {[{\underline{r}}_{k}^{F}]}_{k}, where & (Equation .Math. .Math. 14) \\ {\underline{r}}_{k}^{F} = \underset{\underline{x} \in I}{.Math.} .Math. .Math. \frac{p (k  \underline{x})}{\sqrt{β_{i}}} .Math. {\underline{.Math.}}_{k}^{- 1} .Math. (\underline{x} - {\underline{c}}_{k}) . & (Equation .Math. .Math. 15) \end{matrix}$

[0090] In another example, local descriptors may be encoded using a hybrid combination between BOW and Fisher techniques called VLAD which may offer a compromise between the Fisher encoder's performance and the BOW encoder's processing complexity. This hybrid encoder, similarly to the state-of-the art Fisher aggregator, may encode residuals x−c.sub.k, but may also hard-assign each local descriptor to a single cell C.sub.k instead of using a costly soft-max assignment as in Equation 15. The resulting VLAD encoding may be

[00008] $\begin{matrix} {\underline{r}}^{V} = {[{\underline{r}}_{k}^{V}]}_{k}, where & (Equation .Math. .Math. 16) \\ {\underline{r}}_{k}^{V} = {\underline{Φ}}^{kT} .Math. \underset{\underline{x} \in I .Math. C_{k}}{.Math.} .Math. \frac{\underline{x} - {\underline{c}}_{k}}{.Math. \underline{x} - {\underline{c}}_{k} .Math.} \in R^{d}, & (Equation .Math. .Math. 17) \end{matrix}$

where Φ.sub.k are orthogonal PCA rotation matrices obtained from the training descriptors C.sub.k∩ custom-character in the Voronoi cell.

[0091] In one example, the following power-normalization and l.sub.2 normalization post-processing stages may be applied to any of the feature vectors r in Equations (12), (14) and (16):

p=[h.sub.a(r.sub.j)].sub.j, (Equation 18)

n=g(p). (Equation 19)

Here the scalar function h.sub.a(x) and the vector function n(v) carry out power normalization and l.sub.2 normalization, respectively:

[00009] $\begin{matrix} h (x) = sign (x) .Math. {.Math. x .Math.}^{α} & (Equation .Math. .Math. 20) \\ \underline{g} (\underline{x}) = \frac{\underline{x}}{{.Math. \underline{x} .Math.}_{2}} & (Equation .Math. .Math. 21) \end{matrix}$

Exemplary Local-Descriptor Pruning Method

[0092] In one example, the present invention employs a local-descriptor pruning method applicable to all three feature encoding methods described above (BOW, VLAD and Fisher), and in general to feature encoding methods based on stacking sub-vectors r.sub.k, where each sub-vector is related to a cell C.sub.k or mixture component (β.sub.k,c.sub.k,Σ.sub.k) (these can be thought as soft cells).

[0093] Unlike the case for low-dimensional sub-spaces, the cells C.sub.k in high-dimensional local-descriptors spaces are almost always unbounded, meaning that they have infinite volume. Yet only a part of this volume is informative visually.

[0094] In one example, the visually informative information is pruned by removing the local descriptors that are too far away from the cell center c.sub.k when constructing the sub-vectors r.sub.k in Equations (13), (15) and (17). In one example, the pruning is performed by restricting the summations in Equations (13), (15) and (17) only to those vectors x that are in the cell C.sub.k and satisfy the following distance-to-c.sub.k condition:

(x−c.sub.k).sup.TM.sub.k.sup.−1(x−c.sub.k)≦γσ.sub.k.sup.2. (Equation 22)

The value of γ is determined experimentally by cross-validation and the parameter σ.sub.k is computed from a training set of local descriptors custom-character as follows:

σ.sub.k= custom-character ((x−c.sub.k).sup.TM.sub.k.sup.−1(x−c.sub.k)) (Equation 23)

[0095] The matrix M.sub.k can be either

[0096] Anisotropic M.sub.k: the empirical covariance matrix computed from custom-character ∩C.sub.k;

[0097] Axis-aligned M.sub.k: the same as the anisotropic M.sub.k, but with all elements outside the diagonal set to zero;

[0098] Isotropic M.sub.k: a diagonal matrix σ.sub.k.sup.2I with σ.sub.k.sup.2 equal to the mean diagonal value of the axis-aligned M.sub.k.

[0099] While the anisotropic variant offers the most geometrical modelling flexibility, it also drastically increases the computational cost. The isotropic variant, on the other hand, enjoys practically null computational overhead, but also the least modelling flexibility. The axis-aligned variant offers a compromise between the two approaches.

Hard Weights

[0100] In one example, the pruning carried out by Equation 22 can be implemented by means of 1/0 weights

w.sub.k(x)=[[(x−c.sub.k).sup.TM.sub.k.sup.−1(x−c.sub.k)≦γσ.sub.k.sup.2]] (Equation 24)

[0101] The weighs w(x) can be applied to the summation terms in Equations (13), (15) and (17). For example, for Equation 13 the weights would be used as follows:

[00010] $\begin{matrix} r_{k}^{B} = \underset{\underline{x} \in I .Math. C_{k}}{.Math.} .Math. .Math. w_{k} (\underline{x}) .Math. \frac{[[\underline{x} \in C_{k}]]}{\underset{k, \underline{x} \in I}{.Math.} .Math. .Math. w_{k} (\underline{x})} . & (Equation .Math. .Math. 25) \end{matrix}$

Soft-Weights

[0102] In another example, the pruning carried out by Equation 22 can be implemented using soft weights.

[0103] In another example, the soft-weights may be computed using exponential weighting, where

[00011] $\begin{matrix} w_{k} (\underline{x}) = \exp (- {}_{σ_{k}}^{\frac{ω}{2}}{(\underline{x} - {\underline{c}}_{k})}^{T} .Math. {\underline{M}}_{k}^{- 1} (\underline{x} - {\underline{c}}_{k})) . & (Equation .Math. .Math. 26) \end{matrix}$

[0104] In another example, the soft-weights may be computed using inverse weighting, where

[00012] $\begin{matrix} w_{k} (\underline{x}) = \frac{σ_{k}^{2}}{{(\underline{x} - {\underline{c}}_{k})}^{T} .Math. {\underline{M}}_{k}^{- 1} (\underline{x} - {\underline{c}}_{k})} . & (Equation .Math. .Math. 27) \end{matrix}$

Experiments

[0105] FIG. 5 illustrates an exemplary plot of percentage of Hessian-Affine SIFT descriptors that are pruned versus threshold values. The X axis pertains to square root of the threshold γσ.sub.k.sup.2 in Equation 24. The Y axis pertains to the percentage of local descriptors from the training set of local descriptors for which the condition in Equation 24 is true.

[0106] FIG. 6 illustrates a example of a plot that may be used to select the threshold parameter γ in Equation 24 when doing hard pruning. The X axis pertains to the square root of the threshold γσ.sub.k.sup.2 on the right hand side of Equation 24. The Y axis pertains to image retrieval performance measured as mean Average Precision (mAP) and computed over the Holidays image dataset.

[0107] FIG. 7 illustrates an example of a plot that could be used to select the parameter ω required for soft pruning in Equation 26. The X axis pertains to the value of ω and the Y axis pertains to the image retrieval performance measure mAP computed over the Holidays image dataset.

[0108] The experiments underlying FIGS. 5-7 are carried out using the VLAD feature encoder with soft or hard pruning. The experiments underlying FIGS. 5-7 utilize SIFT descriptors extracted from local regions computed with the Hessian-affine detector or from a dense-grid detector. The RootSIFT variant of SIFT is utilized when using the Hessian affine detector.

[0109] The experiments underlying FIGS. 5-7 utilize a training set that is a Flickr60K dataset is composed of 60,000 images extracted randomly from Flickr. This data set is used to learn the codebook, rotation matrices, per-cluster pruning thresholds and covariance matrices for the computation of the Mahalanobis metrics.

[0110] The experiments underlying FIGS. 5-7 utilize for testing the INRIA Holidays dataset which contains 1491 high resolution personal photos of 500 locations or objects, where common locations/objects define matching images. The search quality in all the experiments is measured using mAP (mean average precision). All the experiments underlying FIGS. 5-7 have been carried out using a codebook of size 64.

[0111] Table 1 provides a summary of results for all variants, where each variant is specified by a choice of weight type (hard, exponential or inverse), metric type (isotropic, anisotropic or axes-aligned), and local detector (dense or Hessian affine).

TABLE-US-00001 TABLE 1 mAP (%) Descriptors baseline Weights Iso Aniso Ax-align Hessian Affine 65.60 Hard 66.29 66.29 66.40 Inverse 66.40 66.39 66.55 Exponential 66.45 66.40 67.02 Dense 72.71 Hard 73.34 73.37 73.56 Inverse 73.45 73.45 73.60 Exponential 73.69 73.61 74.28

[0112] The best result overall is obtained using axis-aligned exponential weighting (74.28% and 67.02% for dense and Hessian affine detections, respectively). Nonetheless, hard pruning yields improvements relative to the baseline, and one should note that it is less-computationally demanding than soft pruning. The best mAP for hard-pruning is obtained using the axes-aligned approach for both the dense and Hessian affine detectors (66.40% and 73.56% respectively). As illustrated in FIG. 7, keeping the parameter co equal to 1.0 provides good results.

[0113] Numerous specific details have been set forth herein to provide a thorough understanding of the present invention. It will be understood by those skilled in the art, however, that the examples above may be practiced without these specific details. In other instances, well-known operations, components and circuits have not been described in detail so as not to obscure the present invention. It can be appreciated that the specific structural and functional details disclosed herein may be representative and do not necessarily limit the scope of the present invention.

[0114] Various examples of the present invention may be implemented using hardware elements, software elements, or a combination of both. Some examples may be implemented, for example, using a computer-readable medium or article which may store an instruction or a set of instructions that, if executed by a machine, may cause the machine to perform a method and/or operations in accordance with the embodiments. Such a machine may include, for example, any suitable processing platform, computing platform, computing device, processing device, computing system, processing system, computer, processor, or the like, and may be implemented using any suitable combination of hardware and/or software. The computer-readable medium or article may include, for example, any suitable type of memory unit, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit. The instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, encrypted code, and the like, implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language

IMAGE RECOGNITION USING DESCRIPTOR PRUNING

Assignee

Inventors

Cpc classification

Classification Explorer

G06T5/20

PHYSICS

Classification Explorer

G06V10/464

PHYSICS

Classification Explorer

G06T2207/20081

PHYSICS

International classification

Classification Explorer

G06T5/20

PHYSICS

Classification Explorer

G06K9/46

PHYSICS

Abstract

Claims

Description