Parallel memories for multidimensional data access
09760770 · 2017-09-12
Assignee
Inventors
- Kenneth Hiroshi Eguro (Seattle, WA)
- Ray A. Bittner, Jr. (Bothell, WA, US)
- George E. Smith (Bellevue, WA, US)
- Shawn Michael Swilley (Kent, WA, US)
- Rehan Ahmed (Kelowna, CA)
Cpc classification
G06F3/0659
PHYSICS
G06F12/00
PHYSICS
B29C64/386
PERFORMING OPERATIONS; TRANSPORTING
B29C64/00
PERFORMING OPERATIONS; TRANSPORTING
A63F13/213
HUMAN NECESSITIES
G02B27/4233
PHYSICS
G06F11/3024
PHYSICS
G01B11/2545
PHYSICS
H04N13/239
ELECTRICITY
G01B11/2513
PHYSICS
G01B11/25
PHYSICS
H04N13/25
ELECTRICITY
G02B27/4205
PHYSICS
H04N23/11
ELECTRICITY
H04N13/254
ELECTRICITY
H04N2013/0081
ELECTRICITY
International classification
G01B11/25
PHYSICS
H04N17/00
ELECTRICITY
B29C67/00
PERFORMING OPERATIONS; TRANSPORTING
G02B27/42
PHYSICS
H04N13/00
ELECTRICITY
G06F12/00
PHYSICS
G06F9/30
PHYSICS
Abstract
The subject disclosure is directed towards loading parallel memories (e.g., in one or more FPGAs) with multidimensional data in an interleaved manner such that a multidimensional patch/window may be filled with corresponding data in a single parallel read of the memories. Depending on the position of the patch, the data may be rotated horizontally and/or vertically, for example, so that the data in each patch is consistently arranged in the patch regardless of from which memory each piece of data was read. Also described is leveraging dual ported memory for multiple line reads and/or loading one part of a buffer while reading from another.
Claims
1. A method comprising: receiving a request to process multidimensional data of an image; based on the request, identifying a window to apply to the image, the window comprising a first dimension having a first length and a second dimension having a second length; determining a subset of a plurality of memories to distribute the multidimensional data of the image among, the determining comprising multiplying the first length of the first dimension of the window by the second length of the second dimension of the window, wherein a number of memories in the subset of the plurality of memories is equal to a product of multiplying the first length by the second length; loading, in an interleaved manner, the multidimensional data into the subset of the plurality of memories; upon loading the multidimensional data, executing a parallel read of the subset of the plurality of memories; filling a data window with the multidimensional data based on the parallel read of the subset of the plurality of memories; and processing the multidimensional data in the data window.
2. The method of claim 1 wherein loading the multidimensional data among the subset of the plurality of memories comprises loading a part of the multidimensional data into a buffer comprising the subset of the plurality of memories.
3. The method of claim 1 wherein loading the multidimensional data comprises writing data corresponding to a first dimension of the multidimensional data to alternate memories in the subset of the plurality of memories corresponding to the first dimension of the window.
4. The method of claim 1 wherein loading the multidimensional data comprises writing data corresponding to a first dimension of the multidimensional data to alternate memories in the subset of the plurality of memories corresponding to the second dimension of the window.
5. The method of claim 1 further comprising accessing a memory of the subset of the plurality of memories based upon a section in the subset of the plurality of memories corresponding to a position of the multidimensional data in the window.
6. The method of claim 1 wherein the subset of the plurality of memories comprise dual ported memories, and wherein reading the subset of the plurality of memories comprises reading data from two addresses in a single cycle.
7. The method of claim 1 wherein the subset of the plurality of memories comprise dual ported memories, and wherein the method further comprises writing to a memory address of a memory of the subset of the plurality of memories while reading from a different or same memory address of the memory.
8. The method of claim 1 further comprising: to ensure that outputs of the subset of memories are returned in a consistent manner, upon determining that a particular pixel in the data window is not in a predefined location, rotating pixels in the data window one or more times in a horizontal rotation until the particular pixel in the data window is in the predefined location; and returning the data window to an original orientation after any rotating of the pixels.
9. The method of claim 1 further comprising: to ensure that outputs of the subset of memories are returned in a consistent manner, upon determining that a particular pixel in the data window is not in a predefined location, rotating pixels in the data window one or more times in a vertical rotation until the particular pixel in the data window is in the predefined location; and returning the data window to an original orientation after any rotating of the pixels.
10. The method of claim 1 further comprising: to ensure that outputs of the subset of memories are returned in a consistent manner, upon determining that a particular pixel in the data window is not in a predefined location, rotating pixels in the data window one or more times in a horizontal rotation and at least once in a vertical rotation until the particular pixel in the data window location; and returning the data window to an original orientation after any rotating of the pixels.
11. A system comprising: a plurality of memories; and a processor programmed to: receive a request to process multidimensional data of an image; based on the request, identify a window to apply to the image, the window comprising a first dimension having a first length and a second dimension having a second length; determine a number of the plurality of memories to distribute the multidimensional data among, the determining comprising multiplying a first length of the first dimension of the window by the second length of the second dimension of the window, wherein the number of the memories is equal to a product of multiplying the first length by the second length; load the number of memories with the multidimensional data of the image data from a multidimensional array in an interleaved manner; executing a parallel read of the loaded number of memories; filling a data window with the multidimensional data based on the parallel read of the number of memories; and processing the multidimensional data in the data window.
12. The system of claim 11 wherein the number of memories are one of the following: contained in a single field programmable gate array, or distributed among a plurality of field programmable gate arrays.
13. The system of claim 11 further comprising a fetching process configured to fill a particular window at a given position relative to the multidimensional array with image data read from the number of memories in parallel.
14. The system of claim 13 further comprising an array processing component configured to receive the particular window filled by the fetching process.
15. The system of claim 13 wherein the fetching process is configured to ensure that outputs of the number of memories are returned in a consistent manner by: determining that a particular pixel in the particular window is not in a predefined location; and rotating pixels read from the number of memories along the dimensions of the particular window until the particular pixel is in the predefined location when providing the outputs.
16. The system of claim 15 further comprising an array processing component configured to receive the data after rotation by the fetching process.
17. The system of claim 11 wherein the processor is further programmed to load a part of the multidimensional array data into a buffer comprising the number of memories.
18. The system of claim 11 wherein the multidimensional array data comprises at least one set of image data.
19. One or more computer-readable storage media having executable instructions, which perform operations comprising: receiving a request to process multidimensional data of an image; based on the request, identifying a window to apply to the image, the window comprising a first dimension having a first length and a second dimension having a second length; determining addresses in a plurality of memories to store the multidimensional data based upon a position of the window from which the multidimensional data is obtained, the window comprising image data corresponding to pixels in the image; loading, in an interleaved manner, the multidimensional data into the plurality of memories; filling a data window with the multidimensional data corresponding to the data window by executing a single parallel read of the plurality of memories; and processing the multidimensional data corresponding to the filled data window.
20. The one or more computer-readable storage media of claim 19 having further executable instructions comprising: determining that a particular pixel in the data window is not in a predefined location; and rotating the multidimensional data in the data window one or more times in a horizontal rotation and at least once in a vertical rotation until the particular pixel in the data window is in the predefined location when processing the multidimensional data; and returning the data window to an original orientation after rotating of the data.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
DETAILED DESCRIPTION
(13) Various aspects of the technology described herein are generally directed towards dividing data to be processed among separate memories (comprising a patch cache), each memory holding a different, but interleaved, portion of the data. The interleaving is based upon the data to be processed (such as image data or other real-world sampled data) being physically or time adjacent, e.g., pixels in an image are adjacent other ones.
(14) The division and interleaving (round-robin) are based upon the dimensions of the patch (e.g., window size in two-dimensional data processing). When a patch is needed for processing, the data are arranged such that each access into the patch cache needs to get one and only one value from each memory within the cache. This provides fast single-cycle access and a large degree of aggregate parallel bandwidth.
(15) In general, the technology described herein provides a memory architecture that capitalizes on the natural physical spatial locality of the image or other real-world data to maintain high performance without duplication. This allows extremely high performance with little resource overhead.
(16) It should be understood that any of the examples herein are non-limiting. For instance, benefits are readily apparent in hardware/FPGA/ASIC scenarios, however the technology may be used in other scenarios. Further, two-dimensional image data are used in some of the examples to help convey the concepts in a way that is relatively easy to understand, however image data is only one type of data, and other types of data, including in more than two dimensions, may benefit from the technology described herein. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in data processing and/or connected components in general.
(17)
(18) To process the data, the array processing component 104 is coupled to a data fetching process 112 that reads a patch of data in parallel from the patch cache 110/independent memories 108(1)-108(4). The array processing component 104 processes each patch, and uses the processing on one or more patches to ultimately provide results 114. Note that the array processing component also may be in hardware, e.g., the patch cache.
(19) To produce arbitrarily located P-sized patches from the buffered data, the data is divided among separate memories, each holding a different, but interleaved, portion of the data. The total memory is thus divided among the separate memories; e.g., if a single, serially accessed memory held the data in D space, each of the divided, parallel memories hold D/P of the data.
(20) The dimensions of the data may be enumerated as N, N, N and so forth, up to the number of data dimensions. Each of the dimensions of the patch may be enumerated as P, P, P and so forth. Internally, the patch cache 110 is organized as an array of independent memories. The number of independent memories (M) is the product of each of the lengths of dimensions in the patch.
(21) Turning to an example,
(22) In
(23) Data may be written into the cache in raster order, organized by some dimension from N to N to N and so forth. These writes are generally low width (e.g., in
(24) The cache 110 accepts the data and writes it in a round-robin style by dimension. For example, if data arrives in raster-order, first along dimension N, then by dimension N, and so forth, the data for each dimension is written into each of the dimensions in the patch array in turn. This proceeds in a round-robin manner among the first dimension of the patch array across the entire length of first data dimension, wrapping on the first dimension of the patch array. Subsequent data along progressively higher dimensions of the data are distributed round-robin across progressively higher dimensions of the patch array, again wrapping each dimension of the patch array. If the dimension order of the data is higher than the dimension order of the patch, the round-robin ordering restarts at the first dimension of the patch. This distribution (e.g., for three dimensions) may be represented as: column=column address mod P row=row address mod P depth=depth address mod P
(25) For example, in
(26) Because in this example the length of the N dimension of the data is four, and the P dimension of the patch is two, the first two data points A and B enter the cache and are placed into the MemW and MemX memories, respectively. When the third data point C enters the cache, the length of the P dimension has been exhausted, but the N dimension has not. Thus, the P dimension will wrap around and the third and fourth data points C and D are placed in the MemW and MemX memories, respectively.
(27) At this point, the N dimension has been exhausted, so the fifth, sixth, seventh, and eighth data points (E-H) enter the MemY, MemZ, MemY, MemZ memories in a similar fashion.
(28) At this time, both the N dimension and P dimension have been exhausted, but the N dimension has not, whereby the ninth, tenth, eleventh and twelfth data points (I-L) wrap around along the P dimension back to the MemW and MemX memories. This continues to the end of the NN data, e.g., M-P are written to the MemY and MemZ memories in the example of
(29) Note that to maintain a 22 patch but with a third data dimension (e.g. the input data were a 2D image over time), and the desire is to get any 22 patch of the image from any time slice across the N dimension (e.g., the dimension of the patch array is smaller than the dimension of the data), the second, third and so forth time slices across the N dimensions wrap back to the first P dimension in the same MemW, MemX, MemW, MemX, MemY, MemZ, MemY, MemZ, MemW, MemX . . . arrangement.
(30) When the data is distributed and needs to be read back from the memory as a PPP . . . patch, the output from each memory can be re-arranged along each dimension to a consistent orientation. For example, as shown in
(31) For some usage scenarios, the order is irrelevant, e.g., if the array processing component 104 is simply summing the returned values. However, other applications expect the data to be returned in a consistent manner, e.g., top left, top right, and lower left, lower right.
(32) Assuming that the outputs from the memories remain static, these values may need to be rotated along each axis (dimension P and then subsequently by dimension P) to ensure that the top left pixel of each resulting patch remains in a consistent location, and so on.
(33) As can be seen, the data 330 from the patch 332 in
(34) The data 550 in patch 552 in
(35) Rotation may be efficiently accomplished by a series of shift registers. The rotations (e.g., in two dimensions) for any patch (window) dimensions are determined according to:
X.sub.Rot=X%A.sub.wY
Y.sub.Rot=Y%A.sub.wY
where % indicates modulo and A.sub.wX and A.sub.wY define the access window, that is, the patch dimensions, and X and Y are the starting coordinates of the patch.
(36) An entire array (e.g., a full set of image data) need not be put into the cache at the same time. For example, as generally shown in
(37) In some implementations having dual ported memory, one part of the memories may be written while reading from another part. Thus, as a line is freed, it may be written while the next line is being processed. In a non-dual ported memory scenario, the reading needs to pause when new writes are needed. Also, as described below, with dual ported memory there may be times when both ports are being used for reads; if this is not the case, reads and writes can occur on the same cycle. However, this opportunity may not occur, or the writes may fall behind the reads, whereby some pausing of the reads needs to occur.
(38)
(39)
(40)
(41) However, as represented by the dashed lines in
(42)
(43) Step 906 selects the first dimension of data, e.g., the X-dimension starting at coordinate zero. Step 908 selects the memories based upon the X-dimension, such as the first two of four memories for a 22 patch, the first three for a 33 patch, and so on.
(44) Step 910 represents the interleaving of the data along the X-axis among the selected memories, e.g., alternating between them. Note that the data wraps in the selected memories as needed, as described above. This continues until the first dimension is exhausted, that is, the entire line is placed in the selected memories.
(45) When the first dimension is exhausted, step 914 evaluates whether the second dimension is exhausted, e.g., the last row has been placed into the memories. If not, at step 916 the first dimension is reset (e.g., the X-coordinate returns to zero) and the next dimension incremented, e.g., the Y-coordinate is moved to the next line.
(46) Step 908 selects the next memories, e.g., not the ones used previously. For example, with a 22 patch, every other row is placed into a different pair of the memories; for a 33 patch, every third row into a different set of three memories, and so on. In this way, every value in a window is in a different memory.
(47) The process continues alternating among memories along the columns until the first dimension (row) is exhausted, and alternating among memories along the rows until all rows are exhausted. At this time, the memory is ready for reading. Note that as described above, if a sliding window scenario is in use, reading may begin as soon as enough lines to fill a patch with data have been written. If the window is allowed to be positioned anywhere in the buffer at any time, then the buffer needs to be filled.
(48)
(49) Step 1004 represents computing the address in each memory for the data points in the access window, e.g., using the address computations described above. Note that rather than the full computation, in a sliding window scenario the previous computation may be used to determine the next location in each memory because the window position and underlying memory changes regularly.
(50) Step 1006 reads the memories at their respective addresses, in parallel, into a set of shift registers or the like. As described above, step 1008 performs any needed X rotation, and step 1010 any needed Y rotation. At this time, the window is output, filled with the correct data in the correct order.
(51) Example Operating Environment
(52)
(53) It can be readily appreciated that the above-described implementation and its alternatives may be implemented on any suitable computing device, including a gaming system, personal computer, tablet, DVR, set-top box, smartphone and/or the like. Combinations of such devices are also feasible when multiple such devices are linked together. For purposes of description, a gaming (including media) system is described as one exemplary operating environment hereinafter.
(54)
(55) The CPU 1102, the memory controller 1103, and various memory devices are interconnected via one or more buses (not shown). The details of the bus that is used in this implementation are not particularly relevant to understanding the subject matter of interest being discussed herein. However, it will be understood that such a bus may include one or more of serial and parallel buses, a memory bus, a peripheral bus, and a processor or local bus, using any of a variety of bus architectures. By way of example, such architectures can include an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnects (PCI) bus also known as a Mezzanine bus.
(56) In one implementation, the CPU 1102, the memory controller 1103, the ROM 1104, and the RAM 1106 are integrated onto a common module 1114. In this implementation, the ROM 1104 is configured as a flash ROM that is connected to the memory controller 1103 via a Peripheral Component Interconnect (PCI) bus or the like and a ROM bus or the like (neither of which are shown). The RAM 1106 may be configured as multiple Double Data Rate Synchronous Dynamic RAM (DDR SDRAM) modules that are independently controlled by the memory controller 1103 via separate buses (not shown). The hard disk drive 1108 and the portable media drive 1109 are shown connected to the memory controller 1103 via the PCI bus and an AT Attachment (ATA) bus 1116. However, in other implementations, dedicated data bus structures of different types can also be applied in the alternative.
(57) A three-dimensional graphics processing unit 1120 and a video encoder 1122 form a video processing pipeline for high speed and high resolution (e.g., High Definition) graphics processing. Data are carried from the graphics processing unit 1120 to the video encoder 1122 via a digital video bus (not shown). An audio processing unit 1124 and an audio codec (coder/decoder) 1126 form a corresponding audio processing pipeline for multi-channel audio processing of various digital audio formats. Audio data are carried between the audio processing unit 1124 and the audio codec 1126 via a communication link (not shown). The video and audio processing pipelines output data to an A/V (audio/video) port 1128 for transmission to a television or other display/speakers. In the illustrated implementation, the video and audio processing components 1120, 1122, 1124, 1126 and 1128 are mounted on the module 1114.
(58)
(59) In the example implementation depicted in
(60) Memory units (MUs) 1150(1) and 1150(2) are illustrated as being connectable to MU ports A 1152(1) and B 1152(2), respectively. Each MU 1150 offers additional storage on which games, game parameters, and other data may be stored. In some implementations, the other data can include one or more of a digital game component, an executable gaming application, an instruction set for expanding a gaming application, and a media file. When inserted into the console 1101, each MU 1150 can be accessed by the memory controller 1103.
(61) A system power supply module 1154 provides power to the components of the gaming system 1100. A fan 1156 cools the circuitry within the console 1101.
(62) An application 1160 comprising machine instructions is typically stored on the hard disk drive 1108. When the console 1101 is powered on, various portions of the application 1160 are loaded into the RAM 1106, and/or the caches 1110 and 1112, for execution on the CPU 1102. In general, the application 1160 can include one or more program modules for performing various display functions, such as controlling dialog screens for presentation on a display (e.g., high definition monitor), controlling transactions based on user inputs and controlling data transmission and reception between the console 1101 and externally connected devices.
(63) The gaming system 1100 may be operated as a standalone system by connecting the system to high definition monitor, a television, a video projector, or other display device. In this standalone mode, the gaming system 1100 enables one or more players to play games, or enjoy digital media, e.g., by watching movies, or listening to music. However, with the integration of broadband connectivity made available through the network interface 1132, gaming system 1100 may further be operated as a participating component in a larger network gaming community or system.
CONCLUSION
(64) While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.