Thursday, May 6, 2010

Processing Huge Data In Memory While within The Loop

Never process a huge data input(i.e data input greater than 30% of the maximum memory limit) at one time. Whenever possible, try to chunk the processing to more acceptable levels.

Take the following example.

$arr =  file("somehugefile.txt"); //buffers the array into memory
foreach($arr as $line) {
//do something

Chunk the processing by, splitting the file to more manageable pieces.
$arr1 =  file("chunk1.txt"); //buffers the array into memory
foreach($arr1 as $line) {
//do something
$arr2 = file("chunk2.txt"); //buffers the array into memory
foreach($arr2 as $line) {
//do something
$arr3 = file("chunk3.txt"); //buffers the array into memory
foreach($arr3 as $line) {
//do something
  • Related Links Widget for Blogspot


eduardo.aconia said...

processing large amount of data is not theoretical and it's better to buffer file reading rather than force the user to split the file in multiple smaller files

Rum Verse said...

What do you mean about the statement "processing large amounts of data being theoretical"? The article is in no way theoretical though. I would say that it's from actual experiences and practices for processing data within a loop anywhere between 1 to 100 million lines of string stored in a file. It's not a joke to load that into memory while looping. You'd end up with various memory problems. I would have to agree with you though if there were no limit to what you can put in the memory. The beefiest ones I have encountered so far only range between 16GB to 32GB per server. Unfortunately, a supercomputer isn't something that's easily out there. You see, this article is inspired by the cloud - always shard. There are ways to split files in a programmatic way like line-by-line read "fgets" for PHP or "split" from the command-line. The example was provided for brevity. Anyway, if you have some more concrete examples, let's see it.

Also, this article doesn't deny the fact that there will be instances wherein you will want to buffer all data in-memory especially if you're dealing with complex data structures and binary trees.