Processing Huge Data In Memory While within The Loop

Thursday, May 6, 2010

Processing Huge Data In Memory While within The Loop

Never process a huge data input(i.e data input greater than 30% of the maximum memory limit) at one time. Whenever possible, try to chunk the processing to more acceptable levels.

Take the following example.

$arr =  file("somehugefile.txt"); //buffers the array into memory
 foreach($arr as $line) {
    //do something
 }
 unset($arr);

Chunk the processing by, splitting the file to more manageable pieces.

$arr1 =  file("chunk1.txt"); //buffers the array into memory
 foreach($arr1 as $line) {
    //do something
 }
 unset($arr1);
 $arr2 = file("chunk2.txt"); //buffers the array into memory 
 foreach($arr2 as $line) {
    //do something
 }
 unset($arr2);
 $arr3 = file("chunk3.txt"); //buffers the array into memory
 foreach($arr3 as $line) {
    //do something
 }
 unset($arr3);

Related Links Widget for Blogspot

2 comments:

eduardo.aconia said...: processing large amount of data is not theoretical and it's better to buffer file reading rather than force the user to split the file in multiple smaller files; May 23, 2010 at 3:11 PM
Unknown said...: What do you mean about the statement "processing large amounts of data being theoretical"? The article is in no way theoretical though. I would say that it's from actual experiences and practices for processing data within a loop anywhere between 1 to 100 million lines of string stored in a file. It's not a joke to load that into memory while looping. You'd end up with various memory problems. I would have to agree with you though if there were no limit to what you can put in the memory. The beefiest ones I have encountered so far only range between 16GB to 32GB per server. Unfortunately, a supercomputer isn't something that's easily out there. You see, this article is inspired by the cloud - always shard. There are ways to split files in a programmatic way like line-by-line read "fgets" for PHP or "split" from the command-line. The example was provided for brevity. Anyway, if you have some more concrete examples, let's see it.

Also, this article doesn't deny the fact that there will be instances wherein you will want to buffer all data in-memory especially if you're dealing with complex data structures and binary trees.; May 24, 2010 at 8:34 AM

Post a Comment

Author Interests

Amazon.com Widgets

Parallel and Distributed Computing
Service-Oriented Architecture
Application Optimization
Network and Application Security
Process Automation
Data Warehousing
Data Visualization
Artificial Intelligence
Open Source Software

This blog is here because I realized that there is no better way to ensure knowledge assimilation for myself and education to the netizens about the things I discover, invent and learn from anything about computing, especially information technology in the cloud, system administration and anything interesting more than writing. Blogger is free so I don't have to worry about publication.

I would be writing mostly on open-source solutions to real-world IT and computing problems. I would also like to write on topics about simplifying, analyzing and aggregating sparsely distributed information, natural language and human behavior whenever possible. I will start by discussing concepts then theories proceeding on practical application or a proof with the aim to provide a model solution.

I'd also like to note that I do have substantial knowledge and experience with Microsoft products but as a matter of preference, I won't be discussing any of those things as long as I can avoid.

I might jump to other topics depending on my mood.

Thursday, May 6, 2010