A research study on mindful eating

Recently there has been a lot of talk about incorporating mindfulness into our hectic, modern lives. Loosely rooted in the Buddhist practice of sati, mindfulness practitioners seek to improve their…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Memory usage of PCY vs. Apriori Algorithms

The Apriori Algorithm uses a triangular matrix to keep track of how many times items co-occur in a basket. I’ve drawn out an example matrix and entered N/A in the spaces that we would not initialize.

Note that memory is not allocated for the bottom part of the matrix

Note that each row and column of the matrix corresponds to an item that is sold in the store. Let’s assume that we’re using a hashmap to keep track of which indices correspond to which items. All of our approaches will use this hashmap, so we’ll leave it out of our calculations.

The memory footprint of the triangular matrix will be:

Each entry in the matrix is an integer. Since we’re dealing with a large dataset, we should assume that counts can get very large — so let’s use a 4-byte int to keep track of these counts.

The number of entries in the matrix is equal to the number of pairs that we can make out of all the frequent items. Note that (n choose 2) gives us that number, where n is the number of frequent items.

Thus, the total memory footprint of the triangular matrix is:

Let’s compare this to the memory footprint of the PCY Algorithm.

In the PCY Algorithm, infrequent pairs are eliminated using a hashing technique. It would be super convenient of all of these pairs happened to be at one end of the triangular matrix that we just described. However, that’s not the case. Before running the first pass of the algorithm, there’s no way for us to know which pairs will be eliminated as infrequent.

Moreover, if we drew out our triangular matrix and removed the infrequent pairs, we’d see what looks like a triangular slice of swiss cheese — the missing entries would be randomly spread out throughout the matrix.

Unfortunately, there’s no way to avoid allocating memory for these infrequent pairs in the triangular matrix. Obviously, creating a full triangular matrix doesn’t provide a smaller memory footprint than the Apriori Algorithm, so we have to use a different approach.

We can keep track of the counts of pairs using tuples that look like this:

where each item is a 4-byte int, which corresponds to one of the items sold in the store — as tracked in our previously mentioned hashmap. Here, the couts are 4-byte ints, just as they were in the triangular matrix. Thus, each of the pairs we count will take up a total of 12 bytes of memory.

So the total memory usage for counting pairs in the PCY Algorithm will be:

Remember, we’re only going to be counting pairs that are comprised of frequent items and hash to frequent buckets. Let m represent the number of pairs we count in the second pass of PCY.

Let’s set up their memory usages as an equation and solve it to get the break-even point.

Remember:

Using the memory usages outlined above, we can set up the equation:

(n choose 2) x 4 bytes = m x 12 bytes

Dividing both sides by 12 bytes yields:

(n choose 2) x (1/3) = m

Thus, we can see that the break-even point for memory usage is reached when the number of pairs we count in the PCY Algorithm is equal to one third of the number of pairs counted in the Apriori Algorithm.

This is equivalent to saying that we need to eliminate at least 2/3 of the pairs counted in the second pass of Apriori, which is the claim made by the book, so we’ve verified the claim.

Nice! 🎉

Check out some of these posts for more insights:

Let me know what you thought of this post by leaving a comment below!

And don’t forget to subscribe if you want to see more content like this.

Add a comment

Related posts:

Safari Tiga Masa

Sambutan matahari memang tidak selalu ramah, sebagaimana tanda pagi ini datang berasal dari suara bising dipertigaan rumah kecilku. Seperti peluit melepas pelari maraton, manusia-manusia memang…

Asian Americans Are Facing More Racism Since COVID

It was not just when COVID started that racism got worse against Asian Americans in the United States. It happened after Trump and some others repeatedly blamed China for the coronavirus. People in…

Extreme Heat Damages The DNA Of An Already Endangered Australian Songbird

Exposure to extremely hot and dry conditions damages the DNA of nestling birds during their first few days of life, so they age faster, die younger and produce fewer offspring I recently shared the…