Day 2 — CS Fundamentals December — A tale of caching and cache policies
Caching is one of the most important topics which comes under distributed computing. Although it is a very easy topic if given the right attention, people tend to get intimidated because people at times approach this concept the wrong way and feel stuck amongst huge amounts of information.
I am writing this article with the intention that after reading it, the next time you hear the word “cache”, you will think, “ Dude! I can deal with it, it’s chill! ”
Also, it is difficult to explain how important this topic is considering software engineering interviews. For example, system design is one of the most crucial areas which is hit by such interviews and caching is one of the core fundamentals of system design.
Let’s dive in!
So, basically, the data which programs access resides in main memory/RAM.
So, data moves to and fro between programs (which are inside core/processor) and memory. But well, at a hardware level, main memory is kinda “far” (metaphorically) because after all, main memory is separate from processor.
So, one of the strategies to quicken the access to data is store data on the processor chip itself.
“Wow, Yash! If that is true, then why the heck do we still have RAMs in our laptops :/”
Hold on, the problem with cache is the capacity. Normal PCs now a days possess GBs of RAM but if you look at the specs carefully, their cache capacity would be just 4 or 6MB, or 8MB if it is pretty expensive, because considering the processor chip’s physical size, you can’t store GBs of data on it.
So, yes, you still need RAM.
So, to utilise the available cache, obviously you don’t store all the data in the cache. Instead you store some specific data on the cache and you decide which data to keep on the cache by some algorithms.
One of the most frequently used algorithms is “Most Frequently Used” (Pun intended :P) where the cache is made to possess only data which is very frequently used.
(Actually, the algorithm is “Least Frequently Used” where the least used variables/data are removed in the cache whenever a cache is about to add a new variable and it has no space left and it has to remove some of it’s existing variables to create space for the new one)
For example, if there is an environment variable which is highly used and accessed by a program, then you can simply store that variable on the cache and hence, that program’s access to that variable will be much quicker than accessing from RAM.
So, what is a cache?
Cache is basically a memory store where data stored is much more quickly accessible and “closer” than the data stored on main memory/RAM.
Why should we use cache?
Because well, data on it becomes much much more quickly accessible than when the same data is on main memory.
Time to scare people — Enter distributed computing
Previously, I explained the “access to cache” considering that a single core was accessing and updating the cache and main memory accordingly and living its life in a chill manner without “moh maya”.
But the problem now is that, now a days, PCs have multiple processors which means that there is a single main memory but there are multiple processors with their own caches trying to access and update the shared data on the main memory.
For example, consider this situation.
There are two processors, Processor 1(P1) and Processor 2 (P2). And every processor has their own cache. And there is a shared variable x = 0 on the main memory.
Now, for quicker access to this variable, let’s consider that the processors store this variable x on their respective caches.
Now, both of the processor can simultaneously read the cache, that is totally cool but things start to mess up when updating the cache becomes involved.
So, say, Processor 1 decides to write the variable x to x = 1.
— x — x — x — x — x — x — x — x —
Now, there are three ways to write/update variable:
Cache write policies
- Write-through policy: Here, the processor updates the variable both in its cache and in the main memory thereby keeping its cache and main memory consistent with each other. The disadvantage, although, with this approach is the increased latency due to writing the data at two places.
- Write-around policy: Here, the processor updates the variable only in the main memory without touching its cache. So, this reduces write-latency and it is good for cases where the data is very infrequently read. Although, the disadvantage is that when a read is performed to recently written data, a cache miss will happen which lead to increased read-latency.
- Write-back policy: Here, the processor updates the variable only to its cache and NOT to the main memory. And finally, the main memory is updated when cache eviction occurs. This lazy approach is good for write-intensive data scenarios which require low-latency write I/O completion but the disadvantage is the risk of losing data if a cache failure occurs before writing the data to the main memory.
Cache eviction basically means the updates to the data blocks in the cache are flushed to the main memory (main memory is updated with cache’s current data) after a certain quota is exceeded. That certain quota is basically, fileset usage exceeds the fileset soft quota. (Read more about this here)
— x — x — x — x — x — x — x — x —
Back to the example,
So, Processor 1 updates that variable in its cache and afterwards, on the main memory as well but what if Processor 2 tries to access this variable x. It will simply look for “x” in its cache and find it and say, “well, I found x in my cache and its value is 0 so yeah, I’ll consider x to be 0” but in reality, x is updated to 1 and hence, Processor 2 will be reading wrong value which would lead to inconsistency issue.
So, basically, someone or something should tell cache 2 to update the variable x in it by re-reading it from the main memory.
And this is basically, what is called cache invalidation
So, what is cache invalidation?
Basically, a process where cache is told to update data in it.
And well, as they say.
There can be numerous approaches to cache invalidation. For example, one of them is Setting a TTL i.e. time-to-live for every variable in the cache so, that after that TTL, the cache checks in the main memory whether that respective variable has a new value or not and updates it accordingly.
There can be other approaches as well but they are taken up according to the kind of data and use case associated.
Thanks a lot for reaching till here :)
I hope you now understand a fair amount of the theory and concept around caches and you won’t get intimidated or you won’t go “What’s that?” the next time you hear the word “caching”.
Stay tuned for another article which is going to come tomorrow associated with some other interesting CS fundamental.
Find me on
Email — firstname.lastname@example.org