A quest for speed

Caching, data grids, and more

2013 Netflix Open Connect cache server
2013 Netflix Open Connect cache server
Intel Raptor Lake Core i9 Die Shot

🍪

Session Cookies

Tracks state through the life of a user session

💳

Account Balance

Increasing demand for up-to-date information

⏳ We don't like waiting

Time is money

Site speed directly impacts the rate at which users complete a desired outcome.

Performance is regulatory


An average Time to Last Byte (TTLB) of 750 milliseconds per endpoint response.

source: open banking api performance

Performance

Why so hard?

Life was simple


We wanted more


And then...

 Alternative Text
1,500 services, 9,300 unique connections, Monzo engineering

Microservices!

A quick aside

Front-end design

Caching

provides stateless processes with fast access to temporary data

"when given the same arguments, if a service call yields the same results every time, then it is a good candidate for caching"

The art of the possible

How fast?

How fast?

Task Duration
L1 cache reference 0.5 ns
L2 cache reference 4 ns
Main memory reference 100 ns
Round trip within DC 500,000 ns
Round trip from DC to cloud 30,000,000 ns
Round trip CA -> Netherlands -> CA 150,000,000 ns




Task Duration
Read 1MB from memory 3,000 ns
Read 1MB from SSD 49,000 ns
Read 1MB over network 80,000 ns
Read 1MB from Disk 825,000 ns

Just put it in memory

 Chart showing the cost of memory over time; from $3.79 quadrillion per TB to $1088 per TB.
Chart showing the cost of memory over time.

The user session

🍪 State

But 12 Factor says...

Just put it in memory

Patterns

Long lived data

💳 Balance

Cache-Aside: Read

Cache-Aside: Write

Update on read and write

Problems?

Problems?

Refresh Ahead

Challenges

Invalidation

If data changes at source, what value should be returned from the cache?

A quest for speed

Caching, data grids, and more

Resilience

Multi-Layer Caching

Between a user and the primary data store there are often multiple different caches.

Mental Model

"a collection of information on severe (SEV-0 or SEV-1 incidents at Twitter that were at least partially attributed to cache"


See: danluu.com/cache-incidents/

🧘 Breathe

data in memory

Can we do more?

SQL queries

SELECT balance FROM /brand WHERE account.id = 12345678 AND account.brand = $1;

run code on the cache

Compute

Event driven

Keeping an eye on things

  • Hit Rate
  • Miss Rate
  • PUT Latency
  • GET Latency
  • Sync Queue Depth
  • Leader Election Events
  • Replication Failures

Where next?