- Scenario: case and interface
- Neccesary: constraint and hypothesis
- Application: service and algorithm
- Kilobit: data storage design
- Evolve: performance, scalability and robustness
Back of the envelope estimation
- 1 day = 86000s
- disk seek: 10ms
- 1ms disk sequential read: 20ms
- Category
- Recursive DNS = Resolver: delegate user DNS request
- Authorative DNS:
- Root DNS: redirect request to TLD
- TLD(Top level domain): .com, .cn, ...
- Route53: example.com
- reference: How Does DNS Route Traffic To Your Web Application?
- Category
- advantages
- Users receive content at data centers close to them
- Your servers do not have to serve requests that the CDN fulfills
- advantages
Load balance:
- distributes incoming client requests among a group of servers, in each case returning the response from the selected server to the appropriate client
- Architecture
- Master/LB: do IO
- Slave: computation resource
- Layer 4 Load balancing: transport layer
- source ip, destination ip, port
- Layer 7 Load balancing: application layer
- all info: source ip, destination ip, port, content...
Reverse Proxy
- accepts a request from a client, forwards it to a server that can fulfill it, and returns the server’s response to the client.
- Increased security, Increased scalability and flexibility
- reference: reverse-proxy-vs-load-balancer
- Type
- cache aside pattern: use both cache and DB explicitly
- read: when cache miss, load data from db and update cache
- write: write to db first and invalidate cache
- write through: cache as data store, cache is responsible for R/w DB
- cache return data after persisted
- write behind:
- Add/update entry in cache
- Asynchronously write entry to the data store, improving write performance
- refresh ahead
- cache aside pattern: use both cache and DB explicitly
- How to update cache correctly?
- factors
- concurrency: multiple r/w
- failure: cache operation failure, db operation failure
- solutions
- cache invalidate instead of cache set: old data can be loaded to cache during concurrency (multiple writes)
- 2 incomplete solutions
- invalidate cache + update db
- pros: consistent over failure (cache failure or db faliure)
- cons: inconsistent over concurrency (i1, r2, w1)
- update db + invalidate cache (cache aside pattern)
- pros: consistent over concurrency (different order of r/w operations)
- cons: inconsistent over failure (cache invalidation failure)
- invalidate cache + update db
- Reference:
- factors
- Type
- index
- implementation:
- BTree for range: logbn
- B+Tree: disk friendly with continuous reading
- Clustered vs. non-clustered
- clustered index sort rows in key order phisically
- How does insert a row work with clustered index: store rows in pages, and insert modifies/split only 1 page
- implementation:
- log for recovery
- redo log: redo commited transactions
- log: changed/modified values in a transaction
- log: commit
- flush data to disk (async operaton for performance)
- redo log: redo commited transactions
- index
Consistent Hashing for Sharding
- Problem to solve:
- Elastic scaling: dynamic add/remove nodes
- Remap keys when adjust nodes: as few as possible
- Avoid hot spot/imbalance on nodes
- Solution
- Design
- Circular Hash key space: int32
- Map node to key space
- How to place key? Find the upper bound/lower bound of hashed key
- When add/remove node
- Only k/n keys needs to be remapped
- Use multiple virtual node/replica of a node to balance keys
- Design
- Reference: https://www.acodersjourney.com/system-design-interview-consistent-hashing/
- Problem to solve:
Squid, ATS, Varnish, HAProxy, Nginx
- Squid, ATS, Varnish: caching / CDN
- HAProxy, Nginx: reverse proxy
- Nginx: web server
- Services: shortening + redirect
- read:write = 10:1
- shortening
- short_url = base62(increment counter++)
- base62: [a-zA-Z0-9]
Twitter timeline implementation
- Timeline
- User timeline: self tweets
- Home timeline: aggregated following tweets
- Push
- Fanout with redis: compute and store home timeline for each new tweet
- Pull
- Compute based on tweet list
- Mixed
- hot users (lots of followers): pull + merge with precomputed home
- Reference
- system-design-primer
- highscalability.com
- ninechapter 6
- Timeline
KV store with external storage/ Facebook photo storage
- index (in mem + file) + data file
- index: key -> data offset in data file
- C(K, V): append V to data file + add index K -> offset of V in data file
- R(K): offset = index(K) + get V at index
- U(K, V): append V to data file + update index K -> offset of new V in data file
- D(K): delete K in index
- Compact/GC
- copy all indexed value to new file: O(n)
- index (in mem + file) + data file
User system
- Scenario
- account: register/update profile/remove
- session: login/logout
- wallet: balance/membership
- User State management:
- User (uid, name, hashed_pw, state)
- register, approve, disapprove, deactivate, ban
- User (uid, name, hashed_pw, state)
- User session management:
- Session (uid, session_id, state)
- login, logout
- Session (uid, session_id, state)
- Membership management
- membership (uid, balance, end_time)
- charge, extend_end_time
- transaction for consistency
- membership (uid, balance, end_time)
- Availability
- translog for failure recover
- Redundency against data loss
- Scenario
Payment system
- Scenario
- charge(deposit)/withdraw
- transfer/payment
- Neccessary
- QPS of charge
- Application (charge)
- Order: bank account -> app account
- Data
- User: userId
- Order: orderId, bankId, userId, amount, opType, state, time
- Payment Method: bankId, userId
- Scenario
Crawler system
- crawl list and page
- Master-slave
- scheduler: control each crawler machine
- task table:
- taskId, priority, type (list/detail), url, time
- page table:
- page content
- task table:
- slave: worker
- scheduler: control each crawler machine
- pull:
- slaves monitor task queue/database
- take a task and crawl
- update task status
- concurrency: producer-consumer pattern
- push
- master pushes task to slave
- slave sends crawled page to master
Search Engine
- Core components
- Crawler for data retrieve
- Indexing service
- reverse index service + document service
- Search API
- Core components
Distributed file system
- Large file: metadata + chunks
- metadata: index (chuck_id -> offset)
- chucks: 64M per chunk
- +: reduce metadata size, reduce traffic
- -: waste space for small files
- Extra large file: chunk server
- master metadata: index (chunck_id -> chunck server id)
- chunck server:
- metadata: index (chunck_id -> offset)
- chuncks
- Chunck robustness:
- checksum: check chuck is broken
- replica: chunck copies on multiple server
- performance + availability: 2 on same rack + 1 on remote rack
- Chunck server aliveness
- keepalive with master
- Chunck server recover
- Once chuck server down, it's 'dead', never go back online
- master replicate chuncks to other chunck server
- Hot spot: hot file
- Master rebalance to replicate chuncks to more chunck server
- Read a file:
- client query chunck server id of block from name node
- client read from chunck server
- Write a file (pipeline replication)
- client query chunck server id list (primary + replicas) of block from name node
- client send data to primary
- chunck server chain: primary->replica1->replica2
- 2-phase commit persist
- primary chunck server ack client
- Large file: metadata + chunks
Big table
- phisical view
- big table = a list of tablets
- tablet = a list of sorted <key, values> pairs
- a list of sstable
- tablet = a list of sorted <key, values> pairs
- SStable (Sorted String Table): sstable-and-log-structured-storage-leveldb/
- mem+file
- memory:
- index(sstable -> file address, bloomfilter)
- sstable(sorted by key)
- file: sstables
- memory:
- how to read: find in all sstables and merge the result
- how to write:
- write to sstable in memory
- write operation transaction log to disk
- dump the sstable to file system when full
- How to store data on GFS
- a server can be both tablet server (memory) + chunck server (persistent)
- log, sstable replicates on chunck server
- big table = a list of tablets
- Logical view
- row key: (column:col_value)* -> timestamp
- Architecture (see evernote 'hbase')
- client
- chubby: metadata for tablet server addresses
- master: manage metadata, balance load, health check
- client read/write data from tablet server
- phisical view
- process
- input-split-map-shuffle-reduce-finalize
- split size = block size = 64M
- process
Big data problem solution
- segment hash
- bitmap
- bloomfilter
- external sort
- map-reduce
Rate limiter
- limit qps to k
time bucket (tumbling window)
- 1 bucket per second, limit k request for 1 bucket
// single time bucket implementation long cur_ts = 0; int remaining = k; boolean allow(long ts){ if(ts != cur_ts){ remaining = k; cur_ts = ts; } if(remaining <= 0){ return false; } else{ remaining--; return true; } }
sliding window
- record enqueue time of each request
- for each request qn, it can be added to queue only when
- queue has less than k items, or
- now - tn-k > 1s
- limit qps to k
Seconds kill: 10 iphones
- Cache: browser, CDN, reverse proxy, server cache
- Queue: 2 level queue
- level 1 (de-peak traffic): n web servers, each has a queue, each allows 10 requests
- level 2: 1 queue for first 10 of 10*n requests
- Key point
- reduce traffic
- recude page size
- CDN for static page
- limit request
- reduce traffic
- Scenario
- return all suggestions(articles, friends, ...) according to the typed words
- Neccessary
- average latency < 100ms
- Application
- trie-tree
- Architecture
- aggregation of inverted index (global/personalized)
- cache result
- Scenario
Bloomfilter vs. hashset:
- +: space saving
- -: hard handle element delete (we can leave it and false positive rate will increase)
- SQL or Nosql (sql or nosql)
- Difference?
- relational(tabular model with schema) vs. non-relational(schemaless)
- structural data vs. semi-structural data
- KV, doc, graph
- structural data vs. semi-structural data
- SQL vs. proprietary query languange
- strong consistency vs. eventual consistency
- relational(tabular model with schema) vs. non-relational(schemaless)
- scenario for nosql
- Temporal data: log data
- Difference?
- Java GC: Oracle
- Mark and compact, Mark and copy
- generational
- young, old, permernant
- young:
- eden: where objects allocation happens
- s0, s1: survivor space
- minor GC on young
- copy live objects of eden and si to s(1-i), sweep eden and si space
- copy old (i.e. >survived over 8 minor GCs) live objects to old gen
- major GC on old
- SQL or Nosql (sql or nosql)