寓教于乐什么意思| 小狗呕吐吃什么药| 心慌心悸是什么原因| 夏天吃什么| 小刺猬吃什么东西| 淋巴细胞是什么意思| 关东煮为什么叫关东煮| 什么是黑色星期五| ipa啤酒是指什么| 腰闪了挂什么科| 附件是什么| 舒张压低是什么原因| 什么是共情| 田野是什么意思| 520是什么节日| 皮肤松弛是什么原因造成的| 吃什么对血液好| 西凤酒属于什么香型| panerai是什么牌子| hf医学上是什么意思| 乙肝e抗体阴性是什么意思| 平板是什么| 下肢水肿是什么原因| 早上一杯温开水有什么好处| 流眼泪是什么原因| 1972年属什么| 梦见抓鱼是什么预兆| 鱼油对身体有什么好处| 什么的辨认| 喝茶叶有什么好处| 逆生长是什么意思| 丙肝为什么会自愈| 思利及人是什么意思| 积劳成疾的疾是什么意思| 6月12日是什么星座| 小巧思什么意思| aa是什么| 97属什么| 肩膀上的肌肉叫什么| 脑血管堵塞是什么症状| 什么是空调病| 芒果吃了有什么好处| 脾大有什么危害| 什么是规培| 小孩拉肚子吃什么药| 奥莱是什么牌子| 多囊卵巢综合症有什么症状| 乳腺炎不能吃什么| 什么是亚健康| 分别心是什么意思| 诸葛亮是什么生肖| 6969是什么意思| 10年是什么年| 灰枣和红枣有什么区别| 眼袋大用什么方法消除| 一个家庭最重要的是什么| 一一是什么意思| 紫薯不能和什么一起吃| 澳门是什么花| 便秘吃什么菜有助排便| 空调抽真空是什么意思| 肾阴阳两虚吃什么药| 什么叫肺大泡| 左下腹痛挂什么科| 白细胞低有什么危害| 西太后手表什么档次| 验孕棒阳性代表什么| 左侧卵巢囊肿是什么原因引起的| 拉黑和删除有什么区别| 坐飞机不能带什么物品| 上海玉佛寺求什么最灵验| 胃泌素是什么| 千叶豆腐是什么做的| 什么对什么| 毫发无损是什么意思| 窦性心动过缓是什么意思| 双鱼座跟什么星座最配| 心疼是什么原因| 不想吃油腻的东西是什么原因| 肌红蛋白偏低什么原因| 儿童手足口病吃什么药| 鸡胸挂什么科| 老鼠长什么样| 会车什么意思| 保妇康栓治疗什么妇科病| 像什么似的| 猫咪冠状病毒什么症状| 什么是好词| 男人吃什么可以补精| 苦荞茶适合什么人喝| karcher是什么牌子| 大名是什么意思| va是什么车牌| 序列是什么意思| 做梦捡到钱是什么意思| 生理期为什么会腰疼| 海绵肾是什么意思| 鲁冰花是什么花| 天条是什么意思| 火牙是什么原因引起的| 降血脂喝什么茶最好| 6.8是什么星座| 情人眼里出西施是什么心理效应| 反流性食管炎吃什么中药| 唇炎是什么原因引起的| 一什么水井| 碘过量会导致什么疾病| 虾头部黄黄的是什么| 孤单的反义词是什么| 乳腺4a是什么意思| 什么水没有鱼| 为什么左手会发麻| 流清鼻涕是什么感冒| 为什么生日不能提前过| 迂回战术什么意思| 包皮看什么科| 调侃是什么意思| 尿素测定是查什么| 仿制药是什么意思| 什么气味能驱赶猫| 八卦脸什么意思| 射手座的幸运色是什么颜色| 四季常青财运旺是什么生肖| 鼻炎是什么引起的| 别无他求是什么意思| 捡肥皂什么意思| 龙日冲狗煞南是什么意思| 吃小米粥有什么好处| 日柱华盖是什么意思| 蒋介石为什么不杀张学良| 吃鸭蛋有什么好处和坏处| 送百合花代表什么意思| dfs是什么| 2018年属什么生肖| cco是什么意思| 做功是什么意思| 唐伯虎是什么生肖| 初秋的天冰冷的夜是什么歌| 降血脂喝什么茶最好| 二刷是什么意思| 尿频繁吃什么药最见效| 间接胆红素是什么意思| 琴棋书画指的是什么| 鲤鱼为什么很少人吃| 什么食物补锌效果最好| 肾虚吃什么食物好| 无动于衷是什么意思| afd是什么意思| 屠苏指的是什么| 什么木做菜板最好| 2019年出生属什么生肖| 八面玲珑代表什么生肖| 对比度是什么意思| 乳糖不耐受不能吃什么| 叶公好龙的意思是什么| 毛尖属于什么茶| 脑供血不足用什么药| 看病人送什么花合适| 什么是结扎| 小甲鱼吃什么| 梦见明星是什么预兆| 牛仔蓝配什么颜色好看| 什么是夹腿| 心口痛是什么原因引起的| 白带什么时候来| 喘不上气是什么原因| dha中文叫什么| 佛是什么生肖| 吃什么胸大| 两情相悦什么意思| 儿童发育迟缓挂什么科| 长期便秘吃什么药效果最好| 孕期吃什么长胎不长肉| 罄竹难书是什么意思| 妄想症有什么症状| 头发为什么会白| 萎缩性胃炎吃什么药能治好| 属兔的跟什么属相最配| 花团锦簇什么意思| 俯卧撑有什么好处| 却的偏旁叫什么| 参合是什么意思| 胆红素高是怎么回事有什么危害| 小马过河的故事告诉我们什么道理| 吃什么能治疗早射| 章子怡是什么脸型| 月经推迟7天是什么原因| 天青色等烟雨是什么意思| 仙人掌能治什么病| 九月二十四号是什么星座| 祖母是什么意思| 解酒吃什么水果| 羊肉和什么食物相克| 1994年出生属什么| 坚贞不渝是什么意思| 屁特别臭是什么原因| 红细胞偏高是什么原因| 孕妇肚子疼是什么原因| gi值是什么| gpd是什么意思| 脾胃不好能吃什么水果| 白菜属于什么科| 右手臂痛是什么预兆| 柳丁是什么| 盐酸达泊西汀片是什么药| 戒备心是什么意思| 酒精对皮肤有什么伤害| o型血孩子父母是什么血型| 什么降血糖| 竖心旁与什么有关| 身体出汗是什么原因| 奶奶的姐姐叫什么| 激素是什么东西| 孕妇贫血吃什么补血最快| 倏地是什么意思| 性格内敛是什么意思| 小孩摇头是什么原因| 读什么| 云南白药的保险子是起什么作用的| 一夜白头是什么原因| 什么原因导致性功能减退| 四月初五是什么星座| 怀孕吃什么会流产| 3s是什么意思| 户名是什么| 奶绿是什么| 吃什么补血补气效果好| 肺癌晚期有什么症状| 36计第一计是什么| 98年属什么的| 猫打什么疫苗| 脚肿吃什么消肿最快| 普洱茶适合什么季节喝| 月亮为什么会发光| 甯字五行属什么| 天上九头鸟地上湖北佬是什么意思| 廉航是什么意思| 亚健康是什么意思| 丝芙兰是什么品牌| 小孩满月送什么礼物好| 牛排用什么油煎好吃| 硬皮病是什么病| 血栓的症状是什么| 睡眠障碍应该挂什么科室| 92年的属什么| 什么动物是站着睡觉的| 胡言乱语是什么意思| 晚饭吃什么好| 减肥中午吃什么比较好| 大便深褐色是什么原因| 乳腺癌ki67是什么意思| 瘥是什么意思| der是什么意思| 月经是黑色的是什么原因| 有编制是什么意思| 裸睡是什么意思| 一例是什么意思| 香菜炒什么好吃| 围产期是什么意思| 泪腺堵塞有什么症状| 端水是什么意思| 益禾堂什么好喝| 为什么当兵| 什么书什么画| 百度
BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Dataflow-Based Query Caching with Readyset

航拍高淳国际慢城 让时间慢下来

51:45

Summary

百度 于是我们看到,在一些保健品、收藏品公司的内部,“如何获得老人信任”“老人的心理”成了上岗培训课程;受训之后,就算是新入行者也很快能成为老人的贴心人,迅速上手业务……  当一套与老年人打交道、做推销的“学问”不断被提炼、被传播、被移植,那么类似的骗局在不同行业被频频复制也就不足为奇了。

Alana Marzoev discusses the fundamentals of streaming dataflow and the architecture of ReadySet, a streaming dataflow system designed for operational workloads.

Bio

Alana Marzoev is one of the founders of ReadySet, the company commercializing Noria, a transparent SQL caching solution for relational databases. Before starting ReadySet, Alana was a PhD student at MIT CSAIL where she worked on databases and distributed systems. Prior to MIT, she did research in the Systems & Networking group at Microsoft Research and worked on the Ray project at UC Berkeley.

About the conference

InfoQ Dev Summit Boston software development conference focuses on the critical software challenges senior dev teams face today. Gain valuable real-world technical insights from 20+ senior software developers, connect with speakers and peers, and enjoy social events.

Transcript

Marzoev: I'm Alana. I'm one of the founders of Readyset. I'm going to be telling you about a computational model called partially-stateful streaming dataflow. Specifically, I'll share more about how the team at Readyset has been leveraging this model to build a high-performance database cache. Partially-stateful streaming dataflow is actually a relatively new concept. It originated in the computer systems research community a couple of years back, over at MIT. The main idea is that you can represent computations like SQL queries, as these long-running computational graphs, and essentially stream data through those graphs to incrementally update the results, and do so in a very memory efficient way.

The team at Readyset has been using this to build a database cache that has the same performance profile as a cache that you've spent a lot of time hand optimizing based on an in-memory key-value store, but that alleviates a lot of the pains that typically come with caching. Specifically, having to rewrite your business logic to now suddenly reference the cache, and then, more importantly, keep the cache state in sync with the state of your database as the data in your database changes.

Scope

I'm going to start by describing the problem domain a little bit more depth. I'll give you a sense for the types of workloads and applications that this approach to caching is best for. Then I will do a live demo of Readyset. You can get a sense for what dataflow-based caches in Readyset in particular look like in action. We'll cache a query and see some cool stuff there. Then I'm going to do a deep dive into dataflow-based caching. The way I'll approach this is essentially by layer of abstraction to layer of abstraction, like peeling the layers of an onion. I'll start with the highest level. We'll talk through the interface. I'll make a lot of the design decisions that were implicit in the demo, more explicit.

You'll get a sense for what it's like to actually program against and operate a dataflow-based cache. Then we'll discuss the system architecture, and I'll do this mostly through going through the life of a query. We'll both set up a cache for a query and then talk through what serving traffic from that cache actually looks like. Then, finally, I'll do a deep dive into partial materialization, which is the star of the show. Specifically, I'll share how you're able to use partial materialization to solve cache staleness issues in an automated way. Then, finally, I'll wrap up by comparing dataflow-based caching to other common alternatives for database scaling, so traditional caching based on in-memory key-value stores, query optimization, read replicas, all this.

Problem Domain

Problem domain. Caching is useful for a large number of types of applications, but in this talk, I'm focused on specifically caching queries from relational databases. I'm focused on database backed applications. I'm also going to be really narrowing in on web apps in particular, because web apps tend to have a set of shared characteristics that make them particularly amenable to caching. Concretely, they're highly latency sensitive. Not super surprising, there's people on the other end of the screen waiting for their pages to load. If that's slow, they're unhappy, they churn. Obviously, this problem's a lot harder nowadays, because most companies leverage ORMs, and ORMs generate both large numbers of redundant queries as well as particularly inefficient ones, unintentionally.

Even though they're a boon for productivity, it makes scaling a lot harder because there tends to be a lot more contention at the data layer, just by not necessarily having that many users. Web apps, they have a lot of variance in terms of traffic volumes that they serve. I think the canonical example of this is Black Friday. Companies will spend the other 11 months of the year just capacity planning for Black Friday. Of course, you see this dynamic in smaller scales and other contexts as well. If you're building a B2B SaaS app, then most of your usage is going to be during business hours, not on evenings or weekends. If you're building an e-commerce site, then you might have a flash sale and see more users or more customers that day than many of the days prior.

Maybe there's some geographic type patterns as well, where in certain regions, like on a booking site, you might get an uptick in hotel bookings or Airbnb bookings, because it's summer, it's like peak season for that location. Having your database in a state that it's able to deal with the traffic that you see at this high watermark, even if the high watermark is orders of magnitude higher than the low watermark, is incredibly important for user experience, because obviously these high watermarks correspond with really important business events. Websites are also really read-heavy, there are just more reads than writes. Like a 90 to 10 reads-write ratio, or 80 to 20 reads-write ratio would all constitute as read-heavy, in my opinion.

To give you a real-world example, if you've been on the website, Hacker News, the tech forum, most people on Hacker News, they're just looking at the front page, maybe clicking the top link. Most people aren't actually posting new things to Hacker News. Great example of, this is a very read-heavy workload. Finally, websites tend to have skewed data access distributions. When users request data, they're not just pulling rows from your table at random. There's really some sense of, some data is more popular. Some data is more recent. That's going to be what's shown to the users. In the wild, this is often governed by Zipf's Law, or the Zipfian distribution, which is this idea that the nth most common term is 1 over n times as popular as the most common term.

To bring this back to our Hacker News example, this is like, most people are seeing the front page, and then maybe a small fraction of them are making it to the second page, and then a small fraction of that group is making it to the third page. There's really a steep, almost exponential drop-off in terms of popularity of items. That makes this really great for caching. To summarize, applications that are highly cacheable, are latency sensitive, have high variance in load, are read-heavy, and have skewed data access distributions, or some combination of these factors.

Again, if that is your application, then caching is going to be a great bet. I think caching is actually a choose your own adventure type of solution. When I'm talking about caching, I'm going to specifically be referencing read-through caching. The idea is that you have an in-memory key-value store off to the side somewhere from your database. Whenever there's a database read, you're going to first check the cache to see if that item is in the cache. If it is, great, you'll just return it from the cache. That'll be a really fast read, it'll be a cache hit. If not, we're going to just trigger a request to the database to run that query and then store the results in the cache, and then return it. It's very simple. It's like read-through caching. Let's take a look at what this actually might look like in code.

We'll do a little before and after type of moment. I just found this code snippet on the internet, it was on some Python SQLAlchemy and Redis tutorial, but it illustrates my point. Here, there's this Python function, fetch_user_from_database, takes user ID as input. All it's doing is opening up a DB session using SQLAlchemy to effectively generate a SQL query that's pulling some user data and then returning it. It's super simple, simple as can be. When we start to modify our code to incorporate caching, read-through caching in particular, suddenly there's 2x the code. Now we have this helper function, get_user_from_cache, that's implementing this read-through logic that I mentioned previously.

It's checking if the user info is in Redis. If it is, great, returns that. If not, it's running that other function that I just showed you to fetch from the database. You're like, I have to write a helper function. Of course, this is just one place in your code where you might want to use caching. If you had dozens of things that you wanted to cache, it might be a little bit annoying to have to write all these helper functions.

Is that it? Is your problem solved? Will your database stay up now? Can we just go home? The answer to that question is, unfortunately, no, because as annoying as having to mess up your beautiful, clean business logic with caching code is, the real hard part of caching is actually keeping the state in the cache in sync with your database in a way that essentially ensures that you're not bothering your users or causing a database outage. I'll elaborate on that. For every query that you want to cache, you're going to have to first sit down and really think through what the user is expecting from you around data freshness. Then, from there, you're going to have to design an invalidation policy that you believe will guarantee that for the user.

Then, finally, you're going to have to implement it without causing any distributed systems bugs. Harder than it sounds. Again, we'll elaborate. The way that I'm going to explain this in greater depth is by walking through two examples. On the left here we have Facebook, and specifically, the Facebook News Feed. On the right we have the checkout window of an e-commerce store. In Facebook, let's say that you have a friend who went out to brunch and they posted photos from their brunch on Facebook at time point t, and then you don't see the photos on your Facebook News Feed until time point t plus 10. Ten minutes have gone by.

In most cases, you'll be none the wiser. You have no idea that your friend posted photos in brunch. The news feed's not even chronological anymore. It hasn't been for a decade. It's like random algorithmic, whatever. In this case, it really doesn't matter if your caching logic is really simple, naive. There's just TTL. It's going to be ok. Contrast that to the case on the right, where you have an e-commerce app, and let's say your goal in particular is to just buy a t-shirt on the internet, and you already know which t-shirt you want. You put it in the shopping cart, and it takes 10 minutes for it to show up. If you're anything like me, you're not going to stick around for 10 minutes refreshing the page, hoping that the cache will be cleared, or whatever.

You're going to go on Amazon or buy it from some other competitor, because nobody has the time to sit around. That's an example of how user expectations around data freshness can really vary across apps and also between different functions within a given app. In Facebook, there's loads of situations, like the checkout example, that we could talk about as well, but I thought this illustrates the point better.

Let's say you've done this. You've sat down, thought about it. You're like, I have a good sense for what users need here. Now you have to actually come up with an invalidation policy. The tricky thing here is that you really are on your own. There's no automation around this. You could set a TTL, but TTLs tend to be overly simplistic, because if you frequently have writes, then the data will really quickly become stale. If you set the TTL to be too short, then you're continuously just running queries against your database. At that point, it's like, why even bother using a cache? You're just rerunning your queries against a database. Your cache hit rate will be really low. You could set a TTL, but if you want your cache to actually do something for you, you might have to do something a little bit more sophisticated.

The more sophisticated you get, the more likely it is that you'll mess something up. Caching is really prone to certain types of distributed systems bugs. I think thundering herds are one great example of this that are really easy to introduce and are really catastrophic when you do. A thundering herd is when you essentially have a popular key in your cache, and there's a long queue of people waiting to read that key, and that key gets evicted. Then all of those requests get sent to the database simultaneously, and your database was not right-sized for this sudden traffic spike, and it just takes it down. Then you're in a situation where you invested all this time and effort into using caching, and you just shot yourself in the foot effectively. This dynamic is why we have this cheeky quote, that there are only two hard things in computer science, cache invalidation and naming things. It's because caching seems like a great idea in theory. Like, "It's so simple. We pre-compute some results. How hard could it be?" Then, in practice, it's actually quite challenging to get right, unfortunately.

If we were to do a cost-benefit analysis here just for fun, then I think that the pros would be like, with caching, you get, essentially, best case tail latencies, because you have an O of 1 lookup for a cache hit, into an in-memory key-value store. It's going to be a lot faster than actually doing work at the time that you run the query. You're doing all of the work beforehand. That's nice. There's also a pretty straightforward path to scaling, at least if you contrast it to sharding your database. This is going to be a lot simpler.

On the other hand, there's lots of things you're going to have to watch out for. You're going to have to rewrite your application potentially in a pretty substantial way. Caching generally is error prone. I talked about all of the different levels at which you can get caching wrong. Then when you do, the bugs are fairly obvious to your end users. On top of that, because conventionally caches are not isolated from the database. They're tightly intertwined. If something goes wrong with the cache and you introduce a distributed system bug by accident, which, again, really easy to do, the difference in code is really minor between code that triggers that bug and code that doesn't, then you could actually take down your database, which defeats the point of the whole endeavor.

Dataflow-Based Caching: Demo

I'm now going to transition into talking about dataflow-based caching. As I mentioned previously, I'll start off by just showing you a demo. Here I have an EC2 instance and two Docker containers running on the EC2 instance. One of them has just some Postgres database in it, and the other has Readyset, which is the dataflow-based cache that I'm going to be telling you about all throughout. Here I'm using psql like a Postgres client to connect to this connection string. This is the Readyset connection string. Readyset has a Postgres-esque connection string that I can use psql to connect with. Dataset wise, we're going to be working with the IMDb dataset. It's what it sounds like. It's like IMDb data about movie ratings, that sort of thing.

The query that we're going to be working with, the natural language interpretation of it is how many movies filmed after the year 2000 had over a 5-star rating on IMDb. This is how that query looks in actual SQL. You can see there's a count, there's a join, but in the grand scheme of things, it's fairly simple. I ran the query. I'm actually going to turn on timing so we can see how long various operations take. I ran the query with timing on, and we can see that it took 42 milliseconds to run. We're like, it's a little slow. Maybe we should cache it. The way that I'm going to do this is by prepending, CREATE CACHE FROM, and the query string. Remember, we're connected to Readyset. This is telling Readyset we're creating a cache from this query. The cache was created.

Now I'm going to run that same query a bunch of times to essentially warm up the cache and then see what the new latencies are. Ran it a few times. Cache was warm. Now it takes 1 millisecond. Before it took 42, now it took 1. The result is just living in memory. That's pretty neat. Let's now see what happens when the underlying data changes, because I think that's the interesting part. Let's say there is a movie that I really like but the rest of the world hates. Historical Revisionism, I'm going to say this movie deserves a 6.0 rating. Now I'm going to run the same query again, and it's still being cached.

We can see now, before the answer was 2418, and now, because I was like, I actually think this movie deserves a higher rating because it was quite good, so I changed that. I updated that row, and then I reran the same query, and you can see that we have the new value. It's the refresh value that accounts for that. We can also see that it's still cache. It took 0.89 milliseconds, so it's even faster now. I'm not actually making a point about the faster, it's noise. It wasn't a cache hit. It's not like we reran this query from scratch to get the update. It was, in fact, incrementally updated. We took the old value, 2418, and we repaired it to reflect the new data. We added 1. That's why it's still in the cache, and it was fast.

Dataflow-Based Caching: Interface

Now I'm going to essentially spend the rest of this presentation telling you about how that worked, because it might be a little mysterious at the moment. I'll start by talking about the interface. The first thing to know is that Readyset is set up as a read replica. Concretely, there is a local copy of any base tables that are referenced by cache queries on Readyset. If you're caching just one query, and that query references tables a and b, then we need to have a local copy of a and b, specifically in RocksDB on the Readyset instance for this whole thing to work. Notably, we don't need tables, like see through whatever, that you might also have on your database. We only need the ones that are actually used for caching. We're going to snapshot those tables, and then we're going to open up a replication slot on your primary database to get updates to those tables.

As writes are happening, we're getting a sense for how is the data changing over time. We support Postgres and MySQL. If you're using Readyset as a Postgres cache, then this would be done via logical replication. If you're using it as a MySQL cache, row-based replication. Readyset is also wire compatible with Postgres and MySQL. That's how we were able to get away with using psql to connect to it. If you're already using some database like driver, client, ORM, then you should be able to use your Readyset connection string interchangeably with the connections that are actually over a database, that database, without having to change code in any way. Obviously, there's not one-to-one feature parity. It's not like we support every single Postgres feature or anything like that, but you can still connect to it, and you can cache SQL-92-esque operations using Readyset.

The next thing to note is that caches are specified explicitly, so you have to actually tell the cache that you want certain queries to be cached. It's not going to automatically do it for you. The way that you do this is via these DDL equivalents for cache management. We saw this in the demo. I was like, CREATE CACHE FROM query string, and that's how we did it. That's what I'm referring to here. Notably, when we're caching in Readyset we're caching prepared statements. If you have a bunch of queries, and the only difference between those queries is that certain literals are different, then those queries could be parameterized, and they will all be resolved by the same query cache in Readyset. We're not creating new caches for every possible instantiation of structurally similar queries. Rather, we're dealing with prepared statements.

Finally, any queries that you have not cached explicitly, as well as anything that can't be cached, like writes, for example, will just get proxied to your primary database. If you remember, in the demo, we had the Postgres instance and the Readyset instance, and Readyset was connected to Postgres. At first, we didn't cache anything, and so, effectively, we proxied the query that we were working with to Postgres, and it continued to be proxied until we specifically cached it. This is nice, because this gives you fine-grain control over what's cached or not while still using the same connection string. If you have already invested in the infrastructure required to deal with multiple read replicas and separate read-write connections, then you can totally just use your Readyset connection string as another read replica. If you haven't, and you don't want to do that, then you can get away with just using a single connection string, but still have an interface to your database. It's not like you have to cache everything or anything like that.

Life of a Query

Now I will tell you about the system architecture via the life of a query. The way I'll do this is I will first walk you through a bird's eye view of the architecture, and I will fairly quickly step through the various components and what they do, and how they relate to each other. Then, after that, I will slow down, and I will walk you through more slowly, the process for creating a cache and what's actually happening behind the scenes, as well as the process for actually serving traffic from your cache. Bird's eye view. Up top here, we have the application server, which has the business logic, which is presumably using the ORM. The ORM, in turn, is connecting to the database via some DB client, like a driver of sorts. Although the actual logic in your application is unchanged, you have changed the connection string that you have in your application that is connecting to your database previously, to now point to Readyset instead.

Specifically, you'll be connecting to the Readyset adapter component here in purple. The first thing that's going to happen once that query or whatever reaches the adapter, is that it will hit the SQL shim component that is responsible for decoding from the binary protocol of the underlying database that your application is assuming that you're talking to, and converting that query string into an internal representation of that query. Essentially, like a relational algebra type of representation that Readyset can work with internally. This is database agnostic. It's just like the raw SQL-esque type operations. This is like our intermediate representation. Then the SQL shim, after doing that, will pass it along to the Readyset client, and the Readyset client will be essentially responsible for deciding whether that query or whatever should be sent to the database or to Readyset.

From there, it will either use the database connector to proxy to the database and return the result back up the chain, or otherwise, it'll use the Readyset connector to send it to the Readyset server. The Readyset server has a specific component called the reader. The reader is responsible for actually storing all of the cache query results and returning them back to users who asked for them. Behind the scenes, the data in your database is likely constantly changing, and so those updates are being sent to Readyset, specifically to a component in the server called the replicator. The replicator is responsible for updating the local copy of these tables on disk, and then emitting those row updates or whatever, through the dataflow graph component, which is responsible for incrementally updating the old cached values to reflect this new change. Then storing that new value in the reader. It's doing this eagerly.

First, I'll talk about setting up a cache. The first thing that you're going to want to do is swap out the database URL in your application to point to Readyset instead of your database. Here I'm assuming that you're not going to use Readyset as a read replica. Mind you, if you wanted to do this in production, you might want to use Readyset as a read replica instead. If you're just playing around, or you don't have read replica things set up that way, you can just change your connections, point to Readyset. The specific way that you'll do this will, of course, vary depending on the programming language, the framework ORM that you're using. Generally speaking, we give you this wire compatible connection string that you can use as you wish. If Readyset is a cache for a Postgres instance, it'll be a Postgres connection. If it's a cache for MySQL, it'll be a MySQL connection.

From there, you have to decide which queries you want to cache. There are many ways you could do this. To name a few, you could check the database slow query logs. You could maybe check your APM, if you're using Datadog, or New Relic, or something like that, to see which queries could be good candidates. Readyset will show you a bunch of stats about any queries that you run through it. It will show you things like the number of times that query was run, the tail latencies that it saw, the total CPU times. You could also use Readyset's built-in monitoring interface to get some sense for what could be cacheable or a good candidate for caching. You don't have to use that. You can just do whatever you want.

You have to just, at some level, figure out what queries you want to cache. Once you've decided, you have to actually create the caches. You do that via running those CREATE CACHE FROM statements using the SQL extensions that I talked about. That's it. Once you run CREATE CACHE FROM, and then the query string, it's not going to be literally immediate. We do have to actually construct a dataflow graph. We have to sometimes index data in RocksDB. If you have really big tables, that could take a few minutes. Typically, it takes a couple of seconds.

Let's say you've done that, you've waited a couple of minutes, and your queries are ready, now I'm going to talk about how we actually serve traffic. User comes along, requests some data. You know that your app is connected to the Readyset adapter now, so that query is getting sent there. Those queries will be decoded from the binary protocol to this internal representation that I was telling you about. Then that representation will then be passed to the Readyset client. The Readyset client is essentially just pattern matching against the caches that it knows exist on the Readyset server, and it's seeing, is there a match? Remember, we're doing this at the prepared statement level. If the query matches a prepared statement cache, then that means it should indeed go to Readyset.

If not, we assume that it should be proxied, because maybe we've never seen it before. In the case where it's going to Readyset, it'll get sent to the reader node. The reader node will just do a key-value lookup. It'll return the results. If it gets sent to the database, the database will run the query, return the results. Get sent to the user. In the background, the cache is continuously being updated to reflect the latest state of the world, as indicated by your database. We have this replication slot open on your database, and we're receiving these data changes in the replicator component of the Readyset server. That component, again, is updating the local tables that we have in RocksDB, and then it's propagating those changes through the dataflow graph component. The dataflow graph component is responsible for doing this incremental update operation that I keep referencing. Then, once it has this new value, it'll replace the old value in the reader with that new value.

Partially-Stateful Streaming Dataflow

The question of the hour is, how does dataflow work? Because we have this fancy incremental update thing, what are we actually doing? That brings me to partially-stateful streaming dataflow. It's the thing that I was telling you about right at the beginning of my presentation. The way that I'll go about explaining this is first by explaining what streaming dataflow is, and then I will explain the partially-stateful component, just for the sake of completeness. Dataflow computing is actually a fairly overloaded term in computer science, but the version that I'm referring to, you can think of as being stream processing. Concretely, you are representing computations as a directed graph. The nodes of the graph are operators that perform the computation on incoming data. Data is flowing through the edges. In our use case of database caching, it's coming from the replication stream from your database. Those are the changes that are flowing through the graph.

To give you a toy example, to just illustrate the concept, let's forget about database caching for 30 seconds, and let's say that our goal is to compute this basic arithmetic operation. We want to multiply x times y and we want to add z to the results of that. This is what the dataflow graph would look like. We have two nodes of the graph, they're the gray ones. We have the multiplication node and we have the addition node. The multiplication node takes x and y as inputs, so this is the data flowing through the edges, and emits the result. In this case, x times y. It emits the result out of its outgoing edge, which we can see, goes into the multiplication node. The addition node will then add this result, x times y with the z, and emit it. Then that's just the final result. This is how a dataflow graph would look like. This would be long running. It's not just like this ephemeral computation or whatever. This is like a system that's responsible for doing this.

When I think about dataflow computing/stream processing, I like to explain it in contrast to batch processing. Batch processing, how I think of is like normal computing, in a way, where you're ingesting data for some time, and at some point, you're like, you have a dataset, you're going to run some compute over it. Of course, there will be a delay between maybe when some of these data entries were collected, and when you run that computation, which means that the computation, if you're running this periodically, will be stale by some amount of time. Contrast that to stream processing, where, essentially, you're ingesting data continuously, and you're running the computation continuously, like all the time, eagerly.

Every time you get a new data point, you're running some computation to output a new result, and so the response is immediate. Let's bring this back to our domain of database caching, and let's talk about what it would look like to represent queries as a dataflow graph. It looks similar. The only difference now is that, rather than having arithmetic operations, now we have relational algebra operations. We can do SELECTs, PROJECTs, JOINs, like any aggregate, whatever, any of the constituent operations of a SQL query. As input, we're taking data from your database, and data changes to those tables over time. As those writes occur, those row updates will be propagated through the dataflow graph and will be passed into these relational algebra operators.

I'll make this more concrete, and we're going to come back to Hacker News as an example. Imagine that we have two tables. There's the stories table and the votes table. The stories table just has the story ID, the author, title, URL. The votes table is just a mapping of the username to the ID of the story that they voted for. Let's say that we had this query here, which is essentially just returning all the story information and metadata, and joining it with the vote count for that story, and its prepared statement. Notice the question mark in the query. We're taking a story ideas input, and when we pass in that input, we'll return all of the story info, plus vote count for that story. This is how it might look in dataflow. The dataflow part is the purple part. Over here, we have a local copy of the votes table and the stories table, and that's on disk. In memory, we have this dataflow graph, and it has a count node and a join node. The count node is taking input from the votes base table, the join node is taking input from the stories base table, as well as the count node. We have this reader node, which, if you remember, from the architecture part of this, that's responsible for returning the cache query results to the user.

To enliven this a bit. We'll look at how writes and reads work and how things flow through the system. This is the same graph as before, but now we actually have tables instead of this abstract blob representation. Let's say that Justin votes for story with ID 2, that's just an insert into the votes base table. That row insert is going to be propagated first to the count node, which will be responsible for updating the old vote count to reflect the fact that now Justin voted for story 2. Before, the vote count for story 2 was 580, now it's 581. The count node will propagate that new result to the join node. The join node will be like, I will update my old vote count to reflect this new vote count.

Then I will store that in the reader so that when users come around asking for the info for story 2, it will be there and it will be updated. This is happening eagerly in the background. Now let's talk about reads. Here's the query again. Now we are going to pass in a parameter, specifically parameter 2. We're going to compute the query result for story ID 2. All we have to do for that is go to this reader node and do a key-value lookup for ID 2 in the reader node. This is in memory, so it's really fast. We're just doing a little key-value lookup with normal caches.

Next, I'll talk about how efficient this is, as currently posed. Let's talk about cache hits first. In the case of a cache hit, both this approach and a "normal cache" are pretty much doing the same thing. They're doing a key-value lookup inside an in-memory data structure, so it's going to be pretty fast. That's good. No big change between the two. Let's think about cache updates. The data in your database has changed, and you want that to be reflected in the cache, so you have to update it in some way. If you're doing this traditional read-through caching approach that I talked about before, the way that this is probably going to work is you are going to evict the old value in the cache and then recompute it by rerunning that query against your database.

Presumably, if you thought to introduce caching, then either the query was slow, it was complicated, or you didn't want to upsize your database to deal with running that many queries. This could potentially be a slow operation, and it could potentially increase the load on your DB in a way that you don't want. Compare that to dataflow-based caching, where, essentially, we are incrementally updating the old cache result, which means we don't have to do an expensive recompute. We don't have to compute an aggregate starting from scratch. That's nice. It's less compute intensive. Importantly, with dataflow-based caching, we have our own query engine. It's this whole dataflow graph mechanism. We don't actually have to use your database's query engine. We can just figure out the result ourselves, even if it's not currently there. That's pretty cool. You don't run the risk of thundering herds. That's a nice plus. You're not going to accidentally take down your database using this approach.

Now let's talk about memory efficiency. This one's the kicker. With traditional caches, it is common knowledge that you should allocate at least enough space for the working set of your application. If you don't, you're going to experience something called thrashing, where essentially you are continuously swapping in and out frequently accessed cache keys because you haven't allocated enough space. Your cache hit rate is going to be artificially low because of that, because there's something that should be cached, but it can't be because there's not enough space for it. With dataflow-based caches, not only do you have to account for that working set, but you also have a ton of intermediate state in your graph. If we look at this graph again, the reader node, that's the working set, essentially. If you look in the purple, all of that's in memory, and you're like, that's actually a lot to put in memory. I have this join and I have this count. I'm really just worried about the join in this case.

Generally speaking, the intermediate state size will depend a lot, of course, on how big are the tables that you're dealing with to begin with. It will depend on query complexity, because if you have a 10 way join, then suddenly you're going to have 10 join nodes that all have a pretty heavy memory footprint. That wouldn't be good. It's also going to depend on the data distribution, because in some cases, you might be filtering 99% of your data out, right after, as you're pulling it into memory, and that would be ok. If you're not, then suddenly maybe you have an expensive 10 way join or something. Then this could be a pretty pathological memory blowup.

Really, if I come back to the motivation behind this talk, there's this idea of partial state or partial materialization that will specifically enable us to use streaming dataflow in this context for database caching. Without it, it's very easy to run out of memory and have just a ridiculous memory footprint for caching fairly simple queries. It's not worth it. I'm going to explain how this works. The main insight is that the full result set for the queries doesn't need to be cached, because data access, again, we're not picking rows at random from our table to return or users aren't doing that, but rather, there's really some notion of popularity and recency that's probably going to influence it. To bring this back to Hacker News, I don't need to preemptively compute and store the results for any given story ID. We know that there's going to be an exponential drop-off in what stories people are looking at. It would really only make sense to cache the most popularly requested items. That's what we're going to do, and we're going to do it lazily.

Essentially, the main idea of partial materialization is the cache will start off empty, and we're going to fill it on demand, lazily. We're going to do it via a mechanism called an upquery. An upquery, I will show you a pictorial diagram showing you what it looks like. The main idea is like, the cache is starting off empty, it's starting off cold, users will request specific results via specific input parameters, and then will lazily compute the missing results. Here's what it would look like. We're going to do the same read as before, so we're going to be getting the result for this query for story ID 2. Here's what the graph would look like at startup. You can see it's just empty, like those things in gray over there are on disk. The things in purple are in memory. It's empty. A user comes around and requests the story with ID 2, or the result for story ID 2, that's getting sense of the reader as it was before. The reader is going to be like, I don't have it. It's going to recursively traverse up the graph to find the closest parent that has the missing value.

Then, once it finds that parent, it will ask the parent to replay all of the missing data that it needs to be able to compute the result. In this case, the whole graph is empty, so it's going to ask the join node. The join node will be like, I don't know, let me ask my parents. Then, same with the count node. By the time you get to the base tables, which are stored locally on disk, the base tables will necessarily have the data that you're looking for, because it has all of the data. It's like a slightly out of date copy from your primary database that's being replicated continuously. It's like a read replica. The votes base table, in this case, would be like, I have the data for story 2. I'm going to replay all of the rows that had story ID 2 in them. Then the count node will recompute the vote count, given all of those inputs. It will simply count all of the votes, in this case. Then it will send the result to the join node. The join node would have, in parallel, sent off a request to the stories-based table node to get the other story data. Then it will finally place that result in the reader and return it to the user that requested it. This is like a cache miss.

Although this took a little bit more time because we had to do all of that work, we're saving so much on memory. Like before, we were storing the query results for any possible story ID. Now we're just storing it for this one. Coming back to Hacker News, maybe we only want to cache the first three pages, four pages, that will get us 98%, 99% of the way there. We're able to really not have as hefty of a memory footprint, but we're still able to get this nice mechanism where the cache is automatically being maintained for us. It's automatically syncing up with the database without having to do anything, and without having to evict and recompute and artificially lower cache hit rate, any of that.

What happens even with partial materialization and all the space savings we're getting from that? What happens when we start to deal with memory pressure? It will happen eventually, in most cases. Here we actually just handle things the normal way. We'll just start evicting. You can evict via LRU, LFU, and so forth. Notably, let's say, later on, a user requests a value that you evicted, we'll just use the same upquery mechanism to deal with it behind the scenes for you. It'll be a little bit slower that first time you ask for it again, but then it'll be stored in the cache, so the next 100,000 people that request that value, it'll be there for them.

Comparison to Alternatives

That's dataflow and dataflow-based caching. Now I'll compare it to some alternatives. I'll compare it to traditional caching. I've been doing this implicitly all throughout. The way that I think about dataflow-based caching compared to traditional caching is that you get faster updates, but you get a heavier memory footprint. You have to be prepared to spend a little bit more on memory, but you're going to get a really high cache hit rate, and you're going to get fresh data, because we're able to do this incremental update rather than a full computation. At the same time, we're able to avoid a lot of the headaches that typically come with caching. We don't have to rewrite the application because it's wire compatible with the database. We can just swap connection string. We don't have to worry about picking an invalidation policy, because the cache is eagerly kept up to date. The cache will never make things worse, because we have our own query engine, and it's isolated from the database.

Moreover, because of the fact that we're able to incrementally update state, we're able to tolerate comparatively write-heavy workloads. Let's say that you have a lot of writes, your data is constantly changing. You're going to have a really tough time caching that workload, because the data will really quickly go out of date, and the only way to refresh it is by running that same query over again against the database. At that point, you might as well not be using the cache. With this approach, you're able to, essentially, still cache without having to worry about lowering the cache hit rate, because it's always there in memory and it's just being updated behind the scenes. Every once in a while, we're just doing a pointer swap to reflect the latest changes. It also makes it easier to cache complex queries, because typically, the more complicated a query is, the more sophisticated your invalidation policy will end up being. There's a large set of queries that you might have felt a little bit nervous to try to cache before, that's now a little bit friendlier to do with dataflow-based caching.

As compared to query optimization, obviously, with caching, you don't have to know how to optimize queries, so the barrier of entry is a little bit lower. In some cases, query optimization will only get you so far, because the query is just doing a lot of work. Caching can be really helpful if you want to just click a button, and if you're willing to spend the memory on it, then you don't have to make sure your whole team knows how to optimize queries or anything like that. If you're on a platform team, and you have a lot of application teams that you support, then this could help you deal with maybe ORM generated queries and so forth, without having to spend a lot of human time on it, which is nice. As compared to read replicas, this has a similar interface.

In fact, you can set Readyset up as your read replica and interact with it in that way. Assuming that we're just dealing with cacheable queries and queries that we're caching, if we compare the two, then you will get better tail latencies, and therefore be able to support higher load if you're doing caching, because, again, you're just doing this O of 1 lookup each time you're doing a read. You will have to allocate more memory.

Conclusion

Dataflow-based caching has a very similar performance profile on cache hits to conventional caches. It's easier to use. It expands the set of use cases that you could imagine caching to incorporate ones that have comparatively more writes. It's source available on GitHub. If you want to take a look, you can Google it. As I mentioned, it's based on research, and so you can read the research paper. It's called Noria. You can Google that. There's an open-source repo that you can take a look at as well.

Questions and Answers

Participant 1: For your cache invalidation scheme, do you just use LRU, or did you find any value in caring about that? If you need to deploy multiple of your cache proxies, either for query volume or geo distribution reasons, do you have any model for coordination between those or do you just treat them all as entirely independent caches?

Marzoev: It's important to note that cache invalidation and cache eviction are actually two different things for us. Normally, they're conflated. For invalidation, we're just trying to keep the cache state fresh. With eviction, it's really going to depend, like we just use LRU most of the time, and that's fine. That's a parameter that you can toggle, but we haven't. There's no new insight that we give you there. It's just evict based on what is appropriate. We give you a nice mechanism for bringing things back into memory after you've evicted. That's our main contribution there.

Let's say you want to run multiple of these? We do support sharding, but not in the open source, but we haven't productionized it in our cloud product yet. It is theoretically possible. In fact, we've built it out. Indeed, we do think about, could you build a CDN for your database? It'd be cool to put this in different regions and so forth. It's not production ready yet, but you can do it.

Participant 2: It sounds like the partial materialization strategy is helpful for transactional queries where you're just really working on one row, but for analytics type queries that you might want to cache, is it still helpful, or is it going to still have a really high memory footprint?

Marzoev: No, I actually think it's pretty useful for analytics ones as well. It really just depends on what operations you're doing. Actually, counts and aggregates tend to have a lesser memory footprint. I think joins tend to be the really heavyweight one, but we've done a lot of work on optimizing that as well. Although this is typically used for transactional workloads, the more complicated the query is, and the more stuff that you're doing in it, the better comparative latency improvement you'll get from caching. There's no fundamental limitation on that. We see a lot of complex analytical queries as well.

 

See more presentations with transcripts

 

BT
痕迹是什么意思 双子座是什么星座 什么水果下火 财神爷供奉什么供品 湿疹是什么样的症状
吃席是什么意思 管型偏高说明什么问题 多吃山竹有什么好处 为什么会面瘫 什么洗面奶好
herb是什么意思 心率过缓有什么危害 开五行属什么 临床是什么意思 农历12月是什么星座
心意是什么意思 脑供血不足检查什么项目 头昏脑胀吃什么药 凿是什么意思 角膜塑形镜是什么
司空见惯是说司空见惯了什么hcv8jop2ns6r.cn 夏天什么面料最凉快hcv9jop5ns0r.cn 什么风化雨hcv9jop1ns1r.cn 两胸之间是什么部位hcv8jop8ns5r.cn 生姜水洗头有什么好处hcv9jop6ns4r.cn
血沉高说明什么问题hcv9jop8ns3r.cn hm是什么hcv7jop4ns8r.cn 属马的男生和什么属相最配hcv7jop7ns3r.cn 麦昆牌子是什么档次hcv9jop8ns0r.cn 复杂性囊肿是什么意思hcv8jop5ns9r.cn
令人发指是什么意思naasee.com 心房扑动是什么意思hcv9jop8ns2r.cn 为什么无缘无故流鼻血hcv8jop8ns7r.cn 1月21号是什么星座hcv7jop6ns1r.cn g是什么单位hcv9jop5ns3r.cn
手足口挂什么科hcv8jop0ns1r.cn 什么感hebeidezhi.com 吃什么补充维生素dhcv9jop6ns7r.cn 铁树开花什么样hcv9jop2ns2r.cn 梦见吃花生是什么意思hcv8jop5ns8r.cn
百度