怎样理解weighted quantile sketch怎样切图

实时数据流分析服务:&实时提交分析数据,实时获取分析结果
& & 之前一直在一家公司做埋点的后台数据处理分析程序,在碰到很多问题后就产生一个想法,能不能直接弄一个数据分析服务,
用户可以随时调用分析服务的接口提交分析数据和获取分析结果()
技术背景:
对常见的统计需求,如count(distinct), quantiles, topN等 在精确程度,存储成本,计算成本之间进行了折衷,对每一个计算元素只touch一次的情况下,得到精度相当高的统计结果。
是一个实现了actor pattern的异步编程 library。业务功能都被封闭成actor, 相互之间的调用或状态改变,通过message传递完成,无状态共享.
目前封装了三种计算模型:
theta sketch:基数计算,即计算count(distinct)quantile sketch:分位数计算,即计算一系列数值的分布情况frequency sketch:频率计算,即topn计算
数据安全性:
提交数据使用wss,获取数据使用https,数据不会被中途截取后台服务不存储任何原始数据,算后即扔,服务端只有分析后的数据相关计算参数,比如userid,business,可以hash或混淆后再传值由于目前只做authenticate,开发者应注意appid的保密性,建议后台调用
目前整个服务后台算是完成了0.5左右吧,由于是试验性质,目前只用了一台阿里云的ES,下一步要做的工作有两方面,
& 1. 增加统计模型
& 2. 增加对akka cluster 的支持,使整个后台服务能力可扩展
&&相关文章推荐
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
访问:274次
排名:千里之外Sketch of the Day – Frugal Streaming - 推酷
Sketch of the Day – Frugal Streaming
We are always worried about the size of our
. At AK we like to count stuff, and if you can count stuff with smaller sketches then you can count more stuff! We constantly have conversations about how we might be able to make our sketches, and subsequently our datastores, smaller. During our
pointed us at some of the new work in
. The concept of Frugal Streaming is to process your data as it comes, O(N) , but severely limit the amount of memory you are using for your sketch, and by “severely” they mean using perhaps one or two pieces of information. Needless to say, we were interested!
The concept of Frugal Streaming reminds me of an algorithm for finding the dominant item in a stream,
written by Boyer and Moore in 1980. (This paper has a very interesting
). The MJRTY algorithm sets out to solve the problem of finding the majority element in a stream (an element comprising more than 50% of the stream). Moore proposed to solve this by using only 2 pieces of information and a single scan of the data.
Imagine you have a stream of names (“matt”, “timon”, “matt”, “matt”, “rob”, “ben”, …) and you wanted to know if any name appeared in more than half the stream. Boyer and Moore proposed the following:
majority = &&
for val in stream:
if count == 0:
majority = val
elif val == majority:
count += 1
count -= 1
print majority if count & 0 else &no majority!&
If you’re anything like me you will go through a few phases: “That can’t work!”, “Wait, that works?!, “Nah, this data will break that”, “Holy sh**t that works!!”. If you think about it for a minute you realize that it HAS to work. If any element in the stream comprises more than half the stream values there is no way to get to the end and have a counter of zero. To convince yourself suppose the majority element only comprises the first half + 1 of your N-length stream. The counter would count up to
and then start decrementing until all N values have been seen, which would leave the counter at
*. This will hold regardless of the ordering of the elements in the stream. A simple
is provided by the authors. Philippe Flajolet apparently “liked this algorithm very much and called it the ‘gang war’, because in his mind, every element is a gang member, and members of opposite gangs are paired in a standoff, and shoot each other. The gang members remaining are the more numerous”**.
The astute among you will have noticed that this algorithm only works if there is, in fact, a majority element. If there is not then it will fail. A stream such as {“matt”,”matt”,”timon”,”timon”,”rob”} would result in the algorithm returning “rob”. In practice you need ways of ensuring that your stream does indeed have a majority element or have a guarantee ahead of time.
* Note, that formula is for an even length stream. For a stream of odd length the counter will indeed be at 1. Proof is left to the reader.
** Thanks to Jeremie Lumbroso for his insightful feedback on the history of MJRTY and his memory of Philippe’s explanation.
One “bit” – Frugal-1U
, Qiang and Muthu decided to see if they could find a frugal solution to the streaming quantile problem. The streaming quantiles problem is one I enjoy quite a bit and I have used it as an interview question for potential candidates for some time. Simply stated it is: “How would you find the quantiles of a stream of numbers in O(N) with limited memory?” There are a few different approaches to solve this problem, with the most widely known probably being
. However, with Count-Min you need to keep your keys around in order to query the sketch. There are
to this question as well.
Instead of focusing on quantiles generally, Qiang and Muthu’s first solution is a frugal way to find the median of a stream. As with MJRTY above, I’ll just write it down and see how you react:
median_est = 0
for val in stream:
if val & median_est:
median_est += 1
elif val & median_est:
median_est -= 1
Granted, the above is just for the median, where the stream is much longer than the value of the median, but it is so simple that I don’t think I would have ever considered this approach to the problem. The extension to quantiles is equally as shocking. If you are trying to find the 75th percentile of the data stream you do the same as above but increment up randomly 75% of the time and decrement down randomly 25% of the time:
quantile_75 = 0
for val in stream:
r = random()
if val & quantile_75 and r & 1 - 0.75:
quantile_75 += 1
elif val & quantile_75 and r & 0.75:
quantile_75 -= 1
As the paper states:
The trick to generalize median estimation to any
-quantile estimation is that not every stream item seen will cause an update. If the current stream item is larger than estimation, an increment update will be triggered only with probability
. The rationale behind it is that if we are estimating
-quantile, and if the current estimate is at stream’s true
-quantile, we will expect to see stream items larger than the current estimate with probability
Finding Quantiles With Two “bits”- Frugal-2U
There are a few obvious drawbacks to the above algorithm. Since we are only incrementing by 1 each time, the time to converge is linear and our initial guess of zero could be very bad. Secondly, and by design, the algorithm has no memory, can fluctuate wildly and, as they show in the paper, the estimates can drift very far away. (Note: while it is extremely unlikely that the estimates will drift far from the correct values the paper has some very loose bounds on how bad it can drift. See Lemma 3 in the paper.) They suggest a few improvements over Frugal-1U where you essentially include a varying step (instead of just incrementing by 1 every time) and 1 “bit” so you know which way you incremented in the last update. The intuition here is that if you have been consistently incrementing up or down for the last few elements of a stream then you are probably “far” away from the quantile in question. If this is the case we can speed up convergence time by incrementing a larger amount. The Frugal-2U algorithm:
def frugal_2u(stream, m = 0, q = 0.5, f = constantly_one):
step, sign = 1, 1
for item in stream:
if item & m and random() & 1 - q:
# Increment the step size if and only if the estimate keeps moving in
# the same direction. Step size is incremented by the result of applying
# the specified step function to the previous step size.
step += f(step) if sign & 0 else -1 * f(step)
# Increment the estimate by step size if step is positive. Otherwise,
# increment the step size by one.
m += step if step & 0 else 1
# Mark that the estimate increased this step
# If the estimate overshot the item in the stream, pull the estimate back
# and re-adjust the step size.
if m & item:
step += (item - m)
# If the item is less than the stream, follow all of the same steps as
# above, with signs reversed.
elif item & m and random() & q:
step += f(step) if sign & 0 else -1 * f(step)
m -= step if step & 0 else 1
if m & item:
step += (m - item)
# Damp down the step size to avoid oscillation.
if (m - item) * sign & 0 and step & 1:
You can play around with the 1U and 2U algorithms in the simulation below.
Click above to run the Frugal Streaming simulation
As usual, there are a few interesting tidbits as well. If you read the paper you will see that they define the updates to step as a function. This means they are allowing many different types of increments to step . For example, instead of increasing the size of step by 1 each time we could increase it by 10 or even have it increase multiplicatively. They do talk about some uses of different updates to step but there is no analysis around this (yet) and they restrict all of the work in the paper to step increments of 1. We offer a few different step update functions in the simulation and they indeed do interesting things. Further exploration is definitely needed to get some insights here.
A non-obvious thing about the step variable is how it behaves under decrements. My initial thought was that step would get large if your current estimate was far below the actual value (thus allowing you to approach it faster from below) and that step would get to be a large negative number if your current estimate was very far above the actual value. This turns out to just be flat wrong. The negative updates to step have the effect of stabilizing the estimate (notice when step is negative that the updates to your estimates are always & 1 ). If you read the algorithm closely you will see that step decrements when you consistently alternate up and down updates. This behavior occurs when the estimate is close to the actual value which causes step to become a very large negative number. And I mean VERY large. In practice we have seen this number as small as
for some simulations.
Monitoring
One of the first things I thought of when I saw this sketch was to use it as a monitoring tool. Specifically, perhaps it could be used to replace the monitoring we use on our application server response times. It turns out that using this sketch for monitoring introduces some very interesting issues. So far, we have mostly talked about what I will call “static streams”. These are streams that have data in them which is pulled consistently from some static underlying distribution (as in the examples above). However, what if the underlying distribution changes? For example, what if one of our servers all of the sudden starts responding with slower response times? Does this frugal sketch enable you to quickly figure out that something has broken and fire off an alarm with high confidence ? Well, in some ways this is an identical problem to cold start: how long does it take for an initial guess to reach a stable quantile? Unfortunately, there is no easy way to know when you have reached “equilibrium” and this makes identifying when an underlying distribution change has occurred difficult as well. The paper ends with an open challenge:
… as our experiments and insights indicate, frugal streaming algorithms work with so little memory of the past that they are adaptable to changes in the stream characteristics. It will be of great interest to understand this phenomenon better.
The paper shows some interesting experiments with changing data, but I agree with Qiang: we need to understand these algorithms better before I swap out all of our server monitoring tools (and our ops team would kill me). However, since these are so simple to implement and so small, we can easily deploy a few tests and see how the results compare “in the field” (you can play around with this by changing the underlying stream distribution in our simulation above.)
Conclusion
The frugal quantile algorithms proposed in the paper are fascinating. It is a very interesting turn in the sketching literature and Qiang and Muthu’s creativity really comes across. I am very interested in getting some real world uses out of this sketch and am excited to see what other applications we (and Qiang!) can think of. &Many thanks to
&and Jeremie Lumbroso for all their help!
Variability : While the bounds on the accuracy of these algorithms seem wide to me, clearly in real world experiments we see much better performance than the guaranteed bounds in the paper. In fact, the error bounds in the paper are dependent on the size of the stream and not fixed.
Size of step: A potential gotcha is the size of the step variable. Depending on your update function it indeed seems possible for this value to get below -MAXINT. Clearly a real implementation would need some error checking.
Cold Start : One more gotcha is that you have no real way of knowing when you are near the quantile in question. For instance, starting your estimate at zero and measuring a true median which is 100,000,000,000 will take a long time to “catch up” to this value. There are a few ways to limit this, some of which are discussed in the paper. One way is to try and make a sane guess up front (especially if you know something about the distribution) and another is to start your guess with the value from the last counter. For instance, suppose you are in a monitoring situation and you are computing means over the course of a day. It would make sense to start tomorrow’s estimate with yesterdays final value.
Accuracy : &And, lastly, there is some interesting dependence on “atomicity”. Meaning, the estimates in some sense depend on how “large” the actual values are. In both, your minimum step size is 1. If my median in the stream is, say, 6 then this “atomic” update of 1 causes high relative fluctuation. It seems in real world examples you would like to balance the size of the thing you are estimating with the speed of approach of the algorithm. This could lead you to estimate latencies in microseconds rather than milliseconds for instance. All of this points to the fact that there are a bunch of real world engineering questions that need to be answered and thought about before you actually go and implement a million frugal sketches throughout your organization.
已发表评论数()
请填写推刊名
描述不能大于100个字符!
权限设置: 公开
仅自己可见
正文不准确
标题不准确
排版有问题
主题不准确
没有分页内容
图片无法显示
视频无法显示
与原文不一致Sketch-based multi-query processing over data streams_百度文库
两大类热门资源免费畅读
续费一年阅读会员,立省24元!
Sketch-based multi-query processing over data streams
&&Abstract. Recent years have witnessed an increasing interest in designing algorithms for querying and analyzing streaming data (i.e., data that is seen only once in a fixed order) with only limited memory. Providing (perhaps approximate) answers to queries
阅读已结束,如果下载本文需要使用0下载券
想免费下载更多文档?
定制HR最喜欢的简历
下载文档到电脑,查找使用更方便
还剩15页未读,继续阅读
定制HR最喜欢的简历
你可能喜欢

我要回帖

更多关于 sketch怎样画虚线 的文章

 

随机推荐