God clickhouse is such great software, if it only it was as ergonomic as duckdb, and management wasn't doing some questionable things (deleting references to competitors in GH issues, weird legal letters, etc.)
The CH contributors are really stellar, from multiple companies (Altinity, Tinybird, Cloudflare, ClickHouse)
tmoertel 5 hours ago [-]
This optimization should provide dramatic speed-ups when taking random samples from massive data sets, especially when the wanted columns can contain large values. That's because the basic SQL recipe relies on a LIMIT clause to determine which rows are in the sample (see query below), and this new optimization promises to defer reading the big columns until the LIMIT clause has filtered the data set down to a tiny number of lucky rows.
SELECT *
FROM Population
WHERE weight > 0
ORDER BY -LN(1.0 - RANDOM()) / weight
LIMIT 100 -- Sample size.
Can anyone from ClickHouse verify that the lazy-materialization optimization speeds up queries like this one? (I want to make sure the randomization in the ORDER BY clause doesn't prevent the optimization.)
Thanks! That's a nice 5x improvement. Pretty good for a query that offers only modest opportunity, given that the few columns it asks for are fairly small (`title` being the largest, which isn't that large).
tschreiber 4 hours ago [-]
Verified:
EXPLAIN plan actions = 1
SELECT *
FROM amazon.amazon_reviews
WHERE helpful_votes > 0
ORDER BY -log(1 - (rand() / 4294967296.0)) / helpful_votes
LIMIT 3
Note that there is a setting query_plan_max_limit_for_lazy_materialization (default value 10) that controls the max n for which lm kicks in for LIMIT n.
tmoertel 4 hours ago [-]
Awesome! Thanks for checking :-)
jurgenkesker 4 hours ago [-]
I really like Clickhouse. Discovered it recently, and man, it's such a breath of fresh air compared to suboptimal solutions I used for analytics. It's so fast and the CLI is also a joy to work with.
theLiminator 49 minutes ago [-]
How does it compare to duckdb and/or polars?
EvanAnderson 3 hours ago [-]
Same here. I come from a strong Postgres and Microsoft SQL Server background and I was able to get up to speed with it, ingesting real data from text files, in an afternoon. I was really impressed with the docs as well as the performance of the software.
simonw 5 hours ago [-]
Unrelated to the new materialization option, this caught my eye:
"this query sorts all 150 million values in the helpful_votes column (which isn’t part of the table’s sort key) and returns the top 3, in just 70 milliseconds cold (with the OS filesystem cache cleared beforehand) and a processing throughput of 2.15 billion rows/s"
I clearly need to update my mental model of what might be a slow query against modern hardware and software. Looks like that's so fast because in a columnar database it only has to load that 150 million value column. I guess sorting 150 million integers in 70ms shouldn't be surprising.
(Also "Peak memory usage: 3.59 MiB" for that? Nice.)
This is a really great article - very clearly explained, good diagrams, I learned a bunch from it.
amluto 5 hours ago [-]
> I guess sorting 150 million integers in 70ms shouldn't be surprising.
I find sorting 150M integers at all to be surprising. The query asks for finding the top 3 elements and returning those elements, sorted. This can be done trivially by keeping the best three found so far and scanning the list. This should operate at nearly the speed of memory and use effectively zero additional storage. I don’t know whether Clickhouse does this optimization, but I didn’t see it mentioned.
Generically, one can find the kth best of n elements in time O(n):
And one can scan again to find the top k, plus some extra if the kth best wasn’t unique, but that issue is manageable and, I think, adds at most a factor of 2 overhead if one is careful (collect up to k elements that compare equal to the kth best and collect up to k that are better than it). Total complexity is O(n) if you don’t need the result sorted or O(n + k log k) if you do.
If you’re not allowed to mutate the input (which probably applies to Clickhouse-style massive streaming reads), you can collect the top k in a separate data structure, and straightforward implementations are O(n log k). I wouldn’t be surprised if using a fancy heap or taking advantage of the data being integers with smallish numbers of bits does better, but I haven’t tried to find a solution or disprove the existence of one.
danlark1 2 hours ago [-]
I am the author of the optimization of partial sorting and selection in Clickhouse. It uses Floyd-Rivest algorithm and we tried a lot of different things back at the time, read [1]
Overall clickhouse reads blocks of fixed sizes (64k) and finds top elements and then does top of the top until it converges.
> This can be done trivially by keeping the best three found so far and scanning the list.
That doesnt seem to guarantee correctness. If you dont track all of the unique values, at least, you could be throwing away one of the most common values.
The wiki entry seems to be specifically about the smallest, rather than largest values.
senderista 4 hours ago [-]
The max-heap algorithm alluded to above is correct. You fill it with the first k values scanned, then peek at the max element for each subsequent value. If the current value is smaller than the max element, you evict the max element and insert the new element. This streaming top-k algorithm is ubiquitous in both leetcode interviews and applications. (The standard quickselect top-k algorithm is not useful in the streaming context because it requires random access and in-place mutation.)
amluto 4 hours ago [-]
To be fair to quickselect, I can imagine a lazy data processing framework having a concept of a lazily sorted data column where the actual data has been materialized but it’s not in sorted order yet. Then someone does “LIMIT k” to it, and the framework can go to town with quickselect.
As noted a couple times in this thread, there are all kinds of tradeoffs here, and I can’t imagine quickselect being even close to competitive for k that is small enough to fit in cache. Quickselect will, in general, scan a large input approximately twice. For k = 3, the answer fits in general-purpose registers or even in a single SIMD register, and a single scan with brute force accumulation of the answer will beat quickselect handily and will also beat any sort of log-time heap.
(In general, more advanced and asymptotically better algorithms often lose to simpler brute force algorithms when the parameters in question are smallish.)
senderista 4 hours ago [-]
Yeah, obviously I wouldn't bother with a heap for k=3. A heap has good compactness but poor locality, so I guess it wouldn't perform well out of (some level of) cache.
eru 3 hours ago [-]
So quickselect needs multiple passes, and the heap needs O(n log k) time to find the top k elements of n elements total.
However, you can find the top k elements in O(n) time and O(k) space in a single pass.
One simple way: you keep a buffer of up to 2*k elements. You scan your stream of n items one by one. Whenever your buffer gets full, you pare it back down to k elements with your favourite selection algorithm (like quickselect).
As a minor optimisation, you can only add items to your buffer, if they improve on the worst element in your buffer (or when you haven't hit k elements in your buffer, yet).
As an empirical question, you can also experiment with the size of the buffer. Theoretically any multiple of k will do (even 1.1*k or so), but in practice they give you different constant factors for space and time.
senderista 2 hours ago [-]
How do you efficiently track the "worst element" without something like a max-heap? But yeah, this is a fun algorithm. I think I've seen it before but can't place it, do you remember where you came across it?
Akronymus 4 hours ago [-]
My failure was misreading it as most common k rather than max k.
senderista 4 hours ago [-]
Most common k is super-interesting because it can't be solved in one pass in constant space!
Why is that interesting? Intuitively a worst-case could be a stream of n-1 unique elements out of n with the duplicate at the end, so there is no way around O(n) space. Any element could be the most common so you must keep them all.
senderista 2 hours ago [-]
Sure, a similar trivial argument applies to the linear-space lower bound for set membership. But these linear lower bounds motivate the search for approximate techniques with sublinear lower bounds (although bloom filters or fingerprint tables are not actually sublinear).
datadrivenangel 4 hours ago [-]
With an equality that returns true/false, this guarantees correctness. If there can be 3 best/biggest/smallest values, this technique works.
recursive 4 hours ago [-]
What? The algorithm is completely symmetrical with respect to smallest or largest, and fully correct and general. I don't understand the problem with unique values. Could you provide a minimal input demonstrating the issue?
Akronymus 4 hours ago [-]
I cant because I completely misread the wiki article before commenting and have now read it more carefully and realized I was wrong. Specifically I went in thinking about top 3 most common value.
simonw 3 hours ago [-]
Maybe they do have that optimization and that explains the 3.59 MiB peak memory usage for ~600MB of integers.
baq 4 hours ago [-]
Slow VMs on overprovisioned cloud hosts which cost as much per month as a dedicated box per year have broken a generation of engineers.
You could host so much from your macbook. The average HN startup could be hosted on a $200 minipc from a closet for the first couple of years if not more - and I'm talking expensive here for the extra RAM you want to not restart every hour when you have a memory leak.
ramraj07 24 minutes ago [-]
I don't see how that's the root cause. ClickHouse and snowflake run on your so-called slow vms on overprovisioned cloud hosts and they're efficient as hell. It's all about your optimizations.
The real problem is the lack of understanding by most engineers the degree of overprovisioning they do for code that's simple and doing stupid things using an inefficient 4th order language on top of 5 different useless (imo) abstractions.
federiconafria 1 hours ago [-]
Not only that, you have a pile of layers that could be advantageous in some situations but are an overkill in most.
I've seen Spark clusters being replaced by a single container using less than 1 CPU core and few 100s MB of RAM.
rfoo 2 hours ago [-]
> so much from your macbook
At least on cloud I can actually have hundreds of GiBs of RAM. If I want this on my Macbook it's even more expensive than my cloud bill.
baq 2 hours ago [-]
You can, but if you need it you’re not searching for a product market fit anymore.
sofixa 3 hours ago [-]
Raw compute wise, you're almost right (almost because real cloud hosts aren't overprovisioned, you get the full CPU/memory/disk reserved for you).
But you actually need more than compute. You might need a database, cache, message broker, scheduler, to send emails, and a million other things you can always DIY with FOSS software, but take time. If you have more money than time, get off the shelf services that provide those with guarantees and maintenance; if not, the DIY route is also great for learning.
baq 2 hours ago [-]
My point is all of this can be hosted on a single bare metal box, a small one at that! We used to do just that back in mid naughts and computers only got faster. Half of those cloud services are preconfigured FOSS derivatives behind the scenes anyway (probably…)
4 hours ago [-]
mmsimanga 20 minutes ago [-]
IMHO if ClickHouse had Windows native release that does not need WSL or a Linux virtual machine it would be more popular than DuckDB. I remember for years MySQL being way more popular than PostgreSQL. One of the reasons being MySQL had a Windows installer.
vjerancrnjak 2 hours ago [-]
It's quite amazing how a db like this shows that all of those row-based dbs are doing something wrong, they can't even approach these speeds with btree index structures. I know they like transactions more than Clickhouse, but it's just amazing to see how fast modern machines are, billions of rows per second.
I'm pretty sure they did not even bother to properly compress the dataset, with some tweaking, could have probably been much smaller than 30GBs. The speed shows that reading the data is slower than decompressing it.
Reminds me of that Cloudflare article where they had a similar idea about encryption being free (slower to read than to decrypt) and finding a bug, that when fixed, materialized this behavior.
The compute engine (chdb) is a wonder to use.
apavlo 43 minutes ago [-]
> It's quite amazing how a db like this shows that all of those row-based dbs are doing something wrong
They're not "doing something wrong". They are designed differently for different target workloads.
Row-based -> OLTP -> "Fetch the entire records from order table where user_id = XYZ"
Column-based -> OLAP -> "Compute the total amount of orders from the order table grouped by month/year"
ohnoesjmr 3 hours ago [-]
Wonder how well this propagates down to subqueries/CTE's
simianwords 4 hours ago [-]
Maybe I'm too inexperienced in this field but reading the mechanism I think this would be an obvious optimisation. Is it not?
But credit where it is due, obviously clickhouse is an industry leader.
ahofmann 3 hours ago [-]
Obvious solutions are often hard to do right. I bet the code that was needed to pull this off is either very complex or took a long time to write (and test). Or both.
ryanworl 2 hours ago [-]
This is a well-known class of optimization and the literature term is “late materialization”. It is a large set of strategies including this one. Late materialization is about as old as column stores themselves.
meta_ai_x 2 hours ago [-]
can we take the "packing your luggage" analogy and only pack the things we actually use in the trip and apply that to clickhouse?
Onavo 2 hours ago [-]
Reminder clickhouse can be optionally embedded, you don't need to reach for Duck just because of hype (it's buggy as hell everytime I tried it).
The CH contributors are really stellar, from multiple companies (Altinity, Tinybird, Cloudflare, ClickHouse)
Note that there is a setting query_plan_max_limit_for_lazy_materialization (default value 10) that controls the max n for which lm kicks in for LIMIT n.
"this query sorts all 150 million values in the helpful_votes column (which isn’t part of the table’s sort key) and returns the top 3, in just 70 milliseconds cold (with the OS filesystem cache cleared beforehand) and a processing throughput of 2.15 billion rows/s"
I clearly need to update my mental model of what might be a slow query against modern hardware and software. Looks like that's so fast because in a columnar database it only has to load that 150 million value column. I guess sorting 150 million integers in 70ms shouldn't be surprising.
(Also "Peak memory usage: 3.59 MiB" for that? Nice.)
This is a really great article - very clearly explained, good diagrams, I learned a bunch from it.
I find sorting 150M integers at all to be surprising. The query asks for finding the top 3 elements and returning those elements, sorted. This can be done trivially by keeping the best three found so far and scanning the list. This should operate at nearly the speed of memory and use effectively zero additional storage. I don’t know whether Clickhouse does this optimization, but I didn’t see it mentioned.
Generically, one can find the kth best of n elements in time O(n):
https://en.m.wikipedia.org/wiki/Selection_algorithm
And one can scan again to find the top k, plus some extra if the kth best wasn’t unique, but that issue is manageable and, I think, adds at most a factor of 2 overhead if one is careful (collect up to k elements that compare equal to the kth best and collect up to k that are better than it). Total complexity is O(n) if you don’t need the result sorted or O(n + k log k) if you do.
If you’re not allowed to mutate the input (which probably applies to Clickhouse-style massive streaming reads), you can collect the top k in a separate data structure, and straightforward implementations are O(n log k). I wouldn’t be surprised if using a fancy heap or taking advantage of the data being integers with smallish numbers of bits does better, but I haven’t tried to find a solution or disprove the existence of one.
Overall clickhouse reads blocks of fixed sizes (64k) and finds top elements and then does top of the top until it converges.
[1] https://danlark.org/2020/11/11/miniselect-practical-and-gene...
That doesnt seem to guarantee correctness. If you dont track all of the unique values, at least, you could be throwing away one of the most common values.
The wiki entry seems to be specifically about the smallest, rather than largest values.
As noted a couple times in this thread, there are all kinds of tradeoffs here, and I can’t imagine quickselect being even close to competitive for k that is small enough to fit in cache. Quickselect will, in general, scan a large input approximately twice. For k = 3, the answer fits in general-purpose registers or even in a single SIMD register, and a single scan with brute force accumulation of the answer will beat quickselect handily and will also beat any sort of log-time heap.
(In general, more advanced and asymptotically better algorithms often lose to simpler brute force algorithms when the parameters in question are smallish.)
However, you can find the top k elements in O(n) time and O(k) space in a single pass.
One simple way: you keep a buffer of up to 2*k elements. You scan your stream of n items one by one. Whenever your buffer gets full, you pare it back down to k elements with your favourite selection algorithm (like quickselect).
As a minor optimisation, you can only add items to your buffer, if they improve on the worst element in your buffer (or when you haven't hit k elements in your buffer, yet).
As an empirical question, you can also experiment with the size of the buffer. Theoretically any multiple of k will do (even 1.1*k or so), but in practice they give you different constant factors for space and time.
https://en.wikipedia.org/wiki/Streaming_algorithm#Frequent_e...
You could host so much from your macbook. The average HN startup could be hosted on a $200 minipc from a closet for the first couple of years if not more - and I'm talking expensive here for the extra RAM you want to not restart every hour when you have a memory leak.
The real problem is the lack of understanding by most engineers the degree of overprovisioning they do for code that's simple and doing stupid things using an inefficient 4th order language on top of 5 different useless (imo) abstractions.
I've seen Spark clusters being replaced by a single container using less than 1 CPU core and few 100s MB of RAM.
At least on cloud I can actually have hundreds of GiBs of RAM. If I want this on my Macbook it's even more expensive than my cloud bill.
But you actually need more than compute. You might need a database, cache, message broker, scheduler, to send emails, and a million other things you can always DIY with FOSS software, but take time. If you have more money than time, get off the shelf services that provide those with guarantees and maintenance; if not, the DIY route is also great for learning.
I'm pretty sure they did not even bother to properly compress the dataset, with some tweaking, could have probably been much smaller than 30GBs. The speed shows that reading the data is slower than decompressing it.
Reminds me of that Cloudflare article where they had a similar idea about encryption being free (slower to read than to decrypt) and finding a bug, that when fixed, materialized this behavior.
The compute engine (chdb) is a wonder to use.
They're not "doing something wrong". They are designed differently for different target workloads.
Row-based -> OLTP -> "Fetch the entire records from order table where user_id = XYZ"
Column-based -> OLAP -> "Compute the total amount of orders from the order table grouped by month/year"
But credit where it is due, obviously clickhouse is an industry leader.
https://clickhouse.com/blog/chdb-embedded-clickhouse-rocket-...