database - Cannot SAMPLE BY when reading from an external parquet file - Stack Overflow

I am trying to use the built-in parquet demo file inside questdb (trades.parquet) to see how it works,

I am trying to use the built-in parquet demo file inside questdb (trades.parquet) to see how it works, but I am stuck as I cannot use SAMPLE BY.

I can do this

select * from read_parquet('trades.parquet');

And it works as expected. I can see the output below with the columns symbol, side, price, amount, and timestamp. The dataset is already sorted by increasing timestamp on the file.

I now want to do the simplest SAMPLE BY query

select timestamp, avg(price) from read_parquet('trades.parquet') SAMPLE by 1m;

But I am getting an error base query does not provide ASC order over dedicated TIMESTAMP column.

I tried creating a table and then doing INSERT INTO tb SELECT * FROM read_parquet('trades.parquet') and I can then use SAMPLE BY. But in this case I am duplicating data and I was hoping to do time-series queries in-place over my file.

I am trying to use the built-in parquet demo file inside questdb (trades.parquet) to see how it works, but I am stuck as I cannot use SAMPLE BY.

I can do this

select * from read_parquet('trades.parquet');

And it works as expected. I can see the output below with the columns symbol, side, price, amount, and timestamp. The dataset is already sorted by increasing timestamp on the file.

I now want to do the simplest SAMPLE BY query

select timestamp, avg(price) from read_parquet('trades.parquet') SAMPLE by 1m;

But I am getting an error base query does not provide ASC order over dedicated TIMESTAMP column.

I tried creating a table and then doing INSERT INTO tb SELECT * FROM read_parquet('trades.parquet') and I can then use SAMPLE BY. But in this case I am duplicating data and I was hoping to do time-series queries in-place over my file.

Share Improve this question asked Jan 29 at 17:52 Javier RamirezJavier Ramirez 4,0951 gold badge27 silver badges36 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

The problem with the SAMPLE BY is that it works over the designated timestamp. When you create a table without a designated timestamp, you get the same error.

Since the parquet file is an external file, QuestDB doesn't know which column, if any, is the designated timestamp. It cannot just use the timestamp column, as for SAMPLE BY to work, the data must be sorted by increasing timestamp. When tables are created inside QuestDB, the increasing timestamp order is always implicit, as QuestDB forces that when persisting data on disk, but when using an external file, it cannot assume anything.

In QuestDB we can always hint the designated timestamp by doing this:

select timestamp, avg(price) from (read_parquet('trades.parquet') timestamp(timestamp)) sample by 1m;

This should work fine in this case, as we know the file is already sorted by time. However, if we were not sure about the original order, we could always add an ORDER BY, as in

select timestamp, avg(price) from ((read_parquet('trades.parquet') order by timestamp  ) timestamp(timestamp)) SAMPLE BY 1m

Which is shorthand for

select timestamp, avg(price) from (SELECT * FROM (read_parquet('trades.parquet') order by timestamp  ) timestamp(timestamp)) SAMPLE BY 1m

Or if we prefer a CTE for clarity

WITH timestamped AS (
  (
   SELECT * FROM read_parquet('trades.parquet') 
   order by timestamp  
  )    
  timestamp(timestamp ) 
  )
 SELECT  timestamp, avg(price) from timestamped SAMPLE BY 1m;

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745286004a4620565.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信