python - How can I verify that polars is using file statistics for applying filters?

I have a table written out as 71 parquet files, and I am trying to find out if polars is using the file statistics correctly to prune the reads. Is there a way to show exactly which files were opened as part of a query plan?

Specifically, I am reading the parquet with:

parquet_path = f'gs://path/to/table/partitions.parquet/'
df = pl.scan_parquet(parquet_path).filter((pl.col("ts_index") >= 1000) & (pl.col("ts_index") < 10000))

and when I explain the query plan I get:

Parquet SCAN [gs://path/to/table/partitions.parquet/p0.parquet, ... 71 other files]
PROJECT */218 COLUMNS
SELECTION: [([(col("ts_index")) < (10000)]) & ([(col("ts_index")) >= (1000)])]

The files are sorted by ts_index, and I expect that Polars should be able to leverage column statistics when filtering by this index to only open 1 or 2 of the 71 parquet files. When I use explain it says that it will scan all 71 parquet files, but I am not sure if this means it actually has to open and read all of the files.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744173035a4561626.html

python - How can I verify that polars is using file statistics for applying filters? - Stack Overflow

发表回复

评论列表（0条）

联系我们

400-800-8888

python - How can I verify that polars is using file statistics for applying filters? - Stack Overflow

相关推荐