I have a table written out as 71 parquet files, and I am trying to find out if polars is using the file statistics correctly to prune the reads. Is there a way to show exactly which files were opened as part of a query plan?
Specifically, I am reading the parquet with:
parquet_path = f'gs://path/to/table/partitions.parquet/'
df = pl.scan_parquet(parquet_path).filter((pl.col("ts_index") >= 1000) & (pl.col("ts_index") < 10000))
and when I explain the query plan I get:
Parquet SCAN [gs://path/to/table/partitions.parquet/p0.parquet, ... 71 other files]
PROJECT */218 COLUMNS
SELECTION: [([(col("ts_index")) < (10000)]) & ([(col("ts_index")) >= (1000)])]
The files are sorted by ts_index
, and I expect that Polars should be able to leverage column statistics when filtering by this index to only open 1 or 2 of the 71 parquet files. When I use explain it says that it will scan all 71 parquet files, but I am not sure if this means it actually has to open and read all of the files.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744173035a4561626.html
评论列表(0条)