I am looking to partition raw parquet data in AWS S3 by YYYYMMDD and to enable bloom filters on high cardinality column (let's say QID).
import boto3
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_unixtime, substring, concat, expr
# Initialize Spark Session with Bloom Filter Support
spark = SparkSession.builder \
.appName("Enable Bloom Filters in Parquet") \
.config("spark.hadoop.fs.s3a.impl", ".apache.hadoop.fs.s3a.S3AFileSystem") \
.config("parquet.bloom.filter.enabled", "true") \
.config("parquet.bloom.filter.column.names", "QID") \
.config("parquet.bloom.filter.expected.ndv", "300000") \
.config("parquet.writer.version", "v2") \
.config("parquet.pushdown", "true") \
.config("parquet.enable.dictionary", "false") \
.getOrCreate()
# Load all Parquet files
df = spark.read.parquet(""s3://dev/source/")
transformed_df = df.withColumn("TIM_YYYYMMDD", expr("FLOOR(TIM_LONG / 10000)").cast("long"))
transformed_df.coalesce(1).write \
.mode("overwrite") \
.partitionBy("TIM_YYYYMMDD") \
.parquet("s3://dev/target")
I have tested this with pyspark 3.5.5 (locally & also with AWS Glue ETL). All method generates the parquet files, but the parquet metadata does not show any presence of bloom filter applied on column QID
parquet-tools meta part-00000-009a5011-e87b-4896-b828-d716f2d25e2a.c000.snappy.parquet
I was expecting to see "BF" or "bloom Filter" keywords in the parquet metadata to provide the presence of bloom filters.
Can someone please advise what am i missing here?
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744537314a4579507.html
评论列表(0条)