pyspark - Generating parquet file with bloom filter - Stack Overflow|江阴雨辰互联

I am looking to partition raw parquet data in AWS S3 by YYYYMMDD and to enable bloom filters on high cardinality column (let's say QID).

import boto3
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_unixtime, substring, concat, expr

# Initialize Spark Session with Bloom Filter Support

spark = SparkSession.builder \
    .appName("Enable Bloom Filters in Parquet") \
    .config("spark.hadoop.fs.s3a.impl", ".apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("parquet.bloom.filter.enabled", "true") \
    .config("parquet.bloom.filter.column.names", "QID") \
    .config("parquet.bloom.filter.expected.ndv", "300000") \
    .config("parquet.writer.version", "v2") \
    .config("parquet.pushdown", "true") \
    .config("parquet.enable.dictionary", "false") \
    .getOrCreate()

# Load all Parquet files
df = spark.read.parquet(""s3://dev/source/") 

transformed_df = df.withColumn("TIM_YYYYMMDD", expr("FLOOR(TIM_LONG / 10000)").cast("long"))

transformed_df.coalesce(1).write \
    .mode("overwrite") \
    .partitionBy("TIM_YYYYMMDD") \
    .parquet("s3://dev/target")

I have tested this with pyspark 3.5.5 (locally & also with AWS Glue ETL). All method generates the parquet files, but the parquet metadata does not show any presence of bloom filter applied on column QID

parquet-tools meta part-00000-009a5011-e87b-4896-b828-d716f2d25e2a.c000.snappy.parquet

I was expecting to see "BF" or "bloom Filter" keywords in the parquet metadata to provide the presence of bloom filters.

Can someone please advise what am i missing here?

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744537314a4579507.html

pyspark - Generating parquet file with bloom filter - Stack Overflow

发表回复

评论列表（0条）

联系我们

400-800-8888

pyspark - Generating parquet file with bloom filter - Stack Overflow

相关推荐