pyspark - Generating parquet file with bloom filter - Stack Overflow

I am looking to partition raw parquet data in AWS S3 by YYYYMMDD and to enable bloom filters on high ca

I am looking to partition raw parquet data in AWS S3 by YYYYMMDD and to enable bloom filters on high cardinality column (let's say QID).

import boto3
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_unixtime, substring, concat, expr

# Initialize Spark Session with Bloom Filter Support

spark = SparkSession.builder \
    .appName("Enable Bloom Filters in Parquet") \
    .config("spark.hadoop.fs.s3a.impl", ".apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("parquet.bloom.filter.enabled", "true") \
    .config("parquet.bloom.filter.column.names", "QID") \
    .config("parquet.bloom.filter.expected.ndv", "300000") \
    .config("parquet.writer.version", "v2") \
    .config("parquet.pushdown", "true") \
    .config("parquet.enable.dictionary", "false") \
    .getOrCreate()

# Load all Parquet files
df = spark.read.parquet(""s3://dev/source/") 

transformed_df = df.withColumn("TIM_YYYYMMDD", expr("FLOOR(TIM_LONG / 10000)").cast("long"))

transformed_df.coalesce(1).write \
    .mode("overwrite") \
    .partitionBy("TIM_YYYYMMDD") \
    .parquet("s3://dev/target")

I have tested this with pyspark 3.5.5 (locally & also with AWS Glue ETL). All method generates the parquet files, but the parquet metadata does not show any presence of bloom filter applied on column QID

parquet-tools meta part-00000-009a5011-e87b-4896-b828-d716f2d25e2a.c000.snappy.parquet

I was expecting to see "BF" or "bloom Filter" keywords in the parquet metadata to provide the presence of bloom filters.

Can someone please advise what am i missing here?

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744537314a4579507.html

相关推荐

  • pyspark - Generating parquet file with bloom filter - Stack Overflow

    I am looking to partition raw parquet data in AWS S3 by YYYYMMDD and to enable bloom filters on high ca

    1天前
    20

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信