python - repartition structured Dask dataframe - Stack Overflow

Assume Dask dataframe partitioned on region column. As more data arrive to the dataframe, more parquets

Assume Dask dataframe partitioned on region column. As more data arrive to the dataframe, more parquets are created within each region partition. I want "concatenate" all partitions within regions with more parquets into single parquet. Assume further the data is large and therefore I do not want to manipulate the data where there is already only parquet in region. I tried various ways but I did not manage to make it working.

P.S. I hope my usage of word "partition" is not confusing - not sure how to differentiate between regions and individual parquets.

Old data:

my_df
├── region=A
│   └── part.0.parquet
├── region=B
│   ├── part.0.parquet
│   └── part.3.parquet
├── region=C
│   ├── part.0.parquet
│   └── part.3.parquet
└── region=D
    └── part.3.parquet

Desired data:

my_df
├── region=A
│   └── part.0.parquet
├── region=B
│   └── part.4.parquet
├── region=C
│   └── part.4.parquet
└── region=D
    └── part.3.parquet

Code to create data:

import dask.dataframe as dd
import pandas as pd
from pyarrow import fs

# create local filesystem and path to data
filesys = fs.LocalFileSystem()
PATH = "my_df"

# create dataframes
df1 = pd.DataFrame({
    "region": ["A", "B", "C"],
    "date": ["2025-03-21", "2025-03-22", "2025-03-23"],
    "value": [100, 200, 300],
})

df2 = pd.DataFrame({
    "region": ["B", "C", "D"],
    "date": ["2025-03-22", "2025-03-23", "2025-03-24"],
    "value": [150, 250, 350],
})

# create dask dataframes and write to parquet
ddf1 = dd.from_pandas(df1)
ddf1.to_parquet(f"{PATH}", filesystem=filesys, partition_on=["region"], append=True)

ddf2 = dd.from_pandas(df2)
ddf2.to_parquet(f"{PATH}", filesystem=filesys, partition_on=["region"], append=True, ignore_divisions=True)

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744239089a4564615.html

相关推荐

  • python - repartition structured Dask dataframe - Stack Overflow

    Assume Dask dataframe partitioned on region column. As more data arrive to the dataframe, more parquets

    8天前
    10

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信