I'll create a demo dataframe to recreate the error that I see in databricks.
from pyspark.sql.types import StructType, StructField, TimestampType, StringType
from datetime import datetime
# Define the schema
schema = StructType([
StructField("session_ts", TimestampType(), True),
StructField("analysis_ts", TimestampType(), True)
])
# Define the data with datetime objects
data = [
(datetime(2023, 9, 15, 17, 30, 41), datetime(2023, 9, 15, 17, 47, 3)),
(datetime(2023, 10, 24, 18, 23, 37), datetime(2023, 10, 24, 18, 25, 16)),
(datetime(2024, 1, 15, 6, 38, 52), datetime(2024, 1, 15, 6, 48, 15)),
(datetime(2024, 2, 21, 13, 16, 37), datetime(2024, 2, 21, 13, 22, 35)),
(datetime(2023, 10, 18, 17, 52, 28), datetime(2023, 10, 19, 17, 11, 3))
]
# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)
When I try to convert the pyspark dataframe to pandas I get the error: TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.
df.toPandas().head()
Casting the fields as TimestampType did not resolve the error.
df = df.withColumn("session_ts", df["session_ts"].cast(TimestampType()))
df = df.withColumn("analysis_ts", df["analysis_ts"].cast(TimestampType()))
df.toPandas()
I was only able to proceed by casting as string, which seems an uneccessary workaround.
df = df.withColumn("session_ts", df["session_ts"].cast(StringType()))
df = df.withColumn("analysis_ts", df["analysis_ts"].cast(StringType()))
df.toPandas()
I'll create a demo dataframe to recreate the error that I see in databricks.
from pyspark.sql.types import StructType, StructField, TimestampType, StringType
from datetime import datetime
# Define the schema
schema = StructType([
StructField("session_ts", TimestampType(), True),
StructField("analysis_ts", TimestampType(), True)
])
# Define the data with datetime objects
data = [
(datetime(2023, 9, 15, 17, 30, 41), datetime(2023, 9, 15, 17, 47, 3)),
(datetime(2023, 10, 24, 18, 23, 37), datetime(2023, 10, 24, 18, 25, 16)),
(datetime(2024, 1, 15, 6, 38, 52), datetime(2024, 1, 15, 6, 48, 15)),
(datetime(2024, 2, 21, 13, 16, 37), datetime(2024, 2, 21, 13, 22, 35)),
(datetime(2023, 10, 18, 17, 52, 28), datetime(2023, 10, 19, 17, 11, 3))
]
# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)
When I try to convert the pyspark dataframe to pandas I get the error: TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.
df.toPandas().head()
Casting the fields as TimestampType did not resolve the error.
df = df.withColumn("session_ts", df["session_ts"].cast(TimestampType()))
df = df.withColumn("analysis_ts", df["analysis_ts"].cast(TimestampType()))
df.toPandas()
I was only able to proceed by casting as string, which seems an uneccessary workaround.
df = df.withColumn("session_ts", df["session_ts"].cast(StringType()))
df = df.withColumn("analysis_ts", df["analysis_ts"].cast(StringType()))
df.toPandas()
Share
Improve this question
asked Nov 18, 2024 at 20:51
JoeJoe
3,8164 gold badges23 silver badges48 bronze badges
1 Answer
Reset to default 01) Ensure datetime64[ns] During Conversion
import pyspark.sql.functions as F
Explicitly cast timestamps to ensure compatibility
df = df.withColumn("session_ts", F.col("session_ts").cast("timestamp")) df = df.withColumn("analysis_ts", F.col("analysis_ts").cast("timestamp"))
Convert to pandas
pdf = df.toPandas() print(pdf.head())
2) Disable PyArrow for Conversion (Fallback to Legacy Conversion)
Disable PyArrow during the conversion
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
Convert to pandas
pdf = df.toPandas() print(pdf.head())
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745595112a4635064.html
评论列表(0条)