header - Issue with renamingselecting columns in pyspark

I have an excel file that I'm reading into databricks using pyspark. The data has extra columns at the end that I do not want included. I use the following code to accomplish this:

data_object = spark.read.format("com.crealytics.spark.excel") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .option("dataAddress", f"'1. ITEM'!A1") \
  .load("path/to/file")

data_object = data_object.select(data_object.columns[0:21])

It then errors on the last line with the following:

AnalysisException: Column '`ITEM NUMBER

The entirety of the first column header is as follows:

'ITEM NUMBER\nMandatory Field\nFor Formula Calc. Only'

So, it appears that the line break is causing an issue, but if I attempt to perform a replace on all of the \n in the header row, I get the same error as above.

The ultimate goal is to rename the column headers to match the database using withColumnRenamed which does work. I also tried to then remove the extra columns (as opposed to right after reading the file like in the code above), but due to one of the extra columns having the same name as another column in the dataframe there is an ambiguity issue instead.

I have an excel file that I'm reading into databricks using pyspark. The data has extra columns at the end that I do not want included. I use the following code to accomplish this:

data_object = spark.read.format("com.crealytics.spark.excel") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .option("dataAddress", f"'1. ITEM'!A1") \
  .load("path/to/file")

data_object = data_object.select(data_object.columns[0:21])

It then errors on the last line with the following:

AnalysisException: Column '`ITEM NUMBER

The entirety of the first column header is as follows:

'ITEM NUMBER\nMandatory Field\nFor Formula Calc. Only'

So, it appears that the line break is causing an issue, but if I attempt to perform a replace on all of the \n in the header row, I get the same error as above.

Share Improve this question asked Mar 12 at 19:14 PracticingPython 678 bronze badges

i can share a step by step approach on this, see if it helps your case – Debayan Commented Apr 8 at 6:05

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

normalize_column_names function: This removes line breaks (\n) and extra spaces from the column headers. It ensures headers are clean and do not cause issues in later processing.
select: You explicitly select the first 21 columns by slicing the list of column names.
withColumnRenamed: This renames columns to match the desired names. If there are more columns to rename, you can extend the column_mapping dictionary.
Handling duplicates: If you have duplicate column names after cleaning, consider appending a suffix (e.g., _duplicate) to differentiate them.

# Step 1: Read the Excel file
data_object = spark.read.format("com.crealytics.spark.excel") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .option("dataAddress", "'1. ITEM'!A1") \
  .load("path/to/file")

# Step 2: Normalize column names
def normalize_column_names(columns):
    return [col.replace("\n", "").strip() for col in columns]

data_object = data_object.toDF(*normalize_column_names(data_object.columns))

# Step 3: Select only the first 21 columns
data_object = data_object.select(data_object.columns[:21])

# Step 4: Rename columns
# Mapping of original column names to desired names
column_mapping = {
    "ITEM NUMBERMandatory FieldFor Formula Calc. Only": "Item_Number",
    # Add other mappings for remaining columns here
}

for old_name, new_name in column_mapping.items():
    data_object = data_object.withColumnRenamed(old_name, new_name)

# Final DataFrame is clean and ready for database usage
data_object.show()

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744732952a4590570.html

header - Issue with renamingselecting columns in pyspark - Stack Overflow

1 Answer 1

发表回复

评论列表（0条）

联系我们

400-800-8888

header - Issue with renamingselecting columns in pyspark - Stack Overflow

1 Answer 1

相关推荐