header - Issue with renamingselecting columns in pyspark - Stack Overflow

I have an excel file that I'm reading into databricks using pyspark. The data has extra columns at

I have an excel file that I'm reading into databricks using pyspark. The data has extra columns at the end that I do not want included. I use the following code to accomplish this:

data_object = spark.read.format("com.crealytics.spark.excel") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .option("dataAddress", f"'1. ITEM'!A1") \
  .load("path/to/file")

data_object = data_object.select(data_object.columns[0:21])

It then errors on the last line with the following:

AnalysisException: Column '`ITEM NUMBER

The entirety of the first column header is as follows:

'ITEM NUMBER\nMandatory Field\nFor Formula Calc. Only'

So, it appears that the line break is causing an issue, but if I attempt to perform a replace on all of the \n in the header row, I get the same error as above.

The ultimate goal is to rename the column headers to match the database using withColumnRenamed which does work. I also tried to then remove the extra columns (as opposed to right after reading the file like in the code above), but due to one of the extra columns having the same name as another column in the dataframe there is an ambiguity issue instead.

I have an excel file that I'm reading into databricks using pyspark. The data has extra columns at the end that I do not want included. I use the following code to accomplish this:

data_object = spark.read.format("com.crealytics.spark.excel") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .option("dataAddress", f"'1. ITEM'!A1") \
  .load("path/to/file")

data_object = data_object.select(data_object.columns[0:21])

It then errors on the last line with the following:

AnalysisException: Column '`ITEM NUMBER

The entirety of the first column header is as follows:

'ITEM NUMBER\nMandatory Field\nFor Formula Calc. Only'

So, it appears that the line break is causing an issue, but if I attempt to perform a replace on all of the \n in the header row, I get the same error as above.

The ultimate goal is to rename the column headers to match the database using withColumnRenamed which does work. I also tried to then remove the extra columns (as opposed to right after reading the file like in the code above), but due to one of the extra columns having the same name as another column in the dataframe there is an ambiguity issue instead.

Share Improve this question asked Mar 12 at 19:14 PracticingPythonPracticingPython 678 bronze badges 1
  • i can share a step by step approach on this, see if it helps your case – Debayan Commented Apr 8 at 6:05
Add a comment  | 

1 Answer 1

Reset to default 0
  1. normalize_column_names function: This removes line breaks (\n) and extra spaces from the column headers. It ensures headers are clean and do not cause issues in later processing.

  2. select: You explicitly select the first 21 columns by slicing the list of column names.

  3. withColumnRenamed: This renames columns to match the desired names. If there are more columns to rename, you can extend the column_mapping dictionary.

  4. Handling duplicates: If you have duplicate column names after cleaning, consider appending a suffix (e.g., _duplicate) to differentiate them.

# Step 1: Read the Excel file
data_object = spark.read.format("com.crealytics.spark.excel") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .option("dataAddress", "'1. ITEM'!A1") \
  .load("path/to/file")

# Step 2: Normalize column names
def normalize_column_names(columns):
    return [col.replace("\n", "").strip() for col in columns]

data_object = data_object.toDF(*normalize_column_names(data_object.columns))

# Step 3: Select only the first 21 columns
data_object = data_object.select(data_object.columns[:21])

# Step 4: Rename columns
# Mapping of original column names to desired names
column_mapping = {
    "ITEM NUMBERMandatory FieldFor Formula Calc. Only": "Item_Number",
    # Add other mappings for remaining columns here
}

for old_name, new_name in column_mapping.items():
    data_object = data_object.withColumnRenamed(old_name, new_name)

# Final DataFrame is clean and ready for database usage
data_object.show()

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744732952a4590570.html

相关推荐

  • header - Issue with renamingselecting columns in pyspark - Stack Overflow

    I have an excel file that I'm reading into databricks using pyspark. The data has extra columns at

    21小时前
    20

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信