powerpoint - How to use python-pptx to extract infrequent tables? - Stack Overflow

I have a pipeline where I'll be needing to ingest PowerPoint (pptx) files using Python. These file

I have a pipeline where I'll be needing to ingest PowerPoint (pptx) files using Python. These files will mostly have text, occasionally have tables, and won't always have the same format and/or design. I need to extract this data, including the [mostly text] cell values of tables when present and eventually get into a table with presentation name, presentation date, and a free text field of all the ppt content.

I've been exploring the python-pptx module, and extracting most of the data is easy enough with the code below, but it skips a table in a slide:

for slide_number, slide in enumerate(presentation.slides):
print(f"Slide {slide_number + 1}:")
for shape in slide.shapes:
    if hasattr(shape, "text"):
        print(shape.text)
        

Question is, what's the best way to grab tables with this module (or another lightweight tool)? I've been perusing documentation for the module but an obvious solution hasn't presented itself given the tables can appear anywhere.

I have a pipeline where I'll be needing to ingest PowerPoint (pptx) files using Python. These files will mostly have text, occasionally have tables, and won't always have the same format and/or design. I need to extract this data, including the [mostly text] cell values of tables when present and eventually get into a table with presentation name, presentation date, and a free text field of all the ppt content.

I've been exploring the python-pptx module, and extracting most of the data is easy enough with the code below, but it skips a table in a slide:

for slide_number, slide in enumerate(presentation.slides):
print(f"Slide {slide_number + 1}:")
for shape in slide.shapes:
    if hasattr(shape, "text"):
        print(shape.text)
        

Question is, what's the best way to grab tables with this module (or another lightweight tool)? I've been perusing documentation for the module but an obvious solution hasn't presented itself given the tables can appear anywhere.

Share Improve this question asked Mar 11 at 16:01 drymolassesdrymolasses 1073 silver badges10 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

Try this:

for slide_number, slide in enumerate(presentation.slides):
    print(f"Slide {slide_number + 1}:")
    for shape in slide.shapes:
        if hasattr(shape, "text"):
            print(shape.text)
        # Check is the shape has a table
        if shape.has_table == True:
            # Generate iterable cells
            cells = shape.table.iter_cells()
            # Iterate through cells
            for cell in cells:
                print(cell.text)

Using this pptx file to test:

The output is:

Slide 1:
File title
Table read testing
Slide 2:
Column A
Column B
Column C
Column D
Cell A1
Cell B1
Cell C1
Cell D1
Cell A2
Cell B2
Cell C2
Cell D2
Cell A3
Cell B3
Cell C3
Cell D3

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744783025a4593453.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信