Situation:
I read 25M+ rows, ~300 columns from BigQuery and write to SQL Server with JDBC and it takes too much time. When I look into for figuring out which step takes the most time inefficiently, I encounter with SDFBoundedSourceReader
. SDFBoundedSourceReader
step gets elements one by one and this increases pipeline elapsed time as well as requires of using a lot of vCPUs and get many errors like
Error message from worker: Error encountered with the status channel: SDK harness sdk-0-0 disconnected.
Operation ongoing in bundle process_bundle-3484514982132990920-35 for at least 12m39s without outputting or completing:
Completed work item 7528337809130607698 UNSUCCESSFULLY: CANCELLED: [type.googleapis/util.MessageSetPayload='[dist_proc.dax.internal.StackTraceProto] { stack_top_loc { filepath: .... [dist_proc.dax.workflow.workflow_utils_message_ext]: WORK_PROGRESS_UPDATE_LEASE_ALREADY_CANCELLED }']
I tried:
- adding cluster and/or partition to BigQuery table
- adding experiment,
pre_optimize=all
- num workers: 5, 10, 15, 20 (however of course I would love to use minimum worker)
- machine type: n2-standard-4, n2-standard-8, n2-standard-16
- autoscaling algorithm: THROUGHPUT_BASED (always)
- write batch size: 10000, 25000, 50000, 100000, 250000
ReadFromBigQuery
configurations:
- method:
EXPORT
- method:
DIRECT_READ
- reading from query
- reading from table
- output type:
BEAM_ROW
- output type:
PYTHON_DICT
and giving schema I created (..NamedTuple
)
I'd prefer getting output as BEAM_ROW because giving schema also takes much time because of ~300 columns however if you have any idea to get better performance, welcome.Please check images from dataflow below.
example dataflow
example dataflow2
example dataflow
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745282789a4620378.html
评论列表(0条)