I need to do inference using vllm for large dataset, code structure as below:
ds = ray.data.read_parquet(my_input_path)
ds = input_data.map_batches(
VLLMPredictor,
concurrency=ray_concurrency,
...
**resources_kwarg
)
ds.write_parquet(my_output_path)
What I observed is for each node, the write process start only when all inference jobs finished. Is there a way to achieve streaming write? like every n batch we do a write.
The reason is
- When doing inference, only GPUs are working and CPUs are idle, don't want to waste CPU resources at the moment
- If the dataset is large (~100GB), I don't want to store the whole result in memory which may cause OOM, and I want to see inference result earlier, as long as inference result is generated
Some of the logs during inference
run_local/0
run_local/0 Processed prompts: 37%|███▋ | 381/1024 [01:46<03:10, 3.37it/s, est. speed input: 2210.75 toks/s, output: 341.30 toks/s]
run_local/0 Running Dataset. Active & requested resources: 0/8 CPU, 1/1 GPU, 483.4MB/40.0MB object store: : 0.00 row [11:25, ? row/s]
run_local/0 - ReadParquet->Map(transform_row): Tasks: 0 [backpressured]; Queued blocks: 195; Resources: 0.0 CPU, 227.4MB object store: 3%|▎ | 94.2k/3.7
run_local/0
run_local/0
The Processed prompts is keep changing, while object store bar is not changing for long time no matter what --object-store-memory I set.
Does ray support it? How can I achieve it?
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744730142a4590417.html
评论列表(0条)