python - How to compile FlashAttention wheels faster? - Stack Overflow|江阴雨辰互联

Currently the compilation of the Python wheel for the FlashAttention 2 (Dao-AILab/flash-attention) Python package takes several hours, as reported by multiple users on GitHub (see e.g. this issue). What are the possible ways of speeding it up?

Share Improve this question edited Mar 21 at 17:01 talonmies 72.4k35 gold badges203 silver badges289 bronze badges asked Mar 21 at 15:00 mirekphd 6,9613 gold badges56 silver badges82 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

I have found that setting these two env. variables will speed up this compilation:

increasing the number of CPU cores used (e.g. to half of all available physical cores) using MAX_JOBS
restricting the list of CUDA architectures for which CUDA kernels are compiled to a bare minimum (e.g Ampere, see this table) using TORCH_CUDA_ARCH_LIST.

For example:

export MAX_JOBS=$(($(nproc)/2)) && \
export TORCH_CUDA_ARCH_LIST="8.0" && \
     pip install flash_attn --no-build-isolation

Notes:

--no-build-isolation switch is used to prevent pip from re-installing the dependencies that are already previously installed.
TORCH_CUDA_ARCH_LIST takes Bash-style space-separated "lists" (which are technically strings), e.g.:
```
"8.0 8.6 8.7"
```

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744344603a4569605.html

python - How to compile FlashAttention wheels faster? - Stack Overflow

1 Answer 1

发表回复

评论列表（0条）

联系我们

400-800-8888

python - How to compile FlashAttention wheels faster? - Stack Overflow

1 Answer 1

相关推荐