Currently the compilation of the Python wheel for the FlashAttention 2 (Dao-AILab/flash-attention
) Python package takes several hours, as reported by multiple users on GitHub (see e.g. this issue). What are the possible ways of speeding it up?
Currently the compilation of the Python wheel for the FlashAttention 2 (Dao-AILab/flash-attention
) Python package takes several hours, as reported by multiple users on GitHub (see e.g. this issue). What are the possible ways of speeding it up?
1 Answer
Reset to default 0I have found that setting these two env. variables will speed up this compilation:
increasing the number of CPU cores used (e.g. to half of all available physical cores) using
MAX_JOBS
restricting the list of CUDA architectures for which CUDA kernels are compiled to a bare minimum (e.g Ampere, see
this table
)
usingTORCH_CUDA_ARCH_LIST
.
For example:
export MAX_JOBS=$(($(nproc)/2)) && \
export TORCH_CUDA_ARCH_LIST="8.0" && \
pip install flash_attn --no-build-isolation
Notes:
--no-build-isolation
switch is used to preventpip
from re-installing the dependencies that are already previously installed.TORCH_CUDA_ARCH_LIST
takes Bash-style space-separated "lists" (which are technically strings), e.g.:"8.0 8.6 8.7"
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744344603a4569605.html
评论列表(0条)