I am trying to load google/gemma-2b
using vLLM
. The script works fine when I run it manually in a regular terminal, but fails inside VS Code with a 401 Unauthorized
error:
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url:
.json
...
huggingface_hub.errors.GatedRepoError: 401 Client Error.
Cannot access gated repo for url .json.
Access to model google/gemma-2b is restricted. You must have access to it and be authenticated to access it. Please log in.
What I’ve Tried
- Running the script in a regular terminal
- Works fine.
- Running inside VS Code’s integrated terminal (
Ctrl + ~
)- Fails with
401 Unauthorized
.
- Fails with
- Logging in inside VS Code’s terminal
huggingface-cli login
- Still fails.
- Passing the token manually inside the script
import os from huggingface_hub import login token = "hf_xxxxxx" login(token=token) os.environ["HUGGINGFACE_TOKEN"] = token
- Still fails.
What Actually Works
Running the script manually in a regular terminal outside of VS Code:
python my_script.py
This works fine, which suggests VS Code’s environment is isolated and not inheriting the correct authentication.
error:
(zip_fit) brando9@skampere1~ $ python /lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py
/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
INFO 01-28 22:27:47 __init__.py:183] Automatically detected platform cuda.
Currently logged in as: brando
Traceback (most recent call last):
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
response.raise_for_status()
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: .json
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py", line 20, in <module>
llm = LLM(model="google/gemma-2-2b", trust_remote_code=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/utils.py", line 1039, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 239, in __init__
self.llm_engine = self.engine_class.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 479, in from_engine_args
engine_config = engine_args.create_engine_config(usage_context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 1047, in create_engine_config
model_config = self.create_model_config()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 972, in create_model_config
return ModelConfig(
^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/config.py", line 282, in __init__
hf_config = get_config(self.model, trust_remote_code, revision,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/transformers_utils/config.py", line 182, in get_config
if is_gguf or file_or_path_exists(
^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/transformers_utils/config.py", line 101, in file_or_path_exists
return file_exists(model,
^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 2855, in file_exists
get_hf_file_metadata(url, token=token)
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1294, in get_hf_file_metadata
r = _request_wrapper(
^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 278, in _request_wrapper
response = _request_wrapper(
^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 302, in _request_wrapper
hf_raise_for_status(response)
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status
raise _format(GatedRepoError, message, response) from e
huggingface_hub.errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6799ca63-2a30842b12ef9bcd33f2caab;66cf0983-c8e9-42d1-8c10-420cb37c805d)
Cannot access gated repo for url .json.
Access to model google/gemma-2-2b is restricted. You must have access to it and be authenticated to access it. Please log in.
note: I obviously have permission.
Output that worked:
(uutils) brando9@skampere1~ $ python /lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py
/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
WARNING 01-28 22:25:15 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See for more information.
Currently logged in as: brando
INFO 01-28 22:25:19 config.py:1826] For Gemma 2, we downcast float32 to bfloat16 instead of float16 by default. Please specify `dtype` if you want to use float16.
INFO 01-28 22:25:19 config.py:1861] Downcasting torch.float32 to torch.bfloat16.
WARNING 01-28 22:25:19 config.py:235] gemma2 has interleaved attention, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 01-28 22:25:23 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='google/gemma-2-2b', speculative_config=None, tokenizer='google/gemma-2-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=google/gemma-2-2b, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 01-28 22:25:24 selector.py:135] Using Flash Attention backend.
INFO 01-28 22:25:25 model_runner.py:1072] Starting to load model google/gemma-2-2b...
INFO 01-28 22:25:26 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:04<00:08, 4.08s/it]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:11<00:05, 5.92s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:12<00:00, 3.55s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:12<00:00, 4.01s/it]
INFO 01-28 22:25:38 model_runner.py:1077] Loading model weights took 4.8999 GB
INFO 01-28 22:25:40 worker.py:232] Memory profiling results: total_gpu_memory=79.15GiB initial_memory_usage=5.47GiB peak_torch_memory=7.25GiB memory_usage_post_profile=5.49GiB non_torch_memory=0.58GiB kv_cache_size=63.41GiB gpu_memory_utilization=0.90
INFO 01-28 22:25:41 gpu_executor.py:113] # GPU blocks: 39957, # CPU blocks: 2520
INFO 01-28 22:25:41 gpu_executor.py:117] Maximum concurrency for 4096 tokens per request: 156.08x
INFO 01-28 22:25:42 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-28 22:25:42 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-28 22:25:56 model_runner.py:1518] Graph capturing finished in 14 secs, took 0.86 GiB
Processed prompts: 100%|██████████████████████████| 1/1 [00:00<00:00, 7.32it/s, est. speed input: 43.97 toks/s, output: 117.24 toks/s]
Model output: [RequestOutput(request_id=0, prompt='Hello, my name is', prompt_token_ids=[2, 4521, 235269, 970, 1503, 603], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' Brandon. I have this and my name is Brandon. So why should I do', token_ids=(45225, 235265, 590, 791, 736, 578, 970, 1503, 603, 45225, 235265, 1704, 3165, 1412, 590, 749), cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1738131957.2502818, last_token_time=1738131957.2502818, first_scheduled_time=1738131957.2579057, first_token_time=1738131957.2903554, time_in_queue=0.007623910903930664, finished_time=1738131957.3883996, scheduler_time=0.0012442255392670631, model_forward_time=None, model_execute_time=None), lora_request=None, num_cached_tokens=0)]
[rank0]:[W128 22:25:57.001296513 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
(uutils) brando9@skampere1~ $
I am trying to load google/gemma-2b
using vLLM
. The script works fine when I run it manually in a regular terminal, but fails inside VS Code with a 401 Unauthorized
error:
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url:
https://huggingface.co/google/gemma-2-2b/resolve/main/config.json
...
huggingface_hub.errors.GatedRepoError: 401 Client Error.
Cannot access gated repo for url https://huggingface.co/google/gemma-2-2b/resolve/main/config.json.
Access to model google/gemma-2b is restricted. You must have access to it and be authenticated to access it. Please log in.
What I’ve Tried
- Running the script in a regular terminal
- Works fine.
- Running inside VS Code’s integrated terminal (
Ctrl + ~
)- Fails with
401 Unauthorized
.
- Fails with
- Logging in inside VS Code’s terminal
huggingface-cli login
- Still fails.
- Passing the token manually inside the script
import os from huggingface_hub import login token = "hf_xxxxxx" login(token=token) os.environ["HUGGINGFACE_TOKEN"] = token
- Still fails.
What Actually Works
Running the script manually in a regular terminal outside of VS Code:
python my_script.py
This works fine, which suggests VS Code’s environment is isolated and not inheriting the correct authentication.
error:
(zip_fit) brando9@skampere1~ $ python /lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py
/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
INFO 01-28 22:27:47 __init__.py:183] Automatically detected platform cuda.
Currently logged in as: brando
Traceback (most recent call last):
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
response.raise_for_status()
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/google/gemma-2-2b/resolve/main/config.json
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py", line 20, in <module>
llm = LLM(model="google/gemma-2-2b", trust_remote_code=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/utils.py", line 1039, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 239, in __init__
self.llm_engine = self.engine_class.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 479, in from_engine_args
engine_config = engine_args.create_engine_config(usage_context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 1047, in create_engine_config
model_config = self.create_model_config()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 972, in create_model_config
return ModelConfig(
^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/config.py", line 282, in __init__
hf_config = get_config(self.model, trust_remote_code, revision,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/transformers_utils/config.py", line 182, in get_config
if is_gguf or file_or_path_exists(
^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/transformers_utils/config.py", line 101, in file_or_path_exists
return file_exists(model,
^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 2855, in file_exists
get_hf_file_metadata(url, token=token)
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1294, in get_hf_file_metadata
r = _request_wrapper(
^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 278, in _request_wrapper
response = _request_wrapper(
^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 302, in _request_wrapper
hf_raise_for_status(response)
File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status
raise _format(GatedRepoError, message, response) from e
huggingface_hub.errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6799ca63-2a30842b12ef9bcd33f2caab;66cf0983-c8e9-42d1-8c10-420cb37c805d)
Cannot access gated repo for url https://huggingface.co/google/gemma-2-2b/resolve/main/config.json.
Access to model google/gemma-2-2b is restricted. You must have access to it and be authenticated to access it. Please log in.
note: I obviously have permission.
Output that worked:
(uutils) brando9@skampere1~ $ python /lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py
/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
WARNING 01-28 22:25:15 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi./project/pynvml for more information.
Currently logged in as: brando
INFO 01-28 22:25:19 config.py:1826] For Gemma 2, we downcast float32 to bfloat16 instead of float16 by default. Please specify `dtype` if you want to use float16.
INFO 01-28 22:25:19 config.py:1861] Downcasting torch.float32 to torch.bfloat16.
WARNING 01-28 22:25:19 config.py:235] gemma2 has interleaved attention, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 01-28 22:25:23 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='google/gemma-2-2b', speculative_config=None, tokenizer='google/gemma-2-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=google/gemma-2-2b, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 01-28 22:25:24 selector.py:135] Using Flash Attention backend.
INFO 01-28 22:25:25 model_runner.py:1072] Starting to load model google/gemma-2-2b...
INFO 01-28 22:25:26 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:04<00:08, 4.08s/it]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:11<00:05, 5.92s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:12<00:00, 3.55s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:12<00:00, 4.01s/it]
INFO 01-28 22:25:38 model_runner.py:1077] Loading model weights took 4.8999 GB
INFO 01-28 22:25:40 worker.py:232] Memory profiling results: total_gpu_memory=79.15GiB initial_memory_usage=5.47GiB peak_torch_memory=7.25GiB memory_usage_post_profile=5.49GiB non_torch_memory=0.58GiB kv_cache_size=63.41GiB gpu_memory_utilization=0.90
INFO 01-28 22:25:41 gpu_executor.py:113] # GPU blocks: 39957, # CPU blocks: 2520
INFO 01-28 22:25:41 gpu_executor.py:117] Maximum concurrency for 4096 tokens per request: 156.08x
INFO 01-28 22:25:42 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-28 22:25:42 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-28 22:25:56 model_runner.py:1518] Graph capturing finished in 14 secs, took 0.86 GiB
Processed prompts: 100%|██████████████████████████| 1/1 [00:00<00:00, 7.32it/s, est. speed input: 43.97 toks/s, output: 117.24 toks/s]
Model output: [RequestOutput(request_id=0, prompt='Hello, my name is', prompt_token_ids=[2, 4521, 235269, 970, 1503, 603], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' Brandon. I have this and my name is Brandon. So why should I do', token_ids=(45225, 235265, 590, 791, 736, 578, 970, 1503, 603, 45225, 235265, 1704, 3165, 1412, 590, 749), cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1738131957.2502818, last_token_time=1738131957.2502818, first_scheduled_time=1738131957.2579057, first_token_time=1738131957.2903554, time_in_queue=0.007623910903930664, finished_time=1738131957.3883996, scheduler_time=0.0012442255392670631, model_forward_time=None, model_execute_time=None), lora_request=None, num_cached_tokens=0)]
[rank0]:[W128 22:25:57.001296513 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
(uutils) brando9@skampere1~ $
Share
Improve this question
asked Jan 29 at 6:42
Charlie ParkerCharlie Parker
5,47676 gold badges250 silver badges410 bronze badges
1 Answer
Reset to default 0fyi way to solve it:
Install vLLM (intalling it by installing lm-harness seems to work well with good flash attn):
# Install lm-harness (https://github/EleutherAI/lm-evaluation-harness)
pip install lm_eval[vllm]
# 0.7.0 seems to give issues with gemma2
pip install vllm==0.6.4.post1
# pip install -e ".[vllm]"
pip install antlr4-python3-runtime==4.11
# to check installs worked do (versions and paths should appear)
pip list | grep lm_eval
pip list | grep vllm
pip list | grep antlr4
use vllm 0.6.4.post1 not 0.7.0 or gemma 2 won't work.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745308764a4621873.html
评论列表(0条)