vscode debugger - vLLM: 401 Unauthorized When Loading googlegemma-2b Inside VS Code but Works in Terminal - Stack Overflow

admin•2025-04-23 04:28:53•questions•阅读0

I am trying to load googlegemma-2b using vLLM. The script works fine when I run it manually in a regul

I am trying to load google/gemma-2b using vLLM. The script works fine when I run it manually in a regular terminal, but fails inside VS Code with a 401 Unauthorized error:

requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: 
.json
...
huggingface_hub.errors.GatedRepoError: 401 Client Error.
Cannot access gated repo for url .json.
Access to model google/gemma-2b is restricted. You must have access to it and be authenticated to access it. Please log in.

What I’ve Tried

Running the script in a regular terminal
- Works fine.
Running inside VS Code’s integrated terminal (Ctrl + ~)
- Fails with 401 Unauthorized.
Logging in inside VS Code’s terminal
```
huggingface-cli login
```
- Still fails.

Passing the token manually inside the script

import os
from huggingface_hub import login

token = "hf_xxxxxx"
login(token=token)
os.environ["HUGGINGFACE_TOKEN"] = token

Still fails.

What Actually Works

Running the script manually in a regular terminal outside of VS Code:

python my_script.py

This works fine, which suggests VS Code’s environment is isolated and not inheriting the correct authentication.

error:

(zip_fit) brando9@skampere1~ $ python /lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py

/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
INFO 01-28 22:27:47 __init__.py:183] Automatically detected platform cuda.
Currently logged in as: brando

Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: .json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py", line 20, in <module>
    llm = LLM(model="google/gemma-2-2b", trust_remote_code=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/utils.py", line 1039, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 239, in __init__
    self.llm_engine = self.engine_class.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 479, in from_engine_args
    engine_config = engine_args.create_engine_config(usage_context)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 1047, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 972, in create_model_config
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/config.py", line 282, in __init__
    hf_config = get_config(self.model, trust_remote_code, revision,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/transformers_utils/config.py", line 182, in get_config
    if is_gguf or file_or_path_exists(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/transformers_utils/config.py", line 101, in file_or_path_exists
    return file_exists(model,
           ^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 2855, in file_exists
    get_hf_file_metadata(url, token=token)
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1294, in get_hf_file_metadata
    r = _request_wrapper(
        ^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 278, in _request_wrapper
    response = _request_wrapper(
               ^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 302, in _request_wrapper
    hf_raise_for_status(response)
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status
    raise _format(GatedRepoError, message, response) from e
huggingface_hub.errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6799ca63-2a30842b12ef9bcd33f2caab;66cf0983-c8e9-42d1-8c10-420cb37c805d)

Cannot access gated repo for url .json.
Access to model google/gemma-2-2b is restricted. You must have access to it and be authenticated to access it. Please log in.

note: I obviously have permission.

Output that worked:

(uutils) brando9@skampere1~ $ python /lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py
/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
WARNING 01-28 22:25:15 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See  for more information.
Currently logged in as: brando

INFO 01-28 22:25:19 config.py:1826] For Gemma 2, we downcast float32 to bfloat16 instead of float16 by default. Please specify `dtype` if you want to use float16.
INFO 01-28 22:25:19 config.py:1861] Downcasting torch.float32 to torch.bfloat16.
WARNING 01-28 22:25:19 config.py:235] gemma2 has interleaved attention, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 01-28 22:25:23 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='google/gemma-2-2b', speculative_config=None, tokenizer='google/gemma-2-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=google/gemma-2-2b, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 01-28 22:25:24 selector.py:135] Using Flash Attention backend.
INFO 01-28 22:25:25 model_runner.py:1072] Starting to load model google/gemma-2-2b...
INFO 01-28 22:25:26 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:04<00:08,  4.08s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:11<00:05,  5.92s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:12<00:00,  3.55s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:12<00:00,  4.01s/it]

INFO 01-28 22:25:38 model_runner.py:1077] Loading model weights took 4.8999 GB
INFO 01-28 22:25:40 worker.py:232] Memory profiling results: total_gpu_memory=79.15GiB initial_memory_usage=5.47GiB peak_torch_memory=7.25GiB memory_usage_post_profile=5.49GiB non_torch_memory=0.58GiB kv_cache_size=63.41GiB gpu_memory_utilization=0.90
INFO 01-28 22:25:41 gpu_executor.py:113] # GPU blocks: 39957, # CPU blocks: 2520
INFO 01-28 22:25:41 gpu_executor.py:117] Maximum concurrency for 4096 tokens per request: 156.08x
INFO 01-28 22:25:42 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-28 22:25:42 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-28 22:25:56 model_runner.py:1518] Graph capturing finished in 14 secs, took 0.86 GiB
Processed prompts: 100%|██████████████████████████| 1/1 [00:00<00:00,  7.32it/s, est. speed input: 43.97 toks/s, output: 117.24 toks/s]
Model output: [RequestOutput(request_id=0, prompt='Hello, my name is', prompt_token_ids=[2, 4521, 235269, 970, 1503, 603], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' Brandon. I have this and my name is Brandon. So why should I do', token_ids=(45225, 235265, 590, 791, 736, 578, 970, 1503, 603, 45225, 235265, 1704, 3165, 1412, 590, 749), cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1738131957.2502818, last_token_time=1738131957.2502818, first_scheduled_time=1738131957.2579057, first_token_time=1738131957.2903554, time_in_queue=0.007623910903930664, finished_time=1738131957.3883996, scheduler_time=0.0012442255392670631, model_forward_time=None, model_execute_time=None), lora_request=None, num_cached_tokens=0)]
[rank0]:[W128 22:25:57.001296513 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
(uutils) brando9@skampere1~ $

I am trying to load google/gemma-2b using vLLM. The script works fine when I run it manually in a regular terminal, but fails inside VS Code with a 401 Unauthorized error:

requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: 
https://huggingface.co/google/gemma-2-2b/resolve/main/config.json
...
huggingface_hub.errors.GatedRepoError: 401 Client Error.
Cannot access gated repo for url https://huggingface.co/google/gemma-2-2b/resolve/main/config.json.
Access to model google/gemma-2b is restricted. You must have access to it and be authenticated to access it. Please log in.

What I’ve Tried

Running the script in a regular terminal
- Works fine.
Running inside VS Code’s integrated terminal (Ctrl + ~)
- Fails with 401 Unauthorized.
Logging in inside VS Code’s terminal
```
huggingface-cli login
```
- Still fails.

Passing the token manually inside the script

import os
from huggingface_hub import login

token = "hf_xxxxxx"
login(token=token)
os.environ["HUGGINGFACE_TOKEN"] = token

Still fails.

What Actually Works

Running the script manually in a regular terminal outside of VS Code:

python my_script.py

This works fine, which suggests VS Code’s environment is isolated and not inheriting the correct authentication.

error:

(zip_fit) brando9@skampere1~ $ python /lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py

/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
INFO 01-28 22:27:47 __init__.py:183] Automatically detected platform cuda.
Currently logged in as: brando

Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/google/gemma-2-2b/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py", line 20, in <module>
    llm = LLM(model="google/gemma-2-2b", trust_remote_code=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/utils.py", line 1039, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 239, in __init__
    self.llm_engine = self.engine_class.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 479, in from_engine_args
    engine_config = engine_args.create_engine_config(usage_context)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 1047, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 972, in create_model_config
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/config.py", line 282, in __init__
    hf_config = get_config(self.model, trust_remote_code, revision,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/transformers_utils/config.py", line 182, in get_config
    if is_gguf or file_or_path_exists(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/transformers_utils/config.py", line 101, in file_or_path_exists
    return file_exists(model,
           ^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 2855, in file_exists
    get_hf_file_metadata(url, token=token)
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1294, in get_hf_file_metadata
    r = _request_wrapper(
        ^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 278, in _request_wrapper
    response = _request_wrapper(
               ^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 302, in _request_wrapper
    hf_raise_for_status(response)
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status
    raise _format(GatedRepoError, message, response) from e
huggingface_hub.errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6799ca63-2a30842b12ef9bcd33f2caab;66cf0983-c8e9-42d1-8c10-420cb37c805d)

Cannot access gated repo for url https://huggingface.co/google/gemma-2-2b/resolve/main/config.json.
Access to model google/gemma-2-2b is restricted. You must have access to it and be authenticated to access it. Please log in.

note: I obviously have permission.

Output that worked:

(uutils) brando9@skampere1~ $ python /lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py
/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
WARNING 01-28 22:25:15 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi./project/pynvml for more information.
Currently logged in as: brando

INFO 01-28 22:25:19 config.py:1826] For Gemma 2, we downcast float32 to bfloat16 instead of float16 by default. Please specify `dtype` if you want to use float16.
INFO 01-28 22:25:19 config.py:1861] Downcasting torch.float32 to torch.bfloat16.
WARNING 01-28 22:25:19 config.py:235] gemma2 has interleaved attention, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 01-28 22:25:23 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='google/gemma-2-2b', speculative_config=None, tokenizer='google/gemma-2-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=google/gemma-2-2b, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 01-28 22:25:24 selector.py:135] Using Flash Attention backend.
INFO 01-28 22:25:25 model_runner.py:1072] Starting to load model google/gemma-2-2b...
INFO 01-28 22:25:26 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:04<00:08,  4.08s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:11<00:05,  5.92s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:12<00:00,  3.55s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:12<00:00,  4.01s/it]

INFO 01-28 22:25:38 model_runner.py:1077] Loading model weights took 4.8999 GB
INFO 01-28 22:25:40 worker.py:232] Memory profiling results: total_gpu_memory=79.15GiB initial_memory_usage=5.47GiB peak_torch_memory=7.25GiB memory_usage_post_profile=5.49GiB non_torch_memory=0.58GiB kv_cache_size=63.41GiB gpu_memory_utilization=0.90
INFO 01-28 22:25:41 gpu_executor.py:113] # GPU blocks: 39957, # CPU blocks: 2520
INFO 01-28 22:25:41 gpu_executor.py:117] Maximum concurrency for 4096 tokens per request: 156.08x
INFO 01-28 22:25:42 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-28 22:25:42 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-28 22:25:56 model_runner.py:1518] Graph capturing finished in 14 secs, took 0.86 GiB
Processed prompts: 100%|██████████████████████████| 1/1 [00:00<00:00,  7.32it/s, est. speed input: 43.97 toks/s, output: 117.24 toks/s]
Model output: [RequestOutput(request_id=0, prompt='Hello, my name is', prompt_token_ids=[2, 4521, 235269, 970, 1503, 603], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' Brandon. I have this and my name is Brandon. So why should I do', token_ids=(45225, 235265, 590, 791, 736, 578, 970, 1503, 603, 45225, 235265, 1704, 3165, 1412, 590, 749), cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1738131957.2502818, last_token_time=1738131957.2502818, first_scheduled_time=1738131957.2579057, first_token_time=1738131957.2903554, time_in_queue=0.007623910903930664, finished_time=1738131957.3883996, scheduler_time=0.0012442255392670631, model_forward_time=None, model_execute_time=None), lora_request=None, num_cached_tokens=0)]
[rank0]:[W128 22:25:57.001296513 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
(uutils) brando9@skampere1~ $

Share Improve this question asked Jan 29 at 6:42 Charlie Parker 5,47676 gold badges250 silver badges410 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

fyi way to solve it:

Install vLLM (intalling it by installing lm-harness seems to work well with good flash attn):

# Install lm-harness (https://github/EleutherAI/lm-evaluation-harness)
pip install lm_eval[vllm]
# 0.7.0 seems to give issues with gemma2 
pip install vllm==0.6.4.post1
# pip install -e ".[vllm]"
pip install antlr4-python3-runtime==4.11
# to check installs worked do (versions and paths should appear)
pip list | grep lm_eval
pip list | grep vllm
pip list | grep antlr4

use vllm 0.6.4.post1 not 0.7.0 or gemma 2 won't work.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1745308764a4621873.html

admin

questions
javascript - Angular js ng-class false condition not working in ng-repeat - Stack Overflow
I want to highlight li if that radio is selected. Inside ng-repeat, ng-class true condition is working
admin
23分钟前
00
questions
javascript - Add HTMLJS code to external web page after it has loaded - Stack Overflow
Is it possible to load an external web page like '' and then append my own HTMLJS code to th
admin
21分钟前
00
questions
javascript - How to style the button in bootstrap vue dropdown like a circle - Stack Overflow
I am using Bootstrap Vue b-dropdown () like so:I want to be able to style the button (to open the drop
admin
20分钟前
00
questions
javascript - Change style of div inside a div. The inside div has inline css without any id and class - Stack Overflow
I need to make style changes in a nested div. Code below :<div id="video_id" class="v
admin
20分钟前
00
questions
How to rewrite product permalinks in Woocommerce to use category slugs
I know here is an information about the issue, but I didn't find full working solution.I need to use product catego
admin
19分钟前
00
questions
javascript - Checking if image properly loaded with Qunit - Stack Overflow
I'm trying to validate image URLs with Qunit by setting the URL as the src attribute of a test ima
admin
18分钟前
20
questions
javascript - How to update() an array of maps-objects on Firestore in Angular or Angularfirestore? - Stack Overflow
I know how to READ and WRITE a document in Firestore, But how does one update() a Value of and object i
admin
15分钟前
00
questions
javascript - How to fix MIME type of "texthtml"? - Stack Overflow
i wanted to make docx file through html and js, also i'm using Docx Library and Filesaverbut i go
admin
14分钟前
00
questions
Windows Event Forwarding: Collector initiated event forwarding forwards events initially, but doesn't pull the events pe
I created a collector initiated event forwarding which looks like the following:PS C:UsersAdministra
admin
14分钟前
00
questions
javascript - Get children attributes in a directive - Stack Overflow
Given<mytag class="red"><mysubtag id="3"><mysubtag><mysubta
admin
14分钟前
00
questions
html - Are there any javascript code (polyfill) available that enable Flexbox (2012, css3), like modernizr? - Stack Overflow
I'm looking for any javascript library that like modernizr (which actually does not) enables flexb
admin
13分钟前
00
questions
Python library that takes as input a complex XSD and outputs the an XML - Stack Overflow
I know my question is not crystal clear but i'd like to find a robust and reliable python library
admin
10分钟前
10
questions
javascript - How to create a ref and manually trigger onclick on a textfield in React hooks - Stack Overflow
I need to trigger click on <TextField> of material-UImanually on certain event. In React I'
admin
8分钟前
00
questions
javascript - format number to price - Stack Overflow
I have seen a few of these jquery things going on, and just wondered if there was a simple number forma
admin
7分钟前
10
questions
php - Merging Multiple Wordpress Websites
I have multiple wordpress sites using subdomains, I want to merge all of them, If I create Multisite network on my main
admin
6分钟前
00
questions
How to refer in an XSLT to a variable that contains nested XML - Stack Overflow
I want to refer to a deeper XML level of a variable that I've defined in an XSLT file. However, wh
admin
4分钟前
10
questions
plugins - Woocommerce functions in custom class, avoid errors
How do I call Woocommerce functions in a custom class?I have the following class method which is getting called on a cro
admin
3分钟前
10
questions
javascript - Use underscore to return true or false with findWhere - Stack Overflow
Let's say I have the following data:var data = {activeUser: { id: 3, name: 'Joe', someth
admin
3分钟前
10
questions
html - javascript getelement where value contains specific string - Stack Overflow
I was wanting to count the occurrences of input fields that has a class name of text where that text in
admin
2分钟前
20
questions
c++ - Why can't a view be assigned with the same type of the view? - Stack Overflow
Why can't I (re)assign the value for a view with the same type of the view?Demo: #include <io
admin
2分钟前
00

发表回复

评论列表（0条）

暂无评论

vscode debugger - vLLM: 401 Unauthorized When Loading googlegemma-2b Inside VS Code but Works in Terminal - Stack Overflow

1 Answer 1

发表回复

评论列表（0条）

联系我们

400-800-8888

vscode debugger - vLLM: 401 Unauthorized When Loading googlegemma-2b Inside VS Code but Works in Terminal - Stack Overflow

1 Answer 1

相关推荐

发表回复

评论列表（0条）

联系我们

400-800-8888