vscode debugger - vLLM: 401 Unauthorized When Loading googlegemma-2b Inside VS Code but Works in Terminal - Stack Overflow

I am trying to load googlegemma-2b using vLLM. The script works fine when I run it manually in a regul

I am trying to load google/gemma-2b using vLLM. The script works fine when I run it manually in a regular terminal, but fails inside VS Code with a 401 Unauthorized error:

requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: 
.json
...
huggingface_hub.errors.GatedRepoError: 401 Client Error.
Cannot access gated repo for url .json.
Access to model google/gemma-2b is restricted. You must have access to it and be authenticated to access it. Please log in.

What I’ve Tried

  1. Running the script in a regular terminal
    • Works fine.
  2. Running inside VS Code’s integrated terminal (Ctrl + ~)
    • Fails with 401 Unauthorized.
  3. Logging in inside VS Code’s terminal
    huggingface-cli login
    
    • Still fails.
  4. Passing the token manually inside the script
    import os
    from huggingface_hub import login
    
    token = "hf_xxxxxx"
    login(token=token)
    os.environ["HUGGINGFACE_TOKEN"] = token
    
    • Still fails.

What Actually Works

Running the script manually in a regular terminal outside of VS Code:

python my_script.py

This works fine, which suggests VS Code’s environment is isolated and not inheriting the correct authentication.


error:

(zip_fit) brando9@skampere1~ $ python /lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py

/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
INFO 01-28 22:27:47 __init__.py:183] Automatically detected platform cuda.
Currently logged in as: brando

Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: .json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py", line 20, in <module>
    llm = LLM(model="google/gemma-2-2b", trust_remote_code=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/utils.py", line 1039, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 239, in __init__
    self.llm_engine = self.engine_class.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 479, in from_engine_args
    engine_config = engine_args.create_engine_config(usage_context)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 1047, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 972, in create_model_config
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/config.py", line 282, in __init__
    hf_config = get_config(self.model, trust_remote_code, revision,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/transformers_utils/config.py", line 182, in get_config
    if is_gguf or file_or_path_exists(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/transformers_utils/config.py", line 101, in file_or_path_exists
    return file_exists(model,
           ^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 2855, in file_exists
    get_hf_file_metadata(url, token=token)
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1294, in get_hf_file_metadata
    r = _request_wrapper(
        ^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 278, in _request_wrapper
    response = _request_wrapper(
               ^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 302, in _request_wrapper
    hf_raise_for_status(response)
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status
    raise _format(GatedRepoError, message, response) from e
huggingface_hub.errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6799ca63-2a30842b12ef9bcd33f2caab;66cf0983-c8e9-42d1-8c10-420cb37c805d)

Cannot access gated repo for url .json.
Access to model google/gemma-2-2b is restricted. You must have access to it and be authenticated to access it. Please log in.

note: I obviously have permission.

Output that worked:

(uutils) brando9@skampere1~ $ python /lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py
/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
WARNING 01-28 22:25:15 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See  for more information.
Currently logged in as: brando

INFO 01-28 22:25:19 config.py:1826] For Gemma 2, we downcast float32 to bfloat16 instead of float16 by default. Please specify `dtype` if you want to use float16.
INFO 01-28 22:25:19 config.py:1861] Downcasting torch.float32 to torch.bfloat16.
WARNING 01-28 22:25:19 config.py:235] gemma2 has interleaved attention, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 01-28 22:25:23 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='google/gemma-2-2b', speculative_config=None, tokenizer='google/gemma-2-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=google/gemma-2-2b, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 01-28 22:25:24 selector.py:135] Using Flash Attention backend.
INFO 01-28 22:25:25 model_runner.py:1072] Starting to load model google/gemma-2-2b...
INFO 01-28 22:25:26 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:04<00:08,  4.08s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:11<00:05,  5.92s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:12<00:00,  3.55s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:12<00:00,  4.01s/it]

INFO 01-28 22:25:38 model_runner.py:1077] Loading model weights took 4.8999 GB
INFO 01-28 22:25:40 worker.py:232] Memory profiling results: total_gpu_memory=79.15GiB initial_memory_usage=5.47GiB peak_torch_memory=7.25GiB memory_usage_post_profile=5.49GiB non_torch_memory=0.58GiB kv_cache_size=63.41GiB gpu_memory_utilization=0.90
INFO 01-28 22:25:41 gpu_executor.py:113] # GPU blocks: 39957, # CPU blocks: 2520
INFO 01-28 22:25:41 gpu_executor.py:117] Maximum concurrency for 4096 tokens per request: 156.08x
INFO 01-28 22:25:42 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-28 22:25:42 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-28 22:25:56 model_runner.py:1518] Graph capturing finished in 14 secs, took 0.86 GiB
Processed prompts: 100%|██████████████████████████| 1/1 [00:00<00:00,  7.32it/s, est. speed input: 43.97 toks/s, output: 117.24 toks/s]
Model output: [RequestOutput(request_id=0, prompt='Hello, my name is', prompt_token_ids=[2, 4521, 235269, 970, 1503, 603], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' Brandon. I have this and my name is Brandon. So why should I do', token_ids=(45225, 235265, 590, 791, 736, 578, 970, 1503, 603, 45225, 235265, 1704, 3165, 1412, 590, 749), cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1738131957.2502818, last_token_time=1738131957.2502818, first_scheduled_time=1738131957.2579057, first_token_time=1738131957.2903554, time_in_queue=0.007623910903930664, finished_time=1738131957.3883996, scheduler_time=0.0012442255392670631, model_forward_time=None, model_execute_time=None), lora_request=None, num_cached_tokens=0)]
[rank0]:[W128 22:25:57.001296513 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
(uutils) brando9@skampere1~ $

I am trying to load google/gemma-2b using vLLM. The script works fine when I run it manually in a regular terminal, but fails inside VS Code with a 401 Unauthorized error:

requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: 
https://huggingface.co/google/gemma-2-2b/resolve/main/config.json
...
huggingface_hub.errors.GatedRepoError: 401 Client Error.
Cannot access gated repo for url https://huggingface.co/google/gemma-2-2b/resolve/main/config.json.
Access to model google/gemma-2b is restricted. You must have access to it and be authenticated to access it. Please log in.

What I’ve Tried

  1. Running the script in a regular terminal
    • Works fine.
  2. Running inside VS Code’s integrated terminal (Ctrl + ~)
    • Fails with 401 Unauthorized.
  3. Logging in inside VS Code’s terminal
    huggingface-cli login
    
    • Still fails.
  4. Passing the token manually inside the script
    import os
    from huggingface_hub import login
    
    token = "hf_xxxxxx"
    login(token=token)
    os.environ["HUGGINGFACE_TOKEN"] = token
    
    • Still fails.

What Actually Works

Running the script manually in a regular terminal outside of VS Code:

python my_script.py

This works fine, which suggests VS Code’s environment is isolated and not inheriting the correct authentication.


error:

(zip_fit) brando9@skampere1~ $ python /lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py

/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
INFO 01-28 22:27:47 __init__.py:183] Automatically detected platform cuda.
Currently logged in as: brando

Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/google/gemma-2-2b/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py", line 20, in <module>
    llm = LLM(model="google/gemma-2-2b", trust_remote_code=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/utils.py", line 1039, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 239, in __init__
    self.llm_engine = self.engine_class.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 479, in from_engine_args
    engine_config = engine_args.create_engine_config(usage_context)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 1047, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 972, in create_model_config
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/config.py", line 282, in __init__
    hf_config = get_config(self.model, trust_remote_code, revision,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/transformers_utils/config.py", line 182, in get_config
    if is_gguf or file_or_path_exists(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/vllm/transformers_utils/config.py", line 101, in file_or_path_exists
    return file_exists(model,
           ^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 2855, in file_exists
    get_hf_file_metadata(url, token=token)
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1294, in get_hf_file_metadata
    r = _request_wrapper(
        ^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 278, in _request_wrapper
    response = _request_wrapper(
               ^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 302, in _request_wrapper
    hf_raise_for_status(response)
  File "/lfs/skampere1/0/brando9/miniconda/envs/zip_fit/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status
    raise _format(GatedRepoError, message, response) from e
huggingface_hub.errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-6799ca63-2a30842b12ef9bcd33f2caab;66cf0983-c8e9-42d1-8c10-420cb37c805d)

Cannot access gated repo for url https://huggingface.co/google/gemma-2-2b/resolve/main/config.json.
Access to model google/gemma-2-2b is restricted. You must have access to it and be authenticated to access it. Please log in.

note: I obviously have permission.

Output that worked:

(uutils) brando9@skampere1~ $ python /lfs/skampere1/0/brando9/ZIP-FIT/experiments/load_gemma2_2b_vllm.py
/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
WARNING 01-28 22:25:15 cuda.py:23] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi./project/pynvml for more information.
Currently logged in as: brando

INFO 01-28 22:25:19 config.py:1826] For Gemma 2, we downcast float32 to bfloat16 instead of float16 by default. Please specify `dtype` if you want to use float16.
INFO 01-28 22:25:19 config.py:1861] Downcasting torch.float32 to torch.bfloat16.
WARNING 01-28 22:25:19 config.py:235] gemma2 has interleaved attention, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 01-28 22:25:23 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='google/gemma-2-2b', speculative_config=None, tokenizer='google/gemma-2-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=google/gemma-2-2b, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 01-28 22:25:24 selector.py:135] Using Flash Attention backend.
INFO 01-28 22:25:25 model_runner.py:1072] Starting to load model google/gemma-2-2b...
INFO 01-28 22:25:26 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:04<00:08,  4.08s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:11<00:05,  5.92s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:12<00:00,  3.55s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:12<00:00,  4.01s/it]

INFO 01-28 22:25:38 model_runner.py:1077] Loading model weights took 4.8999 GB
INFO 01-28 22:25:40 worker.py:232] Memory profiling results: total_gpu_memory=79.15GiB initial_memory_usage=5.47GiB peak_torch_memory=7.25GiB memory_usage_post_profile=5.49GiB non_torch_memory=0.58GiB kv_cache_size=63.41GiB gpu_memory_utilization=0.90
INFO 01-28 22:25:41 gpu_executor.py:113] # GPU blocks: 39957, # CPU blocks: 2520
INFO 01-28 22:25:41 gpu_executor.py:117] Maximum concurrency for 4096 tokens per request: 156.08x
INFO 01-28 22:25:42 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-28 22:25:42 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-28 22:25:56 model_runner.py:1518] Graph capturing finished in 14 secs, took 0.86 GiB
Processed prompts: 100%|██████████████████████████| 1/1 [00:00<00:00,  7.32it/s, est. speed input: 43.97 toks/s, output: 117.24 toks/s]
Model output: [RequestOutput(request_id=0, prompt='Hello, my name is', prompt_token_ids=[2, 4521, 235269, 970, 1503, 603], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' Brandon. I have this and my name is Brandon. So why should I do', token_ids=(45225, 235265, 590, 791, 736, 578, 970, 1503, 603, 45225, 235265, 1704, 3165, 1412, 590, 749), cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1738131957.2502818, last_token_time=1738131957.2502818, first_scheduled_time=1738131957.2579057, first_token_time=1738131957.2903554, time_in_queue=0.007623910903930664, finished_time=1738131957.3883996, scheduler_time=0.0012442255392670631, model_forward_time=None, model_execute_time=None), lora_request=None, num_cached_tokens=0)]
[rank0]:[W128 22:25:57.001296513 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
(uutils) brando9@skampere1~ $
Share Improve this question asked Jan 29 at 6:42 Charlie ParkerCharlie Parker 5,47676 gold badges250 silver badges410 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

fyi way to solve it:

Install vLLM (intalling it by installing lm-harness seems to work well with good flash attn):

# Install lm-harness (https://github/EleutherAI/lm-evaluation-harness)
pip install lm_eval[vllm]
# 0.7.0 seems to give issues with gemma2 
pip install vllm==0.6.4.post1
# pip install -e ".[vllm]"
pip install antlr4-python3-runtime==4.11
# to check installs worked do (versions and paths should appear)
pip list | grep lm_eval
pip list | grep vllm
pip list | grep antlr4

use vllm 0.6.4.post1 not 0.7.0 or gemma 2 won't work.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745308764a4621873.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信