python - model.eval() return a NoneType object when using deepspeed - Stack Overflow

When I want to accelerate the model training by using deepspeed, a problem occured when I want to evalu

When I want to accelerate the model training by using deepspeed, a problem occured when I want to evaluate the model on validation dataset. Here is the problem code snippet:

def evaluate(self, epoch_num=None, keep_all=True):
        print("self.model:", self.model)

        self.model = self.model.eval()
        print("self.model after eval:", self.model)

Then the output log:

self.model: DeepSpeedEngine(
  (module): TSTransformerEncoder(
    (project_inp): Linear(in_features=6, out_features=128, bias=True)
    (pos_enc): LearnablePositionalEncoding(
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer_encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-2): 3 x TransformerBatchNormEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
          )
          (linear1): Linear(in_features=128, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=256, out_features=128, bias=True)
          (norm1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (norm2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (output_layer): Linear(in_features=128, out_features=6, bias=True)
    (dropout1): Dropout(p=0.1, inplace=False)
  )
)
self.model after eval: None

Without using the DeepSpeed tool, the model can be trained and evaluated normally. However, after using DeepSpeed, the above problem occurs.

The way I initialize the deepspeed:

    model, optimizer, _, _ = deepspeed.initialize(
        model=model,
        optimizer=optimizer,
        config_params=ds_config
    )

The ds_config file:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
 
    "optimizer": {
        "params": {
            "lr": 0.001,
            "weight_decay": 0,
            "optimizer_class": "optimizers.RAdam"
        }
    },
 
    "zero_optimization": {
        "stage": 1,
        "overlap_comm": true,
        "contiguous_gradients": true
    },


    "zero_allow_untested_optimizer": true,
    "train_batch_size": 256,
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

Problem Analysis

I originally expected that self.model.eval() would only set the model to evaluation mode, and the model itself would not become None. However, the actual output shows that self.model becomes None after calling the eval() method. I suspect that this might be related to the encapsulation or configuration of DeepSpeed, but I'm not sure about the specific cause.

Relevant Environment Information

  • Python Version: 3.8.20

  • PyTorch Version: 2.4.1

  • DeepSpeed Version: 0.16.4

When I want to accelerate the model training by using deepspeed, a problem occured when I want to evaluate the model on validation dataset. Here is the problem code snippet:

def evaluate(self, epoch_num=None, keep_all=True):
        print("self.model:", self.model)

        self.model = self.model.eval()
        print("self.model after eval:", self.model)

Then the output log:

self.model: DeepSpeedEngine(
  (module): TSTransformerEncoder(
    (project_inp): Linear(in_features=6, out_features=128, bias=True)
    (pos_enc): LearnablePositionalEncoding(
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer_encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-2): 3 x TransformerBatchNormEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
          )
          (linear1): Linear(in_features=128, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=256, out_features=128, bias=True)
          (norm1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (norm2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (output_layer): Linear(in_features=128, out_features=6, bias=True)
    (dropout1): Dropout(p=0.1, inplace=False)
  )
)
self.model after eval: None

Without using the DeepSpeed tool, the model can be trained and evaluated normally. However, after using DeepSpeed, the above problem occurs.

The way I initialize the deepspeed:

    model, optimizer, _, _ = deepspeed.initialize(
        model=model,
        optimizer=optimizer,
        config_params=ds_config
    )

The ds_config file:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
 
    "optimizer": {
        "params": {
            "lr": 0.001,
            "weight_decay": 0,
            "optimizer_class": "optimizers.RAdam"
        }
    },
 
    "zero_optimization": {
        "stage": 1,
        "overlap_comm": true,
        "contiguous_gradients": true
    },


    "zero_allow_untested_optimizer": true,
    "train_batch_size": 256,
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

Problem Analysis

I originally expected that self.model.eval() would only set the model to evaluation mode, and the model itself would not become None. However, the actual output shows that self.model becomes None after calling the eval() method. I suspect that this might be related to the encapsulation or configuration of DeepSpeed, but I'm not sure about the specific cause.

Relevant Environment Information

  • Python Version: 3.8.20

  • PyTorch Version: 2.4.1

  • DeepSpeed Version: 0.16.4

Share Improve this question asked Mar 15 at 17:28 external external 111 silver badge1 bronze badge
Add a comment  | 

1 Answer 1

Reset to default 1

From the source code:

class DeepSpeedEngine(Module):
    r"""DeepSpeed engine for training."""
    ...

    def eval(self):
        r""""""

        self.warn_unscaled_loss = True
        self.module.train(False)

The eval method updates the internal train status of the model but does not return anything. This is different from the standard Pytorch eval code that returns the model itself.

This means self.model.eval() sets the model to eval mode internally, but returns None. This means that when you assign the output of self.model.eval() to self.model via self.model = self.model.eval() , you are essentially running self.model = None.

You can change your code to:

def evaluate(self, epoch_num=None, keep_all=True):
        print("self.model:", self.model)

        self.model.eval() # simply call `eval`, no assignment necessary
        print("self.model after eval:", self.model)

Note that this also works for standard pytorch models - eval primarily updates the internal state of the model object, so reassigning the model object to the same variable name is unnecessary both for the DeepSpeedEngine model and standard pytorch models.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744607262a4583543.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信