basecls.models.vit#

Vision Transformer (ViT)

ViT: “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”

DeiT: “Training data-efficient image transformers & distillation through attention”

引用

rwightman/pytorch-image-models

class basecls.models.vit.PatchEmbed(img_size=224, patch_size=16, in_chans=3, embed_dim=768, flatten=True, norm_name=None, **kwargs)[源代码]#

基类:Module

Image to Patch Embedding

参数
  • img_size (int) – Image size. Default: 224

  • patch_size (int) – Patch token size. Default: 16

  • in_chans (int) – Number of input image channels. Default: 3

  • embed_dim (int) – Number of linear projection output channels. Default: 768

  • flatten (bool) – Flatten embedding. Default: True

  • norm_name (Optional[str]) – Normalization layer. Default: None

forward(x)[源代码]#
class basecls.models.vit.Attention(dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0.0, proj_drop=0.0)[源代码]#

基类:Module

Self-Attention block.

参数
  • dim (int) – input Number of input channels.

  • num_heads (int) – Number of attention heads. Default: 8

  • qkv_bias (bool) – If True, add a learnable bias to query, key, value. Default: False

  • qk_scale (Optional[float]) – Override default qk scale of head_dim ** -0.5 if set.

  • attn_drop (float) – Dropout ratio of attention weight. Default: 0.0

  • proj_drop (float) – Dropout ratio of output. Default: 0.0

forward(x)[源代码]#
class basecls.models.vit.FFN(in_features, hidden_features=None, out_features=None, drop=0.0, act_name='gelu')[源代码]#

基类:Module

FFN for ViT

参数
  • in_features (int) – Number of input features.

  • hidden_features (Optional[int]) – Number of input features. Default: None

  • out_features (Optional[int]) – Number of output features. Default: None

  • drop (float) – Dropout ratio. Default: 0.0

  • act_name (str) – activation function. Default: "gelu"

forward(x)[源代码]#
class basecls.models.vit.EncoderBlock(dim, num_heads, ffn_ratio=4.0, qkv_bias=False, qk_scale=None, attn_drop=0.0, drop=0.0, drop_path=0.0, norm_name='LN', act_name='gelu', **kwargs)[源代码]#

基类:Module

Transformer Encoder block.

参数
  • dim (int) – Number of input channels.

  • num_heads (int) – Number of attention heads.

  • ffn_ratio (float) – Ratio of ffn hidden dim to embedding dim. Default: 4.0

  • qkv_bias (bool) – If True, add a learnable bias to query, key, value. Default: False

  • qk_scale (Optional[float]) – Override default qk scale of head_dim ** -0.5 if set.

  • drop (float) – Dropout ratio of non-attention weight. Default: 0.0

  • attn_drop (float) – Dropout ratio of attention weight. Default: 0.0

  • drop_path (float) – Stochastic depth rate. Default: 0.0

  • norm_name (str) – Normalization layer. Default: "LN"

  • act_name (str) – Activation layer. Default: "gelu"

forward(x)[源代码]#
class basecls.models.vit.ViT(img_size=224, patch_size=16, in_chans=3, embed_dim=768, depth=12, num_heads=12, ffn_ratio=4.0, qkv_bias=True, qk_scale=None, representation_size=None, distilled=False, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, embed_layer=PatchEmbed, norm_name='LN', act_name='gelu', num_classes=1000, **kwargs)[源代码]#

基类:Module

ViT model.

参数
  • img_size (int) – Input image size. Default: 224

  • patch_size (int) – Patch token size. Default: 16

  • in_chans (int) – Number of input image channels. Default: 3

  • embed_dim (int) – Number of linear projection output channels. Default: 768

  • depth (int) – Depth of Transformer Encoder layer. Default: 12

  • num_heads (int) – Number of attention heads. Default: 12

  • ffn_ratio (float) – Ratio of ffn hidden dim to embedding dim. Default: 4.0

  • qkv_bias (bool) – If True, add a learnable bias to query, key, value. Default: True

  • qk_scale (Optional[float]) – Override default qk scale of head_dim ** -0.5 if set. Default: None

  • representation_size (Optional[int]) – Size of representation layer (pre-logits). Default: None

  • distilled (bool) – Includes a distillation token and head. Default: False

  • drop_rate (float) – Dropout rate. Default: 0.0

  • attn_drop_rate (float) – Attention dropout rate. Default: 0.0

  • drop_path_rate (float) – Stochastic depth rate. Default: 0.0

  • embed_layer (Module) – Patch embedding layer. Default: PatchEmbed

  • norm_name (str) – Normalization layer. Default: "LN"

  • act_name (str) – Activation function. Default: "gelu"

  • num_classes (int) – Number of classes. Default: 1000

init_weights()[源代码]#
forward(x)[源代码]#
load_state_dict(state_dict, strict=True)[源代码]#

Loads a given dictionary created by state_dict() into this module. If strict is True, the keys of state_dict() must exactly match the keys returned by state_dict().

Users can also pass a closure: Function[key: str, var: Tensor] -> Optional[np.ndarray] as a state_dict, in order to handle complex situations. For example, load everything except for the final linear classifier:

state_dict = {...}  #  Dict[str, np.ndarray]
model.load_state_dict({
    k: None if k.startswith('fc') else v
    for k, v in state_dict.items()
}, strict=False)

Here returning None means skipping parameter k.

To prevent shape mismatch (e.g. load PyTorch weights), we can reshape before loading:

state_dict = {...}
def reshape_accordingly(k, v):
    return state_dict[k].reshape(v.shape)
model.load_state_dict(reshape_accordingly)

We can also perform inplace re-initialization or pruning:

def reinit_and_pruning(k, v):
    if 'bias' in k:
        M.init.zero_(v)
    if 'conv' in k: