basecls.models.vit#
Vision Transformer (ViT)
ViT: “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”
DeiT: “Training data-efficient image transformers & distillation through attention”
引用
rwightman/pytorch-image-models
- class basecls.models.vit.PatchEmbed(img_size=224, patch_size=16, in_chans=3, embed_dim=768, flatten=True, norm_name=None, **kwargs)[源代码]#
基类:
Module
Image to Patch Embedding
- 参数
img_size (
int
) – Image size. Default:224
patch_size (
int
) – Patch token size. Default:16
in_chans (
int
) – Number of input image channels. Default:3
embed_dim (
int
) – Number of linear projection output channels. Default:768
flatten (
bool
) – Flatten embedding. Default:True
norm_name (
Optional
[str
]) – Normalization layer. Default:None
- class basecls.models.vit.Attention(dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0.0, proj_drop=0.0)[源代码]#
基类:
Module
Self-Attention block.
- 参数
dim (
int
) – input Number of input channels.num_heads (
int
) – Number of attention heads. Default:8
qkv_bias (
bool
) – If True, add a learnable bias to query, key, value. Default:False
qk_scale (
Optional
[float
]) – Override default qk scale ofhead_dim ** -0.5
if set.attn_drop (
float
) – Dropout ratio of attention weight. Default:0.0
proj_drop (
float
) – Dropout ratio of output. Default:0.0
- class basecls.models.vit.FFN(in_features, hidden_features=None, out_features=None, drop=0.0, act_name='gelu')[源代码]#
基类:
Module
FFN for ViT
- 参数
- class basecls.models.vit.EncoderBlock(dim, num_heads, ffn_ratio=4.0, qkv_bias=False, qk_scale=None, attn_drop=0.0, drop=0.0, drop_path=0.0, norm_name='LN', act_name='gelu', **kwargs)[源代码]#
基类:
Module
Transformer Encoder block.
- 参数
dim (
int
) – Number of input channels.num_heads (
int
) – Number of attention heads.ffn_ratio (
float
) – Ratio of ffn hidden dim to embedding dim. Default:4.0
qkv_bias (
bool
) – If True, add a learnable bias to query, key, value. Default:False
qk_scale (
Optional
[float
]) – Override default qk scale ofhead_dim ** -0.5
if set.drop (
float
) – Dropout ratio of non-attention weight. Default:0.0
attn_drop (
float
) – Dropout ratio of attention weight. Default:0.0
drop_path (
float
) – Stochastic depth rate. Default:0.0
norm_name (
str
) – Normalization layer. Default:"LN"
act_name (
str
) – Activation layer. Default:"gelu"
- class basecls.models.vit.ViT(img_size=224, patch_size=16, in_chans=3, embed_dim=768, depth=12, num_heads=12, ffn_ratio=4.0, qkv_bias=True, qk_scale=None, representation_size=None, distilled=False, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, embed_layer=PatchEmbed, norm_name='LN', act_name='gelu', num_classes=1000, **kwargs)[源代码]#
基类:
Module
ViT model.
- 参数
img_size (
int
) – Input image size. Default:224
patch_size (
int
) – Patch token size. Default:16
in_chans (
int
) – Number of input image channels. Default:3
embed_dim (
int
) – Number of linear projection output channels. Default:768
depth (
int
) – Depth of Transformer Encoder layer. Default:12
num_heads (
int
) – Number of attention heads. Default:12
ffn_ratio (
float
) – Ratio of ffn hidden dim to embedding dim. Default:4.0
qkv_bias (
bool
) – If True, add a learnable bias to query, key, value. Default:True
qk_scale (
Optional
[float
]) – Override default qk scale of head_dim ** -0.5 if set. Default:None
representation_size (
Optional
[int
]) – Size of representation layer (pre-logits). Default:None
distilled (
bool
) – Includes a distillation token and head. Default:False
drop_rate (
float
) – Dropout rate. Default:0.0
attn_drop_rate (
float
) – Attention dropout rate. Default:0.0
drop_path_rate (
float
) – Stochastic depth rate. Default:0.0
embed_layer (
Module
) – Patch embedding layer. Default:PatchEmbed
norm_name (
str
) – Normalization layer. Default:"LN"
act_name (
str
) – Activation function. Default:"gelu"
num_classes (
int
) – Number of classes. Default:1000
- load_state_dict(state_dict, strict=True)[源代码]#
Loads a given dictionary created by
state_dict()
into this module. Ifstrict
isTrue
, the keys ofstate_dict()
must exactly match the keys returned bystate_dict()
.Users can also pass a closure:
Function[key: str, var: Tensor] -> Optional[np.ndarray]
as a state_dict, in order to handle complex situations. For example, load everything except for the final linear classifier:state_dict = {...} # Dict[str, np.ndarray] model.load_state_dict({ k: None if k.startswith('fc') else v for k, v in state_dict.items() }, strict=False)
Here returning
None
means skipping parameterk
.To prevent shape mismatch (e.g. load PyTorch weights), we can reshape before loading:
state_dict = {...} def reshape_accordingly(k, v): return state_dict[k].reshape(v.shape) model.load_state_dict(reshape_accordingly)
We can also perform inplace re-initialization or pruning:
def reinit_and_pruning(k, v): if 'bias' in k: M.init.zero_(v) if 'conv' in k: