CS231n: Convolutional Neural Networks for Visual Recognition

学院版:Stanford:http://cs231n.github.io/convolutional-networks/

Architecture Overview

ConvNet Layers

ConvNet Architectures

Architecture Overview

simple ConvNet 卷积网络[INPUT - CONV - RELU - POOL - FC]

  • INPUT [32x32x3] :
    • 原始pixel value ,长宽高(e.g 32*32*3),高是3色 R B G;
  • CONV layer

    • [32x32x12] if we decided to use 12 filters.
    • 包含多个filters,每个代表一种特征;一个filter 是一个矩阵,矩阵大小是hyper parameter,需要设定,它会与图像中对应的像素矩阵做卷积运算。
    • 超参数控制输出的大小:the depth,stride , zero-padding

      • depth - filter 的数量

      • Stride - 移动filter的步伐,越大,output size越小;vice verse

      • zero-padding

      • filter size F - 卷积层大小

      • dilation – 卷积核元素之间的间距

    • Parameter sharing - 大大降低是参数(weights)的数量

  • RELU layer

    • activation function, such as the Max(0,x):最小值为0,大于0的,取自身
    • The size of the volume unchanged ([32x32x12])
  • POOL layer

    • perform a downsampling operation along the spatial(空间) dimensions (width, height), resulting in volume such as [16x16x12].
  • FC (i.e. fully-connected) layer

    • 计算每一类的得分,resulting in volume of size [1x1x10],即10种类别,对应的得分。

以这种方式,卷积网络将一幅图片(二维像素)转化成一维的数值,百分比率。每经过一层计算,都会产生一些activations,和 weights and bias. CONV/FC layers will be trained with gradient descent.

In Summary,

A ConvNet architecture is in the simplest case a list of Layers that transform the image volume into an output volume (e.g. holding the class scores)

  • There are a few distinct types of Layers (e.g. CONV/FC/RELU/POOL are by far the most popular)
  • Each Layer accepts an input 3D volume and transforms it to an output 3D volume through a differentiable function
  • Each Layer may or may not have parameters (e.g. CONV/FC do, RELU/POOL don’t)
  • Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL do, RELU doesn’t)

Layer Patterns

The most common form of a ConvNet architecture stacks a few CONV-RELU layers, 接着是 POOL layers, and repeats this pattern until the image has been merged spatially to a small size. At some point, it is common to transition to fully-connected layers. The last fully-connected layer holds the output, such as the class scores. In other words, 最常用的 ConvNet architecture follows the pattern:

INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC

where the * indicates repetition, and the POOL? indicates an optional pooling layer.

Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3). For example, here are some common ConvNet architectures you may see that follow this pattern:

    INPUT -> FC, implements a linear classifier. Here N = M = K = 0.
    INPUT -> CONV -> RELU -> FC
    INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC. 
                    Here we see that there is a single CONV layer between every POOL layer.
    INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC
                    Here we see two CONV layers stacked before every POOL layer. 
                    This is generally a good idea for larger and deeper networks, 
                    because multiple stacked CONV layers can develop more complex features of 
                    the input volume before the destructive pooling operation.

use whatever works best on ImageNet

.一般来说,不会去从头build一个新的网络架构,或从重训练一个模型,而是去关注哪些模型在ImageNet表现得很好,download一个pretrain 好的模型,并finetune它with small data。

Layer Sizing Patterns - 关于 超参

  • Input layer : image size (32, 64 , 96, 224 , 512)

conv layer:use small filters (e.g. 3x3 or at most 5x5), using a stride of S=1 ,and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input. That is, when F =3 , then using P=1 will retain the original size of the input.When F=5,P=2. For a general F , it can be seen that P = (F -1)/2 preserves the input size. If you must use bigger filter sizes (such as 7x7 or so), it is only common to see this on the very first conv layer that is looking at the input image.

  • The pool layers

are in charge of (downsampling)降低像素采样 输入的空间维度. The most common setting 最常用的设置是is to use max-pooling with 2x2 receptive fields (i.e.F=2), and with a stride of 2 (i.e.S=2). Note that this discards丢弃 exactly 75% of the activations in an input volume (due to downsampling by 2 in both width and height).

另一个不怎么常用的Another slightly less common setting is to use 3x3 receptive fields with a stride of 2, but this makes. It is very uncommon to see receptive field sizes for max pooling that are larger than 3 because the pooling is then too lossy and aggressive. This usually leads to worse performance.

  • Reducing sizing headaches. The scheme presented above is pleasing because all the CONV layers preserve the spatial size of their input, while the POOL layers alone are in charge of down-sampling the volumes spatially. In an alternative scheme where we use strides greater than 1 or don’t zero-pad the input in CONV layers, we would have to very carefully keep track of the input volumes throughout the CNN architecture and make sure that all strides and filters “work out”, and that the ConvNet architecture is nicely and symmetrically wired.

  • Why use stride of 1 in CONV? Smaller strides work better in practice. Additionally, as already mentioned stride 1 allows us to leave all spatial down-sampling to the POOL layers, with the CONV layers only transforming the input volume depth-wise.

  • Why use padding? In addition to the aforementioned benefit of keeping the spatial sizes constant after CONV, doing this actually improves performance. If the CONV layers were to not zero-pad the inputs and only perform valid convolutions, then the size of the volumes would reduce by a small amount after each CONV, and the information at the borders would be “washed away” too quickly.

其它一些变种CNN

计算考量:Biggest Memory bottleneck - constructing ConvNet architectures

best GPUs having about 12GB,normally 3/4/6GB memory

  • 对于中间物大小 :每一层的卷积网络都会有原始数量的激活体产生,包括梯度值,通常,大部分的激活体是先前的卷积层产生的,它们被存储起来是由于需要做向后求导 ,但是一种聪明的做法是在测试时候,通过只存储的当前层激活体,和丢弃先前层的激活体来 运行 convNets 会显著大大降内存;
  • 从参数数量规模的角度,网络参数,向后传播中的梯度,通常有一步缓存如果优化器是使用momentum,Adagrad,RMSProp,因此,内存存储至少是参数向量本身的3倍
  • 每次卷积层运行,要维护其它各种内存,如图像数据,可能增加版本。

results matching ""

    No results matching ""