The Depthwise Separable Convolution

The cost of standard convolution

A standard convolution mixes spatial position and channels at once, so its cost scales with kernel area times input channels times output channels. On mobile budgets this is too expensive.

Two cheaper steps

Depthwise separable convolution factors the operation:

A depthwise step applies one spatial filter per input channel, mixing space but not channels.
A pointwise step uses one by one convolutions to mix channels but not space.

Together they approximate the full convolution at far lower cost.

The savings

The cost ratio is roughly one over the output channels plus one over the kernel area. For a three by three kernel with many channels this is about an eight to nine times reduction in multiplies. That is why MobileNet and similar designs lean on it.

The trade

You lose some expressive power because spatial and channel mixing no longer happen jointly. In practice the accuracy drop is small and the efficiency gain is large, which is a favorable bargain on constrained hardware.

Key idea