It is vital to learn a generalizable policy in visual reinforcement learning (RL). Many algorithms are proposed to handle this problem while none of them theoretically show what affects the generalization gap and why their methods work. In this paper, we bridge this issue by theoretically answering the key factors that contribute to the generalization gap when the testing environment has distractors. Our theories indicate that minimizing the representation distance between training and testing environments is the most critical. Our theoretical results are supported by the empirical evidence in the DMControl Generalization Benchmark.