Appendix A A.“NoCloning” in Tensor Networks
The required operation of duplicating a vector and sending it to be part of two different calculations, which is simply achieved in any practical setting, is actually impossible to represent in the framework of TNs. We formulate this notion in the following claim:
Claim 1
Let be a vector. is represented by a node with one leg in the TN notation. The operation of duplicating this node, i.e. forming two separate nodes of degree , each equal to , cannot be achieved by any TN.
Proof. We assume by contradiction that there exists a TN which operates on any vector and clones it to two separate nodes of degree , each equal to , to form an overall TN representing . Component wise, this implies that upholds . By our assumption, duplicates the standard basis elements of , denoted , meaning that :
(5) 
By definition of the standard basis elements, the left hand side of Eq. (5) takes the form while the right hand side equals only if , and otherwise . In other words, in order to successfully clone the standard basis elements, Eq. (5) implies that must uphold . However, for , i.e. , a cloning operation does not take place when using this value of , since , in contradiction to duplicating any vector in .
Appendix B B. Entanglement Scaling in OverlappingConvolutional Networks
In this section we provide a detailed description of overlapping convolutional networks that include spatial decimation, and then provide our proof of Theorem 1 of the main text. Additionally, we analyze the effect of the added decimating pooling layers on the entanglement entropy.
We begin by presenting a broad definition of a what is called a Generalized Convolutional (GC) layer (Sharir and Shashua, 2018) as a fusion of a linear operation with a pooling (spatial decimation) function – this view of convolutional layers is motivated by the allconvolutional architecture (Springenberg et al., 2015), which replaces all pooling layers with convolutions with stride greater than 1. The input to a GC layer is an order 3 tensor, having width and height equal to and depth , also referred to as channels, e.g. the input could be a 2D image with RGB color channels. Similarly, the output of the layer has width and height equal to and channels, where for that is referred to as the stride, and has the role of a subsampling operation. Each spatial location at the output of the layer corresponds to a 2D window slice of the input tensor of size , extended through all the input channels, whose topleft corner is located exactly at , where is referred to as its local receptive field, or filter size. For simplicity, the parts of window slices extending beyond the boundaries have zero value. Let be a vector representing the channels at some location of the output, and similarly, let be the set of vectors representing the slice, where each vector represents the channels at its respective location inside the window, then the operation of a GC layer is defined as follows:
where are referred to as the weights of the layer, and is some pointwise pooling function. Additionally, we call a GC layer that is limited to unitstride and has receptive field a Conv layer, and similarly, a Pooling layer is a GC layer with both stride and receptive fields equal to . With the above definitions, a convolutional network is simply a sequence of blocks of Conv and Pooling layers that follows the representation layer, and ends with a global pooling layer, i.e. a pooling layer with equals the entire spatial extent of its input. The entire network is illustrated in Fig. 4.
Given a nonlinear pointwise activation function
(e.g. ReLU), then setting all pooling functions to average pooling followed by the activation, i.e. for , give rise to the common allconvolutional network with activations, which served as the initial motivation for this formulation. Alternatively, choosing instead a product pooling function, i.e. for , results in an Arithmetic Circuit, i.e. a circuit containing just product and sum operations, hence it is referred to as an Overlapping Convolutional Arithmetic Circuit, or Overlapping ConvAC in short, where ‘overlapping’ refers to having receptive fields which overlap when . The nonoverlapping case, where , is equivalent to ConvACs as originally introduced by Cohen et al. (2016a).In the body of the paper we have discussed the entanglement entropy of overlapping convolutional networks with no spatial decimation, which essentially amount to having pooling layers with , which was summarized in Theorem 1 of the main text. The following theorem quantifies the effect of pooling layers with in overlapping convolutional networks:
Theorem 3
For an overlapping ConvAC with pooling operations in between convolution layers (Fig. 4 with ), the maximal entanglement entropy w.r.t. modeled by the network obeys:
where is the linear dimension of the ddimensional system for .
Thus, the introduction of such pooling layers results in a diminished ability of the overlappingconvolutional network to represent volumelaw entanglement scaling, since the factor from Theorem 1 of the main text is diminished to a factor of . In the following, we prove the results in Theorem 1 of the main text and Theorem 3 in this appendix regarding entanglement scaling supported by overlapping ConvACs:
Proof (of Theorem 1 of the main text and Theorem 3 above). We begin by providing a succinct summary of the theoretical analysis of overlapping ConvACs that was shown by Sharir and Shashua (2018), including the necessary technical background on ConvACs required to understand their results. Sharir and Shashua (2018) shows lower bounds on the rank of the duptensor for various architectures when is left half of the input and the right half, in , when the convolutional kernel is anchored at the corner instead of at the center like presented in this letter.
For any layer in a convolutional network, the local receptive field (or kernel size) is defined as the linear size of the window on which each convolutional kernel acts upon, and the stride is defined as the step size in each dimension between two neighboring windows (assumed to be in this letter). The main result of Sharir and Shashua (2018) relies on two architecture dependent attributes that they referred to as the total receptive field and and the total stride of the ’th layer, defined as the projections on the input layer of the local receptive fields and strides from the perspective of the ’th layer, as illustrated in Fig. 5. In their main result they show that the first layer that has a total receptive field of at least half the linear dimension of the input size , denoted , gives rise to a lower bound on the rank of the matricized tensor that is proportional to , where is the total stride of the ’th layer. To prove this result the authors rely on the ability of a sufficiently large total receptive field to represent identity matrices between pairs of input indices, each pair comprising one index from and one from , where the total stride limits the maximal number of pairs as it denotes the minimal distance between any two given pairs of indices.
To prove the lower bounds on the architectures described in Theorem 1 of the main text and Theorem 3 above, it is sufficient to consider just the first convolutional layers with unit strides, specifically for the case of a sequence of conv layers followed by global pooling (Fig. 4 with ), and for the case of alternating conv and pooling layers (Fig. 4 with ). Under the above, the total receptive field of the ’th layer is simply , which can be thought of a single large convolutional layer. Now, following the same proof sketch described above, we can use the combined convolutional layer to pair indices of and along the boundary between the two sets, where the size of the total receptive field determines the maximal number of pairs we can capture around each point on the boundary. In the special case of , nearly any index of could be paired with a unique index from . This results in a lower bound of .
Appendix C C. Entanglement Scaling in DeepRecurrent Networks
In the following, we prove the result in Theorem 2 of the main text, regarding the entanglement scaling supported by deep RACs:
Proof (of Theorem 2 of the main text). In Levine et al. (2017), a lower bound of is shown for that is placed to the right of and , for which the size of is the largest possible under the conditions of Theorem 2 of the main text. There, is the dimension of the RAC’s hidden state. Essentially, the combinatorial dependence of the lower bound follows from the indistinguishability of duplicated indices. Given , we designate the final indices of the set to form a set which upholds by definition . The lower bound in Theorem 2 of the main text is obtained by replacing with and continuing with the same exact proof procedure as in Levine et al. (2017), applied to and , when all the residual initial indices, corresponding to the set , are kept fixed. Finally, for fixed the binomial term is polynomial in , therefore its logarithm obeys .
References
 Fannes et al. (1992) Mark Fannes, Bruno Nachtergaele, and Reinhard F Werner, “Finitely correlated states on quantum spin chains,” Communications in mathematical physics 144, 443–490 (1992).
 PerezGarcía et al. (2007) David PerezGarcía, Frank Verstraete, Michael M Wolf, and J Ignacio Cirac, “Matrix product state representations,” Quantum Information and Computation 7, 401–430 (2007).
 Verstraete and Cirac (2004) Frank Verstraete and J Ignacio Cirac, “Renormalization algorithms for quantummany body systems in two and higher dimensions,” arXiv preprint condmat/0407066 (2004).
 Vidal (2008) Guifré Vidal, “Class of quantum manybody states that can be efficiently simulated,” Physical review letters 101, 110501 (2008).
 Verstraete et al. (2008) Frank Verstraete, Valentin Murg, and J Ignacio Cirac, “Matrix product states, projected entangled pair states, and variational renormalization group methods for quantum spin systems,” Advances in Physics 57, 143–224 (2008).
 Gu and Wen (2009) ZhengCheng Gu and XiaoGang Wen, “Tensorentanglementfiltering renormalization approach and symmetryprotected topological order,” Physical Review B 80, 155131 (2009).
 Evenbly and Vidal (2011) Glen Evenbly and Guifré Vidal, “Tensor network states and geometry,” Journal of Statistical Physics 145, 891–918 (2011).
 Evenbly and Vidal (2014) Glen Evenbly and Guifre Vidal, “Scaling of entanglement entropy in the (branching) multiscale entanglement renormalization ansatz,” Physical Review B 89, 235113 (2014).
 Eisert et al. (2010) Jens Eisert, Marcus Cramer, and Martin B Plenio, “Colloquium: Area laws for the entanglement entropy,” Reviews of Modern Physics 82, 277 (2010).
 Orús (2014) Román Orús, “A practical introduction to tensor networks: Matrix product states and projected entangled pair states,” Annals of Physics 349, 117–158 (2014).

Krizhevsky et al. (2012)
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems 25, edited by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Curran Associates, Inc., 2012) pp. 1097–1105.  Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556 (2014).
 Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, “Going Deeper with Convolutions,” CVPR (2015).

He et al. (2016)
Kaiming He, Xiangyu Zhang,
Shaoqing Ren, and Jian Sun, “Deep residual learning for image
recognition,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2016) pp. 770–778.  Sutskever et al. (2011) Ilya Sutskever, James Martens, and Geoffrey E Hinton, “Generating text with recurrent neural networks,” in Proceedings of the 28th International Conference on Machine Learning (ICML11) (2011) pp. 1017–1024.
 Graves et al. (2013) Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on (IEEE, 2013) pp. 6645–6649.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473 (2014).
 Amodei et al. (2016) Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al., “Deep speech 2: Endtoend speech recognition in english and mandarin,” in International Conference on Machine Learning (2016) pp. 173–182.
 Carleo and Troyer (2017) Giuseppe Carleo and Matthias Troyer, “Solving the quantum manybody problem with artificial neural networks,” Science 355, 602–606 (2017).
 Saito (2017) Hiroki Saito, “Solving the bose–hubbard model with machine learning,” Journal of the Physical Society of Japan 86, 093001 (2017).
 Deng et al. (2017a) DongLing Deng, Xiaopeng Li, and S. Das Sarma, “Machine learning topological states,” Phys. Rev. B 96, 195145 (2017a).
 Gao and Duan (2017) Xun Gao and LuMing Duan, “Efficient representation of quantum manybody states with deep neural networks,” Nature communications 8, 662 (2017).
 Deng et al. (2017b) DongLing Deng, Xiaopeng Li, and S. Das Sarma, “Quantum entanglement in neural network states,” Phys. Rev. X 7, 021021 (2017b).
 Carleo et al. (2018) Giuseppe Carleo, Yusuke Nomura, and Masatoshi Imada, “Constructing exact representations of quantum manybody systems with deep neural networks,” arXiv preprint arXiv:1802.09558 (2018).
 Cai and Liu (2018) Zi Cai and Jinguo Liu, “Approximating quantum manybody wave functions using artificial neural networks,” Physical Review B 97, 035116 (2018).

Glorot et al. (2011)
Xavier Glorot, Antoine Bordes, and Yoshua Bengio, “Deep sparse
rectifier neural networks,” in
Proceedings of the fourteenth international conference on artificial intelligence and statistics
(2011) pp. 315–323.  (27) Specialized in the sense that the architecture is the same and all nonlinearities boil down to polynomials Cohen et al. (2016a); Levine et al. (2017).
 Levine et al. (2018) Yoav Levine, David Yakira, Nadav Cohen, and Amnon Shashua, “Deep learning and quantum entanglement: Fundamental connections with implications to network design,” in 6th International Conference on Learning Representations (ICLR) (2018).
 Levine et al. (2017) Yoav Levine, Or Sharir, and Amnon Shashua, “Benefits of depth for longterm memory of recurrent networks,” arXiv preprint arXiv:1710.09431 (2017).
 Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) pp. 2818–2826.
 Cohen et al. (2016a) Nadav Cohen, Or Sharir, and Amnon Shashua, “On the expressive power of deep learning: A tensor analysis,” Conference On Learning Theory (COLT) (2016a).
 Cohen and Shashua (2016) Nadav Cohen and Amnon Shashua, “Convolutional rectifier networks as generalized tensor decompositions,” International Conference on Machine Learning (ICML) (2016).
 Cohen and Shashua (2017) Nadav Cohen and Amnon Shashua, “Inductive bias of deep convolutional networks through pooling geometry,” in 5th International Conference on Learning Representations (ICLR) (2017).
 Sharir et al. (2016) Or Sharir, Ronen Tamari, Nadav Cohen, and Amnon Shashua, “Tractable generative convolutional arithmetic circuits,” arXiv preprint arXiv:1610.04167 (2016).
 Cohen et al. (2016b) Nadav Cohen, Or Sharir, and Amnon Shashua, “Deep simnets,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016b).
 Khrulkov et al. (2018) Valentin Khrulkov, Alexander Novikov, and Ivan Oseledets, “Expressive power of recurrent neural networks,” in 6th International Conference on Learning Representations (ICLR) (2018).
 (37) These upper bounds on entanglement can be attained by employing minimal cut considerations in TNs Cui et al. (2016).
 Vidal (2007) Guifre Vidal, “Entanglement renormalization,” Physical review letters 99, 220405 (2007).
 (39) Direct sampling of spin configuration from MERA was shown to be possible Ferris and Vidal (2012a, b).
 Sharir and Shashua (2018) Or Sharir and Amnon Shashua, “On the expressive power of overlapping architectures of deep learning,” in 6th International Conference on Learning Representations (ICLR) (2018).
 Biamonte et al. (2011) Jacob D Biamonte, Stephen R Clark, and Dieter Jaksch, “Categorical tensor network states,” AIP Advances 1, 042172 (2011).
 Gull et al. (2013) Emanuel Gull, Olivier Parcollet, and Andrew J Millis, “Superconductivity and the pseudogap in the twodimensional hubbard model,” Physical review letters 110, 216405 (2013).
 Chen et al. (2013) KS Chen, Zi Yang Meng, SX Yang, Thomas Pruschke, Juana Moreno, and Mark Jarrell, “Evolution of the superconductivity dome in the twodimensional hubbard model,” Physical Review B 88, 245110 (2013).
 Lubasch et al. (2014) Michael Lubasch, J Ignacio Cirac, and MariCarmen Banuls, “Algorithms for finite projected entangled pair states,” Physical Review B 90, 064425 (2014).
 Zheng and Chan (2016) BoXiao Zheng and Garnet KinLic Chan, “Groundstate phase diagram of the square lattice hubbard model from density matrix embedding theory,” Physical Review B 93, 035126 (2016).
 Liu et al. (2017) WenYuan Liu, ShaoJun Dong, YongJian Han, GuangCan Guo, and Lixin He, “Gradient optimization of finite projected entangled pair states,” Physical Review B 95, 195154 (2017).
 LeBlanc et al. (2015) JPF LeBlanc, Andrey E Antipov, Federico Becca, Ireneusz W Bulik, Garnet KinLic Chan, ChiaMin Chung, Youjin Deng, Michel Ferrero, Thomas M Henderson, Carlos A JiménezHoyos, et al., “Solutions of the twodimensional hubbard model: Benchmarks and results from a wide range of numerical algorithms,” Physical Review X 5, 041041 (2015).
 Hermans and Schrauwen (2013) Michiel Hermans and Benjamin Schrauwen, “Training and analysing deep recurrent neural networks,” in Advances in Neural Information Processing Systems (2013) pp. 190–198.
 (49) We focus on the case where is located to the right for proof simplicity, simulations of the network in Fig. 2(d) with randomized weights matrices indicate that the lower bound in Theorem 2 holds for all other locations of .
 Springenberg et al. (2015) J Springenberg, Alexey Dosovitskiy, Thomas Brox, and M Riedmiller, “Striving for simplicity: The all convolutional net,” in ICLR (workshop track) (2015).
 Cui et al. (2016) Shawn X Cui, Michael H Freedman, Or Sattath, Richard Stong, and Greg Minton, “Quantum maxflow/mincut,” Journal of Mathematical Physics 57, 062206 (2016).
 Ferris and Vidal (2012a) Andrew J Ferris and Guifre Vidal, “Perfect sampling with unitary tensor networks,” Physical Review B 85, 165146 (2012a).
 Ferris and Vidal (2012b) Andrew J Ferris and Guifre Vidal, “Variational monte carlo with the multiscale entanglement renormalization ansatz,” Physical Review B 85, 165147 (2012b).
Comments
There are no comments yet.