Efficient processing and representations for 3D tasks

Doctoral Thesis Theo W. Costain

The challenge this thesis aims to address is that of improving the efficiency of deep learning on 3D tasks. We consider this problem from two angles. First how we can improve the efficiency of operations on dense voxel grids. Then second how recent improvements in representational efficiency can be leveraged in computer vision pipelines. The cubic growth in memory requirements for dense voxel grids means consideration of computational complexity for 3D tasks is of much greater importance than in 2D tasks. Whilst a variety of approaches to addressing this issue are available, in our work we consider two related approaches: First we consider quantisation, which presents a powerful approach to reducing both the energy, computational, and memory requirements of deep networks. Our work investigates how varying the level of quantisation in each of the layers of a deep network can provide for greater memory savings than using the same level throughout, whilst maintaining the same accuracy. Second, where quantisation capable hardware is not available, particularly in power constrained environments, simply reducing the overall size of a model can have greater improvements to overall energy consumption than an equivalent reduction in the amount of computation performed. We approach the problem of reducing model sizes by compressing the weights of a network rather than altering its structure. We examine this task through the lens of approximation, using functional approximations like cosine and Chebyshev series to learn a compact representation of the weights of a deep network. Recent developments in the form of neural implicit representations (NIRs) present an exciting new approach to improving the representational efficiency of 3D pipelines, considering what tasks are possible using a given representation, rather than just the compute required. Our work examines how early NIR works focused on the reconstruction task alone, hampering their use in later tasks. We show how when training for reconstruction tasks alone, NIRs learn representations that are unsuitable for semantic tasks. To address this, we demonstrate that simple joint training for semantic tasks like classification is sufficient to generate representations that are meaningful even for unseen semantic tasks. Expanding on this work, we propose an approach to contextualise NIRs that are trained for reconstruction alone avoiding any need to retrain the encoders. This contextualising approach then permits training on larger unlabelled datasets to provide improved reconstruction performance, before fine tuning on a labelled dataset to achieve the required semantic capability.