While hardware can accelerate machine learning (ML), using ML to accelerate hardware is much harder. Majority of ML algorithms cannot run (let alone train) at the speed of hardware. I explore how we can reduce deep neural network (DNN) model sizes through sparsity [4, 17] and quantization ahead of training. Small, efficient models that can be baked on-chip may allow us to reason about microarchitecture at runtime, so that we may increase performace, reduce power, and adapt to changing tasks, environments, and goals.
This is also the topic of my PhD thesis!
DNN models can require tons of data, compute power, and researcher-hours to train. After we deploy them in the field, whether on self-driving cars, voice assistants, or suicidal robots, it is easy for attackers to steal these models (see ). In , we propose the Trusted Inference Engine (TIE), an root-of-trust based secure DNN accelerator that can run these models without risking exposing the underlying model.
I also explore how we can modify DNN models to better fit the hardware. In  I propose ClosNets, which replace linear layers with 3 sparse layers with the Clos topology. The insight here is that as long as you maintain full connectivity and provide enough paths between inputs and outputs, networks can still train well, but with 5-10x less connections. ClosNets also get mapped into hardware very well - since we know the topology ahead of time, they don't suffer from the problems of ordinary sparse networks (i.e., having to store weight indices, having non-uniform memory access patterns, unbalanced computation, etc.).
Currently, I'm trying to provide theoretical answers to which topology is gives the best accuracy / memory ratio on average for any task. In NeuroFabric [ADD REF], we claim that parallel butterfly topologies give the best bang for the buck, as long as you initialize them well.
I am a part of the BRISC-V team here at the ASCS lab. We are building an open-source manycore system-on-chip, CPUs, caches, NoCs and all. At the moment, I am building a RISC-V ISA, out-of-order, multiple issue procesor with branch speculation and register renaming, which will hopefully be the flagship of the BRISC-V project. All of our RTL is Verilog 2001, so feel free to use it!
For computer architecture and computer organization classes, we teach the RISC-V architecture. For use in the class, we have built an in-browser, step-by-step RISC-V compiler and simulator, so that the students can get used to writing and debugging assembly. Feel free to try it!
Training DNNs on clusters of machines is difficult, partly due to the networks being far
slower than the compute units, and partly due to the straggler problem. When synchronizing updates between
worker nodes, we can either wait until everyone get's everyones update, which takes time, or we can
use stale updates, which can hurt accuracy. Meanwhile, we have all this DRAM memory distributed accross
nodes, yet it is used only to clone the DNN model. In NoSync  we asked whether we can instead not sync
updates at all, but (1) have each worker train a separate model, and (2) have the best-performing models