I am a Masters student in Robotics Institute, CMU, where I’m advised by Dr. Kris Kitani. While at CMU I have also been a recipient of the Siebel Scholarship, 2019.
My current research interests lie in Computer Vision, Reinforcement Learning and Machine Learning. I want to build robust visual representation systems that can learn with limited human supervision using interactions or unlabeled experience and generalize these representations across different tasks.
Previously, I completed my Bachelors with Honors in Electrical Engineering from Indian Insitute of Technology, Hyderabad (IIT-H). During this time I was advised by Dr. Vineeth N. Balasubramanian.
In the past, I have also worked as a Research Intern at Bosch AI.
We present a methodology to infer a textured and artic-ulated 3D model of a person from a single image. We propose a two-stream approach that decouples geometry and texture inference, and combines the outputs ofthe two streams using a differentiable renderer, which en-ables end-to-end self-supervised learning.
We propose a novel approach to perform variable length semantic video generation using short text in the form of captions. We adopt a new perspective to video generation and combine both the long-term and short-term dependencies between video frames. Using this we are able to generate a video incrementally.
We introduce Synchronized Deep Recurrent Attentive WRiter (Sync-Draw), a novel generative model for video generation. We introduce a novel way to combine a Variational Autoencoder (VAE) with a Recurrent Attention Mechanism, to create temporally dependent sequence of frames, that are gradually generated over time.
Detection of Diabetic Retinopathy (DR) has been worked on for a long time, but no commercially viable solutions that work for differ- ent populations exist yet. In this work, we investigate the performance of Very Deep Networks for the binary classification of fundus images provided by EyePACS as part of Kaggle’s DR detection challenge.
We compare the sentiment of social media news posts of television, radio and print media, to show the differences in the ways these channels cover the news. We also analyze users’ reactions and opinion sentiment on news posts with different sentiments. We perform our experiments on a dataset extracted from Facebook Pages of five popular news channels.
Using Imitation Learning for Learning Culinary Skills
04 Dec 2018
In this project we explored learning culinary skills such as cutting, pouring, drizzling using Learning from Demonstrations (LfD). For our low level control policy we use Dynamic Motor Primitives which allow us to learn versatile skills from few demonstrations. We tested our approach on a 7-DOF Franka Arm. Video
Incremental Image Generation using Scene Graphs
01 Dec 2018
We propose a method that enables the underlying model to generate an image incrementally based on a sequence of graph of scene descriptions (scene-graphs). We propose a recurrent network architecture such that the cumulative image generated at any point in the sequential generation is consistent with the previously generated images. Our model utilizes Graph Convolutional Networks (GCN) to cater to variable size scene graphs along with GAN based image translation networks to generate realistic multi-object images with high amount of variability. We demonstrate our model’s capability to generate context preserving scene-graph based image sequence using multi-modal datasets such as Coco-Stuff which have multi-object images along with annotations describing the visual scene.
RL Based Multi-Object Image Compositing
15 May 2018
We propose a generative adversarial approach to multi-object compositing based on  where we introduce multiple discriminators to handle the different distributions. We use reinforcement learning to train an agent which treats the discriminators as bandits and learns to choose the right discriminator to train the generator. The generator learns to iteratively predict the parameters of aatial transformer to warp the foreground object onto the background. This ability allows the genera- tor to generalize over different stages of multi-object com- positing. We demonstrate our approach on CLEVR dataset and show that our approach is giving promising results and has the potential to generalize over more complicated visual scenes.
Improving Task-Oriented Language Grounding using Disentangled Representations
01 May 2018
In this work, we aim to learn a disentangled representation of the visual scene using β-VAE, and then combine it with instruction-based text representation using a soft-attention mechanism. This generates a representation of the “instruction-conditioned” visual scene which is robust to variations with respect to objects, actions and attributes like shape and color. This representation is then used to learn a policy using standard reinforcement learning methods to execute the instruction in the given scene.