Applying Transflower to goal-conditioned robotics task

As I posted in the mattermost, one of the applications I’ve been working on for applying Transflower, is the robotics task that Tianwei designed and gathered some data for using VR.

After a lot bugs and problems, both with code and data, and fixing and improving stuff, the model seems to be beginning to work and even generalizing somewhat!

What was done


  • Simplified observation space to only include the object to interact with, and only included tasks that include one object. This is a significant simplification, but was worth trying! I plan to explore relaxing this simplification
  • As we only give it the object it will interact with, I’ve ommited the object color information from the observation, as it’s not necessary to accomplish the simplified task.
  • Applied smoothing to the data and trimmed a bit of the beginning (where the user is reading the instrution and doing nothing)
  • Played with adding input dropout like in MoGlow to force it to pay more attention to the instructions
  • Filtered the data to include only my and Tianwei’s demonstrations. Turns out most other people didn’t have enough practice with the VR system to produce good demonstrations reliably. I could do a more fine-grained filtering, because some of the other demonstrations are actually good, but I’ve done it this way for simpliciy for now.


  • I have some evidence from another application that suggests that TorchScript sometimes subtly breaks the model. I have thus stopped using it, but want to investigate further
  • found serious bug in code for updating context during inference that would break any model with context length > 1
  • probably other bugs I cant remember

More comments: “plateau” learning rate schedule (auto-decrease when loss stops improving) seems to be a nice trick.

So main steps to make it work were: filter bad data, maybe TorchScript broke stuff, and smoothing also helps a lot. It shouldn’t have taken me so long to figure these

Preliminary qualitative look

Here it is accomplishing a task outside the training set (the instruction is found in the training set but not with the particular position of the object)
Paint black bird yellow:

Put blue bird on the shelf:

Those are a bit cherry picked. Here are some comments on the current degree of generalization I observe in the model from manual tests:

  • If the configuration is quite different from training set (i.e. a new pair of (object type, object position)) it seems to have a tendency towards painting the object some color, regardless of the given instruction. This may be that painting is the most common instruction in the training set (approximately half of the instrunctions are Paint instructions)
  • If the configuration is similar to the one in the training set, it will have a tendency to do the task that was in the training set, specially if initiated in the same arm pose. Not surprising
  • Nevertheless, it will still generalize sometimes, achieving new combinations of (object type, object position, goal) that weren’t in the training set. As the model is probabilistic, it may only achieve this a certain fraction of the time, and also for a long enough episode (where it may fumble around until it achieves it), and measuring these for different task/configurations will be necessary to get a better grasp of the generalization.

Next steps

  • Try the model with a bit more training and longer context
  • Gather more data
  • Make automatic evaluation. This would require creating a reward function, either hard-coded or learned, to test how often the model succeeds at achieving tasks
  • Try different methods to improve the model (beyond gathering more demo data):
    – PPO with hard-coded reward function
    – GAIL (basically same as above but with learned reward function)
    – Supervised learning on self-generated data with relabelling (like in Laetitia’s project), which is basically Levin’s GCSL algorithm, which in turn is a combination of HER + the self-imitaiton learning approach to RL (see
    – Some offline-RL algorithm. Actually question: is there a policy-based (not value-based) offline RL algorithm? I’m sure there are but can’t remember one off the top of my head!

Thanks a lot for the updates! This is cool to see these results, quite promising!

Do you mean you use an oracle which, given the instruction, identifies which is the object to interact with, and the agent only sees this object and not the others? How are objects represented/perceived? What about switches, are these objects?

So I guess this means in this case it is ignoring the instruction, right? To not ignore it, could an approach where the system tries to predict the language instruction from the trajectory help (maybe at training time)?

Yes to the first question. The objects are represented as the position and orientation (as Euler angles). One can also include the type, color, and size, but I’m not currently including these because the oracle picks the object already (and size is not being used in tasks at the moment). The switches, drawer and door are included in the observation too (so they are not considered “objects”). There’s 1 variable for the drawer position, 1 for the door position, and two for each of the three buttons (a continuous and a discrete one).

It does sometimes seem to ignore the instruction yeah. I think what you are proposing could help, maybe combined with heavier input dropout, hmm. I wonder if gathering data where the initial configuration is kept the same but the instruction changes, could help too