PhD Position F/M LLM-based code generation for controlling artificial agents in simulated environments

Le descriptif de l’offre ci-dessous est en Anglais

Type de contrat : CDD

Niveau de diplôme exigé : Bac + 5 ou équivalent

Fonction : Doctorant

Contexte et atouts du poste

Program synthesis (Chaudhuri et al., 2021) has been traditionally considered for different programming tasks, but very little studied for synthesizing controllers, i.e. programs controlling artificial agents in simulated environments. Such controllers are usually studied under the Reinforcement Learning (RL) paradigm in AI (Sutton & Barto, 2018), where an agent learns an action policy from experience in order to maximize cumulative reward in a simulated environment. We believe that the recent rise of Large Language Models (LLMs) opens important perspectives and novel directions for controller synthesis. The successes of Copilot and ChatGPT show that LLMs can provide help and assistance in many programming tasks, both saving time and reducing errors through for instance bug finding and code suggestions (Chen et al., 2021). Moreover, LLMs are increasingly used in RL, in particular in LLM-Augmented RL, where LLMs can interpret users' intent and translate them into concrete rewards to be integrated in a RL algorithm (Fan et al., 2022). A few recent papers have proposed to use LLMs for generating AI controllers, reward functions and tasks (Faldor et al., 2024; Lehman et al., 2022; Wang et al., 2023), opening many exciting perspectives where the code generation abilities of LLMs is leveraged to synthesize AI agents and their environments.

In this project, we propose to explore novel research directions for using LLMs to synthesize controllers in unknown and complex environments. How to generate a controller program from a natural language description of the environment dynamics and task properties? Are LLM able to generalize to the generation of controllers that combine skills from previously discovered controllers? How to generate adaptive controllers, e.g. in the form of code specifying neural architectures using standard deep learning libraries such as Pytorch? Can LLM be used to generate the morphology of embodied agents? Can LLM help to disentangle semantic vs procedural knowledge in the context of controller synthesis? In particular, we believe that the question of combining skills is the key to scale these techniques to large environments and complex tasks.

Mission confiée

Several methodologies have been identified to address the above research questions. Existing approaches for adapting/evolving LLMs such as fine-tuning using Reinforcement Learning from Human Feedback (RLHF, Ouyang et al., 2022) was made famous with ChatGPT, but alternative promising approaches such as mutation operators (Lehman et al., 2022; Wang et al., 2023) and evolution of prompts (Fernando et al., 2023; Guo et al., 2024; Wan et al., 2024) have been proposed. This will enable us to explore complementary strategies (i) fixing the prompt and adapt the LLM or (ii) evolve the prompt with a fixed LLM; or using interactive vs. autonomous adaptation. We will also consider combining such approaches with non-LLM methods such as automatic program repair. Moreover, sorting knowledge in different components, e.g. semantic vs procedural knowledge, could help evolutionary methods to be more efficient and compositional, and decouple the evolution of such representations. Indeed, our brains represent "semantic" and "procedural" knowledge in different ways: knowing that mugs, lamps or books are inanimate objects (generally described as "facts") are different in nature from procedural knowledge (e.g. know how to bike or swim, how to sort action primitives or instructions primitives) is crucial. Enabling LLMs to store these two kinds of knowledge will help to obtain more disentangled representations and compositionality, and offer more interpretable representations.

To initiate models, we will bootstrap them with supervised learning by for instance providing a first database of problems and solutions as prompts, or fine-tuning an existing LLM with it. The evaluation of found solutions will be performed through a Python interpreter or by various feedbacks (e.g. success or failure of code execution) with a special emphasis on the ability to generalize the acquired knowledge to novel environments and tasks that are compositions of the already solved ones. Datasets and environments such as the ones proposed in stable-baselines and evogym will be relevant baselines. We will also consider data augmentation, for instance by mutating existing problems and solutions and evaluating them.

Principales activités

The project will be structured along several milestones of increasing complexity, although this proposed plan can be adapted according to the student’s interests. In each of them, we plan to first bootstrap the model with a database of problem descriptions and known solutions to solve them, using it either to prompt the LLM (if the database is sufficiently small) or to fine-tune it with either supervised learning or RLHF (e.g. if the database is too large to fit in a prompt). Then we will evaluate how the LLM can generalize to novel problems, in particular environments and tasks that are compositions of the already solved ones.

In the first stage, we will study the synthesis of reactive controllers on simple navigation tasks, for example in the form of Braitenberg vehicles (Braitenberg, 1984). The database will consist of a few examples of simple problems and solutions from existing Braitenberg vehicle implementations, which will be sufficiently short to input as a prompt.

In the second stage, we will study the synthesis of adaptive controllers, i.e controllers that can learn for experience (e.g. in the form of learnable decision trees or neural networks). For this aim, we will bootstrap the model with a database of descriptions of RL environments and known RL algorithms to solve them, e.g. from the library and documentation of stable-baselines. Since this will be too large to input as a prompt, we will instead use supervised learning or RLHF to fine-tune the LLM on these examples.

Finally, in the third stage, we will study the synthesis of meta-learners, i.e. an adaptive controller that can perform well on a wide distribution of environments, as usually studied in the field of meta reinforcement learning (Duan et al., 2016). A possible direction here will be to explore how to train a Transformer model taking as input the sequence of agent’s observations, actions and rewards in a given environment in order to infer properties of the task at hand and outputting it in a natural language form. This natural language description will then be used as input to the LLM in order to generate the controller or to adapt an existing one.

In addition, in each of the three proposed stages, we will in parallel study the ability to synthesize agent’s morphologies as JSON text files that can be interpreted in a 3D simulator such as Mujoco. This will allow the study of body-controller co-evolution, which is an important problem in both AI and Artificial Life (Bhatia et al., 2021). We believe that LLM-assisted program synthesis could lead to important breakthroughs in this domain.

Take this workplan mostly as a suggestion at this point. If you have your own ideas on alternative methodology or objectives we will be glad to discuss them.

References (most relevant are in bold)

Bhatia, J., Jackson, H., Tian, Y., Xu, J., & Matusik, W. (2021). Evolution Gym: A Large-Scale Benchmark for Evolving Soft Robots. Advances in Neural Information Processing Systems, 34, 2201–2214. https://papers.nips.cc/paper/2021/hash/118921efba23fc329e6560b27861f0c2-Abstract.html

Braitenberg, V. (1984). Vehicles: Experiments in Synthetic Psychology. MIT Press.

Chaudhuri, S., Ellis, K., Polozov, O., Singh, R., Solar-Lezama, A., & Yue, Y. (2021). Neurosymbolic Programming. Foundations and Trends® in Programming Languages, 7(3), 158–243. https://doi.org/10.1561/2500000049

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., … Zaremba, W. (2021). Evaluating Large Language Models Trained on Code (arXiv:2107.03374). arXiv. https://doi.org/10.48550/arXiv.2107.03374

Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., & Abbeel, P. (2016). RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning. arXiv:1611.02779 [Cs, Stat]. http://arxiv.org/abs/1611.02779

Faldor, M., Zhang, J., Cully, A., & Clune, J. (2024). OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code (arXiv:2405.15568). arXiv. https://doi.org/10.48550/arXiv.2405.15568

Fan, L., Wang, G., Jiang, Y., Mandlekar, A., Yang, Y., Zhu, H., Tang, A., Huang, D.-A., Zhu, Y., & Anandkumar, A. (2022). MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge (arXiv:2206.08853). arXiv. https://doi.org/10.48550/arXiv.2206.08853

Fernando, C., Banarse, D., Michalewski, H., Osindero, S., & Rocktäschel, T. (2023). Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (arXiv:2309.16797). arXiv. https://doi.org/10.48550/arXiv.2309.16797

Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., & Yang, Y. (2024). Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (arXiv:2309.08532). arXiv. https://doi.org/10.48550/arXiv.2309.08532

Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., & Stanley, K. O. (2022). Evolution through Large Models (arXiv:2206.08896; Version 1). arXiv. https://doi.org/10.48550/arXiv.2206.08896

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning, second edition: An Introduction (second edition). Bradford Books.

Wan, Z., Wang, X., Liu, C., Alam, S., Zheng, Y., Liu, J., Qu, Z., Yan, S., Zhu, Y., Zhang, Q., Chowdhury, M., & Zhang, M. (2024). Efficient Large Language Models: A Survey (arXiv:2312.03863). arXiv. https://doi.org/10.48550/arXiv.2312.03863

Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models (arXiv:2305.16291). arXiv. https://doi.org/10.48550/arXiv.2305.16291

Compétences

Excellent programming skills, preferably in Python. Experience with Pytorch or JAX.
Prior experience with foundational models, deep reinforcement learning and data analysis.
Strong interest in implementing artificial agents able to acquire an open-ended repertoire of skills.
Prior experience with running large-scale experiments on CPU/GPU clusters is a plus.
Fluent English

Avantages

Subsidized meals
Partial reimbursement of public transport costs
Possibility of teleworking and flexible organization of working hours
Professional equipment available (videoconferencing, loan of computer equipment, etc.)
Social, cultural and sports events and activities
Access to vocational training
Social security coverage

Rémunération

2100€ / month (before taxs) during the first 2 years,
2190€ / month (before taxs) during the third year.

Postuler à cette offre

Informations générales

Thème/Domaine : Robotique et environnements intelligents
Ville : Talence
Centre Inria : Centre Inria de l'université de Bordeaux
Date de prise de fonction souhaitée : 2024-10-01
Durée de contrat : 3 ans
Date limite pour postuler : 2024-08-04

Attention: Les candidatures doivent être déposées en ligne sur le site Inria. Le traitement des candidatures adressées par d'autres canaux n'est pas garanti.

Consignes pour postuler

Thank you to send:
- CV
- Cover letter
- Master marks and ranking
- Support letter(s)

Sécurité défense :
Ce poste est susceptible d’être affecté dans une zone à régime restrictif (ZRR), telle que définie dans le décret n°2011-1425 relatif à la protection du potentiel scientifique et technique de la nation (PPST). L’autorisation d’accès à une zone est délivrée par le chef d’établissement, après avis ministériel favorable, tel que défini dans l’arrêté du 03 juillet 2012, relatif à la PPST. Un avis ministériel défavorable pour un poste affecté dans une ZRR aurait pour conséquence l’annulation du recrutement.

Politique de recrutement :
Dans le cadre de sa politique diversité, tous les postes Inria sont accessibles aux personnes en situation de handicap.

Contacts

Équipe Inria : FLOWERS
Directeur de thèse :
Moulin-frier Clément / clement.moulin-frier@inria.fr

L'essentiel pour réussir

Please don't wait the application deadline to contact us. At first you can just send us you CV by email, explaining why you are particularly interested in this position.

Later on you'll have to send a more formal cover letter, as well reference letters. Please also send documents or reports describing previous projects you have been working on (even if they are not directly related to the topic), as well as your Master grades and links to some code repositories.

For information, we will proceed in two phases:

A first phase where we will make a selection of candidates based on the CV, the cover letter and the provided documents;
A second phase of interviews of the selected candidates.

If you are selected for an interview, it will be the occasion to have a scientific discussion on the topics of the PhD. We highly recommend that you take the time to have a look at some of the mentioned references. We don't expect you to have read all of them, neither to have fully understood all the related concepts: during the interview you will be able to explain what you have understood and to ask questions about what was less clear.

A propos d'Inria

Inria est l’institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 215 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3900 scientifiques pour relever les défis du numérique, souvent à l’interface d’autres disciplines. L’institut fait appel à de nombreux talents dans plus d’une quarantaine de métiers différents. 900 personnels d’appui à la recherche et à l’innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 200 start-up. L'institut s'eﬀorce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.