PhD Position F/M LLM4Code and SopraSteria: Software migration and modernization with LLMs
Type de contrat : CDD
Niveau de diplôme exigé : Bac + 5 ou équivalent
Fonction : Doctorant
A propos du centre ou de la direction fonctionnelle
The Inria Centre at Rennes University is one of Inria's eight centres and has more than thirty research teams. The Inria centre is a major and recognized player in the field of digital sciences. It is at the heart of a rich R&D and innovation ecosystem: highly innovative PMEs, large industrial groups competitiveness clusters, research and higher education players, laboratories of excellence, technological research institute, etc
Contexte et atouts du poste
The PhD subject (thèse CIFRE) is a collaboration between SopraSteria (in Nantes) and DiverSE Inria research team (in Rennes).
The candidate will be employee of SopraSteria and spend part of the time in SopraSteria and in DiverSE.
The work is also part of a Défi Inria LLM4Code "Reliable and productive Code Assistants based on large language models" with more than 10 research teams working on several aspects of LLMs and code. Hence the candidate will have the opportunity to collaborate with numerous researchers and experts, as well as to leverage computational infrastructure and the SoftwareHeritage project.
More details here: https://project.inria.fr/llm4code/
Generative AI, in particular the recent Large Language Models (LLMs), show great promise for software developments. Specialized models are now able to perform an impressive variety of programming tasks: solving programming exercises, assisting software developers, or even generating mechanized proofs. Yet, many challenges still need to be addressed to build reliable and productive LLM-based coding assistants: improving the quality of the generated code, increasing the developers’ confidence in the generated code, enabling interaction with other software development tools (verification, test), and providing new capabilities (automated migration and evolution of software).
The goal of the Défi Inria LLM4Code is to leverage LLM capabilities to build code assistants that can enhance both reliability and productivity. The défi is organized along three work packages: Self-improving code generation, Evolution of existing software (WP2), Interactive tools with AI-in-the-loop.
The specific subject lies in the WP2 "migration and modernization of existing software"
Mission confiée
The candidate will be employee of SopraSteria and spend part of the time in SopraSteria and in DiverSE.
The work is also part of a Défi Inria LLM4Code "Reliable and productive Code Assistants based on large language models" with more than 10 research teams working on several aspects of LLMs and code. Hence the candidate will have the opportunity to collaborate with numerous researchers and experts, as well as to leverage computational infrastructure and the SoftwareHeritage project.
Principales activités
A vast portion of the software used nowadays in the critical sectors of the industry is
written in legacy languages (e.g. Fortran, COBOL, Ada, etc.) that are prone to be outdated.
These languages do not profit from modern software engineering tools, do not adhere to the
latest standards of quality or security, and are famous for blocking developers in their
everyday work. However, there is no standard solution to migrate an existing code base to
newer technologies that would be stable, secure, and affordable from the time/value
perspective.
We propose to leverage LLMs’ capabilities for software migration. While LLMs excel at
translation tasks for natural language, programming language migration is still challeng- ing
[Zhu et al., 2022, Pan et al., 2023, Yan et al., 2023]. Incorporating fine-grained examples into
the training of LLMs is essential to capture the nuances of different programming paradigms
and semantics. These examples provide detailed, context-rich scenarios that help LLMs
understand and adapt to various programming structures and logic. Furthermore, leveraging
compilers or transpilers to generate synthetic data can be effective in creating
a diverse
training dataset. On the other hand, LLMs can enhance existing compilers or migration
tools by broadening their scope to cover more diverse and complex corner cases. This results
in tools that are not only more robust, but also capable of addressing a wider range of
migration scenarios. Besides, LLMs are not solely beneficial for translating programs; they
also play a crucial role in comprehending existing codebases, documenting system
architectures, or synthesizing test cases to validate the migration. Both activities are essential
for software migration, and our strategy includes utilizing LLMs to efficiently address these
specific tasks.
We will first experiment on the open challenge of converting Fortran-77 to C. From
many perspectives, the gap between Fortran-77 (as the most spread version of Fortran)
and C is significant. Furthermore, the lack of a reference dataset matching Fortran-77 to
C code, and the validation of the results generated by the LLM raise multiple challenges.
In addition to the challenging case of converting Fortran-77 to C, we aim to explore the
problem of migrating old codebases written in programming languages such as 4GL or old
Java [Fleurey et al., 2007, Verhaeghe et al., 2019]. Software migration involves resolving
many tasks and related problems: reverse engineering (e.g., understanding the existing
codebase and functionality, documenting the system’s architecture), translating code to
a modern platform or programming language, testing (from unit to user acceptance) to
ensure the new migrated system fits original requirements, etc. For each task, LLMs can
be of interest [Xie et al., 2023, Fan et al., 2023, Hou et al., 2023]
Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo,
and Jie M Zhang. Large language models for software engineering: Survey and open
problems. arXiv preprint arXiv:2310.03533, 2023.
Franck Fleurey, Erwan Breton, Benoit Baudry, Alain Nicolas, and Jean-Marc Jézéquel.
Model-driven engineering for software migration in a large industrial context. In Model
Driven Engineering Languages and Systems: 10th International Conference, MoDELS
2007, Nashville, USA, September 30-October 5, 2007. Proceedings 10, pages 482–497.
Springer, 2007.
Benoıt Verhaeghe, Anne Etien, Nicolas Anquetil, Abderrahmane Seriai, Laurent Deruelle,
Stéphane Ducasse, and Mustapha Derras. Gui migration using mde from gwt to angular 6:
An industrial case. In 2019 IEEE 26th International Conference on Software Analysis,
Evolution and Reengineering (SANER), pages 579–583, 2019. doi: 10.1109/SANER.2019.
8667989.
Ming Zhu, Karthik Suresh, and Chandan K Reddy. Multilingual code snippets training for
program translation. Proceedings of the AAAI Conference on Artificial Intelligence, 36
(10):11783–11790, Jun. 2022. doi: 10.1609/aaai.v36i10.21434. URL https://ojs.aaai.
org/index.php/AAAI/article/view/21434.
Rui Xie, Tianxiang Hu, Wei Ye, and Shikun Zhang. Low-resources project-specific code
summarization. In Proceedings of the 37th IEEE/ACM International Conference on
Automated Software Engineering, ASE ’22, New York, NY, USA, 2023. Association for
Computing Machinery. ISBN 9781450394758. doi: 10.1145/3551349.3556909. URL
https://doi.org/10.1145/3551349.3556909.
Compétences
You need to:
- have (or soon receive) a Masters degree in computer science/engineering, informatics, or related fields
- be ok investing 3+ years as a "research apprentice" (aka PhD student)
The subject requires strong expertise in software engineering, including automated software engineering, program transformations (compilers, interpreters, etc.) and analysis, the mastering of numerous languages (ie being polyglot), the development of languages. The candidate should also be highly knowledgeable in LLMs, from foundations to cutting-edge tools recently developed, and excited by the use of LLMs with software (it does not exclude, of course, to be critics about LLMs and their current limits)
Avantages
- Subsidized meals
- Partial reimbursement of public transport costs
- Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
Rémunération
Monthly gross salary: 2100€
Informations générales
- Thème/Domaine :
Programmation distribuée et génie logiciel
Ingénierie logicielle (BAP E) - Ville : Rennes
- Centre Inria : Centre Inria de l'Université de Rennes
- Date de prise de fonction souhaitée : 2024-02-01
- Durée de contrat : 3 ans
- Date limite pour postuler : 2025-01-15
Attention: Les candidatures doivent être déposées en ligne sur le site Inria. Le traitement des candidatures adressées par d'autres canaux n'est pas garanti.
Consignes pour postuler
Please submit online : your resume, cover letter and letters of recommendation eventually
Sécurité défense :
Ce poste est susceptible d’être affecté dans une zone à régime restrictif (ZRR), telle que définie dans le décret n°2011-1425 relatif à la protection du potentiel scientifique et technique de la nation (PPST). L’autorisation d’accès à une zone est délivrée par le chef d’établissement, après avis ministériel favorable, tel que défini dans l’arrêté du 03 juillet 2012, relatif à la PPST. Un avis ministériel défavorable pour un poste affecté dans une ZRR aurait pour conséquence l’annulation du recrutement.
Politique de recrutement :
Dans le cadre de sa politique diversité, tous les postes Inria sont accessibles aux personnes en situation de handicap.
Contacts
- Équipe Inria : DIVERSE
-
Directeur de thèse :
Acher Mathieu / Mathieu.Acher@irisa.fr
L'essentiel pour réussir
You need to:
- be really excited about our project
- be persistent (get back up and continue when things don't work out as planned -- true research rarely works out as planned)
- be fearless (e.g., be ok hacking a virtual machine, a compiler, a kernel, or implementing a complex algorithm)
- have a small child's attitude (to want to understand and learn about everything they encounter)
- have an engineer's attitude (not to take the first solution that comes to mind, but to look at the key alternatives)
- have a researcher's attitude (to want to truly understand something, and to not be satisfied with the first best explanation)
- want to look at the simple and obvious before exploring the complicated
- be able to focus (to ignore the many other cool things one could also do)
- derive pleasure from coming up with a logical and clear argument or explanation
- like to read (books, papers, papers, papers)
- like to write (prospectus, proposal, dissertation, and papers)
- like to present (at conferences, or in class)
- like to convince others using sound arguments
- be ok working hard
- under-promise and over-deliver
- be happy staying in Brittany for quite some time
- be ok traveling long distance from time to time (e.g., for conferences)
A propos d'Inria
Inria est l’institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 215 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3900 scientifiques pour relever les défis du numérique, souvent à l’interface d’autres disciplines. L’institut fait appel à de nombreux talents dans plus d’une quarantaine de métiers différents. 900 personnels d’appui à la recherche et à l’innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 200 start-up. L'institut s'efforce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.