Using Large Language Models for Generating and Securing Code

There is a lot of hype and expectations on how Large Language Models (LLMs) will impact software development. This session is focused specifically on how LLMs will affect the development of secure software in specialized and high assurance systems such as industrial systems/operational technology.

The discussion *briefly* gives examples of some of the hype for documenting the bar of expectations outside of the development community. It then gives a review of some of the more cited academic studies in using LLMs for developing or correcting vulnerabilities in software. Gaps in those studies are identified (mainly, few, artificial, and/or non-representative examples) as the motivation for the research done by the author. The methods used in the research are reviewed, specifically our collection of secure and insecure code examples collected from code evaluation engagements.

The results are presented, illustrating behaviors of LLMs for this task, as well as summary analysis across the entire study, overall capability and specific areas where performance is good or poor. The study has been performed several times, and the changes in results as LLMs have matured and evolved is presented (mainly, getting better but not near the expectations set above, and surprising, bigger models on general training do better than smaller models on domain specific training).

Based on the data, limitations on the use of LLMs are discussed as well as several approaches being explored by the field and the author to manage the limitations. The result is guidance on how to use LLMs today for improving software security and some predictions to the future. Suggestions are also provided on what to look for when seeing claims in popular trade press to evaluate their assertions.

Our studies focused on C and C++ in ChatGPT 3.5, ChatGPT 4, ChatGPT 4o and GitHub Copilot. This is an introductory level topic geared towards developers, contract requirements professionals, CTO/Chief Engineers or people who manage them. A very limited background on how generative AI systems work is covered to help explain some of the example behaviors – the discussion is brief and no deep understanding of generative AI systems or details of programming languages is expected or needed.