LLM Code generation notes
I explore LLMs and generating code.
March 19,2024
last update: 2024-11-19
This is going to be binary analysis with LLM type thing The idea is to see, if LLMs are able to understand the “language” of binaries and are able to retrieve the underlying vulnerability.
High Level Topics
- Decompilation
- Code creation for tasks.
chatGPT does better in Julia than python why?
- Julia code and stdlib is more consistent than in python
- gpt trips up in the same way new learners of python trip up
- There’s too much data on python which means people of different experience levels write python code meaning the training data is inconsistent. But, Julia is usually written by highly specialized people hence the code generation is on a specific level.
Codegen tools
- Starcoder
- LLM finetuned
- StableLm
- By stability.ai https://github.com/Stability-AI/StableLM
- Refact Code LLM
Security in generated code
Decompilation projects
- https://github.com/kukas/deepcompyle
- https://github.com/albertan017/LLM4Decompile
- Harnessing the power of LLMs to Suport Binary Taint analysis
- Straightforward binary and prompting technique, does better than traditional methods requiring complex understanding of very complex tools for taint analysis
Datasets
- Juliet dataset
- Human Eval is a dataset of prompt and output execution library from OpenAI