Is GenAI for risk assessments reliable?

Organisations across Australia are looking at generative AI (GenAI) as a way to streamline governance, risk and compliance (GRC) functions. The technology is attractive because large language models can generate human‑like responses and summarise large volumes of data rapidly. But, can GenAI be trusted to identify, analyse and prioritise risks on its own?

This post summarises findings from a recent research paper by Flame Tree and Queensland University of Technology researchers QUT – Academic profiles – Dr Gowri Ramachandran and Atticus D’mello that investigated this question using ChatGPT and related models.

Why use GenAI for risk assessments

In the context of cybersecurity GRC, GenAI offers potential efficiencies. It can analyse large text datasets, draft policies or reports and identify emerging risk themes. Used responsibly, it may automate manual processes such as risk register updates, summarising incident reports and drafting compliance documentation. These capabilities make GenAI an attractive option for security teams looking to increase their capacity.

Surface‑level strengths: where GenAI shows promise

The research observed that GenAI can produce context‑aware outputs that are useful in preliminary or surface‑level analysis. Models like ChatGPT are good at synthesising broad patterns in text and providing coherent summaries when given appropriate prompts. When integrated with retrieval‑augmented generation (RAG) pipelines or domain‑specific data, the models can offer more precise and relevant contextual responses. Early experiments highlighted several benefits:

Drafting and summarisation: GenAI can quickly summarise incident reports and draft risk assessment documents, accelerating reporting cycles.
Pattern recognition: By analysing large datasets, the models can highlight emerging themes and areas that warrant further investigation.
Assistance for non‑experts: In organisations without deep cybersecurity expertise, GenAI can provide baseline explanations of common vulnerabilities and threats, acting as an educational tool.

While these strengths make GenAI a useful assistant, they should not be mistaken for the ability to produce reliable, self‑contained risk assessments.

Where GenAI falls short: key limitations identified

The research team used the OpenAI GPT API to test GenAI’s performance on real‑world risk scenarios. Over 180 common vulnerabilities and exposures (CVEs) from 2020–2025 were used for factual recall and misclassification tests, alongside five system architecture diagrams and three bias scenarios. Each architectural image was assessed 50 times to evaluate consistency, and a tailored LLM benchmarked responses against known risks. We found five significant limitations:

Hallucination. When asked to explain CVE risks from 2020 to 2023, GPT misdescribed them over 95 per cent of the time. This tendency to fabricate information shows that a language model may confidently present false details if it lacks sufficient grounding.
Outdated data. For CVEs from 2024 and 2025, GPT failed 98 per cent of the time because its knowledge was last updated in October 2023. The result highlights a critical weakness for dynamic security contexts – without access to current information, GenAI produces inaccurate assessments.
Inconsistency. Repeated evaluations of identical network diagrams led to dramatic variation. The average variance in identified risks was almost 90 per cent, emphasising the non‑deterministic nature of the model.
Misclassification of diagrams. In neutral scenarios requiring accurate identification of secure, insecure or non‑existent network elements, GPT misclassified elements 34 per cent of the time. The model tended to overestimate security, which could mislead practitioners into under‑prioritising serious issues.
Prompt bias. Introducing a positive bias into prompts reduced GPT’s correct classification rate from 60 per cent to just 35 per cent. Even slight shifts in wording caused the model to produce overly optimistic assessments, illustrating its susceptibility to prompt engineering.

These findings indicate that although you can use GenAI for risk assessments, it struggles with accuracy, consistency and objectivity when used independently. High‑stakes decisions demand dependable outputs and traceability, qualities that current LLMs cannot guarantee.

How the study evaluated performance

The methodology provides context for understanding the results. Testing was done using the OpenAI GPT API to simulate integration into GRC software. GPT‑4 handled general queries while GPT‑4‑Turbo processed image‑based reasoning, and a local model (phi‑3 via Ollama) was used for comparison. Prompts were submitted programmatically using Python to remove web‑enhanced responses.

Data sources included over 180 CVEs spanning five years, five architectural diagrams of varying complexity and three bias examples. For CVE tests, 30 random samples from each year were selected. Each architecture was tested 50 times to measure variance, and bias scenarios were evaluated using three different prompts per image. A custom benchmarking model classified the sentiment of prompts, flagged positive bias and extracted key indicators. Finally, results were manually reviewed to ensure evaluation metrics aligned with expectations.

How to responsibly use GenAI for risk assessment

Given the limitations, the research proposes several mitigations:

Retrieval‑augmented generation (RAG). Integrating trusted internal or external data sources with a language model helps provide current and contextual answers. Access to up‑to‑date threat intelligence is essential to avoid outdated assessments.
Multi‑agent analysis. Deploying multiple agents and averaging their outputs can reduce inconsistency and help identify outlier responses. Ensemble approaches may provide a more stable foundation for decision making.
Multi‑LLM systems with personalised agents. Building systems that use different models for specialised roles can mimic real‑world GRC assessments and mitigate hallucination.
Structured problem decomposition. Breaking down risk assessment tasks into smaller, verifiable steps can improve transparency and allow human experts to review intermediate outputs.
Human oversight and guardrails. Clear system prompts, access controls and audit mechanisms are essential to ensure outputs are explainable, defensible and auditable. GenAI should assist human experts rather than replace them.

We’re going to undertake further testing, such as using a multi-agent approach using a risk library based on NIST 800‑30 or ISO 27001 controls, exploring different input sources like Cloud Security Alliance questionnaires, and optimising prompts for better accuracy.

Final thoughts: proceed with caution

GenAI has the potential to streamline surface‑level aspects of risk management, but it currently lacks the reliability required for cybersecurity decisions. Models can hallucinate, misclassify and return outdated information despite providing coherent, confident responses. Effective use of GenAI therefore requires thoughtful system design, regular validation and a clear understanding of where AI ends and human judgement begins.

For now, the most prudent approach is to treat GenAI as an assistant that augments existing processes, accelerating analysis and surfacing potential issues. Ultimate responsibility for risk decisions should remain with experienced professionals who can interpret AI outputs within the broader organisational context.

As our research progresses, the cybersecurity community can expect clearer guidance on how to leverage GenAI safely. In the meantime, organisations should proceed with caution.