You are required to read and agree to the below before accessing a full-text version of an article in the IDE article repository.

The full-text document you are about to access is subject to national and international copyright laws. In most cases (but not necessarily all) the consequence is that personal use is allowed given that the copyright owner is duly acknowledged and respected. All other use (typically) require an explicit permission (often in writing) by the copyright owner.

For the reports in this repository we specifically note that

  • the use of articles under IEEE copyright is governed by the IEEE copyright policy (available at http://www.ieee.org/web/publications/rights/copyrightpolicy.html)
  • the use of articles under ACM copyright is governed by the ACM copyright policy (available at http://www.acm.org/pubs/copyright_policy/)
  • technical reports and other articles issued by M‰lardalen University is free for personal use. For other use, the explicit consent of the authors is required
  • in other cases, please contact the copyright owner for detailed information

By accepting I agree to acknowledge and respect the rights of the copyright owner of the document I am about to access.

If you are in doubt, feel free to contact webmaster@ide.mdh.se

Change-Aware Round-Trip Benchmarking of LLMs

Fulltext:


Publication Type:

Journal article

Venue:

Journal of Systems and Software


Abstract

Large language models are increasingly embedded in software development, yet most evaluations still treat them as one-shot generators for isolated tasks such as code completion or refactoring. In real workflows, however, artifacts such as application programming interfaces, data models, and database schemas co-evolve, and changes must propagate across representations without breaking consistency. When propagation fails, developers incur extra validation, retries, and manual repair, which increases latency and infrastructure cost and undermines sustainable operation. In this study, we ask whether large language models can preserve cross-artifact consistency under change in a round-trip workflow. We apply a controlled edit to one artifact, translate it to its coupled counterpart, and translate it back, then check whether the intended edit persists without drift (i.e., unintended semantic changes or syntactic invalidity). We instantiate this question by synchronizing class-oriented data models with relational database schemas. Using a curated dataset of paired models and schemas and a suite of controlled edit operations, we evaluate four large language model, GPT- 5, Qwen3-Next-80B-A3B, DeepSeek V3, and Gemini 2.5, under a unified, reproducible protocol that measures (i) edit persistence, (ii) structural validity (parsability/loadability), and (iii) run-to-run consistency over repeated executions. Our results show that the models handle small, routine edits reliably, but they struggle when edits require structural reorganization. Gemini 2.5 is the most consistent across runs; DeepSeek V3 often preserves the intended semantics but occasionally produces unparsable outputs; Qwen3-Next-80BA3B exhibits high variance; and GPT-5 often recognizes the change but fails to propagate it coherently through the coupled representation. We contribute a reproducible benchmark and evaluation framework for assessing LLM reliability under artifact co-evolution, together with empirical evidence of current limitations. Overall, the findings reveal a gap between detecting a change and propagating it coherently, underscoring the need for structural validation and human oversight to achieve dependable and cost-efficient LLM-assisted software evolution.

Bibtex

@article{Dao7418,
author = {Duy Dao and Alessio Bucaioni and Antonio Cicchetti},
title = {Change-Aware Round-Trip Benchmarking of LLMs},
pages = {1--21},
month = {August},
year = {2026},
journal = {Journal of Systems and Software},
url = {http://www.es.mdu.se/publications/7418-}
}