Application research of chain-of-thought reasoning-based DeepSeek in liver space-occupying lesions

FANG Guo-xu, DING Zong-ren, FAN Jian-hui, GUO Peng-fei, HUANG Xin, ZHOU Yang, LIN Zhao-wang, LIU Jian-xin, ZHANG Li-na, CHEN Li-hong, WENG Xia-di, ZHANG Li-yan, HUANG Shao-yan, HUANG Rui, LUO Shun-feng, LI Hai-tao, ZENG Yong-yi

Chinese Journal of Practical Surgery ›› 2026, Vol. 46 ›› Issue (1) : 108-116.

PDF(3156 KB)
PDF(3156 KB)
Chinese Journal of Practical Surgery ›› 2026, Vol. 46 ›› Issue (1) : 108-116. DOI: 10.19538/j.cjps.issn1005-2208.2026.01.16

Application research of chain-of-thought reasoning-based DeepSeek in liver space-occupying lesions

Author information +
History +

Abstract

Objective To investigate the potential value of DeepSeek enhanced by Chain-of-Thought reasoning in the differential diagnosis of liver space-occupying lesions and liver cancer staging, offering insights into the clinical application of large language models in the medical domain. Methods A retrospective analysis was conducted on clinical case data of 400 patients with liver space-occupying lesions admitted to Mengchao Hepatobiliary Hospital of Fujian Medical University from January 2022 to June 2024, including 169 cases of hepatocellular carcinoma, 37 cases of intrahepatic cholangiocarcinoma, 90 cases of liver cysts, 64 cases of hepatic hemangiomas, and 40 cases of focal nodular hyperplasia (FNH). The performance of DeepSeek combined with Chain-of-Thought (CoT) reasoning was evaluated in four core tasks: differential diagnosis of liver space-occupying lesions, structured CT reporting, Child-Pugh classification of liver function, and Barcelona Clinic Liver Cancer (BCLC) staging, using metrics such as precision, recall, and F1 score. Results With the integration of Chain-of-Thought (CoT) reasoning, DeepSeek demonstrated varying degrees of performance improvement across all tasks. Classification-based CoT reasoning significantly enhanced DeepSeek's ability in imaging report diagnosis, particularly in the differential diagnosis of focal nodular hyperplasia (FNH), where the F1 score increased dramatically from 0.182 to 0.897. Structured CT-based CoT reasoning significantly improved DeepSeek's structured processing of imaging reports, the F1 score for tumor quantity features increased from 0.798 to 0.932, vascular invasion features improved from 0.795 to 0.982, and extrahepatic metastasis features rose from 0.779 to 0.959. In the Child-Pugh classification task for liver function, the introduction of scoring-based CoT reasoning and classification-based CoT reasoning raised the F1 score from 0.936 to 1.000. For BCLC staging tasks, the introduction of staging-based CoT reasoning significantly increased the F1 score from 0.622 to 0.858. Conclusion The deep integration of Chain-of-Thought (CoT) reasoning with DeepSeek not only significantly improved its performance in core tasks such as differential diagnosis of liver space-occupying lesions, structured CT report processing, Child-Pugh classification of liver function, and BCLC staging, but also intuitively demonstrated the data support and reasoning logic behind decision-making. This technique effectively addresses the hallucination problem of large models and significantly enhances their interpretability, further increasing the trust and recognition of medical professionals in the DeepSeek-assisted decision-making system. It provides a valuable reference for the application of intelligent clinical decision support systems in diagnosis and treatment.

Key words

DeepSeek / large language model / chain-of-thought reasoning / explainable artificial intelligence / hepatocellular carcinoma

Cite this article

Download Citations
FANG Guo-xu , DING Zong-ren , FAN Jian-hui , et al . Application research of chain-of-thought reasoning-based DeepSeek in liver space-occupying lesions[J]. Chinese Journal of Practical Surgery. 2026, 46(1): 108-116 https://doi.org/10.19538/j.cjps.issn1005-2208.2026.01.16

References

[1]
Sandmann S, Hegselmann S, Fujarski M, et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making[J]. Nat Med, 2025, 31(8): 2546-2549.DOI:10.1038/s41591-025-03727-2.
[2]
Li QP, Zhan LL, Cai XJ. Assessing DeepSeek-R1 for Clinical Decision Support in Multidisciplinary Laboratory Medicine[J]. J Multidiscip Healthc, 2025, 18:4979-4988.DOI: 10.2147/JMDH.S538253.
Recent advancements in artificial intelligence (AI), particularly with large language models (LLMs), are transforming healthcare by enhancing diagnostic decision-making and clinical workflows. The application of LLMs like DeepSeek-R1 in clinical laboratory medicine demonstrates potential for improving diagnostic accuracy, supporting decision-making, and optimizing patient care.This study evaluates the performance of DeepSeek-R1 in analyzing clinical laboratory cases and assisting with medical decision-making. The focus is on assessing its accuracy and completeness in generating diagnostic hypotheses, differential diagnoses, and diagnostic workups across diverse clinical cases.We analyzed 100 clinical cases from, which includes comprehensive case histories and laboratory findings. DeepSeek-R1 was queried independently for each case three times, with three specific questions regarding diagnosis, differential diagnoses, and diagnostic tests. The outputs were assessed for accuracy and completeness by senior clinical laboratory physicians.DeepSeek-R1 achieved an overall accuracy of 72.9% (95% CI [69.9%, 75.7%]) and completeness of 73.4% (95% CI [70.5%, 76.2%]). Performance varied by question type: the highest accuracy was observed for diagnostic hypotheses (85.7%, 95% CI [81.2%, 89.2%]) and the lowest for differential diagnoses (55.0%, 95% CI [49.3%, 60.5%]). Notable variations in performance were also seen across disease categories, with the best performance observed in genetic and obstetric diagnostics (accuracy 93.1%, 95% CI [84.0%, 97.3%]; completeness 86.1%, 95% CI [76.4%, 92.3%]).DeepSeek-R1 demonstrates potential for a decision-support tool in clinical laboratory medicine, particularly in generating diagnostic hypotheses and recommending diagnostic workups. However, its performance in differential diagnosis and handling specific clinical nuances remains limited. Future work should focus on expanding training data, integrating clinical ontologies, and incorporating physician feedback to improve real-world applicability. DeepSeek-R1 and the new versions under development may be promising tools for non-medical professionals and professionals in medical laboratory diagnoses.© 2025 Li et al.
[3]
Hassanein FEA, El Barbary A, Hussein RR, et al. Diagnostic performance of ChatGPT-4o and DeepSeek-3 differential diagnosis of complex oral lesions: A multimodal imaging and case difficulty analysis[J]. Oral Dis, 2025.Jul 1.Online ahead of print. DOI: 10.1111/odi.70007.
[4]
Mikhail D, Farah A, Milad J, et al. Performance of DeepSeek-R1 in ophthalmology: an evaluation of clinical decision-making and cost-effectiveness[J]. Br J Ophthalmol, 2025, 109(9): 976-981. DOI: 10.1136/bjo-2025-327360.
To compare the performance and cost-effectiveness of DeepSeek-R1 with OpenAI o1 in diagnosing and managing ophthalmology clinical cases.
[5]
Yan ZJ, Fan KQ, Zhang Q, et al. Comparative analysis of the performance of the large language models DeepSeek-V3, DeepSeek-R1, open AI-O3 mini and open AI-O3 mini high in urology[J]. World J Urol, 2025, 43(1): 416. DOI: 10.1007/s00345-025-05757-4.
[6]
Ali R, Shi LS, Cui HY. A comparative study on the use of DeepSeek-R1 and ChatGPT-4.5 in different aspects of plastic surgery[J]. Aesthetic Plast Surg, 2025. Aug 11. Online ahead of print. DOI: 10.1007/s00266-025-05108-z.
[7]
宋尔卫, 尚桐锐, 陈凯. 人工智能在临床肿瘤领域中应用现状和基础建设问题[J]. 中国实用外科杂志, 2021, 41(11): 1206-1208. DOI:10.19538/j.cjps.issn1005-2208.2021.11.02.
[8]
徐家豪, 周琛曜, 杨田. 肝胆外科研究的未来:精准医学时代的数字化转型之路[J]. 中国实用外科杂志, 2025, 45(9): 985-988. DOI:10.19538/j.cjps.issn1005-2208.2025.09.04.
[9]
Farquhar S, Kossen J, Kuhn L, et al. Detecting hallucinations in large language models using semantic entropy[J]. Nature, 2024, 630(8017): 625-630. DOI: 10.1038/s41586-024-07421-0.
Large language model (LLM) systems, such as ChatGPT1or Gemini2, can show impressive reasoning and question-answering capabilities but often ‘hallucinate’ false outputs and unsubstantiated answers3,4. Answering unreliably or without the necessary information prevents adoption in diverse fields, with problems including fabrication of legal precedents5or untrue facts in news articles6and even posing a risk to human life in medical domains such as radiology7. Encouraging truthfulness through supervision or reinforcement has been only partially successful8. Researchers need a general method for detecting hallucinations in LLMs that works even with new and unseen questions to which humans might not know the answer. Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations—confabulations—which are arbitrary and incorrect generations. Our method addresses the fact that one idea can be expressed in many ways by computing uncertainty at the level of meaning rather than specific sequences of words. Our method works across datasets and tasks without a priori knowledge of the task, requires no task-specific data and robustly generalizes to new tasks not seen before. By detecting when a prompt is likely to produce a confabulation, our method helps users understand when they must take extra care with LLMs and opens up new possibilities for using LLMs that are otherwise prevented by their unreliability.
[10]
Liang SF, Zhang JJ, Liu XT, et al. The potential of large language models to advance precision oncology[J]. EBioMedicine, 2025, 115: 105695. DOI: 10.1016/j.ebiom.2025.105695.
[11]
Gunning D, Stefik M, Choi J, et al. XAI-Explainable artificial intelligence[J]. Sci Robot, 2019, 4(37): eay7120.DOI: 10.1126/scirobotics.aay7120.
[12]
Huang GM, Li YY, Jameel S, et al. From explainable to interpretable deep learning for natural language processing in healthcare: How far from reality?[J]. Comput Struct Biotechnol J, 2024, 24: 362-373. DOI: 10.1016/j.csbj.2024.05.004.
[13]
Yang HB, Hu MX, Most A, et al. Evaluating accuracy and reproducibility of large language model performance on critical care assessments in pharmacy education[J]. Front Artif Intell, 2024, 7: 1514896. DOI: 10.3389/frai.2024.1514896.
[14]
Li ZX, Yan CY, Cao Y, et al. Evaluating performance of large language models for atrial fibrillation management using different prompting strategies and languages[J]. Sci Rep, 2025, 15(1): 19028. DOI: 10.1038/s41598-025-04309-5.
[15]
Garin D, Cook S, Ferry C, et al. Improving large language models accuracy for aortic stenosis treatment via Heart Team simulation: a prompt design analysis[J]. Eur Heart J Digit Health, 2025, 6(4): 665-674. DOI: 10.1093/ehjdh/ztaf068.
Large language models (LLMs) have shown potential in clinical decision support, but the influence of prompt design on their performance, particularly in complex cardiology decision-making, is not well understood.We retrospectively reviewed 231 patients evaluated by our Heart Team for severe aortic stenosis, with treatment options including surgical aortic valve replacement, transcatheter aortic valve implantation, or medical therapy. We tested multiple prompt-design strategies using zero-shot (0-shot), Chain-of-Thought (CoT), and Tree-of-Thought (ToT) prompting, combined with few-shot prompting, free/guided-thinking, and self-consistency. Patient data were condensed into standardized vignettes and queried using GPT4-o (version 2024-05-13, OpenAI) 40 times per patient under each prompt (147 840 total queries). Primary endpoint was mean accuracy; secondary endpoints included sensitivity, specificity, area under the curve (AUC), and treatment invasiveness. Guided-thinking-ToT achieved the highest accuracy (94.04%, 95% CI 90.87-97.21), significantly outperforming few-shot-ToT (87.16%, 95% CI 82.68-91.63) and few-shot-CoT (85.32%, 95% CI 80.59-90.06; < 0.0001). Zero-shot prompting showed the lowest accuracy (73.39%, 95% CI 67.48-79.31). Guided-thinking-ToT yielded the highest AUC values (up to 0.97) and was the only prompt whose invasiveness did not differ significantly from Heart Team decisions ( = 0.078). An inverted quadratic relationship emerged between few-shot examples and accuracy, with nine examples optimal (< 0.0001). Self-consistency improved overall accuracy, particularly for ToT-derived prompts (< 0.001).Prompt design significantly impacts LLM performance in clinical decision-making for severe aortic stenosis. Tree-of-Thought prompting markedly improved accuracy and aligned recommendations with expert decisions, though LLMs tended toward conservative treatment approaches.© The Author(s) 2025. Published by Oxford University Press on behalf of the European Society of Cardiology.
[16]
Xie YX, Hu ZH, Tao HY, et al. Large language models for efficient whole-organ MRI score-based reports and categorization in knee osteoarthritis[J]. Insights Imaging, 2025, 16(1): 100. DOI: 10.1186/s13244-025-01976-w.
To evaluate the performance of large language models (LLMs) in automatically generating whole-organ MRI score (WORMS)-based structured MRI reports and predicting osteoarthritis (OA) severity for the knee.

Funding

Fujian Provincial Natural Science Foundation(2025J01268)
Fujian Provincial Natural Science Foundation(2024J011236)
Fujian Provincial Natural Science Foundation(2021J011283)
Fuzhou Municipal Science and Technology Program(2024-S-071)
PDF(3156 KB)

Accesses

Citation

Detail

Sections
Recommended

/