![]()
Optimizing Large Language Model Deployment with Scalable Inference and Ensemble Techniques
Adurthy Gurupriya
Gurupriya Adurthy, Student, Department of Artificial Intelligence & Machine Learning, Institute of Aeronautical Engineering (IARE), Hyderabad (Telangana), India.
Manuscript received on 26 August 2025 | First Revised Manuscript received on 03 September 2025 | Second Revised Manuscript received on 17 November 2025 | Manuscript Accepted on 15 December 2025 | Manuscript published on 30 December 2025 | PP: 9-14 | Volume-15 Issue-2, December 2025 | Retrieval Number: 100.1/ijeat.A469215011025 | DOI: 10.35940/ijeat.A4692.15021225
Open Access | Editorial and Publishing Policies | Cite | Zenodo | OJS | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: The rapid expansion of complex system logs in modern infrastructures has heightened the need for accurate, interpretable, and low-latency risk analysis. These logs contain high-dimensional, context-rich data that is essential for operational reliability, cybersecurity, and compliance. While conventional machine learning models are efficient, they often overlook the nuanced semantic relationships in sequential log data, limiting predictive reliability. Conversely, large language models (LLMs) offer deeper contextual understanding but are computationally intensive, making them unsuitable for real-time, large-scale deployment. This study presents a deployment optimised pipeline that balances semantic depth with computational efficiency for log-based risk prediction. The architecture integrates lightweight MiniLM embeddings with an XGBoost classifier to produce interpretable, high-quality predictions at reduced computational cost. Key optimizations include class balancing to address dataset skew, model quantization to lower memory usage, and batched inference to increase throughput, enabling cost-effective CPU-only execution without GPUs. A structured evaluation examined accuracy, latency, and memory trade-offs across production scenarios. Testing on representative log datasets showed notable gains over a TF-IDF baseline: classification accuracy improved from 21.4% to 57.1%, weighted F1-scores rose accordingly, and inference latency decreased with negligible loss in predictive strength. By combining transformer-based dense embeddings with gradient-boosted decision trees, this approach delivers a practical balance of semantic expressiveness, interpretability, and deployment efficiency. The framework supports scalable, real-time risk prediction for cybersecurity monitoring, compliance auditing, and IT operations, bridging the gap between advanced language modelling and real-world infrastructure constraints.
Keywords: Large Language Models, MiniLM Embeddings, XGBoost, log Risk Prediction, Scalable Inference, Deployment Optimisation
Scope of the Article: Artificial Intelligence & Methods
