Alibaba Cloud has shared more information on a technology it uses to enhance fault prediction and detection for its servers, claiming a 10% improvement compared with existing models.
The Chinese company’s latest tool, Time-Aware Attention-Based Transformer (TAAT), addressed the limitations of existing machine learning tools that overlook the importance of log timestamps.
Detailed in a new research paper co-written by Alibaba Cloud workers and a researcher from Huazhong University of Science and Technology in Wuhan, TAAT uses timestamps to make failure predictions more accurate.
Alibaba Cloud boost server failure predictions by 10%
The paper’s authors highlight growing concern over server reliability and stability in light of the “wide-spread applications of cloud computing,” which impact the availability of virtual machines.
Noting that previous failures can help companies predict future failures, the company has opted to use timestamps to improve accuracy.
TAAT integrates semantic and temporal data by using the Google-developed Bidirectional Encoder Representations from Transformers (BERT) language model, which Alibaba says is good for analyzing log data. An enhancement to BERT’s capabilities add a time-aware attention mechanism.
Consequentially, Alibaba Cloud is now using TAAT in daily operations to improve predictions. The company has also released the real-world cloud computing failure prediction dataset used in its study to help further developments from the community. The dataset contains approximately 2.7 billion logs from around 300,000 servers, collected over a four-month period, and is believed to be the largest log of its kind.
With TAAT, Alibaba hopes for more reliable cloud infrastructure, and while the tool is not yet available for public download, it paves the way for an increasingly cloud-based landscape.