Cybersecurity firms Sophos and ReversingLabs on Monday jointly released the first-ever production-scale malware research dataset to be made available to the general public that aims to build effective defenses and drive industry-wide improvements in security detection and response.
“SoReL-20M” (short for Sophos-ReversingLabs – 20 Million), as it’s called, is a dataset containing metadata, labels, and features for 20 million Windows Portable Executable (.PE) files, including 10 million disarmed malware samples, with the goal of devising machine-learning approaches for better malware detection capabilities.
“Open knowledge and understanding about cyber threats also leads to more predictive cybersecurity,” Sophos AI group said. “Defenders will be able to anticipate what attackers are doing and be better prepared for their next move.”
Accompanying the release are a set of PyTorch and LightGBM-based machine learning models pre-trained on this data as baselines.
Unlike other fields such as natural language and image processing, which have benefitted from vast publicly-available datasets such as MNIST, ImageNet, CIFAR-10, IMDB Reviews, Sentiment140, and WordNet, getting hold of…