On March 17, 2013, the authors of an anonymous email to the "Full Disclosure" mailing list announced that last year they conducted a full probing of the entire IPv4 Internet. They claimed they used a botnet (named "carna" botnet) created by infecting machines vulnerable due to use of default login/password pairs (e.g., admin/admin). The botnet instructed each of these machines to execute a portion of the scan and then transfer the results to a central server. The authors also published a detailed description of how they operated, along with 9TB of raw logs of the scanning activity.
Online magazines (e.g., ArsTechnica) and newspapers (e.g., Spiegel Online) reported the news, which triggered some debate in the research community about the ethical implications of using such data for research purposes. A more fundamental question received less attention: since the authors went out of their way to remain anonymous, and the only data available about this event is the data they provide, how do we know this scan actually happened? If it did, how do we know that the resulting data is correct?
Since we could not find any third-party validation of this event, we looked for evidence in the traffic captured at the UCSD Network Telescope (a large darknet). From this traffic we selected probing packets consistent with the default nmap host probe (comprised of four different types of packets) that the carna botnet used. The visualization below shows, for each day of 2012, the total number of probes we observed at the telescope in bins of 1 day (blue line). While these probes may have been generated by any host on the Internet, the large increase visible between April and September 2012 matches the logs distributed by the authors of the botnet (red line), showing evidence of this scanning activity.
We also found that the raw logs of the carna botnet erroneously reported that a large number of IPs in our darknet were active, and specifically accepting connections on port TCP 80 (darknet IP addresses are inactive by definition, thus not accepting connections). A preliminary analysis suggests that this measurement error is likely due to the presence of HTTP proxies in some of the networks that hosted scanning bots. The default nmap host probe sends four different packets trying to solicit a response from the target: (i) ICMP echo request, (ii) ICMP timestamp, (iii) TCP ack on port 80, (iv) TCP syn on port 443. For darknet addresses that the carna logs report as inactive, we observed all four of these packets, but for the addresses misreported as active, packets of type (iii) did not reach the telescope. We suspect that these packets were intercepted by HTTP proxies whose replies caused the bots to falsely report the target IP address as listening on port TCP 80.
Assuming these bots probed the rest of the IPv4 Internet proportionally to their probing of the darknet we can observe, about 3% of the host probe logs and port scan logs of the carna botnet could potentially be affected by this particular problem. The maps and animations they published seem unaffected by this issue because they were based on ICMP pings and actual (application-layer) responses from the target hosts.
We have only briefly investigated the carna botnet scan, but there are clearly epistemological issues related to any potential scientific use of the data published by the botnet authors. There are even more complex ethical issues related to using this data set, as well as with its original collection. We have previously mentioned efforts to provide ethical guidance to Internet researchers; the debate continues and this data set will likely become an interesting part of it.