OARC Report on DITL Data gathering Jan 9-10th 2007 - V1.1
Participants
- c.root-servers.net, operated by Paul Vixie on behalf of Cogent.
- e.root-servers.net, operated by NASA Ames.
- f.root-servers.net, operated by ISC.
- k.root-servers.net, operated by RIPE NCC.
- m.root-servers.net, operated by WIDE.
- as112.namex.it, operated by NaMeX.
- b.orsn-servers.net, operated by FunkFeur.
- m.orsn-servers.net, operated by Brave GmbH.
Operational History
- 10th Jan
It became clear towards the end of the data collection that the storage array (sa1) connected to the in1 collection server would fill up before all data was submitted. sshd on in1 was thus shut down, and the new storage volume was mounted via NFS. Synchronization of data between old and new storage array volumes was then attempted, this needed incoming data submission via sshd to remain shut down while this proceeded. This impacted DSC submission as well as it was difficult to separate out DITL/PCAP and DSC upload services. 
 Decision was taken to migrate uploads to in2 server which had more up-to-date/stable OS software.
- 11/12th Jan
Some spare space found, uploads were briefly re-enabled. 
 Data synchronization failed, needed to be restarted.
 Downtime for oarc.isc.org website while migrating.
- 12th/13th Jan
in1 server crashed with data sync incomplete, caused NFS mounts to an1 server to hang. 
 Disabled uploads and re-started data sync again.
- 13th Jan
Re-enabled PCAP uploads to in2 server, sync continuing. 
 DSC uploads still disabled at this point.
 Loss of public.oarci.net website from in1.
- 14th Jan
Fixed NFS problems to an1 analysis server, re-enabled DSC uploads. 
 public.oarci.net drupal website restored on in1.
- 14th Jan through 19th Jan
Various crashes of in2 server. DSC cron jobs suspected as possible cause - temporarily disabled these and started migrating DSC to use postgres for data storage. 
- 16th Jan
JFS problem on new fd1 file server, blocked NFS mounts to an1. 
- 19th Jan
Crash of fd1 file server, caused in2 to hang. Power cycled fd1. 
- 21st Jan
Further crash of fd1 file server, blocked NFS mounts to an1 and in2. Installed latest standard non-SMP 2.6.16.27-0.6-default kernel on fd1, re-mounted filesystem on an1. 
- 23rd Jan
Postgres problems on in2 - needed reboot. 
- 25th Jan
Further in2 crash. Restored, then shut down in2 and fd1 to upgrade BIOS firmware. Loss of an1 NFS mount again, fixed on 26th. 
- 28th Jan
in2 server OS software upgraded to FreeBSD 6.2. 
- 2nd Feb
Brief downtime of sa1 RAID controller for performance tuning. 
- 6th Feb
DDoS attack against various root servers - requirement for more space to store PCAP attack data as well as DITL data. 
 DSC service re-started after migration from in1 to in2 server.
- 20th Feb
Further loss of an1 NFS mount. 
- 28th Feb
Crash of in2 due to lock violation. 
- 2nd March
fd1 server and NFS access to it down for 70 minutes for successful scheduled upgrade of storage array. 
- 3rd, 4th, 7th through  March
Various crashes of in2 server. 
Resolved Issues
- Upgraded firmware on OARC inter-server Ethernet switches to ensure support for flow-control and jumbo frames.
- Various submitters experienced problems with stateful firewalls at the outbound edge of their organization, causing file uploads to be truncated, corrupted or subject to maximum size limitations. It is currently unknown exactly what these issues or their (apparently mostly successful) work-arounds were.
- There was at least one JFS-related crash of the new Linux-based storage array server, fd1. To resolve this, the kernel was upgraded to a non-SMP newer 2.6.16 version. This reduced the frequency of crashes but has not eliminated them. The underlying fiberchannel/JFS architecture of this new storage server otherwise appears to be performing soundly.
- Non-alpha characters in filenames were initially not accepted in the upload scripts, this was fixed.
- Upload accounts need to be correctly setup, including creation of a "pcap" sub-directory and permissions (750) on ssh config files.
- SSH timeouts could cause partial files to be created and prevent them being overwritten with subsequent attempts, the upload.sh script was fixed to correct this.
- Some duplication of data from C-root due to two tcpdumps being run in parallel in error.
- Having DSC collection and viewing, and PCAP uploading, all going to the same collect.oarc.isc.org destination hostname, made it difficult to separate out problems with one of these from impacting the others. Distinct destination hostnames for each DSC function, upload.dsc.oarci.net, and view.dsc.oarci.net as well as the existing one for PCAP collection, have now been defined for future scaling and migration.
- Slow uploads from M root due to 64Kbyte TCP window limitations - enabling parallel SSH transfers may have helped this ?
- Due to gzclose() not being called properly in upload scripts for M-root, a few seconds of data at the end of each sample were lost. This bug has been fixed for the next collection exercise.
- Bug in upload verification script for zero timestamps, fixed.
- Some performance issues with NFS file access to the fd1 server were traced to spurious firewall configuration on that server which was rate-limiting NFS requests, this was removed on Feb 14th and has improved performance since.
- Constrained bandwidth from Peking F-root node (due to Taiwan fiber cut) caused large file transfers to get dropped, staged upload eventually completed by 16th February.
- Some hardware problems with Munich F-root node delayed upload, these were resolved and data uploaded by 16th February. Some data for 10th January appears to be missing, possibly due to confusion over finish times.
- 
Lack of disk space on storage array. This was resolved in the short
term to create sufficient space for all DITL data by copying it from
old low-capacity disks in the storage array to new higher-capacity
disks that were on-line but unused in another volume of the array.
A longer term resolution was achieved by replacing the disks in the old array volume with further higher-capacity new disks, and then on 2nd March re-partitioning and merging this into a single new volume with a total of 4.5TB capacity The OARC filesystem is now at 36% of available capacity including both DITL data to date and additional data collected during the DDoS attack on 6/7th March, 
Open Issues
- 
Periodic crashes of in2 server - the cause is unknown, but may be linked to cron jobs
running in support of DSC. Migration of DSC to SQL-based data storage is being looked into
as a way around this. Various steps have been taken to attempt to resolve the in2 server crashes:
- Crash-dump diagnostics have been enabled
- Upgraded BIOS firmware to latest version
- Upgrade of OS software to FreeBSD 6.2.
 
- Various maintenance reboots on the fd1 RAID server without first unmounting the filesystems exported by it caused filesystem access from the an1 server to hang on a number of occasions. This appears to be intrinsic to the SuSE Linux 10.1 NFS server implementation, it is unclear what can be done about this apart from ensuring all client filesystems are unmounted/remounted before and after maintenance and due warning given, or trying another operating system.
- 
Loss of some data from K-root- Insufficient disk space on (busiest) London node led to loss of data between 05:00 and 09:30 UTC on Jan 9th.
- Human error/confusion (aggravated by pressure on disk space) when uploading data from the Helsinki, Frankfurt, Milan, Athens, Budapest and Poznan nodes led to them being unrecoverably overwritten. Brisbane data later overwritten by Poznan data due to human error during analysis.
- Disk space shortages on some other nodes led to sample files being split on other than uniform hourly boundaries.
- Downtime caused by adding additional filesystem space to the OARC collection server meant that further K root collection servers ran out of disk space. This caused auto-rotation scripts to overwrite the first few hours of data from the 9th January for the busiest servers.
- Possible corruption of some files showing de-compression errors, this may have been introduced during de-compression from the lzop format used by K root to the gzip format used for PCAP submission.
 
- Loss of all data from E-root. This also suffered from a combination of the space exhaustion on the OARC server combined with local auto-rotate scripts, which deleted data collected on the server after one week. Unfortunately by the time upload space became available again, the data for the DITL period had been auto-deleted.
- Persistent problems uploading K-root data from the RIPE NCC. These are not fully understood, but do not arise when the same data is uploaded from systems outside the RIPE NCC's network, suggesting some kind of firewall problem.
- Updating the Linux kernel to a newer, non-SMP version on the fd1 RAID server has reduced but not completely eliminated intermittent (every few weeks) crashes of this box.
- Is there a case for using lzop rather than (or as an additional option to) gzip compression ?
- Attempts to upload uncompressed files led to problems - need to clarify documentation ?
- There was complete lack of interest from TLD operators in participating, we need to understand why this was.
Lessons Learned
- Doing the exercise so soon after the seasonal holidays proved problematic, various people being difficult to reach and co-ordinate with in the run-up to the data collection period. Future exercises need to avoid major holiday periods by around a month, and planned ideally several months in advance.
- While on the one hand it was possible to recruit additional participants right up until the last minute, other potential new participants were willing in principle, but unable in practice, to participate due to lack of notice, again good reason for extending it.
- Having a simple one-page "Data Supply Agreement" to ensure confidentiality of data collected at OARC worked very well for signing up new contributors.
- There was not any kind of dimensioning exercise to estimate the volume of data likely to be submitted versus available disk space. In future a size estimate should be gathered from all contributors of their data set in advance, and this checked against available disk space on the upload servers. Ideally submitters should do a similar exercise on their local data collection servers, particularly those who do not gather this data on an ongoing basis, and where the upload bandwidth from the collection servers is limited.
- Continuous monitoring (via e.g. Nagios) might have helped give more warning disk space was running out, but most of the disk-space related problems arose due to attempts to add more storage on the fly once it was realized space was running out, rather than not realizing it was running out in the first place.
- The dry-run for data collection was an excellent idea, but in practice most of the issues were with data submission rather than collection. In future there should also be a dry-run of data uploading, ideally for at least 1-2 hours' worth of data. The volume of data submitted during the dry-run could also feed back into the space required estimates above.
- There were a number of issues where due to software and/or human error, data files from one instance of a root server were confused with, and overwrote, previously submitted files for another instance. One way to minimize this risk would be to auto write-protect files immediately their upload is completed. Note that there should be a mechanism for the submitter to manually override this write-protection in case the data needs to be re-submitted.
- Participants need to be reminded before any future exercises to either disable any auto-rotate scripts which delete data after an expiry period, or to take copies of measurement period data to another local location.
- Verification of the presence and integrity of the data uploaded by the submitter is unstraightforward using the current scp-based access methods. A number of files uploaded were incomplete, which only became clear after subsequent local decompression verification. Creating a "file upload status" web page during the exercise has proved invaluable, and should be repeated for future exercises. This could be complemented by giving contributors sftp and/or shell access to verify and amend their uploads.
- Publishing some agreed standards for upload filename formats in advance of the exercise could head off various minor issues and confusion in future.
- Even after deleting files from the local collector, it would be potentially useful to keep a log of the MD5 checksum of files submitted for post-verification.
- It may be useful to set up some intermediate "staging" servers which can perform store-and-forward of data between the local collectors and central server, this could potentially mitigate some of the disk space and bandwidth issues experienced.

