RCIPS data storage failures mount

The computer file server used to store various data for the Royal Cayman Islands Police Service has experienced no fewer than five major hardware failures since February 2012, according to information obtained under the Cayman Islands Freedom of Information Law.

The latest hardware failures, occurring sometime between October 2013 and March 2014, were responsible for a significant number of police records being corrupted – a number of which have been unable to be recovered, Computer Services Department officials confirmed this week.

According to information obtained from a Cayman Compass FOI request, computer services records show that a controller card in the police data server at the Citrus Grove building in downtown George Town failed in February 2012. A hard drive in the server failed in August 2012, and another two hard drives failed in April 2013.

In October 2013, three more hard drives failed at the same time, causing storage of RCIPS files to “crash,” according to computer services officials. At that time, the server operating system was rebuilt and police data restored from a tape backup.

In March of this year, three more hard drives failed on the same server at the same time, causing server storage to go “offline,” computer services officials reported.

“Either of these events [referring to the October 2013 and March 2014 hard drive failures] could have caused the data on the server to become corrupted, but it is difficult to pinpoint the exact date of when the corruption started,” said Rex Whittaker, information manager at the Computer Services Department. “A large portion of the data is static and could have gone unnoticed for an extended period.”

The Computer Services “best estimate” is that the police data became corrupted between October and March of this year.

When the Compass first reported this incident earlier this year, it was told that server crashes had affected five hard drives in the Citrus Grove server.

“A series of hard drive failures occurred on the server, the server itself did not fail,” Wesley Howell, deputy chief officer of the Ministry of Home Affairs, said in May. “The server held data for the RCIPS, specific types and exact number of files that are corrupted are unknown. The file share that has the corrupted files holds 1.2 terabytes of data.”

A terabyte is equal to 1,000 gigabytes or 1 trillion bytes of computerized data. A one terabyte hard drive, for example, could hold more than 71 million copies of this story, if it were saved to a computer drive using a standard Microsoft Word program.

The Citrus Grove server held records from the RCIPS Joint Intelligence Unit, the Marine Unit and the police commissioner’s office, but the specific records that were corrupted were unknown.

The Computer Services Department directed all questions about what records might have been affected by the hard drive crash to the RCIPS.

Computer services is still working with the RCIPS forensic team to recover corrupted data, some 10.3 Gigabytes of which have been recovered to this point, Mr. Whittaker said. In addition, various data recovery specialists had been contacted in April 2014 to get a cost on recovery of the police data. The RCIPS staff has been performing tests to access the recovered data since March.

“In addition to the normal working hours spent recovering corrupted data, it is estimated that [computer services] also spent 104 hours at $4,500 overtime working on the RCIPS corrupt data recovery project,” Mr. Whittaker noted. The Computer Services Department also found that a portion of the RCIPS server at Citrus Grove “did not get backed up.”

“The origin of this problem was traced back to the rebuilding of the server after the October 2013 crash, when the storage was divided into four logical data drives to effectively manage the RCIPS’s large volume of storage,” computer services officials reported. “After reviewing the backup logs, it was determined that the backup job for the RCIPS server was only manually updated with three of the four logical drives,

“So a portion of the server did not get backed up.”

To correct this problem in the future, computer services has implemented “more checks and balances” in its server change management and backup procedures to ensure multiple checks are done on backup systems by different individuals.

Police records

The initial problem with the police computerized records was revealed when police admitted they could not recover responsive records sought by the Compass in an open records request earlier this year.

The RCIPS information manager, Chief Inspector Raymond Christian, reported numerous times that officers were searching for the relevant records sought by the request for the period from Jan. 1, 2011, to Feb. 19, 2014: “All of the watercraft used as part of the Joint Marine Unit’s operations by name of the boat. How many times each of those watercraft have a) broken down, have been damaged or were otherwise found to be deficient and have required repairs or replacement, b) the period of time they were out of service, c) the cost of making the repairs, d) when they were returned to service e) if they were not returned to service, what happened to the watercraft.”

Some of the repair cost information had been provided as part of the request, but Mr. Christian said data related to the time the vessels were out of service was on the government hard drives in the Citrus Grove building that had crashed, apparently sometime in May. The Computer Services Department has since confirmed that troubles with data storage on the RCIPS servers had occurred well before that time.

Acting Information Commissioner Jan Liebaers expressed some concerns regarding police data not being restored if it had been lost, and indicated his office was following up on the situation.

To date, the Compass’s request for information about the marine unit craft has not been answered.

1 COMMENT

  1. The explanation provided by the Computer Services Department is simply not acceptable and clearly represents a failure to put in place proper checks and balances and a failure to provide the necessary oversight of critical data protection processes.

    The statement that…. ‘So a portion of the server did not get backed up’ is simply not acceptable and reeks of incompetence. Also, while it is not unusual for hard drives to fail, I find it difficult to believe that a comprehensive backup solution was not in place to ensure that all critical data could be recovered in the event of a major server or storage failure. Also, I question the statements regarding the multiple drive failures as the design and implementation of the storage solution for such critical data would have included multiple mirrored storage enclosures and some form of tape or remote site backup solution.

  2. I been in IT for over 20 Years, and it I even gave a client this explanation of why backed up date couldn’t be recovered I would be sued or fired. This was an obvious case or human error. They obviously had no regular healthchecks in place nor do they seem to have ever verified that their backups were viable.

  3. There is more to this.

    Hard drives are quoted with a MTBF – Mean Time Between Failure

    A modern hard disk intended for home use has an MTBF of about 114 YEARS meaning some will fail after a few years and some will last 200 years plus!

    Statistically 3 drives failing at the same time is impossible due to ‘normal wear and tear’

    This leaves four possible causes;

    Environmental – did the server room experience adverse conditions e.g. AC failure could allow the temperature to climb rapidly to beyond the drives maximum permitted operating temperature at which point simultaneous failure becomes not only possible but inevitable. The same can occur inside the server if one of the internal fans fails and the cool air is not being circulated.

    Electrical – poor quality electrical supply quality – spikes in the voltage or periods of higher and lower than normal voltage. Servers and storage should always be protected by a UPS – which would prevent such damage but without one a lightening strike can damage multiple devices at the same time.

    Physical – when a drive is spinning at full speed they can be damaged by minor impact – something dropped onto the storage unit or moving and knocking into something again this COULD cause simultaneous failures.

    But the worrying fourth possibility is deliberate sabotage – someone trying to destroy the data for nefarious purposes.

    Given the nature of the Data – access to server rooms should be restricted to a very small number of qualified people.

Comments are closed.