A student was asked to analyse data for a term paper. She had received the data from her lecturer. It included study areas, plant species, species names and the frequency of animals found on the plants. The student then added information from the literature on the degree of specialization and other species characteristics for each of these animal species. The analysis revealed that none of the expected correlations could be found – the specialists were not very specialized, and some usually common species turned out to be very rare. Despite different methods of analysis, the results remained the same: neither the different plant species nor the study areas differed in the composition of the animal species.
Shortly before the work had to be handed in, it turned out that the species names in the table had slipped by two lines. Consequently, the connection between species and their locations and frequency had been broken. It was no longer possible to reconstruct exactly how and when the error had occurred, but the student had received the data in an already corrupted state and had worked with the incorrect data set from the outset. No further analysis could be conducted before the assignment was due, so her work was accepted despite being based on flawed data. After all, she had carried out all the required analyses and bravely discussed the (very unsatisfactory) results.
Such errors are unfortunately quite common, especially when data is saved and edited exclusively in spreadsheet programs such as Excel or LibreOffice. It is therefore very important to work exclusively with copies of the raw data, but never in the raw data itself. If data is to be re-sorted, selected or aggregated, programs such as OpenRefine, queries in relational databases (e.g. MariaDB, SQLite) or the use of R or Python scripts are recommended.
Source: Personal communication.