
Introduction
Data is increasingly collected and used on dairy farms to monitor performance and support decision-making. But for data to provide truly valuable insights, it must be high quality. As the saying goes, “Garbage in, Garbage Out”, or maybe more fittingly, “Quality In, Quality Out”!
This article explains the difference between dirty and clean data, pillars of quality data, and practical tips to prevent entering dirty data.
What is Dirty Data?
Dirty data doesn’t refer to when record sheets are covered in manure or cow saliva. Instead, it refers to data that contains mistakes or quality issues and requires “cleaning”. Cleaning data is the process of reviewing and correcting data problems such as incorrect entries, missing values, inconsistent units, or duplicated records (1). Errors and issues like these in dirty data can lead to inaccurate insights or even make analysis impossible (2). While cleaning can help address dirty data, it is time consuming and can’t fix every issue such as missing records.
Think of your data like your house. You can live in a very messy house, but it’s uncomfortable and harder to function in. Cleaning helps, but some damage from the mess (like stained floors) requires extra effort or may be irreversible. Instead, you can avoid headaches and issues by preventing the mess in the first place.
This same idea applies to your data. You can work with dirty data, but it is more difficult, less reliable, and sometimes unusable. All data will contain a little dirt, it’s unavoidable, but minimizing dirty data from the start makes everything easier.
Pillars of Quality Data
Farmers can minimize the creation of dirty data by following these pillars for data quality: Accuracy, Completeness, Consistency, and Uniqueness.

Accuracy
Accurate data reflects the true values including correct measurements, entries, and no typos. Accuracy is influenced by all the other pillars and is essential for reliable insights.
Real Dairy Farm Scenarios That Cause Inaccurate Data:
- A calf handler forgets to zero the scales when weighing a group of weaned calves. This results in all the weights for that group being recorded as 13 lbs. heavier than actual.
- The calf manager enters the data but accidentally skips a few lines, mismatching weights to calves. This results in an inaccurate report of average daily gains.
How to Avoid Inaccurate Data:
- Train staff thoroughly and consistently.
- Ensure that all sensors and systems are properly calibrated and electronically recording accurate data.
- Use the same people for data collection when possible.
- Spot-check entries during data input.
- Use software tools to flag outliers and catch mistakes early.
Completeness
Complete data includes values for all required fields. Missing values reduce the usefulness and reliability of a dataset.
Real Dairy Farm Scenarios That Cause Incomplete Data:
- During a busy night, an employee forgets to record which calf ID coincides with which dam for four births. Due to the incomplete data, the farm is unable to reliably match those calves with their dams.
- Some calving record sheets are thrown away by an employee cleaning the office before they have been inputted. This instance leads to incomplete gaps in the farm’s calving data.
How to Avoid Incomplete Data:
- Emphasize the importance of complete records.
- Use organization systems such as folders or trays to safely store unentered data.
- Keep original hardcopy sheets or verified scans after entry for reference.
- Use spreadsheet filters or herd software tools to check for missing data points.
Consistency
Consistent data uses standardized formats, units, terminology, and definitions. Inconsistent data results in limited and potentially unreliable analysis.
Real Dairy Farm Scenarios That Cause Inconsistent Data:
- When checking fresh cows, some employees on a farm consider a cow to have ketosis if her urine test result is 15 mg/dL while others on the farm use a threshold of 80 mg/dL. This inconsistency in diagnosis results in misleading and inconsistent health reports.
- When entering the fresh cow health information into the herd management software, the manager uses various terms for ketosis including “ketosis, “ket.”, “k”, and “keto”. This inconsistency in terminology makes generating and analyzing health reports difficult.
How to Avoid Inconsistent Data:
- Develop and train staff on consistent and clear data collection or inputting protocols.
- Regularly review protocols and data standards to minimize protocol drift.
- Develop and use consistent terms, formats, and naming conventions across all data systems.
Uniqueness
Unique data avoids duplication. Redundant records can skew analysis and cause confusion.
Real Dairy Farm Scenario That Caused Duplicated Data:
- A farm weighs groups of heifers as they return from a custom-raiser. One returning group was smaller than normal. The farm manager accidentally enters the data for this group twice. Later when evaluating heifer weights, the duplicated entries of that group skew the overall average down, making it look like heifers are returning from the custom-raiser lighter on average than they actually are.
How to Avoid Duplicated Data:
- Train those entering data to follow consistent input practices.
- Use spreadsheet tools or herd management software to identify duplicate records.
- Establish protocols to double-check data entry (i.e. have a second person to verify all data entries).
Tips and Tricks to Increase Data Quality
High-quality data does not happen magically. Instead, it takes focused effort and time. Focusing on the pillars of quality data, farms can take steps like these to improve their data.
- Provide consistent training and review protocols regularly with anyone collecting or entering data.
- Design and organize recording sheets or systems so that they are easy to understand and fill out.
- Reduce variability by having the same people regularly entering data.
- Create and use standardized terms or definitions for health events and labels. Consider using drop-down menus in spreadsheets to only allow selection of pre-defined terms.
- Use standardized formats (i.e. dates, animal IDs) across all systems.
- Avoid formatting issues in spreadsheets such as merged cells, empty rows, or using colored cells used as a form of data.
- Use consistent file naming and folder organization to make data easy to find.
- Check data for errors during entry by using methods like double entry or read aloud. Double entry, where data is entered twice and compared, is the most accurate. The read-aloud method, where the data is read from the original hardcopy sheet when checking against the entered data, is faster but slightly less accurate (3).
Summary
Low-quality, “dirty” data has reduced value for performance tracking and decision making on dairy farms. By focusing on the pillars of data quality – accuracy, completeness, consistency, and uniqueness – farms can improve the reliability and usefulness of their data. High quality data helps farms make more informed decisions to support their long-term success.
Author

Katelyn Goldsmith
Dairy Outreach Specialist– In her role as a statewide Dairy Outreach Specialist, Katelyn connects research with practical farm management practices to create educational programming addressing the needs of Wisconsin dairy producers.
Reviewers
Victor Carbera
Dairy Systems Management Extension Specialist, Professor, Honorary Associate/Fellow
University of Wisconsin-Madison, Division of Extension
John Shutske
Agricultural Safety and Health Extension Specialist, Professor, Honorary Associate/Fellow
University of Wisconsin-Madison, Division of Extension
References
- Guo, M., Wang, Y., Yang, Q., Li, R., Zhao, Y., Li, C., Zhu, M., Cui, Y., Jiang, X., Sheng, S., Li, Q., & Gao, R. (2023). Normal workflow and key strategies for data cleaning toward real-world data: Viewpoint. Interactive Journal of Medical Research, 12:e44310. Doi: https://doi.org/10.2196/44310
- Barchard, K., & Pace, L. (2011). Preventing human error: The impact of data entry methods on data accuracy and statistical results. Computers in Human Behavior, 27:1834-1839. https://doi.org/10.1016/j.chb.2011.04.004
- Barchard, K.A., Freeman, A.J., Ochoa, E., & Stephens, A.K. (2019). Comparing the accuracy and speed of four data-checking methods. Behavior Research Methods, 52:97-115. https://doi.org/10.3758/s13428-019-01207-3
Feeding Zeolite A for Milk Fever Prevention in Dairy Cattle
Best Practices for Roasting Soybeans on Farm for Dairy Diets
Grouping Strategies in Automatic Milking System Herds
Feeding Dairy Cows for Milk Protein


