Dataset Documentation & Data Cards
Dataset documentation - sometimes called data cards or datasheets for datasets - is the practice of publishing structured information about a dataset's contents, collection methods, intended uses, and known limitations. Think of it as a nutrition label for data. A good data card tells you what's in the dataset, where it came from, who created it, what populations or scenarios it covers (and which it doesn't), and any known quality issues. It might also describe the annotation process, inter-annotator agreement, and ethical review steps taken during creation. This practice was popularised by researchers at Google and Microsoft who argued that the AI community needed standardised ways to communicate about datasets, much as other fields have standardised reporting for clinical trials or environmental impact assessments. Adoption is growing but still patchy. Many widely used datasets have minimal documentation, making it difficult to assess their suitability for a given purpose. If you're building AI systems, creating thorough documentation for your training data is one of the simplest and most valuable governance practices you can adopt. If you're procuring AI systems, ask for dataset documentation and be wary if it doesn't exist.