Verification vs Validation of Data
Do we verify that the data is correct? Or do we validate it? These two similar terms have different meanings although the two are often conflated.
For data, we can make the following distinction:
Verification: “Did we collect the data right”
Validation: “Did we collect the right data”
For verification we are asking if the data that we collected is consistent with the rules for the form the data should take. For example, if we insist that all the dates must be in the format YYYY-MM-DD, then we can verify the dataset and check that all the dates are indeed in that format. We might also insist on checking that all p-values are between 0 and 1, or that we are using consistent names for all the farms in the dataset.
With the Semantic Engine, you can write the rules needed to verify the dataset right into the schema using overlays such as the format overlay.
Other rule containing overlays of the schema also contribute verifying a dataset. For example, if a value is required for a specific attribute, then the verification check will make sure there are no blank entries in the dataset for that attribute. If there are entry codes documented in the schema, then data verification will check that only valid entries codes appear in the data for the specific attribute.
Validation is a different beast compared to verification. A researcher collects data to answer a specific research question, such as “Do plants need water to grow”. If you collect data about soil type and plant height and date of death etc. but fail to collect any information about if the plants received water or not, then you are unable to answer the research question. This is of course a trivial example, but it is very easy to get into analysis and realize that you are missing a key variable because you are unable to reach a conclusion. Deciding if your dataset is valid for the research question you are asking is validation and is a key feature of being a researcher.
Typically many people conflate verification and validation, but it becomes very significant especially in a regulatory environment. For example, when the FDA comes and asks for evidence of verification and validation they are asking for very specific things!
The Semantic Engine can help researchers with data verification – did we collect the data right.
Written by Carly Huitema