Validity in assessment

Ross Woods, updated Feb. 2015

Valid means that the assessment assesses what it is supposed to assess. This has two distinct meanings:

1. You are assessing the right thing. For example, an assessment of theory is not an assessment of practical competence, and vice versa. You can't assess skill in riding a bicycle through an essay. If you tried, the student might not be able to ride a bicycle, even if he/she showed competence in expressing and handling concepts, and drawing and justifying conclusions.

2. You are complying with the standard. The assessment addresses all requirements of the competency standard (e.g. Australian Training Package units). In this meaning, assessments are invalid if they don't address all requirements. You'd ask, "Does the assessment adequately cover the range of skills? Does it integrate theory and practice? Are there multiple ways to assess the learning?"

Validity: the problem of "stretching"

It is possible to "interpret" an element statement by stretching into something quite different, so that the assessment is no longer valid.

In Kirsty’s case, the unit requirements have been stretched into very different contexts, so different that it would be unfair on her:

Kirsty, a female youth worker has all her training and experience with young homeless girls in the inner city. She is not only the team leader, but is assessed as highly competent.

Is it fair to say that Kirsty is not yet competent because she doesn’t work with boys of the same age? Or older boys? Or older upper class boys in an exclusive private school? What about older immigrant young people? Somewhere there is a line over which it becomes unfair to stretch.

The student must be able to perform a skill in a specified context, but then we say that the student must also be able to adapt the skill to other contexts. However, some contexts are so different that it would be unfair to require the student to perform the skill in that way.

You can ask too much. In your quest for excellence, you might make demands of students that are so high that the extra demands are unfair. In Justin’s case, the requirements were stretched upward.

The coordinator of Justin’s course was aiming for excellence. He determined that only the best would be able to get through.

He added substantially to the requirements and defined very high performance standards. The course was excellent and many students did the extra work to rise to the challenge, although nearly half the class dropped out. Those who finished were all assessed as competent and went on to do very well in other units.

Stretching upward doesn’t seem to happen as often, but it can be unfair to students like Justin:

The course is too difficult for many people with the appropriate entry requirements, giving them less chance of passing.
Students who pass have earned credit for a more advanced unit but are considered to have done a lower unit.
The students cannot see in the elements what will actually be assessed, and might feel that they have not been told what extras will be assessed.
Students must invest excessive time.
The instructor might teach very poorly and simply demand high performance levels.

On the other hand, you can stretch upwards as good practice. The units are simply minimum standards, and it is good practice to encourage students to achieve the highest standard that they can. It’s just that you can’t require students to do more than the actual unit to pass it.

In Aaron’s case, the requirements were stretched downward ("dumbed down"):

Aaron joins a VET course and puts in an honest effort. He finds it quite difficult, but is given extra help by a teacher who nurtures the students along so that they can at least fall over the line and pass the course. With all good intentions, the teacher stretched the elements lower than they should be.

You can ask too little. At this end of the scale, you may interpret the standards so low that you are basically cheating.

Whether the tendency is downward or upward, it seems to happen quite often.

Stretching can be handled by industry consultation to identify more precisely the standards for employability. Moderation is also important. In either case, the point is to specify fair interpretation.

So how do you know? Two ways. Check the AQF, which will help you decide the level. After that, it’s a consultation issue. Ask for opinions and make decisions based on the advice you get.

Adding to requirements

You are free to add requirements to the units, but if you add too much then the standard is substantially changed and the assessment is no longer valid. In other words, you can't fail a student if they get all actual unit requirements right and fail on the parts you added. A few colleges even treat the endorsed units as minimum standards, and add tougher expectations.

Then the question comes up, "Can you add other relevant standards in your assessment such as licensing standards and enterprise standards?"

The answer: If you're assessing for a recognized qualification or statement of attainment, you can only require compliance with other standards if the package allows it or the national standards require it.

That means that you can include other relevant standards in the assessment in these ways.

You can require compliance with legislation and regulatory requirements. The national standards require it.
You must comply with the package's assessment guidelines and the range of variables. Most of these require students to comply with organizational standards, and some even require you to explore standards.
Assessments need to meet all standards of workplace performance. In other words, students need to do whatever is necessary to do the job. This will sometimes be identified when you consult industry but some students find ways of being incompetent that you hadn't anticipated.
You can include other standards only if it doesn't affect your assessment for a giving an RTO credential. That is, you may be assessing for another purpose as well, such as a license. Licensing is always a separate procedure, even if the qualification forms a major part of the requirements. In other words, those other standards don't affect the assessment you do for the RTO.

Validity in the AQTF 2007

The AQTF 2007 User’s Guide to the Essential Standards for Registration gives the following formal definition:

"One of the principles of assessment and also one of the rules of evidence. Assessment is valid when the process is sound and assesses what it claims to assess. Validity requires that:

assessment against the units of competency must cover the broad range of skills and knowledge that are essential to competent performance
assessment of knowledge and skills must be integrated with their practical application
judgement of competence must be based on sufficient evidence (that is, evidence gathered on a number of occasions and in a range of contexts using different assessment methods). The specific evidence requirements of each unit of competency provide advice on sufficiency.

Comments

It is a pity that validity is seen here to "cover the broad range of skills and knowledge that are essential to competent performance," which is actually a different principle of evidence. Not a wrong thing to say, just not as part of the definition of validity. It is also odd that it is linked to sufficiency.

Educationally speaking

Valid traditionally means that the assessment assesses what it is supposed to assess. For example, an assessment of theory is not an assessment of practical competence, and vice versa. You cannot assess bicycle riding through an essay. This is the educator's meaning.

The bureaucrat's meaning

Some versions of the national standards appear to abandon this sense and increasingly view validity as a compliance issue rather than an educational one. This means that the assessment addresses all package requirements. In this meaning, an assessment that does not address all requirements is considered invalid.

Validity increasingly means that the student meets all performance criteria for a particular element. This is difficult when one or more criteria are irrelevant, as is sometimes the case with poorly written criteria. Some auditors suggest that if not all criteria can be met, then the assessment task is inappropriate to the unit or the unit is inappropriate to the context.

Some auditors suggest that if other standards are incorporated into the assessment, then the assessment is invalid.

Validity and essays

Here's a common scenario:

You are assessing a unit in a higher qualification that is primarily about thinking skills. You decide to assess students by essay, as it requires students to think through the relevant issues and come up with defensible conclusions. Obviously, you want to incorporate a separate list of criteria for assessing essays.

But you are told that the assessments will be considered invalid because you have added other criteria to the assessment (that is, the essay criteria).

The ramifications would be as follows:

You would have to assess the student as competent if the essay was very poor but somehow met the elements. This to you is quite unacceptable.
You find it difficult to justify essay writing skills. First, they are not a workplace requirement; it would also be forced and artificial to dress up an essay as a report or board proposal. Second, you couldn't justify essay skills on the basis of other relevant skills requirements, as professions don't require people to write essays so they don't have standards for them.

However, better justification might be available through two other sources:

Cognitive descriptors in the AQF require certain levels of thinking skills that can be quite consistent with essay writing.
Key competencies require students to be able to handle and present information.
Essay skills are very consistent with the advanced literacy skills required at Certificate IV and above.

Standards for National Recognition (2011)

The 2011 SNR included a long technical statement on validity:

"Validity: There are five major types of validity: face, content, criterion (i.e. predictive and concurrent), construct and consequential. In general, validity is concerned with the appropriateness of the inferences, use and consequences that result from the assessment. In simple terms, it is concerned with the extent to which an assessment decision about a candidate (e.g. competent/not yet competent, a grade and/or a mark), based on the evidence of performance by the candidate, is justified. It requires determining conditions that weaken the truthfulness of the decision, exploring alternative explanations for good or poor performance, and feeding them back into the assessment process to reduce errors when making inferences about competence.
"Unlike reliability, validity is not simply a property of the assessment tool. As such, an assessment tool designed for a particular purpose and target group may not necessarily lead to valid interpretations of performance and assessment decisions if the tool was used for a different purpose and/or target group.

It is technically very good from the viewpoints of educators and researchers, but quite inconsistent with all previous definitions of validity in assessment. Here's an explanation:

Based on: "Work-based Education Research Centre of Victoria University in conjunction with Bateman and Giles Pty Ltd. 2009. Guide for developing assessment tools. N.p.: National Quality Council."

Validity means that an assessment decision is based on the evidence of performance. What conditions weaken the truthfulness of the decision? Could conclusions about good or poor performance be the result of something else? These can then be fed back into the assessment process to reduce errors when making assessment decisions.

Unlike reliability, validity is not simply a property of the assessment tool. An assessment tool is designed for a particular purpose and target group and might not necessarily lead to valid interpretations of performance and assessment decisions if the tool was used for a different purpose and/or target group.

There are five major types of validity: face, content, criterion (i.e. predictive and concurrent), construct and consequential:

Concurrent validity
Does performance in one task consistently result in performance in other related tasks? Can students transfer leaning from what they are assessed on to other tasks requiring the same skills?

Consequential validity
Does the specific assessment task make value-laden assumptions that have social and moral implications in a specific, local context?

Construct validity
Do the explanatory concepts or constructs account for the performance on a task? In other words, can the assessor infer competence from the evidence collected, without being influenced by other non-related factors (e.g. literacy levels).

Content validity
Does assessment tool actually collect evidence that matches the required knowledge and skills specified in the competency standards?

Face validity
Do the assessment tasks reflect real work-based activities?

Predictive validity
Do assessment outcomes accurately predict the future performance of the candidate?

Standards for National Recognition (2015)

In the Rules of Evidence, validity is defined as: "The assessor is assured that the learner has the skills, knowledge and attributes as described in the module or unit of competency and associated assessment requirements."

As a principle of assessment, it is defined as follows:

Any assessment decision of the RTO is justified, based on the evidence of performance of the individual learner.

Validity requires:

assessment against the unit/s of competency and the associated assessment requirements covers the broad range of skills and knowledge that are essential to competent performance;

assessment of knowledge and skills is integrated with their practical application;

assessment to be based on evidence that demonstrates that a learner could demonstrate these skills and knowledge in other similar situations; and

judgement of competence is based on evidence of learner performance that is aligned to the unit/s of competency and associated assessment requirements.

The more technical meanings (concurrent, consequential, construct, content, etc.) were discontinued, and some of the ideas of the AQTF 2007 were re-adopted.

Item	Comment
The learner has the skills, knowledge and attributes as described in the module or unit of competency and associated assessment requirements.	Does not refer to validity at all, but to compliance with assessment requirements.
Assessment against the unit/s of competency and the associated assessment requirements covers the broad range of skills and knowledge that are essential to competent performance.	Does not refer to validity at all, but to compliance with assessment requirements.
Assessment of knowledge and skills is integrated with their practical application	Not directly related to the concept of validity.
Assessment to be based on evidence that demonstrates that a learner could demonstrate these skills and knowledge in other similar situations	Not directly related to the concept of validity but to application in other contexts.
Judgement of competence is based on evidence of learner performance that is aligned to the unit/s of competency and associated assessment requirements.	Does not refer to valitity at all, but to compliance with assessment requirements.

Appendix

The NQC also covered definitions of reliability and assessment tools, which are worth including. Based on: "Work-based Education Research Centre of Victoria University in conjunction with Bateman and Giles Pty Ltd. 2009. Guide for developing assessment tools. N.p.: National Quality Council."

Reliability
One of the principles of assessment. In general, reliability is an estimate of how accurate the task is as a measurement instrument. It is concerned with how much error is included in the evidence. There are five types of reliability: internal consistency, parallel forms, split-half, inter-rater and intra rater.

Internal consistency How well do the items of tasks act together to elicit a consistent type of response? (This usually relates only to tests.)

Inter-rater reliability Do different assessors make consistent judgements using the same assessment task and procedure?

Intra-rater reliability Does an individual assessor make consistent assessment judgements using the same assessment task? Are assessment judgements consistent across time, location, and different kinds of students?

Parallel forms of reliability Are two alternative forms of a task actually equivalent?

Split half reliability If the candidate sits a test that is subsequently split into two tests during the scoring process, will both scoring processes give the same result?

Assessment tool
An assessment tool includes the following components:

the context and conditions for the assessment
the tasks to be administered to the candidate
an outline of the evidence to be gathered from the candidate
the criteria used for judging the quality of performance (i.e. the assessment decision making rules), and
the administration, recording and reporting requirements.