We have all heard of the Colbert Report's idea of truthiness. Well, I am borrowing from Colbert to create the term poofiness--a term that describes something that has been so blown up and "poofy" that is seems out of reach; yet, if you prick it a bit, you will find that it explodes, that there really isn't much of substance behind the poofery.
How does poofiness relate to PISA--the OECD's Program for International Student Assessment? Well, an April, 2014 study in the journal Psychometrika reveals serious flaws in the statistical methodologies used to provide the rankings for PISA. Svend Kreiner and Karl Bang Christensen of the University of Copenhagen were curious about whether true rankings of countries could be produced using the statistical techniques upon which PISA is supposed to be based. In particular, they used Item Response Theory (IRT) modeling which basically is premised on the idea that performance on individual items is likely to represent a latent trait (e.g., reading ability). So, armed with the technical manuals and 2006 PISA data sets along with their knowledge of IRT, these researchers set out to conduct their review of PISA. The results should give us all some pause with respect to PISA.
First of all, Kreiner and Christensen report that “roughly half of the students participating …did not respond to any reading items. In spite of this, all students were assigned reading scores (so-called plausible values). Exactly how these scores were calculated is one of the unanswered questions” (p. 212). Now, it seems to me that regardless of how generous one might be in thinking of inferencing based on test data, it is downright inconceivable that half the students participating in PISA in 2006 were given reading scores even though they hadn’t answered any of the reading questions on PISA!!! Sort of reminds you of the tale of the Emperor and his clothes.
Next, Kreiner and Christensen used a variety of statistical procedures and came to the conclusion that the 2006 PISA test items simply do not live up to the assumptions that they come from what is called the same “IRT model.” This essentially means that the items don’t demonstrate the kind of performance that is needed to say that they all are measuring the same thing.
What is the upshot of all of this? Take the case of Canada. Kreiner and Christensen say that: “The most extreme cases are those found for Canada, where the probability of a rank equal to 25 or worse on items measuring information retrieval is equal to 0.00022, whereas the probability that the rank is equal to 1 or 2 on items relating to reflection is equal to 0.00008” (p. 221). Basically that means that because of the fluctuation in questions (which aren’t measuring the same thing), Canada’s ranking could be anywhere from 2 to 25th. Japan and Denmark’s rankings were also found to have wide ranges in possible rankings. Poof...the rankings which have become a tool in the political educational aresnal seems to be more poofiness than reality! This is something I wrote about a couple of years ago in The InterAmerican Journal of Education for Democracy, although I confess the term poofiness had not entered my vocabulary at that time.
In an interview with the UK Times Education Supplement, Kreiner said that “the best we can say about PISA rankings is that they are useless” and, in an interview with The New Zealand Listener, Kreiner told the Minister of Education in New Zealand to “Forget about these rankings. Disregard them.” So, poofiness lives...in PISA.