Item response theory

  • Jul 26, 2021
click fraud protection
Item response theory - Applications and Test

Within the field of Theory of Psychometric Tests Different names have appeared that currently take the name "Item Response Theory" (F.M. Lord, 1980). This denomination presents some differences with respect to the classical model: 1.- the relationship between the expected value of the subject and trait scores (characteristic responsible for the values), it is not usually of the type linear. 2.- It intends to make individual predictions without the need to refer to the characteristics of the normative group.

You may also like: Classical test theory

Index

  1. Theory of response to the item or latent trait models in test theory
  2. Item response theory models (tri)
  3. Parameter estimation
  4. Test construction
  5. Item response theory applications
  6. Interpretation of scores

Theory of response to the item or latent trait models in test theory.

We see, then, that this Item Response Theory provides the possibility of separately describing both items and individuals; It also considers that the response given by the subject depends on the level of ability that he has in the range considered. The origin of these models is due to Lazarsfeld, 1950, who introduced the term "latent trait".

From here it is considered that each individual has an individual parameter which is responsible for the characteristics of the subject, also called "trait". This trait is not directly measurable, hence the individual parameter is called a latent variable. At the time of applying the tests, two different things can be obtained, the true score and the aptitude scale; This is achieved if we pass two tests about the same aptitude to the same group.

In the Latent Trait Theory or Item Response Theory the true score is the expected value of the observed score. According to Lord, true score and fitness are the same thing but expressed on different scales of measurement.

Item response theory models (tri)

Binomial Error Models: they were introduced by Lord (1965), which assume that the observed score corresponds to the number of correct answers obtained in the test (whose Items all have the same difficulty and have local independence, that is, the probability of correctly answering one item is not affected by the answers given to other items. ).

Poisson models: these models are appropriate for those tests that have a large number of items and in which the probability of a correct or incorrect answer is small. Within this group, in turn, we have different models:

  1. Rasch's Poisson model, whose hypotheses are: each test has a large number of binary items that are locally independent. the probability of error in each item is small. the probability that the subject makes an error depends on two things, the difficulty of the test and the ability of the subject. the additivity of the difficulties, understood as the result of mixing two equivalent tests in a single test whose difficulty is the sum of the difficulties of the two initial tests.
  2. Poisson model to evaluate speed: This model was also proposed by Rasch and is characterized by taking into account the speed in the execution of the test. The model can be considered in two ways: to count the number of mistakes made and words read in a unit of time. count the number of mistakes made and the time invested in completing the reading of the text. The probability of carrying out a certain number of words of a test (i) by a subject (j), during a time (t)
  3. Normal Warhead Models: is a model proposed by Lord (1968), which is used in tests with dichotomous items and with a single variable in common. Its graph would be the following: The basic assumptions that characterize this model are:
  • the space of the latent variant is one-dimensional (k = 1).
  • local independence between intems.
  • the metric for the latent variable can be chosen so that the curve for each item is the normal warhead.

Logistics Models; It is a model very similar to the previous one but it also has more advantages with respect to its mathematical treatment. The logistic function takes the following form: There are different logistic models depending on the number of parameters they have:

  • 2-parameter logistic model, Birnbaum 1968, among its characteristics we mention that it is one-dimensional, there is local independence, the items are dichotomous, etc.
  • 3-parameter logistic modelLord, it is characterized because the probability of hitting by guessing is a factor that will influence the performance of the test. 4.3. 4-parameter logistic model: model proposed by McDonald 1967 and Barton-Lord in 1981, whose purpose is explain those cases in which subjects who have a high level of aptitude do not respond correctly to the item.
  • Rasch logistic model: This model is the one that has generated the greatest number of jobs despite having a drawback, which is that its adjustment to the real data is more difficult. But in contrast to this, the advantage that makes it so widely used is that it does not require large sample sizes for its adjustment.

Parameter estimation.

The method that has been used the most is the Maximum Likelihood, together with this method numerical approximation procedures such as Newton-Raphson and Scoring (Rao) are used. The Maximum Likelihood Method is based on the principle of obtaining estimators of the unknown parameters that maximize the probability of obtaining such samples. In addition to the Maximum Likelihood, the Bayesian Estimation is also used, based on the Bayes Theorem, which It consists of incorporating all the known information, a priori, that is relevant to the process of making inferences. A more in-depth study of the Bayesian method for estimating fitness parameters is carried out by Birnbaum (1996) and Owen (1975).

INFORMATION FUNCTIONS

The best test that can be constructed is the one that provides the greatest amount of information about the latent trait. The quantification of this information is done through the "information functions". The information function formula, Birnbaum 1968, is the following: It must be taken into account that the information obtained in a test is the sum of the information of each item, in addition the contribution of each item does not depend on the rest of the items that make up the test. In general terms, we can say that the information, in all models:

  • varies with fitness levels.
  • the greater the slope of the curve, the more information.
  • it depends on the variance of the scores, the higher it is, the less information.

Test construction.

The first task and one of the most important when constructing a test is the choice of items, prior agreement of the theoretical assumptions that should define the trait that the test intends to measure. The concept "Item analysis" refers to the set of those formal procedures that are carried out to select those items that will finally form the test. The information that is considered most relevant regarding the items is:

  1. Item difficulty, percentage of individuals who get it right.
  2. Discrimination, correlation of each item with the total score on the test.
  3. Distractors or error analysis, their influence is relevant, affects the difficulty of the item and causes the discrimination values ​​to be underestimated.

When establishing indicators of the different indices, some statistics or indices are usually used, the following being the most used:

Difficulty index Index of discrimination Reliability index Validity index Knowing the indices that must be taken into account for the selection of the items that will form the test, we will see what steps are necessary for the construction of a test:

  1. Specification of the problem.
  2. List a broad set of items and debug them.
  3. Choice of model.
  4. Test the preselected items.
  5. Select the ideal items.
  6. Study the qualities of the test
  7. Establish the rules of interpretation of the final test obtained.

From the previous points, it should be noted that the choice of the model, point 3, will depend on the objectives that pursues the test, of the characteristics and quality of the data, and of those resources that are available. When a model is chosen, the theoretical conditions in which it can be applied are already given, not despite its virtues they must be analyzed in each case and specific circumstances. The properties attributable to those models that make up the Item Response Theory (TRI), can be affected by:

  • the dimensionality of the test the scarce availability of the sample lack of computing resources There are a number of preferences to When using one or the other models, let's see them: normal warhead models are not usually used in applications, their value is theoretical.
  • Rasch: suitable for horizontal comparison (comparable tests at levels of difficulty with similar aptitude distributions). to have different forms of the same test. * 2 and 3 parameters: they are the ones that best adjust to a variety of problems.
  • to detect erroneous response patterns. for vertical matching of tests (compares tests with different levels of difficulty and different distributions for aptitude).

1 and 2 parameters:

  • suitable for constructing a single scale so that skills can be compared at different levels.

The choice of the model, in addition to the end to be pursued, can be affected by the size of the sample; In the event that the sample is large and representative, there will be no problem, be it the classical or latent trait model. But in TRI ( item response theory ) a small sample makes it necessary to choose models with a small number of parameters, even the one-parameter model.

Applications of the item response theory.

Let's see which are the most common applications: a) Test matching, sometimes it is It is necessary to relate the scores obtained in different tests, with two possible purposes:

  • Horizontal Equalization: it seeks to obtain different forms of the same test.
  • Vertical Equalization: seeks to build a single aptitude scale with different levels of difficulty. Regarding the equalization of tests, Lord (1980) introduces the concept of "equity", which implies that for each subject two tests They can be interchangeable since it is applied that one or the other will not change the level of aptitude that had been estimated for the subject.

Study of item bias, an item is biased when, on average, it gives significantly different scores in specific groups that are assumed to be part of the same population.

Adapted or average testsBy means of IRT, individualized tests can be constructed that allow the true value of the trait in question to be inferred more precisely. The items will be administered sequentially, the presentation of one item or another will depend on the answers given previously. There are different types of adapted tests, we point out the following:

  • two-stage procedure, Lord 1971; Bertz and Weiss 1973 - 1974. The same test is passed first and, depending on the results, a second test is administered.
  • Procedure in several stages, it is the same as the previous one, only that the process includes more stages.
  • Fixed branched model, Lord 1970, 1971, 1974; Mussio 1973. All the subjects solve the same item, according to the answer, a set of items is solved.
  • Variable branched model is based on the independence between the items and on the properties of the maximum likelihood estimators.

Item bankHaving a large set of items is something that will improve the quality of the test, but for this the items must first go through a debugging process. In order to classify the items, it is necessary to take into account what is the trait that the test of which this item will be part of is intended to measure.

Interpretation of scores.

Scales: its purpose is to offer a continuum to be able to order, classify or know what the relative magnitude of the evaluated trait is; This will allow us to establish differences and similarities in people regarding this trait. The scales used in Psychology are: nominal, ordinal, interval and ratio; These scales are constructed from the results of the tests, results called "direct scores".

Typify: to typify a test is to transform the direct scores into others that are easily interpretable since the Typified score will reveal the position of the subject with respect to the group, and will allow us to make intra and intersubjects. There are two forms of typing:

  1. Linear, they preserve the shape of the distribution and do not modify the size of the correlations.
  2. Nonlinear, they do not preserve the distribution or the size of the correlations.

APTITUDE SCALE In IRT, the scale that is constructed is the scale that corresponds to the levels of aptitude; This scale is characterized in that estimates and references are made directly with respect to fitness and its scale. Furthermore, this aptitude that is estimated only depends on the shape of the characteristic curve of the items. Among the possible scales, we indicate two:

  1. Scale, proposed by Woodcock (1978) and is defined by the following formula:
  2. WITS scale, proposed by Wright (1977), this scale is a modification of the previous one and is given by the following relationship:

This article is merely informative, in Psychology-Online we do not have the power to make a diagnosis or recommend a treatment. We invite you to go to a psychologist to treat your particular case.

If you want to read more articles similar to Item response theory - Applications and Test, we recommend that you enter our category of Experimental psychology.

instagram viewer