Opus iti — my little work

Just another WordPress.com weblog

Archive for the ‘experimental methods’ Category

Credibility

leave a comment »

Reading the chapter on credibility from Witten and Frank’s book:

“So the question is, is the error rate on old data likely to be a good indicator of the error rate on new data? And the answer is a resounding no—not if the old data was used during the learning process to train the classifier … Why? Since the classifier has been learned from the very same training data, any estimate of performance based on that data will be optimistic, and may be hopelessly optimistic.” (pg. 121)

“People … often talk about three datasets: the training data, the validation data, and the test data. The training data is used by one or more learning schemes to come up with classifiers. The vliadation data is used to optimize parameters of those classifiers, or to select a particular one. Then the test data is used to calculate the error rate of the final, optimized scheme. Each of the three sets must be chosen independently” (pg. 122)

“There’s a dilemma here: to get a good classifier, we want to use as much of the data as possible for training; to get a good error estimate, we want to use as uch of it as possible for testing.”

“The holdout method reserves a certain amount for testing and uses the remainder for training (and sets part of that aside for validation, if required). In practical terms, it is common to hold one-third of the data out for testing and use the remaining two-thirds for training … you should ensure that the random smaplng is done is such a way as to guarantee that each class if properly represented in both training and test sts. This procedure is called stratification and we might speak of stratified holdout … A more gneral way … is to repeat the whole process, training and testing, several times with different random samples. In each iteration a certain proportion – say two-thirds – of the data is randomly selsected for training, possibly with strafitication, and the remainder used for testing. The error rates on the different iterations are averaged to yield an overall error rate.”

They continue in their dicussion of measuring the rror rate of a learning scheme on a particular dataset and recommend ten times tenfold cross-validation. Here the data is divided randomly into pten parts, in each of which the class is represented in approximately the same proportions as in the full dataset. Each part is held out in turn and the learning scheme training on the remaining nine-thenths; then its eror rate is calculated on the holdout set. Thus the learning procedure is executed a total of ten times … the ten error estimates are averaged to yield an overall error estimate”. Ten is an emprircally arrived at number. Stratification will improve the results slightly. The error estimate will vary each time you run the tenfold cross-validation so it is best to run it ten times and average the results.

Written by opusiti

January 2, 2009 at 6:07 pm

Most experiments are local but have general aspirations

leave a comment »

Cronback noted that each experiment consists of units that receive the experriences being contrasted, of the treatments themselves, of observations made on the units, and of the sessings in which the study is conducted. Taking the first letter from each of these four words, he defined the acronym utos to refere to the “instances on which data are collected” — to the actual people, treatments, measures that were sampled in the experiment. He then defined two problems of generalization: (1) generalizing to the “domain about which [the] question is asked”, which he called UTOS; and (2) generalizing to “units, treatments, variables, and settings not directly observed”, which he called *UTOS. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Shadish, Cook and Campbell. 2002.

There are two types of generalization that may be an issue in experimental design: (1) Construct validity — are the labels applied to the elements of the experiment (UTOS) really represent the concepts that appear in the underlying theory that we wish to test. (2) External validity — either from one unit to another (for example, generalizing from one city to another) or broader (for example, where we randomly sample we may be able to generalize from our sample to the population being sampled).

Written by opusiti

September 18, 2008 at 5:31 pm

Preliminary Hazard Analysis

leave a comment »

… is used in the early life cycle states to identify critical system functions and broad system hazards. The identified hazards are often assessed and prioritized, and safety design criteria and requirements may be identified … Because PHA starts are the concept formation stage of the a project, little detail is available, and assessments of hazard and risk levels are necessarily qualitative and limited. (Nancy Levenson, Safeware)

This applies to our project. It has been unclear how to rank the different liklihoods of a risk to the experiment.

Written by opusiti

September 16, 2008 at 5:08 pm

Validity and Reliability (more about validity)

leave a comment »

Most of the definitions here from from:

T08. W. M. K. Trochim, “Research methods knowledge base,” Online, last accessed August 2008. [Online]. Available: http://www.socialresearchmethods.net/kb/index.php

There are other definitions, a longer term task is to review a number of other textbooks on experimental design and crossreference!

Reliability and Validity

Reliability and validity apply to measurements. A hypothesis is not a measurement (surely). So to say that a hypothesis is reliable and valid is not meaningful. These refer to the measurements.

There are four main types of validity: internal validity, construct validity, external validity and conclusion validity.

The first type of validity only applies where there is a causal relationship and the other types of validity apply to all types of experiment.

Internal Validity

Internal validity is “you have evidence that what you did in the study (i.e., the program) caused what you observed (i.e., the outcome) to happen. It doesn’t tell you whether what you did for the program was what you wanted to do or whether what you observed was what you wanted to observe — those are construct validity concerns. It is possible to have internal validity in a study and not have construct validity.”.

The single group threats are:

  1. History Threat
  2. Maturation Threat
  3. Testing Threat
  4. Instrumentation Threat
  5. Mortality Threat
  6. Regression Threat

External validity

“External validity refers to the approximate truth of conclusions the involve generalizations.”

There are three major threats to external validity because there are three ways you could be wrong — people, places or times. Your critics could come along, for example, and argue that the results of your study are due to the unusual type of people who were in the study. Or, they could argue that it might only work because of the unusual place you did the study in (perhaps you did your educational study in a college town with lots of high-achieving educationally-oriented kids). Or, they might suggest that you did your study in a peculiar time. For instance, if you did your smoking cessation study the week after the Surgeon General issues the well-publicized results of the latest smoking and cancer studies, you might get different results than if you had done it the week before.

Construct validity

“Construct validity refers to the degree to which inferences can legitimately be made from the operationalizations in your study to the theoretical constructs on which those operationalizations were based.”

Are we measuring what we really think we are measuring?

Threats include:

  1. Inadequate Preoperational Explication of Constructs
  2. Mono-Operation Bias (did you only use one version of the treatment?)
  3. Mono-Method Bias (did you cross-validate your measurements?)
  4. Interaction of Different Treatments
  5. Interaction of Testing and Treatment
  6. Restricted Generalizability Across Constructs
  7. Confounding Constructs and Levels of Constructs

There are also “social threats”: hypothesis guessing, evaluator apprehension and experimenter expectencies.

Conclusion Validity

“Conclusion validity is the degree to which conclusions we reach about relationships in our data are reasonable.” [T08]

Ok, there is a bit of difference here.

Distinguish from internal validity which is a measure that the cause was what we expect it to be. So we could have an experiment with conclusion validity (there is some relationship measured between our variables) but it is not interally valid because the explanation for the relationship is due to some external uncontrolled factor.

So in this case. We have the goal of “Drawing a valid conclusion about the relationship between typing styles and identity where the user is typing a strong password.”

Threats to conclusion validity include:

  1. Noise
    1. low reliability of measures
    2. poor reliability of treatment implementation
    3. Noise that is caused by random irrelevancies in the setting
    4. random heterogeneity of respondents. If you have a very diverse group of respondents, they are likely to vary more widely on your measures or observations.
  2. Statistical power
    1. fishing and the error rate problem
    2. violated assumptions of statistical tests.

Written by opusiti

September 3, 2008 at 12:57 am

Operational definitions

leave a comment »

A specific statement about how an event or behavior will be measured to represent the concept under study. See also indicator and research methods. http://www.utexas.edu/academic/diia/assessment/iar/glossary.php

Notes from  Research Methods class (Roy Maxion):

1. Identify the characteristic of interest.
2. Get a dictionary definition.
3. Select the measuring instrument.
4. Describe the test method.
5. State the decision criteria.
6. Document the operational definition.
7. Test the operational definition.

Operational definition – example (attribute)
Characteristic of interest:
– Number of black spots per radiator grill.
Measuring instrument:
– The observation will be performed with the naked eye (or with corrective lenses if normally worn), under the light available in the work station (in 100% working order, i.e., no burned-out bulbs).
Method of test:
– The number of black spots per radiator grill will be counted by taking samples at the work station. The sample should be studied at a distance of 18 inches (roughly half an arm’s length) from the eye. Only the top surface of the grill is to be examined.
Decision criteria:
– Wipe the top surface of the grill with the palm of your hand and look for any black specks embedded in the plastic. Any observed black speck of any size counts as a black spot.
Test the definition:
– Try it out with N observers; see if there is agreement among them.

Operational definition – example (variable)
Characteristic of interest:
-Diameter of 48-inch rod – acceptability for delivery.
Measuring instrument:
-Micrometer
Method of test:
– The sample size is 3. Measure 3 rods every hour. When the grinder releases the rod, take one measurement each at (a) 8″ down, (b) 24″ down, and (c) 40″ down from the notched end. Tighten the micrometer as much as possible. Record to 4 decimal points. If the fifth number to the right of the decimal point is 5 or higher, round up the fourth number by one.
Decision criteria:
– (A) If the three measures do not differ by more than 0.0001, then the rod passes the invariance test. (B) if the three measures also are within 0.0005 of ½ inch, then the rod passes the diameter test.
Test the definition:
– Test the procedure on calibrated rods; use several people; validate with inter-rater reliability test.

Question: what is the difference between an attribute and a variable?  Is one “absolute” and the other a sample?

Written by opusiti

August 28, 2008 at 7:52 pm

Some definitions

leave a comment »

From the research course:

External validity: Do the results hold across different settings? Can you generalise to a larger population?

Internal validity: Is there a causal relation between two variables? (no confounds)

** What does confounds mean in this context?

** What do we mean by power of a statistic?

The NIST speaker recognition evaluation: Overview methodology, systems, results, perspective.
Doddington, G.R., Przybocki, M.A., Martin, A.F., and Reynolds, D.A.
Speech Communication, 2000, 31(2-3), 225-254.

Identification: Deciding amongst a set of candidates who the biometric belong to.

Verification: Whether the biometric belongs to a particular candidate.

Closed set identification task aka closed world assumption: When actual subject is always one of the candidates.

Open set identification task: When actual subject may not be one of the candidates.

Sheep: well behaved subjects who dominate the target population.

Goats a minority group who tend to determine the performance of the system through their disproportionate contribution of errors.

Wolves are imposters who have unusally good success at impersonating many different target speakers.

Lambs are target speakers who seem unusually susceptible to many different imposters.

Weighted detection cost function:

Cdet = cmiss . Emiss . Ptarget + cfa . Efa . (1 – Ptarget)

cmiss and cfa are the relative costs of detection errors

Ptarget is the a priori probability of the target

Written by opusiti

August 27, 2008 at 6:34 pm

Some further reading on experimental methods

leave a comment »

For the reader eager to learn about the role of experimentation in general, I suggest the following literature:

  1. Chalmers, A.F., What Is This Thing Called Science? The Open University Press, Buckingham, England, 1988. Addresses the philosophical underpinnings of the scientific process, including inductivism, Popper’s falsificationism, Kuhn’s paradigms, objectivism, and the theory dependence of observation.
  2. Latour, B., Science in Action: How to Follow Scientists and Engineers through Society, Harvard University Press, Cambridge, Mass., 1987. Describes the social processes of science-in-the-making as opposed to ready-made science. Latour illustrates the fact-building and convincing power of laboratories with fascinating examples.
  3. Basili, V.R., “The Role of Experimentation in Software Engineering: Past, Current, and Future.” Proc. 18th Int. Conf. Software Eng., IEEE Computer Soc. Press, Los Alamitos, Calif., March 1996.
  4. Frankl, P.G., and S.N. Weiss, “An Experimental Comparison of the Effectiveness of Branch Testing and Data Flow Testing,” IEEE Trans. Software Eng., Aug. 1993, pp. 774-787.
  5. Brett, B., “Comments on The Cost of Selective Recompilation and Environment Processing,” ACM Trans. Software Eng. and Methodology, 1995, pp. 214-216. A good example of a repeated experiment in compiling.
  6. Denning, P.J., “Performance Evaluation: Experimental Computer Science at Its Best,” ACM Performance Evaluation
    Review, ACM Press, New York, 1981, pp. 106-109. Argues that performance evaluation is an excellent form of experimentation in computer science.
  7. Hennessy, J.L., and D.A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, San Mateo, Calif., 1990. A landmark in making computer architecture research quantitative.
  8. Cohen, P.R., Empirical Methods for Artificial Intelligence, MIT Press, Cambridge, Mass., 1995. Covers empirical methods in AI, but a large part applies to all of computer science.
  9. Fenton, N.E., and S.L. Pfleeger. Software Metrics: A Rigorous and Practical Approach (2nd edition), Thomson Computer Press, New York, 1997. Excellent discussion of experimental designs as well as a wealth
    of material on experimentation with software.
  10. Christensen, L.B., Experimental Methodology, Allyn and Bacon, New York, 1994.
  11. Judd, C.M., E.R. Smith, and L.H. Kidder, Research Methods in Social Relations, Holt, Rinehart, and Winston,
    1991. General experimental methods.
  12. Moore, D.S., and G.P. McCabe, Introduction to the Practice of Statistics, W.H. Freeman and Co., New York, 1993. Excellent introductory text on statistics.
  13. Venables, W.N. and B.D. Ripley, Modern Applied Statistics with S-PLUS, Springer Verlag, New York, 1997. One of the best statistical packages available today is SPlus. Venables and Ripley’s book is both a guide to using S-Plus and a course in modern statistical methods. Keep in mind, however, that sophisticated statistical analysis is no substitute for good experimental design.

From Tichy, Walter (1998). Should Computer Scientists Experiment More?

Written by opusiti

August 26, 2008 at 7:40 pm

Zobel on research methods

leave a comment »

I was recommended this paper by Justin Zobel on research methods that has some interesting points to make about record keeping. In particular, he makes some practical suggestions for record keeping around experiments that is in line with good practice and looks quite practical for Computer Science.

He observes that current practice does not meet the minimum standards followed in other disciplines. In particular, it is often very hard (probably impossible) to replicate an experiment from the description found in the average computer science paper.

His key suggestions in this area were (note that the following is only partially paraphrased and is mostly direct quotes!):

1. Notebooks. These should be a guide to the experiment. They can be electronic although they should really be have some mechanism for checking their integrity to defeat attempts to falsify results (timestamps and regular dumping to backup files might be a good way to achieve this). The notebooks should record: dates; daily notes; names and locations of code, scripts, input, and other files; important references and web addresses; minutes of discussions; bug reports; locations and identifying marks of paper records; experimental parameters; and intent, outcomes, and interpretation of experiements (note that this blog covers some of this). He says that such notes can provide a “guidebook” to the experiement and should contain descriptions of ideas and show the progress of the research. Notes should be on the order of a few lines so as to make maintenance less onerous.

2. Code At an absolute minimum researchers should preserve the exact code used to yield any published results, and if possible the exact input. We may not need to keep every variation of the code because some changes may be very small — almost trivial. The notebook should discuss the kinds of changes that were made and why; if the changes are small enough to be quickly made by a competent programmer, and are documented in notebooks.

3. Logs should be complete transcipts of the output of the experiment. This should be the data as reduced by some process for human consumption. Note that we should keep all or some where the criteria is chosen ahead of time to avoid investigative bias.

Written by opusiti

August 26, 2008 at 6:48 pm

Posted in experimental methods

Tagged with

Design a site like this with WordPress.com
Get started