Correlation analysis Description:

Firstly a ‘Correlation Analysis’ would be

used by a biologist to investigate whether two chosen variables are associated.

There are two key elements to the test. Primarily, the ‘Correlation

coefficient’ amongst the two data-sets is generated. Next a significance

analysis is conducted producing a significance value (commonly known as a ‘p.value’).

The p.value could be used as a method of assessing the: “no correlation between

the two variables” suggested by the null hypothesis. Moreover, the correlation

coefficient is helpful in determining the correlation between the two

variables. These correlations could be

either: a positive correlation, a negative correlation or no correlation at

all.

Once on R studio, the ‘cor.test’ function

is the R function used for the correlation analysis. The experimental values in

task 1 were recorded using the vectors ‘Bangladesh’ and ‘Pakistan’ these are

the inputs for the test. Therefore, one could produce a correlation analysis by

writing the following script: model=cor.test(Bangladesh, Pakistan).

One guideline which the experimental data

has to follow is that the ‘Bangladesh’ and ‘Pakistan’ vectors must be equal in

length. There are two primary outputs of

the cor.test function; these include a p.value and correlation

coefficient. One quick method of

retrieving (just) the correlation coefficient value, can occur using

‘cor.test(Bangladesh, Pakistan)$estimate’. Furthermore, a similar code can be

used to display (just) the p.value. The ‘cor.test(Bangladesh,

Pakistan)$p.value’ can be used. Furthermore, if the correlation coefficient

value is outputted as positive, ‘Bangladesh’ and ‘Pakistan’ values have a

positive correlation. In contrast if the correlation coefficient value is

outputted as negative, ‘Bangladesh’ and ‘Pakistan’ values have a negative

correlation. Additionally, if your p.value generated is smaller than the

critical p.value (0.05), one can reject the null hypothesis and you can decide

that the two variables are correlated. Nevertheless, if the p.value is larger

than the critical p.value you cannot reject the null hypothesis and its

probable that two data-sets are less correlated.

The hypothesis of correlation Analysis:

The

universal null hypothesis for a correlation analysis could be written as H0 : p= 0. In this equation p

is the correlation coefficient between Bangladesh’s and Pakistan’s rice

product. The null hypothesis illustrates that the rice product in Bangladesh

and Pakistan have no correlation. Ultimately the correlation coefficient (p) is

presumed to be zero

Test Design:

Since task 1 questions the association between

rice product in Bangladesh and Pakistan between 1992-2011, one could use a correlation

analysis to examine whether the two data-sets are related. Firstly, to place

the variables into two separate vectors I used the ‘c operator’.

Consequentially, the ‘Bangladesh’ and ‘Pakistan’ variables could be used as the

two inputs of the ‘cor.test function’. The outputs given by the ‘cor.test’ were

saved into a variable model (for easy access). Finally, a scatter was produced

using the ‘plot’ function to visualize my data. The ‘legend’ function was used

to print the p.value and correlation coefficient onto the scatter.

Discussion:

The p value

calculated was 0.00158. Consequentially the rice product in Bangladesh and

Pakistan has a significant correlation and we would reject the null hypothesis.

As the correlation coefficient is 0.66, the two variables have a clear positive

correlation. E.g. when the rice product in Bangladesh increases so too does the

product in Pakistan. Figure1 below highlights the two variables in a scatter.

The scatter is coherent with the correlation analysis results as the plot shows

a pattern where the points are moving from the left-bottom to the right-top

corner.

My R script for Analysing this data set:

x² Test

Description:

A biologist would use a x² test to

determine if your actual observed data has the same distribution as your

expected data. The test is usually used to examine proportion data which barely

follows the normal distribution. It is vital that you have two variables: the

observed data and the expected data to carry out the test (two variables must

have equal length). As the test is being carried out the two data sets are

‘one-to-one’ compared to conclude the fitness degree between them (pair wise

distance). Moreover, if one doesn’t have access to the expected data, you can

make the assumption that the expected data is the average of the observed data.

When in R studios , you can use the ‘chisq.test’

function to carry out a x² test. In task 2 the two variables (containing

Transposon Hit frequency data) were split into the actual observed (‘O’) and

the expected (‘E’) vectors first. Now the test can be executed using: ‘model=chisq.test(O,p=E/sum(E))’.

The ‘E/Sum(E))’ section generates a probability vector (from expected data).

The key outputs of the test produce a x² and a p.value. Once the ‘chisq.test’

function has been used , one can just access the x² value by using the code: ‘model$statistic’.

Similarly one can access just the p.value using ‘model$p.value’. If your p.value

is lower than your critical p.value (0.05), the two variables (O and E) are

significantly different and one can reject the null hypothesis. In contrast, if

the p.value generated is larger than the critical value then the two variables

are unlikely to be significantly different and the null hypothesis cannot be

rejected.

Hypothesis of the x²Test:

The

general null hypothesis used when carrying out a x² test is H0 :

O= E. The O stands for the observed values when inside a vector and E stands

for the expected values inside a vector. The hypothesis states that the

observed values for Transposon Hit frequency are equal to the expected

Transposon hit frequencies.

Test Design:

In task two, we

were given the observed and expected Transposon Hit Frequencies. We would first

place these two sets of data into separate vectors via the ‘c operator’. Next,

one could take these two vectors and use them as the input data for the ‘chisq.test’

test. The observed (‘O’) vector would be

placed first, then the probability expected (‘E’) vector would be inputted second

inside the ‘chisq.test’ code. It is important to save the two outputted values

from the ‘chisq.test’ into a variable model. This model would allow just the

p.value/x² statistic to be extracted easily. The results of the test could be

visualized using the boxplot function and the p.value and x² statistic could be printed onto

boxplot using the legend function.

Discussion:

The test

outputted a p.value of 0.100364. Because this value is greater than 0.05 the

observed and expected (Transposon Hit Frequencies) values are not significantly

different and the null hypothesis cannot be rejected. Figure2 below highlights

the boxplot of the observed transposon hit frequencies and the expected transposon

hit frequencies. The x² test result is

seen to be coherent with the boxplot. Finally, the distribution of the observed/expected

transposon hit frequencies are similar with them both having a similar median value on

the boxplot.

My R script for Analysing the data set:

Description of the T test:

A biologist would

use the t test to calculate the mean difference between two separate sets of

data. However, the parameters of t test need to be decided. Firstly, whether

the table data is from one or two populations. Secondly whether the data suits

a one or two tailed t test. Finally, whether the values in the table are paired

or unpaired.

When in R

studio the ‘t.test’ function is used for a t test. In question 3 the two reading

speed variables are placed into ‘basic’ and ‘colour’ vectors (these are the two

inputs). In order to actually perform

the t test the: ‘model=t.test(colour,basic)’ script is used. Compared to other

data handling methods the two variables do not have to be equal in length.

There are two key outputs from the T test: the p.value and a t statistic. If

you just want to extract the t statistic you can use ‘model$statistic’.

Similarly the same extraction can be done for the p.value using the

‘model$p.value’. A general rule in the t test is that if the p.value is smaller

than the critical value (0.05) , the data in the two vectors have means which

differ significantly and one can reject the null hypothesis. In contrast if the

p.value is greater than the critical p value , the means of the two populations

are not different, and the null hypothesis cannot be rejected.

Hypothesis of the T test:

In task 3 the null hypothesis proposed is H0 : ?colour ? ?basic. ?basic

is the mean reading speed of 20 children, reading the basic book. Whereas ?colour

is the mean reading speed of 20 children, reading the colour book. The hypothesis states that the mean reading

speed of children reading the basic book is greater than or equal to the mean

reading speed of children reading the colour book.

Test Design:

Due to task 3

questioning whether the mean reading speed of the colour book is faster than

the basic book, a two population/one-tailed t test is used. The test is also

paired because the two sets of data are not independent as reading speed is

measured twice (for colour and basic). As a result, I used the ‘TRUE’ setting

for paired. Therefore, the code becomes: ‘t.test(colour,basic,alternative=’g’,paired=TRUE’.

If H0: ?colour ? ?basic. The vectors for the two populations

were generated using the ‘c operator function’. The 2 variables (basic and colour) are inputted into the t

test. As the t test is one tailed I used the ‘g’ alternative. Also I saved the

p.value and t statistic outputs in a variable model. In order to envision the

data the boxplot function was used. This boxplot contained the data for both vectors.

Finally the t statistic and p.value were printed onto the boxplot using the

legend function.

Discussion:

the p value produced was 0.0013. Hence, the

null hypothesis is rejected and the statement: “the mean

reading speed of children reading the basic book is greater than/equal to the

mean reading speed of children reading the colour book” can be denied. Figure3

illustrates the boxplots of the two data-sets. The t test score is coherent

with the boxplot as the two sets of data have medium differences in their

means. The median colour reading speed is greater than the median using a basic

book.

My R script for analysis the data set:

Brief Description of the ANOVA test:

A biologist could use the ANOVA test when

the data presented in the question has more than two populations. Thus, you can

calculate the mean difference between these populations. The test works by

analysing the variance inside a population and the variance between different

populations. Task 4 requires a one-way ANOVA test; therefore, your population

data needs to be split into observations and label variables.

When in R studio one can produce an ANOVA

test by using the ‘aov’ function. This function needs two inputs: an

observation vector that is numeric and a labels vector. In task 4 these vectors

were named ‘observation’ and ‘label’ respectively. In order to use the ANOVA

test the association between ‘observation’ and ‘label’ vectors needs to occur

using the ~key. This occurs via: ‘model=aov(observation~factor(label))’. The

(observation~factor(label)) illustrates the observation values when categorized

by their label. One rule when conducting an ANOVA test is that the

‘observation’ and ‘label’ vectors must have the same length. Furthermore to

create a summary ANOVA table we use the code ‘stats=summary(model)1’. Here

the input is the ANOVA model itself. The summary table produces two key

statistical values: the p value and the Fisher ratio. To obtain just the

p.value use the code stats$Pr1. Similarly to obtain just the Fisher ratio use

the script : stats$F1. After conducting the test if the p value is smaller

than the critical p value (0.05), the three populations have significantly

different means and one can reject the null hypothesis. However if the p value

is larger than the critical value , the means between the three populations are

not significantly different and the null hypothesis cannot be rejected.

The Hypothesis for the ANOVA test:

Task 4’s

null hypothesis for the ANOVA would be: H0 : ?A = ?B = ?C. The ?A/B/C values in the hypothesis represent

the rose flowering times using three different fertilizers (A,B and C). The

hypothesis highlights that the mean rose flowering time in the 3 fertilizer

treatments are equal.

Test Design:

Task 4 questions whether three variables have

the same mean. Consequently, an ANOVA test could be used to investigate the

data. I used the ‘c operator’ to create three vectors for the three fertilizer

types (A/B/C). Additionally, the numeric ‘observation’ vector was created by

combining the three fertilizer vectors using the ‘c operator’. Also, a factor

label vector was created using the ‘c operator’ and the ‘rep operator’ to characterize

the 12 data values. The ANOVA summary table was created by using the ANOVA

model as an input. This allowed the p.value and fisher ratio to be accessed

(using stats$ functions). The model was visualized using the boxplot function

and the p.value was printed onto the plot using the legend function.

Discussion:

The p value generated was 0.297715. Hence the

mean values between the three fertilizer treatments did not differ

significantly and the null hypothesis is not rejected. Figure4 highlights the

boxplot of the data. The boxplot is clearly coherent with the ANOVA test

results. The plot illustrates that the rose flowering times using three

different treatments don’t display a large difference across A , B and C.

Median values do vary slightly.

My R script for analysing this data set: