Correlation analysis Description:
Firstly a ‘Correlation Analysis’ would be
used by a biologist to investigate whether two chosen variables are associated.
There are two key elements to the test. Primarily, the ‘Correlation
coefficient’ amongst the two data-sets is generated. Next a significance
analysis is conducted producing a significance value (commonly known as a ‘p.value’).
The p.value could be used as a method of assessing the: “no correlation between
the two variables” suggested by the null hypothesis. Moreover, the correlation
coefficient is helpful in determining the correlation between the two
variables. These correlations could be
either: a positive correlation, a negative correlation or no correlation at
Once on R studio, the ‘cor.test’ function
is the R function used for the correlation analysis. The experimental values in
task 1 were recorded using the vectors ‘Bangladesh’ and ‘Pakistan’ these are
the inputs for the test. Therefore, one could produce a correlation analysis by
writing the following script: model=cor.test(Bangladesh, Pakistan).
One guideline which the experimental data
has to follow is that the ‘Bangladesh’ and ‘Pakistan’ vectors must be equal in
length. There are two primary outputs of
the cor.test function; these include a p.value and correlation
coefficient. One quick method of
retrieving (just) the correlation coefficient value, can occur using
‘cor.test(Bangladesh, Pakistan)$estimate’. Furthermore, a similar code can be
used to display (just) the p.value. The ‘cor.test(Bangladesh,
Pakistan)$p.value’ can be used. Furthermore, if the correlation coefficient
value is outputted as positive, ‘Bangladesh’ and ‘Pakistan’ values have a
positive correlation. In contrast if the correlation coefficient value is
outputted as negative, ‘Bangladesh’ and ‘Pakistan’ values have a negative
correlation. Additionally, if your p.value generated is smaller than the
critical p.value (0.05), one can reject the null hypothesis and you can decide
that the two variables are correlated. Nevertheless, if the p.value is larger
than the critical p.value you cannot reject the null hypothesis and its
probable that two data-sets are less correlated.
The hypothesis of correlation Analysis:
universal null hypothesis for a correlation analysis could be written as H0 : p= 0. In this equation p
is the correlation coefficient between Bangladesh’s and Pakistan’s rice
product. The null hypothesis illustrates that the rice product in Bangladesh
and Pakistan have no correlation. Ultimately the correlation coefficient (p) is
presumed to be zero
Since task 1 questions the association between
rice product in Bangladesh and Pakistan between 1992-2011, one could use a correlation
analysis to examine whether the two data-sets are related. Firstly, to place
the variables into two separate vectors I used the ‘c operator’.
Consequentially, the ‘Bangladesh’ and ‘Pakistan’ variables could be used as the
two inputs of the ‘cor.test function’. The outputs given by the ‘cor.test’ were
saved into a variable model (for easy access). Finally, a scatter was produced
using the ‘plot’ function to visualize my data. The ‘legend’ function was used
to print the p.value and correlation coefficient onto the scatter.
The p value
calculated was 0.00158. Consequentially the rice product in Bangladesh and
Pakistan has a significant correlation and we would reject the null hypothesis.
As the correlation coefficient is 0.66, the two variables have a clear positive
correlation. E.g. when the rice product in Bangladesh increases so too does the
product in Pakistan. Figure1 below highlights the two variables in a scatter.
The scatter is coherent with the correlation analysis results as the plot shows
a pattern where the points are moving from the left-bottom to the right-top
My R script for Analysing this data set:
A biologist would use a x² test to
determine if your actual observed data has the same distribution as your
expected data. The test is usually used to examine proportion data which barely
follows the normal distribution. It is vital that you have two variables: the
observed data and the expected data to carry out the test (two variables must
have equal length). As the test is being carried out the two data sets are
‘one-to-one’ compared to conclude the fitness degree between them (pair wise
distance). Moreover, if one doesn’t have access to the expected data, you can
make the assumption that the expected data is the average of the observed data.
When in R studios , you can use the ‘chisq.test’
function to carry out a x² test. In task 2 the two variables (containing
Transposon Hit frequency data) were split into the actual observed (‘O’) and
the expected (‘E’) vectors first. Now the test can be executed using: ‘model=chisq.test(O,p=E/sum(E))’.
The ‘E/Sum(E))’ section generates a probability vector (from expected data).
The key outputs of the test produce a x² and a p.value. Once the ‘chisq.test’
function has been used , one can just access the x² value by using the code: ‘model$statistic’.
Similarly one can access just the p.value using ‘model$p.value’. If your p.value
is lower than your critical p.value (0.05), the two variables (O and E) are
significantly different and one can reject the null hypothesis. In contrast, if
the p.value generated is larger than the critical value then the two variables
are unlikely to be significantly different and the null hypothesis cannot be
Hypothesis of the x²Test:
general null hypothesis used when carrying out a x² test is H0 :
O= E. The O stands for the observed values when inside a vector and E stands
for the expected values inside a vector. The hypothesis states that the
observed values for Transposon Hit frequency are equal to the expected
Transposon hit frequencies.
In task two, we
were given the observed and expected Transposon Hit Frequencies. We would first
place these two sets of data into separate vectors via the ‘c operator’. Next,
one could take these two vectors and use them as the input data for the ‘chisq.test’
test. The observed (‘O’) vector would be
placed first, then the probability expected (‘E’) vector would be inputted second
inside the ‘chisq.test’ code. It is important to save the two outputted values
from the ‘chisq.test’ into a variable model. This model would allow just the
p.value/x² statistic to be extracted easily. The results of the test could be
visualized using the boxplot function and the p.value and x² statistic could be printed onto
boxplot using the legend function.
outputted a p.value of 0.100364. Because this value is greater than 0.05 the
observed and expected (Transposon Hit Frequencies) values are not significantly
different and the null hypothesis cannot be rejected. Figure2 below highlights
the boxplot of the observed transposon hit frequencies and the expected transposon
hit frequencies. The x² test result is
seen to be coherent with the boxplot. Finally, the distribution of the observed/expected
transposon hit frequencies are similar with them both having a similar median value on
My R script for Analysing the data set:
Description of the T test:
A biologist would
use the t test to calculate the mean difference between two separate sets of
data. However, the parameters of t test need to be decided. Firstly, whether
the table data is from one or two populations. Secondly whether the data suits
a one or two tailed t test. Finally, whether the values in the table are paired
When in R
studio the ‘t.test’ function is used for a t test. In question 3 the two reading
speed variables are placed into ‘basic’ and ‘colour’ vectors (these are the two
inputs). In order to actually perform
the t test the: ‘model=t.test(colour,basic)’ script is used. Compared to other
data handling methods the two variables do not have to be equal in length.
There are two key outputs from the T test: the p.value and a t statistic. If
you just want to extract the t statistic you can use ‘model$statistic’.
Similarly the same extraction can be done for the p.value using the
‘model$p.value’. A general rule in the t test is that if the p.value is smaller
than the critical value (0.05) , the data in the two vectors have means which
differ significantly and one can reject the null hypothesis. In contrast if the
p.value is greater than the critical p value , the means of the two populations
are not different, and the null hypothesis cannot be rejected.
Hypothesis of the T test:
In task 3 the null hypothesis proposed is H0 : ?colour ? ?basic. ?basic
is the mean reading speed of 20 children, reading the basic book. Whereas ?colour
is the mean reading speed of 20 children, reading the colour book. The hypothesis states that the mean reading
speed of children reading the basic book is greater than or equal to the mean
reading speed of children reading the colour book.
Due to task 3
questioning whether the mean reading speed of the colour book is faster than
the basic book, a two population/one-tailed t test is used. The test is also
paired because the two sets of data are not independent as reading speed is
measured twice (for colour and basic). As a result, I used the ‘TRUE’ setting
for paired. Therefore, the code becomes: ‘t.test(colour,basic,alternative=’g’,paired=TRUE’.
If H0: ?colour ? ?basic. The vectors for the two populations
were generated using the ‘c operator function’. The 2 variables (basic and colour) are inputted into the t
test. As the t test is one tailed I used the ‘g’ alternative. Also I saved the
p.value and t statistic outputs in a variable model. In order to envision the
data the boxplot function was used. This boxplot contained the data for both vectors.
Finally the t statistic and p.value were printed onto the boxplot using the
the p value produced was 0.0013. Hence, the
null hypothesis is rejected and the statement: “the mean
reading speed of children reading the basic book is greater than/equal to the
mean reading speed of children reading the colour book” can be denied. Figure3
illustrates the boxplots of the two data-sets. The t test score is coherent
with the boxplot as the two sets of data have medium differences in their
means. The median colour reading speed is greater than the median using a basic
My R script for analysis the data set:
Brief Description of the ANOVA test:
A biologist could use the ANOVA test when
the data presented in the question has more than two populations. Thus, you can
calculate the mean difference between these populations. The test works by
analysing the variance inside a population and the variance between different
populations. Task 4 requires a one-way ANOVA test; therefore, your population
data needs to be split into observations and label variables.
When in R studio one can produce an ANOVA
test by using the ‘aov’ function. This function needs two inputs: an
observation vector that is numeric and a labels vector. In task 4 these vectors
were named ‘observation’ and ‘label’ respectively. In order to use the ANOVA
test the association between ‘observation’ and ‘label’ vectors needs to occur
using the ~key. This occurs via: ‘model=aov(observation~factor(label))’. The
(observation~factor(label)) illustrates the observation values when categorized
by their label. One rule when conducting an ANOVA test is that the
‘observation’ and ‘label’ vectors must have the same length. Furthermore to
create a summary ANOVA table we use the code ‘stats=summary(model)1’. Here
the input is the ANOVA model itself. The summary table produces two key
statistical values: the p value and the Fisher ratio. To obtain just the
p.value use the code stats$Pr1. Similarly to obtain just the Fisher ratio use
the script : stats$F1. After conducting the test if the p value is smaller
than the critical p value (0.05), the three populations have significantly
different means and one can reject the null hypothesis. However if the p value
is larger than the critical value , the means between the three populations are
not significantly different and the null hypothesis cannot be rejected.
The Hypothesis for the ANOVA test:
null hypothesis for the ANOVA would be: H0 : ?A = ?B = ?C. The ?A/B/C values in the hypothesis represent
the rose flowering times using three different fertilizers (A,B and C). The
hypothesis highlights that the mean rose flowering time in the 3 fertilizer
treatments are equal.
Task 4 questions whether three variables have
the same mean. Consequently, an ANOVA test could be used to investigate the
data. I used the ‘c operator’ to create three vectors for the three fertilizer
types (A/B/C). Additionally, the numeric ‘observation’ vector was created by
combining the three fertilizer vectors using the ‘c operator’. Also, a factor
label vector was created using the ‘c operator’ and the ‘rep operator’ to characterize
the 12 data values. The ANOVA summary table was created by using the ANOVA
model as an input. This allowed the p.value and fisher ratio to be accessed
(using stats$ functions). The model was visualized using the boxplot function
and the p.value was printed onto the plot using the legend function.
The p value generated was 0.297715. Hence the
mean values between the three fertilizer treatments did not differ
significantly and the null hypothesis is not rejected. Figure4 highlights the
boxplot of the data. The boxplot is clearly coherent with the ANOVA test
results. The plot illustrates that the rose flowering times using three
different treatments don’t display a large difference across A , B and C.
Median values do vary slightly.
My R script for analysing this data set: