Correlation analysis Description: Firstly a ‘Correlation Analysis’ would beused by a biologist to investigate whether two chosen variables are associated.
There are two key elements to the test. Primarily, the ‘Correlationcoefficient’ amongst the two data-sets is generated. Next a significanceanalysis is conducted producing a significance value (commonly known as a ‘p.value’).The p.value could be used as a method of assessing the: “no correlation betweenthe two variables” suggested by the null hypothesis. Moreover, the correlationcoefficient is helpful in determining the correlation between the twovariables.
These correlations could beeither: a positive correlation, a negative correlation or no correlation atall. Once on R studio, the ‘cor.test’ functionis the R function used for the correlation analysis.
The experimental values intask 1 were recorded using the vectors ‘Bangladesh’ and ‘Pakistan’ these arethe inputs for the test. Therefore, one could produce a correlation analysis bywriting the following script: model=cor.test(Bangladesh, Pakistan). One guideline which the experimental datahas to follow is that the ‘Bangladesh’ and ‘Pakistan’ vectors must be equal inlength.
There are two primary outputs ofthe cor.test function; these include a p.value and correlationcoefficient.
One quick method ofretrieving (just) the correlation coefficient value, can occur using’cor.test(Bangladesh, Pakistan)$estimate’. Furthermore, a similar code can beused to display (just) the p.value. The ‘cor.test(Bangladesh,Pakistan)$p.
value’ can be used. Furthermore, if the correlation coefficientvalue is outputted as positive, ‘Bangladesh’ and ‘Pakistan’ values have apositive correlation. In contrast if the correlation coefficient value isoutputted as negative, ‘Bangladesh’ and ‘Pakistan’ values have a negativecorrelation. Additionally, if your p.value generated is smaller than thecritical p.
value (0.05), one can reject the null hypothesis and you can decidethat the two variables are correlated. Nevertheless, if the p.value is largerthan the critical p.
value you cannot reject the null hypothesis and itsprobable that two data-sets are less correlated. The hypothesis of correlation Analysis:Theuniversal null hypothesis for a correlation analysis could be written as H0 : p= 0. In this equation pis the correlation coefficient between Bangladesh’s and Pakistan’s riceproduct. The null hypothesis illustrates that the rice product in Bangladeshand Pakistan have no correlation. Ultimately the correlation coefficient (p) ispresumed to be zeroTest Design:Since task 1 questions the association betweenrice product in Bangladesh and Pakistan between 1992-2011, one could use a correlationanalysis to examine whether the two data-sets are related.
Firstly, to placethe variables into two separate vectors I used the ‘c operator’.Consequentially, the ‘Bangladesh’ and ‘Pakistan’ variables could be used as thetwo inputs of the ‘cor.test function’.
The outputs given by the ‘cor.test’ weresaved into a variable model (for easy access). Finally, a scatter was producedusing the ‘plot’ function to visualize my data. The ‘legend’ function was usedto print the p.value and correlation coefficient onto the scatter.
Discussion:The p valuecalculated was 0.00158. Consequentially the rice product in Bangladesh andPakistan has a significant correlation and we would reject the null hypothesis.
As the correlation coefficient is 0.66, the two variables have a clear positivecorrelation. E.g.
when the rice product in Bangladesh increases so too does theproduct in Pakistan. Figure1 below highlights the two variables in a scatter.The scatter is coherent with the correlation analysis results as the plot showsa pattern where the points are moving from the left-bottom to the right-topcorner. My R script for Analysing this data set: x² TestDescription:A biologist would use a x² test todetermine if your actual observed data has the same distribution as yourexpected data. The test is usually used to examine proportion data which barelyfollows the normal distribution. It is vital that you have two variables: theobserved data and the expected data to carry out the test (two variables musthave equal length). As the test is being carried out the two data sets are’one-to-one’ compared to conclude the fitness degree between them (pair wisedistance).
Moreover, if one doesn’t have access to the expected data, you canmake the assumption that the expected data is the average of the observed data. When in R studios , you can use the ‘chisq.test’function to carry out a x² test. In task 2 the two variables (containingTransposon Hit frequency data) were split into the actual observed (‘O’) andthe expected (‘E’) vectors first. Now the test can be executed using: ‘model=chisq.test(O,p=E/sum(E))’.The ‘E/Sum(E))’ section generates a probability vector (from expected data).The key outputs of the test produce a x² and a p.
value. Once the ‘chisq.test’function has been used , one can just access the x² value by using the code: ‘model$statistic’.Similarly one can access just the p.value using ‘model$p.value’. If your p.valueis lower than your critical p.
value (0.05), the two variables (O and E) aresignificantly different and one can reject the null hypothesis. In contrast, ifthe p.value generated is larger than the critical value then the two variablesare unlikely to be significantly different and the null hypothesis cannot berejected. Hypothesis of the x²Test: Thegeneral null hypothesis used when carrying out a x² test is H0 :O= E.
The O stands for the observed values when inside a vector and E standsfor the expected values inside a vector. The hypothesis states that theobserved values for Transposon Hit frequency are equal to the expectedTransposon hit frequencies. Test Design:In task two, wewere given the observed and expected Transposon Hit Frequencies. We would firstplace these two sets of data into separate vectors via the ‘c operator’. Next,one could take these two vectors and use them as the input data for the ‘chisq.test’test. The observed (‘O’) vector would beplaced first, then the probability expected (‘E’) vector would be inputted secondinside the ‘chisq.test’ code.
It is important to save the two outputted valuesfrom the ‘chisq.test’ into a variable model. This model would allow just thep.value/x² statistic to be extracted easily. The results of the test could bevisualized using the boxplot function and the p.value and x² statistic could be printed ontoboxplot using the legend function. Discussion:The testoutputted a p.
value of 0.100364. Because this value is greater than 0.05 theobserved and expected (Transposon Hit Frequencies) values are not significantlydifferent and the null hypothesis cannot be rejected.
Figure2 below highlightsthe boxplot of the observed transposon hit frequencies and the expected transposonhit frequencies. The x² test result isseen to be coherent with the boxplot. Finally, the distribution of the observed/expectedtransposon hit frequencies are similar with them both having a similar median value onthe boxplot. My R script for Analysing the data set:Description of the T test: A biologist woulduse the t test to calculate the mean difference between two separate sets ofdata. However, the parameters of t test need to be decided. Firstly, whetherthe table data is from one or two populations.
Secondly whether the data suitsa one or two tailed t test. Finally, whether the values in the table are pairedor unpaired. When in Rstudio the ‘t.test’ function is used for a t test.
In question 3 the two readingspeed variables are placed into ‘basic’ and ‘colour’ vectors (these are the twoinputs). In order to actually performthe t test the: ‘model=t.test(colour,basic)’ script is used. Compared to otherdata handling methods the two variables do not have to be equal in length.There are two key outputs from the T test: the p.value and a t statistic.
Ifyou just want to extract the t statistic you can use ‘model$statistic’.Similarly the same extraction can be done for the p.value using the’model$p.value’. A general rule in the t test is that if the p.
value is smallerthan the critical value (0.05) , the data in the two vectors have means whichdiffer significantly and one can reject the null hypothesis. In contrast if thep.value is greater than the critical p value , the means of the two populationsare not different, and the null hypothesis cannot be rejected.
Hypothesis of the T test:In task 3 the null hypothesis proposed is H0 : ?colour ? ?basic. ?basicis the mean reading speed of 20 children, reading the basic book. Whereas ?colouris the mean reading speed of 20 children, reading the colour book. The hypothesis states that the mean readingspeed of children reading the basic book is greater than or equal to the meanreading speed of children reading the colour book.Test Design:Due to task 3questioning whether the mean reading speed of the colour book is faster thanthe basic book, a two population/one-tailed t test is used. The test is alsopaired because the two sets of data are not independent as reading speed ismeasured twice (for colour and basic).
As a result, I used the ‘TRUE’ settingfor paired. Therefore, the code becomes: ‘t.test(colour,basic,alternative=’g’,paired=TRUE’.If H0: ?colour ? ?basic. The vectors for the two populationswere generated using the ‘c operator function’. The 2 variables (basic and colour) are inputted into the ttest. As the t test is one tailed I used the ‘g’ alternative.
Also I saved thep.value and t statistic outputs in a variable model. In order to envision thedata the boxplot function was used. This boxplot contained the data for both vectors.
Finally the t statistic and p.value were printed onto the boxplot using thelegend function. Discussion: the p value produced was 0.0013. Hence, thenull hypothesis is rejected and the statement: “the meanreading speed of children reading the basic book is greater than/equal to themean reading speed of children reading the colour book” can be denied. Figure3illustrates the boxplots of the two data-sets. The t test score is coherentwith the boxplot as the two sets of data have medium differences in theirmeans. The median colour reading speed is greater than the median using a basicbook.
My R script for analysis the data set: Brief Description of the ANOVA test: A biologist could use the ANOVA test whenthe data presented in the question has more than two populations. Thus, you cancalculate the mean difference between these populations. The test works byanalysing the variance inside a population and the variance between differentpopulations.
Task 4 requires a one-way ANOVA test; therefore, your populationdata needs to be split into observations and label variables. When in R studio one can produce an ANOVAtest by using the ‘aov’ function. This function needs two inputs: anobservation vector that is numeric and a labels vector.
In task 4 these vectorswere named ‘observation’ and ‘label’ respectively. In order to use the ANOVAtest the association between ‘observation’ and ‘label’ vectors needs to occurusing the ~key. This occurs via: ‘model=aov(observation~factor(label))’. The(observation~factor(label)) illustrates the observation values when categorizedby their label. One rule when conducting an ANOVA test is that the’observation’ and ‘label’ vectors must have the same length. Furthermore tocreate a summary ANOVA table we use the code ‘stats=summary(model)1’. Herethe input is the ANOVA model itself.
The summary table produces two keystatistical values: the p value and the Fisher ratio. To obtain just thep.value use the code stats$Pr1. Similarly to obtain just the Fisher ratio usethe script : stats$F1.
After conducting the test if the p value is smallerthan the critical p value (0.05), the three populations have significantlydifferent means and one can reject the null hypothesis. However if the p valueis larger than the critical value , the means between the three populations arenot significantly different and the null hypothesis cannot be rejected. The Hypothesis for the ANOVA test:Task 4’snull hypothesis for the ANOVA would be: H0 : ?A = ?B = ?C. The ?A/B/C values in the hypothesis representthe rose flowering times using three different fertilizers (A,B and C).
Thehypothesis highlights that the mean rose flowering time in the 3 fertilizertreatments are equal. Test Design:Task 4 questions whether three variables havethe same mean. Consequently, an ANOVA test could be used to investigate thedata. I used the ‘c operator’ to create three vectors for the three fertilizertypes (A/B/C). Additionally, the numeric ‘observation’ vector was created bycombining the three fertilizer vectors using the ‘c operator’.
Also, a factorlabel vector was created using the ‘c operator’ and the ‘rep operator’ to characterizethe 12 data values. The ANOVA summary table was created by using the ANOVAmodel as an input. This allowed the p.value and fisher ratio to be accessed(using stats$ functions).
The model was visualized using the boxplot functionand the p.value was printed onto the plot using the legend function. Discussion:The p value generated was 0.
297715. Hence themean values between the three fertilizer treatments did not differsignificantly and the null hypothesis is not rejected. Figure4 highlights theboxplot of the data. The boxplot is clearly coherent with the ANOVA testresults. The plot illustrates that the rose flowering times using threedifferent treatments don’t display a large difference across A , B and C.Median values do vary slightly.
My R script for analysing this data set: