Comparative Transcriptomics

Comparative Transcriptomics (5 pts)

Once your dataset is chosen and approved by me, you will upload and process your data in Galaxy. Remember: running these jobs can take a lot of trial and error and long run times. Get your list of accession numbers in two text files (one accession number per line): one for the first species, another for the second species. This makes it easy to upload all files in a batch mode. It’s easier to run your analyses from start to finish with one species, then you can reapply the same exact workflow on the second species (e.g.

Don't use plagiarized sources. Get Your Custom Essay on
Comparative Transcriptomics
Just from $13/Page
Order Essay

Make your first Galaxy history for the species of your choice and name the history appropriately. Upload the accession text file, the compressed genome file (.fna.gz), and compressed annotation file (.gff.gz) for the corresponding species. Answer the questions below for the first species, then when you are done, redo all the steps and answer all the questions for the second species. Make sure to tell me which species you start with.


  • Upload data. Use “Faster Download and Extract Reads” with your text file. Notice this will make a “Collection” of files, so you can run operations on the entire ‘batch’ all at the same time.
  • Run FastQC on the collection of FASTQ files.
  • Run Trim Galore on the collection of FASTQ files – make sure to select from the advanced settings “Yes” to “Generate a report file”. This report output will be used in the MultiQC step below.
  • Run FastQC on the collection of Trim Galore FASTQ files.
  • Run RNA STAR on the Trim Galore collection. You need to select the genome you uploaded with the gene-model (gff file), and choose “Per gene read counts (GeneCounts)”.
  • Run Samtools stats to gather statistics on your STAR output BAM files.
  • Run featureCounts using the BAM files. Ours should be Unstranded libraries. Use the Gene annotation file that you uploaded (.gff.gz), and select Yes to create a gene-length file. But notice that running this will give an error because the annotation file it expects is in GTF format (not GFF). So first you will need to run gffread to convert your annotation file you uploaded (.gff.gz) from GFF to GTF format.
  • Run MultiQC – Run this to generate results from 4 of the steps above: steps 2,3,5,7.
  • Download the featureCounts Counts and lengths. These will be imported into R in the next assignment.
  • Share with me your finalized Galaxy histories


For each species:

Provide a screenshot of your entire desktop screen showing the general statistics output from MultiQC in your Galaxy browser. Without this screenshot, you will not get graded. Also, share with me your Galaxy histories or else you will not get graded.

In a sentence or two, answer these questions for each species:


  1. What developmental stages are you analyzing?
  2. Describe the mean quality of the data in lay terms. 5 pts
  3. What is the approximate length of most reads after trimming? 5 pts
  4. Are there any samples in particular that you are worried about (e.g. you would consider excluding from analysis) after looking at the FastQC metrics, why or why not? 1 pt
  5. What is the approximate average % read duplication, and is this surprising given your dataset (why/why not)? 1 pt
  6. Describe the mapping results and variance among samples, and if you are satisfied with this result or not (and why). Also provide a screenshot of the STAR Alignment Scores from MultiQC. 1 pt
  7. Interpret the Assigned Reads vs % Assigned Reads, explain what these numbers represent, and how was the % calculated? 1 pt

Bonus: What do you think is more important in determining the success of your sequencing (and why), Assigned or Aligned reads? 0.5 pts



and taste our undisputed quality.