Category Archives: Scripts

awk saves the day…

Sometimes you just need to process your text data in some manner.

I was looking for a simple script to count raw reads within my fastq files and came across this helpful script in Biostars by Jorge Amigo (thank you Jorge).

awk ‘{s++}END{print s/4}’ file.fastq

This counts the number of lines and divides by 4 to output the number of raw reads.

 

Running biopython scripts

As I am learning a bit about scripts and how to run them I have made recent errors which probably appear very simple to someone from a computing background.

The word I have learnt is ‘parse‘ which means:

  • to analyze (a sentence) into its parts and describe their syntactic roles.
  • analyze (a string or text) into logical syntactic components, typically in order to test conformability to a logical grammar.
  • examine or analyze minutely.

Another word is to ‘pass‘. You pass some instruction to the software to parse (analyze).

To run software you need:

  • the software and dependencies (other software) to be installed
  • script to tell the program what you want it to do
  • input files

I managed to find this very useful biopython script to make bed files from my newly made transcriptomes. Bed files are tab  delimited files that include the sequence ID, and the start and end location. For new transcriptomes the start is always 0 and the end is the length of the contig. These files can be used to gather the count data from your alignment files per sequence ID (gene).

http://ged.msu.edu/angus/tutorials-2013/rnaseq_bwa_counting.html

In case the website above is unavailable, here is the biopython script, slightly altered for the later version of Biopython which requires brackets around the print function:

import sys
from Bio import SeqIO

fasta_handle = open(sys.argv[1], "rU") #Open the fasta file specified by the user on the command line

#Go through all the records in the fasta file
for record in SeqIO.parse(fasta_handle, "fasta"):
    cur_id = record.id #Name of the current feature
    cur_length = len(record.seq) #Size of the current feature
    print (("%s\t%d\t%d") % (cur_id, 0, cur_length)) #Output the name, start, and end coordinates to the screen

This script will take any fasta file you give it and output the bed file. The important thing is to tell python.exe what to do with the input file. You need to pass the script to python.exe so it can parse and act on it. Therefore the order of your command must be correct Рthe .py script comes directly after the python.exe.

I placed the input file first and could not understand why I kept getting the error that my fasta file had syntax errors.

So in the command line window – I just run this in Windows cmd.exe by holding shift and right-clicking on the folder I want as my working directory;

Open the command window in the directory where your fasta file is, then input the directory path to Python34\python.exe then input the directory path to the script and output to your working directory.

So it will look something like this:

D:\Syzygium\>C:Python34\python.exe D:\scripts\make_bed_from_fasta.py Syzygium.fasta > Syzygium.bed

The output will look like this showing the length of each ‘gene’.

TRINITY_DN4638_c0_g1_i1 0 486
TRINITY_DN4638_c0_g2_i1 0 486
TRINITY_DN4608_c0_g1_i1 0 442
TRINITY_DN4687_c0_g1_i1 0 207
TRINITY_DN4630_c0_g1_i1 0 2449
TRINITY_DN4630_c0_g2_i1 0 457
TRINITY_DN4679_c0_g1_i1 0 410
TRINITY_DN4674_c0_g1_i1 0 227
TRINITY_DN4609_c0_g1_i1 0 265
TRINITY_DN4651_c0_g1_i1 0 1030
TRINITY_DN4626_c0_g1_i1 0 468
TRINITY_DN4626_c1_g1_i1 0 286
TRINITY_DN4653_c0_g2_i1 0 555
TRINITY_DN4653_c1_g1_i1 0 793
TRINITY_DN4648_c0_g1_i1 0 605
TRINITY_DN4632_c0_g1_i1 0 2026
TRINITY_DN4672_c0_g1_i1 0 239