The Beautiful science

How to convert text files to all upper or lower case [reblogged]

As usual, in Linux, there are more than 1 way to accomplish a task.

To convert a file (input.txt) to all lower case (output.txt), choose any ONE of the following:

$ dd if=input.txt of=output.txt conv=lcase
tr $ tr '[:upper:]' '[:lower:]' < input.txt > output.txt
awk $ awk '{ print tolower($0) }' input.txt > output.txt
perl $ perl -pe '$_= lc($_)' input.txt > output.txt
sed $ sed -e 's/\(.*\)/\L\1/' input.txt > output.txt

We use the backreference \1 to refer to the entire line and the \L to convert to lower case. To convert a file (input.txt) to all upper case (output.txt):

$ dd if=input.txt of=output.txt conv=ucase
$ tr '[:lower:]' '[:upper:]' < input.txt > output.txt
$ awk '{ print toupper($0) }' input.txt > output.txt
$ perl -pe '$_= uc($_)' input.txt > output.txt
$ sed -e 's/\(.*\)/\U\1/' input.txt > output.txt

These oneliners can be used to convert the lowercase chars in FASTA file to uppercase and vice versa etc. Cheers


Download encrypted file in background using Wget

Fire up the terminal and replace the username and password with your login details and you can donwload a heavy file or a list of files in the backgroud.

where files, is a text file containing the link of the files to be downloaded.

[LINUX] removing files with special characters using rm

Many of you might have observed a problem with the removal of the files having a special characters in them. For instance, $1.txt, -bg etc.

# removing $1.txt

rm \$1.txt

# if its the only one file 

rm *.txt

# removing -bg, its a little tricky as - denotes the parameter input for rm. So, from the manual we will use ‘- -‘  - - A - - signals the end of options and disables further option processing. Any arguments after the - - are treated as file-names and arguments. An argument of - equals to - -

rm - - -bg

For some other complex characters. you can always use grep match. Consider a file name; r.34$@

# command

rm `ls * | egrep ‘*@’`

I hope that helps.


super useful ‘apt-file’ search utility

Yo guys, suppose you are running a perl code from someone, and get a error while execution like :

Can’t locate Parallel/ in @INC (@INC contains: blah blah)

It means you are missing (ofcourse), but how to install it, which library corresponds to it???

To find the right package, use ‘apt-file’ utility but install it first as:

sudo apt-get update sudo apt-get install apt-file sudo apt-file update

Now search the module

sudo apt-file search

which gives

libproc-processtable-perl: /usr/lib/perl5/Proc/

so , just install the library then

sudo apt-get install libproc-processtable-perl

Thats it, go and have fun. Cheers


wget multiple files in one command


Wanna fetch lot of files from different links in one line.


 wget -b -i links.txt

* put all the links in a text file called links.txt

** -b puts the wget process in background which is cool, if you dont want your terminal flooded with text.

More examples here :


adding character to each line of a text file (adding ‘>’ to each seqline to make it fasta)


Opening file, adding ‘>’ and new line character “\n” to the start of each line “$0” and saving the output.

cat file.seq | awk ‘{ print “>\n” ” “$0 }’ > file.fasta


Linux: Remove files when disk quota exceeded error


When you excede the disk quota, sometimes you get error while removing files.

Trick :

copy /dev/null to files and then you can remove them.

# removing all files at a once in script

for i in `ls *`; do cp /dev/null $i; rm $i; done


ZFS is a copy-on-write filesystem, so a file deletion transiently takes slightly more space on disk before a file is actually deleted. It has to write the metadata involved with the file deletion before it removes the allocation for the file being deleted. This is how ZFS is able to always be consistent on disk, even in the event of a crash.



NGS:chip-seq - bedClip to clip out off end chromosome reads (mainly after extension)

Hola people!

After extending the reads directionally (+ & -) in a bed (or bedGraph) file, the most likely error is that, the addition causes some reads to show postions, which are out of limits as compared to the real genomic coordinates.

So, to fix this there’s an utility called bedClip.


bedClip input.bed mm9 output.bed

mm9 - mm9 is a per chromosome total length text file.

or use the following awk script, it replaces -ve coordinates in 2nd column by 0, but doesn’t removes them.

awk '{ if( $2 ~ /^-/ ){sub($2,0, $2); print $0;}else{print $0}}' file.bed >out.bed

Have Fun

Sukhdeep Singh

NGS:chip-seq - bam to bedGraph with strand information


A little scientific post, converting newly mapped chip-seq BAM file to bedGraph with strand information. A little better lookin modification from the original post of Aaron Quinlan.

# getting ‘+’ strand coverage

bamToBed -i file.bam | awk ‘$6==”+”’ | genomeCoverageBed -i stdin -bg -g mm9 >

# getting ‘-’ strand coverage

bamToBed -i file.bam | awk ‘$6==”-“’ | genomeCoverageBed -i stdin -bg -g mm9 >

# adding strand information to the respective bedGraph files obtained from the previous step as we already know the file source in terms of strand

cat | awk ‘{print $0”\t+”}’>

cat | awk ‘{print $0”\t-“}’>

# concatanating these to files to a single file

cat >

# sorting file using chromosome and co-ordinate information

sort -k1,1 -k2,2n >

# adding fourth column with some id, in this case ‘row number’ to make it a 6 column bed file

awk ‘{OFS=”\t”; print $1,$2,$3,NR,$4,$5}’ >


chr1 3001029 3001080 1 1 -

chr1 3001447 3001498 2 1 +

chr1 3002155 3002206 3 1 -

chr1 3002351 3002402 4 1 -

chr1 3004372 3004423 5 1 +

chr1 3004950 3005001 6 1 -

chr1 3005014 3005065 7 1 -

chr1 3006174 3006225 8 1 -

chr1 3006337 3006388 9 1 -

chr1 3006445 3006496 10 1 +

Edit : While copying, please take a note on the quotes, it might not be the same liked by your system, so edit and replace them for getting rid of execution errors.

Cheers to Life
Sukhdeep Singh


R : counting elements in a list of data.frames (without loop, single line-sapply)


Suppose you have a list of 20 dataframes, each of data frame has 10 columns but different rows. You want to know the length of rows in each dataframe without using loop in a single line.


where x is a list

If you are after a specific column, just replace the number [[1]] with it.