Learning objectives:
These introductory notes are drawn from The Carpentries Introduction to the Command Line for Genomics lessons
A shell is a computer program that presents a command line interface which allows you to control your computer using commands entered with a keyboard instead of controlling graphical user interfaces (GUIs) with a mouse/keyboard combination.
Many bioinformatics tools can only be used through a command line interface, or have extra capabilities in the command line version that are not available in the GUI. This is true, for example, of BLAST, which offers many advanced functions only accessible to users who know how to use a shell.
In this lesson you will learn how to use the command line interface to move around in your file system.
Open up a terminal
[matt.bixley@wbg004 nesi02659]$
The dollar sign is a prompt, which shows us that the shell is waiting for input; your shell may use a different character as a prompt and may add information before the prompt. When typing commands, either from these lessons or from other sources, do not type the prompt, only the commands that follow it.
Let’s find out where we are by running a command called pwd
(which stands for “print working directory”).
At any moment, our current working directory
is our current default directory,
i.e.,
the directory that the computer assumes we want to run commands in,
unless we explicitly specify something else.
Here,
the computer’s response is /home/yourname
,
which is the top level directory within our cloud system:
$ pwd
/scale_wlg_persistent/filesets/project/nesi02659
Let’s look at how our file system is organized. We can see what files and subdirectories are in this directory by running ls
,
which stands for “listing”:
$ ls
14417834 fastqcmultiqc_html MGSS_2020_MikeH_test safe Untitled2.ipynb Untitled4.ipynb Untitled.ipynb
checkv.err ga4gh MGSS_resources_2020 SEPT28 untitled2.md untitled4.md untitled.md
checkv.out ga4gh-example-1 MGSS_U source_data untitled2.txt untitled4.txt untitled.txt
checkv.sl gateach newcounts.txt Untitled1.ipynb Untitled3.ipynb Untitled5.ipynb viz_thang.sh
dc_workshop Kegg_dat.ipynb obss_2020 untitled1.md untitled3.md untitled5.txt workflow_workshop
Ex16_afar.ipynb MGSS_2020 RNA-seq.ipynb untitled1.txt untitled3.txt untitled6.txt yeast_counts_all_chr.txt
ls
prints the names of the files and directories in the current directory in
alphabetical order,
arranged neatly into columns.
We’ll be working within the shell_data
subdirectory, and creating new subdirectories, throughout this workshop.
To get our shell data into our home directory
cp -r /nesi/nobackup/nesi02659/shell_data/ ./
The command to change locations in our file system is cd
, followed by a
directory name to change our working directory.
cd
stands for “change directory”.
Let’s say we want to navigate to the shell_data
directory we saw above. We can
use the following command to get there:
$ cd shell_data
Let’s look at what is in this directory:
We can make the ls
output from above, more comprehensible by using the flag -F
,
which tells ls
to add a trailing /
to the names of directories:
$ ls -F
sra_metadata/ untrimmed_fastq/
Anything with a “/” after it is a directory. Things with a “*” after them are programs. If there are no decorations, it’s a file.
We have a special command to tell the computer to move us back or up one directory level.
$ cd ..
The cd
command takes an argument which is a directory
name. Directories can be specified using either a relative path or a
full absolute path. The directories on the computer are arranged into a
hierarchy. The full path tells you where a directory is in that
hierarchy. Navigate to the home directory, then enter the pwd
command.
$ cd
$ pwd
You will see:
/home/yourname
This is the full name of your home directory. This tells you that you
are in a directory called dcuser
, which sits inside a directory called
home
which sits inside the very top directory in the hierarchy. The
very top of the hierarchy is a directory called /
which is usually
referred to as the root directory. So, to summarize: dcuser
is a
directory in home
which is a directory in /
. More on root
and
home
in the next section.
Now enter the following command:
$ cd /home/yourname/shell_data/.hidden
This jumps forward multiple levels to the .hidden
directory.
Now go back to the home directory with cd
or cd ~
$ cd
You can also navigate to the .hidden
directory using:
$ cd shell_data/.hidden
These two commands have the same effect, they both take us to the .hidden
directory.
The first uses the absolute path, giving the full address from the home directory. The
second uses a relative path, giving only the address from the working directory. A full
path always starts with a /
. A relative path does not.
A relative path is like getting directions from someone on the street. They tell you to “go right at the stop sign, and then turn left on Main Street”. That works great if you’re standing there together, but not so well if you’re trying to tell someone how to get there from another country. A full path is like GPS coordinates. It tells you exactly where something is no matter where you are right now.
We now know how to switch directories, run programs, and look at the contents of directories, but how do we look at the contents of files?
One way to examine a file is to print out all of the
contents using the program cat
.
Enter the following command from within the untrimmed_fastq
directory:
$ cat SRR098026.fastq
This will print out all of the contents of the SRR098026.fastq
to the screen.
Enter the following command:
$ less SRR097977.fastq
Some navigation commands in less
:
key | action |
---|---|
Space | to go forward |
b | to go backward |
g | to go to the beginning |
G | to go to the end |
q | to quit |
less
also gives you a way of searching through files. Use the
“/” key to begin a search. Enter the word you would like
to search for and press enter
. The screen will jump to the next location where
that word is found.
There’s another way that we can look at files, and in this case, just look at part of them. This can be particularly useful if we just want to see the beginning or end of the file, or see how it’s formatted.
The commands are head
and tail
and they let you look at
the beginning and end of a file, respectively.
$ head SRR098026.fastq
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
@SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
+SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
@SRR098026.3 HWUSI-EAS1599_1:2:1:0:570 length=35
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
$ tail SRR098026.fastq
+SRR098026.247 HWUSI-EAS1599_1:2:1:2:1311 length=35
#!##!#################!!!!!!!######
@SRR098026.248 HWUSI-EAS1599_1:2:1:2:118 length=35
GNTGNGGTCATCATACGCGCCCNNNNNNNGGCATG
+SRR098026.248 HWUSI-EAS1599_1:2:1:2:118 length=35
B!;?!A=5922:##########!!!!!!!######
@SRR098026.249 HWUSI-EAS1599_1:2:1:2:1057 length=35
CNCTNTATGCGTACGGCAGTGANNNNNNNGGAGAT
+SRR098026.249 HWUSI-EAS1599_1:2:1:2:1057 length=35
A!@B!BBB@ABAB#########!!!!!!!######
The -n
option to either of these commands can be used to print the
first or last n
lines of a file.
$ head -n 1 SRR098026.fastq
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
$ tail -n 1 SRR098026.fastq
A!@B!BBB@ABAB#########!!!!!!!######
When working with computational data, it’s important to keep a safe copy of that data that can’t be accidentally overwritten or deleted. For this lesson, our raw data is our FASTQ files. We don’t want to accidentally change the original files, so we’ll make a copy of them and change the file permissions so that we can read from, but not write to, the files.
First, let’s make a copy of one of our FASTQ files using the cp
command.
Navigate to the shell_data/untrimmed_fastq
directory and enter:
$ cp SRR098026.fastq SRR098026-copy.fastq
$ ls -F
SRR097977.fastq SRR098026-copy.fastq SRR098026.fastq
The mkdir
command is used to make a directory. Enter mkdir
followed by a space, then the directory name you want to create:
$ mkdir backup
We can now move our backup file to this directory. We can
move files around using the command mv
:
$ mv SRR098026-copy.fastq backup
$ ls backup
SRR098026-copy.fastq
The mv
command is also how you rename files. Let’s rename this file to make it clear that this is a backup:
$ cd backup
$ mv SRR098026-copy.fastq SRR098026-backup.fastq
$ ls
SRR098026-backup.fastq
We discussed in a previous section how to look at a file using less
and head
. We can also
search within files without even opening them, using grep
. We can then send what we find to somewhere else.
$ cd ~/shell_data/untrimmed_fastq
Suppose we want to see how many reads in our file have really bad segments containing 10 consecutive unknown nucleotides (Ns). Let’s search for the string NNNNNNNNNN in the SRR098026 file:
$ grep NNNNNNNNNN SRR098026.fastq
One of the sets of lines returned by this command is:
@SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
We can use the -B
argument for grep to return a specific number of lines before
each match. The -A
argument returns a specific number of lines after each matching line. Here we want the line before and the two lines after each
matching line, so we add -B1 -A2
to our grep command:
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq
The command for redirecting output to a file is >
.
Let’s try out this command and copy all the records (including all four lines of each record)
in our FASTQ files that contain
‘NNNNNNNNNN’ to another file called bad_reads.txt
.
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
We can then see how many bad read we have
$ wc -l bad_reads.txt
537 bad_reads.txt
We created the files to store the reads and then counted the lines in
the file to see how many reads matched our criteria. There’s a way to do this, however, that
doesn’t require us to create these intermediate files - the pipe command (|
).
This is probably not a key on
your keyboard you use very much, so let’s all take a minute to find that key. For the standard QWERTY keyboard
layout, the |
character can be found using the key combination
What |
does is take the output that is scrolling by on the terminal and uses that output as input to another command.
When our output was scrolling by, we might have wished we could slow it down and
look at it, like we can with less
. Well it turns out that we can! We can redirect our output
from our grep
call through the less
command.
$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l
A variable is a method to store information eg a list, and use it again (or several times) without having to write the list out.
$ foo=abc
$ echo foo is $foo
foo is abc
We can add to a variable with curly brackets {}
$ echo foo is ${foo}EFG
foo is abcEFG
Loops are key to productivity improvements through automation as they allow us to execute commands repeatedly. Similar to wildcards and tab completion, using loops also reduces the amount of typing (and typing mistakes). Loops are helpful when performing operations on groups of sequencing files, such as unzipping or trimming multiple files. We will use loops for these purposes in subsequent analyses, but will cover the basics of them for now.
When the shell sees the keyword for
, it knows to repeat a command (or group of commands) once for each item in a list.
Each time the loop runs (called an iteration), an item in the list is assigned in sequence to the variable, and
the commands inside the loop are executed, before moving on to the next item in the list. Inside the loop, we call for
the variable’s value by putting $
in front of it. The $
tells the shell interpreter to treat the variable
as a variable name and substitute its value in its place, rather than treat it as text or an external command.
Let’s write a for loop to show us the first two lines of the fastq files we downloaded earlier. You will notice the shell prompt changes from $
to >
and back again as we were typing in our loop. The second prompt, >
, is different to remind us that we haven’t finished typing a complete command yet. A semicolon, ;
, can be used to separate two commands written on a single line.
$ cd ../untrimmed_fastq/
$ for filename in *.fastq
> do
> head -n 2 ${filename}
> done
The for loop begins with the formula for <variable> in <group to iterate over>
. In this case, the word filename
is designated
as the variable to be used over each iteration. In our case SRR097977.fastq
and SRR098026.fastq
will be substituted for filename
because they fit the pattern of ending with .fastq in the directory we’ve specified. The next line of the for loop is do
. The next line is
the code that we want to execute. We are telling the loop to print the first two lines of each variable we iterate over. Finally, the
word done
ends the loop.
After executing the loop, you should see the first two lines of both fastq files printed to the terminal. Let’s create a loop that will save this information to a file.
$ for filename in *.fastq
> do
> head -n 2 ${filename} >> seq_info.txt
> done
When writing a loop, you will not be able to return to previous lines once you have pressed Enter. Remember that we can cancel the current command using
If you notice a mistake that is going to prevent your loop for executing correctly.
Note that we are using >>
to append the text to our seq_info.txt
file. If we used >
, the seq_info.txt
file would be rewritten
every time the loop iterates, so it would only have text from the last variable used. Instead, >>
adds to the end of the file.
basename
is a function in UNIX that is helpful for removing a uniform part of a name from a list of files. In this case, we will use basename to remove the .fastq
extension from the files that we’ve been working with.
$ basename SRR097977.fastq .fastq
We see that this returns just the SRR accession, and no longer has the .fastq file extension on it.
SRR097977
If we try the same thing but use .fasta
as the file extension instead, nothing happens. This is because basename only works when it exactly matches a string in the file.
$ basename SRR097977.fastq .fasta
SRR097977.fastq
Basename is really powerful when used in a for loop. It allows to access just the file prefix, which you can use to name things. Let’s try this.
Inside our for loop, we create a new name variable. We call the basename function inside the parenthesis, then give our variable name from the for loop, in this case ${filename}
, and finally state that .fastq
should be removed from the file name. It’s important to note that we’re not changing the actual files, we’re creating a new variable called name. The line > echo $name will print to the terminal the variable name each time the for loop runs. Because we are iterating over two files, we expect to see two lines of output.
$ for filename in *.fastq
> do
> name=$(basename ${filename} .fastq)
> echo ${name}
> done
One way this is really useful is to move files. Let’s rename all of our .txt files using mv
so that they have the years on them, which will document when we created them.
$ for filename in *.txt
> do
> name=$(basename ${filename} .txt)
> mv ${filename} ${name}_2019.txt
> done