Introduction to Unix commandline (BASH)

Learning objectives:

Use the shell to navigate directories
Perform operations on files in directories outside your working directory
Interconvert between relative and absolute paths
View, search within, copy, move, and rename files. Create new directories.
Construct command pipelines with two or more stages.
Use for loops to run the same command for several input files.

These introductory notes are drawn from The Carpentries Introduction to the Command Line for Genomics lessons

Introducing the Shell

A shell is a computer program that presents a command line interface which allows you to control your computer using commands entered with a keyboard instead of controlling graphical user interfaces (GUIs) with a mouse/keyboard combination.

Many bioinformatics tools can only be used through a command line interface, or have extra capabilities in the command line version that are not available in the GUI. This is true, for example, of BLAST, which offers many advanced functions only accessible to users who know how to use a shell.

The shell makes your work less boring. In bioinformatics you often need to do the same set of tasks with a large number of files. Learning the shell will allow you to automate those repetitive tasks and leave you free to do more exciting things.
The shell makes your work less error-prone. When humans do the same thing a hundred different times (or even ten times), they’re likely to make a mistake. Your computer can do the same thing a thousand times with no mistakes.
The shell makes your work more reproducible. When you carry out your work in the command-line (rather than a GUI), your computer keeps a record of every step that you’ve carried out, which you can use to re-do your work when you need to. It also gives you a way to communicate unambiguously what you’ve done, so that others can check your work or apply your process to new data.
Many bioinformatic tasks require large amounts of computing power and can’t realistically be run on your own machine. These tasks are best performed using remote computers or cloud computing, which can only be accessed through a shell.

In this lesson you will learn how to use the command line interface to move around in your file system.

Open up a terminal

[matt.bixley@wbg004 nesi02659]$

The dollar sign is a prompt, which shows us that the shell is waiting for input; your shell may use a different character as a prompt and may add information before the prompt. When typing commands, either from these lessons or from other sources, do not type the prompt, only the commands that follow it.

Let’s find out where we are by running a command called pwd (which stands for “print working directory”). At any moment, our current working directory is our current default directory, i.e., the directory that the computer assumes we want to run commands in, unless we explicitly specify something else. Here, the computer’s response is /home/yourname, which is the top level directory within our cloud system:

$ pwd
/scale_wlg_persistent/filesets/project/nesi02659

Let’s look at how our file system is organized. We can see what files and subdirectories are in this directory by running ls, which stands for “listing”:

$ ls

14417834         fastqcmultiqc_html  MGSS_2020_MikeH_test  safe             Untitled2.ipynb  Untitled4.ipynb  Untitled.ipynb
checkv.err       ga4gh               MGSS_resources_2020   SEPT28           untitled2.md     untitled4.md     untitled.md
checkv.out       ga4gh-example-1     MGSS_U                source_data      untitled2.txt    untitled4.txt    untitled.txt
checkv.sl        gateach             newcounts.txt         Untitled1.ipynb  Untitled3.ipynb  Untitled5.ipynb  viz_thang.sh
dc_workshop      Kegg_dat.ipynb      obss_2020             untitled1.md     untitled3.md     untitled5.txt    workflow_workshop
Ex16_afar.ipynb  MGSS_2020           RNA-seq.ipynb         untitled1.txt    untitled3.txt    untitled6.txt    yeast_counts_all_chr.txt

ls prints the names of the files and directories in the current directory in alphabetical order, arranged neatly into columns. We’ll be working within the shell_data subdirectory, and creating new subdirectories, throughout this workshop.

To get our shell data into our home directory

cp -r  /nesi/nobackup/nesi02659/shell_data/ ./

Navigating Files and Directories

The command to change locations in our file system is cd, followed by a directory name to change our working directory. cd stands for “change directory”.

Let’s say we want to navigate to the shell_data directory we saw above. We can use the following command to get there:

$ cd shell_data

Let’s look at what is in this directory:

We can make the ls output from above, more comprehensible by using the flag -F, which tells ls to add a trailing / to the names of directories:

$ ls -F
sra_metadata/  untrimmed_fastq/

Anything with a “/” after it is a directory. Things with a “*” after them are programs. If there are no decorations, it’s a file.

We have a special command to tell the computer to move us back or up one directory level.

$ cd ..

Full vs. Relative Paths

The cd command takes an argument which is a directory name. Directories can be specified using either a relative path or a full absolute path. The directories on the computer are arranged into a hierarchy. The full path tells you where a directory is in that hierarchy. Navigate to the home directory, then enter the pwd command.

$ cd  
$ pwd  

You will see:

/home/yourname

This is the full name of your home directory. This tells you that you are in a directory called dcuser, which sits inside a directory called home which sits inside the very top directory in the hierarchy. The very top of the hierarchy is a directory called / which is usually referred to as the root directory. So, to summarize: dcuser is a directory in home which is a directory in /. More on root and home in the next section.

Now enter the following command:

$ cd /home/yourname/shell_data/.hidden

This jumps forward multiple levels to the .hidden directory. Now go back to the home directory with cd or cd ~

$ cd

You can also navigate to the .hidden directory using:

$ cd shell_data/.hidden

These two commands have the same effect, they both take us to the .hidden directory. The first uses the absolute path, giving the full address from the home directory. The second uses a relative path, giving only the address from the working directory. A full path always starts with a /. A relative path does not.

A relative path is like getting directions from someone on the street. They tell you to “go right at the stop sign, and then turn left on Main Street”. That works great if you’re standing there together, but not so well if you’re trying to tell someone how to get there from another country. A full path is like GPS coordinates. It tells you exactly where something is no matter where you are right now.

Examining Files

We now know how to switch directories, run programs, and look at the contents of directories, but how do we look at the contents of files?

One way to examine a file is to print out all of the contents using the program cat.

Enter the following command from within the untrimmed_fastq directory:

$ cat SRR098026.fastq

This will print out all of the contents of the SRR098026.fastq to the screen.

Enter the following command:

$ less SRR097977.fastq

Some navigation commands in less:

key	action
`Space`	to go forward
`b`	to go backward
`g`	to go to the beginning
`G`	to go to the end
`q`	to quit

less also gives you a way of searching through files. Use the “/” key to begin a search. Enter the word you would like to search for and press enter. The screen will jump to the next location where that word is found.

There’s another way that we can look at files, and in this case, just look at part of them. This can be particularly useful if we just want to see the beginning or end of the file, or see how it’s formatted.

The commands are head and tail and they let you look at the beginning and end of a file, respectively.

$ head SRR098026.fastq

@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
@SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
+SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
@SRR098026.3 HWUSI-EAS1599_1:2:1:0:570 length=35
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN

$ tail SRR098026.fastq

+SRR098026.247 HWUSI-EAS1599_1:2:1:2:1311 length=35
#!##!#################!!!!!!!######
@SRR098026.248 HWUSI-EAS1599_1:2:1:2:118 length=35
GNTGNGGTCATCATACGCGCCCNNNNNNNGGCATG
+SRR098026.248 HWUSI-EAS1599_1:2:1:2:118 length=35
B!;?!A=5922:##########!!!!!!!######
@SRR098026.249 HWUSI-EAS1599_1:2:1:2:1057 length=35
CNCTNTATGCGTACGGCAGTGANNNNNNNGGAGAT
+SRR098026.249 HWUSI-EAS1599_1:2:1:2:1057 length=35
A!@B!BBB@ABAB#########!!!!!!!######

The -n option to either of these commands can be used to print the first or last n lines of a file.

$ head -n 1 SRR098026.fastq

@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35

$ tail -n 1 SRR098026.fastq

A!@B!BBB@ABAB#########!!!!!!!######

Copying Files

When working with computational data, it’s important to keep a safe copy of that data that can’t be accidentally overwritten or deleted. For this lesson, our raw data is our FASTQ files. We don’t want to accidentally change the original files, so we’ll make a copy of them and change the file permissions so that we can read from, but not write to, the files.

First, let’s make a copy of one of our FASTQ files using the cp command.

Navigate to the shell_data/untrimmed_fastq directory and enter:

$ cp SRR098026.fastq SRR098026-copy.fastq
$ ls -F

SRR097977.fastq  SRR098026-copy.fastq  SRR098026.fastq

Creating Directories

The mkdir command is used to make a directory. Enter mkdir followed by a space, then the directory name you want to create:

$ mkdir backup

Moving

We can now move our backup file to this directory. We can move files around using the command mv:

$ mv SRR098026-copy.fastq backup
$ ls backup

SRR098026-copy.fastq

The mv command is also how you rename files. Let’s rename this file to make it clear that this is a backup:

$ cd backup
$ mv SRR098026-copy.fastq SRR098026-backup.fastq
$ ls

SRR098026-backup.fastq

Redirection/Pipes

We discussed in a previous section how to look at a file using less and head. We can also search within files without even opening them, using grep. We can then send what we find to somewhere else.

$ cd ~/shell_data/untrimmed_fastq

Suppose we want to see how many reads in our file have really bad segments containing 10 consecutive unknown nucleotides (Ns). Let’s search for the string NNNNNNNNNN in the SRR098026 file:

$ grep NNNNNNNNNN SRR098026.fastq

One of the sets of lines returned by this command is:

@SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

We can use the -B argument for grep to return a specific number of lines before each match. The -A argument returns a specific number of lines after each matching line. Here we want the line before and the two lines after each matching line, so we add -B1 -A2 to our grep command:

$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq

The command for redirecting output to a file is >.

Let’s try out this command and copy all the records (including all four lines of each record) in our FASTQ files that contain ‘NNNNNNNNNN’ to another file called bad_reads.txt.

$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt

We can then see how many bad read we have

$ wc -l bad_reads.txt

537 bad_reads.txt

We created the files to store the reads and then counted the lines in the file to see how many reads matched our criteria. There’s a way to do this, however, that doesn’t require us to create these intermediate files - the pipe command (|).

This is probably not a key on your keyboard you use very much, so let’s all take a minute to find that key. For the standard QWERTY keyboard layout, the | character can be found using the key combination

Shift+\

What | does is take the output that is scrolling by on the terminal and uses that output as input to another command. When our output was scrolling by, we might have wished we could slow it down and look at it, like we can with less. Well it turns out that we can! We can redirect our output from our grep call through the less command.

$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l

Variables

A variable is a method to store information eg a list, and use it again (or several times) without having to write the list out.

$ foo=abc
$ echo foo is $foo
foo is abc

We can add to a variable with curly brackets {}

$ echo foo is ${foo}EFG
foo is abcEFG

Writing for loops

Loops are key to productivity improvements through automation as they allow us to execute commands repeatedly. Similar to wildcards and tab completion, using loops also reduces the amount of typing (and typing mistakes). Loops are helpful when performing operations on groups of sequencing files, such as unzipping or trimming multiple files. We will use loops for these purposes in subsequent analyses, but will cover the basics of them for now.

When the shell sees the keyword for, it knows to repeat a command (or group of commands) once for each item in a list. Each time the loop runs (called an iteration), an item in the list is assigned in sequence to the variable, and the commands inside the loop are executed, before moving on to the next item in the list. Inside the loop, we call for the variable’s value by putting $ in front of it. The $ tells the shell interpreter to treat the variable as a variable name and substitute its value in its place, rather than treat it as text or an external command.

Let’s write a for loop to show us the first two lines of the fastq files we downloaded earlier. You will notice the shell prompt changes from $ to > and back again as we were typing in our loop. The second prompt, >, is different to remind us that we haven’t finished typing a complete command yet. A semicolon, ;, can be used to separate two commands written on a single line.

$ cd ../untrimmed_fastq/

$ for filename in *.fastq
> do
> head -n 2 ${filename}
> done

The for loop begins with the formula for <variable> in <group to iterate over>. In this case, the word filename is designated as the variable to be used over each iteration. In our case SRR097977.fastq and SRR098026.fastq will be substituted for filename because they fit the pattern of ending with .fastq in the directory we’ve specified. The next line of the for loop is do. The next line is the code that we want to execute. We are telling the loop to print the first two lines of each variable we iterate over. Finally, the word done ends the loop.

After executing the loop, you should see the first two lines of both fastq files printed to the terminal. Let’s create a loop that will save this information to a file.

$ for filename in *.fastq
> do
> head -n 2 ${filename} >> seq_info.txt
> done

When writing a loop, you will not be able to return to previous lines once you have pressed Enter. Remember that we can cancel the current command using

Ctrl+C

If you notice a mistake that is going to prevent your loop for executing correctly.

Note that we are using >> to append the text to our seq_info.txt file. If we used >, the seq_info.txt file would be rewritten every time the loop iterates, so it would only have text from the last variable used. Instead, >> adds to the end of the file.

Extended for loops

basename is a function in UNIX that is helpful for removing a uniform part of a name from a list of files. In this case, we will use basename to remove the .fastq extension from the files that we’ve been working with.

$ basename SRR097977.fastq .fastq

We see that this returns just the SRR accession, and no longer has the .fastq file extension on it.

SRR097977

If we try the same thing but use .fasta as the file extension instead, nothing happens. This is because basename only works when it exactly matches a string in the file.

$ basename SRR097977.fastq .fasta

SRR097977.fastq

Basename is really powerful when used in a for loop. It allows to access just the file prefix, which you can use to name things. Let’s try this.

Inside our for loop, we create a new name variable. We call the basename function inside the parenthesis, then give our variable name from the for loop, in this case ${filename}, and finally state that .fastq should be removed from the file name. It’s important to note that we’re not changing the actual files, we’re creating a new variable called name. The line > echo $name will print to the terminal the variable name each time the for loop runs. Because we are iterating over two files, we expect to see two lines of output.

$ for filename in *.fastq
> do
> name=$(basename ${filename} .fastq)
> echo ${name}
> done

One way this is really useful is to move files. Let’s rename all of our .txt files using mv so that they have the years on them, which will document when we created them.

$ for filename in *.txt
> do
> name=$(basename ${filename} .txt)
> mv ${filename}  ${name}_2019.txt
> done