Intro to linux CLI data manipulation for biologists

This is an intro to data manipulation and intermediate command line scripting for biologists. We assume a basic familiarity with command line interactions, and instead focus here on the most useful tools for manipulating data.

Each grey cell in this tutorial indicates a command line interaction. Lines starting with $ indicate a command that should be executed in a terminal connected to the cluster, for example by copying and pasting the text into your terminal. Elements in code cells surrounded by angle brackets (e.g. ) are variables that need to be replaced by the user. All lines in code cells beginning with \#\# are comments and should not be copied and executed. All other lines should be interpreted as output from the issued commands.

## Example Code Cell.
## Create an empty file in my home directory called `watdo.txt`
$ touch ~/watdo.txt

## Print "wat" to the screen
$ echo "wat"
wat

Getting set up

Start by making an ssh connection to the cluster](../UiO_Cluster_info.html). Once you have gotten logged use git to fetch some data:

git clone https://github.com/speciationgenomics/unix_exercises.git

Now change directory into the unix_exercises:

cd unix_exercises

Things we want to be able to do:

Look at data: head,tail, cat, less
Find stuff in data: cut, grep
Summarize data: wc, sort, uniq, |
Organize data: pwd, mkdir, mv, cp
Modify data: nano, sed

See the help for any of these functions with: man [cmd]

Look at data

head displays the first count of lines in a file.

## indicate the number of lines to include with `-n`
## the default is 10
$ head -n 5 iris_data.tsv
Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa

tail displays the last lines of a file.

## show the last 10 lines in the file
$ tail iris_data.tsv
## tail used with `-n+` and will remove files from the beginning of the file
## here, we remove the first 50 lines of the file
$ tail -n+50 iris_data.tsv

cat concatenates and prints files

## cat will show all contents of the file
$ cat mobydick.txt

Sometimes you may issue commands and then realize this is a bad idea
(like cat'ing 15,000 lines of text to your screen), in which case you can
kill commands using CTRL+c.

## `>` can be used to combine the contents of multiple files and push them into a new file
$ cat mobydick.txt udrh.txt > combinedExample.txt

less is used for viewing files and allows backward movement in the file, as well as forward movement.

## use `q` to exit
$ less mobydick.txt

Find stuff in data

cut cuts out selected portions of each line of a file.

## use `-f` to specify a column, or range of columns
$ cut -f 3-5 iris_data.tsv
## use `>` to push the cut data to a new file
$ cut -f 3-5 iris_data.tsv > petalData.txt

grep searches input files for lines that match a specified search term or pattern

## search for the word whale in moby dick
$ grep "whale" mobydick.txt
## use the `--color` flag to highlight "whale" in the text.
$ grep --color "whale" mobydick.txt
## `v` inverts the search and fines lines without "whale"
$ grep -v "whale" mobydick.txt
## `-c` counts the number of appearances
$ grep -c "whale" mobydick.txt

Summarize data

wc counts different elements of a file.

## See the number of lines, words, and characters in the file
$ wc udhr.txt
## specify only lines `-l`, words `-w`, or characters `-m`

sort sorts lines of a file

## Sorts files alphabetically if text
$ sort udhr.txt
## sort by column using `-k`; sorts numerically
$ sort -k 2 iris_data.tsv

| indicates a “pipe” and is used to pipe

## lists all text files
$ ls *.txt
## takes the text files and determines number of lines, i.e. the number of files
$ ls *.txt | wc -l
## shows the number of characters
$ ls *.txt | wc -m

uniq reports or filters out repeated lines in a file

## use `cut` and `|` to ask how many unique species are in the iris data
$ cut -f 5 iris_data.tsv | uniq
Species
setosa
versicolor
virginica

## Count the number of unique occurrences with `-c`
$ cut -f 5 iris_data.tsv | uniq -c
      1 Species
     50 setosa
     50 versicolor
     50 virginica

Organize data

pwd returns the full path of the working directory you are in

$ pwd

mkdir makes a new directory

$ mkdir NewDirectory

cp makes a copy of a file/directory

$ cp mobydick.txt NewDirectory/mobydick_copy.txt
## use `cd` to change into NewDirectory
$ cd NewDirectory/
## and `ls` to see the copied file
$ ls
$ cd ..

mv moves a file somewhere else and does NOT make a copy

$ mv mobydick.txt NewDirectory/
## if we `ls`, we will no longer see the file
## we can move it back up one directory using `.`
$ mv NewDirectory/mobydick.txt .

Modify data

nano is a text editor that allows you to modify files within the CLI

## explore nano using the iris data file
$ nano iris_data.tsv
## to exit use `ctrl` + `x`

sed is commonly used for find and replace text editing

## replace "whale" with "robot-kitten"
$ sed 's/whale/robot-kitten/' mobydick.txt > robykitten.txt
## see where the replacements occured
$ grep --color "robot-kitten" robykitten.txt

Further resources

The python data science handbook

A couple very nice pages of introduction to linux command line for biologists: