Intro to linux CLI data manipulation for biologists
This is an intro to data manipulation and intermediate command line scripting for biologists. We assume a basic familiarity with command line interactions, and instead focus here on the most useful tools for manipulating data.
Each grey cell in this tutorial indicates a command line interaction.
Lines starting with $
indicate a command that should be executed
in a terminal connected to the cluster, for example by copying and
pasting the text into your terminal. Elements in code cells surrounded
by angle brackets (e.g.
## Example Code Cell.
## Create an empty file in my home directory called `watdo.txt`
$ touch ~/watdo.txt
## Print "wat" to the screen
$ echo "wat"
wat
Getting set up
Start by making an ssh connection to the cluster](../UiO_Cluster_info.html).
Once you have gotten logged use git
to fetch some data:
git clone https://github.com/speciationgenomics/unix_exercises.git
Now change directory into the unix_exercises
:
cd unix_exercises
Things we want to be able to do:
- Look at data:
head
,tail
,cat
,less
- Find stuff in data:
cut
,grep
- Summarize data:
wc
,sort
,uniq
,|
- Organize data:
pwd
,mkdir
,mv
,cp
- Modify data:
nano
,sed
See the help for any of these functions with: man [cmd]
Look at data
head
displays the first count of lines in a file.
## indicate the number of lines to include with `-n`
## the default is 10
$ head -n 5 iris_data.tsv
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
tail
displays the last lines of a file.
## show the last 10 lines in the file
$ tail iris_data.tsv
## tail used with `-n+` and will remove files from the beginning of the file
## here, we remove the first 50 lines of the file
$ tail -n+50 iris_data.tsv
cat
concatenates and prints files
## cat will show all contents of the file
$ cat mobydick.txt
Sometimes you may issue commands and then realize this is a bad idea
(like cat'ing 15,000 lines of text to your screen), in which case you can
kill commands using CTRL+c.
## `>` can be used to combine the contents of multiple files and push them into a new file
$ cat mobydick.txt udrh.txt > combinedExample.txt
less
is used for viewing files and allows backward movement in the file, as well as forward movement.
## use `q` to exit
$ less mobydick.txt
Find stuff in data
cut
cuts out selected portions of each line of a file.
## use `-f` to specify a column, or range of columns
$ cut -f 3-5 iris_data.tsv
## use `>` to push the cut data to a new file
$ cut -f 3-5 iris_data.tsv > petalData.txt
grep
searches input files for lines that match a specified search term or pattern
## search for the word whale in moby dick
$ grep "whale" mobydick.txt
## use the `--color` flag to highlight "whale" in the text.
$ grep --color "whale" mobydick.txt
## `v` inverts the search and fines lines without "whale"
$ grep -v "whale" mobydick.txt
## `-c` counts the number of appearances
$ grep -c "whale" mobydick.txt
Summarize data
wc
counts different elements of a file.
## See the number of lines, words, and characters in the file
$ wc udhr.txt
## specify only lines `-l`, words `-w`, or characters `-m`
sort
sorts lines of a file
## Sorts files alphabetically if text
$ sort udhr.txt
## sort by column using `-k`; sorts numerically
$ sort -k 2 iris_data.tsv
|
indicates a “pipe” and is used to pipe
## lists all text files
$ ls *.txt
## takes the text files and determines number of lines, i.e. the number of files
$ ls *.txt | wc -l
## shows the number of characters
$ ls *.txt | wc -m
uniq
reports or filters out repeated lines in a file
## use `cut` and `|` to ask how many unique species are in the iris data
$ cut -f 5 iris_data.tsv | uniq
Species
setosa
versicolor
virginica
## Count the number of unique occurrences with `-c`
$ cut -f 5 iris_data.tsv | uniq -c
1 Species
50 setosa
50 versicolor
50 virginica
Organize data
pwd
returns the full path of the working directory you are in
$ pwd
mkdir
makes a new directory
$ mkdir NewDirectory
cp
makes a copy of a file/directory
$ cp mobydick.txt NewDirectory/mobydick_copy.txt
## use `cd` to change into NewDirectory
$ cd NewDirectory/
## and `ls` to see the copied file
$ ls
$ cd ..
mv
moves a file somewhere else and does NOT make a copy
$ mv mobydick.txt NewDirectory/
## if we `ls`, we will no longer see the file
## we can move it back up one directory using `.`
$ mv NewDirectory/mobydick.txt .
Modify data
nano
is a text editor that allows you to modify files within the CLI
## explore nano using the iris data file
$ nano iris_data.tsv
## to exit use `ctrl` + `x`
sed
is commonly used for find and replace text editing
## replace "whale" with "robot-kitten"
$ sed 's/whale/robot-kitten/' mobydick.txt > robykitten.txt
## see where the replacements occured
$ grep --color "robot-kitten" robykitten.txt
Further resources
A couple very nice pages of introduction to linux command line for biologists: