Sol Genomics Network
Introduction to
UNIX command-line
Boyce Thompson Institute
2014
Noe Fernandez
Sol Genomics Network
• Terminal file system navigation
• Wildcards, shortcuts and special characters
• File permissions
• Compression UNIX commands
• Networking UNIX commands
• Basic NGS file formats
• Text files manipulation commands
• Command-line pipelines
• Introduction to bash scripts
Class Content
Sol Genomics Network
What is a virtual machine?
Sol Genomics Network
What is a terminal?
Sol Genomics Network
Why are command-line needed?
• Most of the software for biological data analysis can be used in
a UNIX command-line terminal
• Most of the servers for biological data analysis use Linux as
operative system
• Data analysis on calculation servers are much faster since they
can use more CPUs and RAM than in a PC (e.g.: Boyce servers
has 64 cores and 1TB RAM)
• Large NGS data files can not be opened or loaded in most of
the software with interface and web sites
• Compression commands are useful, since NGS large data files
usually are stored and shared as compressed files
Sol Genomics Network
Text handling commandsText handling commands
command > file saves STDOUT in a file
command >> file appends STDOUT in a file
cat file concatenate and print files
cat file1 file2 > file3 merges files 1 and 2 into file3
cat *fasta > all.fasta
concatenates all fasta files in
the current directory
head file prints first lines from a file
head -n 5 file prints first five lines from a file
tail file prints last lines from a file
tail -n 5 file prints last five lines from a file
less file view a file
less -N file includes line numbers
less -S file wraps long lines
grep ‘pattern’ file Prints lines matching a pattern
grep -c ‘pattern’ file counts lines matching a pattern
cut -f 1,3 file
retrieves data from selected
columns in a tab-delimited file
sort file sorts lines from a file
sort -u file sorts and return unique lines
uniq -c file filters adjacent repeated lines
wc file counts lines, words and bytes
paste file1 file2
concatenates the lines of input
files
paste -d “,”
concatenates the lines of input
files by commas
sed transforms text
File system CommandsFile system Commands
ls lists directories and files
ls -a lists all files including hidden files
ls -lh formatted list including more data
ls -t lists sorted by date
pwd returns path to working directory
cd dir changes directory
cd .. goes to parent directory
cd / goes to root directory
cd goes to home directory
touch file_name creates en empty file
cp file file_copy copy a file
cp -r copy files contained in directories
rm file deletes a file
rm -r dir deletes a directory and its files
mv file1 file2 moves or renames a file
mkdir dir_name creates a directory
rmdir dir_name deletes a directory
locate file_name searches a file
man command shows commands manual
top shows process activity
df -h shows disk space info
Networking CommandsNetworking Commands
wget URL download a file from an URL
ssh user@server connects to a server
scp copy files between computers
apt-get install installs applications in linux
Compression commandsCompression commands
gzip/zip compress a file
gunzip/unzip decompress a file
tar -cvf groups files
tar -xvf ungroups files
tar -zcvf groups and gzip files
tar -zxvf gunzip and ungroups files
UNIX Command-Line Cheat Sheet
BTI-SGN Bioinformatics Course 2014
•File system commands
File system navigation
http://www.slideshare.net/NoFernndezPozo/unix-command-sheet2014
https://btiplantbioinfocourse.files.wordpress.com/2014/02/unix_command_sheet_2014.pdf
Download the cheat sheet from:
Sol Genomics Network
File system navigation
File Browser Terminal
=
Sol Genomics Network
Home and Root directories
/bin, /lib, /usr code and code libraries
/var logs and other data
/home user directories
/tmp temporary files
/etc configuration information
/proc special file system in Linux
/home/bioinfo
/home/noe
/home/noe/Desktop
Root directory
Home directory
Sol Genomics Network
Anatomy of a UNIX command
grep -c -A 3 --ignore-case file.txt
command
Simple option flag
(short form)
option (long form)option with
argument
argument
man grep
print grep manual
Sol Genomics Network
ls, cd and pwd to navigate the file system
• where am I? pwd
• how to change current directory cd
• what files and directories are in my current directory? ls
pwd
return current work directory
Sol Genomics Network
ls
list directories and files in current directory
ls lists directories and files
ls -a
list all directories and files, including hidden files
ls -l -h -t
time sorted
ls -lhS
size sorted
ls -l -h
list in long format
human readable
Sol Genomics Network
ls lists directories and files
r readable
w writable
x executable or searchable
- not rwx
d Directory
- Regular file
d rwx r-x r-x
user
group
other
owner user
permissions
owner group
date File namesizelinks #
Sol Genomics Network
Use up and down
arrows to navigate
the command
history
Wildcards, history and some shortcuts
ls *txt
ls P*s list files starting with P and ending with s,
e.g.: Pictures, Photos, Programs ...
list all txt files in current directory
ctrl-c stop process
ctrl-a go to begin of line
ctrl-e go to end of line
ctrl-r search in command history
Sol Genomics Network
Escaping special characters
Tip: file names in lower
case and with underscores
instead of spaces
! @ $ ^ & * ~ ? . | / [ ] < >  ` " ;# ( )
Use tab key to
autocomplete names
ls my folder list a folder containing a space
ls my_folder list a folder
Sol Genomics Network
Use tab key to
autocomplete names
cd changes directory
cd Desktop
changes directory to Desktop
cd ..
goes to parent directory
cd goes to home directory
cd / goes to root directory
cd - goes to previous directory
Sol Genomics Network
Absolute and relative paths
ls /home/user/Desktop
list files in Desktop using an absolute path
ls Desktop/
list files in Documents using a relative path (from your home: /home/bioinfo)
ls ~/Desktop
list files in Desktop using your home as a reference
Sol Genomics Network
Absolute and relative paths
ls /home/bioinfo/Desktop
ls ~/Desktop
Absolute paths do not depend on where you are
~/ is equivalent to /home/bioinfo/
Sol Genomics Network
Absolute and relative paths
ls ../Documents
cd Desktop/
goes to Desktop from when you are in your home (/home/bioinfo)
list files from Documents when you are in Desktop
Sol Genomics Network
Create, copy, move and delete files
touch tmp_file.txt
creates an empty file called tmp_file.txt
cp tmp_file.txt file_copy.txt
copies tmp_file.txt in file_copy.txt
rm file.txt deletes file.txt
mv file1.txt file2.txt moves or rename a file
Tip: file names in lower
case and with underscores
instead of spaces
Sol Genomics Network
Locate a file
locate unix_class_file_samples.zip
Locate the path for the file unix_class_file_samples.zip
locate unix_class
Locate the path for all the files containing unix_class
Sol Genomics Network
Create, copy and delete directories
mkdir dir_name
creates an empty directory called dir_name
rmdir dir_name
deletes dir_name directory if it is empty
cp -r dir_name dir_copy
copy dir_name and its files in a new folder
rm -r dir_name delete dir_name and its files
Sol Genomics Network
wc file counts lines, words and bytes
paste file1 file2
concatenates the lines of input
files
paste -d “,”
concatenates the lines of input
files by commas
sed transforms text
locate file_name searches a file
man command shows commands manual
top shows process activity
df -h shows disk space info
Networking CommandsNetworking Commands
wget URL download a file from an URL
ssh user@server connects to a server
scp copy files between computers
apt-get install installs applications in linux
Compression commandsCompression commands
gzip/zip compress a file
gunzip/unzip decompress a file
tar -cvf groups files
tar -xvf ungroups files
tar -zcvf groups and gzip files
tar -zxvf gunzip and ungroups files
Text handling commandsText handling commands
command > file saves STDOUT in a file
command >> file appends STDOUT in a file
cat file concatenate and print files
cat file1 file2 > file3 merges files 1 and 2 into file3
cat *fasta > all.fasta
concatenates all fasta files in
the current directory
head file prints first lines from a file
head -n 5 file prints first five lines from a file
tail file prints last lines from a file
tail -n 5 file prints last five lines from a file
less file view a file
less -N file includes line numbers
less -S file wraps long lines
grep ‘pattern’ file Prints lines matching a pattern
grep -c ‘pattern’ file counts lines matching a pattern
cut -f 1,3 file
retrieves data from selected
columns in a tab-delimited file
sort file sorts lines from a file
sort -u file sorts and return unique lines
uniq -c file filters adjacent repeated lines
wc file counts lines, words and bytes
paste file1 file2
concatenates the lines of input
files
paste -d “,”
concatenates the lines of input
files by commas
sed transforms text
File system CommandsFile system Commands
ls lists directories and files
ls -a lists all files including hidden files
ls -lh formatted list including more data
ls -t lists sorted by date
pwd returns path to working directory
cd dir changes directory
cd .. goes to parent directory
cd / goes to root directory
cd goes to home directory
touch file_name creates en empty file
cp file file_copy copy a file
cp -r copy files contained in directories
rm file deletes a file
rm -r dir deletes a directory and its files
mv file1 file2 moves or renames a file
mkdir dir_name creates a directory
rmdir dir_name deletes a directory
locate file_name searches a file
man command shows commands manual
top shows process activity
df -h shows disk space info
Networking CommandsNetworking Commands
wget URL download a file from an URL
ssh user@server connects to a server
scp copy files between computers
apt-get install installs applications in linux
Compression commandsCompression commands
gzip/zip compress a file
gunzip/unzip decompress a file
tar -cvf groups files
tar -xvf ungroups files
tar -zcvf groups and gzip files
tar -zxvf gunzip and ungroups files
UNIX Command-Line Cheat Sheet
BTI-SGN Bioinformatics Course 2014
Compression commands
tar -zcvf file.tar.gz f1 f2
groups and compress files
tar -zxvf file.tar.gz
decompress and ungroup a tar.gz file files, directories or wildcards
Sol Genomics Network
Compression commands
gzip f1.txt
gunzip file.gz
unzip file.zip decompress file.zip
zip file.zip f1 f2
compress files f1 and f2 in file.zip
compress file f1.txt in f1.txt.gz
decompress file.gz
Sol Genomics Network
Introduction to
UNIX command-line II
Boyce Thompson Institute
October
Noe Fernandez
Sol Genomics Network
• Terminal file system navigation
• Wildcards, shortcuts and special characters
• File permissions
• Compression UNIX commands
• Networking UNIX commands
• Basic NGS file formats
• Text files manipulation commands
• Command-line pipelines
• Introduction to bash scripts
Class Content
Sol Genomics Network
FASTA format
A sequence in FASTA format begins with a single-line description,
followed by lines of sequence data. The description line is
distinguished from the sequence data by a greater-than (">") symbol
at the beginning.
http://www.ncbi.nlm.nih.gov/
>sequence_ID1 description
ATGCGCGCGCGCGCGCGCGGGTAGCAGATGACGACACAGAGCGAGGATGCGCTGAGAGTA
GTGTGACGACGATGACGGAAAATCAGATGGACCCGATGACAGCATGACGATGGGACGGGA
AAGATTGGACCAGGACAGGACCAGGACCAGGACCAGGGATTAGA
>sequence_ID2 description
ATGGGGGGGACGACGATGGACACAGAGACAGAGACGACGACAGCAGACAGATTTACCTTA
GACGAGATAGGAGAGACGACAGATATATATATATAGCAGACAGACAGACATTTAGACGAG
ACGACGATAGACGATaaaaataa
sequence datadescription line
Sol Genomics Network
@D3B4KKQ1:291:D17NUACXX:8:1101:3630:2109 1:N:0:
GACTTGCAGGCATGCAAGCTTGGCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACACTGGCGT
+
?@<+ADDDDFDFFI<FGE=EHGIGFFGEFIIFFBGFIDEI>D?FFFFA4;C;DC=;=ABDD;
@D3B4KKQ1:291:D17NUACXX:8:1101:3971:2092 1:N:0:
ATTGCAGAAGCGGCCCCGCATCTGCGAAGGGTTAACCGCAGGTGCAGAAGCTGGCTTTAAGTGAGAAGT
+
=BAADBA?D?FGI<@FHDB6?ADFEGGIE8@FGGII3ABBBB(;;6@CC?C3;C<99?CCCCC;:::?
FASTQ format
A FASTQ file normally uses four lines per sequence.
Line 1: begins with a '@' character, followed by a sequence identifier and an optional description.
Line 2: is the raw sequence letters.
Line 3: begins with a '+' character, is optionally followed by the same sequence identifier.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of
symbols as letters in the sequence.
wikipedia
sequence datadescription line sequence quality
Sol Genomics Network
Tab-delimited text files
ATCG00890.1 PACid:16418828 90.60 117 11 0 18 134 1 117 1e-71 220
ATCG00890.1 PACid:16412855 90.48 147 14 2 41 387 27 173 1e-68 214
ATCG00500.1 PACid:23047568 64.88 299 64 2 220 477 112 410 5e-131 388
ATCG00500.1 PACid:23052247 58.88 321 69 3 220 477 381 701 3e-117 361
ATCG00280.1 PACid:24129717 95.99 474 19 0 1 474 1 474 0.0 847
ATCG00280.1 PACid:24095593 95.36 474 22 0 1 474 1 474 0.0 840
ATCG00280.1 PACid:20871697 94.94 474 24 0 1 474 1 474 0.0 837
scoreQuery Subject id %
length
mismatch
gaps
qstart
qend
sstart
send evalue
Tabular blast output example
Tab-delimited files are a very common format in scientific
data.They consist in columns of text separated by tabs.
Other file formats could have different delimiters.
Blast, SAM (mapping), BED, VCF (SNPs), GTF, GFF ...
Sol Genomics Network
Text handling commandsText handling commands
command > file saves STDOUT in a file
command >> file appends STDOUT in a file
cat file concatenate and print files
cat file1 file2 > file3 merges files 1 and 2 into file3
cat *fasta > all.fasta
concatenates all fasta files in
the current directory
head file prints first lines from a file
head -n 5 file prints first five lines from a file
tail file prints last lines from a file
tail -n 5 file prints last five lines from a file
less file view a file
less -N file includes line numbers
less -S file wraps long lines
grep ‘pattern’ file Prints lines matching a pattern
grep -c ‘pattern’ file counts lines matching a pattern
cut -f 1,3 file
retrieves data from selected
columns in a tab-delimited file
sort file sorts lines from a file
sort -u file sorts and return unique lines
uniq -c file filters adjacent repeated lines
wc file counts lines, words and bytes
paste file1 file2
concatenates the lines of input
files
paste -d “,”
concatenates the lines of input
files by commas
sed transforms text
File system CommandsFile system Commands
ls lists directories and files
ls -a lists all files including hidden files
ls -lh formatted list including more data
ls -t lists sorted by date
pwd returns path to working directory
cd dir changes directory
cd .. goes to parent directory
cd / goes to root directory
cd goes to home directory
touch file_name creates en empty file
cp file file_copy copy a file
cp -r copy files contained in directories
rm file deletes a file
rm -r dir deletes a directory and its files
mv file1 file2 moves or renames a file
mkdir dir_name creates a directory
rmdir dir_name deletes a directory
locate file_name searches a file
man command shows commands manual
top shows process activity
df -h shows disk space info
Networking CommandsNetworking Commands
wget URL download a file from an URL
ssh user@server connects to a server
scp copy files between computers
apt-get install installs applications in linux
Compression commandsCompression commands
gzip/zip compress a file
gunzip/unzip decompress a file
tar -cvf groups files
tar -xvf ungroups files
tar -zcvf groups and gzip files
tar -zxvf gunzip and ungroups files
UNIX Command-Line Cheat Sheet
BTI-SGN Bioinformatics Course 2014
Text Handling Commands
•Text Handling Commands
Sol Genomics Network
less blast_sample.txt
view file blast_sample.txt
less to view large files
/pattern search pattern
n find next
N find previous
q quit less
scroll through the file
< or g go to file beginning
> or G go to file end
space bar page down
b page up
less -S blast_sample.txt
view file blast_sample.txt without wrapping long lines
less -N blast_sample.txt
view file blast_sample.txt showing line numbers
Sol Genomics Network
cat sample1.fasta
prints file sample1.fasta on the screen
cat concatenates and prints files
cat /home/bioinfo/Desktop/unix_data/sample1.fasta
prints file sample1.fasta on the screen
concatenates files sample1.fasta and sample2.fasta
and saves them in the file new_file.fasta
cat sample1.fasta sample2.fasta > new_file.fasta
redirects output to a file
Sol Genomics Network
cat *fasta > all_samples.fasta
appends sample3.fasta file to new_file.fasta
cat sample3.fasta >> new_file.fasta
concatenates all FASTA files in the current directory and saves them in
the file all_samples.fasta
cat concatenates and prints files
redirect output to a file
Sol Genomics Network
head blast_sample.txt > blast10.txt
print first lines from blast_sample.txt file (10 by default) and
save them in blast10.txt
head displays first lines of a file
head -n 5 blast_sample.txt
print first five lines from blast_sample.txt file
Sol Genomics Network
tail blast_sample.txt
print last 10 lines from blast_sample.txt file
tail displays the last part of a file
print last five lines from blast_sample.txt file
tail -n 5 blast_sample.txt
Sol Genomics Network
grep ‘^>’ sample1.fasta
prints lines starting with a “>”, i.e., prints description lines from FASTA files
grep searches patterns in files
grep -c ‘^>’ sample1.fasta
counts lines starting with a “>”, i.e.,
it counts the number of sequences from a FASTA file
grep -c ‘^+$’ *fastq
counts lines formed only by “+”, i.e., it counts the
number of sequences from all FASTQ files in the
current directory
search pattern at line start
search pattern at line end
Sol Genomics Network
grep searches patterns in files
grep -v ‘Vvin’ blast10.txt
prints all lines but the ones containing ‘Vvin’
prints lines containing ‘Vvin’ and all their case combinations
grep -i ‘Vvin’ blast10.txt
Sol Genomics Network
BIOINFORMATICIAN
A: -c B: -v
C: D: -e
Which option of the command grep makes the
search case insensitive?
-i
Sol Genomics Network
BIOINFORMATICIAN
A: -c B: -v
C: D: -e
Which option of the command grep makes the
search case insensitive?
-i
Sol Genomics Network
cut -f 1,2 blast10.txt
prints columns 1 and 2 from blast10.txt
cut gets columns from a
tab-delimited file
cut -c 1-4,17-21 blast_sample.txt > tmp.txt
prints characters from 1 to 4 and from 17 to 21 for
each line in blast_sample.txt and save them in tmp.txt
Sol Genomics Network
sort tmp.txt > tmp2.txt
sort lines from file tmp.txt
and save them in tmp2.txt
sort sorts lines from a file
sort -u tmp.txt
sort lines from file tmp.txt and
remove the repeated ones
uniq -c tmp2.txt
removes repeated lines from tmp.txt and counts how many times they were repeated.
Lines have to be sorted since only adjacent lines are compared
Sol Genomics Network
wc blast10.txt
counts lines, words and characters in blast10.txt
wc counts lines, words and characters
wc -l blast10.txt
counts lines in blast10.txt
wc -c blast10.txt
counts bytes in blast_sample.txt
(including the line return)
wc -w blast10.txt
counts words in blast10.txt
Sol Genomics Network
paste concatenates files as columns
paste col2.txt col3.txt col1.txt
concatenates files
by their right end
cut -f 1 blast10.txt > col1.txt
creates a file for the columns 1, 2 and 3 respectively from blast10.txt
cut -f 2 blast10.txt > col2.txt
cut -f 3 blast10.txt > col3.txt
paste -d ‘,’ col2.txt col3.txt col1.txt
pastes columns with
commas as delimiters
Sol Genomics Network
sed replaces a pattern
sed ‘s/A/a/g’ col1.txt
replaces all “A” characters by “a” in col1.txt file
sed ‘s/Atha/SGN/’ col1.txt
replaces Atha by SGN in col1.txt file
sed -r ‘s/^([A-Za-z]+)|(.+)/gene 2 from 1/’ col2.txt
get species and gene name from col2.txt
and print each line in a different format
Saves species name in 1
Saves gene name in 2
Sol Genomics Network
Pipelines consists in concatenate several commands by using the output of
the first command as the input of the next one.
Two commands are connected placing the sign “|” between them.
ls | wc -l counts files in current directory
Pipelines
Sol Genomics Network
Pipelines
cat *fasta | grep “^>” | sed ‘s/>//’
prints sequence description line for
all fasta files from current directory
cut -f 1 blast_sample.txt | sort -u | wc -l
counts different query ids in a blast tabular file
cat *fasta | grep -c “^>”
counts sequences in all fasta
files from current directory
cut -f 1 blast_sample.txt | sort | uniq -c
counts the appearance of each query id in a blast tabular file
Sol Genomics Network
shell script (bash) example
• All commands and programs we run in the terminal could be included
in a text file with extension .sh
• This file will execute the commands in the order they were written,
from top to bottom.
head of bash scripts
comment line
command or program line execution
Sol Genomics Network
Run a bash script on a server
emacs: text editor
save = ctrl-x ctrl-s
exit = ctrl-x ctrl-c
touch file.sh
creates an empty file
emacs file.sh
open file.sh in emacs
Sol Genomics Network
reviewing the permissions
r readable
w writable
x executable or searchable
- not rwx
d Directory
- Regular file
d rwx r-x r-x
user
group
other
owner user
permissions
owner group
date File namesizelinks #
Sol Genomics Network
Run a bash script on a server
chmod 755 ./file.sh
screen -L ./file.sh run file.sh script in screen mode
Chmod manual
ctrl+a+d detach screen
makes file.sh executable
screen -r process_id return to process screen
less screenlog.0 watch log from screen execution
Sol Genomics Network
1. Merge all fasta files, in the order sample3.fasta, sample1.fasta and sample2.fasta, and save them in
a new file called all_samples.fasta
2. Merge all fastq files (sample1.fastq, sample2.fastq and sample3.fastq) using wildcards, and save
them in a new file called all_samples.fastq
3. Save in a file called blast100.txt the first 100 lines from blast_sample.txt
4. Save in a file called blast200.txt the last 200 lines from blast_sample.txt
5. How many sequences are in all_samples.fasta?
6. How many sequences are in all_sample.fastq?
7. Create a file with the subject ids and their scores for the 15 first lines from blast_sample.txt
8. How many different queries ids are in blast_sample.txt?
9. How many different subjects ids are in blast_sample.txt?
10. Change all ‘|’ in blast_sample.txt by ‘_’ and save the new file in Desktop as tmp.txt.
11. Count how many genes are in each Arabidopsis thaliana chromosome, chloroplast and
mitochondria based on the next file:
ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR10_blastsets/
TAIR10_pep_20110103_representative_gene_model_updated
Exercises

More Related Content

PPT
Operating Systems Process Scheduling Algorithms
PDF
Course 102: Lecture 3: Basic Concepts And Commands
PPTX
Linux file system
PPTX
Deadlock avoidance (Safe State, Resource Allocation Graph Algorithm)
PDF
Course 102: Lecture 2: Unwrapping Linux
PPSX
Management file and directory in linux
PPTX
Os security issues
PPT
Linux file system
Operating Systems Process Scheduling Algorithms
Course 102: Lecture 3: Basic Concepts And Commands
Linux file system
Deadlock avoidance (Safe State, Resource Allocation Graph Algorithm)
Course 102: Lecture 2: Unwrapping Linux
Management file and directory in linux
Os security issues
Linux file system

What's hot (20)

PPT
Basic 50 linus command
PDF
Linux Directory Structure
PPTX
Operating system memory management
PPT
Firewall(linux)
PPTX
Unix ppt
PDF
spinlock.pdf
PPTX
First-Come-First-Serve (FCFS)
PPTX
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
PPTX
Administering security
PPT
Chapter 10 - File System Interface
PPT
Basic Unix
PDF
Chapter 1 Introduction of Cryptography and Network security
PDF
ITFT - DOS - Disk Operating System
PPTX
Linux basic commands
PPTX
Process management in linux
PPT
Internal representation of files ppt
PDF
Unix Cheat Sheet
PPT
Protection and Security in Operating Systems
PDF
Linux basic commands with examples
PPTX
process control block
Basic 50 linus command
Linux Directory Structure
Operating system memory management
Firewall(linux)
Unix ppt
spinlock.pdf
First-Come-First-Serve (FCFS)
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
Administering security
Chapter 10 - File System Interface
Basic Unix
Chapter 1 Introduction of Cryptography and Network security
ITFT - DOS - Disk Operating System
Linux basic commands
Process management in linux
Internal representation of files ppt
Unix Cheat Sheet
Protection and Security in Operating Systems
Linux basic commands with examples
process control block
Ad

Viewers also liked (20)

PPT
Цветочные легенды
PPT
Римский корсаков снегурочка
PPTX
High Performance Distributed Systems with CQRS
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
PPTX
правописание приставок урок№4
PPTX
бсп (обоб. урок)
PDF
Troubleshooting mysql-tutorial
PDF
Windowing in Apache Apex
PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
PDF
The 5 People in your Organization that grow Legacy Code
PDF
Hadoop File System Shell Commands,
DOCX
Hadoop basic commands
PPTX
Introduction to Apache Apex and writing a big data streaming application
PPT
Learning sed and awk
PDF
Build your shiny new pc, with Pangoly
PPTX
HDFS Internals
PDF
Hadoop Internals (2.3.0 or later)
PPTX
Hadoop Interacting with HDFS
PPTX
Introduction to Real-Time Data Processing
Цветочные легенды
Римский корсаков снегурочка
High Performance Distributed Systems with CQRS
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
правописание приставок урок№4
бсп (обоб. урок)
Troubleshooting mysql-tutorial
Windowing in Apache Apex
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
The 5 People in your Organization that grow Legacy Code
Hadoop File System Shell Commands,
Hadoop basic commands
Introduction to Apache Apex and writing a big data streaming application
Learning sed and awk
Build your shiny new pc, with Pangoly
HDFS Internals
Hadoop Internals (2.3.0 or later)
Hadoop Interacting with HDFS
Introduction to Real-Time Data Processing
Ad

Similar to Introduction to UNIX Command-Lines with examples (20)

PDF
SGN Introduction to UNIX Command-line 2015 part 1
PDF
Unix Command-Line Cheat Sheet BTI2014
PDF
SGN Introduction to UNIX Command-line 2015 part 2
PPT
BITS: Introduction to Linux - Text manipulation tools for bioinformatics
PDF
3.1.a linux commands reference
PPTX
Linux Fundamentals
PDF
Workshop on command line tools - day 1
PDF
Linux Command Line - By Ranjan Raja
PDF
Linux cheat sheet
PPTX
Linux Commands.pptx
PDF
Quick guide of the most common linux commands
PPTX
Basic unix
PDF
2023comp90024_linux.pdf
PDF
Unix / Linux Command Reference
PPTX
various shell commands in unix operating system.pptx
PPTX
Linux Commands all presentation file .pptx
ODP
Love Your Command Line
PPTX
Introduction to linux day1
PPT
A Quick Introduction to Linux
PDF
Introduction to the linux command line.pdf
SGN Introduction to UNIX Command-line 2015 part 1
Unix Command-Line Cheat Sheet BTI2014
SGN Introduction to UNIX Command-line 2015 part 2
BITS: Introduction to Linux - Text manipulation tools for bioinformatics
3.1.a linux commands reference
Linux Fundamentals
Workshop on command line tools - day 1
Linux Command Line - By Ranjan Raja
Linux cheat sheet
Linux Commands.pptx
Quick guide of the most common linux commands
Basic unix
2023comp90024_linux.pdf
Unix / Linux Command Reference
various shell commands in unix operating system.pptx
Linux Commands all presentation file .pptx
Love Your Command Line
Introduction to linux day1
A Quick Introduction to Linux
Introduction to the linux command line.pdf

Recently uploaded (20)

PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PPT
Mutation in dna of bacteria and repairss
PDF
CuO Nps photocatalysts 15156456551564161
PPT
Cell Structure Description and Functions
PPTX
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
PPT
LEC Synthetic Biology and its application.ppt
PPTX
Substance Disorders- part different drugs change body
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PDF
From Molecular Interactions to Solubility in Deep Eutectic Solvents: Explorin...
PDF
Cosmology using numerical relativity - what hapenned before big bang?
PPTX
congenital heart diseases of burao university.pptx
PDF
7.Physics_8_WBS_Electricity.pdfXFGXFDHFHG
PPTX
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
PDF
Science Form five needed shit SCIENEce so
PPTX
ELISA(Enzyme linked immunosorbent assay)
PPT
THE CELL THEORY AND ITS FUNDAMENTALS AND USE
PPT
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
PDF
Social preventive and pharmacy. Pdf
PDF
5.Physics 8-WBS_Light.pdfFHDGJDJHFGHJHFTY
PPTX
Introduction to Immunology (Unit-1).pptx
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
Mutation in dna of bacteria and repairss
CuO Nps photocatalysts 15156456551564161
Cell Structure Description and Functions
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
LEC Synthetic Biology and its application.ppt
Substance Disorders- part different drugs change body
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
From Molecular Interactions to Solubility in Deep Eutectic Solvents: Explorin...
Cosmology using numerical relativity - what hapenned before big bang?
congenital heart diseases of burao university.pptx
7.Physics_8_WBS_Electricity.pdfXFGXFDHFHG
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
Science Form five needed shit SCIENEce so
ELISA(Enzyme linked immunosorbent assay)
THE CELL THEORY AND ITS FUNDAMENTALS AND USE
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
Social preventive and pharmacy. Pdf
5.Physics 8-WBS_Light.pdfFHDGJDJHFGHJHFTY
Introduction to Immunology (Unit-1).pptx

Introduction to UNIX Command-Lines with examples

  • 1. Sol Genomics Network Introduction to UNIX command-line Boyce Thompson Institute 2014 Noe Fernandez
  • 2. Sol Genomics Network • Terminal file system navigation • Wildcards, shortcuts and special characters • File permissions • Compression UNIX commands • Networking UNIX commands • Basic NGS file formats • Text files manipulation commands • Command-line pipelines • Introduction to bash scripts Class Content
  • 3. Sol Genomics Network What is a virtual machine?
  • 4. Sol Genomics Network What is a terminal?
  • 5. Sol Genomics Network Why are command-line needed? • Most of the software for biological data analysis can be used in a UNIX command-line terminal • Most of the servers for biological data analysis use Linux as operative system • Data analysis on calculation servers are much faster since they can use more CPUs and RAM than in a PC (e.g.: Boyce servers has 64 cores and 1TB RAM) • Large NGS data files can not be opened or loaded in most of the software with interface and web sites • Compression commands are useful, since NGS large data files usually are stored and shared as compressed files
  • 6. Sol Genomics Network Text handling commandsText handling commands command > file saves STDOUT in a file command >> file appends STDOUT in a file cat file concatenate and print files cat file1 file2 > file3 merges files 1 and 2 into file3 cat *fasta > all.fasta concatenates all fasta files in the current directory head file prints first lines from a file head -n 5 file prints first five lines from a file tail file prints last lines from a file tail -n 5 file prints last five lines from a file less file view a file less -N file includes line numbers less -S file wraps long lines grep ‘pattern’ file Prints lines matching a pattern grep -c ‘pattern’ file counts lines matching a pattern cut -f 1,3 file retrieves data from selected columns in a tab-delimited file sort file sorts lines from a file sort -u file sorts and return unique lines uniq -c file filters adjacent repeated lines wc file counts lines, words and bytes paste file1 file2 concatenates the lines of input files paste -d “,” concatenates the lines of input files by commas sed transforms text File system CommandsFile system Commands ls lists directories and files ls -a lists all files including hidden files ls -lh formatted list including more data ls -t lists sorted by date pwd returns path to working directory cd dir changes directory cd .. goes to parent directory cd / goes to root directory cd goes to home directory touch file_name creates en empty file cp file file_copy copy a file cp -r copy files contained in directories rm file deletes a file rm -r dir deletes a directory and its files mv file1 file2 moves or renames a file mkdir dir_name creates a directory rmdir dir_name deletes a directory locate file_name searches a file man command shows commands manual top shows process activity df -h shows disk space info Networking CommandsNetworking Commands wget URL download a file from an URL ssh user@server connects to a server scp copy files between computers apt-get install installs applications in linux Compression commandsCompression commands gzip/zip compress a file gunzip/unzip decompress a file tar -cvf groups files tar -xvf ungroups files tar -zcvf groups and gzip files tar -zxvf gunzip and ungroups files UNIX Command-Line Cheat Sheet BTI-SGN Bioinformatics Course 2014 •File system commands File system navigation http://www.slideshare.net/NoFernndezPozo/unix-command-sheet2014 https://btiplantbioinfocourse.files.wordpress.com/2014/02/unix_command_sheet_2014.pdf Download the cheat sheet from:
  • 7. Sol Genomics Network File system navigation File Browser Terminal =
  • 8. Sol Genomics Network Home and Root directories /bin, /lib, /usr code and code libraries /var logs and other data /home user directories /tmp temporary files /etc configuration information /proc special file system in Linux /home/bioinfo /home/noe /home/noe/Desktop Root directory Home directory
  • 9. Sol Genomics Network Anatomy of a UNIX command grep -c -A 3 --ignore-case file.txt command Simple option flag (short form) option (long form)option with argument argument man grep print grep manual
  • 10. Sol Genomics Network ls, cd and pwd to navigate the file system • where am I? pwd • how to change current directory cd • what files and directories are in my current directory? ls pwd return current work directory
  • 11. Sol Genomics Network ls list directories and files in current directory ls lists directories and files ls -a list all directories and files, including hidden files ls -l -h -t time sorted ls -lhS size sorted ls -l -h list in long format human readable
  • 12. Sol Genomics Network ls lists directories and files r readable w writable x executable or searchable - not rwx d Directory - Regular file d rwx r-x r-x user group other owner user permissions owner group date File namesizelinks #
  • 13. Sol Genomics Network Use up and down arrows to navigate the command history Wildcards, history and some shortcuts ls *txt ls P*s list files starting with P and ending with s, e.g.: Pictures, Photos, Programs ... list all txt files in current directory ctrl-c stop process ctrl-a go to begin of line ctrl-e go to end of line ctrl-r search in command history
  • 14. Sol Genomics Network Escaping special characters Tip: file names in lower case and with underscores instead of spaces ! @ $ ^ & * ~ ? . | / [ ] < > ` " ;# ( ) Use tab key to autocomplete names ls my folder list a folder containing a space ls my_folder list a folder
  • 15. Sol Genomics Network Use tab key to autocomplete names cd changes directory cd Desktop changes directory to Desktop cd .. goes to parent directory cd goes to home directory cd / goes to root directory cd - goes to previous directory
  • 16. Sol Genomics Network Absolute and relative paths ls /home/user/Desktop list files in Desktop using an absolute path ls Desktop/ list files in Documents using a relative path (from your home: /home/bioinfo) ls ~/Desktop list files in Desktop using your home as a reference
  • 17. Sol Genomics Network Absolute and relative paths ls /home/bioinfo/Desktop ls ~/Desktop Absolute paths do not depend on where you are ~/ is equivalent to /home/bioinfo/
  • 18. Sol Genomics Network Absolute and relative paths ls ../Documents cd Desktop/ goes to Desktop from when you are in your home (/home/bioinfo) list files from Documents when you are in Desktop
  • 19. Sol Genomics Network Create, copy, move and delete files touch tmp_file.txt creates an empty file called tmp_file.txt cp tmp_file.txt file_copy.txt copies tmp_file.txt in file_copy.txt rm file.txt deletes file.txt mv file1.txt file2.txt moves or rename a file Tip: file names in lower case and with underscores instead of spaces
  • 20. Sol Genomics Network Locate a file locate unix_class_file_samples.zip Locate the path for the file unix_class_file_samples.zip locate unix_class Locate the path for all the files containing unix_class
  • 21. Sol Genomics Network Create, copy and delete directories mkdir dir_name creates an empty directory called dir_name rmdir dir_name deletes dir_name directory if it is empty cp -r dir_name dir_copy copy dir_name and its files in a new folder rm -r dir_name delete dir_name and its files
  • 22. Sol Genomics Network wc file counts lines, words and bytes paste file1 file2 concatenates the lines of input files paste -d “,” concatenates the lines of input files by commas sed transforms text locate file_name searches a file man command shows commands manual top shows process activity df -h shows disk space info Networking CommandsNetworking Commands wget URL download a file from an URL ssh user@server connects to a server scp copy files between computers apt-get install installs applications in linux Compression commandsCompression commands gzip/zip compress a file gunzip/unzip decompress a file tar -cvf groups files tar -xvf ungroups files tar -zcvf groups and gzip files tar -zxvf gunzip and ungroups files Text handling commandsText handling commands command > file saves STDOUT in a file command >> file appends STDOUT in a file cat file concatenate and print files cat file1 file2 > file3 merges files 1 and 2 into file3 cat *fasta > all.fasta concatenates all fasta files in the current directory head file prints first lines from a file head -n 5 file prints first five lines from a file tail file prints last lines from a file tail -n 5 file prints last five lines from a file less file view a file less -N file includes line numbers less -S file wraps long lines grep ‘pattern’ file Prints lines matching a pattern grep -c ‘pattern’ file counts lines matching a pattern cut -f 1,3 file retrieves data from selected columns in a tab-delimited file sort file sorts lines from a file sort -u file sorts and return unique lines uniq -c file filters adjacent repeated lines wc file counts lines, words and bytes paste file1 file2 concatenates the lines of input files paste -d “,” concatenates the lines of input files by commas sed transforms text File system CommandsFile system Commands ls lists directories and files ls -a lists all files including hidden files ls -lh formatted list including more data ls -t lists sorted by date pwd returns path to working directory cd dir changes directory cd .. goes to parent directory cd / goes to root directory cd goes to home directory touch file_name creates en empty file cp file file_copy copy a file cp -r copy files contained in directories rm file deletes a file rm -r dir deletes a directory and its files mv file1 file2 moves or renames a file mkdir dir_name creates a directory rmdir dir_name deletes a directory locate file_name searches a file man command shows commands manual top shows process activity df -h shows disk space info Networking CommandsNetworking Commands wget URL download a file from an URL ssh user@server connects to a server scp copy files between computers apt-get install installs applications in linux Compression commandsCompression commands gzip/zip compress a file gunzip/unzip decompress a file tar -cvf groups files tar -xvf ungroups files tar -zcvf groups and gzip files tar -zxvf gunzip and ungroups files UNIX Command-Line Cheat Sheet BTI-SGN Bioinformatics Course 2014 Compression commands tar -zcvf file.tar.gz f1 f2 groups and compress files tar -zxvf file.tar.gz decompress and ungroup a tar.gz file files, directories or wildcards
  • 23. Sol Genomics Network Compression commands gzip f1.txt gunzip file.gz unzip file.zip decompress file.zip zip file.zip f1 f2 compress files f1 and f2 in file.zip compress file f1.txt in f1.txt.gz decompress file.gz
  • 24. Sol Genomics Network Introduction to UNIX command-line II Boyce Thompson Institute October Noe Fernandez
  • 25. Sol Genomics Network • Terminal file system navigation • Wildcards, shortcuts and special characters • File permissions • Compression UNIX commands • Networking UNIX commands • Basic NGS file formats • Text files manipulation commands • Command-line pipelines • Introduction to bash scripts Class Content
  • 26. Sol Genomics Network FASTA format A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol at the beginning. http://www.ncbi.nlm.nih.gov/ >sequence_ID1 description ATGCGCGCGCGCGCGCGCGGGTAGCAGATGACGACACAGAGCGAGGATGCGCTGAGAGTA GTGTGACGACGATGACGGAAAATCAGATGGACCCGATGACAGCATGACGATGGGACGGGA AAGATTGGACCAGGACAGGACCAGGACCAGGACCAGGGATTAGA >sequence_ID2 description ATGGGGGGGACGACGATGGACACAGAGACAGAGACGACGACAGCAGACAGATTTACCTTA GACGAGATAGGAGAGACGACAGATATATATATATAGCAGACAGACAGACATTTAGACGAG ACGACGATAGACGATaaaaataa sequence datadescription line
  • 27. Sol Genomics Network @D3B4KKQ1:291:D17NUACXX:8:1101:3630:2109 1:N:0: GACTTGCAGGCATGCAAGCTTGGCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACACTGGCGT + ?@<+ADDDDFDFFI<FGE=EHGIGFFGEFIIFFBGFIDEI>D?FFFFA4;C;DC=;=ABDD; @D3B4KKQ1:291:D17NUACXX:8:1101:3971:2092 1:N:0: ATTGCAGAAGCGGCCCCGCATCTGCGAAGGGTTAACCGCAGGTGCAGAAGCTGGCTTTAAGTGAGAAGT + =BAADBA?D?FGI<@FHDB6?ADFEGGIE8@FGGII3ABBBB(;;6@CC?C3;C<99?CCCCC;:::? FASTQ format A FASTQ file normally uses four lines per sequence. Line 1: begins with a '@' character, followed by a sequence identifier and an optional description. Line 2: is the raw sequence letters. Line 3: begins with a '+' character, is optionally followed by the same sequence identifier. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. wikipedia sequence datadescription line sequence quality
  • 28. Sol Genomics Network Tab-delimited text files ATCG00890.1 PACid:16418828 90.60 117 11 0 18 134 1 117 1e-71 220 ATCG00890.1 PACid:16412855 90.48 147 14 2 41 387 27 173 1e-68 214 ATCG00500.1 PACid:23047568 64.88 299 64 2 220 477 112 410 5e-131 388 ATCG00500.1 PACid:23052247 58.88 321 69 3 220 477 381 701 3e-117 361 ATCG00280.1 PACid:24129717 95.99 474 19 0 1 474 1 474 0.0 847 ATCG00280.1 PACid:24095593 95.36 474 22 0 1 474 1 474 0.0 840 ATCG00280.1 PACid:20871697 94.94 474 24 0 1 474 1 474 0.0 837 scoreQuery Subject id % length mismatch gaps qstart qend sstart send evalue Tabular blast output example Tab-delimited files are a very common format in scientific data.They consist in columns of text separated by tabs. Other file formats could have different delimiters. Blast, SAM (mapping), BED, VCF (SNPs), GTF, GFF ...
  • 29. Sol Genomics Network Text handling commandsText handling commands command > file saves STDOUT in a file command >> file appends STDOUT in a file cat file concatenate and print files cat file1 file2 > file3 merges files 1 and 2 into file3 cat *fasta > all.fasta concatenates all fasta files in the current directory head file prints first lines from a file head -n 5 file prints first five lines from a file tail file prints last lines from a file tail -n 5 file prints last five lines from a file less file view a file less -N file includes line numbers less -S file wraps long lines grep ‘pattern’ file Prints lines matching a pattern grep -c ‘pattern’ file counts lines matching a pattern cut -f 1,3 file retrieves data from selected columns in a tab-delimited file sort file sorts lines from a file sort -u file sorts and return unique lines uniq -c file filters adjacent repeated lines wc file counts lines, words and bytes paste file1 file2 concatenates the lines of input files paste -d “,” concatenates the lines of input files by commas sed transforms text File system CommandsFile system Commands ls lists directories and files ls -a lists all files including hidden files ls -lh formatted list including more data ls -t lists sorted by date pwd returns path to working directory cd dir changes directory cd .. goes to parent directory cd / goes to root directory cd goes to home directory touch file_name creates en empty file cp file file_copy copy a file cp -r copy files contained in directories rm file deletes a file rm -r dir deletes a directory and its files mv file1 file2 moves or renames a file mkdir dir_name creates a directory rmdir dir_name deletes a directory locate file_name searches a file man command shows commands manual top shows process activity df -h shows disk space info Networking CommandsNetworking Commands wget URL download a file from an URL ssh user@server connects to a server scp copy files between computers apt-get install installs applications in linux Compression commandsCompression commands gzip/zip compress a file gunzip/unzip decompress a file tar -cvf groups files tar -xvf ungroups files tar -zcvf groups and gzip files tar -zxvf gunzip and ungroups files UNIX Command-Line Cheat Sheet BTI-SGN Bioinformatics Course 2014 Text Handling Commands •Text Handling Commands
  • 30. Sol Genomics Network less blast_sample.txt view file blast_sample.txt less to view large files /pattern search pattern n find next N find previous q quit less scroll through the file < or g go to file beginning > or G go to file end space bar page down b page up less -S blast_sample.txt view file blast_sample.txt without wrapping long lines less -N blast_sample.txt view file blast_sample.txt showing line numbers
  • 31. Sol Genomics Network cat sample1.fasta prints file sample1.fasta on the screen cat concatenates and prints files cat /home/bioinfo/Desktop/unix_data/sample1.fasta prints file sample1.fasta on the screen concatenates files sample1.fasta and sample2.fasta and saves them in the file new_file.fasta cat sample1.fasta sample2.fasta > new_file.fasta redirects output to a file
  • 32. Sol Genomics Network cat *fasta > all_samples.fasta appends sample3.fasta file to new_file.fasta cat sample3.fasta >> new_file.fasta concatenates all FASTA files in the current directory and saves them in the file all_samples.fasta cat concatenates and prints files redirect output to a file
  • 33. Sol Genomics Network head blast_sample.txt > blast10.txt print first lines from blast_sample.txt file (10 by default) and save them in blast10.txt head displays first lines of a file head -n 5 blast_sample.txt print first five lines from blast_sample.txt file
  • 34. Sol Genomics Network tail blast_sample.txt print last 10 lines from blast_sample.txt file tail displays the last part of a file print last five lines from blast_sample.txt file tail -n 5 blast_sample.txt
  • 35. Sol Genomics Network grep ‘^>’ sample1.fasta prints lines starting with a “>”, i.e., prints description lines from FASTA files grep searches patterns in files grep -c ‘^>’ sample1.fasta counts lines starting with a “>”, i.e., it counts the number of sequences from a FASTA file grep -c ‘^+$’ *fastq counts lines formed only by “+”, i.e., it counts the number of sequences from all FASTQ files in the current directory search pattern at line start search pattern at line end
  • 36. Sol Genomics Network grep searches patterns in files grep -v ‘Vvin’ blast10.txt prints all lines but the ones containing ‘Vvin’ prints lines containing ‘Vvin’ and all their case combinations grep -i ‘Vvin’ blast10.txt
  • 37. Sol Genomics Network BIOINFORMATICIAN A: -c B: -v C: D: -e Which option of the command grep makes the search case insensitive? -i
  • 38. Sol Genomics Network BIOINFORMATICIAN A: -c B: -v C: D: -e Which option of the command grep makes the search case insensitive? -i
  • 39. Sol Genomics Network cut -f 1,2 blast10.txt prints columns 1 and 2 from blast10.txt cut gets columns from a tab-delimited file cut -c 1-4,17-21 blast_sample.txt > tmp.txt prints characters from 1 to 4 and from 17 to 21 for each line in blast_sample.txt and save them in tmp.txt
  • 40. Sol Genomics Network sort tmp.txt > tmp2.txt sort lines from file tmp.txt and save them in tmp2.txt sort sorts lines from a file sort -u tmp.txt sort lines from file tmp.txt and remove the repeated ones uniq -c tmp2.txt removes repeated lines from tmp.txt and counts how many times they were repeated. Lines have to be sorted since only adjacent lines are compared
  • 41. Sol Genomics Network wc blast10.txt counts lines, words and characters in blast10.txt wc counts lines, words and characters wc -l blast10.txt counts lines in blast10.txt wc -c blast10.txt counts bytes in blast_sample.txt (including the line return) wc -w blast10.txt counts words in blast10.txt
  • 42. Sol Genomics Network paste concatenates files as columns paste col2.txt col3.txt col1.txt concatenates files by their right end cut -f 1 blast10.txt > col1.txt creates a file for the columns 1, 2 and 3 respectively from blast10.txt cut -f 2 blast10.txt > col2.txt cut -f 3 blast10.txt > col3.txt paste -d ‘,’ col2.txt col3.txt col1.txt pastes columns with commas as delimiters
  • 43. Sol Genomics Network sed replaces a pattern sed ‘s/A/a/g’ col1.txt replaces all “A” characters by “a” in col1.txt file sed ‘s/Atha/SGN/’ col1.txt replaces Atha by SGN in col1.txt file sed -r ‘s/^([A-Za-z]+)|(.+)/gene 2 from 1/’ col2.txt get species and gene name from col2.txt and print each line in a different format Saves species name in 1 Saves gene name in 2
  • 44. Sol Genomics Network Pipelines consists in concatenate several commands by using the output of the first command as the input of the next one. Two commands are connected placing the sign “|” between them. ls | wc -l counts files in current directory Pipelines
  • 45. Sol Genomics Network Pipelines cat *fasta | grep “^>” | sed ‘s/>//’ prints sequence description line for all fasta files from current directory cut -f 1 blast_sample.txt | sort -u | wc -l counts different query ids in a blast tabular file cat *fasta | grep -c “^>” counts sequences in all fasta files from current directory cut -f 1 blast_sample.txt | sort | uniq -c counts the appearance of each query id in a blast tabular file
  • 46. Sol Genomics Network shell script (bash) example • All commands and programs we run in the terminal could be included in a text file with extension .sh • This file will execute the commands in the order they were written, from top to bottom. head of bash scripts comment line command or program line execution
  • 47. Sol Genomics Network Run a bash script on a server emacs: text editor save = ctrl-x ctrl-s exit = ctrl-x ctrl-c touch file.sh creates an empty file emacs file.sh open file.sh in emacs
  • 48. Sol Genomics Network reviewing the permissions r readable w writable x executable or searchable - not rwx d Directory - Regular file d rwx r-x r-x user group other owner user permissions owner group date File namesizelinks #
  • 49. Sol Genomics Network Run a bash script on a server chmod 755 ./file.sh screen -L ./file.sh run file.sh script in screen mode Chmod manual ctrl+a+d detach screen makes file.sh executable screen -r process_id return to process screen less screenlog.0 watch log from screen execution
  • 50. Sol Genomics Network 1. Merge all fasta files, in the order sample3.fasta, sample1.fasta and sample2.fasta, and save them in a new file called all_samples.fasta 2. Merge all fastq files (sample1.fastq, sample2.fastq and sample3.fastq) using wildcards, and save them in a new file called all_samples.fastq 3. Save in a file called blast100.txt the first 100 lines from blast_sample.txt 4. Save in a file called blast200.txt the last 200 lines from blast_sample.txt 5. How many sequences are in all_samples.fasta? 6. How many sequences are in all_sample.fastq? 7. Create a file with the subject ids and their scores for the 15 first lines from blast_sample.txt 8. How many different queries ids are in blast_sample.txt? 9. How many different subjects ids are in blast_sample.txt? 10. Change all ‘|’ in blast_sample.txt by ‘_’ and save the new file in Desktop as tmp.txt. 11. Count how many genes are in each Arabidopsis thaliana chromosome, chloroplast and mitochondria based on the next file: ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR10_blastsets/ TAIR10_pep_20110103_representative_gene_model_updated Exercises