print("BIOE591")[1] "BIOE591"
For much of the semester we will be executing programs from a command-line interface, or CLI. In scientific computing, a CLI known as the Unix or Linux Shell (sometimes abbreviated *NIX Shell; see if you can figure out why) is most commonly used and a requirement for many scripts and applications. On Mac and Linux operating systems, the CLI is accessed via an application called Terminal, which by default runs a version of the Unix shell called zsh or bash.
Windows users–though likely already equipped with a CLI called PowerShell or CMD—will need to install a Unix shell emulator, as commands and syntax differs with these interpreters. Unless you have prior knowledge and a strong preference otherwise, I recommend Git for Windows, which comes with a BASH emulator that will work for all lab assignments.
To write code snippets and submit assignments, you will need to have a plain text editor to work with. Plain text files have numerous advantages over writing in (e.g.) a word processing application. First, they are free and open source, something that may become important to you once you lose an academic software license. Second, the files will always remain usable and readable by humans, whereas if Microsoft Word sunsets its proprietary software, .docx files may become useless. Third, they are more or less universally interpretable by other programs (i.e., they are highly portable); this allows you to input data and function arguments to software in automated fashion.
Mac, Linux, and Windows computers all typically come with a preloaded application for this purpose (e.g., textEdit on MacOS). You will likely enjoy a slightly more sophisticated editor aimed at writing code, which typically come with features like syntax highlighting and tab autofill / spellcheck (as well as a suite of AI features and easy version control integration that I will discourage you from using until you know the basics). RStudio provides these features, though as a full-on integrated development environment (or IDE) it is somewhat distracting. In rough order of declining preference, here are my suggestions:
Writing in plain text necessarily means forgoing formatting. Word processing is handled by a lightweight plain-text-to-formatted-text language called Markdown. Markdown is portable, simple, and ubiquitous: it is responsible for formatting this assignment, this course website, my lab website, my lecture slides, GitHub READMEs, Reddit posts, and much more besides. The basic idea is that by surrounding words with a handful of characters, you indicate to an interpreter how text should be formatted. For example, in this sentence I have surrounded the last six words with a pair of two asterices to indicate it should be bold:
Italics work with a single asterisk at the end of each word, like this:
Headers can be rendered with hashtags (# Title, ## Section, ### Subsection). Code can indicated by a pair of accents (`bash.sh`); three in a row on one opens a block of formatted code, which must then be closed by three further down. Depending on the flavor of Markdown, you can indicate syntax highlighting by putting the name of the programming language in brackets after the opening line of accents:
Tables are wonderfully simple. For example, the following text…
…renders as:
| Student | Fun Fact |
|---|---|
| Jason | Doesn’t like to walk |
| Lizzy | Has a chubby cat |
Block quotes can be indicated by greater-than signs (>), e.g.
becomes:
Nonethless, Jason is still better at walking than Lizzy’s cat.
Numbered lists are equally simple, with the following chunk…
…rendering as:
(Bullet points are handled as you might imagine.)
A cheat sheet to basic and extended syntax can be found here. The web application JotBird is nice for quickly drafting Markdown documents; you may also be interested in downloading a program that can render Markdown as .pdf or .html files locally, such as Pandoc (recommended). (RMarkdown has this ability as well, though you’ll have to download LaTeX via TinyTeX, MacTeX, or another source.)
After opening Terminal, it’s time to get oriented and learn how to navigate a computer via the CLI. To start, we will figure out where in your file structure you actually are, using the equivalent of the R function getwd():
Easily memorized as print working directory, this should indicate you are in your home directory (e.g., /Users/ethanlinck/), a location you can return from wherever you are by typing cd ~ and hitting enter. cd is a fundamental tool of any CLI, allowing you to change directories:
It can also use explicit paths. Here is an example with a relative path (from my current home directory, /Users/ethanlinck/):
This would navigate to the genomics subdirectory of my teaching directory from my current location. If I am accidentally not in my home directory and there are no such folders where I am, the command will fail. An absolute path helps avoid this risk, though at the cost of flexibility:
Use cd and pwd in combination to navigate around your computer. Note that if you are correctly typing the start of a path, pressing “tab” should autocomplete it, or present options with the same suffix..
You can make a directory with the command mkdir and a path (including the name of the new directory):
Make a new directory entitled test/. We can then create a set of new files to put in test/ for future manipulation using echo, the assignment pipe >, and a filename:
(Each line below can be entered individually; the semicolons allow you to copy and past the chunk below without manually entering linebreaks. If you misplace a semicolon or otherwise have a typo, type Ctrl+C to cancel a command. Typing exit closes the current terminal session.)
Navigate to ~/test/. From a given location—or paired with a path (written in help documents as <path>) you can use the command ls to list the contents of a directory:
The command ls -a will reveal “dotfiles”, typically hidden text files that begin with a period and contain information that helps software run. In this class, the dotfile .gitignore will be useful down the line; we may also manipulate .bashrc, which determines settings for the unix shell of a particular user. What are the differences between ls and ls -a when run in test/?
At this point you may be wondering where the argument -a came from. The following commands produce documentation (though typing an erroneous argument will also provide a brief example of proper useage)
Commands can be chained together using a pipe (something you may be familiar with from R). Here I am counting the number of files in a directory using the commands ls and wc, with appropriate arguments:
The command cat can be used to print (or concatenate) the contents of a file:
Similarly, head can be used to print the first -n lines of a file. We will demonstrate this with a new file and the application of a double pipe (which appends text to new lines):
The command mv can be used to move a file to a new location, e.g:
This can also be used to rename files, even within the same directory:
An analog is cp, which copies files:
Using a . in combination with a relative or absolute path will preserve the name of the original file:
Pasting multiple commands in a text document with the suffix .sh creates a shell script, analogous to one written in R or Python. To do so, you need a hashed line called a shebang to start your script to tell your computer to use a particular CLI. You can then add multiple commands, separated by semi-colons:
Save this as script.sh in the test/ directory. To do so you may use your new plain text editor, or type nano to access a CLI-based solution available on Mac and Linux machines. (This takes a second to get used to; ask me if you need help. ) Type the command bash script.sh. Does it do what you expect?
Wildcards are characters that match multiple patterns. Most useful for our purposes is *, which will match all files in a given directory, or all
Another useful wildcard is ?, which matches a single character:
Lastly, the command rm removes (deletes) files and directories. For example, rm test/sub/file.md deletes the file test/sub/file.md. Paired with wildcards, you can quickly remove many files in one fell swoop.
rm with caution
One pitfall of becoming a CLI ninja is that you will not get warning messages when you inevitably deploy a powerful command somewhere you don’t intend to. rm paired with a wildcard should be used especially sparingly. What do you think would happen if you navigated to your Desktop and typed the following command?
We will ease into our coding exercises with a brief shell scripting activity. I have created a set of dummy data files that are typical of population genomics—.fastq.gz raw sequencing read files, .fasta assembled sequence data, text files (*.txt), and metadata (*.csv) files. First, download these data using the command curl:
Using the information above and any other resource available to you*, write a single bash script that can perform the following steps:
Write a bash script called 01_homework_<your_name>.sh that does the following:
week_1 directory;fastq/fasta/metadata/*.fastq.gz to fastq/*.fasta to fasta/*.csv to metadata/For now, you may keep this in your local directory; we’ll work on submitting it as part of next week’s activities.
* This will generally include AI tools, and as these are an indispensible part of modern programming, I am hesitant to outright ban them for this course. However, it will be in your best interest to attempt to put this script together yourself—from its component parts—as these commands need to become second nature as you navigate the cluster.