Python¶

Setting up¶

Install Python using miniforge. This is better than using Anaconda or Miniconda because those others will set the defaults channel to be the paid one at anaconda.org. Miniforge sets the default channel to the free, community-maintained conda-forge channel. This will ensure reproducibility of your work by not requiring a license to anaconda.org channel.

You may have heard about Jupyter notebooks. While they have their place in data science-type work, I do not think this is a good way to learn Python. Here are some reasons why; they boil down to needing to mentally keep track of what cells have been run when to avoid strange and hard-to-track-down bugs.

At first, stay away from IDEs like PyCharm or Spyder. These remove you too much from what is really happening and add an extra layer of complexity. Once you have the basics down then by all means see what IDEs or notebook solutions work best for you.

To start, you should start by using just a text editor and a terminal running Python (or IPython, see the Use IPython section). My setup happens to be running Vim in one terminal next to another terminal open running IPython.

Learning Python¶

When learning Python, it’s very important to start with the basics. Sometimes people will start learning Python because they’re interested in data science, and those sorts of tutorials jump right in to teaching pandas. While pandas is amazing and it’s important to be fluent in pandas, learning pandas is not learning Python.

What do I mean by that?

It’s possible to get most of the way through a pandas tutorial without ever encountering a dictionary (a fundamental data structure that is used all over Python code). While someone might be good at pandas specifically, they don’t learn the rest of Python. So they end up having big gaps in their knowledge. Once they start hitting those gaps (like the first time they need to use a dictionary), ideally they should go all the way back to learning the fundamentals. But in practice, they want to solve the problem they’re working on in pandas and don’t want to take the time to learn everything. As a result they’ll learn just enough of the basics to solve that one problem, and leave the rest of the gaps in their knowledge until they hit the next one, when the process repeats.

It’s much better in the long run to learn the fundamentals. Then you will have a shared vocabulary with everyone else – including Stack Overflow commenters! – that will help you understand more complex topics as you progress.

If you like learning from first principles, The Official Python tutorial is a complete tutorial. It’s a bit dry though.
If you like jumping into the deep end and working out first principles on your own, Dive Into Python is another good tutorial.
How to Think Like a Computer Scientist takes a different approach and teaches principles by controlling graphics on the screen. A bonus is that you can run code directly in the browser. If you are a visual thinker or visual learner, give this one a shot.
Automate the Boring Stuff with Python This book brings you through the basics of Python, and comes highly recommended by some BSPC post-bacs.
Learn Python the Hard Way is regarded highly but is $30.
If you plan to learn bash, Python, and R, A Primer for Computational Biology is a single resource that teaches all of these.

Use IPython¶

Python comes with an interpreter (what you enter when you type python at the command line). But that interpreter is limited in a lot of ways.

As early as possible, start using IPython. Its tab-completion, integrated help, and debugging are amazing and will make your life so much easier. It also makes working with matplotlib plots (and therefore seaborn and pandas plots) much nicer, spawning the figures as a separate window and returning you to the prompt. You can also directly interact with the Bash shell from inside IPython (see this page for more).

Learn more about IPython in the IPython tutorial.

Debugging in Python¶

Something that is often skipped over in tutorials is the utility of interactive debugging. In IPython, when you get an error you can call %debug from the IPython command line and you are dropped into a live version of your code at the exact point where the error was caused. You can then inspect the values of various variables to troubleshoot what went wrong. This is much more powerful than sprinkling print() statements throughout your code!

There is a good intro to the debugging workflow at SciPy lecture notes: debugging, along with what to do when you can’t use IPython.

Learning pandas¶

The pandas package is the standard for working with tables of data (like from a spreadsheet).

Note

It’s important to learn Python (see above resources) before jumping in to pandas. Pandas is almost its own mini-language, so learning Pandas does not mean you’re learning Python!

Visual intro to pandas is very basic but visually helps you bridge the conceptual gap between Excel and pandas.
The official pandas tutorial list has several options you can try to see what fits your learning style best.
DataCamp Pandas A bit more gradual introduction to Pandas
If you’ve been using pandas already, advanced pandas tricks has some useful tricks.

After learning pandas, you should be able to do the following (in very rough order of beginner to advanced):

read csv or tsv or url into dataframe
select rows and columns
save to file
discuss the difference between .loc and .iloc
apply a function to a column
create a DataFrame from lists or dictionaries
find row with largest value in column
chain pandas.DataFrame methods together to build a “pipeline”
inspect for duplicates
work with Excel files
remove duplicates
get rows where column value is one of a set
discuss ways of handling missing data
join dataframes together (aligning by index)
group-by and summarize (e.g., find group means)

Visualization in Python¶

There are a lot of visualization options in Python. I think it’s best to learn matplotlib and then use seaborn, but this dramatic comparison of Python visualization libraries is entertaining and shows the different options.

After learning matplotlib and/or seaborn, you should be able to do the following (in very rough order of beginner to advanced):

plot line plots, scatter plots, histograms, bar plots, heatmaps
change the axes labels and title
save to file
adjust x- and y-ticks and tick labels
use different colors
choose appropriate colormaps for heatmaps
make subplots

Matplotlib¶

Matplotlib is extremely powerful, as it gives you access to every aspect of a plot. It is well worth the time to learn the basics of matplotlib, and then move on to seaborn, which wraps matplotlib into easier-to-use functions and classes.

matplotlib quick start is the best place to start if you’re new to matplotlib.
The matplotlib tutorials page has beginner, intermediate, and advanced tutorials.
The matplotlib gallery shows the kinds of things you can do with matplotlib.

Seaborn¶

The seaborn tutorial page lays out everything you need to know about seaborn.

Useful built-in Python modules¶

There are many built-in Python modules, here is a list of those that I keep coming back to. There’s no need to jump in and start learning these one-by-one. But it is important to be aware of what’s available. For example, it’s useful to know that if you are going to be building command-line tools, you should look more into the argparse module.

The Python Module of the Week (PyMOTW) is a great resource for learning about these as well. Here I’m just listing the ones I most commonly use:

argparse: build a command-line interface to your code, with auto-generated help.
collections: has the very useful defaultdict, Counter, and OrderedDict classes
datetime: work with dates, times, and timedeltas
glob: use wildcards when searching for files
itertools: fast, memory-efficient functions especially useful for working with very large datasets
json: read in JSON-formatted text
os: useful tools for interacting with the operating system (env vars, usernames, file permissions, etc)
pathlib: manipulate filenames and directories (new in Python 3.4)
pprint: pretty-print. Useful for printing out big objects
re: regular expressions
shutil: shell-related utilities (copy/move files directories)
sqlite3: create and interact with SQLite3 file-based databases
subprocess: call out to the shell, for when you need to call other programs from within Python
sys: various system-related functions. Often used for sys.argv which contains the arguments a Python script was called with
tempfile: create and manipulate temporary files
textwrap: nicely indent or dedent text, or line-wrap to a fixed line length
zipfile: interact with zip files

Useful Python libraries¶

Below are some useful and commonly-used Python libraries to give you a flavor of what else is possible with Python. Like the modules above, this section is more for being aware of what’s out there, and you can look for more details on particular ones that seem like they would be helpful for your work.

argh is great for building more complext command-line tools
biopython is the way to parse FASTA, FASTQ, and do various sequence manipulation (for other file formats like SAM/BAM or GTF/BED/VCF, see below)
cyvcf2 for working with VCF files
flask is a website development framework
matplotlib, for plotting
numpy is actually the basis for matplotlib, scipy, and pandas, but is useful on its own
pandas for tabular data manipulation
pybedtools wraps and greatly extends bedtools for manipulating BED/VCF/GTF/GFF/BAM/SAM files. Written and maintained by BSPC!
pysam for working with BAM/SAM files. Also VCF.
requests for working with anything from the internet (downloading pages etc)
scikit-learn for machine learning
scipy general scientific computing (e.g., signal processing, stats, linear algebra)
seaborn wraps and extends matplotlib for plotting
sphinx for documentation. This very site is built using Sphinx!
trackhub for building UCSC track hubs. Written and maintained by BSPC!
yaml for working with YAML config files

Python skills¶

Some people have asked about what skills they would be expected to have when learning Python. That’s a very difficult question, as it depends on exactly what you’re using Python for.

Below, I’ve attempted to categorize various parts of base Python into different levels. This is by no means exhaustive, and the items and organization likely reflect the biases of my own path when learning and using Python. And by no means do you have to learn everything here! You can do a lot of really interesting things just with the “level 1” skills.

Note that many of the more advanced topics will not be found in the tutorials linked above, so you’ll need to find your own resources for learning them, or get in touch ryan.dale@nih.gov if you would like some pointers.

There are lots of commonly-used Python modules (see sections above), each of which have their own lists of skills. This section is just about base Python.

Level 1¶

creating lists, dicts, tuples
difference between list and tuple
methods of string
methods of list
methods of dict
importing
functions
for loops
while loops
using IPython
run in IPython
while loops
open a file
write to a file

Level 2¶

debugging in IPython (with pdb)
list comprehensions
dict comprehensions
sets
discuss dictionary order
4 < 3 and 5 > 4 is False, why?
manually parse a file line-by-line
difference between *args and **kwargs in a function definition
common standard modules (os, sys, argparse, pathlib, glob)
f-strings
docstrings

Level 3¶

dealing with unicode
building a command-line interface with argparse
object-oriented design
string formatting mini-language
discuss when you would use *args and **kwargs in a function definition
write a generator function
discuss when you would use a generator function
making a class an iterator
“dunder” methods
why import * is not a great idea (discussion of namespaces)
Zen of Python
lambda expressions
pep8
using decorators
raising errors
catching errors
using if __name__ == “__main__”
understanding what if __name__ == “__main__” means
organizing code into modules
what those __pycache__ directories are
doctests
using a context manager

Level 4¶

shallow vs deep copy
function annotations
type hints
writing decorators
writing a package
unit tests
writing a context manager and discussing why it’s useful
create and use sqlite3 databases

Level 5¶

cython extensions
asyncio
multiprocessing