Python

See also

See Choosing R or Python if you’re trying to decide whether to learn R or Python

Different people have different learning styles. Below are various well-regarded resources for learning Python, but they take different strategies. You’ll be spending a lot of time with them, so it might be a good idea to briefly look at them all first to see what would be a good fit for you.

Setting up

Install Python using miniforge. This is better than using Anaconda or Miniconda because those others will set the defaults channel to be the paid one at anaconda.org. Miniforge sets the default channel to the free, community-maintained conda-forge channel. This will ensure reproducibility of your work by not requiring a license to anaconda.org channel.

You may have heard about Jupyter notebooks. While they have their place in data science-type work, I do not think this is a good way to learn Python. Here are some reasons why; they boil down to needing to mentally keep track of what cells have been run when to avoid strange and hard-to-track-down bugs.

At first, stay away from IDEs like PyCharm or Spyder. These remove you too much from what is really happening and add an extra layer of complexity. Once you have the basics down then by all means see what IDEs or notebook solutions work best for you.

To start, you should start by using just a text editor and a terminal running Python (or IPython, see the Use IPython section). My setup happens to be running Vim in one terminal next to another terminal open running IPython.

Learning Python

When learning Python, it’s very important to start with the basics. Sometimes people will start learning Python because they’re interested in data science, and those sorts of tutorials jump right in to teaching pandas. While pandas is amazing and it’s important to be fluent in pandas, learning pandas is not learning Python.

What do I mean by that?

It’s possible to get most of the way through a pandas tutorial without ever encountering a dictionary (a fundamental data structure that is used all over Python code). While someone might be good at pandas specifically, they don’t learn the rest of Python. So they end up having big gaps in their knowledge. Once they start hitting those gaps (like the first time they need to use a dictionary), ideally they should go all the way back to learning the fundamentals. But in practice, they want to solve the problem they’re working on in pandas and don’t want to take the time to learn everything. As a result they’ll learn just enough of the basics to solve that one problem, and leave the rest of the gaps in their knowledge until they hit the next one, when the process repeats.

It’s much better in the long run to learn the fundamentals. Then you will have a shared vocabulary with everyone else – including Stack Overflow commenters! – that will help you understand more complex topics as you progress.

Use IPython

Python comes with an interpreter (what you enter when you type python at the command line). But that interpreter is limited in a lot of ways.

As early as possible, start using IPython. Its tab-completion, integrated help, and debugging are amazing and will make your life so much easier. It also makes working with matplotlib plots (and therefore seaborn and pandas plots) much nicer, spawning the figures as a separate window and returning you to the prompt. You can also directly interact with the Bash shell from inside IPython (see this page for more).

Learn more about IPython in the IPython tutorial.

Debugging in Python

Something that is often skipped over in tutorials is the utility of interactive debugging. In IPython, when you get an error you can call %debug from the IPython command line and you are dropped into a live version of your code at the exact point where the error was caused. You can then inspect the values of various variables to troubleshoot what went wrong. This is much more powerful than sprinkling print() statements throughout your code!

There is a good intro to the debugging workflow at SciPy lecture notes: debugging, along with what to do when you can’t use IPython.

Learning pandas

The pandas package is the standard for working with tables of data (like from a spreadsheet).

Note

It’s important to learn Python (see above resources) before jumping in to pandas. Pandas is almost its own mini-language, so learning Pandas does not mean you’re learning Python!

After learning pandas, you should be able to do the following (in very rough order of beginner to advanced):

  • read csv or tsv or url into dataframe

  • select rows and columns

  • save to file

  • discuss the difference between .loc and .iloc

  • apply a function to a column

  • create a DataFrame from lists or dictionaries

  • find row with largest value in column

  • chain pandas.DataFrame methods together to build a “pipeline”

  • inspect for duplicates

  • work with Excel files

  • remove duplicates

  • get rows where column value is one of a set

  • discuss ways of handling missing data

  • join dataframes together (aligning by index)

  • group-by and summarize (e.g., find group means)

Visualization in Python

There are a lot of visualization options in Python. I think it’s best to learn matplotlib and then use seaborn, but this dramatic comparison of Python visualization libraries is entertaining and shows the different options.

After learning matplotlib and/or seaborn, you should be able to do the following (in very rough order of beginner to advanced):

  • plot line plots, scatter plots, histograms, bar plots, heatmaps

  • change the axes labels and title

  • save to file

  • adjust x- and y-ticks and tick labels

  • use different colors

  • choose appropriate colormaps for heatmaps

  • make subplots

Matplotlib

Matplotlib is extremely powerful, as it gives you access to every aspect of a plot. It is well worth the time to learn the basics of matplotlib, and then move on to seaborn, which wraps matplotlib into easier-to-use functions and classes.

Seaborn

The seaborn tutorial page lays out everything you need to know about seaborn.

Useful built-in Python modules

There are many built-in Python modules, here is a list of those that I keep coming back to. There’s no need to jump in and start learning these one-by-one. But it is important to be aware of what’s available. For example, it’s useful to know that if you are going to be building command-line tools, you should look more into the argparse module.

The Python Module of the Week (PyMOTW) is a great resource for learning about these as well. Here I’m just listing the ones I most commonly use:

  • argparse: build a command-line interface to your code, with auto-generated help.

  • collections: has the very useful defaultdict, Counter, and OrderedDict classes

  • datetime: work with dates, times, and timedeltas

  • glob: use wildcards when searching for files

  • itertools: fast, memory-efficient functions especially useful for working with very large datasets

  • json: read in JSON-formatted text

  • os: useful tools for interacting with the operating system (env vars, usernames, file permissions, etc)

  • pathlib: manipulate filenames and directories (new in Python 3.4)

  • pprint: pretty-print. Useful for printing out big objects

  • re: regular expressions

  • shutil: shell-related utilities (copy/move files directories)

  • sqlite3: create and interact with SQLite3 file-based databases

  • subprocess: call out to the shell, for when you need to call other programs from within Python

  • sys: various system-related functions. Often used for sys.argv which contains the arguments a Python script was called with

  • tempfile: create and manipulate temporary files

  • textwrap: nicely indent or dedent text, or line-wrap to a fixed line length

  • zipfile: interact with zip files

Useful Python libraries

Below are some useful and commonly-used Python libraries to give you a flavor of what else is possible with Python. Like the modules above, this section is more for being aware of what’s out there, and you can look for more details on particular ones that seem like they would be helpful for your work.

  • argh is great for building more complext command-line tools

  • biopython is the way to parse FASTA, FASTQ, and do various sequence manipulation (for other file formats like SAM/BAM or GTF/BED/VCF, see below)

  • cyvcf2 for working with VCF files

  • flask is a website development framework

  • matplotlib, for plotting

  • numpy is actually the basis for matplotlib, scipy, and pandas, but is useful on its own

  • pandas for tabular data manipulation

  • pybedtools wraps and greatly extends bedtools for manipulating BED/VCF/GTF/GFF/BAM/SAM files. Written and maintained by BSPC!

  • pysam for working with BAM/SAM files. Also VCF.

  • requests for working with anything from the internet (downloading pages etc)

  • scikit-learn for machine learning

  • scipy general scientific computing (e.g., signal processing, stats, linear algebra)

  • seaborn wraps and extends matplotlib for plotting

  • sphinx for documentation. This very site is built using Sphinx!

  • trackhub for building UCSC track hubs. Written and maintained by BSPC!

  • yaml for working with YAML config files

Python skills

Some people have asked about what skills they would be expected to have when learning Python. That’s a very difficult question, as it depends on exactly what you’re using Python for.

Below, I’ve attempted to categorize various parts of base Python into different levels. This is by no means exhaustive, and the items and organization likely reflect the biases of my own path when learning and using Python. And by no means do you have to learn everything here! You can do a lot of really interesting things just with the “level 1” skills.

Note that many of the more advanced topics will not be found in the tutorials linked above, so you’ll need to find your own resources for learning them, or get in touch ryan.dale@nih.gov if you would like some pointers.

There are lots of commonly-used Python modules (see sections above), each of which have their own lists of skills. This section is just about base Python.

Level 1

  • creating lists, dicts, tuples

  • difference between list and tuple

  • methods of string

  • methods of list

  • methods of dict

  • importing

  • functions

  • for loops

  • while loops

  • using IPython

  • run in IPython

  • while loops

  • open a file

  • write to a file

Level 2

  • debugging in IPython (with pdb)

  • list comprehensions

  • dict comprehensions

  • sets

  • discuss dictionary order

  • 4 < 3 and 5 > 4 is False, why?

  • manually parse a file line-by-line

  • difference between *args and **kwargs in a function definition

  • common standard modules (os, sys, argparse, pathlib, glob)

  • f-strings

  • docstrings

Level 3

  • dealing with unicode

  • building a command-line interface with argparse

  • object-oriented design

  • string formatting mini-language

  • discuss when you would use *args and **kwargs in a function definition

  • write a generator function

  • discuss when you would use a generator function

  • making a class an iterator

  • “dunder” methods

  • why import * is not a great idea (discussion of namespaces)

  • Zen of Python

  • lambda expressions

  • pep8

  • using decorators

  • raising errors

  • catching errors

  • using if __name__ == “__main__”

  • understanding what if __name__ == “__main__” means

  • organizing code into modules

  • what those __pycache__ directories are

  • doctests

  • using a context manager

Level 4

  • shallow vs deep copy

  • function annotations

  • type hints

  • writing decorators

  • writing a package

  • unit tests

  • writing a context manager and discussing why it’s useful

  • create and use sqlite3 databases

Level 5

  • cython extensions

  • asyncio

  • multiprocessing