Getting started with Python for data analysis.

Dataframes are 2-dimensional labeled data structures with columns of potentially different types. You can think of it like a spreadsheet. Numpy and Pandas are two very powerful and commonly used libraries used for datasets in bioinformatics. If you don’t have these installed, you can get them as part of the SciPy bundle. You can also use conda to install them individually.

Numpy

Numpy is a library for arrays.

Helpful, lengthy explanation of indexes and slices in arrays: https://stackoverflow.com/a/24713353
Numpy basics: https://docs.scipy.org/doc/numpy-1.15.0/user/quickstart.html
More explanation of data types (numpy): https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html
Why do we want to use numpy vs regular python lists? https://stackoverflow.com/questions/993984/what-are-the-advantages-of-numpy-over-regular-python-lists

Pandas

Pandas is a python library for data frames. Understanding the basics of numpy will be helpful before getting into pandas.

Pandas introduction: http://pandas.pydata.org/pandas-docs/stable/10min.html
And another: https://www.learnpython.org/en/Pandas_Basics
Selecting data in a dataframe (iloc): https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

Tutorial challenge from an introduction to biocomputing class - Prompt and corresponding code

To follow along you can download the dataset by pasting this code into your command line: curl -L https://osf.io/kges5/download -o wages.csv

Intro to python

Helpful links for data analysis in python