Python Cheatsheet 1

Python Cheatsheet 1

Collections


Python Lists

A list is an ordered and mutable container; it is arguably the most common data structure in Python. Understanding how lists work becomes even more relevant as you work on data cleaning and create for-loops.

list = list[from_inclusive : to_exclusive : ±step_size]

list.append(el)               # Or: list += [el]
list.extend(collection)       # Or: list += collection

list.sort()
list.reverse()
list = sorted(collection)
list = reversed(list)

sum_of_elements  = sum(collection)
elementwise_sum  = [sum(pair) for pair in zip(list_a, list_b)]
sorted_by_second = sorted(collection, key=lambda el: el[1])
sorted_by_both   = sorted(collection, key=lambda el: (el[1], el[0]))
flatter_list     = list(itertools.chain.from_iterable(list))
product_of_elems = functools.reduce(lambda out, el: out * el, collection)
list_of_chars    = list(str)

# Module operator provides functions itemgetter() and mul() that offer the same functionality as lambda expressions above.
int = list.count(el)         # Returns number of occurrences. Also works on strings.
index = list.index(el)       # Returns index of first occurrence or raises ValueError.
list.insert(index, el)       # Inserts item at index and moves the rest to the right.
el = list.pop([index])       # Removes and returns item at index or from the end.
list.remove(el)              # Removes first occurrence of item or raises ValueError.
list.clear()                 # Removes all items. Also works on dictionary and set.

Dictionary

A Dictionary is a type of data structure in Python that use keys for indexing. Dictionaries are an unordered sequence of items (key-value pairs) and vital for Data Scientists, especially those interested in web scraping. For example: extracting data from YouTube channels.

dict = {'x': 1, 'y': 2}                       # Dictionary example

dict.keys()                                   # Coll. of keys that reflects changes.
dict.values()                                 # Coll. of values that reflects changes.
dict.items()                                  # Coll. of key-value tuples that reflects ch

value  = dict.get(key, default=None)          # Returns default if key is missing.
value  = dict.setdefault(key, default=None)   # Returns and writes default if key is missing.
dict = collections.defaultdict(type)          # Creates a dict with default value of type.
dict = collections.defaultdict(lambda: 1)     # Creates a dict with default value 1.

dict = dict(collection)                       # Creates a dict from coll. of key-value pairs.
dict = dict(zip(keys, values))                # Creates a dict from two collections.
dict = dict.fromkeys(keys [, value])          # Creates a dict from collection of keys.

dict.update(dict)                             # Adds items. Replaces ones with matching keys.
value = dict.pop(key)                         # Removes item or raises KeyError.
{k for k, v in dict.items() if v == value}    # Returns set of keys that point to the value.
{k: v for k, v in dict.items() if k in keys}  # Returns a dictionary, filtered by keys.

Range & Enumerate

Range function returns a sequence of numbers and increments by 1 (by default), and stops before a specified number. The enumerate function, takes a collection (a tuple) and adds a counter to an enumerated object. Both functions are useful in for-loops.

range_name = range(to_exclusive)
range_name = range(from_inclusive, to_exclusive)
range_name = range(from_inclusive, to_exclusive, ± step_size)


for i, el in enumerate(collection [, i_start]):
    ...

Python Types


String

Strings are objects containing a sequence of characters. String methods always return new values and will not change the original string.

str  = str.strip()                         # Strips all whitespace characters from both ends.
str  = str.strip('chars')                  # Strips all passed characters from both ends.

list = str.split()                         # Splits on one or more whitespace characters.
list = str.split(sep=None, maxsplit=-1)    # Splits on 'sep' str at most 'maxsplit' times.
list = str.splitlines(keepends=False)      # Splits on \n,\r,\r\n. Keeps them if 'keepends'.
str  = str.join(coll_of_strings)           # Joins elements using string as separator.

bool = sub_str in str                      # Checks if string contains a substring.
bool = str.startswith(sub_str)             # Pass tuple of strings for multiple options.
bool = str.endswith(sub_str)               # Pass tuple of strings for multiple options.
int  = str.find(sub_str)                   # Returns start index of first match or -1.
int  = str.index(sub_str)                  # Same but raises ValueError if missing.

str  = str.replace(old, new [, count])     # Replaces 'old' with 'new' at most 'count' times.
str  = str.translate(table)                # Use `str.maketrans(dict)` to generate table.

str  = chr(int)                            # Converts int to Unicode char.
int  = ord(str)                            # Converts Unicode char to int.

Don’t forget to use some other basic methods such as lower(), upper(), capitalize() and title().

Regular Expressions (Regex)

A regular expression is a sequence of characters that describes a search pattern. In Pandas, regular expressions are integrated with vectorized string methods, making finding and extracting patterns of characters easier. Learning how to use Regex make data cleaning less time-consuming for Data Scientists.

import re
str   = re.sub(regex, new, text, count=0)  # Substitutes all occurrences with 'new'.
list  = re.findall(regex, text)            # Returns all occurrences as strings.
list  = re.split(regex, text, maxsplit=0)  # Use brackets in regex to include the matches.
match = re.search(regex, text)             # Searches for first occurrence of the pattern.
match = re.match(regex, text)              # Searches only at the beginning of the text.
iter  = re.finditer(regex, text)           # Returns all occurrences as match objects.

Unless you use the flags=re.ASCII argument, by default whitespaces, digits, and alphanumeric characters in any alphabet will be matched. Also, use a capital letter for negation.

'\d' == '[0-9]'                                # Matches any digit.
'\w' == '[a-zA-Z0-9_]'                         # Matches any alphanumeric.
'\s' == '[ \t\n\r\f\v]'                        # Matches any whitespace.

Numbers

int      = int(float/str/bool)         # Or: math.floor(float)
float    = float(int/str/bool)         
complex  = complex(real=0, imag=0)     
Fraction = fractions.Fraction(0, 1)    # Or: Fraction(numerator=0, denominator=1)
Decimal  = decimal.Decimal(str/int)    # Or: Decimal((sign, digits, exponent))

Math & Basic Statistics

Python has a built-in module that Data Scientists can use for mathematical tasks and basic statistics. However, the describe() function computes a summary of DataFrame columns’ statistics.

from math import e, pi, inf, nan, isinf, isnan
from math import cos, acos, sin, asin, tan, atan, degrees, radians
from math import log, log10, log2
from statistics import mean, median, variance, stdev, pvariance, pstdev

Datetime

The module Datetime provides date d, time t, datetime dt and timedelta td classes. These classes are immutable and hashable. This means that its value will not change. As a result, it allows Python to create a unique hash value and be used by dictionaries to track unique keys. The Datetime module is crucial to Data Analysts who frequently encounter data sets showing "time of purchase" or "how long users spent on a specific page."

from datetime import date, time, datetime, timedelta
from dateutil.tz import UTC, tzlocal, gettz, resolve_imaginary

# Constructors
d  = date(year, month, day)
t  = time(hour=0, minute=0, second=0, microsecond=0, tzinfo=None, fold=0)
dt = datetime(year, month, day, hour=0, minute=0, second=0, ...)
td = timedelta(days=0, seconds=0, microseconds=0, milliseconds=0, minutes=0, hours=0, weeks=0)
                 
# Now
d/dtn  = d/dt.today()                     # Current local date or naive datetime.
dtn    = dt.utcnow()                      # Naive datetime from current UTC time.
dta    = dt.now(tzinfo)                   # Aware datetime from current tz time.

# To extract time
dtn.time()
dta.time()
dta.timetz()

# Encode
d/t/dt = d/t/dt.fromisoformat('iso')     # Object from ISO string. Raises ValueError.
dt     = dt.strptime(str, 'format')      # Datetime from str, according to format.
d/dtn  = d/dt.fromordinal(int)           # d/dtn from days since Christ, at midnight.
dtn    = dt.fromtimestamp(real)          # Local time dtn from seconds since the Epoch.
dta    = dt.fromtimestamp(real, tz.)     # Aware datetime from seconds since the Epoch.

# Decode
str    = d/t/dt.isoformat(sep='T')       # Also timespec='auto/hours/minutes/seconds'.
str    = d/t/dt.strftime('format')       # Custom string representation.
int    = d/dt.toordinal()                # Days since Christ, ignoring time and tz.
float  = dtn.timestamp()                 # Seconds since the Epoch, from dtn in local tz.
float  = dta.timestamp()                 # Seconds since the Epoch, from dta.

# Format
from datetime import datetime
dt = datetime.strptime('2015-05-14 23:39:00.00 +0200', '%Y-%m-%d %H:%M:%S.%f %z')
dt.strftime("%A, %dth of %B '%y, %I:%M%p %Z")
"Friday, 14 of July '89, 09:45PM UTC+03:00"

# Arithmatics
d/dt   = d/dt   ± td                     # Returned datetime can fall into missing hour.
td     = d/dtn  - d/dtn                  # Returns the difference, ignoring time jumps.
td     = dta    - dta                    # Ignores time jumps if they share tzinfo object.
td     = dt_UTC - dt_UTC                 # Convert dts to UTC to get the actual delta.

Python Syntax


Lambda

A Python lambda function is an anonymous function that works just like a normal one with arguments. They are handy when Data Scientists need to use a function only once and don’t want to write an entire Python function.

# Lambda
function = lambda: return_value
function = lambda argument_1, argument_2: return_value

Python Comprehension

Comprehensions are one-line codes that allow data professionals to create lists from iterables sources, such as lists. They are perfect for simplifying for-loops and map() while addressing one of the fundamental premises in Data Science: "readability counts." So, try not to overcomplicate your list/dictionary comprehensions.

list = [i+1 for i in range(10)]                   # [1, 2, ..., 10]
set  = {i for i in range(10) if i > 5}            # {6, 7, 8, 9}
iter = (i+5 for i in range(10))                   # (5, 6, ..., 14)
dict = {i: i*2 for i in range(10)}                # {0: 0, 1: 2, ..., 9: 18}

out = [i+j for i in range(10) for j in range(10)]

# Is the same as
out = []
for i in range(10):
    for j in range(10):
        out.append(i+j)

Python Comprehension If statement

new_list = [expression for member in iterable (if conditional)]
sentence = 'the rocket came back from mars'
vowels = [i for i in sentence if i in 'aeiou']
print(vowels)
['e', 'o', 'e', 'a', 'e', 'a', 'o', 'a']

Python Comprehension If-Else

Data Scientists use if-else statements to execute a code only if a certain condition is satisfied.

obj = expression_if_true if condition else expression_if_false

obj = [a if a else 'zero' for a in (0, 1, 2, 3)]
['zero', 1, 2, 3]

Python Libraries


Libraries are lifesavers for Data Scientists. Some of these libraries are massive and have been created specifically to address data professionals’ needs. Because both Numpy and Pandas allow multiple and essential features to Data Scientists, below, you will find some of the basic features useful to beginners and early career Data Scientists.

NumPy

Python is a high-level language as there is no need to manually allocate memory or specify how the CPU performs certain operations. A low-level language, such as C gives us this control and improves specific code performance (vital when working with Big Data). One of the reasons why NumPy makes Python efficient is because of vectorisation, which takes advantage of Single Instruction Multiple Data (SIMD) to process data more quickly. NumPy will become extremely useful as you apply linear algebra to your machine learning projects.

Bear in mind that a list in NumPy is called a 1D Ndarray whereas a 2D Ndarray is a list of lists. NumPy ndarrays use indices along both rows and columns and is the how you will select and slice values.

$ pip3 install numpy
import numpy as np

array = np.array(list)
array = np.arange(from_inclusive, to_exclusive, ±step_size)
array = np.ones(shape)
array = np.random.randint(from_inclusive, to_exclusive, shape)

array.shape = shape
view  = array.reshape(shape)
view  = np.broadcast_to(array, shape)

array = array.sum(axis)
indexes = array.argmin(axis)

# Shape is a tuple of dimension sizes.
# Axis is the index of a dimension that gets collapsed. The leftmost dimension has index 0.

Pandas

Although NumPy provides crucial structures and tools which make work with data a lot easier, there are some limitations:

  • Because there is no support for column names, NumPy forces you to frame questions so that your answer is always multi-dimensional array operations.

  • If there is support for only one data type per ndarray, then it’s more challenging to work with data containing both numeric and string data.

  • There are many low-level methods — however, there are many common analysis patterns that don’t have pre-built methods.

Luckily, the Pandas library provides solutions to every issue mentioned above. Pandas is not a replacement for NumPy, but a massive extension of NumPy. I don’t want to state the obvious but, for the sake of completion: the main objects in pandas are Series and Dataframes. The former is equivalent to a 1D Ndarray while the latter is equivalent to a 2D Ndarray. Pandas is vital to data cleaning and analysis.

$ pip3 install pandas
import pandas as pd
from pandas import Series, DataFrame

Pandas Series

A Pandas Series is a one-dimensional array that holds any type of data (integer, float, string, python objects). The axis labels are called index.

>>> Series([1, 2], index=['x', 'y'], name='a')
x    1
y    2
Name: a, dtype: int64

sr = Series(list)                           # Assigns RangeIndex starting at 0.
sr = Series(dict)                           # Takes dictionary's keys for index.
sr = Series(dict/Series, index=list)        # Only keeps items with keys specified in index.

el = sr.loc[key]                            # Or: sr.iloc[index]
sr = sr.loc[keys]                           # Or: sr.iloc[indexes]
sr = sr.loc[from_key : to_key_inclusive]    # Or: sr.iloc[from_i : to_i_exclusive]

el = sr[key/index]                          # Or: sr.key
sr = sr[keys/indexes]                       # Or: sr[key_range/range]
sr = sr[bools]                              # Or: sr.i/loc[bools]

sr = sr ><== el/sr                          # Returns a Series of bools.
sr = sr +-*/ el/sr                          # Items with non-matching keys get value NaN.

sr = sr.append(sr)                          # Or: pd.concat(coll_of_sr)
sr = sr.combine_first(sr)                   # Adds items that are not yet present.
sr.update(sr)                               # Updates items that are already present.

DataFrame

A Pandas DataFrame is a 2-dimensional labelled data structure whereby each column can have a unique name. You can think of Pandas DataFrames as a SQL table or a simple spreadsheet.

>>> DataFrame([[1, 2], [3, 4]], index=['a', 'b'], columns=['x', 'y'])
   x  y
a  1  2
b  3  4

df    = DataFrame(list_of_rows)           # Rows can be either lists, dicts or series.
df    = DataFrame(dict_of_columns)        # Columns can be either lists, dicts or series.

el    = df.loc[row_key, column_key]       # Or: df.iloc[row_index, column_index]
sr/df = df.loc[row_key/s]                 # Or: df.iloc[row_index/es]
sr/df = df.loc[:, column_key/s]           # Or: df.iloc[:, column_index/es]
df    = df.loc[row_bools, column_bools]   # Or: df.iloc[row_bools, column_bools]

sr/df = df[column_key/s]                  # Or: df.column_key
df    = df[row_bools]                     # Keeps rows as specified by bools.
df    = df[df_of_bools]                   # Assigns NaN to False values.

df    = df ><== el/sr/df                  # Returns DataFrame of bools.
df    = df +-*/ el/sr/df                  # Items with non-matching keys get value NaN.

df    = df.set_index(column_key)          # Replaces row keys with values from a column.
df    = df.reset_index()                  # Moves row keys to column named index.
df    = df.filter('regex', axis=1)        # Only keeps columns whose key matches the regex.
df    = df.melt(id_vars=column_key/s)     # Converts DF from wide to long format.

Merge, Join & Concat

These methods enable Data Scientists to expand their analysis by combining multiple datasets into a single DataFrame.

>>> l = DataFrame([[1, 2], [3, 4]], index=['a', 'b'], columns=['x', 'y'])
   x  y 
a  1  2 
b  3  4 
>>> r = DataFrame([[4, 5], [6, 7]], index=['b', 'c'], columns=['y', 'z'])
   y  z
b  4  5
c  6  7

+------------------------+---------------+------------+------------+--------------------------+
|                        |    'outer'    |   'inner'  |   'left'   |       Description        |
+------------------------+---------------+------------+------------+--------------------------+
| l.merge(r, on='y',     |    x   y   z  | x   y   z  | x   y   z  | Joins/merges on column.  |
|            how=)      | 0  1   2   .  | 3   4   5  | 1   2   .  | Also accepts left_on and |
|                        | 1  3   4   5  |            | 3   4   5  | right_on parameters.     |
|                        | 2  .   6   7  |            |            | Uses 'inner' by default. |
+------------------------+---------------+------------+------------+--------------------------+
| l.join(r, lsuffix='l', |    x yl yr  z |            | x yl yr  z | Joins/merges on row keys.|
|           rsuffix='r', | a  1  2  .  . | x yl yr  z | 1  2  .  . | Uses 'left' by default.  |
|           how=)       | b  3  4  4  5 | 3  4  4  5 | 3  4  4  5 | If r is a series, it is  |
|                        | c  .  .  6  7 |            |            | first converted to DF.   |
+------------------------+---------------+------------+------------+--------------------------+
| pd.concat([l, r],      |    x   y   z  |     y      |            | Adds rows at the bottom. |
|           axis=0,      | a  1   2   .  |     2      |            | Uses 'outer' by default. |
|           join=)      | b  3   4   .  |     4      |            | By default works the     |
|                        | b  .   4   5  |     4      |            | same as `l.append(r)`.   |
|                        | c  .   6   7  |     6      |            |                          |
+------------------------+---------------+------------+------------+--------------------------+
| pd.concat([l, r],      |    x  y  y  z |            |            | Adds columns at the      |
|           axis=1,      | a  1  2  .  . | x  y  y  z |            | right end.               |
|           join=)      | b  3  4  4  5 | 3  4  4  5 |            | Uses 'outer' by default. |
|                        | c  .  .  6  7 |            |            |                          |
+------------------------+---------------+------------+------------+--------------------------+
| l.combine_first(r)     |    x   y   z  |            |            | Adds missing rows and    |
|                        | a  1   2   .  |            |            | columns.                 |
|                        | b  3   4   5  |            |            |                          |
|                        | c  .   6   7  |            |            |                          |
+------------------------+---------------+------------+------------+--------------------------+

GroupBy

The groupby() method is useful to investigate datasets split into groups based on a given criteria.

>>> df = DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 6]], index=list('abc'), columns=list('xyz'))
>>> df.groupby('z').get_group(3)
   x  y
a  1  2
>>> df.groupby('z').get_group(6)
   x  y
b  4  5
c  7  8

gb = df.groupby(column_key/s)             # df is split into groups based on passed column.
df = gb.get_group(group_key/s)            # Selects a group by value of grouping column.

Aggregate, Transform, Map:

df = gb.sum/max/mean/idxmax/all()         # Or: gb.apply/agg(agg_func)
df = gb.rank/diff/cumsum/ffill()          # Or: gb.aggregate(trans_func)  
df = gb.fillna(el)                        # Or: gb.transform(map_func)

>>> gb = df.groupby('z')
      x  y  z
3: a  1  2  3
6: b  4  5  6
   c  7  8  6

+-------------+-------------+-------------+-------------+---------------+
|             |    'sum'    |    'rank'   |   ['rank']  | {'x': 'rank'} |
+-------------+-------------+-------------+-------------+---------------+
| gb.agg()   |      x   y  |      x  y   |      x    y |        x      |
|             |  z          |   a  1  1   |   rank rank |     a  1      |
|             |  3   1   2  |   b  1  1   | a    1    1 |     b  1      |
|             |  6  11  13  |   c  2  2   | b    1    1 |     c  2      |
|             |             |             | c    2    2 |               |
+-------------+-------------+-------------+-------------+---------------+
| gb.trans() |      x   y  |      x  y   |             |               |
|             |  a   1   2  |   a  1  1   |             |               |
|             |  b  11  13  |   b  1  1   |             |               |
|             |  c  11  13  |   c  1  1   |             |               |
+-------------+-------------+-------------+-------------+---------------+