Data Science
Raw
Data Sciencen_array.dtype.name
n_array.ravel
- flatten
pd.Series(np.random.rand(5))
d['some'].isnull().value_counts()
then dropna()
df2.fillna(df2.mean())
!!! String Regex in Pandas !!!
df['text'][0:5].str.extract('(\w+)\s(\w+)')
(other ops with .str.
like .upper()
, .len()
, .replace('DISTRICT$', 'DIST')
)
df['NO. OBESE'].groupby(d['GRADE LEVEL']).aggregate([sum, mean, std])
Inference
>>> from scipy.stats import binom
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(1, 1)
>>> x = [0, 1, 2, 3, 4, 5, 6]
>>> n, p = 6, 0.5
>>> rv = binom(n, p)
>>> ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1,
label='Probablity')
>>> ax.legend(loc='best', frameon=False)
>>> plt.show()
>>> from scipy.stats import poisson
>>> rv = poisson(20)
>>> rv.pmf(23)
0.066881473662401172
# With the Poisson function, we define the mean value, which is 20 cars. The rv.pmf function gives the probability, which is around 6%, that 23 cars will pass the bridge.
>>> from scipy import stats
>>> stats.bernoulli.rvs(0.7, size=100)
>>> classscore = np.random.normal(50, 10, 60).round()
>>> plt.hist(classscore, 30, normed=True) #Number of breaks is 30
>>> stats.zscore(classscore)
>>> prob = 1 - stats.norm.cdf(1.334)
>>> prob
0.091101928265359899
>>> stats.norm.ppf(0.80) # get the z-score at which the top 20% score marks
# The z-score for the preceding output that determines whether the top 20% marks are at 0.84 is as follows:
>>> (0.84 * classscore.std()) + classscore.mean()
55.942594176524267
# We multiply the z-score with the standard deviation and then add the result with the mean of the distribution. This helps in converting the z-score to a value in the distribution. The 55.83 marks means that students who have marks more than this are in the top 20% of the distribution.
# The z-score is an essential concept in statistics, which is widely used. Now you can understand that it is basically used in standardizing any distribution so that it can be compared or inferences can be derived from it.
# Let's understand this concept with an example where the null hypothesis is that it is common for students to score 68 marks in mathematics.
# Let's define the significance level at 5%. If the p-value is less than 5%, then the null hypothesis is rejected and it is not common to score 68 marks in mathematics.
# Let's get the z-score of 68 marks:
>>> zscore = ( 68 - classscore.mean() ) / classscore.std()
>>> zscore
2.283
>>> prob = 1 - stats.norm.cdf(zscore)
>>> prob
0.032835182628040638
# So, you can see that the p-value is at 3.2%, which is lower than the significance level. This means that the null hypothesis can be rejected, and it can be said that it's not common to get 68 marks in mathematics.
# Type-1 Error == False Positive (falsely rejecting Null Hypothesis)
# Type-2 Error == False Negative (falsely accepting NULL)
>>> stats.pearsonr(mpg,hp)
>>> stats.spearmanr(mpg,hp)
# We can clearly see that the Pearson correlation has been drastically affected due to the outliers, which are from a correlation of 0.89 to 0.47.
# The Spearman correlation did not get affected much as it is based on the order rather than the actual value in the data.
>>> stats.ttest_ind(class1_score,class2_score)
Chi^2
$X^2 = [(n-1)*s^2] / \sigma^2$
If repeat-sample and define chi-square statistics, PDF obtained:
$Y = Y_0 * (X^2)^{v/2-1} *e^{-X2/2}$
Y0 a constant depending on # DF, X2 the chi-square stats, v = n - 1 the # DF
- goodness of fit
# e.g. dice rolling 36 times each expected frequency == 6
expected = np.array([6,6,6,6,6,6])
# observed frequency as
observed = np.array([7,5,3,9,6,6])
# H0: observed value ~= expected value
stats.chisquare(observed, expected) # (3.333333333333, 0.64874323423)
# first stats latter p-value, unable to rejcet H0
- test of independence (pair of categorical variables independence)
# e.g. male-female frequency on 3 categorical book genres
men_women = np.array([
[100, 120, 60],
[350, 200, 90],
])
stats.chi2_contingency(men_women) # (28.23912391, 6.9912312323e-07, 2, array...)
# rejecting H0 showing association between gender and genre read, third value DF, fourth expected frequencies
ANOVA (F-test)
Test differences between means.
$H_o: \mu_1 = \mu_2 = … \mu_k$
stats.f_oneway(class1, class2, class2)
»> (stats, p-value)
Fluent Python
DATA MODEL
import collections
Card = collections.namedtuple('Card', ['rank', 'suit'])
class FrenchDeck:
ranks = [str(n) for n in range(2, 11)] + list('JQKA')
suits = 'spades diamonds clubs hearts'.split()
def __init__(self):
self._cards = [Card(rank, suit) for suit in self.suits
for rank in self.ranks]
def __len__(self):
return len(self._cards)
def __getitem__(self, position):
return self._cards[position]
beer_card = Card('7', 'diamonds') # Card(rank='7', suit='..')
-
Point is FrenchDeck, it responds to len() and accessing [] because of dunders !
-
Pros of using special dunders!! (1) users of classes has pythonic method unlike
.method()
and (2) easier to benefit rich SL likerandom.choice
to pick random cards !!! -
__getitem__
delegates[]
toself._cards
to auto-slice according the class nature, while also making deck iterablefor card in deck: print(card)
-
If collection has no
__contains__
, thein
operator does a sequential scan! -
Simply the two dunders can seize all benefits of
sorted
suit_values = dict(spades=3, hearts=2, diamonds=1, clubs=0) def spades_high(card): rank_value = FrenchDeck.ranks.index(card.rank) return rank_value * len(suit_values) + suit_values[card.suit] for card in sorted(deck, key=spades_high): # doctest: +ELLIPSIS print(card) # auto sorting due to dunders delegation
-
dunders or special methods are meant to be called by interpreter NOT coder: no
my_object.__len__()
butlen(my_object)
and if it’s instance of user defined class, the Python calls the__len__
instance method coded -
But for built-in types
list
etc, interpreter shortcut CPython usesPyVarObjet C
struct in MEM faster than calling method -
Most special calls implicit like
for i in x
invokesiter(x)
in turn may callx.__iter__()
if exists -
Unless metaprogramming, often use built-in calss
len, iter, str,
etc -
Example
v = Vector(3, 4); abs(v) >>> 5.0
the magnitude for consistency:from math import hypot class Vector: def __init__(self, x=0, y=0): self.x = x self.y = y def __repr__(self): return 'Vector...' def __abs__(self): return hypot(self.x, self.y) def __bool__(self): return bool(abs(self)) def __add__(self, other): x = self.x + other.x ... return Vector(x, y) def __mul__(self, scalar): return Vector(self.x * scalar, self.y * scalar)
TYPES
Container sequences
list
,tuple
,collections.deque
holding varied typesstr, bytes, bytearray, memoryview, array.array
holding single type
Mutability
list, bytearray, array.array, collections.deque, memoryview
mutabletuple, str, bytes
immutable HENCE hold only primitive types like char, bytes, numbers
generator expressions vital for non-list creation
array.array('I', (ord(sth) for i in sth))
creating array- without creating memory
for i in ((c, s) for c in colors for s in sizes): ...
tuple is nameless records due to indexing
>>> lax_coordinates = (33.9425, -118.408056)
>>> city, year, pop, chg, area = ('Tokyo', 2003, 32450, 0.66, 8014)
>>> traveler_ids = [('USA', '31195855'), ('BRA', 'CE342567'),
...
('ESP', 'XDA205856')]
>>> for passport in sorted(traveler_ids):
...
print('%s/%s' % passport)
...
BRA/CE342567
ESP/XDA205856
USA/31195855
>>> for country, _ in traveler_ids:
...
print(country)
...
USA
BRA
ESP
# namedTuple
>>> from collections import namedtuple
>>> City = namedtuple('City', 'name country population coordinates')
>>> tokyo = City('Tokyo', 'JP', 36.933, (35.689722, 139.691667))
>>> tokyo
City(name='Tokyo', country='JP', population=36.933, coordinates=(35.689722, 139.691667))
>>> tokyo.population
36.933
>>> tokyo.coordinates
(35.689722, 139.691667)
>>> tok
>>> City._fields
('name', 'country', 'population', 'coordinates')
>>> LatLong = namedtuple('LatLong', 'lat long')
>>> delhi_data = ('Delhi NCR', 'IN', 21.935, LatLong(28.613889, 77.208889))
>>> delhi = City._make(delhi_data)
>>> delhi._asdict()
OrderedDict([('name', 'Delhi NCR'), ('country', 'IN'), ('population',
21.935), ('coordinates', LatLong(lat=28.613889, long=77.208889))])
>>> for key, value in delhi._asdict().items():
print(key + ':', value)
_fields
is a tuple with field names of the class_make()
inits named tuple from an iterable;City(*delhi_data)
would do the same_asdict()
returns acollections.OrderedDict
built form named tuple instance
bisect and insort
import bisect
import sys
HAYSTACK = [1, 4, 5, 6, 8, 12, 15, 20, 21, 23, 23, 26, 29, 30]
NEEDLES = [0, 1, 2, 5, 8, 10, 22, 23, 29, 30, 31]
ROW_FMT = '{0:2d} @ {1:2d}
{2}{0:<2d}'
def demo(bisect_fn):
for needle in reversed(NEEDLES):
position = bisect_fn(HAYSTACK, needle)
offset = position * ' |'
print(ROW_FMT.format(needle, position, offset))
if __name__ == '__main__':
if sys.argv[-1] == 'left':
bisect_fn = bisect.bisect_left
else:
bisect_fn = bisect.bisect
print('DEMO:', bisect_fn.__name__)
print('haystack ->', ' '.join('%2d' % n for n in HAYSTACK))
demo(bisect_fn)
# An interesting application of bisect is to perform table lookups by numeric values, for example to convert test scores to letter grades, as in Example 2-18. Example 2-18. Given a test score, grade returns the corresponding letter grade.
>>> def grade(score, breakpoints=[60, 70, 80, 90], grades='FDCBA'):
...
i = bisect.bisect(breakpoints, score)
...
return grades[i]
...
>>> [grade(score) for score in [33, 99, 77, 70, 89, 90, 100]]
['F', 'A', 'C', 'C', 'B', 'A', 'A']
# keeping sorted
random.seed(1729)
my_list = []
for i in range(SIZE):
new_item = random.randrange(SIZE*2)
bisect.insort(my_list, new_item)
print('%2d ->' % new_item, my_list)
memoryview - generalised NumPy without the maths
>>> numbers = array.array('h', [-2, -1, 0, 1, 2])
>>> memv = memoryview(numbers)
>>> len(memv)
5
>>> memv[0]
-2
>>> memv_oct = memv.cast('B')
>>> memv_oct.tolist()
[254, 255, 255, 255, 0, 0, 1, 0, 2, 0]
>>> memv_oct[5] = 4
>>> numbers
array('h', [-2, -1, 1024, 1, 2])
NumPy detour
>>> import numpy
>>> floats = numpy.loadtxt('floats-10M-lines.txt')
>>> floats[-3:]
array([ 3016362.69195522,
535281.10514262, 4566560.44373946])
>>> floats *= .5
>>> floats[-3:]
array([ 1508181.34597761,
267640.55257131, 2283280.22186973])
>>> from time import perf_counter as pc
>>> t0 = pc(); floats /= 3; pc() - t0
0.03690556302899495
>>> numpy.save('floats-10M', floats)
>>> floats2 = numpy.load('floats-10M.npy', 'r+')
>>> floats2 *= 6
>>> floats2[-3:]
memmap([ 3016362.69195522,
535281.10514262, 4566560.44373946])
deque
>>> from collections import deque
>>> dq = deque(range(10), maxlen=10)
>>> dq
deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], maxlen=10)
>>> dq.rotate(3)
>>> dq
deque([7, 8, 9, 0, 1, 2, 3, 4, 5, 6], maxlen=10)
>>> dq.rotate(-4)
>>> dq
deque([1, 2, 3, 4, 5, 6, 7, 8, 9, 0], maxlen=10)
>>> dq.appendleft(-1)
>>> dq
deque([-1, 1, 2, 3, 4, 5, 6, 7, 8, 9], maxlen=10)
>>> dq.extend([11, 22, 33])
>>> dq
deque([3, 4, 5, 6, 7, 8, 9, 11, 22, 33], maxlen=10)
>>> dq.extendleft([10, 20, 30, 40])
>>> dq
deque([40, 30, 20, 10, 3, 4, 5, 6, 7, 8], maxlen=10)
Besides deque , other Python Standard Library packages implement queues. queue Provides the synchronized (i.e. thread-safe) classes Queue , LifoQueue and Priori tyQueue . These are used for safe communication between threads. All three classes can be bounded by providing a maxsize argument greater than 0 to the constructor. However, they don’t discard items to make room as deque does. Instead, when the queue is full the insertion of a new item blocks — i.e. it waits until some other threadmakes room by taking an item from the queue, which is useful to throttle the num‐ ber of live threads. multiprocessing Implements its own bounded Queue , very similar to queue.Queue but designed for inter-process communication. There is also has a specialized multiprocess ing.JoinableQueue for easier task management. asyncio Newly added to Python 3.4, asyncio provides Queue , LifoQueue , PriorityQueue and JoinableQueue with APIs inspired by the classes in queue and multiprocess ing , but adapted for managing tasks in asynchronous programming. heapq In contrast to the previous three modules, heapq does not implement a queue class, but provides functions like heappush and heappop that let you use a mutable se‐ quence as a heap queue or priority queue
Jake DS
- wild-search
object.*keyword*?
- Memory usage profiler
pip install memory_profiler
and%load_ext
%memit
similar to%timeit
- BUT line-wise need separate modules
%%file module_name.py
followed by code then import method and%mprun -f method method(args...)
- Python Int is more than Int, x is pointer to C struct with 4 pieces
ob_refcnt
reference count silently handling memory alloc-deallocob_type
encoding type of variableob_size
specifying size of data membersob_digit
containing actual integer value in Python- Python Integer is a structure containing many meta data
- C integer is essentially a label for position in memory whose bytes encode an integer value
struct _longobject {
long ob_refcnt;
PyTypeObject *ob_type;
size_t ob_size;
long ob_digit[1];
};
- Same logic, a heterogeneous list contains type info, reference count, other meta that each element its python object while array is a single pointer to start position + size, unlike list which is a pointer to many pointers each storing the above integer like meta data
NUMPY
# basic arrays
np.array()
np.arange()
np.linspace()
np.random.random((3, 3)) # default [0,1]
np.random.normal(0, 1, (3,3))
np.random.randint(0, 10, (3, 3)) # 3x3 in [0, 10)
- NumPy is the key to python data manipulation
- numpy object is fixed typed hence persist only single type
- reshaping 1D to 2D
X[np.newaxis, :]
- multi-D array concatenation best use
vstack
orhstack
for clarity np.split(arr, list_split_index)
andhsplit
VECTORISED operation UFUNC
-
NumPy interface for statically typed, compiled routine applied to each element pushing the LOOP into compiled layer underlying
-
vectorised operations in NP effected via ufuncs whose main purpose is to fast do repeated ops on values in NP-Array, extremely flexible
-
scipy.special
contains many functions beyond ordinary
from scipy import special
# Gamma functions (generalised factorials)
special.gamma()
special.gammaln()
special.beta()
# error functions (integral of Gaussian)
special.erf()
special.erfc()
special.erfinv()
- Advanced UFUNC
- specifying ouput
np.multiply(x, 10, out=y)
(sety = np.empty(5)
) and even write results to alternate elementnp.power(2, x, out=y[::2]
)- instead of
y[::2] = 2 ** x
which creates temp array holding results followed by copying values intoy
, important speed up for large arrays memory saving !!!
- instead of
- Binary ufunc:
reduce
repeated applies ops till only a single result remainsnp.add.reduce(x)
calling onadd
ufunc returns sum of all in arraynp.add.accumulate(x)
for intermediate results- directly by
np.sum, np.prod, np.cumsum, np.cumprod
- Outer product: allowing in-line creating multiplication table
np.multiply.outer(x, x)
ufunc.at, ufunc.reduceat
will be explored later in indexing
- specifying ouput
AGGREGATION
np.sum
faster due to exec in compiled code thansum
, best use dot-method of these for clarity- note
NaN
-safe version of mostnp.nanpercentile, np.nanmedian
etc
BROADCASTING (mix-dim binary ufuncs)
-
scalar ops as if stretching/duplicating scalar into array of matching size then ops!! Broadcasting shines on none-duplication of such ops!
-
mismatched arrays BOTH stretched to COMMON shape before ops
-
RULES:
- If arrays differ in ndim, the shape of fewer is padded with ones on its leading (left) side !!!
- If shape mismatch in any dim, the array with shape == 1 in that dim is stretched/broadcasted to match the other shape
- If disagreed in any dim and neither == 1, error raised
-
Example:
M.shape = (2,3) a.shape = (3,) # a vector # rule-1 sees a has fewer dim so padded on left with ones M.shape -> (2, 3) a.shape -> (1, 3) # rule-2 see axis-0 disagrees, so streched this dim to match M.shape -> (2, 3) a.shape -> (2, 3) # example 2 a.shape = (3, 1) b.shape = (3,) # rule-1 to pad shape of b with ones b.shape -> (1, 3) # rule-2 upgrade each of ones to match size of other array a.shape -> (3, 3) b.shape -> (3, 3) # example 3 M.shape = (3, 2) a.shape = (3,) # rule-1 padding a with ones a.shape -> (1, 3) # rule-2 to stretch axis-0 of a to match that of M a.shape -> (3, 3) # rule-3 raise error for none of axis-0 mismatch in final shape # Potential Confusion: a and M compatible via padding a's shape # with ones on the right! BUT this is not how BC work! # instead, reshape right side via np.newaxis a[:, np.newaxis].shape # (3, 1) # now matches np.logaddexp(M, a[:, np.newaxis]) # log(exp(a) + exp(b))
-
Practice: normalise matrix on axis and visualise 2D from functions
x = np.linspace(0, 5, 50), y = np.linspace(0, 5, 50)[:, np.newaxis], z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x), plt.imshow(z, origin='lower', extent=[0, 5, 0, 5], cmap=viridis'), plt.colorbar()
BOOLEAN MASKING
rng = np.random.RandomState(42), x = rng.randint(10, size=(3, 4))
- counting
np.count_nonzero(x < 6)
ornp.sum()
which is better for itsaxis
- checking
np.any(x > 8)
andnp.all(x < 10)
- boolean ops
& | ^ ~
NP overloads these, as other ops, ufuncs which work element-wise on arrays - MASKING objects
rainy = (inches > 0)
days = np.arange(365) # mask of all summer days
summer = (days > 172) & (days < 262)
np.median(inches[rainy])
np.median(inches[summer])
and or
gauge truth on entire object while& |
bitwise !! within each object which is always used for array boolean (boolean of array of integer or maybe container of elements might be ill-defined)
bin(42) # 0b101010
bin(59) # 0b111011
bin(42 & 59) # 0b101010 or bitwise 1 0 boolean
FANCY INDEXING
-
shape of result reflects shape of INDEX arrays rather than shape of array INDEXED
-
x array, ind = np.array([[3, 7], [4, 5]]), x[ind]
returns 2x2 array of indexed values of x !! -
Usage: making random arrays
mena = [0, 0]
cov = [[1, 2],
[2, 5]]
X = rand.multivariate_normal(mean, cov, 100)
X.shape # (100, 2)
# subsetting
indices = np.random.choice(X.shape[0], 20, replace=False)
selection = X[indices]
# plotting circling effet
plt.scatter(selection[: 0,], selection[:, 1],
facecolor='none', s=200);
-
np.add.at(x, i, 1)
in-place adding and alsoreduceat()
for repeated indexing ops rather than otherwise repeated assignment !! -
case: binning data:
x = np.random.randn(100) bins = np.linspace(-5, 5, 20) counts = np.zeros_like(bins) i = np.searchsorted(bins, x) # find right bin for each x np.add.at(counts, i, 1) # add 1 to each of bins # the counts now reflect the num of points within each bin
SORTING
#selection sort find and swap till sorted
def selection_sort(x):
for i in range(len(x)):
swap = i + np.argmin(x[i:])
(x[i], x[swap]) = (x[swap], x[i])
return x
- default
np.sort
uses quicksort though mergesort, heapsort also available - Example: KNN
X = rand.rand(10, 2)
dist_sq = np.sum((X[:, np.newaxis, :]
- X[np.newaxis, :, :]) ** 2, axis=-1)
# broadcasting in detail
# for each pair of points, compute diff in corrdinates
# shape of diff = (10, 10, 2)
# argsort to sort along each row, leftmost give indices of NN
nearest = np.argsort(dist_sq, axis=1) # full sort
# partial k sort
K = 2
nearest_partition = np.argpartition(dist_sq, K+1, axis=1)
# plotting
plt.scatter(X[:, 0], X[:, 1], s=100)
# draw lines from each point to its 2NN
for i in range(X.shape[0]):
for j in nearest_partition[i, :K+1]:
plt.plot(*zip(X[j], X[i]), color='black')
- scaling KNN by tree-based and approx algo such as KD-Tree
STRUCTURE NUMPY
data = np.zeros(4, dtype={'names': ('name', 'age', 'weight'), 'formats': ('U10', 'i4', 'f8')})
(10 length Unicode string, 4-byte integer or int16 and 8-byte float or float64)- then store heterogeneous data
data['name'] = ['Ocean', ...]
which output list of tuples as record- mask ops
data[data['age'] < 30]['name']
- mask ops
- shortcut for types
np.dtype(['name', 'S10'), ('age', 'i4'), ...])
or without namingb, i4, u1, f8, c16, S5, U, V
- for writing Python API to legacy C library that manipulates structured data, use below construct, as numpy dtype accessed directly by C programs
tp = np.dtype([
('id', 'i8'),
('mat', 'f8', (3, 3))
])
X = np.zeros(1, dtype=tp)
X[0] # (0, [[0.0, 0.0, 0.0], ...])
X['mat'][0] # the matrix form
- dot-access enabler
np.recarray
data_rec = data.view(np.recarray)
and thendata_rec.age
possible- NO, for its overhead !!!
PANDAS
data.ix[:3, :'pop']
hybrid ofloc, iloc
data.loc[data.density > 100, ['pop', 'density']]
- Convention! while indexing refers to columns, slicing refers to rows!!
- most numpy ufuncs works on pandas objects
A.add(B, fill_value=0)
optional NaN handler, such asfill = A.stack().mean()
- pandas equivalents of python ops
add, sub, mul, trudiv, floordiv, mod, pow
MISSING
- Pandas handles using sentinel values
None
andNaN
, andNone
CANNOT used in any arbitrary numpy/pandas array but ONLY in arrays with data type'object'
, or python object- any such object will become
dtype=object
meaning best common type inferred is python object, while useful, any ops on data will be done on PYTHON LEVEL!!!, much more overhead than native types
- any such object will become
NaN
is different, a special floating-point value recognised by all system using standard IEEE representationnp.nan
which can be pushed into compiled code - a virus-like object rendering any opsNaN
hence non-error but useless- one solution is to use numpy
np.nansum
versions - special floating-point value, NO equivalent int, string etc
- one solution is to use numpy
# examples of hanlding
data.isnull()
data[data.notnull()]
# fine-conrol on dropping
# thresh sets min of non-null for keeping
data.dropna(thresh=3)
MULTI-INDEXING
- many constructors
df.unstack() # from multi-index single-index
df.stack() # formating multi-index
# from dataframe index
index=[
['2010', '2010', '2011', '2011'],
['US', 'Canada', 'US', 'Canada']
]
# Explicit
pd.MultiIndex.from_array([
['a', 'a', 'b', 'b'],
[1, 2, 1, 2]
])
pd.MultiIndex.from_tuples([
('a', 1), ('a', 2), ...
])
pd.MultiIndex.from_product([
['a', 'b'], [1, 2]
])
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
# naming
df.index.names = ['state', 'year']
# column-wise
columns=pd.MultiIndex.from_product(..., names=[...])
# most accessing and slicing work intuitively
# get around slicing WITHIN index tuples
idx = pd.IndexSlice
df.loc[idx[:, 1], idx[:, 'HR']]
df.unstack(level=0)
df.unstack(level=1)
# retrieve original Series
df.unstack().stack()
# reset multi-index
df.reset_index(name='population')
# returning multi-index
df.set_index(['state', 'year'])
# aggregation (GroupBy effective)
df.mean(level='year', axis=1)
CONCAT
- options
join
- outer, inner, left, rightjoin_axes
on column names of left dfjoin_axes=[df1.columns]
ignore_index=True
creating additional index for right dfverify_integrity
raise error if above not specifykeys
list of label names for data sources, resulting hierarchically indexed series containing datapd.concat([x, y], keys=['x', 'y'])
MERGE (relational algebra)
- One2One joins
- similar to column-wise concatenation, simple
pd.merge(df1, df2)
detects common column and join as key - merge generally discards index, except in special case of merges by index (
left_index, right_index
keywords)
- similar to column-wise concatenation, simple
- Many2One joins
- duplicated entries in key columns, resulting df preseves duplicate as appropriate
- Many2Many
- if key column in both contains duplicates, resulting duplicates other columns to fulfill largest many key
- Specification of Merge Key
on
taking column name or list of names as key columnleft_on
a name matchingright_on
if common value (often used with.drop('one of the name', axis=1)
to drop duplicate)left_index
merges on indexTrue
which equalsjoin
- Specifying arithmetic
how='inner'
for intersection of key
- Overlapping Column Names
suffixes
keyword for renaming
# example
merged.isnull().any() # see which column has NaN
merged[merged['population'].isnull()].head() # scout them out
merge.loc[merge['state'].isnull(), 'state/region'].unique()
# see which 'state/region' also missing
final.query("year == 2010 & ages == 'total'")
GROUPBY: Split, Apply, Combine
- Groupby object can be operated by virtually any np-pd aggregation functions
- column indexing with string name of column
groupby('col_name')
- iteration over groups
for (method, group) in planets.groupby('method'): print("{0:30s} shape={1}".format(method, group.shape))
planets.groupby('method')['year'].describe().unstack()
produces nice matrix of summaryaggregate()
takes string, function or list thereof['min', np.median, max]
{'col1': 'min', 'col2': 'max'}
filter()
to drop based on group properties
def filter_sd(x):
return x['col2'].std() > 4
df.groupby('key').filter(filter_sd)
-
transform()
can return some transformed version of full data to recombine, with same shape as input, e.g. normalisation.transform(lambda x: x - x.mean())
-
apply()
takes arbitrary function to group object, with CRITERION that takes DF and ouptut pandas objects or scalar -
Split Key
- any of list of index, dictionary or series mapping index, pure index
- even python function
str.lower
- list of valid keys
[str.lower, mapping]
- e.g.
decade = decade.astype(str) + 's', decade.name='decade', planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)
PANEL
- multi-index groupby - to better present 2D groupby
# pure groupby
df.groupby(['sex', 'class'])['survived'].agg('mean').unstack()
# pivot table
df.pivot_table('survived', index='sex', columns='class')
- multi-level pivot table
- bin age using
age = pd.cut(df['age'], [0, 18, 80])
df.pivot_table('survived', ['sex', age], 'class')
- adding info on fare sing auto quantile
fare = pd.qcut(df['fare'], 2)
and put into[fare, 'class']
as extra column - resulting 4D aggregation with hierarchical indices !!!
- full args
fill_value
anddropna
for handling,aggfunc
controls type of aggregation same as agg() andmargins=True
gives total
- bin age using
# example of birth in decades comparison
births['decade'] = 10 * (births['year'] // 10)
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')
EXTRA EXPLORATION
- robust outliers cutting sigma-clipping
quartiles = np.percentile(births['births'] [25, 50, 75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])
births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')
# set 'day' to int from string
# then combine timestamps to create Date index
births.index = pd.to_datetime(10000 * births.year +
100 * births.month +
births.day, format='%Y%m%d')
births['dayofweek'] = births.index.dayofweek
births.pivot_table('births', index='dayofweek',
columns='decade', aggfunc='mean').plot()
# another view by day of year
births.pivot_table('births',
[births.index.month, births.index.day])
new_births.index = [pd.datetime(2012, month, day)
for (month, day) in births_by_date.index]
robust estimate of sample mean, where 0.74 comes from interquartile range of Gaussian distribution
VECTORISED STRING
-
names.str.captialize()
given names a series of strings -
list of string ops similar to Python
len, lower, ljust, find, center, zfill, strip, swapcase, translate, islower, startswith, isnumeric, split, isspace, partition, istitle
-
regex
method()
callingre.match()
on each element returning boolean;extract(), findall(), replace(), contians(), count()
- example extracting first name
monte.str.extract('([A-Za-z]+', expand=Flase)
- finding all names start-end on consonant
series.str.findall(r'^[^AEIOU].*[^aeiou]$')
- example extracting first name
-
MISC ops
get()
index each element,slice()
,cat
concatenates stringsrepeat
,normalize
unicoding,pad
adding whitespace,wrap
split long strings into lines wiht lenght less than a given width,join
andget_dummies
- example slicing first-3 char
str.slice(0, 3)
or.str.split().str.get(-1)
getting last name .str.get_dummies('|')
if that separates chars
- example slicing first-3 char
MESSY DATA
# recipe database to parse !curl -O URL then gunzip
# ValueError Trailing data ->
with open('some.json') as f:
line = f.readline()
pd.read_jsn(line).shape # 2, 12
# concat string
# read into array
with ... f:
data = (line.strip() for line in f)
data_json = "[{0}]".format(','.join(data))
recipes = pd.read_json(data_json) # shape 123994324, 18
# textcol.str.len().describe() reveals summary
# how many breakfast recipes
recipes.description.str.contains('[Bb]reakfast').sum()
# suggest based on ingredient
spice_df = pd.DataFrame(dict((spice,
recipes.ingredients.str.contains(
spice, re.IGNORECASE))
for spice in spice_list))
selection = spice_df.query('parsley & paprika & tarragon')
# finding them
recipes.name[selection.index] # wow
unfortunately, the wide variety of formats used makes this time-consuming, pointing to TRUISM in data science, cleaning and munging data often comprises the majority of work
TIME SERIES
- native python
datetime
anddateutil
wins on freedom loses on optimality
# hand made date
from datetime import datetime
datetime(year=2015, month=7, day=4)
from dateutil import parser
date = parser.parse("4th of July, 2015")
# same object
# ops on them
date.strftime('%A') # standard string format codes
# COOL LIB pytz helps with time-zoning
-
Typed arrays of times
np.datetime64
encoding dates as 64-bit integers allowing arrays of dates be repr compactlydate = np.array('2015-07-04', dtype=np.datetime64)
- now vector-ops !!!
date + np.arange(12)
gives array of following days
-
datetime64, timedelta64
built on a fundamental time unit - range of encodable time is $2^{64}$ times this fundamental unit - imposing trade-off between time resolution and max time span- e.g. nanosecond only with 600 years, inferred by input
np.datetime64('2015-07-04 12:00')
or add12:59:59:50, 'ns'
Y
on/off 9.2e18 years,M, W, D, h, m, s, ms, us, ns, ps, fs
andas
Attosecond on/off 9.2seconds [1969 AD, 1970AD]
- e.g. nanosecond only with 600 years, inferred by input
-
Pandas best of both
-
date = pd.datetime('4th of July, 2015')
thendate.strftime('%A')
and also numpy vector-opsdate + pd.to_timedelta(np.arange(12), 'D')
-
shines on timestamp indexing
index = pd.DatetimeIndex(['2015-07-04', ...]
with all ops in indexing pandas and more such asdate['2015']
-
time stamps using
Timestamp
type - at core replacing nativedatetime
but based on more efficientnumpy.datetime64
type with index structure asDatetimeIndex
-
time periods using
Period
type - encoding fixed-freq interval based on numpy 64 indexing withPeroidIndex
-
time deltas or durations using
Timedelta
- speedy based on numpy equivalent indexing withTimedeltaIndex
-
common to use
to_dateime()
to parse variey of format yieldingTimestamp
with single date andDatetimeIndex
with a series of dates-
pd.to_datetime([dateime(2015, 7, 3), '4th of July, 2015', '2015-Jul-6', '07-07-2015', '20150708'])
-
convertible to
PeriodIndex
withdates.to_period('D')
as frequency -
TimedeltaIndex
created on subtractiondates - dates[0]
a duration -
regular sequences
pd.date_range()
for easy creating timestamps andperiod_range()
for periods andtimedelta_range()
etc e.g.pd.date_range('2015-07-03', '2015-07-10')
auto freq at Daily or spec numperiods=8
andfreq='H'
settings and similar in the other two objects -
frequencies and offests
D, W, B, M, BM, Q, BQ, A, BA, H, BH, T, S, L, U, N
- change month usd to mark
Q-JAN, BQ-FEB
etcs W-SUN
for weekday- also combining with num specifying other freq e.g. 2 hours 30 mins
timedelta_range(0, periods=9, freq='2H30T')
wow!! - all stored as object in
pandas.tseries.offsets
and e.g.import BDay
and dofreq=BDay()
- change month usd to mark
-
Resampling, shifting and windowing
resample()
is at core data aggregation whileasfreq()
is data selection -goog.resample('BA').mean()
andgoog.asfreq('BA')
noting at each pointreample
gives average of previous year whileasfreq
gives value at end of year- up-sampling both similar though resample has more options -
data.asfreq('D', method='bfill')
tshift
is on index instead of datagoog.tshift(900)
and offtset =pd.timedelta(900, 'D')
- common usage for computing differences overtime such as ROI on last year
ROI = 100 * (goog.tshift(-365) / goog - 1)
- common usage for computing differences overtime such as ROI on last year
rolling()
similar to groupby and empowered by mixing withapply
andagg
# one-year center rolling mean and st rolling = goog.rolling(365, center=True) data = pd.DataFrame({'input': goog, 'one-year mean': rolling.mean(), 'one-year sd': rolling.std()}) # example merging data.columns = ['west', 'east'] data['total'] = data.eval('west + east') # plotting coarser grid weekly = data.resample('W').sum() weekly.plot(style=[':', '--', '-']) # or rolling mean 30 days daily.rolling(30, center=True).sum().plot(...) # smoothing by Gaussian window (50days width of 10) daily.rolling(50, center=True, win_type='gaussian').sum(std=10).plot(...) # digging in # avg traffic as func of time of day by_time = date.groupby(date.index.time).mean() hourly_ticks = 4 * 60 * 60 * np.arange(6) by_time.plot(xticks=hourly_ticks, style=...) # out an strongly bimodal distribution peaking at 8am 5pm # as func of day of week by_weekday = data.groupby(data.index.dayofweek).mean() # compound groupby on hourly trend weekend = np.where(data.index.weekday < 5, 'Weekday', 'Weekend') by_time = data.groupby([weekend, data.index.time]).mean()
-
-
HPC on PANDAS
- C-speed ops with allocation of intermediate arrays
eval(), query()
basing on Numexpr - Compound expression - since pandas every intermediate step is explicity allocated in memory leading to large cost for large array, numexpr compute this type of compound expression element by element without allocating full intermediate arrays
# eval test
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols)
for i in range(4)))
%timeit df1 + df2 + df3 + df4
%timeit pd.eval('df1 + df2 + df3 + df4')
# 50% faster and much less memory
# supported ops
#arithmetic
('-df1 * df2 / (df3 + df4) - df5')
#boolean
('df1 < df2 <= df3 != df4')
#bitwise & | syntax on logical
# and or
# atti and indices
result1 = df2.T[0] + df3.iloc[1]
reslut2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2) # True
# further complex func NOT in eval but in Numexpr library !!!
# supports dataframe ops
# accessing directly the column names
# assignment possible
df.eval('D = (A + B) / C', inplace=True)
# local python variables
df.eval('A + @column_mean')
# query (eval cannot expr df[(df.A)]))
df.query('A < 0.5 and B < 0.5')
# faster masking
df.query('A < @Cmean and B < @Cmean')
- Performance and when? check size
df.values.nbytes
eval()
can be faster even when not maxing out system memory due to how temporary df compare to size of L2 or L2 CPU cache (a few MB); if they are much bigger, then eval() can avoid some potentially slow movement of values between the different memory caches - but practically less important than memory saving
MATPLOT
- script usage - need
plt.show()
which starts an event loop looking for all currently active figure objects, opens one ore more interactive windows to display - doing lots of background calls with system graphical backend- ONLY ONCE per python session, often at end of script
- verify saved figure
from IPython.display import Image
and callImage()
- list out save format
fig.canvas.get_supported_filetypes()
- Two API
- first create figure
plt.figure()
then partialplt.subplot(2, 1, 1)
followed by plotting, moving on to next - STATEFUL interface lively tracking “current” figure and axes where all plt cmd applied - get reference to these usingplt.gcf()
andplt.gca()
routines - create a grid of plots and ax will be array of Axes objects
fig, ax = plt.subplots(2)
then operate onax
object
- first create figure
- figure or an instance of
plt.Figure
viafig = plt.figure()
is like single container for all objects of axes, graphics, text, labels; axes is the bounding box with ticks and labels containing the plot elements - implicit creation of figure if not set, also overlaying plots by default as current figure in sequence
- color formating
blue, g, 0.75, hex, tupe of 3 RGB, all HTML color names
- mixing line style
-g, --c, -.k, :r
- mixing line style
rng = np.random.RandomState(0)
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
plt.plot(rng.rand(5), rng.rand(5), marker,
label="marker='{0}'".format(marker))
plt.legend(numpoints=1)
plt.xlim(0, 1.8);
plt.plot(x, y, '-p', color='gray',
markersize=15, linewidth=4,
markerfacecolor='white',
markeredgecolor='gray',
markeredgewidth=2)
# ERRORs
plt.errorbar(x, y, yerr=dy, fmt='o', color='black',
ecolor='lightgray', elinewidth=3, capsize=0);
# visual GP
from sklearn.gaussian_process import GaussianProcess
# define the model and draw some data
model = lambda x: x * np.sin(x)
xdata = np.array([1, 3, 5, 6, 8])
ydata = model(xdata)
# Compute the Gaussian process fit
gp = GaussianProcess(corr='cubic', theta0=1e-2, thetaL=1e-4, thetaU=1E-1,
random_start=100)
gp.fit(xdata[:, np.newaxis], ydata)
xfit = np.linspace(0, 10, 1000)
yfit, MSE = gp.predict(xfit[:, np.newaxis], eval_MSE=True)
dyfit = 2 * np.sqrt(MSE) # 2*sigma ~ 95% confidence region
# Visualize the result
plt.plot(xdata, ydata, 'or')
plt.plot(xfit, yfit, '-', color='gray')
plt.fill_between(xfit, yfit - dyfit, yfit + dyfit,
color='gray', alpha=0.2)
plt.xlim(0, 10);
SKL
- learning curve checks for given model, the increase in N has on train and validate score
from sklearn.learning_curve import learning_curve
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
for i, degree in enumerate([2, 9]):
N, train_lc, val_lc = learning_curve(PolynomialRegression(degree),
X, y, cv=7,
train_sizes=np.linspace(0.3, 1, 25))
ax[i].plot(N, np.mean(train_lc, 1), color='blue', label='training score')
ax[i].plot(N, np.mean(val_lc, 1), color='red', label='validation score')
ax[i].hlines(np.mean([train_lc[-1], val_lc[-1]]), N[0], N[-1],
color='gray', linestyle='dashed')
ax[i].set_ylim(0, 1)
ax[i].set_xlim(N[0], N[-1])
ax[i].set_xlabel('training size')
ax[i].set_ylabel('score')
ax[i].set_title('degree = {0}'.format(degree), size=14)
ax[i].legend(loc='best')
visual depiction of how model responds to N, especially when learning curve has converged, adding more N will not improve; only to use more complex model
Naive Bayes
- group of extremely fast and simple classification algorithms suitable for high-dimensional dataset
- $P(L | features)$ (probability of label given some observed features)
- $P(L|features) = \frac{P(features|L)P(L)}{P(features)}$
- one way to make decision on binary label is ratio of posterior probabilities $\frac{P(L_1|features)}{P(L_2|features)} = \frac{P(features|L_1)P(L_1)}{P(features|L_2)P(L_2)}$
- generative model is such that the Likelihood for each label is computed because it specifies the hypothetical random process that generates the data; specifying this generative model for each label is the main piece of training of such Bayesian classifier;
- general version of training step hard but NB simplifies premise of conditional independence given Label - product rule and addition via log form
- various types of NB rest on different assumptions about data
- Gaussian NB - assuming data from each label drawn from simple Gaussian distribution sans covariance between dimensions/features, fit by simply finding mean and standard deviation of points within each label, or all params needed to define such distribution
- with resulting Gaussian generative model for each label, computing likelihood for any data and thus quickly computing posterior ratio and determine which label most probable for a given point
- natural probabilistic classification with probability of prediction conveying uncertainty of prediction
- Multinomial NB - features assumed to be generated from simple multinomial distribution describing counts among categories - hence suitable for features of counts or rates
- e.g. text classification in which features are word counts within documents using sparse word count features (need convert content of string into numerical vectors - TF-IDF)
Linear Regression
-
model = LinearRegression(fit_intercept=True), model.fit(x[:, np.newaxis], y), xfit = np.linspace(0, 10, 1000), yfit = model.predict(xfit[:, np.newaxis])
-
multidimensional linear models akin geometrically to fitting a plane to points in 3D or hyperplane to points in higher D
-
basis function higher D transformation
- polynomial, Gaussian (customised)
from sklearn.base import BaseEstimator, TransformerMixin class GaussianFeatures(BaseEstimator, TransformerMixin): """Uniformly spaced Gaussian features for one-dimensional input""" def __init__(self, N, width_factor=2.0): self.N = N self.width_factor = width_factor @staticmethod def _gauss_basis(x, y, width, axis=None): arg = (x - y) / width return np.exp(-0.5 * np.sum(arg ** 2, axis)) def fit(self, X, y=None): # create N centers spread along the data range self.centers_ = np.linspace(X.min(), X.max(), self.N) self.width_ = self.width_factor * (self.centers_[1] - self.centers_[0]) return self def transform(self, X): return self._gauss_basis(X[:, :, np.newaxis], self.centers_, self.width_, axis=1) gauss_model = make_pipeline(GaussianFeatures(20), LinearRegression()) gauss_model.fit(x[:, np.newaxis], y) yfit = gauss_model.predict(xfit[:, np.newaxis]) plt.scatter(x, y) plt.plot(xfit, yfit) plt.xlim(0, 10);
-
EXAMPLE: Predicting Traffic
# !curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD
import pandas as pd
counts = pd.read_csv('FremontBridge.csv', index_col='Date', parse_dates=True)
weather = pd.read_csv('data/BicycleWeather.csv', index_col='DATE', parse_dates=True)
daily = counts.resample('d').sum()
daily['Total'] = daily.sum(axis=1)
daily = daily[['Total']] # remove other columns
# account for binary col indicating day of week
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
for i in range(7):
daily[days[i]] = (daily.index.dayofweek == i).astype(float)
from pandas.tseries.holiday import USFederalHolidayCalendar
cal = USFederalHolidayCalendar()
holidays = cal.holidays('2012', '2016')
daily = daily.join(pd.Series(1, index=holidays, name='holiday'))
daily['holiday'].fillna(0, inplace=True)
def hours_of_daylight(date, axis=23.44, latitude=47.61):
"""Compute the hours of daylight for the given date"""
days = (date - pd.datetime(2000, 12, 21)).days
m = (1. - np.tan(np.radians(latitude))
* np.tan(np.radians(axis) * np.cos(days * 2 * np.pi / 365.25)))
return 24. * np.degrees(np.arccos(1 - np.clip(m, 0, 2))) / 180.
daily['daylight_hrs'] = list(map(hours_of_daylight, daily.index))
daily[['daylight_hrs']].plot()
plt.ylim(8, 17)
# temperatures are in 1/10 deg C; convert to C
weather['TMIN'] /= 10
weather['TMAX'] /= 10
weather['Temp (C)'] = 0.5 * (weather['TMIN'] + weather['TMAX'])
# precip is in 1/10 mm; convert to inches
weather['PRCP'] /= 254
weather['dry day'] = (weather['PRCP'] == 0).astype(int)
daily = daily.join(weather[['PRCP', 'Temp (C)', 'dry day']])
daily['annual'] = (daily.index - daily.index[0]).days / 365.
# fit and sans intercept because daily flags essentailly opearte as their own day-specific intercepts:
# Drop any rows with null values
daily.dropna(axis=0, how='any', inplace=True)
column_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'holiday',
'daylight_hrs', 'PRCP', 'dry day', 'Temp (C)', 'annual']
X = daily[column_names]
y = daily['Total']
model = LinearRegression(fit_intercept=False)
model.fit(X, y)
daily['predicted'] = model.predict(X)
daily[['Total', 'predicted']].plot(alpha=0.5);
# evident that missing some key features esp during summer time
# either features incomplete or nonlinearity unaccounted
# check coeff of linear model to estimate how much each feature contributes to the daily bicyle count:
params = pd.Series(model.coef_, index=X.columns)
params
# these numbers hard to interpret sans uncertainty - compute using bootstrap resamplings
from sklearn.utils import resample
np.random.seed(1)
err = np.std([model.fit(*resample(X, y)).coef_
for i in range(1000)], 0)
# check again with errors estimated
print(pd.DataFrame({'effect': params.round(0),
'error': err.round(0)}))
- missing nonlinearity in many variables; also discarding finer-grained info such as diff rany morning and afternoon or ignored links between days - all potential effects to explore
SVM
-
discriminative classification via line or curve (2D) or manifold (high dimensional) dividing classes
-
max margin for best cutter and only important points are SV
-
kernels same idea as basis function in LR for transforming data into higher dimension space - radial basis function
r = np.exp(-(X ** 2).sum(1))
-
finding perfect kernel is a problem - strategy could be compute basis function centered at every point and let SVM sift through results - known as kernel transformation as it is based on similarity relationship (or kernel) between each pair of points
-
potential problem - projecting N points into N dimensions might be costly as N grows - HOWEVER due to neat little procedure known as kernel trick a fit on kernel-transformed data can be done implicitly - without ever building full N-D representation of kernel projection!
-
soft-margin control by C, smaller the softer for encompassing overlapping points