DODS

Data Science

Raw

Data Sciencen_array.dtype.name

n_array.ravel - flatten

pd.Series(np.random.rand(5))

d['some'].isnull().value_counts() then dropna()

df2.fillna(df2.mean())

!!! String Regex in Pandas !!!

df['text'][0:5].str.extract('(\w+)\s(\w+)') (other ops with .str. like .upper(), .len(), .replace('DISTRICT$', 'DIST'))

df['NO. OBESE'].groupby(d['GRADE LEVEL']).aggregate([sum, mean, std])

Inference

>>> from scipy.stats import binom
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(1, 1)
>>> x = [0, 1, 2, 3, 4, 5, 6]
>>> n, p = 6, 0.5
>>> rv = binom(n, p)
>>> ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1,
          label='Probablity')
>>> ax.legend(loc='best', frameon=False)
>>> plt.show()


>>> from scipy.stats import poisson
>>> rv = poisson(20)
>>> rv.pmf(23)
0.066881473662401172
# With the Poisson function, we define the mean value, which is 20 cars. The rv.pmf function gives the probability, which is around 6%, that 23 cars will pass the bridge.

>>> from scipy import stats
>>> stats.bernoulli.rvs(0.7, size=100)

>>> classscore = np.random.normal(50, 10, 60).round()
>>> plt.hist(classscore, 30, normed=True) #Number of breaks is 30
>>> stats.zscore(classscore)
>>> prob = 1 - stats.norm.cdf(1.334)
>>> prob
0.091101928265359899
>>> stats.norm.ppf(0.80) # get the z-score at which the top 20% score marks

# The z-score for the preceding output that determines whether the top 20% marks are at 0.84 is as follows:
>>> (0.84 * classscore.std()) + classscore.mean()
55.942594176524267

# We multiply the z-score with the standard deviation and then add the result with the mean of the distribution. This helps in converting the z-score to a value in the distribution. The 55.83 marks means that students who have marks more than this are in the top 20% of the distribution.
# The z-score is an essential concept in statistics, which is widely used. Now you can understand that it is basically used in standardizing any distribution so that it can be compared or inferences can be derived from it.


# Let's understand this concept with an example where the null hypothesis is that it is common for students to score 68 marks in mathematics.
# Let's define the significance level at 5%. If the p-value is less than 5%, then the null hypothesis is rejected and it is not common to score 68 marks in mathematics.
# Let's get the z-score of 68 marks:
>>> zscore = ( 68 - classscore.mean() ) / classscore.std()
>>> zscore
  2.283

>>> prob = 1 - stats.norm.cdf(zscore)
>>> prob
0.032835182628040638
# So, you can see that the p-value is at 3.2%, which is lower than the significance level. This means that the null hypothesis can be rejected, and it can be said that it's not common to get 68 marks in mathematics.

# Type-1 Error == False Positive (falsely rejecting Null Hypothesis)
# Type-2 Error == False Negative (falsely accepting NULL)

>>> stats.pearsonr(mpg,hp)
>>> stats.spearmanr(mpg,hp)
# We can clearly see that the Pearson correlation has been drastically affected due to the outliers, which are from a correlation of 0.89 to 0.47.
# The Spearman correlation did not get affected much as it is based on the order rather than the actual value in the data.

>>> stats.ttest_ind(class1_score,class2_score)

Chi^2

$X^2 = [(n-1)*s^2] / \sigma^2$

If repeat-sample and define chi-square statistics, PDF obtained:

$Y = Y_0 * (X^2)^{v/2-1} *e^{-X2/2}$

Y0 a constant depending on # DF, X2 the chi-square stats, v = n - 1 the # DF

# e.g. dice rolling 36 times each expected frequency == 6
expected = np.array([6,6,6,6,6,6])
# observed frequency as 
observed = np.array([7,5,3,9,6,6])
# H0: observed value ~= expected value
stats.chisquare(observed, expected) # (3.333333333333, 0.64874323423)
# first stats latter p-value, unable to rejcet H0 
# e.g. male-female frequency on 3 categorical book genres
men_women = np.array([
  [100, 120, 60],
  [350, 200, 90],
])
stats.chi2_contingency(men_women) # (28.23912391, 6.9912312323e-07, 2, array...)
# rejecting H0 showing association between gender and genre read, third value DF, fourth expected frequencies

ANOVA (F-test)

Test differences between means.

$H_o: \mu_1 = \mu_2 = … \mu_k$

stats.f_oneway(class1, class2, class2) »> (stats, p-value)

Fluent Python

DATA MODEL

import collections
Card = collections.namedtuple('Card', ['rank', 'suit'])
class FrenchDeck:
    ranks = [str(n) for n in range(2, 11)] + list('JQKA')
    suits = 'spades diamonds clubs hearts'.split()
    def __init__(self):
	    self._cards = [Card(rank, suit) for suit in self.suits
                       for rank in self.ranks]
    def __len__(self):
 	   return len(self._cards)
    def __getitem__(self, position):
		return self._cards[position]
    
beer_card = Card('7', 'diamonds') # Card(rank='7', suit='..')

TYPES

Container sequences

Mutability

generator expressions vital for non-list creation

tuple is nameless records due to indexing

>>> lax_coordinates = (33.9425, -118.408056)
>>> city, year, pop, chg, area = ('Tokyo', 2003, 32450, 0.66, 8014)
>>> traveler_ids = [('USA', '31195855'), ('BRA', 'CE342567'),
...
('ESP', 'XDA205856')]
>>> for passport in sorted(traveler_ids):
...
print('%s/%s' % passport)
...
BRA/CE342567
ESP/XDA205856
USA/31195855
>>> for country, _ in traveler_ids:
...
print(country)
...
USA
BRA
ESP

# namedTuple
>>> from collections import namedtuple
>>> City = namedtuple('City', 'name country population coordinates')
>>> tokyo = City('Tokyo', 'JP', 36.933, (35.689722, 139.691667))
>>> tokyo
City(name='Tokyo', country='JP', population=36.933, coordinates=(35.689722, 139.691667))
>>> tokyo.population
36.933
>>> tokyo.coordinates
(35.689722, 139.691667)
>>> tok

>>> City._fields
('name', 'country', 'population', 'coordinates')
>>> LatLong = namedtuple('LatLong', 'lat long')
>>> delhi_data = ('Delhi NCR', 'IN', 21.935, LatLong(28.613889, 77.208889))
>>> delhi = City._make(delhi_data)
>>> delhi._asdict()
OrderedDict([('name', 'Delhi NCR'), ('country', 'IN'), ('population',
21.935), ('coordinates', LatLong(lat=28.613889, long=77.208889))])
>>> for key, value in delhi._asdict().items():
print(key + ':', value)
  1. _fields is a tuple with field names of the class
  2. _make() inits named tuple from an iterable; City(*delhi_data) would do the same
  3. _asdict() returns a collections.OrderedDict built form named tuple instance

bisect and insort

import bisect
import sys
HAYSTACK = [1, 4, 5, 6, 8, 12, 15, 20, 21, 23, 23, 26, 29, 30]
NEEDLES = [0, 1, 2, 5, 8, 10, 22, 23, 29, 30, 31]
ROW_FMT = '{0:2d} @ {1:2d}
{2}{0:<2d}'
def demo(bisect_fn):
for needle in reversed(NEEDLES):
position = bisect_fn(HAYSTACK, needle)
offset = position * ' |'
print(ROW_FMT.format(needle, position, offset))
if __name__ == '__main__':
if sys.argv[-1] == 'left':
    bisect_fn = bisect.bisect_left
else:
	bisect_fn = bisect.bisect
print('DEMO:', bisect_fn.__name__)
print('haystack ->', ' '.join('%2d' % n for n in HAYSTACK))
demo(bisect_fn)


# An interesting application of bisect is to perform table lookups by numeric values, for example to convert test scores to letter grades, as in Example 2-18. Example 2-18. Given a test score, grade returns the corresponding letter grade.
>>> def grade(score, breakpoints=[60, 70, 80, 90], grades='FDCBA'):
...
i = bisect.bisect(breakpoints, score)
...
return grades[i]
...
>>> [grade(score) for score in [33, 99, 77, 70, 89, 90, 100]]
['F', 'A', 'C', 'C', 'B', 'A', 'A']

# keeping sorted 
random.seed(1729)
my_list = []
for i in range(SIZE):
    new_item = random.randrange(SIZE*2)
    bisect.insort(my_list, new_item)
    print('%2d ->' % new_item, my_list)

memoryview - generalised NumPy without the maths

>>> numbers = array.array('h', [-2, -1, 0, 1, 2])
>>> memv = memoryview(numbers)
>>> len(memv)
5
>>> memv[0]
-2
>>> memv_oct = memv.cast('B')
>>> memv_oct.tolist()
[254, 255, 255, 255, 0, 0, 1, 0, 2, 0]
>>> memv_oct[5] = 4
>>> numbers
array('h', [-2, -1, 1024, 1, 2])

NumPy detour

>>> import numpy
>>> floats = numpy.loadtxt('floats-10M-lines.txt')
>>> floats[-3:]
array([ 3016362.69195522,
535281.10514262, 4566560.44373946])
>>> floats *= .5
>>> floats[-3:]
array([ 1508181.34597761,
267640.55257131, 2283280.22186973])
>>> from time import perf_counter as pc
>>> t0 = pc(); floats /= 3; pc() - t0
0.03690556302899495
>>> numpy.save('floats-10M', floats)
>>> floats2 = numpy.load('floats-10M.npy', 'r+')
>>> floats2 *= 6
>>> floats2[-3:]
memmap([ 3016362.69195522,
535281.10514262, 4566560.44373946])

deque

>>> from collections import deque
>>> dq = deque(range(10), maxlen=10)
>>> dq
deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], maxlen=10)
>>> dq.rotate(3)
>>> dq
deque([7, 8, 9, 0, 1, 2, 3, 4, 5, 6], maxlen=10)
>>> dq.rotate(-4)
>>> dq
deque([1, 2, 3, 4, 5, 6, 7, 8, 9, 0], maxlen=10)
>>> dq.appendleft(-1)
>>> dq
deque([-1, 1, 2, 3, 4, 5, 6, 7, 8, 9], maxlen=10)
>>> dq.extend([11, 22, 33])
>>> dq
deque([3, 4, 5, 6, 7, 8, 9, 11, 22, 33], maxlen=10)
>>> dq.extendleft([10, 20, 30, 40])
>>> dq
deque([40, 30, 20, 10, 3, 4, 5, 6, 7, 8], maxlen=10)

Besides deque , other Python Standard Library packages implement queues. queue Provides the synchronized (i.e. thread-safe) classes Queue , LifoQueue and Priori tyQueue . These are used for safe communication between threads. All three classes can be bounded by providing a maxsize argument greater than 0 to the constructor. However, they don’t discard items to make room as deque does. Instead, when the queue is full the insertion of a new item blocks — i.e. it waits until some other threadmakes room by taking an item from the queue, which is useful to throttle the num‐ ber of live threads. multiprocessing Implements its own bounded Queue , very similar to queue.Queue but designed for inter-process communication. There is also has a specialized multiprocess ing.JoinableQueue for easier task management. asyncio Newly added to Python 3.4, asyncio provides Queue , LifoQueue , PriorityQueue and JoinableQueue with APIs inspired by the classes in queue and multiprocess ing , but adapted for managing tasks in asynchronous programming. heapq In contrast to the previous three modules, heapq does not implement a queue class, but provides functions like heappush and heappop that let you use a mutable se‐ quence as a heap queue or priority queue

Jake DS

struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};

NUMPY

# basic arrays
np.array()
np.arange()
np.linspace()
np.random.random((3, 3)) # default [0,1]
np.random.normal(0, 1, (3,3)) 
np.random.randint(0, 10, (3, 3)) # 3x3 in [0, 10)

VECTORISED operation UFUNC

from scipy import special
# Gamma functions (generalised factorials)
special.gamma()
special.gammaln()
special.beta()
# error functions (integral of Gaussian)
special.erf()
special.erfc()
special.erfinv()

AGGREGATION

BROADCASTING (mix-dim binary ufuncs)

BOOLEAN MASKING

rainy = (inches > 0)
days = np.arange(365) # mask of all summer days
summer = (days > 172) & (days < 262)
np.median(inches[rainy])
np.median(inches[summer])
bin(42) # 0b101010
bin(59) # 0b111011
bin(42 & 59) # 0b101010 or bitwise 1 0 boolean

FANCY INDEXING

mena = [0, 0]
cov = [[1, 2], 
       [2, 5]]
X = rand.multivariate_normal(mean, cov, 100)
X.shape # (100, 2)

# subsetting 
indices = np.random.choice(X.shape[0], 20, replace=False)
selection = X[indices]
# plotting circling effet
plt.scatter(selection[: 0,], selection[:, 1],
            facecolor='none', s=200);

SORTING

#selection sort find and swap till sorted
def selection_sort(x):
    for i in range(len(x)):
        swap = i + np.argmin(x[i:])
        (x[i], x[swap]) = (x[swap], x[i])
	return x
X = rand.rand(10, 2)
dist_sq = np.sum((X[:, np.newaxis, :]
                 - X[np.newaxis, :, :]) ** 2, axis=-1)
# broadcasting in detail
# for each pair of points, compute diff in corrdinates
# shape of diff = (10, 10, 2)

# argsort to sort along each row, leftmost give indices of NN
nearest = np.argsort(dist_sq, axis=1) # full sort
# partial k sort
K = 2
nearest_partition = np.argpartition(dist_sq, K+1, axis=1)

# plotting
plt.scatter(X[:, 0], X[:, 1], s=100)

# draw lines from each point to its 2NN
for i in range(X.shape[0]):
    for j in nearest_partition[i, :K+1]:
        plt.plot(*zip(X[j], X[i]), color='black')

STRUCTURE NUMPY

tp = np.dtype([
    ('id', 'i8'),
    ('mat', 'f8', (3, 3))
])
X = np.zeros(1, dtype=tp)
X[0] # (0, [[0.0, 0.0, 0.0], ...])
X['mat'][0] # the matrix form

PANDAS

MISSING

# examples of hanlding
data.isnull()
data[data.notnull()]
# fine-conrol on dropping
# thresh sets min of non-null for keeping
data.dropna(thresh=3)

MULTI-INDEXING

df.unstack() # from multi-index single-index
df.stack() # formating multi-index

# from dataframe index
index=[
    ['2010', '2010', '2011', '2011'], 
    ['US', 'Canada', 'US', 'Canada']
]

# Explicit
pd.MultiIndex.from_array([
    ['a', 'a', 'b', 'b'],
    [1, 2, 1, 2]
])
pd.MultiIndex.from_tuples([
    ('a', 1), ('a', 2), ...
])
pd.MultiIndex.from_product([
    ['a', 'b'], [1, 2]
])
pd.MultiIndex(levels=[['a', 'b'], [1, 2]], 
             labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

# naming
df.index.names = ['state', 'year']
# column-wise
columns=pd.MultiIndex.from_product(..., names=[...])

# most accessing and slicing work intuitively
# get around slicing WITHIN index tuples
idx = pd.IndexSlice
df.loc[idx[:, 1], idx[:, 'HR']]

df.unstack(level=0)
df.unstack(level=1)
# retrieve original Series
df.unstack().stack()

# reset multi-index 
df.reset_index(name='population')
# returning multi-index
df.set_index(['state', 'year'])

# aggregation (GroupBy effective)
df.mean(level='year', axis=1)

CONCAT

MERGE (relational algebra)

# example
merged.isnull().any() # see which column has NaN
merged[merged['population'].isnull()].head() # scout them out
merge.loc[merge['state'].isnull(), 'state/region'].unique() 
# see which 'state/region' also missing
final.query("year == 2010 & ages == 'total'")

GROUPBY: Split, Apply, Combine

def filter_sd(x):
    return x['col2'].std() > 4
df.groupby('key').filter(filter_sd)

PANEL

# pure groupby
df.groupby(['sex', 'class'])['survived'].agg('mean').unstack()
# pivot table
df.pivot_table('survived', index='sex', columns='class')
# example of birth in decades comparison
births['decade'] = 10 * (births['year'] // 10)
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')

EXTRA EXPLORATION

quartiles = np.percentile(births['births'] [25, 50, 75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])

births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')

# set 'day' to int from string
# then combine timestamps to create Date index
births.index = pd.to_datetime(10000 * births.year +
                             100 * births.month + 
                             births.day, format='%Y%m%d')
births['dayofweek'] = births.index.dayofweek
births.pivot_table('births', index='dayofweek',
                  columns='decade', aggfunc='mean').plot()
# another view by day of year
births.pivot_table('births', 
                  [births.index.month, births.index.day])
new_births.index = [pd.datetime(2012, month, day)
                   for (month, day) in births_by_date.index]

robust estimate of sample mean, where 0.74 comes from interquartile range of Gaussian distribution

VECTORISED STRING

MESSY DATA

# recipe database to parse !curl -O URL then gunzip
# ValueError Trailing data -> 
with open('some.json') as f:
    line = f.readline()
pd.read_jsn(line).shape # 2, 12
# concat string
# read into array
with ... f:
    data = (line.strip() for line in f)
    data_json = "[{0}]".format(','.join(data))
recipes = pd.read_json(data_json) # shape 123994324, 18

# textcol.str.len().describe() reveals summary
# how many breakfast recipes
recipes.description.str.contains('[Bb]reakfast').sum()
# suggest based on ingredient
spice_df = pd.DataFrame(dict((spice,
                             recipes.ingredients.str.contains(
                             spice, re.IGNORECASE))
                            for spice in spice_list))
selection = spice_df.query('parsley & paprika & tarragon')
# finding them
recipes.name[selection.index] # wow

unfortunately, the wide variety of formats used makes this time-consuming, pointing to TRUISM in data science, cleaning and munging data often comprises the majority of work

TIME SERIES

# hand made date
from datetime import datetime
datetime(year=2015, month=7, day=4)
from dateutil import parser
date = parser.parse("4th of July, 2015")
# same object
# ops on them 
date.strftime('%A') # standard string format codes 
# COOL LIB pytz helps with time-zoning

HPC on PANDAS

# eval test
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols)
                                  for i in range(4)))
%timeit df1 + df2 + df3 + df4
%timeit pd.eval('df1 + df2 + df3 + df4')
# 50% faster and much less memory

# supported ops
#arithmetic
('-df1 * df2 / (df3 + df4) - df5')
#boolean
('df1 < df2 <= df3 != df4')
#bitwise & | syntax on logical
# and or

# atti and indices
result1 = df2.T[0] + df3.iloc[1]
reslut2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2) # True

# further complex func NOT in eval but in Numexpr library !!!

# supports dataframe ops
# accessing directly the column names
# assignment possible
df.eval('D = (A + B) / C', inplace=True)
# local python variables 
df.eval('A + @column_mean')


# query (eval cannot expr df[(df.A)]))
df.query('A < 0.5 and B < 0.5')
# faster masking
df.query('A < @Cmean and B < @Cmean')

MATPLOT

rng = np.random.RandomState(0)
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
    plt.plot(rng.rand(5), rng.rand(5), marker,
             label="marker='{0}'".format(marker))
plt.legend(numpoints=1)
plt.xlim(0, 1.8);

plt.plot(x, y, '-p', color='gray',
         markersize=15, linewidth=4,
         markerfacecolor='white',
         markeredgecolor='gray',
         markeredgewidth=2)

# ERRORs
plt.errorbar(x, y, yerr=dy, fmt='o', color='black',
             ecolor='lightgray', elinewidth=3, capsize=0);

# visual GP
from sklearn.gaussian_process import GaussianProcess

# define the model and draw some data
model = lambda x: x * np.sin(x)
xdata = np.array([1, 3, 5, 6, 8])
ydata = model(xdata)

# Compute the Gaussian process fit
gp = GaussianProcess(corr='cubic', theta0=1e-2, thetaL=1e-4, thetaU=1E-1,
                     random_start=100)
gp.fit(xdata[:, np.newaxis], ydata)

xfit = np.linspace(0, 10, 1000)
yfit, MSE = gp.predict(xfit[:, np.newaxis], eval_MSE=True)
dyfit = 2 * np.sqrt(MSE)  # 2*sigma ~ 95% confidence region

# Visualize the result
plt.plot(xdata, ydata, 'or')
plt.plot(xfit, yfit, '-', color='gray')

plt.fill_between(xfit, yfit - dyfit, yfit + dyfit,
                 color='gray', alpha=0.2)
plt.xlim(0, 10);

SKL

from sklearn.learning_curve import learning_curve

fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)

for i, degree in enumerate([2, 9]):
    N, train_lc, val_lc = learning_curve(PolynomialRegression(degree),
                                         X, y, cv=7,
                                         train_sizes=np.linspace(0.3, 1, 25))

    ax[i].plot(N, np.mean(train_lc, 1), color='blue', label='training score')
    ax[i].plot(N, np.mean(val_lc, 1), color='red', label='validation score')
    ax[i].hlines(np.mean([train_lc[-1], val_lc[-1]]), N[0], N[-1],
                 color='gray', linestyle='dashed')

    ax[i].set_ylim(0, 1)
    ax[i].set_xlim(N[0], N[-1])
    ax[i].set_xlabel('training size')
    ax[i].set_ylabel('score')
    ax[i].set_title('degree = {0}'.format(degree), size=14)
    ax[i].legend(loc='best')

visual depiction of how model responds to N, especially when learning curve has converged, adding more N will not improve; only to use more complex model

Naive Bayes

Linear Regression

# !curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD

import pandas as pd
counts = pd.read_csv('FremontBridge.csv', index_col='Date', parse_dates=True)
weather = pd.read_csv('data/BicycleWeather.csv', index_col='DATE', parse_dates=True)

daily = counts.resample('d').sum()
daily['Total'] = daily.sum(axis=1)
daily = daily[['Total']] # remove other columns

# account for binary col indicating day of week
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
for i in range(7):
    daily[days[i]] = (daily.index.dayofweek == i).astype(float)
    


from pandas.tseries.holiday import USFederalHolidayCalendar
cal = USFederalHolidayCalendar()
holidays = cal.holidays('2012', '2016')
daily = daily.join(pd.Series(1, index=holidays, name='holiday'))
daily['holiday'].fillna(0, inplace=True)



def hours_of_daylight(date, axis=23.44, latitude=47.61):
    """Compute the hours of daylight for the given date"""
    days = (date - pd.datetime(2000, 12, 21)).days
    m = (1. - np.tan(np.radians(latitude))
         * np.tan(np.radians(axis) * np.cos(days * 2 * np.pi / 365.25)))
    return 24. * np.degrees(np.arccos(1 - np.clip(m, 0, 2))) / 180.

daily['daylight_hrs'] = list(map(hours_of_daylight, daily.index))
daily[['daylight_hrs']].plot()
plt.ylim(8, 17)



# temperatures are in 1/10 deg C; convert to C
weather['TMIN'] /= 10
weather['TMAX'] /= 10
weather['Temp (C)'] = 0.5 * (weather['TMIN'] + weather['TMAX'])

# precip is in 1/10 mm; convert to inches
weather['PRCP'] /= 254
weather['dry day'] = (weather['PRCP'] == 0).astype(int)

daily = daily.join(weather[['PRCP', 'Temp (C)', 'dry day']])

daily['annual'] = (daily.index - daily.index[0]).days / 365.

# fit and sans intercept because daily flags essentailly opearte as their own day-specific intercepts:
# Drop any rows with null values
daily.dropna(axis=0, how='any', inplace=True)

column_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'holiday',
                'daylight_hrs', 'PRCP', 'dry day', 'Temp (C)', 'annual']
X = daily[column_names]
y = daily['Total']

model = LinearRegression(fit_intercept=False)
model.fit(X, y)
daily['predicted'] = model.predict(X)

daily[['Total', 'predicted']].plot(alpha=0.5);

# evident that missing some key features esp during summer time
# either features incomplete or nonlinearity unaccounted

# check coeff of linear model to estimate how much each feature contributes to the daily bicyle count:
params = pd.Series(model.coef_, index=X.columns)
params

# these numbers hard to interpret sans uncertainty - compute using bootstrap resamplings 
from sklearn.utils import resample
np.random.seed(1)
err = np.std([model.fit(*resample(X, y)).coef_
              for i in range(1000)], 0)

# check again with errors estimated
print(pd.DataFrame({'effect': params.round(0),
                    'error': err.round(0)}))

SVM


code · notebook · prose · gallery · qui et quoi? · main