python数据分析之第三方模块Pandas

发表于 2019-11-21 更新于 2020-09-04 分类于 rd ， python Changyan：本文字数： 6.1k 阅读时长 ≈ 22 分钟

pandas 是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。你很快就会发现，它是使Python成为强大而高效的数据分析环境的重要因素之一。

`Series`

Series是一种类似与一维数组的对象，由下面两个部分组成：

values：一组数据（ndarray类型）
index：相关的数据索引标签

`Series`创建

列表创建

from pandas import Series

s = Series(data=[3, 4, 5, 2, 5])
print(s)

# 结果
0    3
1    4
2    5
3    2
4    5

`numpy`创建

import numpy as np
from pandas import Series
s = Series(data=np.random.randint(1, 50, size=6), index=['a', 'b', 'c', 'd', 'e', 'f'], name='test')
print(s)

# 结果
a    24
b    39
c    46
d    13
e    41
f    15
Name: test, dtype: int32

字典创建

from pandas import Series
dic = {
    'name' : '张学友',
    'gender' : '男',
    'age' : 18,
    'city' : '鹤岗'
}
s = Series(data=dic, name='test')
print(s)

# 结果
name      张学友
gender      男
age        18
city       鹤岗
Name: test, dtype: object

`Series`索引

可以使用中括号取单个索引（此时返回的是元素类型），或者中括号里一个列表取多个索引（此时返回的是一个Series类型）。

显式索引

使用index中的元素作为索引值
使用s.loc[]（推荐）:注意，loc中括号中放置的一定是显式索引
使用s.索引元素直接获取

from pandas import Series
dic = {
    'name' : '张学友',
    'gender' : '男',
    'age' : 18,
    'city' : '鹤岗'
}
s = Series(data=dic, name='test')
print(s['city'])
print(s.loc['city'])
print(s.city)

# 结果
鹤岗
鹤岗
鹤岗

隐式索引

使用整数作为索引值
使用.iloc[]（推荐）:iloc中的中括号中必须放置隐式索引

from pandas import Series
dic = {
    'name' : '张学友',
    'gender' : '男',
    'age' : 18,
    'city' : '鹤岗'
}
s = Series(data=dic, name='test')
print(s[3])
print(s.iloc[3])

# 结果
鹤岗
鹤岗

`Series`切片

显式索引切片

直接使用索引区间：如s['a':'b']
使用loc：如s.loc['a':'b']

from pandas import Series
dic = {
    'name' : '张学友',
    'gender' : '男',
    'age' : 18,
    'city' : '鹤岗'
}
s = Series(data=dic, name='test')
print(s['gender':'city'])
print(s.loc['gender':'city'])

# 结果
gender     男
age       18
city      鹤岗
Name: test, dtype: object
gender     男
age       18
city      鹤岗
Name: test, dtype: object

隐式索引切片

直接使用索引区间：如s[1:3]
使用iloc：如s.iloc[1:3]

from pandas import Series
dic = {
    'name' : '张学友',
    'gender' : '男',
    'age' : 18,
    'city' : '鹤岗'
}
s = Series(data=dic, name='test')
print(s[0:2])
print(s.iloc[0:2])

# 结果
name      张学友
gender      男
Name: test, dtype: object
name      张学友
gender      男
Name: test, dtype: object

`Series`属性及基本使用

索引

from pandas import Series
dic = {
    'name' : '张学友',
    'gender' : '男',
    'age' : 18,
    'city' : '鹤岗'
}
s = Series(data=dic, name='test')
print(s.index)

# 结果
Index(['name', 'gender', 'age', 'city'], dtype='object')

值

from pandas import Series
dic = {
    'name' : '张学友',
    'gender' : '男',
    'age' : 18,
    'city' : '鹤岗'
}
s = Series(data=dic, name='test')
print(s.values)

# 结果
['张学友' '男' 18 '鹤岗']

`head(n)`

获取前n个值

print(s.head(2))
# 结果
name      张学友
gender      男
Name: test, dtype: object

`tail(n)`

获取后n个值

print(s.tail(2))
# 结果
age     18
city    鹤岗
Name: test, dtype: object

`unique()`

去重

s = Series(data=[1, 2, 3, 4, 5, 4, 3, 2, 3, 2])
print(s.unique())

# 结果
[1 2 3 4 5]

`Series运算`

数字运算

直接使用运算符+、-、*、/进行运算

s = Series(data=[1, 2, 3, 4, 5])
print(s+10)

# 结果
0    11
1    12
2    13
3    14
4    15
dtype: int64

计算函数运算

使用add()、sub()、mul()、div()，分别代表加减乘除。其中都可以携带参数，默认参数可以是数字或者Series，fill_value代表如果元素为空则给其填充一个默认值

s = Series(data=[20, 30, None, 40, None], index=['a', 'b', 'c', 'd', 'e'])
print(s.add(10, fill_value=999))

# 结果
a      30.0
b      40.0
c    1009.0
d      50.0
e    1009.0
dtype: float64

多个`series`运算

当索引没有对应的值时，可能出现缺失数据显示NaN（not a number）的情况，使得两个Series进行相加:索引与之对应的元素会进行算数运算,不对应的就补空

s1 = Series(data=[1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = Series(data=[10, 20, 30, 40], index=['f', 'c', 'e', 'a'])
print(s1+s2)
# 或者
s1.add(s2)

# 结果
a    41.0
b     NaN
c    23.0
d     NaN
e     NaN
f     NaN
dtype: float64

`isnull()`

检测元素是否为空

s = Series(data=[1, 2, None, 4, None], index=['a', 'b', 'c', 'd', 'e'])
print(s.isnull())

# 结果
a    False
b    False
c     True
d    False
e     True
dtype: bool

`notnull()`

检测元素是否不为空

s = Series(data=[1, 2, None, 4, None], index=['a', 'b', 'c', 'd', 'e'])
print(s.notnull())

# 结果
a     True
b     True
c    False
d     True
e    False
dtype: bool

`DataFrame`

DataFrame是一个【表格型】的数据结构。DataFrame由按一定顺序排列的多列数据组成。设计初衷是将Series的使用场景从一维拓展到多维。DataFrame既有行索引，也有列索引。

行索引：index
列索引：columns
值：values

`Dataframe`创建

使用`ndarray`

1	DataFrame(data=np.random.randint(0,100,size=(5,6)))

运行结果：

0	1	2	3	4	5
0	32	93	0	23	21	40
1	27	35	9	76	41	68
2	63	96	63	30	96	51
3	2	50	28	26	26	41
4	32	74	97	84	56	7

字典创建

dic = {
    '张三':[77,88,99,90],
    '李四':[67,88,99,78]
}
df = DataFrame(data=dic,index=['语文','数学','英语','理综'])

运行结果：

	张三	李四
语文	77	67
数学	88	88
英语	99	99
理综	90	78

`Dataframe`属性

`values`

元素值

dic = {
    '张三':[77,88,99,90],
    '李四':[67,88,99,78]
}
df = DataFrame(data=dic,index=['语文','数学','英语','理综'])
df.values

运行结果：

array([[77, 67],
       [88, 88],
       [99, 99],
       [90, 78]], dtype=int64)

`columns`

列索引

1	df.columns

运行结果：

1	Index(['张三', '李四'], dtype='object')

`index`

行索引

df.index

运行结果：

1	Index(['语文', '数学', '英语', '理综'], dtype='object')

`shape`

形状、维度

df.shape

运行结果：

(4, 2)

`DataFrame`索引

列索引

字典方式

通过类似字典的方式取得如：df['a']

df['张三']

结果：

语文    77
数学    88
英语    99
理综    90
Name: 张三, dtype: int64

属性方式

通过类似对象取属性的方式取列，如：df.a

df.张三

结果：

语文    77
数学    88
英语    99
理综    90
Name: 张三, dtype: int64

取多列

1	df[['李四', '张三']]

结果：

	李四	张三
语文	67	77
数学	88	88
英语	99	99
理综	78	90

修改列索引

1
2
3

#修改列索引
df.columns = ['zhangsan','lisi']
df

结果：

	zhangsan	lisi
语文	77	67
数学	88	88
英语	99	99
理综	90	78

行索引

`loc`

使用.loc[]加index来进行行索引

1	df.loc['语文']

结果：

1
2
3

张三    77
李四    67
Name: 语文, dtype: int64

选择多行

1	df.loc[['语文', '英语']]

结果：

	张三	李四
语文	77	67
英语	99	99

`iloc`

使用.iloc[]加整数来进行行索引(默认索引)

1	df.iloc[1]

结果：

1
2
3

张三    88
李四    88
Name: 数学, dtype: int64

多行索引

1	df.iloc[[1,3]]

结果：

	张三	李四
数学	88	88
理综	90	78

元素索引

行索引在前，列索引在后

`loc`

使用自定义索引获取，如：df.loc['理综', '李四']

1	df.loc['理综', '李四']

结果：

`iloc`

使用默认索引获取，如：df.iloc[3, 1]

1	df.iloc[3, 1]

结果：

`DataFrame`切片

中括号切片

直接使用中括号，是按行进行切分的，且只能按行切分，如：df[0:2]

df[0:2]

结果：

	张三	李四
语文	77	67
数学	88	88

`iloc`

默认按行进行切分，第1个参数为行区间，第二个参数为列区间，且区间范围使用默认索引，如：df.iloc[1:2, 1:3]

只切行

1	df.iloc[0:2]

结果：

	张三	李四
语文	77	67
数学	88	88

只切列

默认第一个参数为行，如果只切列，行区间也要保留:

1	df.iloc[:, 1:2]

结果：

	李四
语文	67
数学	88
英语	99
理综	78

切行切列

1	df.iloc[1:3, 1:2]

结果：

	李四
数学	88
英语	99

`loc`

默认按行进行切分，第1个参数为行区间，第二个参数为列区间，且区间范围使用自定义索引，如：df.loc['语文':'英语', '张三':'李四']

只切行

1	df.loc['语文':'数学']

结果：

	张三	李四
语文	77	67
数学	88	88

只切列

默认第一个参数为行，如果只切列，行区间也要保留:

切单列

1	df.loc[:, '李四']

结果：

	李四
语文	67
数学	88
英语	99
理综	78

切多列

dic = {
    '张三':[77,88,99,90],
    '李四':[67,92,97,68],
    '王五':[83,79,91,78]
}
df = DataFrame(data=dic,index=['语文','数学','英语','理综'])

df.loc[:, '张三':'李四']

结果：

	张三	李四
语文	77	67
数学	88	92
英语	99	97
理综	90	68

切行切列

dic = {
    '张三':[77,88,99,90],
    '李四':[67,92,97,68],
    '王五':[83,79,91,78]
}
df = DataFrame(data=dic,index=['语文','数学','英语','理综'])

df.loc['数学':'英语', '张三':'李四']

结果：

	张三	李四
数学	88	92
英语	99	97

`DataFrame`运算

`DataFrame`之间运算

同Series一样：

在运算中自动对齐不同索引的数据
如果索引不对应，则补NaN

1
2

df1 = DataFrame(data=np.random.randint(1, 100, size=[4, 3]), index=['a', 'b', 'c', 'd'], columns=['aa', 'bb', 'cc'])
df2 = DataFrame(data=np.random.randint(1, 100, size=[4, 3]), index=['a', 'b', 'c', 'd'], columns=['bb', 'dd', 'aa'])

df1：

	aa	bb	cc
a	48	48	44
b	31	53	24
c	92	71	10
d	43	62	40

df2:

	bb	dd	aa
a	34	22	60
b	25	30	21
c	41	36	13
d	45	48	50

df1+df2

结果为：

	aa	bb	cc	dd
a	108	82	NaN	NaN
b	52	78	NaN	NaN
c	105	112	NaN	NaN
d	93	107	NaN	NaN

下面以一个示例来说明DataFrame的运算，要求如下：

假设df1是期中考试成绩，df2是期末考试成绩，请自由创建df2，并将其与df1相加，求期中期末平均值。
假设张三期中考试数学被发现作弊，要记为0分，如何实现？
李四因为举报张三作弊立功，期中考试所有科目加100分，如何实现？
后来老师发现有一道题出错了，为了安抚学生情绪，给每位学生每个科目都加10分，如何实现？

# 防止每次变化
np.random.seed(2)
df1 = DataFrame(data=np.random.randint(80, 100, size=[4, 3]), index=['语文', '数学', '英语', '开车'], columns=['张三', '李四', '王五'])
np.random.seed(2)
df2 = DataFrame(data=np.random.randint(75, 100, size=[4, 3]), index=['语文', '数学', '英语', '开车'], columns=['张三', '李四', '王五'])

先看下两次考试结果，期中df1考试结果为：

	张三	李四	王五
语文	88	95	93
数学	88	91	98
英语	91	88	87
开车	82	97	91

期末df2考试结果为：

	张三	李四	王五
语文	83	90	88
数学	83	97	86
英语	93	86	83
开车	82	77	92

求平均值

1 2	# 期中加期末除以2 (df1+df2)/2

结果：

	张三	李四	王五
语文	85.5	92.5	90.5
数学	85.5	94.0	92.0
英语	92.0	87.0	85.0
开车	82.0	87.0	91.5

张三期中数学0分

df1.loc['数学', '张三'] = 0
# 或者使用默认索引
# df1.iloc[1,0] = 0
df1

结果：

	张三	李四	王五
语文	88	95	93
数学	0	91	98
英语	91	88	87
开车	82	97	91

李四期中所有加100

df1.loc[:, '李四'] += 100
# 使用默认索引
# df1.iloc[:, 1] += 100
df1

结果：

	张三	李四	王五
语文	88	195	93
数学	0	191	98
英语	91	188	87
开车	82	197	91

期中全部加10

1 2	df1 += 10 df1

结果：

	张三	李四	王五
语文	98	205	103
数学	10	201	108
英语	101	198	97
开车	92	207	101

删除重复元素

使用duplicated()函数检测重复的行，返回元素为布尔类型的Series对象，每个元素对应一行，如果该行不是第一次出现，则元素为True

import numpy as np
from pandas import DataFrame, Series
#创建一个df
np.random.seed(2)
df = DataFrame(data=np.random.randint(1, 100, size=[7, 3]), index=['a', 'b', 'c', 'd', 'e', 'f', 'g'], columns=['aa', 'bb', 'cc'])
#手动将df的某几行设置成相同的内容
df.loc[['b', 'd', 'f']] = 888
print(df)

结果：

	aa	bb	cc
a	41	16	73
b	888	888	888
c	76	8	35
d	888	888	888
e	86	48	64
f	888	888	888
g	38	40	68

使用duplicated查看所有重复元素行

1
2
3

# keep代表保留某些行，可选有first、last、False分别代表保留首行重复行、末行重复行，不保留任何重复行
duplicate_rows = df.duplicated(keep=False)
print(duplicate_rows)

结果：

a    False
b     True
c    False
d     True
e    False
f     True
g    False
dtype: bool

获取重复行的索引

1 2	duplicate_rows_index = df.loc[duplicate_rows].index print(duplicate_rows_index)

结果：

1	Index(['b', 'd', 'f'], dtype='object')

根据重复行索引，删除行

1 2	# 根据索引删除行，axis=0代表行 df.drop(labels=duplicate_rows_index, axis=0)

结果：

	aa	bb	cc
a	41	16	73
c	76	8	35
e	86	48	64
g	38	40	68

使用`drop_duplicates`

df原始数据：

	aa	bb	cc
a	41	16	73
b	888	888	888
c	76	8	35
d	888	888	888
e	86	48	64
f	888	888	888
g	38	40	68

1 2	# 不保留任何重复行 df.drop_duplicates(keep=False)

结果：

	aa	bb	cc
a	41	16	73
c	76	8	35
e	86	48	64
g	38	40	68

映射

`replace()`函数

替换元素-replace()函数：替换元素。使用replace()函数，对values进行映射操作

语法：

df.replace(
    to_replace=None, # 要替换的值
    value=None,			# 替换后的值
    inplace=False		# 是否作用于原对象还是返回新对象
)

`Series`替换操作

单值替换
- 普通替换
- 字典替换(推荐）
多值替换
- 列表替换
- 字典替换（推荐）
参数
- to_replace:被替换的元素

`DataFrame`替换操作

单值替换
- 普通替换：替换所有符合要求的元素:to_replace=15,value='e'
- 按列指定单值替换： to_replace={列标签：替换值} value='value'
多值替换
- 列表替换: to_replace=[] value=[]
- 字典替换（推荐） to_replace={to_replace:value,to_replace:value}

示例

>>> s = pd.Series([0, 1, 2, 3, 4])

0修改为5
>>> s.replace(0, 5)
0    5
1    1
2    2
3    3
4    4
dtype: int64

>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
0修改为5
>>> df.replace(0, 5)
   A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

**List-like `to_replace`**

将0/1/2/3都修改为4
>>> df.replace([0, 1, 2, 3], 4)
   A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e

将0/1/2/3依次修改为4/3/2/1
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
   A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e

将1/2替换为所在元素下一行当前列的元素，bfill代表上一行当前列对应的元素
>>> s.replace([1, 2], method='bfill')
0    0
1    3
2    3
3    3
4    4
dtype: int64

**dict-like `to_replace`**
将0替换为10， 1替换为100
>>> df.replace({0: 10, 1: 100})
     A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e
将A列中0和B列中5替换为100
>>> df.replace({'A': 0, 'B': 5}, 100)
     A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e
将A列中的0替换为100，4替换为400
>>> df.replace({'A': {0: 100, 4: 400}})
     A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

**Regular expression `to_replace`**

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
...                    'B': ['abc', 'bar', 'xyz']})
将元素值ba开头且3位字母结束的值替换为new
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)
      A    B
0   new  abc
1   foo  new
2  bait  xyz
将A列中元素值ba开头且3位字母结束的值替换为new
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
      A    B
0   new  abc
1   foo  bar
2  bait  xyz
将元素值ba开头且3位字母结束的值替换为new
>>> df.replace(regex=r'^ba.$', value='new')
      A    B
0   new  abc
1   foo  new
2  bait  xyz
将元素值ba开头且3位字母结束的值替换为new，foo替换为xyz
>>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})
      A    B
0   new  abc
1   xyz  new
2  bait  xyz
将元素值ba开头且3位字母结束的值和职位foo的元素替换为new
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')
      A    B
0   new  abc
1   new  new
2  bait  xyz

`map()`函数

新建一列， map函数并不是df的方法，而是series的方法

map()可以映射新一列数据
map()中可以使用lambda表达式
map()中可以使用方法，可以是自定义的方法
eg:map({to_replace:value})
注意 map()中不能使用sum之类的函数，for循环；并不是任何形式的函数都可以作为map的参数。只有当一个函数具有一个参数且有返回值，那么该函数才可以作为map的参数。

示例

原始数据

人员对应工资表

np.random.seed(2)
dic={
    'name':['JAY', 'TONY', 'DAVID'],
    'salary': np.random.randint(9000, 500000, size=3)
}

df = DataFrame(data=dic)
df

结果：

	name	salary
0	JAY	360400
1	TONY	109879
2	DAVID	212245

添加对应中文名

# 添加中文名
cname = {
    'JAY' : '周杰伦',
    'TONY': '托小尼',
    'DAVID': '大卫'
}
df['cname'] = df['name'].map(cname)
df

结果：

	name	salary	cname
0	JAY	360400	周杰伦
1	TONY	109879	托小尼
2	DAVID	212245	大卫

增加税后工资

工资超过20万，交20%的税，添加税后工资列

# 工资大于200000交20%税，税后工资
def calc_salary(salary):
    if salary > 200000:
        salary = salary - salary * 0.2
    return salary

df['after_salary'] = df['salary'].map(calc_salary)
df

结果：

	name	salary	cname	after_salary
0	JAY	360400	周杰伦	288320.0
1	TONY	109879	托小尼	109879.0
2	DAVID	212245	大卫	169796.0

匿名函数计算工资

1 2	df['after_salary1'] = df['salary'].map(lambda x: x - x * 0.2 if x>200000 else x) df

结果：

	name	salary	cname	after_salary	after_salary1
0	JAY	360400	周杰伦	288320.0	288320.0
1	TONY	109879	托小尼	109879.0	109879.0
2	DAVID	212245	大卫	169796.0	169796.0

Series map示例

# 创建Series
>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
>>> s
0      cat
1      dog
2      NaN
3   rabbit
dtype: object


# 使用字典映射替换为对应的值，没有被字典中映射的值将赋为NaN
>>> s.map({'cat': 'kitten', 'dog': 'puppy'})
0   kitten
1    puppy
2      NaN
3      NaN
dtype: object

# 类似循环映射原始Series每个值到字符串
>>> s.map('I am a {}'.format)
0       I am a cat
1       I am a dog
2       I am a nan
3    I am a rabbit
dtype: object


# 如果na_action设置为ignore，将忽略该次映射关系
>>> s.map('I am a {}'.format, na_action='ignore')
0     I am a cat
1     I am a dog
2            NaN
3  I am a rabbit

异常值过滤

过滤掉以下DataFrame中cc列中元素小于cc列平均值的元素所在行

创建`DataFrame`

1
2
3

np.random.seed(3)
df = DataFrame(data=np.random.randint(1, 10, size=[5, 3]), columns=['aa', 'bb', 'cc'], index=['a', 'b', 'c', 'd', 'e'])
df

结果：

	aa	bb	cc
a	9	4	9
b	9	1	6
c	4	6	8
d	7	1	5
e	8	9	2

检测平均值

# 过滤掉bb列元素小于当前列平均值的元素
# 计算cc列平均值，平均值为6.0
cc_avg = df['cc'].mean()
# 检测cc列元素不小于平均值
ret = df['cc']>=cc_avg
ret

结果：

a     True
b     True
c     True
d    False
e    False
Name: cc, dtype: bool

过滤结果

1	df.loc[ret]

结果：

	aa	bb	cc
a	9	4	9
b	9	1	6
c	4	6	8

随机抽样

用于随机抽取其中的部分样本

take

take()函数接受一个索引列表，用数字表示,使得df根据列表中索引的顺序进行排序
eg:df.take([1,3,4,2,5])

原始数列

1
2
3

np.random.seed(2)
df = DataFrame(data=np.random.randint(1, 100, size=[20, 5]), columns=['a', 'b', 'c', 'd', 'e'])
df

	a	b	c	d	e
0	41	16	73	23	44
1	83	76	8	35	50
2	96	76	86	48	64
3	32	91	21	38	40
4	68	5	43	52	39
5	34	59	68	70	89
6	69	47	71	96	84
7	32	67	81	53	77
8	51	5	91	64	80
9	50	40	47	9	51
10	16	9	18	23	74
11	58	91	63	84	97
12	44	33	27	9	77
13	11	41	35	61	10
14	71	87	71	20	57
15	83	2	69	41	82
16	62	71	98	19	85
17	91	88	23	44	53
18	75	73	91	92	97
19	17	56	22	44	94

随机抽取指定行数据

1 2	# axis=0代表行，随机抽取以下4行 df.take([19, 3, 9, 7], axis=0)

结果：

	a	b	c	d	e
19	17	56	22	44	94
3	32	91	21	38	40
9	50	40	47	9	51
7	32	67	81	53	77

随机抽取指定列数据

1 2	# 抽取默认索引为4/2/0的列数据 df.take([4, 2, 0], axis=1)

结果：

	a	b	c	d	e
18	75	73	91	92	97
15	83	2	69	41	82
0	41	16	73	23	44
14	71	87	71	20	57
2	96	76	86	48	64
12	44	33	27	9	77
13	11	41	35	61	10
9	50	40	47	9	51
17	91	88	23	44	53
3	32	91	21	38	40
4	68	5	43	52	39
7	32	67	81	53	77
19	17	56	22	44	94
1	83	76	8	35	50
11	58	91	63	84	97
5	34	59	68	70	89
10	16	9	18	23	74
6	69	47	71	96	84
8	51	5	91	64	80
16	62	71	98	19	85

随机打乱

当DataFrame规模足够大时，直接使用np.random.permutation(x)函数，就配合take()函数实现随机抽样，np.random.permutation(x)生成从0到x-1为元素的列表

生成随机数列

1	rand_arr = np.random.permutation(20)

结果：

1 2	array([ 6, 7, 12, 17, 9, 8, 4, 18, 3, 1, 13, 15, 2, 10, 11, 5, 16, 0, 19, 14])

抽取随机后的所有数据

1 2	#行列全部打乱 df.take(rand_arr, axis=0).take(np.random.permutation(5), axis=1)

结果：

	c	d	a	e	b
10	18	23	16	74	9
9	47	9	50	51	40
18	91	92	75	97	73
12	27	9	44	77	33
7	81	53	32	77	67
4	43	52	68	39	5
8	91	64	51	80	5
0	73	23	41	44	16
11	63	84	58	97	91
1	8	35	83	50	76
19	22	44	17	94	56
13	35	61	11	10	41
16	98	19	62	85	71
14	71	20	71	57	87
5	68	70	34	89	59
6	71	96	69	84	47
2	86	48	96	64	76
15	69	41	83	82	2
3	21	38	32	40	91
17	23	44	91	53	88

数据分类

数据聚合是数据处理的最后一步，通常是要使每一个数组生成一个单一的数值。

数据分类处理：

分组：先把数据分为几组
用函数处理：为不同组的数据应用不同的函数以转换数据
合并：把不同组得到的结果合并起来

数据分类处理的核心：

1
2
3

- groupby()函数
- groups属性查看分组情况
- eg: df.groupby(by='item').groups

分组

原始数据

dic = {
    '水果' : ['苹果', '梨','苹果', '草莓', '香蕉', '梨', '香蕉', '草莓'],
    '价格' : [4, 3, 5.5, 12, 3.2, 3.5, 2.5, 10],
    '颜色' : ['绿色', '绿色', '红色', '红色', '绿色', '黄色', '绿色', '黄色'],
    '重量' : [12, 10, 9, 15, 19, 20, 17, 13]
}
df = DataFrame(data=dic)
df

结果：

	水果	价格	颜色	重量
0	苹果	4.0	绿色	12
1	梨	3.0	绿色	10
2	苹果	5.5	红色	9
3	草莓	12.0	红色	15
4	香蕉	3.2	绿色	19
5	梨	3.5	黄色	20
6	香蕉	2.5	绿色	17
7	草莓	10.0	黄色	13

根据水果类型分组

1 2	# axis=0表示列分组 df.groupby(by='水果', axis=0).groups

结果

{'梨': Int64Index([1, 5], dtype='int64'),
 '苹果': Int64Index([0, 2], dtype='int64'),
 '草莓': Int64Index([3, 7], dtype='int64'),
 '香蕉': Int64Index([4, 6], dtype='int64')}

按颜色分组

1 2	# axis=0表示列分组 df.groupby(by='颜色', axis=0).groups

结果

1
2
3

{'红色': Int64Index([2, 3], dtype='int64'),
 '绿色': Int64Index([0, 1, 4, 6], dtype='int64'),
 '黄色': Int64Index([5, 7], dtype='int64')}

聚合

分组后要做的就是聚合统计数据

每种水果平均值

1 2	# 水果平均值，只有价格和重量列是数字可以计算 df.groupby(by='水果').mean()

结果

	价格	重量
水果
梨	3.25	15.0
苹果	4.75	10.5
草莓	11.00	14.0
香蕉	2.85	18.0

每种水果平均价格

1 2	df.groupby(by='水果')['价格'].mean() # df.groupby(by='水果').mean()['价格']

结果

水果
梨      3.25
苹果     4.75
草莓    11.00
香蕉     2.85
Name: 价格, dtype: float64

苹果的平均价格

# 先计算出每种水果平均价格
# 使用自定义索引获取
df.groupby(by='水果')['价格'].mean()['苹果']
# 使用默认索引获取苹果平均价
# df.groupby(by='水果')['价格'].mean()[1]

添加每个水果平均价格列

计算每种水果均价

1
2
3

#给df创建一个新列，内容为各个水果的平均价格
prices = df.groupby('水果')['价格'].mean()
prices

结果

水果
梨      3.25
苹果     4.75
草莓    11.00
香蕉     2.85
Name: 价格, dtype: float64

每种水果价格平均价转换成字典

1 2	dic_prices = prices.to_dict() dic_prices

结果：

1	{'梨': 3.25, '苹果': 4.75, '草莓': 11.0, '香蕉': 2.85}

使用map映射后添加到一个新列

1 2	df['平均价'] = df['水果'].map(dic_prices) df

结果：

	水果	价格	颜色	重量	平均价
0	苹果	4.0	绿色	12	4.75
1	梨	3.0	绿色	10	3.25
2	苹果	5.5	红色	9	4.75
3	草莓	12.0	红色	15	11.00
4	香蕉	3.2	绿色	19	2.85
5	梨	3.5	黄色	20	3.25
6	香蕉	2.5	绿色	17	2.85
7	草莓	10.0	黄色	13	11.00

高级数据聚合

使用groupby分组后，也可以使用transform和apply提供自定义函数实现更多的运算

df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
transform和apply都会进行运算，在transform或者apply中传入函数即可
transform和apply也可以传入一个lambda表达式

使用apply计算每种水果平均价格

#计算平均价格
def calc_avg(s):
    # 当前水果总价
    sum = 0
    # 循环当前该类水果所有商品,并将该类所有商品汇总
    for i in s:
        sum += i
    # 将该类所有商品总价除以该类所有商品数量及该类商品平均价返回
    return sum/s.size
avg_price = df.groupby(by='水果')['价格'].apply(calc_avg)
print(avg_price)

结果：

水果
梨      3.25
苹果     4.75
草莓    11.00
香蕉     2.85
Name: 价格, dtype: float64

0	1	2	3	4	5
0	32	93	0	23	21	40
1	27	35	9	76	41	68
2	63	96	63	30	96	51
3	2	50	28	26	26	41
4	32	74	97	84	56	7

	a	b	c	d	e
0	41	16	73	23	44
1	83	76	8	35	50
2	96	76	86	48	64
3	32	91	21	38	40
4	68	5	43	52	39
5	34	59	68	70	89
6	69	47	71	96	84
7	32	67	81	53	77
8	51	5	91	64	80
9	50	40	47	9	51
10	16	9	18	23	74
11	58	91	63	84	97
12	44	33	27	9	77
13	11	41	35	61	10
14	71	87	71	20	57
15	83	2	69	41	82
16	62	71	98	19	85
17	91	88	23	44	53
18	75	73	91	92	97
19	17	56	22	44	94

	a	b	c	d	e
18	75	73	91	92	97
15	83	2	69	41	82
0	41	16	73	23	44
14	71	87	71	20	57
2	96	76	86	48	64
12	44	33	27	9	77
13	11	41	35	61	10
9	50	40	47	9	51
17	91	88	23	44	53
3	32	91	21	38	40
4	68	5	43	52	39
7	32	67	81	53	77
19	17	56	22	44	94
1	83	76	8	35	50
11	58	91	63	84	97
5	34	59	68	70	89
10	16	9	18	23	74
6	69	47	71	96	84
8	51	5	91	64	80
16	62	71	98	19	85

	c	d	a	e	b
10	18	23	16	74	9
9	47	9	50	51	40
18	91	92	75	97	73
12	27	9	44	77	33
7	81	53	32	77	67
4	43	52	68	39	5
8	91	64	51	80	5
0	73	23	41	44	16
11	63	84	58	97	91
1	8	35	83	50	76
19	22	44	17	94	56
13	35	61	11	10	41
16	98	19	62	85	71
14	71	20	71	57	87
5	68	70	34	89	59
6	71	96	69	84	47
2	86	48	96	64	76
15	69	41	83	82	2
3	21	38	32	40	91
17	23	44	91	53	88

0	1	2	3	4	5
0	32	93	0	23	21	40
1	27	35	9	76	41	68
2	63	96	63	30	96	51
3	2	50	28	26	26	41
4	32	74	97	84	56	7

	a	b	c	d	e
0	41	16	73	23	44
1	83	76	8	35	50
2	96	76	86	48	64
3	32	91	21	38	40
4	68	5	43	52	39
5	34	59	68	70	89
6	69	47	71	96	84
7	32	67	81	53	77
8	51	5	91	64	80
9	50	40	47	9	51
10	16	9	18	23	74
11	58	91	63	84	97
12	44	33	27	9	77
13	11	41	35	61	10
14	71	87	71	20	57
15	83	2	69	41	82
16	62	71	98	19	85
17	91	88	23	44	53
18	75	73	91	92	97
19	17	56	22	44	94

	a	b	c	d	e
18	75	73	91	92	97
15	83	2	69	41	82
0	41	16	73	23	44
14	71	87	71	20	57
2	96	76	86	48	64
12	44	33	27	9	77
13	11	41	35	61	10
9	50	40	47	9	51
17	91	88	23	44	53
3	32	91	21	38	40
4	68	5	43	52	39
7	32	67	81	53	77
19	17	56	22	44	94
1	83	76	8	35	50
11	58	91	63	84	97
5	34	59	68	70	89
10	16	9	18	23	74
6	69	47	71	96	84
8	51	5	91	64	80
16	62	71	98	19	85

	c	d	a	e	b
10	18	23	16	74	9
9	47	9	50	51	40
18	91	92	75	97	73
12	27	9	44	77	33
7	81	53	32	77	67
4	43	52	68	39	5
8	91	64	51	80	5
0	73	23	41	44	16
11	63	84	58	97	91
1	8	35	83	50	76
19	22	44	17	94	56
13	35	61	11	10	41
16	98	19	62	85	71
14	71	20	71	57	87
5	68	70	34	89	59
6	71	96	69	84	47
2	86	48	96	64	76
15	69	41	83	82	2
3	21	38	32	40	91
17	23	44	91	53	88

Series

Series创建

列表创建

numpy创建

字典创建

Series索引

显式索引

隐式索引

Series切片

显式索引切片

隐式索引切片

Series属性及基本使用

索引

值

head(n)

tail(n)

unique()

Series运算

数字运算

计算函数运算

多个series运算

isnull()

notnull()

DataFrame

Dataframe创建

使用ndarray

字典创建

Dataframe属性

values

columns

index

shape

DataFrame索引

列索引

字典方式

属性方式

取多列

修改列索引

行索引

loc

iloc

元素索引

loc

iloc

DataFrame切片

中括号切片

iloc

只切行

只切列

切行切列

loc

只切行

只切列

切行切列

DataFrame运算

DataFrame之间运算

求平均值

张三期中数学0分

李四期中所有加100

期中全部加10

删除重复元素

使用drop_duplicates

映射

replace()函数

Series替换操作

DataFrame替换操作

示例

map()函数

示例

原始数据

添加对应中文名

增加税后工资

匿名函数计算工资

Series map示例

异常值过滤

创建DataFrame

检测平均值

过滤结果

随机抽样

take

`Series`

`Series`创建

`numpy`创建

`Series`索引

`Series`切片

`Series`属性及基本使用

`head(n)`

`tail(n)`

`unique()`

`Series运算`

多个`series`运算

`isnull()`

`notnull()`

`DataFrame`

`Dataframe`创建

使用`ndarray`

`Dataframe`属性

`values`

`columns`

`index`

`shape`

`DataFrame`索引

`loc`

`iloc`

`loc`

`iloc`

`DataFrame`切片

`iloc`

`loc`

`DataFrame`运算

`DataFrame`之间运算

使用`drop_duplicates`

`replace()`函数

`Series`替换操作

`DataFrame`替换操作

`map()`函数

创建`DataFrame`

0	1	2	3	4	5
0	32	93	0	23	21	40
1	27	35	9	76	41	68
2	63	96	63	30	96	51
3	2	50	28	26	26	41
4	32	74	97	84	56	7

	a	b	c	d	e
0	41	16	73	23	44
1	83	76	8	35	50
2	96	76	86	48	64
3	32	91	21	38	40
4	68	5	43	52	39
5	34	59	68	70	89
6	69	47	71	96	84
7	32	67	81	53	77
8	51	5	91	64	80
9	50	40	47	9	51
10	16	9	18	23	74
11	58	91	63	84	97
12	44	33	27	9	77
13	11	41	35	61	10
14	71	87	71	20	57
15	83	2	69	41	82
16	62	71	98	19	85
17	91	88	23	44	53
18	75	73	91	92	97
19	17	56	22	44	94

	a	b	c	d	e
18	75	73	91	92	97
15	83	2	69	41	82
0	41	16	73	23	44
14	71	87	71	20	57
2	96	76	86	48	64
12	44	33	27	9	77
13	11	41	35	61	10
9	50	40	47	9	51
17	91	88	23	44	53
3	32	91	21	38	40
4	68	5	43	52	39
7	32	67	81	53	77
19	17	56	22	44	94
1	83	76	8	35	50
11	58	91	63	84	97
5	34	59	68	70	89
10	16	9	18	23	74
6	69	47	71	96	84
8	51	5	91	64	80
16	62	71	98	19	85

	c	d	a	e	b
10	18	23	16	74	9
9	47	9	50	51	40
18	91	92	75	97	73
12	27	9	44	77	33
7	81	53	32	77	67
4	43	52	68	39	5
8	91	64	51	80	5
0	73	23	41	44	16
11	63	84	58	97	91
1	8	35	83	50	76
19	22	44	17	94	56
13	35	61	11	10	41
16	98	19	62	85	71
14	71	20	71	57	87
5	68	70	34	89	59
6	71	96	69	84	47
2	86	48	96	64	76
15	69	41	83	82	2
3	21	38	32	40	91
17	23	44	91	53	88