series基础入门笔记

发表于 2023-01-05 更新于 2024-01-08 分类于 rd ， python Changyan：本文字数： 2.4k 阅读时长 ≈ 9 分钟

在Python的Pandas库中，Series是一个一维的标签化数组，它可以包含任何数据类型（数字、字符串、布尔值等），并且每个元素都有一个与之关联的标签或索引。

Series提供了很多有用的方法和属性，使得用户可以轻松地处理和分析数据。例如，可以使用索引来访问或修改数据，使用条件语句来过滤数据，计算描述性统计量，或者将Series与其他Pandas数据结构（如DataFrame）进行合并和连接。

一维数组对象

Series是一种类似于一维数组的对象，由一组数据和一组与之相关的数据标签 (索引) 组成。

创建方式

In [4]: import pandas as pd

In [5]: pd.Series([2, 3, -4, 5])
Out[5]:
0    2
1    3
2   -4
3    5
dtype: int64

In [6]: pd.Series(np.arange(10))
Out[6]:
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32

In [7]: pd.Series([2, 3, -4, 5], index=['a', 'b', 'c', 'd'])
Out[7]:
a    2
b    3
c   -4
d    5
dtype: int64

In [8]: pd.Series({'a':1, 'b':2})
Out[8]:
a    1
b    2
dtype: int64

In [9]: pd.Series(0, index=['a', 'b', 'c', 'd'])
Out[9]:
a    0
b    0
c    0
d    0
dtype: int64

获取值数组和索引数组

values属性和index属性

In [27]: s
Out[27]:
a    5
b    3
c    6
d    4
e    1
dtype: int32

# 获取index
In [28]: s.keys()
Out[28]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [37]: s.index
Out[37]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [38]: s.index[2]
Out[38]: 'c'

# 获取值
In [41]: s.values
Out[41]: array([5, 3, 6, 4, 1])

In [42]: s.values[3]
Out[42]: 4

Series比较像列表 (数组) 和字典的结合体

In [30]: s
Out[30]:
a    5
b    3
c    6
d    4
e    1
dtype: int32

# key是否存在
In [31]: 'b' in s
Out[31]: True

# 标签key获取元素
In [32]: s['b']
Out[32]: 3

# 对象标签方式获取元素
In [33]: s.b
Out[33]: 3

# 数字索引获取元素
In [34]: s[1]
Out[34]: 3

# 循环获取值
In [35]: for i in s:
    ...:     print(i)
    ...:
5
3
6
4
1

使用特性

Series支持array的特性 (下标)

从`ndarray`创建`Series: Series(arr)`

In [45]: pd.Series(np.random.randint(1, 10, 5), index=['a', 'b', 'c', 'd', 'e'])
Out[45]:
a    4
b    1
c    9
d    8
e    4
dtype: int32

与标量运算

In [46]: s
Out[46]:
a    3
b    8
c    4
d    2
e    7
dtype: int32

In [47]: s*2
Out[47]:
a     6
b    16
c     8
d     4
e    14
dtype: int32

两个Series运算

In [50]: s1
Out[50]:
a    4
b    9
c    3
d    6
e    6
dtype: int32

In [51]: s2
Out[51]:
a    7
b    2
c    2
d    8
e    3
dtype: int32

In [52]: s1+s2
Out[52]:
a    11
b    11
c     5
d    14
e     9
dtype: int32

索引

In [53]: s
Out[53]:
a    3
b    8
c    4
d    2
e    7
dtype: int32


# 数字索引获取元素
In [54]: s[2]
Out[54]: 4

# 对象方式获取元素
In [55]: s.c
Out[55]: 4

# 标签索引获取元素
In [56]: s['c']
Out[56]: 4

# 数字索引切片(顾头不顾尾)
In [57]: s[0:2]
Out[57]:
a    3
b    8
dtype: int32

# 标签索引切片(全包，含头尾)
In [58]: s['a':'c']
Out[58]:
a    3
b    8
c    4
dtype: int32

# 获取指定数字索引
In [59]: s[[0, 2, 3]]
Out[59]:
a    3
c    4
d    2
dtype: int32

# 获取指定标签索引
In [60]: s[['a', 'c', 'd']]
Out[60]:
a    3
c    4
d    2
dtype: int32

通用函数

1 2	In [64]: np.mean(s) Out[64]: 4.8

布尔值过滤

In [67]: s[s>5]
Out[67]:
b    8
e    7
dtype: int32

整数索引

整数索引的pandas对象往往会使新手抓狂
如果索引是整数类型，则根据整数进行下标获取值时总是面向标签的。
解决方法: loc属性 (将索引解释为标签) 和iloc属性 (将索引解释为下标)

In [85]: s
Out[85]:
0     22
1     64
2     39
3     66
4      7
5     75
6     74
7     75
8     19
9     72
10    38
11    94
12    34
13    71
14    58
15    30
16    89
17    37
18    39
19    21
dtype: int32

# 截取一个新的series
In [86]: s1 = s[10:].copy()

In [87]: s1
Out[87]:
10    38
11    94
12    34
13    71
14    58
15    30
16    89
17    37
18    39
19    21
dtype: int32

# 获取第一个数字索引（直接报错）
In [88]: s1[0]
KeyError: 0
# 如果是数字索引，panda把它解释为标签索引，认为0/10是用户要取的标签索引
In [89]: s1[10]
Out[89]: 38

这种情况下最好的解决方案是，我们显示的告诉pandas我们要取的是标签索引还是数字索引，loc代表要取的为标签索引；iloc表示要取的是数字索引

In [91]: s1
Out[91]:
10    38
11    94
12    34
13    71
14    58
15    30
16    89
17    37
18    39
19    21
dtype: int32

# 获取标签索引为10的元素
In [92]: s1.loc[10]
Out[92]: 38

# 获取标签索引为19的元素
In [93]: s1.loc[19]
Out[93]: 21

# 获取第一个数字索引对应的元素
In [94]: s1.iloc[0]
Out[94]: 38

# 获取最后一个数字索引对应的元素
In [95]: s1.iloc[-1]
Out[95]: 21

# 数字索引切片
In [96]: s1.iloc[3:6]
Out[96]:
13    71
14    58
15    30
dtype: int32

# 标签索引切片
In [97]: s1.loc[13:16]
Out[97]:
13    71
14    58
15    30
16    89
dtype: int32

数据对齐

pandas在进行两个Series对象的运算时，会按索引进行对齐然后计算。

In [101]: s1 = pd.Series([2, 6, 3], index = ['a', 'b', 'c'])

In [102]: s2 = pd.Series([8, 5, 6], index = ['c', 'a', 'b'])

In [103]: s1
Out[103]:
a    2
b    6
c    3
dtype: int64

In [104]: s2
Out[104]:
c    8
a    5
b    6
dtype: int64

# 计算时自动对齐索引，a索引与a索引相加，b索引与b索引相加……
In [105]: s1+s2
Out[105]:
a     7
b    12
c    11
dtype: int64

如果两个Series对象的索引不完全相同，则结果的索引是两个操作数索引的并集。

如果只有一个对象在某索引下有值，则结果中该索引的值为nan (缺失值)

In [106]: s1 = pd.Series([2, 6, 3, 5], index = ['a', 'b', 'c', 'd'])

In [107]: s2 = pd.Series([8, 5, 6], index = ['c', 'a', 'b'])

In [108]: s1
Out[108]:
a    2
b    6
c    3
d    5
dtype: int64

In [109]: s2
Out[109]:
c    8
a    5
b    6
dtype: int64
# 两个索引不完全相同的两个series相加，如果只有一个series对象某索引有值，则结果中该索引值为nan
In [110]: s1+s2
Out[110]:
a     7.0
b    12.0
c    11.0
d     NaN
dtype: float64


In [111]: s1 = pd.Series([2, 6, 3], index = ['a', 'b', 'c'])

In [112]: s2 = pd.Series([8, 5, 6], index = ['c', 'd', 'b'])

In [113]: s1
Out[113]:
a    2
b    6
c    3
dtype: int64

In [114]: s2
Out[114]:
c    8
d    5
b    6
dtype: int64
# 两个索引不完全相同的两个series相加，如果只有一个series对象某索引有值，则结果中该索引值为nan
In [115]: s1+s2
Out[115]:
a     NaN
b    12.0
c    11.0
d     NaN
dtype: float64

如果想要没有的索引填充为0，这样不就可以相加了吗？

的确是这样的，但是这样就不能用s1+s2这种方式了，而要使用内置方法add

In [134]: a = pd.Series([1, 11, 111], index=['c', 'a', 'd'])

In [135]: b = pd.Series([2, 22, 222], index=['b', 'c', 'a'])

In [136]: a+b
Out[136]:
a    233.0
b      NaN
c     23.0
d      NaN
dtype: float64
# 没有该索引series对应位置填充0
In [137]: a.add(b, fill_value=0)
Out[137]:
a    233.0
b      2.0
c     23.0
d    111.0
dtype: float64

但是这是正常的的情况，还有一些特殊情况，假如一个series没有该索引，另一个该索引对应的值为nan，结果又是什么呢？

In [138]: a = pd.Series([1, 11, 111], index=['c', 'a', 'd'])

In [139]: b = pd.Series([np.nan, 22, 222], index=['b', 'c', 'a'])

In [140]: a
Out[140]:
c      1
a     11
d    111
dtype: int64

In [141]: b
Out[141]:
b      NaN
c     22.0
a    222.0
dtype: float64

In [142]: a.add(b, fill_value=0)
Out[142]:
a    233.0
b      NaN
c     23.0
d    111.0
dtype: float64

可以看到，相当于两个series对应的b索引位置都没有值，也就没办法做运算了，结果依然是NaN

同样该种方式不止适用于add，sub/div/mul内置方法也是同样适用的

缺失数据

使用NaN (Not a Number)来表示缺失数据。其值等于np.nan。内置的None值也会被当做NaN处理

处理缺失数据的相关方法

dropna() 过滤掉值为NaN的行

In [145]: s
Out[145]:
a    233.0
b      NaN
c     23.0
d      NaN
dtype: float64
# 过滤掉NaN行
In [146]: s.dropna()
Out[146]:
a    233.0
c     23.0
dtype: float64

fillna() 填充缺失数据

In [147]: s
Out[147]:
a    233.0
b      NaN
c     23.0
d      NaN
dtype: float64

# 将NaN行填充为999
In [148]: s.fillna(999)
Out[148]:
a    233.0
b    999.0
c     23.0
d    999.0
dtype: float64

isnull() 返回布尔数组，缺失值对应为True

In [149]: s
Out[149]:
a    233.0
b      NaN
c     23.0
d      NaN
dtype: float64

# 查询每行是否为NaN
In [150]: s.isnull()
Out[150]:
a    False
b     True
c    False
d     True
dtype: bool

notnull() 返回布尔数组，缺失值对应为False

In [151]: s
Out[151]:
a    233.0
b      NaN
c     23.0
d      NaN
dtype: float64

# 查询每行是否为非NaN
In [152]: s.notnull()
Out[152]:
a     True
b    False
c     True
d    False
dtype: bool

过滤缺失数据

In [153]: s
Out[153]:
a    233.0
b      NaN
c     23.0
d      NaN
dtype: float64

# dropna过滤缺失数据
In [154]: s.dropna()
Out[154]:
a    233.0
c     23.0
dtype: float64

# 布尔索引过滤缺失数据
In [155]: s[s.notnull()]
Out[155]:
a    233.0
c     23.0
dtype: float64

填充缺失数据

In [156]: s
Out[156]:
a    233.0
b      NaN
c     23.0
d      NaN
dtype: float64

# fillna填充缺失数据
In [157]: s.fillna(999)
Out[157]:
a    233.0
b    999.0
c     23.0
d    999.0
dtype: float64

# 使用s的平均数填充缺失值（非nan的平均值，上述(233+23)/2）
In [159]: s.fillna(s.mean())
Out[159]:
a    233.0
b    128.0
c     23.0
d    128.0
dtype: float64