【笔记】初识Pandas

Series 是个定长的字典序列

因为在存储数据的时候,相当于两个ndarray,这也是和字典最大的不同。因为在字典的结构里,元素的个数是不固定的。

Series有两个基本属性:index和values。默认index是0,1,2……递增的整数序列,也可以自己指定索引, index = [‘a’, ‘b’, ‘c’, ‘d’]

1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
from pandas import Series, DataFrame

x1 = Series([1,2,3,4])
x2 = Series([1,2,3,4], ['a','b','c','d'])

# 采用字典的方式创建Series
d = {'a': 1, 'b':2, 'c':3, 'd':4}
x3 = Series(d)

print(x1)
print(x2)
print(x3)

执行结果:

1
2
3
4
5
6
7
8
9
10
0    1
1 2
2 3
3 4
dtype: int64
a 1
b 2
c 3
d 4
dtype: int64

DataFrame数据结构类型类似数据库表

1
2
3
4
5
6
7
import pandas as pd
from pandas import Series, DataFrame
data = {'Chinese': [66, 95, 93, 90,80],'English': [65, 85, 92, 88, 90],'Math': [30, 98, 96, 77, 90]}
df1= DataFrame(data)
df2 = DataFrame(data, index=['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei'], columns=['English', 'Math', 'Chinese'])
print(df1)
print(df2)

执行结果:

1
2
3
4
5
6
7
8
9
10
11
12
   Chinese  English  Math
0 66 65 30
1 95 85 98
2 93 92 96
3 90 88 77
4 80 90 90
English Math Chinese
ZhangFei 65 30 66
GuanYu 85 98 95
ZhaoYun 92 96 93
HuangZhong 88 77 90
DianWei 90 90 80

数据导入和导出

1
2
3
4
5
6
7
8
9
import pandas as pd
from pandas import Series, DataFrame
data = {'Chinese': [66, 95, 93, 90,80],'English': [65, 85, 92, 88, 90],'Math': [30, 98, 96, 77, 90]}
score = DataFrame(data, index=['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei'], columns=['English', 'Math', 'Chinese'])
score.to_excel('data.xlsx')

# 导入excel表格
inputScores = DataFrame(pd.read_excel('data.xlsx'))
print(inputScores)
1
2
3
4
5
6
            English  Math  Chinese
ZhangFei 65 30 66
GuanYu 85 98 95
ZhaoYun 92 96 93
HuangZhong 88 77 90
DianWei 90 90 80

数据清洗

删除DataFrame的行或列

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
from pandas import Series, DataFrame
data = {'Chinese': [66, 95, 93, 90,80],'English': [65, 85, 92, 88, 90],'Math': [30, 98, 96, 77, 90]}
df2 = DataFrame(data, index=['ZhangFei', 'GuanYu', 'ZhaoYun', 'HuangZhong', 'DianWei'], columns=['English', 'Math', 'Chinese'])

# 删除列
df2 = df2.drop(columns=['Chinese'])
print(df2)

# 删除行
df2 = df2.drop(index=['ZhangFei'])
print(df2)

执行结果:

1
2
3
4
5
6
7
8
9
10
11
12
            English  Math
ZhangFei 65 30
GuanYu 85 98
ZhaoYun 92 96
HuangZhong 88 77
DianWei 90 90

English Math
GuanYu 85 98
ZhaoYun 92 96
HuangZhong 88 77
DianWei 90 90

重命名列名

1
2
3
4
# 紧接上面代码
# 对列名重命名
df2.rename(columns={'Math': 'Shuxue', 'English': 'Yingyu'}, inplace = True)
print(df2)

执行结果:

1
2
3
4
5
            Yingyu  Shuxue
GuanYu 85 98
ZhaoYun 92 96
HuangZhong 88 77
DianWei 90 90

去除重复的值

1
2
3
4
5
6
# 紧接上面代码
duplicateDf = DataFrame({'Shuxue': [90], 'Yingyu': [90]}, index=['Weiyan'])
df2 = df2.append(duplicateDf, sort=True)
print(df2)
df2 = df2.drop_duplicates() # 去除重复行
print(df2)

执行结果:

1
2
3
4
5
6
7
8
9
10
11
            Shuxue  Yingyu
GuanYu 98 85
ZhaoYun 96 92
HuangZhong 77 88
DianWei 90 90
Weiyan 90 90
Shuxue Yingyu
GuanYu 98 85
ZhaoYun 96 92
HuangZhong 77 88
DianWei 90 90

格式问题

更改数据格式

加载评论框需要科学上网