博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Pandas基础命令速查清单
阅读量:5958 次
发布时间:2019-06-19

本文共 38695 字,大约阅读时间需要 128 分钟。

本文翻译整理自,结合K-Lab的工具属性,添加了具体的内容将速查清单里面的代码实践了一遍。

速查表内容概要

点击右上角的Fork按钮上手实践,即可点击标题实现内容跳转

  • [缩写解释 & 库的导入]
  • [数据的导入]
  • [数据的导出]
  • [创建测试对象]
  • [数据的查看与检查]
  • [数据的选取]
  • [数据的清洗]
  • [数据的过滤(filter),排序(sort)和分组(groupby)]
  • [数据的连接(join)与组合(combine)]
  • [数据的统计]
 
缩写解释 & 库的导入
 

df --- 任意的pandas DataFrame(数据框)对象

s --- 任意的pandas Series(数组)对象
pandasnumpy是用Python做数据分析最基础且最核心的库

In [2]:
import pandas as pd # 导入pandas库并简写为pdimport numpy as np # 导入numpy库并简写为np
In [1]:
import pandas as pdimport numpy as np
 
数据的导入
 
pd.read_csv(filename) # 导入csv格式文件中的数据 pd.read_table(filename) # 导入有分隔符的文本 (如TSV) 中的数据 pd.read_excel(filename) # 导入Excel格式文件中的数据 pd.read_sql(query, connection_object) # 导入SQL数据表/数据库中的数据 pd.read_json(json_string) # 导入JSON格式的字符,URL地址或者文件中的数据 pd.read_html(url) # 导入经过解析的URL地址中包含的数据框 (DataFrame) 数据 pd.read_clipboard() # 导入系统粘贴板里面的数据 pd.DataFrame(dict) # 导入Python字典 (dict) 里面的数据,其中key是数据框的表头,value是数据框的内容。
In [4]:
pd.read_csv(filename) pd.read_table(filename) pd.read_excel(filename) pd.read_sql(query, connection_object) pd.read_json(json_string) pd.read_html(url) pd.read_clipboard() pd.DataFrame(dict)
 
---------------------------------------------------------------------------NameError                                 Traceback (most recent call last)
in
()----> 1 pd.read_csv(filename) 2 pd.read_table(filename) 3 pd.read_excel(filename) 4 pd.read_sql(query, connection_object) 5 pd.read_json(json_string) NameError: name 'filename' is not defined
 
数据的导出
 
df.to_csv(filename) # 将数据框 (DataFrame)中的数据导入csv格式的文件中 df.to_excel(filename) # 将数据框 (DataFrame)中的数据导入Excel格式的文件中 df.to_sql(table_name,connection_object) # 将数据框 (DataFrame)中的数据导入SQL数据表/数据库中 df.to_json(filename) # 将数据框 (DataFrame)中的数据导入JSON格式的文件中
In [5]:
df.to_csv(filename) df.to_excel(filename) df.to_sql(table_name, connection_object) df.to_json(filename)
 
---------------------------------------------------------------------------NameError                                 Traceback (most recent call last)
in
()----> 1 df.to_csv(filename) 2 df.to_excel(filename) 3 df.to_sql(table_name, connection_object) 4 df.to_json(filename) NameError: name 'df' is not defined
 
创建测试对象
 
pd.DataFrame(np.random.rand(10,5)) # 创建一个5列10行的由随机浮点数组成的数据框 DataFrame
In [6]:
pd.DataFrame(np.random.rand(10,5))
Out[6]:
  0 1 2 3 4
0 0.178801 0.846355 0.705159 0.196188 0.874350
1 0.362044 0.390863 0.760347 0.555912 0.689457
2 0.201675 0.673297 0.180532 0.648759 0.483332
3 0.645076 0.932788 0.182940 0.722370 0.542127
4 0.578884 0.839314 0.734570 0.691949 0.538795
5 0.999395 0.383014 0.192030 0.315428 0.940216
6 0.980939 0.475735 0.674909 0.112695 0.961567
7 0.389256 0.855763 0.026823 0.876811 0.274633
8 0.108523 0.267471 0.988235 0.991163 0.271738
9 0.403084 0.935190 0.628058 0.296839 0.386862
In [2]:
pd.DataFrame(np.random.rand(10,5))
Out[2]:
  0 1 2 3 4
0 0.647736 0.372628 0.255864 0.853542 0.613267
1 0.064364 0.156340 0.575021 0.561911 0.479901
2 0.036473 0.876819 0.255325 0.393240 0.543039
3 0.357489 0.006578 0.093966 0.531294 0.029009
4 0.550582 0.504600 0.273546 0.011693 0.052523
5 0.721563 0.170689 0.702163 0.447883 0.905983
6 0.839726 0.935997 0.343133 0.356957 0.377116
7 0.931894 0.026684 0.719148 0.911425 0.676187
8 0.115619 0.114894 0.130696 0.321598 0.170082
9 0.194649 0.526141 0.965442 0.275433 0.880765
 
pd.Series(my_list) # 从一个可迭代的对象 my_list 中创建一个数据组
In [7]:
my_list = ['huang', 100, 'xiaolei',4,56] pd.Series(my_list)
Out[7]:
0      huang1        1002    xiaolei3          44         56dtype: object
In [3]:
my_list = ['Kesci',100,'欢迎来到科赛网'] pd.Series(my_list)
Out[3]:
0      Kesci1        1002    欢迎来到科赛网dtype: object
 
df.index = pd.date_range('2017/1/1', periods=df.shape[0]) # 添加一个日期索引 index
In [4]:
df = pd.DataFrame(np.random.rand(10,5)) df.index = pd.date_range('2017/1/1', periods=df.shape[0]) df
Out[4]:
  0 1 2 3 4
2017-01-01 0.248515 0.647889 0.111346 0.540434 0.159914
2017-01-02 0.445073 0.329843 0.823678 0.737438 0.707598
2017-01-03 0.526543 0.876826 0.717986 0.271920 0.719657
2017-01-04 0.471256 0.657647 0.973484 0.598997 0.249301
2017-01-05 0.958465 0.474331 0.004078 0.842343 0.819295
2017-01-06 0.271308 0.271988 0.434776 0.449652 0.369188
2017-01-07 0.989573 0.928428 0.452436 0.058590 0.732283
2017-01-08 0.435328 0.730214 0.909400 0.683413 0.186820
2017-01-09 0.897414 0.687525 0.122937 0.018102 0.440427
2017-01-10 0.743821 0.134602 0.210326 0.877157 0.815462
 
数据的查看与检查
 
df.head(n) # 查看数据框的前n行
In [9]:
df = pd.DataFrame(np.random.rand(10, 5)) df.head(5)
Out[9]:
  0 1 2 3 4
0 0.857171 0.900692 0.500228 0.636632 0.395819
1 0.332900 0.856592 0.645121 0.311064 0.836480
2 0.815698 0.667021 0.328536 0.924848 0.400043
3 0.693114 0.551914 0.696962 0.703079 0.645103
4 0.842381 0.466469 0.279249 0.740606 0.941279
In [5]:
df = pd.DataFrame(np.random.rand(10,5)) df.head(3)
Out[5]:
  0 1 2 3 4
0 0.705884 0.845813 0.770585 0.481049 0.381055
1 0.733309 0.542363 0.264334 0.254283 0.859442
2 0.497977 0.474898 0.806073 0.384412 0.242989
 
df.tail(n) # 查看数据框的最后n行
In [10]:
df = pd.DataFrame(np.random.rand(15,8)) df.tail(4)
Out[10]:
  0 1 2 3 4 5 6 7
11 0.785491 0.243000 0.991953 0.367337 0.512946 0.740280 0.897460 0.799860
12 0.602312 0.440157 0.985066 0.992641 0.550723 0.387046 0.047515 0.566604
13 0.726211 0.132540 0.302954 0.542220 0.029554 0.963806 0.436351 0.462788
14 0.516992 0.624268 0.423005 0.476461 0.627335 0.635427 0.173666 0.034728
In [6]:
df = pd.DataFrame(np.random.rand(10,5)) df.tail(3)
Out[6]:
  0 1 2 3 4
7 0.617289 0.009801 0.220155 0.992743 0.944472
8 0.261141 0.940925 0.063394 0.052104 0.517853
9 0.634541 0.897483 0.748453 0.805861 0.344938
 
df.shape # 查看数据框的行数与列数
In [11]:
df = pd.DataFrame(np.random.rand(14, 5)) df.shape
Out[11]:
(14, 5)
In [7]:
df = pd.DataFrame(np.random.rand(10,5)) df.shape
Out[7]:
(10, 5)
 
df.info() # 查看数据框 (DataFrame) 的索引、数据类型及内存信息
In [13]:
df = pd.DataFrame(np.random.rand(10, 4)) df.info()
 
RangeIndex: 10 entries, 0 to 9Data columns (total 4 columns):0 10 non-null float641 10 non-null float642 10 non-null float643 10 non-null float64dtypes: float64(4)memory usage: 400.0 bytes
In [8]:
df = pd.DataFrame(np.random.rand(10,5)) df.info()
 
RangeIndex: 10 entries, 0 to 9Data columns (total 5 columns):0 10 non-null float641 10 non-null float642 10 non-null float643 10 non-null float644 10 non-null float64dtypes: float64(5)memory usage: 480.0 bytes
 
df.describe() # 对于数据类型为数值型的列,查询其描述性统计的内容
In [14]:
df.describe()
Out[14]:
  0 1 2 3
count 10.000000 10.000000 10.000000 10.000000
mean 0.459510 0.467315 0.616311 0.546682
std 0.401191 0.319752 0.304275 0.205285
min 0.017633 0.150638 0.068416 0.160698
25% 0.108201 0.183076 0.535336 0.419520
50% 0.409686 0.381424 0.729697 0.610982
75% 0.846220 0.751856 0.831845 0.688182
max 0.970186 0.959066 0.905394 0.779920
In [9]:
df.describe()
Out[9]:
  0 1 2 3 4
count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 0.410631 0.497585 0.506200 0.322960 0.603119
std 0.280330 0.322573 0.254780 0.260299 0.256370
min 0.043731 0.031742 0.070668 0.044822 0.143786
25% 0.240661 0.211625 0.416827 0.145298 0.422969
50% 0.346297 0.544697 0.479648 0.217359 0.635974
75% 0.493105 0.669044 0.557353 0.468119 0.782573
max 0.937583 0.945573 0.987328 0.883157 0.992891
 
s.value_counts(dropna=False) # 查询每个独特数据值出现次数统计
In [16]:
s = pd.Series([1,2,5,6,6,6,6,5,5,'huang']) s.value_counts(dropna=False)
Out[16]:
6        45        3huang    12        11        1dtype: int64
In [10]:
s = pd.Series([1,2,3,3,4,np.nan,5,5,5,6,7]) s.value_counts(dropna=False)
Out[10]:
5.0    3 3.0    2 7.0    1 6.0    1NaN     1 4.0    1 2.0    1 1.0    1dtype: int64
 
df.apply(pd.Series.value_counts) # 查询数据框 (Data Frame) 中每个列的独特数据值出现次数统计
In [19]:
pd.DataFrame(np.random.rand(3, 3)) print(df) df.apply(pd.Series.value_counts)
 
a         b         c         d         e0  0.743688  0.081938  0.693243  0.647515  0.8359971  0.162604  0.421371  0.422371  0.930136  0.7322342  0.842065  0.139927  0.675018  0.543914  0.0170943  0.535794  0.078217  0.964779  0.607462  0.4324294  0.560279  0.544811  0.304371  0.797165  0.5050085  0.695691  0.696121  0.741812  0.502741  0.4846976  0.775342  0.410536  0.275251  0.810911  0.0818187  0.584267  0.917728  0.379231  0.097702  0.6228858  0.754810  0.809628  0.102337  0.283509  0.6157199  0.003056  0.536268  0.187236  0.181844  0.255499
Out[19]:
  a b c d e
0.003056 1.0 NaN NaN NaN NaN
0.017094 NaN NaN NaN NaN 1.0
0.078217 NaN 1.0 NaN NaN NaN
0.081818 NaN NaN NaN NaN 1.0
0.081938 NaN 1.0 NaN NaN NaN
0.097702 NaN NaN NaN 1.0 NaN
0.102337 NaN NaN 1.0 NaN NaN
0.139927 NaN 1.0 NaN NaN NaN
0.162604 1.0 NaN NaN NaN NaN
0.181844 NaN NaN NaN 1.0 NaN
0.187236 NaN NaN 1.0 NaN NaN
0.255499 NaN NaN NaN NaN 1.0
0.275251 NaN NaN 1.0 NaN NaN
0.283509 NaN NaN NaN 1.0 NaN
0.304371 NaN NaN 1.0 NaN NaN
0.379231 NaN NaN 1.0 NaN NaN
0.410536 NaN 1.0 NaN NaN NaN
0.421371 NaN 1.0 NaN NaN NaN
0.422371 NaN NaN 1.0 NaN NaN
0.432429 NaN NaN NaN NaN 1.0
0.484697 NaN NaN NaN NaN 1.0
0.502741 NaN NaN NaN 1.0 NaN
0.505008 NaN NaN NaN NaN 1.0
0.535794 1.0 NaN NaN NaN NaN
0.536268 NaN 1.0 NaN NaN NaN
0.543914 NaN NaN NaN 1.0 NaN
0.544811 NaN 1.0 NaN NaN NaN
0.560279 1.0 NaN NaN NaN NaN
0.584267 1.0 NaN NaN NaN NaN
0.607462 NaN NaN NaN 1.0 NaN
0.615719 NaN NaN NaN NaN 1.0
0.622885 NaN NaN NaN NaN 1.0
0.647515 NaN NaN NaN 1.0 NaN
0.675018 NaN NaN 1.0 NaN NaN
0.693243 NaN NaN 1.0 NaN NaN
0.695691 1.0 NaN NaN NaN NaN
0.696121 NaN 1.0 NaN NaN NaN
0.732234 NaN NaN NaN NaN 1.0
0.741812 NaN NaN 1.0 NaN NaN
0.743688 1.0 NaN NaN NaN NaN
0.754810 1.0 NaN NaN NaN NaN
0.775342 1.0 NaN NaN NaN NaN
0.797165 NaN NaN NaN 1.0 NaN
0.809628 NaN 1.0 NaN NaN NaN
0.810911 NaN NaN NaN 1.0 NaN
0.835997 NaN NaN NaN NaN 1.0
0.842065 1.0 NaN NaN NaN NaN
0.917728 NaN 1.0 NaN NaN NaN
0.930136 NaN NaN NaN 1.0 NaN
0.964779 NaN NaN 1.0 NaN NaN
 
数据的选取
 
df[col] # 以数组 Series 的形式返回选取的列
In [23]:
df = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef')) df['c']
Out[23]:
0    0.2383551    0.6411292    0.7160133    0.5499034    0.997134Name: c, dtype: float64
In [11]:
df = pd.DataFrame(np.random.rand(5,5),columns=list('ABCDE')) df['C']
Out[11]:
0    0.7209651    0.3601552    0.4740673    0.1162064    0.774503Name: C, dtype: float64
 
df[[col1, col2]] # 以新的数据框(DataFrame)的形式返回选取的列
In [25]:
df = pd.DataFrame(np.random.rand(5, 4), columns=list('abcd')) df[['a','d']]
Out[25]:
  a d
0 0.689811 0.446470
1 0.022796 0.101198
2 0.724498 0.555124
3 0.923610 0.952664
4 0.990061 0.891120
In [12]:
df = pd.DataFrame(np.random.rand(5,5),columns=list('ABCDE')) df[['B','E']]
Out[12]:
  B E
0 0.205912 0.333909
1 0.475620 0.540206
2 0.144041 0.065117
3 0.636970 0.406317
4 0.451541 0.944245
 
s.iloc[0] # 按照位置选取
In [11]:
s = pd.Series(np.array(['huang','xiao','lei'])) print(s) s.iloc[1]
 
0    huang1     xiao2      leidtype: object
Out[11]:
'xiao'
In [13]:
s = pd.Series(np.array(['I','Love','Data'])) s.iloc[0]
Out[13]:
'I'
 
s.loc['index_one'] # 按照索引选取
In [10]:
s = pd.Series(np.array(['df','s','df'])) print(s) s.loc[1]
 
0    df1     s2    dfdtype: object
Out[10]:
's'
In [14]:
s = pd.Series(np.array(['I','Love','Data'])) s.loc[1]
Out[14]:
'Love'
 
df.iloc[0,:] # 选取第一行
In [24]:
df = pd.DataFrame(np.random.rand(5, 5),columns= list('abcde')) print(df) #df.iloc[1, :] df.loc[1:3]
 
a         b         c         d         e0  0.293829  0.636855  0.383047  0.182288  0.9910801  0.098706  0.984684  0.362848  0.865179  0.1914182  0.238197  0.027557  0.847372  0.478444  0.2867123  0.816694  0.886405  0.637459  0.917760  0.2185784  0.962678  0.322024  0.489059  0.675897  0.024523
Out[24]:
  a b c d e
1 0.098706 0.984684 0.362848 0.865179 0.191418
2 0.238197 0.027557 0.847372 0.478444 0.286712
3 0.816694 0.886405 0.637459 0.917760 0.218578
In [15]:
df = pd.DataFrame(np.random.rand(5,5),columns=list('ABCDE')) df.iloc[0,:]
Out[15]:
A    0.234156B    0.513754C    0.593067D    0.856575E    0.291528Name: 0, dtype: float64
 
df.iloc[0,0] # 选取第一行的第一个元素
In [26]:
df = pd.DataFrame(np.random.rand(10, 5), columns=list('asdfg')) print(df) df.iloc[1,3]
 
a         s         d         f         g0  0.819962  0.011747  0.969565  0.467551  0.2813031  0.741277  0.645715  0.113062  0.495135  0.1697682  0.862192  0.433940  0.726602  0.692266  0.7964433  0.701999  0.222973  0.553875  0.253598  0.0908334  0.354669  0.779308  0.282878  0.729156  0.9724025  0.310698  0.253160  0.435239  0.465066  0.3936266  0.449286  0.079748  0.778311  0.651505  0.6597017  0.621606  0.883868  0.059535  0.015870  0.0562868  0.762552  0.159625  0.716243  0.179370  0.1614849  0.695830  0.388746  0.759827  0.325159  0.379626
Out[26]:
0.49513455869985046
In [16]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.iloc[0,0]
Out[16]:
0.91525996455410763
 
数据的清洗
 
df.columns = ['a','b'] # 重命名数据框的列名称
In [36]:
df = pd.DataFrame({ 'a':np.array([1,2,5,8,4,3]), 'b':np.array([9,3,7,5,3,4]), 'c':'htl'}) df.columns = ['q','e','r'] df
Out[36]:
  q e r
0 1 9 htl
1 2 3 htl
2 5 7 htl
3 8 5 htl
4 4 3 htl
5 3 4 htl
In [30]:
df = pd.DataFrame({ 'A':np.array([1,np.nan,2,3,6,np.nan]), 'B':np.array([np.nan,4,np.nan,5,9,np.nan]), 'C':'foo'}) df.columns = ['a','b','c'] df
Out[30]:
  a b c
0 1.0 NaN foo
1 NaN 4.0 foo
2 2.0 NaN foo
3 3.0 5.0 foo
4 6.0 9.0 foo
5 NaN NaN foo
 
pd.isnull() # 检查数据中空值出现的情况,并返回一个由布尔值(True,Fale)组成的列
In [37]:
df = pd.DataFrame({ 'a':np.array([1,np.nan,2,3,6,np.nan]), 'b':np.array([np.nan,4,np.nan,5,9,np.nan]), 'c':'sdf'}) pd.isnull(df)
Out[37]:
  a b c
0 False True False
1 True False False
2 False True False
3 False False False
4 False False False
5 True True False
In [18]:
df = pd.DataFrame({ 'A':np.array([1,np.nan,2,3,6,np.nan]), 'B':np.array([np.nan,4,np.nan,5,9,np.nan]), 'C':'foo'}) pd.isnull(df)
Out[18]:
  A B C
0 False True False
1 True False False
2 False True False
3 False False False
4 False False False
5 True True False
 
pd.notnull() # 检查数据中非空值出现的情况,并返回一个由布尔值(True,False)组成的列
In [39]:
df = pd.DataFrame({ 'a':np.array([1,np.nan,2,3,4,np.nan]), 'b':np.array([np.nan,4,np.nan,5,9,np.nan]), 'c':'foo' }) pd.notnull(df)
Out[39]:
  a b c
0 True False True
1 False True True
2 True False True
3 True True True
4 True True True
5 False False True
In [40]:
df = pd.DataFrame({ 'A':np.array([1,np.nan,2,3,6,np.nan]), 'B':np.array([np.nan,4,np.nan,5,9,np.nan]), 'C':'foo'}) pd.notnull(df) df.dropna()
Out[40]:
  A B C
3 3.0 5.0 foo
4 6.0 9.0 foo
 
df.dropna() # 移除数据框 DataFrame 中包含空值的行
In [20]:
df = pd.DataFrame({ 'A':np.array([1,np.nan,2,3,6,np.nan]), 'B':np.array([np.nan,4,np.nan,5,9,np.nan]), 'C':'foo'}) df.dropna()
Out[20]:
  A B C
3 3.0 5.0 foo
4 6.0 9.0 foo
 
df.dropna(axis=1) # 移除数据框 DataFrame 中包含空值的列
In [45]:
df = pd.DataFrame({ 'a':np.array([1,np.nan,2,3,4,np.nan]), 'b':np.array([np.nan,4,np.nan,5,9,np.nan]), 'c':'foo' }) print(df) df.dropna(axis=1)
 
a    b    c0  1.0  NaN  foo1  NaN  4.0  foo2  2.0  NaN  foo3  3.0  5.0  foo4  4.0  9.0  foo5  NaN  NaN  foo
Out[45]:
  c
0 foo
1 foo
2 foo
3 foo
4 foo
5 foo
In [21]:
df = pd.DataFrame({ 'A':np.array([1,np.nan,2,3,6,np.nan]), 'B':np.array([np.nan,4,np.nan,5,9,np.nan]), 'C':'foo'}) df.dropna(axis=1)
Out[21]:
  C
0 foo
1 foo
2 foo
3 foo
4 foo
5 foo
 
df.dropna(axis=1,thresh=n) # 移除数据框df中空值个数不超过n的行
In [73]:
df = pd.DataFrame({ 'A':np.array([1,np.nan,2,3,6,np.nan]), 'B':np.array([np.nan,4,np.nan,5,9,np.nan]), 'C':'foo'}) print(df) df.dropna(axis=1,thresh=3)
 
A    B    C0  1.0  NaN  foo1  NaN  4.0  foo2  2.0  NaN  foo3  3.0  5.0  foo4  6.0  9.0  foo5  NaN  NaN  foo
Out[73]:
  A B C
0 1.0 NaN foo
1 NaN 4.0 foo
2 2.0 NaN foo
3 3.0 5.0 foo
4 6.0 9.0 foo
5 NaN NaN foo
In [22]:
df = pd.DataFrame({ 'A':np.array([1,np.nan,2,3,6,np.nan]), 'B':np.array([np.nan,4,np.nan,5,9,np.nan]), 'C':'foo'}) test = df.dropna(axis=1,thresh=1) test
Out[22]:
  A B C
0 1.0 NaN foo
1 NaN 4.0 foo
2 2.0 NaN foo
3 3.0 5.0 foo
4 6.0 9.0 foo
5 NaN NaN foo
 
df.fillna(x) # 将数据框 DataFrame 中的所有空值替换为 x
In [76]:
df = pd.DataFrame({ 'A':np.array([1,np.nan,2,3,6,np.nan]), 'B':np.array([np.nan,4,np.nan,5,9,np.nan]), 'C':'foo'}) print(df) df.fillna('huang')
 
A    B    C0  1.0  NaN  foo1  NaN  4.0  foo2  2.0  NaN  foo3  3.0  5.0  foo4  6.0  9.0  foo5  NaN  NaN  foo
Out[76]:
  A B C
0 1 huang foo
1 huang 4 foo
2 2 huang foo
3 3 5 foo
4 6 9 foo
5 huang huang foo
In [23]:
df = pd.DataFrame({ 'A':np.array([1,np.nan,2,3,6,np.nan]), 'B':np.array([np.nan,4,np.nan,5,9,np.nan]), 'C':'foo'}) df.fillna('Test')
Out[23]:
  A B C
0 1 Test foo
1 Test 4 foo
2 2 Test foo
3 3 5 foo
4 6 9 foo
5 Test Test foo
 

s.fillna(s.mean()) -> 将所有空值替换为平均值

In [82]:
s = pd.Series([1,3,4,np.nan,7,8,9]) a = s.fillna(s.mean()) print(a)
 
0    1.0000001    3.0000002    4.0000003    5.3333334    7.0000005    8.0000006    9.000000dtype: float64
In [24]:
s = pd.Series([1,3,5,np.nan,7,9,9]) s.fillna(s.mean())
Out[24]:
0    1.0000001    3.0000002    5.0000003    5.6666674    7.0000005    9.0000006    9.000000dtype: float64
 
s.astype(float) # 将数组(Series)的格式转化为浮点数
In [85]:
s = pd.Series([1,2,4,np.nan,5,6,6]) a = s.fillna(s.mean()) a.astype(int)
Out[85]:
0    11    22    43    44    55    66    6dtype: int64
In [25]:
s = pd.Series([1,3,5,np.nan,7,9,9]) s.astype(float)
Out[25]:
0    1.01    3.02    5.03    NaN4    7.05    9.06    9.0dtype: float64
 
s.replace(1,'one') # 将数组(Series)中的所有1替换为'one'
In [86]:
s = pd.Series([1,2,4,np.nan,5,6,7]) s.replace(1,'yi')
Out[86]:
0     yi1      22      43    NaN4      55      66      7dtype: object
In [26]:
s = pd.Series([1,3,5,np.nan,7,9,9]) s.replace(1,'one')
Out[26]:
0    one1      32      53    NaN4      75      96      9dtype: object
 
s.replace([1,3],['one','three']) # 将数组(Series)中所有的1替换为'one', 所有的3替换为'three'
In [87]:
s = pd.Series([1,3,4,np.nan,7,3,5]) s.replace([1,4],['sd', 'dsf'])
Out[87]:
0     sd1      32    dsf3    NaN4      75      36      5dtype: object
In [27]:
s = pd.Series([1,3,5,np.nan,7,9,9]) s.replace([1,3],['one','three'])
Out[27]:
0      one1    three2        53      NaN4        75        96        9dtype: object
 
df.rename(columns=lambda x: x + 2) # 将全体列重命名
In [20]:
df = pd.DataFrame(np.random.rand(4, 4)) df.rename(columns=lambda x:x+2 )
Out[20]:
  2 3 4 5
0 0.081634 0.064494 0.171152 0.568444
1 0.355771 0.934762 0.634321 0.505097
2 0.544467 0.824562 0.742992 0.937263
3 0.524025 0.620101 0.764900 0.211475
In [28]:
df = pd.DataFrame(np.random.rand(4,4)) df.rename(columns=lambda x: x+ 2)
Out[28]:
  2 3 4 5
0 0.753588 0.137984 0.022013 0.900072
1 0.947073 0.815182 0.769708 0.729688
2 0.334815 0.204315 0.707794 0.437704
3 0.467212 0.738360 0.853463 0.529946
 
df.rename(columns={ 'old_name': 'new_ name'}) # 将选择的列重命名
In [24]:
df = pd.DataFrame(np.random.rand(10, 5), columns=list('asdfp')) df.rename(columns={ 'a':'huang', 'd':'xiao'})
Out[24]:
  huang s xiao f p
0 0.883222 0.073876 0.740827 0.035460 0.929947
1 0.161005 0.276637 0.095228 0.490336 0.433798
2 0.245889 0.763647 0.472240 0.718072 0.260942
3 0.933051 0.400177 0.494481 0.173994 0.800894
4 0.762221 0.170352 0.507960 0.383658 0.533412
5 0.665419 0.515597 0.538217 0.305045 0.072796
6 0.723260 0.661109 0.793995 0.391161 0.724623
7 0.829130 0.896624 0.732372 0.317762 0.745941
8 0.302628 0.320006 0.420980 0.400016 0.556747
9 0.574811 0.952172 0.573045 0.343735 0.930765
In [29]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.rename(columns={ 'A':'newA','C':'newC'})
Out[29]:
  newA B newC D E
0 0.169072 0.694563 0.069313 0.637560 0.475181
1 0.910271 0.800067 0.676448 0.934767 0.025608
2 0.825186 0.451545 0.135421 0.635303 0.419758
3 0.401979 0.510304 0.014901 0.209211 0.121889
4 0.579282 0.001947 0.036519 0.750415 0.453078
5 0.896213 0.557514 0.028147 0.527471 0.575772
6 0.443222 0.095459 0.319582 0.912069 0.781455
7 0.067923 0.590470 0.602999 0.507358 0.703022
8 0.301491 0.682629 0.283103 0.565754 0.089268
9 0.399671 0.925416 0.020578 0.278000 0.591522
 
df.set_index('column_one') # 改变索引
In [27]:
df = pd.DataFrame(np.random.rand(10, 5), columns=list('asdfg')) print(df) df.set_index('a')
 
a         s         d         f         g0  0.483397  0.944772  0.678662  0.439009  0.5884501  0.984601  0.110966  0.331303  0.578410  0.4676332  0.001784  0.431582  0.593597  0.238572  0.4297713  0.644358  0.102394  0.935862  0.863739  0.1187164  0.514392  0.928633  0.750763  0.026851  0.0499355  0.749309  0.961028  0.383087  0.052621  0.5989806  0.963810  0.087193  0.569974  0.440941  0.3847487  0.000576  0.538573  0.171773  0.802815  0.5561918  0.731837  0.934994  0.998125  0.485058  0.7459509  0.599032  0.462614  0.234398  0.833158  0.521382
Out[27]:
  s d f g
a        
0.483397 0.944772 0.678662 0.439009 0.588450
0.984601 0.110966 0.331303 0.578410 0.467633
0.001784 0.431582 0.593597 0.238572 0.429771
0.644358 0.102394 0.935862 0.863739 0.118716
0.514392 0.928633 0.750763 0.026851 0.049935
0.749309 0.961028 0.383087 0.052621 0.598980
0.963810 0.087193 0.569974 0.440941 0.384748
0.000576 0.538573 0.171773 0.802815 0.556191
0.731837 0.934994 0.998125 0.485058 0.745950
0.599032 0.462614 0.234398 0.833158 0.521382
In [30]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.set_index('B')
Out[30]:
  A C D E
B        
0.311742 0.972069 0.557977 0.114267 0.795128
0.931644 0.725425 0.082130 0.993764 0.136923
0.206382 0.980647 0.947041 0.038841 0.879139
0.157801 0.402233 0.249151 0.724130 0.108238
0.314238 0.341221 0.512180 0.218882 0.046379
0.029040 0.470619 0.666784 0.036655 0.823498
0.843928 0.779437 0.926912 0.189213 0.624111
0.282773 0.993681 0.048483 0.135934 0.576662
0.759600 0.235513 0.359139 0.488255 0.669043
0.088552 0.893269 0.277296 0.889523 0.398392
 
df.rename(index = lambda x: x+ 1) # 改变全体索引
In [29]:
df = pd.DataFrame(np.random.rand(10, 5)) df.rename(index = lambda x: x+1)
Out[29]:
  0 1 2 3 4
1 0.932421 0.478929 0.051820 0.721526 0.016739
2 0.359403 0.327488 0.503009 0.352523 0.169186
3 0.894238 0.268052 0.906756 0.726393 0.973686
4 0.188892 0.056018 0.156585 0.643488 0.321641
5 0.661594 0.043409 0.392303 0.469758 0.157635
6 0.582072 0.992046 0.060181 0.202060 0.119541
7 0.073971 0.157798 0.616039 0.516502 0.472920
8 0.885208 0.158675 0.211644 0.763249 0.762270
9 0.907770 0.455217 0.430548 0.473017 0.240695
10 0.043648 0.259251 0.365041 0.518889 0.765609
In [31]:
df = pd.DataFrame(np.random.rand(10,5)) df.rename(index = lambda x: x+ 1)
Out[31]:
  0 1 2 3 4
1 0.386542 0.031932 0.963200 0.790339 0.602533
2 0.053492 0.652174 0.889465 0.465296 0.843528
3 0.411836 0.460788 0.110352 0.083247 0.389855
4 0.336156 0.830522 0.560991 0.667896 0.233841
5 0.307933 0.995207 0.506680 0.957895 0.636461
6 0.724975 0.842118 0.123139 0.244357 0.803936
7 0.059176 0.117784 0.330192 0.418764 0.464144
8 0.104323 0.222367 0.930414 0.659232 0.562155
9 0.484089 0.024045 0.879834 0.492231 0.949636
10 0.201583 0.280658 0.356804 0.890706 0.236174
 
数据的过滤(```filter```),排序(```sort```)和分组(```groupby```)
 
df[df[col] > 0.5] # 选取数据框df中对应行的数值大于0.5的全部列
In [33]:
df = pd.DataFrame(np.random.rand(10, 5), columns=list('asdfg')) print(df) df[df['a']>0.5]
 
a         s         d         f         g0  0.191880  0.437651  0.780847  0.836473  0.0864901  0.997351  0.671057  0.212071  0.946415  0.7685352  0.506504  0.800164  0.968510  0.513060  0.2586593  0.791777  0.632927  0.624002  0.799357  0.2704554  0.207246  0.152955  0.007859  0.257787  0.2086385  0.620649  0.557626  0.393774  0.331476  0.8552536  0.220170  0.358326  0.811410  0.667446  0.0857037  0.554684  0.994837  0.054684  0.854683  0.7495158  0.759856  0.771095  0.571663  0.189677  0.1772129  0.887868  0.617078  0.487259  0.462189  0.673066
Out[33]:
  a s d f g
1 0.997351 0.671057 0.212071 0.946415 0.768535
2 0.506504 0.800164 0.968510 0.513060 0.258659
3 0.791777 0.632927 0.624002 0.799357 0.270455
5 0.620649 0.557626 0.393774 0.331476 0.855253
7 0.554684 0.994837 0.054684 0.854683 0.749515
8 0.759856 0.771095 0.571663 0.189677 0.177212
9 0.887868 0.617078 0.487259 0.462189 0.673066
In [32]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df[df['A'] > 0.5]
Out[32]:
  A B C D E
0 0.534886 0.863546 0.236718 0.326766 0.415460
2 0.953931 0.070198 0.483749 0.922528 0.295505
8 0.880175 0.056811 0.520499 0.533152 0.548145
 
df[(df[col] > 0.5) & (df[col] < 0.7)] # 选取数据框df中对应行的数值大于0.5,并且小于0.7的全部列
In [34]:
df = pd.DataFrame(np.random.rand(10,6),columns= list('qwerty')) df[(df['e'] > 0.5) &(df['t'] < 0.7) ]
Out[34]:
  q w e r t y
2 0.176275 0.358433 0.895002 0.739299 0.050452 0.114546
3 0.726330 0.591592 0.909450 0.120671 0.677124 0.837148
4 0.318870 0.805787 0.600435 0.629595 0.045091 0.891886
5 0.270306 0.143335 0.519607 0.118409 0.079835 0.071877
In [33]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df[(df['C'] > 0.5) & (df['D'] < 0.7)]
Out[33]:
  A B C D E
2 0.953112 0.174517 0.645300 0.308216 0.171177
6 0.853087 0.863079 0.701823 0.354019 0.311754
 
df.sort_values(col1) # 按照数据框的列col1升序(ascending)的方式对数据框df做排序
In [35]:
df = pd.DataFrame(np.random.rand(10,6),columns=list('adsfgh')) df.sort_values('a')
Out[35]:
  a d s f g h
8 0.012038 0.240554 0.900154 0.630489 0.971382 0.889947
3 0.174606 0.704540 0.284934 0.412725 0.261158 0.807697
9 0.324203 0.834741 0.624353 0.676012 0.580034 0.436738
1 0.386444 0.256227 0.924961 0.000652 0.589956 0.476489
5 0.479683 0.080173 0.333917 0.741830 0.219858 0.550681
6 0.546706 0.358566 0.875383 0.921672 0.004955 0.631361
4 0.581234 0.001990 0.737987 0.203702 0.231551 0.235576
7 0.762742 0.800615 0.945827 0.434820 0.755877 0.312649
2 0.888132 0.019374 0.555217 0.618628 0.396756 0.924784
0 0.904388 0.758854 0.450406 0.487383 0.666163 0.430539
In [34]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.sort_values('E')
Out[34]:
  A B C D E
3 0.024096 0.623842 0.775949 0.828343 0.317729
6 0.220055 0.381614 0.463676 0.762644 0.391758
4 0.589411 0.727439 0.064528 0.319521 0.413518
1 0.878490 0.229301 0.699506 0.726879 0.464106
8 0.438101 0.970649 0.050256 0.697440 0.499057
9 0.566100 0.558798 0.723253 0.254244 0.524486
7 0.613603 0.933109 0.677036 0.808160 0.544953
5 0.079326 0.711673 0.266434 0.910628 0.816783
2 0.132114 0.145395 0.908436 0.521271 0.889645
0 0.432677 0.216837 0.203532 0.093214 0.977671
 
df.sort_values(col2,ascending=False) # 按照数据框的列col2降序(descending)的方式对数据框df做排序
In [36]:
df = pd.DataFrame(np.random.rand(10, 8),columns=list('qwertyui')) df.sort_values('e', ascending=False)
Out[36]:
  q w e r t y u i
8 0.541191 0.443107 0.804432 0.475763 0.332738 0.169072 0.350597 0.234079
9 0.278131 0.672111 0.766488 0.555026 0.271935 0.453826 0.491817 0.986139
1 0.758781 0.041056 0.732308 0.974348 0.219851 0.211953 0.524819 0.300156
2 0.065457 0.556341 0.655507 0.205678 0.606155 0.945356 0.915438 0.642333
4 0.916662 0.179418 0.620904 0.689385 0.477483 0.262302 0.868513 0.002603
6 0.934955 0.970812 0.331655 0.507056 0.012076 0.643469 0.579360 0.416791
3 0.372486 0.775326 0.250734 0.021345 0.267355 0.059874 0.253597 0.244643
7 0.598279 0.031159 0.205364 0.715331 0.340993 0.918638 0.918882 0.971622
5 0.062437 0.923440 0.119125 0.755429 0.744593 0.421468 0.366993 0.103529
0 0.965093 0.630529 0.034310 0.500022 0.736686 0.484777 0.595759 0.281686
In [35]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.sort_values('A',ascending=False)
Out[35]:
  A B C D E
9 0.977172 0.930607 0.889285 0.475032 0.031715
0 0.864511 0.229990 0.678612 0.042491 0.148123
2 0.694747 0.580891 0.817524 0.392417 0.055003
6 0.684327 0.802028 0.862043 0.241838 0.800401
7 0.612324 0.099445 0.714120 0.215054 0.280343
8 0.441434 0.315553 0.564762 0.800143 0.330030
1 0.438734 0.161109 0.610750 0.647330 0.792404
4 0.365880 0.710768 0.344320 0.998757 0.979497
3 0.202511 0.769728 0.575057 0.511384 0.696753
5 0.029527 0.560114 0.224787 0.086291 0.318322
 
df.sort_values([col1,col2],ascending=[True,False]) # 按照数据框的列col1升序,col2降序的方式对数据框df做排序
In [37]:
df = pd.DataFrame(np.random.rand(5,6),columns=list('qwerty')) df.sort_values(['q', 'w'],ascending=[True, False])
Out[37]:
  q w e r t y
3 0.039156 0.902539 0.544040 0.715766 0.476489 0.968014
4 0.369672 0.760559 0.339207 0.773287 0.112713 0.465799
2 0.446962 0.675626 0.805690 0.869418 0.553809 0.310547
0 0.898922 0.210659 0.024452 0.310047 0.492718 0.530260
1 0.981514 0.476470 0.435834 0.613164 0.071609 0.771960
In [36]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.sort_values(['A','E'],ascending=[True,False])
Out[36]:
  A B C D E
6 0.075863 0.696980 0.648945 0.336977 0.113122
2 0.199316 0.632063 0.787358 0.133175 0.060568
5 0.242081 0.818550 0.618439 0.215761 0.924459
7 0.261237 0.400725 0.659224 0.555746 0.132572
0 0.390540 0.358432 0.754028 0.194403 0.889624
8 0.410481 0.463811 0.343021 0.736340 0.291121
4 0.578705 0.544711 0.881707 0.396593 0.414465
3 0.600541 0.459247 0.591303 0.027464 0.496864
9 0.720029 0.419921 0.740225 0.904391 0.226958
1 0.777955 0.992290 0.144495 0.600207 0.647018
 
df.groupby(col) # 按照某列对数据框df做分组
In [3]:
df = pd.DataFrame({ 'a':np.array(['huang','huang','huang','xiao','xiao','xiao']), 'b':np.array(['lei','lei','lei','xiao','xiao','lei']), 'c':np.array(['small','medium','large','small','large','medium']), 'd':np.array([1,2,3,4,5,6]) }) df.groupby('a').count()
Out[3]:
  b c d
a      
huang 3 3 3
xiao 3 3 3
In [38]:
df = pd.DataFrame({ 'A':np.array(['foo','foo','foo','foo','bar','bar']), 'B':np.array(['one','one','two','two','three','three']), 'C':np.array(['small','medium','large','large','small','small']), 'D':np.array([1,2,2,3,3,5])}) print(df) df.groupby('A').count()
 
A      B       C  D0  foo    one   small  11  foo    one  medium  22  foo    two   large  23  foo    two   large  34  bar  three   small  35  bar  three   small  5
Out[38]:
  B C D
A      
bar 2 2 2
foo 4 4 4
 
df.groupby([col1,col2]) # 按照列col1和col2对数据框df做分组
In [4]:
df = pd.DataFrame({ 'a':np.array(['s','s','s','e','e','e']), 'b':np.array(['q','w','e','e','e','w']), 'c':np.array(['t','t','t','hu','hi','jk']) }) print(df) df.groupby(['a','b']).count()
 
a  b   c0  s  q   t1  s  w   t2  s  e   t3  e  e  hu4  e  e  hi5  e  w  jk
Out[4]:
    c
a b  
e e 2
w 1
s e 1
q 1
w 1
In [39]:
df = pd.DataFrame({ 'A':np.array(['foo','foo','foo','foo','bar','bar']), 'B':np.array(['one','one','two','two','three','three']), 'C':np.array(['small','medium','large','large','small','small']), 'D':np.array([1,2,2,3,3,5])}) print(df) df.groupby(['B','C']).sum()
 
A      B       C  D0  foo    one   small  11  foo    one  medium  22  foo    two   large  23  foo    two   large  34  bar  three   small  35  bar  three   small  5
Out[39]:
    D
B C  
one medium 2
small 1
three small 8
two large 5
 
df.groupby(col1)[col2].mean() # 按照列col1对数据框df做分组处理后,返回对应的col2的平均值
In [10]:
df = pd.DataFrame({ 'a':np.array(['ho','ho','ho','e','e','e']), 'b':np.array(['huang','huang','lei','lei','xiao','xiao']), 'c':np.array([1,2,3,4,5,6]) }) df.groupby('a')['c'].mean()
Out[10]:
ae     5ho    2Name: c, dtype: int64
In [39]:
df = pd.DataFrame({ 'A':np.array(['foo','foo','foo','foo','bar','bar']), 'B':np.array(['one','one','two','two','three','three']), 'C':np.array(['small','medium','large','large','small','small']), 'D':np.array([1,2,2,3,3,5])}) df.groupby('B')['D'].mean()
Out[39]:
Bone      1.5three    4.0two      2.5Name: D, dtype: float64
 
pythyondf.pivot_table(index=col1,values=[col2,col3],aggfunc=mean) # 做透视表,索引为col1,针对的数值列为col2和col3,分组函数为平均值
In [11]:
df = pd.DataFrame({ 'A':np.array(['foo','foo','foo','foo','bar','bar']), 'B':np.array(['one','one','two','two','three','three']), 'C':np.array(['small','medium','large','large','small','small']), 'D':np.array([1,2,2,3,3,5])}) print(df) df.pivot_table(df,index=['A','B'], columns=['C'],aggfunc=np.sum)
 
A      B       C  D0  foo    one   small  11  foo    one  medium  22  foo    two   large  23  foo    two   large  34  bar  three   small  35  bar  three   small  5
Out[11]:
    D
  C large medium small
A B      
bar three NaN NaN 8.0
foo one NaN 2.0 1.0
two 5.0 NaN NaN
 
df.groupby(col1).agg(np.mean)
In [12]:
df = pd.DataFrame({ 'A':np.array(['foo','foo','foo','foo','bar','bar']), 'B':np.array(['one','one','two','two','three','three']), 'C':np.array(['small','medium','large','large','small','small']), 'D':np.array([1,2,2,3,3,5])}) print(df) df.groupby('A').agg(np.mean)
 
A      B       C  D0  foo    one   small  11  foo    one  medium  22  foo    two   large  23  foo    two   large  34  bar  three   small  35  bar  three   small  5
Out[12]:
  D
A  
bar 4
foo 2
 
df.apply(np.mean) # 对数据框df的每一列求平均值
In [13]:
df = pd.DataFrame(np.random.rand(10, 5),columns=list('adsfg')) df.apply(np.mean)
Out[13]:
a    0.539334d    0.500330s    0.508882f    0.580603g    0.523317dtype: float64
In [42]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.apply(np.mean)
Out[42]:
A    0.388075B    0.539564C    0.607983D    0.518634E    0.482960dtype: float64
 
df.apply(np.max,axis=1) # 对数据框df的每一行求最大值
In [14]:
df = pd.DataFrame(np.random.rand(10, 6),columns=list('asdfrg')) df.apply(np.max, axis=1)
Out[14]:
0    0.8453781    0.9986862    0.9686023    0.8432314    0.9403535    0.9088926    0.9497007    0.6630648    0.8760519    0.975562dtype: float64
In [43]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.apply(np.max,axis=1)
Out[43]:
0    0.9041631    0.8045192    0.9241023    0.7617814    0.9520845    0.9236796    0.7963207    0.5829078    0.7613109    0.893564dtype: float64
 
数据的连接(```join```)与组合(```combine```)
 
df1.append(df2) # 在数据框df2的末尾添加数据框df1,其中df1和df2的列数应该相等
In [44]:
df1 = pd.DataFrame({ 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}, index=[0, 1, 2, 3]) df2 = pd.DataFrame({ 'A': ['A4', 'A5', 'A6', 'A7'], 'B': ['B4', 'B5', 'B6', 'B7'], 'C': ['C4', 'C5', 'C6', 'C7'], 'D': ['D4', 'D5', 'D6', 'D7']}, index=[4, 5, 6, 7]) df1.append(df2)
Out[44]:
  A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
 
pd.concat([df1, df2],axis=1) # 在数据框df1的列最后添加数据框df2,其中df1和df2的行数应该相等
In [45]:
df1 = pd.DataFrame({ 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}, index=[0, 1, 2, 3]) df2 = pd.DataFrame({ 'A': ['A4', 'A5', 'A6', 'A7'], 'B': ['B4', 'B5', 'B6', 'B7'], 'C': ['C4', 'C5', 'C6', 'C7'], 'D': ['D4', 'D5', 'D6', 'D7']}, index=[4, 5, 6, 7]) pd.concat([df1,df2],axis=1)
Out[45]:
  A B C D A B C D
0 A0 B0 C0 D0 NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN
2 A2 B2 C2 D2 NaN NaN NaN NaN
3 A3 B3 C3 D3 NaN NaN NaN NaN
4 NaN NaN NaN NaN A4 B4 C4 D4
5 NaN NaN NaN NaN A5 B5 C5 D5
6 NaN NaN NaN NaN A6 B6 C6 D6
7 NaN NaN NaN NaN A7 B7 C7 D7
 
df1.join(df2,on=col1,how='inner') # 对数据框df1和df2做内连接,其中连接的列为col1
In [46]:
df1 = pd.DataFrame({ 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'], 'key': ['K0', 'K1', 'K0', 'K1']}) df2 = pd.DataFrame({ 'C': ['C0', 'C1'], 'D': ['D0', 'D1']}, index=['K0', 'K1']) df1.join(df2, on='key')
Out[46]:
  A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K0 C0 D0
3 A3 B3 K1 C1 D1
 

<div id = 'p10'>数据的统计</div>

 
df.describe() # 得到数据框df每一列的描述性统计
In [4]:
df = pd.DataFrame(np.random.rand(10, 5),columns=list('abcde')) df.describe()
Out[4]:
  a b c d e
count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 0.401144 0.359406 0.603465 0.627617 0.408927
std 0.314415 0.276410 0.225576 0.338007 0.277260
min 0.052844 0.015361 0.255718 0.121600 0.082777
25% 0.148306 0.141934 0.498205 0.320862 0.198211
50% 0.328256 0.301379 0.575852 0.661513 0.332168
75% 0.603549 0.584706 0.665217 0.922541 0.581780
max 0.899552 0.838164 0.973688 0.986095 0.933372
In [47]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.describe()
Out[47]:
  A B C D E
count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 0.398648 0.451699 0.443472 0.739478 0.412954
std 0.330605 0.221586 0.303084 0.308798 0.262148
min 0.004457 0.188689 0.079697 0.113562 0.052935
25% 0.088177 0.270355 0.205663 0.715005 0.205685
50% 0.315533 0.457229 0.332148 0.885872 0.400232
75% 0.749716 0.497208 0.737900 0.948651 0.634670
max 0.782956 0.825671 0.851065 0.962922 0.815447
 
df.mean() # 得到数据框df中每一列的平均值
In [6]:
df = pd.DataFrame(np.random.rand(10, 5),columns=list('abcde')) df.mean()
Out[6]:
a    0.501247b    0.596623c    0.525627d    0.503693e    0.420740dtype: float64
In [5]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.mean()
Out[5]:
A    0.554337B    0.574231C    0.438493D    0.514337E    0.532763dtype: float64
 
df.corr() # 得到数据框df中每一列与其他列的相关系数
In [7]:
df = pd.DataFrame(np.random.rand(10, 5),columns=list('abcde')) df.corr()
Out[7]:
  a b c d e
a 1.000000 -0.314863 0.145670 0.569909 -0.089665
b -0.314863 1.000000 0.241693 -0.105917 0.510971
c 0.145670 0.241693 1.000000 0.073844 -0.070198
d 0.569909 -0.105917 0.073844 1.000000 -0.425560
e -0.089665 0.510971 -0.070198 -0.425560 1.000000
In [49]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.corr()
Out[49]:
  A B C D E
A 1.000000 -0.634931 -0.354824 -0.354131 0.170957
B -0.634931 1.000000 0.225222 -0.338124 -0.043300
C -0.354824 0.225222 1.000000 0.098285 0.297133
D -0.354131 -0.338124 0.098285 1.000000 -0.324209
E 0.170957 -0.043300 0.297133 -0.324209 1.000000
 
df.count() # 得到数据框df中每一列的非空值个数
In [8]:
df = pd.DataFrame(np.random.rand(10, 5),columns=list('abcde')) df.count()
Out[8]:
a    10b    10c    10d    10e    10dtype: int64
In [50]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.count()
Out[50]:
A    10B    10C    10D    10E    10dtype: int64
 
df.max() # 得到数据框df中每一列的最大值
In [12]:
df = pd.DataFrame(np.random.rand(10, 5),columns=list('abcde')) print(df) print(df.max()) df.count()
 
a         b         c         d         e0  0.743688  0.081938  0.693243  0.647515  0.8359971  0.162604  0.421371  0.422371  0.930136  0.7322342  0.842065  0.139927  0.675018  0.543914  0.0170943  0.535794  0.078217  0.964779  0.607462  0.4324294  0.560279  0.544811  0.304371  0.797165  0.5050085  0.695691  0.696121  0.741812  0.502741  0.4846976  0.775342  0.410536  0.275251  0.810911  0.0818187  0.584267  0.917728  0.379231  0.097702  0.6228858  0.754810  0.809628  0.102337  0.283509  0.6157199  0.003056  0.536268  0.187236  0.181844  0.255499a    0.842065b    0.917728c    0.964779d    0.930136e    0.835997dtype: float64
Out[12]:
a    10b    10c    10d    10e    10dtype: int64
In [51]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.max()
Out[51]:
A    0.933848B    0.730197C    0.921751D    0.715280E    0.940010dtype: float64
 
df.min() # 得到数据框df中每一列的最小值
In [52]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.min()
Out[52]:
A    0.107516B    0.001635C    0.024502D    0.092810E    0.019898dtype: float64
 
df.median() # 得到数据框df中每一列的中位数
In [53]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.median()
Out[53]:
A    0.497591B    0.359854C    0.661607D    0.342418E    0.588468dtype: float64
 
df.std() # 得到数据框df中每一列的标准差
In [54]:
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE')) df.std()
Out[54]:
A    0.231075B    0.286691C    0.276511D    0.304167E    0.272570dtype: float64

转载于:https://www.cnblogs.com/heitaoq/p/7965964.html

你可能感兴趣的文章
LayoutInflater的infalte()
查看>>
TCP粘包, UDP丢包, nagle算法
查看>>
POJ 3280 Cheapest Palindrome (DP)
查看>>
投递外刊引用自己的文章该注意什么
查看>>
文本 To 音频
查看>>
UVA 644 Immediate Decodability (字符处理)
查看>>
项目总结—jQuery EasyUI- DataGrid使用
查看>>
使用智能移动设备访问Ossim制
查看>>
39. Volume Rendering Techniques
查看>>
AVD启动不了 ANDROID_SDK_HOME is defined but could not find *.ini
查看>>
Java JDK 8 安装和环境变量的配置(Linux and Windows)
查看>>
[模拟] hdu 4452 Running Rabbits
查看>>
扩展easyui 的表单验证
查看>>
MySQL锁之一:锁详解
查看>>
选择29部分有用jQuery应用程序插件(免费点数下载)
查看>>
JS类的封装及实现代码
查看>>
HDOJ 3480 Division
查看>>
BeanFactory、ApplicationContext、ApplicationContextAware区别
查看>>
关于WEB Service&WCF&WebApi实现身份验证之WCF篇(2)
查看>>
HDU2586 How far away ?(LCA模板题)
查看>>