Pandas基础

文件的读取和写入

文件读取

import numpy as np
import pandas as pd

pandas可以读取的文件格式有很多，这里主要介绍读取csv,excel,txt文件。

df_csv = pd.read_csv('data/my_csv.csv')
df_csv

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5
3	5	d	3.2	lemon	2020/1/7

df_txt = pd.read_table('data/my_table.txt')
df_txt

	col1	col2	col3	col4
0	2	a	1.4	apple 2020/1/1
1	3	b	3.4	banana 2020/1/2
2	6	c	2.5	orange 2020/1/5
3	5	d	3.2	lemon 2020/1/7

df_excel = pd.read_excel('data/my_excel.xlsx')
df_excel

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5
3	5	d	3.2	lemon	2020/1/7

这里有一些常用的公共参数，header=None表示第一行不作为列名，index_col表示把某一列或几列作为索引，usecols表示读取列的集合，默认读取所有的列，parse_dates表示需要转化为时间的列，nrows表示读取的数据行数。上面这些参数在上述的三个函数里都可以使用。

pd.read_table('data/my_table.txt',header=None)

	0	1	2	3
0	col1	col2	col3	col4
1	2	a	1.4	apple 2020/1/1
2	3	b	3.4	banana 2020/1/2
3	6	c	2.5	orange 2020/1/5
4	5	d	3.2	lemon 2020/1/7

pd.read_csv('data/my_csv.csv',index_col=['col1','col2'])

		col3	col4	col5
col1	col2
2	a	1.4	apple	2020/1/1
3	b	3.4	banana	2020/1/2
6	c	2.5	orange	2020/1/5
5	d	3.2	lemon	2020/1/7

pd.read_table('data/my_table.txt',usecols=['col1','col2'])

	col1	col2
0	2	a
1	3	b
2	6	c
3	5	d

pd.read_csv('data/my_csv.csv',parse_dates=['col5'])   

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020-01-01
1	3	b	3.4	banana	2020-01-02
2	6	c	2.5	orange	2020-01-05
3	5	d	3.2	lemon	2020-01-07

pd.read_excel('data/my_excel.xlsx',nrows=2)

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2

在读取txt文件时，经常遇到分隔符非空格的情况，read_table有一个分割参数sep，它使得用户可以自定义分割符号，进行txt数据的读取。例如，下面的读取的表以||||为分割：

pd.read_table('data/my_table_special_sep.txt')

	col1 \|\|\|\| col2
0	TS \|\|\|\| This is an apple.
1	GQ \|\|\|\| My name is Bob.
2	WT \|\|\|\| Well done!
3	PT \|\|\|\| May I help you?

上面的结果显然不是理想的，这时可以使用sep，同时需要置顶引擎为python：

pd.read_table('data/my_table_special_sep.txt',sep=' \|\|\|\| ',engine='python')

	col1	col2
0	TS	This is an apple.
1	GQ	My name is Bob.
2	WT	Well done!
3	PT	May I help you?

在使用read_table的时候需要注意，参数sep中使用的是正则表达式，因此需要对｜进行转义。

数据写入

一般在数据写入中，最常用的操作是把index设置为False，特别当索引没有特殊意义的时候，这样的行为能把索引在保存的时候去除。

df_csv.to_csv('data/my_csv_saved.csv',index=False)
df_excel.to_excel('data/my_excel_saved.xlsx',index=False)

pandas中没有定义to_table函数，但是to_csv可以保存为txt文件，并且允许自定义分隔符，常用制表符\t分割：

df_txt.to_csv('data/my_txt_saved.txt',sep='\t',index=False)

如果想要把表格快速转换为markdown和$LaTeX$语言，可以使用to_markdown和to_latex函数，此处需要安装tabulate包。

print(df_csv.to_markdown())

|    |   col1 | col2   |   col3 | col4   | col5     |
|---:|-------:|:-------|-------:|:-------|:---------|
|  0 |      2 | a      |    1.4 | apple  | 2020/1/1 |
|  1 |      3 | b      |    3.4 | banana | 2020/1/2 |
|  2 |      6 | c      |    2.5 | orange | 2020/1/5 |
|  3 |      5 | d      |    3.2 | lemon  | 2020/1/7 |

print(df_csv.to_latex())

\begin{tabular}{lrlrll}
\toprule
{} &  col1 & col2 &  col3 &    col4 &      col5 \\
\midrule
0 &     2 &    a &   1.4 &   apple &  2020/1/1 \\
1 &     3 &    b &   3.4 &  banana &  2020/1/2 \\
2 &     6 &    c &   2.5 &  orange &  2020/1/5 \\
3 &     5 &    d &   3.2 &   lemon &  2020/1/7 \\
\bottomrule
\end{tabular}

数据结构简介

在pandas中有两类非常重要的数据结构，即序列Series和数据框DataFrame。Series类似于numpy中的一维数组，除了通吃一维数组可用的函数或方法，而且其可通过索引标签的方式获取数据，还具有索引的自动对齐功能；DataFrame类似于numpy中的二维数组，同样可以通用numpy数组的函数和方法，而且还具有其他灵活应用，后续会介绍到。

Series的创建

Series一般由四个部分组成，分别是序列的值data、索引index、存储类型dtype、序列的名字name。其中，索引也可以指定它的名字，默认为空。

s = pd.Series(data = [100,'a',{'dic1':5}],
             index = pd.Index(['id1',20,'third'],name='my_idx'),
             dtype = 'object',
             name = 'my_name')
s 

my_idx
id1              100
20                 a
third    {'dic1': 5}
Name: my_name, dtype: object

对于这些属性，可以通过.的方式来获取：

s.values

array([100, 'a', {'dic1': 5}], dtype=object)

利用.shape可以获取序列的长度：

s.shape

(3,)

如果想要取出单个索引对应的值，可以通过[index_item]取出。

s['third']

{'dic1': 5}

序列的创建主要有三种方式：

1）通过一维数组创建序列

import numpy as np, pandas as pd 

arr1 = np.arange(10) 
print(arr1)

[0 1 2 3 4 5 6 7 8 9]

print(type(arr1))

<class 'numpy.ndarray'>

s1 = pd.Series(arr1)
print(s1)

  0
  1
  2
  3
  4
  5
  6
  7
  8
  9
dtype: int64

print(type(s1))

<class 'pandas.core.series.Series'>

2)通过字典的方式创建序列

dic1 = {'a':10,'b':20,'c':30,'d':40,'e':50}
print(dic1)
print(type(dic1))

{'a': 10, 'b': 20, 'c': 30, 'd': 40, 'e': 50}
<class 'dict'>

s2 = pd.Series(dic1)
print(s2)

a    10
b    20
c    30
d    40
e    50
dtype: int64

print(type(s2))

<class 'pandas.core.series.Series'>

3）通过DataFrame中的某一行或某一列创建序列

DataFrame的创建

DataFrame在Series的基础上增加了列索引。

数据框的创建主要有三种方式：

1)通过二维数组创建数据框

arr2 = np.array(np.arange(12)).reshape(4,3)
print(arr2)
print(type(arr2))

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
<class 'numpy.ndarray'>

df1 = pd.DataFrame(arr2)
print(df1)
print(type(df1))

 1   2
0   1   2
3   4   5
6   7   8
9  10  11
<class 'pandas.core.frame.DataFrame'>

2)通过字典的方式创建数据框

以下以两种字典来创建数据框，一个是字典列表，一个是嵌套字典。

dic2 = {'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,10,11,12],'d':[13,14,15,16]}
print(dic2)
print(type(dic2))

{'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8], 'c': [9, 10, 11, 12], 'd': [13, 14, 15, 16]}
<class 'dict'>

df2 = pd.DataFrame(dic2)
print(df2)
print(type(df2))

   a  b   c   d
0  1  5   9  13
1  2  6  10  14
2  3  7  11  15
3  4  8  12  16
<class 'pandas.core.frame.DataFrame'>

dic3 = {'one':{'a':1,'b':2,'c':3,'d':4},'two':{'a':5,'b':6,'c':7,'d':8},'three':{'a':9,'b':10,'c':11,'d':12}}
print(dic3)
print(type(dic3))

{'one': {'a': 1, 'b': 2, 'c': 3, 'd': 4}, 'two': {'a': 5, 'b': 6, 'c': 7, 'd': 8}, 'three': {'a': 9, 'b': 10, 'c': 11, 'd': 12}}
<class 'dict'>

df3 = pd.DataFrame(dic3)
print(df3)
print(type(df3))

   one  two  three
a    1    5      9
b    2    6     10
c    3    7     11
d    4    8     12
<class 'pandas.core.frame.DataFrame'>

3)通过数据框的方式来创建数据框

df4 = df3[['one','three']]
print(df4)
print(type(df4))

   one  three
a    1      9
b    2     10
c    3     11
d    4     12
<class 'pandas.core.frame.DataFrame'>

s3 = df3['one']
print(s3)
print(type(s3))

a    1
b    2
c    3
d    4
Name: one, dtype: int64
<class 'pandas.core.series.Series'>

但一般而言，更多的时候会采用从列索引名到数据的映射来构造数据框，同时加上行索引：

df = pd.DataFrame(data = {'col_0':[1,2,3],'col_1':list('abc'),'col2':[1.2,2.2,3.2]},
                 index = ['row_%d'%i for i in range(3)])
df

	col_0	col_1	col2
row_0	1	a	1.2
row_1	2	b	2.2
row_2	3	c	3.2

文档信息

本文作者：weownthenight
本文链接：https://weownthenight.github.io/2021/06/09/Pandas%E5%9F%BA%E7%A1%80/
版权声明：自由转载-非商用-非衍生-保持署名（创意共享3.0许可证）

	col1 \|\|\|\| col2
0	TS \|\|\|\| This is an apple.
1	GQ \|\|\|\| My name is Bob.
2	WT \|\|\|\| Well done!
3	PT \|\|\|\| May I help you?

weownthenight的博客