티스토리 뷰

Kafka & Elasticsearch

Pandas

이주성 2020. 3. 5. 21:40
Practice1_Tutorial_Pandas_KK

Opening

CSV (쉼표로 구분 된 값) 파일은 일반적인 데이터 파일 형식입니다. Python을 사용하여 CSV 파일에서 날짜를 읽고, 조작하고, 날짜를 쓰는 기능은 데이터 과학자 또는 비즈니스 분석을 마스터하는 핵심 기술입니다. 1) what CSV files are,
2) how to read CSV files into "Pandas DataFrames",
3) how to write DataFrames back to CSV files.

What is "Pandas DataFrame"? : Pandas는 Python에서 가장 널리 사용되는 데이터 조작 패키지이며 DataFrames는 테이블 형식 2D 데이터를 저장하기위한 Pandas 데이터 타입입니다. : Pandas 개발은 2008 년에 주요 개발자 인 Wes McKinney와 함께 시작되었으며 라이브러리는 Python을 사용한 데이터 분석 및 관리의 표준이되었습니다. : Pandas 유창성은 모든 Python 기반 데이터 전문가, Kaggle 과제에 관심이 있거나 데이터 프로세스를 자동화하려는 모든 사람에게 필수적입니다. : Pandas 라이브러리 설명서는 DataFrame을“축과 행이 레이블이 지정된 2 차원 크기 변경 가능 이종 테이블 형식 데이터 구조”로 정의합니다.

  • There can be multiple rows and columns in the data.
  • Each row represents a sample of data,
  • Each column contains a different variable that describes the samples (rows).
  • The data in every column is usually the same type of data – e.g. numbers, strings, dates.
  • Usually, unlike an excel data set, DataFrames avoid having missing values, and there are no gaps and empty values between rows or columns.
In [1]:
# Manually generate data
import pandas as pd
pd.options.display.max_columns = 20
pd.options.display.max_rows = 10

data = {'column1':[1,2,3,4,5],
        'anatoeh_column':['this', 'column', 'has', 'strings', 'indise!'],
        'float_column':[0.1, 0.5, 33, 48, 42.5555],
        'binary_column':[True, False, True, True, False]}
print(data)
print(data['column1'])
display(pd.DataFrame(data))
display(pd.DataFrame(data['column1']))
{'column1': [1, 2, 3, 4, 5], 'anatoeh_column': ['this', 'column', 'has', 'strings', 'indise!'], 'float_column': [0.1, 0.5, 33, 48, 42.5555], 'binary_column': [True, False, True, True, False]}
[1, 2, 3, 4, 5]
column1 anatoeh_column float_column binary_column
0 1 this 0.1000 True
1 2 column 0.5000 False
2 3 has 33.0000 True
3 4 strings 48.0000 True
4 5 indise! 42.5555 False
0
0 1
1 2
2 3
3 4
4 5

Data loading manually and from CSV files to Pandas DataFrame

(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)
There are 3 fundamantal conceps to grasp and debug the operation of the data loading procedure.
1) Understanding file extensions and file types – what do the letters CSV actually mean? what’s the difference between a .csv file and a .txt file?
2) Understanding how data is represented inside CSV files – if you open a CSV file, what does the data actually look like?
3) Understanding the Python path and how to reference a file – what is the absolute and relative path to the file you are loading? What directory are you working in?
4) CSV file loading errors

  • FileNotFoundError: File b'filename.csv' does not exist
    => A File Not Found error is typically an issue with path setup, current directory, or file name confusion (file extension can play a part here!)
  • UnicodeDecodeError: 'utf-8' codec can't decode byte in position : invalid continuation byte
    => A Unicode Decode Error is typically caused by not specifying the encoding of the file, and happens when you have a file with non-standard characters. For a quick fix, try opening the file in Sublime Text, and re-saving with encoding ‘UTF-8’.
  • pandas.parser.CParserError: Error tokenizing data.
    => Parse Errors can be caused in unusual circumstances to do with your data format – try to add the parameter “engine=’python'” to the read_csv function call; this changes the data reading function internally to a slower but more stable method.
In [2]:
# Finding your Python path
# The "OS module" is for operating system dependent functionality into Python
import os
print(os.getcwd())
print(os.listdir())
# os.chdir("path")
/Users/ijuseong/opt/anaconda3/TimeSeriesAnalysis/Shared
['[FastCampus] 2주차_강의자료_김경원박사.pptx', 'Practice3_Setting_Analysis_KK.ipynb', 'Lecture2_Learning_TimeSeries_KK.ipynb', 'Untitled.ipynb', '[FastCampus] 3주차_강의자료_김경원박사.pptx', 'Practice0_Installation_Program_KK.ipynb', '__pycache__', 'week_contents.txt', '[FastCampus] 1주차_강의자료_김경원박사.pptx', 'Image', 'Practice4_FE_Analysis_KK.ipynb', '[FastCampus] 7주차_강의자료_김경원박사.pptx', 'Lecture1_DataAnalysisCycle_DataStatistics_KK.ipynb', 'Practice5_Agile_Analysis_KK.ipynb', 'Lecture3_Algorithms_ML_TS_Linear_KK.ipynb', 'Practice6_TimeSeries_Analysis_KK.ipynb', '[FastCampus] 8주차_강의자료_김경원박사.pptx', '[FastCampus] 5주차_강의자료_김경원박사.pptx', 'Practice1_Tutorial_Pandas_KK.ipynb', 'Practice2_Tutorial_Numpy.ipynb', '.ipynb_checkpoints', 'module.py', '[FastCampus] 6주차_강의자료_김경원박사.pptx', 'Lecture4_Algorithms_TS_NonLinear_Multivariate_KK.ipynb', '[FastCampus] 4주차_강의자료_김경원박사.pptx']
In [3]:
# File Loading from "Absolute" and "Relative" paths
# Relative paths are directions to the file starting at your current working directory, where absolute paths always start at the base of your file system.
# direct_path : 'https://s3-eu-west-1.amazonaws.com/shanebucket/downloads/FAO+database.csv' from 'https://www.kaggle.com/dorbicycle/world-foodfeed-production'
absolute_path = '../Data/FoodAgricultureOrganization/Food_Agriculture_Organization_UN_Full.csv'
pd.read_csv(absolute_path, sep=',')
Out[3]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 Zimbabwe 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 Zimbabwe 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 Zimbabwe 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 Zimbabwe 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21477 rows × 63 columns

In [5]:
relative_path = '../Data/FoodAgricultureOrganization/Food_Agriculture_Organization_UN_Full.csv'
pd.read_csv(relative_path, sep=',')
Out[5]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 Zimbabwe 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 Zimbabwe 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 Zimbabwe 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 Zimbabwe 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21477 rows × 63 columns

In [6]:
pd.options.display.max_columns = 20
relative_path = '../Data/FoodAgricultureOrganization/Food_Agriculture_Organization_UN_Full.csv'
raw_data = pd.read_csv(relative_path, sep=',')
raw_data
Out[6]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 Zimbabwe 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 Zimbabwe 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 Zimbabwe 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 Zimbabwe 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21477 rows × 63 columns

Loading, editing, and viewing data from Pandas DataFrame

팬더는 넓은 데이터 데이터 프레임의 경우 기본적으로 20 개의 열만 표시하고 중간 섹션은 잘리는 60 개 정도의 행만 표시합니다. 이 한도를 변경하려면 Pandas 디스플레이의 일부 내부 옵션을 사용하여 기본값을 편집 할 수 있습니다 (simple use pd.display.options.XX = value to set these) (https://pandas.pydata.org/pandas-docs/stable/options.html)

  • pd.options.display.width – the width of the display in characters – use this if your display is wrapping rows over more than one line.
  • pd.options.display.max_rows – maximum number of rows displayed.
  • pd.options.display.max_columns – maximum number of columns displayed.

마지막으로 특정 열에 대한 핵심 통계를 보려면 'describe'기능을 사용하십시오.

  • For numeric columns, describe() returns basic statistics: the value count, mean, standard deviation, minimum, maximum, and 25th, 50th, and 75th quantiles for the data in a column.
  • For string columns, describe() returns the value count, the number of unique entries, the most frequently occurring value (‘top’), and the number of times the top value occurs (‘freq’)

Pandas에서 선택 및 색인 생성 활동을 수행하기위한 두 가지 주요 옵션이 있습니다. .loc 또는 .iloc을 사용하는 경우, 당신은 목록 또는 단일 값을 selector에 전달하여 출력 형식을 제어 할 수 있습니다. (http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label)

  1. iloc
    • .iloc은 한 row을 선택하면 Pandas Series를, 여러 row을 선택하거나 전체 column을 선택하면 Pandas DataFrame을 반환합니다. 이를 방지하려면 DataFrame 출력이 필요한 경우 single-valued list을 전달하십시오.
    • 이러한 방식으로 여러 column 또는 여러 row을 선택할 때 선택 (예 : [1:5])에서 선택한 행 / 열은 첫 번째 숫자에서 두 번째 숫자에서 1을 뺀 숫자로 실행됩니다. 예 : [1:5]는 1,2,3,4.가되고, [x,y]는 x에서 y-1이됩니다.
  2. loc
    • Label-based / Index-based indexing
    • Boolean / Logical indexing
      • array 또는 True / False 값 Series를 .loc 인덱서에 전달하여 Series에 True 값이있는 행을 선택합니다.

Selecting rows and columns

  • using a dot notation, e.g. data.column_name,
  • using square braces and the name of the column as a string, e.g. data['column_name']
  • using numeric indexing and the iloc selector data.iloc[:, ]

When a column is selected using any of these methodologies, a pandas.Series is the resulting datatype. A pandas series is a one-dimensional set of data.

  • square-brace selection with a list of column names, e.g. data[['column_name_1', 'column_name_2']]
  • using numeric indexing with the iloc selector and a list of column numbers, e.g. data.iloc[:, [0,1,20,22]]

Rows in a DataFrame are selected, typically, using the iloc/loc selection methods, or using logical selectors

  • numeric row selection using the iloc selector, e.g. data.iloc[0:10, :] – select the first 10 rows.
  • label-based row selection using the loc selector (this is only applicably if you have set an “index” on your dataframe. e.g. data.loc[44, :]
  • logical-based row selection using evaluated statements, e.g. data[data["Area"] == "Ireland"] – select the rows where Area value is ‘Ireland’.

To delete rows and columns from DataFrames, Pandas uses the “drop” function.

  • column 또는 여러 column을 삭제하려면 column 이름을 사용하고 “axis”을 1로 지정하십시오.
  • 또는 아래 예와 같이 'columns'매개 변수가 pandas에 추가되어 'axis'이 필요하지 않습니다.
  • drop 함수는 열이 제거 된 새 DataFrame을 반환합니다. 원래 DataFrame을 실제로 편집하기 위해 "inplace"매개 변수를 True로 설정할 수 있으며 반환 된 값이 없습니다.
  • axis = 0을 지정하여 “drop”기능을 사용하여 행을 제거 할 수도 있습니다. Drop ()은 숫자 색인 대신 "labels"을 기준으로 행을 제거합니다. 숫자 position / index를 기준으로 행을 삭제하려면 iloc을 사용하여 데이터 프레임 값을 다시 지정하십시오.
In [7]:
# Examine data in a Pandas DataFrame
raw_data.shape
Out[7]:
(21477, 63)
In [8]:
raw_data.ndim
Out[8]:
2
In [9]:
raw_data.head(5)
Out[9]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200

5 rows × 63 columns

In [10]:
raw_data.tail(5)
Out[10]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
21472 ZW 181 Zimbabwe 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 Zimbabwe 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 Zimbabwe 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 Zimbabwe 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

5 rows × 63 columns

In [13]:
raw_data.dtypes
Out[13]:
Area Abbreviation     object
Area Code              int64
Area                  object
Item Code              int64
Item                  object
                      ...   
Y2009                float64
Y2010                float64
Y2011                float64
Y2012                  int64
Y2013                  int64
Length: 63, dtype: object
In [11]:
raw_data['Item Code'] = raw_data['Item Code'].astype(str)
raw_data.dtypes
Out[11]:
Area Abbreviation     object
Area Code              int64
Area                  object
Item Code             object
Item                  object
                      ...   
Y2009                float64
Y2010                float64
Y2011                float64
Y2012                  int64
Y2013                  int64
Length: 63, dtype: object
In [12]:
raw_data['Y2013'].describe()
Out[12]:
count     21477.000000
mean        575.557480
std        6218.379479
min        -246.000000
25%           0.000000
50%           8.000000
75%          90.000000
max      489299.000000
Name: Y2013, dtype: float64
In [13]:
raw_data['Area'].describe()
Out[13]:
count     21477
unique      174
top       Spain
freq        150
Name: Area, dtype: object
In [14]:
raw_data.describe()
Out[14]:
Area Code Element Code latitude longitude Y1961 Y1962 Y1963 Y1964 Y1965 Y1966 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
count 21477.000000 21477.000000 21477.000000 21477.000000 17938.000000 17938.000000 17938.000000 17938.000000 17938.000000 17938.000000 ... 21128.000000 21128.000000 21373.000000 21373.000000 21373.000000 21373.000000 21373.000000 21373.000000 21477.000000 21477.000000
mean 125.449411 5211.687154 20.450613 15.794445 195.262069 200.782250 205.464600 209.925577 217.556751 225.988962 ... 486.690742 493.153256 496.319328 508.482104 522.844898 524.581996 535.492069 553.399242 560.569214 575.557480
std 72.868149 146.820079 24.628336 66.012104 1864.124336 1884.265591 1861.174739 1862.000116 2014.934333 2100.228354 ... 5001.782008 5100.057036 5134.819373 5298.939807 5496.697513 5545.939303 5721.089425 5883.071604 6047.950804 6218.379479
min 1.000000 5142.000000 -40.900000 -172.100000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -169.000000 -246.000000
25% 63.000000 5142.000000 6.430000 -11.780000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 120.000000 5142.000000 20.590000 19.150000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 6.000000 6.000000 7.000000 7.000000 7.000000 7.000000 7.000000 8.000000 8.000000 8.000000
75% 188.000000 5142.000000 41.150000 46.870000 21.000000 22.000000 23.000000 24.000000 25.000000 26.000000 ... 75.000000 77.000000 78.000000 80.000000 82.000000 83.000000 83.000000 86.000000 88.000000 90.000000
max 276.000000 5521.000000 64.960000 179.410000 112227.000000 109130.000000 106356.000000 104234.000000 119378.000000 118495.000000 ... 360767.000000 373694.000000 388100.000000 402975.000000 425537.000000 434724.000000 451838.000000 462696.000000 479028.000000 489299.000000

8 rows × 57 columns

In [15]:
# Selecting and manipulating data
raw_data.iloc[0]
Out[15]:
Area Abbreviation                    AF
Area Code                             2
Area                        Afghanistan
Item Code                          2511
Item                 Wheat and products
                            ...        
Y2009                              4538
Y2010                              4605
Y2011                              4711
Y2012                              4810
Y2013                              4895
Name: 0, Length: 63, dtype: object
In [16]:
raw_data.iloc[[1]]
Out[16]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422

1 rows × 63 columns

In [17]:
raw_data.iloc[[-1]]
Out[17]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

1 rows × 63 columns

In [18]:
raw_data.iloc[:,0]
Out[18]:
0        AF
1        AF
2        AF
3        AF
4        AF
         ..
21472    ZW
21473    ZW
21474    ZW
21475    ZW
21476    ZW
Name: Area Abbreviation, Length: 21477, dtype: object
In [19]:
raw_data.iloc[:,[1]]
Out[19]:
Area Code
0 2
1 2
2 2
3 2
4 2
... ...
21472 181
21473 181
21474 181
21475 181
21476 181

21477 rows × 1 columns

In [20]:
raw_data.iloc[:,[-1]]
Out[20]:
Y2013
0 4895
1 422
2 360
3 89
4 200
... ...
21472 451
21473 15
21474 40
21475 0
21476 0

21477 rows × 1 columns

In [21]:
raw_data.iloc[0:5]
Out[21]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200

5 rows × 63 columns

In [22]:
raw_data.iloc[:,0:2]
Out[22]:
Area Abbreviation Area Code
0 AF 2
1 AF 2
2 AF 2
3 AF 2
4 AF 2
... ... ...
21472 ZW 181
21473 ZW 181
21474 ZW 181
21475 ZW 181
21476 ZW 181

21477 rows × 2 columns

In [23]:
raw_data.iloc[[0,3,6,24],[0,5,6]]
Out[23]:
Area Abbreviation Element Code Element
0 AF 5142 Food
3 AF 5142 Food
6 AF 5142 Food
24 AF 5142 Food
In [24]:
raw_data.iloc[0:5, 5:8]
Out[24]:
Element Code Element Unit
0 5142 Food 1000 tonnes
1 5142 Food 1000 tonnes
2 5521 Feed 1000 tonnes
3 5142 Food 1000 tonnes
4 5521 Feed 1000 tonnes
In [25]:
raw_data.loc[0]
Out[25]:
Area Abbreviation                    AF
Area Code                             2
Area                        Afghanistan
Item Code                          2511
Item                 Wheat and products
                            ...        
Y2009                              4538
Y2010                              4605
Y2011                              4711
Y2012                              4810
Y2013                              4895
Name: 0, Length: 63, dtype: object
In [26]:
raw_data.loc[[1]]
Out[26]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422

1 rows × 63 columns

In [27]:
raw_data.loc[[1,3]]
Out[27]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89

2 rows × 63 columns

In [28]:
raw_data.loc[[1,3],['Item','Y2013']]
Out[28]:
Item Y2013
1 Rice (Milled Equivalent) 422
3 Barley and products 89
In [29]:
raw_data.loc[[1,3],'Item':'Y2013']
Out[29]:
Item Element Code Element Unit latitude longitude Y1961 Y1962 Y1963 Y1964 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
1 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 183.0 183.0 182.0 220.0 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
3 Barley and products 5142 Food 1000 tonnes 33.94 67.71 237.0 237.0 237.0 238.0 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89

2 rows × 59 columns

In [30]:
raw_data.loc[1:3,'Item':'Y2013']
Out[30]:
Item Element Code Element Unit latitude longitude Y1961 Y1962 Y1963 Y1964 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
1 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 183.0 183.0 182.0 220.0 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 76.0 76.0 76.0 76.0 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 Barley and products 5142 Food 1000 tonnes 33.94 67.71 237.0 237.0 237.0 238.0 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89

3 rows × 59 columns

In [31]:
raw_data_test = raw_data.loc[10:,'Item':'Y2013']
raw_data_test.iloc[[0]]
Out[31]:
Item Element Code Element Unit latitude longitude Y1961 Y1962 Y1963 Y1964 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
10 Sugar beet 5521 Feed 1000 tonnes 33.94 67.71 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

1 rows × 59 columns

In [32]:
raw_data_test.loc[[10]]
Out[32]:
Item Element Code Element Unit latitude longitude Y1961 Y1962 Y1963 Y1964 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
10 Sugar beet 5521 Feed 1000 tonnes 33.94 67.71 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

1 rows × 59 columns

In [33]:
raw_data.loc[raw_data['Item'] == 'Sugar beet']
Out[33]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
10 AF 2 Afghanistan 2537 Sugar beet 5521 Feed 1000 tonnes 33.94 67.71 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
103 AL 3 Albania 2537 Sugar beet 5521 Feed 1000 tonnes 41.15 20.17 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1 1
699 AM 1 Armenia 2537 Sugar beet 5521 Feed 1000 tonnes 40.07 45.04 ... 1.0 2.0 1.0 2.0 1.0 3.0 1.0 1.0 0 1
832 AU 10 Australia 2537 Sugar beet 5521 Feed 1000 tonnes -25.27 133.78 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
1099 AZ 52 Azerbaijan 2537 Sugar beet 5521 Feed 1000 tonnes 40.14 47.58 ... 1.0 1.0 4.0 3.0 4.0 4.0 6.0 6.0 4 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20020 AE 225 United Arab Emirates 2537 Sugar beet 5521 Feed 1000 tonnes 23.42 53.85 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0 0
20676 UZ 235 Uzbekistan 2537 Sugar beet 5521 Feed 1000 tonnes 41.38 64.59 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
20900 VE 236 Venezuela (Bolivarian Republic of) 2537 Sugar beet 5142 Food 1000 tonnes 6.42 -66.59 ... 17.0 20.0 23.0 21.0 21.0 21.0 30.0 35.0 20 22
21135 YE 249 Yemen 2537 Sugar beet 5521 Feed 1000 tonnes 15.55 48.52 ... 0.0 0.0 1.0 13.0 0.0 0.0 0.0 0.0 0 0
21374 ZW 181 Zimbabwe 2537 Sugar beet 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

66 rows × 63 columns

In [34]:
# is same as
raw_data[raw_data['Item'] == 'Sugar beet']
Out[34]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
10 AF 2 Afghanistan 2537 Sugar beet 5521 Feed 1000 tonnes 33.94 67.71 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
103 AL 3 Albania 2537 Sugar beet 5521 Feed 1000 tonnes 41.15 20.17 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1 1
699 AM 1 Armenia 2537 Sugar beet 5521 Feed 1000 tonnes 40.07 45.04 ... 1.0 2.0 1.0 2.0 1.0 3.0 1.0 1.0 0 1
832 AU 10 Australia 2537 Sugar beet 5521 Feed 1000 tonnes -25.27 133.78 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
1099 AZ 52 Azerbaijan 2537 Sugar beet 5521 Feed 1000 tonnes 40.14 47.58 ... 1.0 1.0 4.0 3.0 4.0 4.0 6.0 6.0 4 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20020 AE 225 United Arab Emirates 2537 Sugar beet 5521 Feed 1000 tonnes 23.42 53.85 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0 0
20676 UZ 235 Uzbekistan 2537 Sugar beet 5521 Feed 1000 tonnes 41.38 64.59 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
20900 VE 236 Venezuela (Bolivarian Republic of) 2537 Sugar beet 5142 Food 1000 tonnes 6.42 -66.59 ... 17.0 20.0 23.0 21.0 21.0 21.0 30.0 35.0 20 22
21135 YE 249 Yemen 2537 Sugar beet 5521 Feed 1000 tonnes 15.55 48.52 ... 0.0 0.0 1.0 13.0 0.0 0.0 0.0 0.0 0 0
21374 ZW 181 Zimbabwe 2537 Sugar beet 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

66 rows × 63 columns

In [35]:
raw_data.loc[raw_data['Item'] == 'Sugar beet', 'Area']
Out[35]:
10                              Afghanistan
103                                 Albania
699                                 Armenia
832                               Australia
1099                             Azerbaijan
                        ...                
20020                  United Arab Emirates
20676                            Uzbekistan
20900    Venezuela (Bolivarian Republic of)
21135                                 Yemen
21374                              Zimbabwe
Name: Area, Length: 66, dtype: object
In [36]:
raw_data.loc[raw_data['Item'] == 'Sugar beet', ['Area']]
Out[36]:
Area
10 Afghanistan
103 Albania
699 Armenia
832 Australia
1099 Azerbaijan
... ...
20020 United Arab Emirates
20676 Uzbekistan
20900 Venezuela (Bolivarian Republic of)
21135 Yemen
21374 Zimbabwe

66 rows × 1 columns

In [37]:
# is not same as
raw_data[raw_data['Item'] == 'Sugar beet', ['Area']]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-d794b59d3bc7> in <module>
      1 # is not same as
----> 2 raw_data[raw_data['Item'] == 'Sugar beet', ['Area']]

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2895                 )
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:
   2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

TypeError: '(0        False
1        False
2        False
3        False
4        False
         ...  
21472    False
21473    False
21474    False
21475    False
21476    False
Name: Item, Length: 21477, dtype: bool, ['Area'])' is an invalid key
In [38]:
raw_data.loc[raw_data['Item'] == 'Sugar beet', ['Area', 'Item', 'latitude']]
Out[38]:
Area Item latitude
10 Afghanistan Sugar beet 33.94
103 Albania Sugar beet 41.15
699 Armenia Sugar beet 40.07
832 Australia Sugar beet -25.27
1099 Azerbaijan Sugar beet 40.14
... ... ... ...
20020 United Arab Emirates Sugar beet 23.42
20676 Uzbekistan Sugar beet 41.38
20900 Venezuela (Bolivarian Republic of) Sugar beet 6.42
21135 Yemen Sugar beet 15.55
21374 Zimbabwe Sugar beet -19.02

66 rows × 3 columns

In [42]:
raw_data.loc[raw_data['Item'] == 'Sugar beet', 'Area':'latitude']
Out[42]:
Area Item Code Item Element Code Element Unit latitude
10 Afghanistan 2537 Sugar beet 5521 Feed 1000 tonnes 33.94
103 Albania 2537 Sugar beet 5521 Feed 1000 tonnes 41.15
699 Armenia 2537 Sugar beet 5521 Feed 1000 tonnes 40.07
832 Australia 2537 Sugar beet 5521 Feed 1000 tonnes -25.27
1099 Azerbaijan 2537 Sugar beet 5521 Feed 1000 tonnes 40.14
... ... ... ... ... ... ... ...
20020 United Arab Emirates 2537 Sugar beet 5521 Feed 1000 tonnes 23.42
20676 Uzbekistan 2537 Sugar beet 5521 Feed 1000 tonnes 41.38
20900 Venezuela (Bolivarian Republic of) 2537 Sugar beet 5142 Food 1000 tonnes 6.42
21135 Yemen 2537 Sugar beet 5521 Feed 1000 tonnes 15.55
21374 Zimbabwe 2537 Sugar beet 5142 Food 1000 tonnes -19.02

66 rows × 7 columns

In [43]:
raw_data.loc[raw_data['Area'].str.endswith('many')]
Out[43]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
7532 DE 79 Germany 2511 Wheat and products 5521 Feed 1000 tonnes 51.17 10.45 ... 8006.0 11084.0 10644.0 8993.0 10669.0 10608.0 8189.0 9242.0 7868 7494
7533 DE 79 Germany 2511 Wheat and products 5142 Food 1000 tonnes 51.17 10.45 ... 6524.0 6931.0 6796.0 6886.0 6868.0 7137.0 7235.0 7204.0 6712 6900
7534 DE 79 Germany 2805 Rice (Milled Equivalent) 5521 Feed 1000 tonnes 51.17 10.45 ... 7.0 9.0 12.0 17.0 8.0 18.0 13.0 12.0 7 7
7535 DE 79 Germany 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 51.17 10.45 ... 206.0 212.0 226.0 244.0 251.0 257.0 243.0 271.0 275 277
7536 DE 79 Germany 2513 Barley and products 5521 Feed 1000 tonnes 51.17 10.45 ... 7571.0 6878.0 7845.0 6940.0 7255.0 7592.0 7543.0 6062.0 6316 6598
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7674 DE 79 Germany 2948 Milk - Excluding Butter 5142 Food 1000 tonnes 51.17 10.45 ... 20252.0 20878.0 21323.0 21661.0 21091.0 22077.0 21431.0 21171.0 21169 21401
7675 DE 79 Germany 2960 Fish, Seafood 5521 Feed 1000 tonnes 51.17 10.45 ... 11.0 11.0 32.0 51.0 28.0 25.0 29.0 25.0 6 5
7676 DE 79 Germany 2960 Fish, Seafood 5142 Food 1000 tonnes 51.17 10.45 ... 1122.0 1177.0 1238.0 1287.0 1226.0 1187.0 1175.0 1188.0 1128 1039
7677 DE 79 Germany 2961 Aquatic Products, Other 5142 Food 1000 tonnes 51.17 10.45 ... 3.0 3.0 4.0 4.0 4.0 4.0 4.0 5.0 1 0
7678 DE 79 Germany 2928 Miscellaneous 5142 Food 1000 tonnes 51.17 10.45 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

147 rows × 63 columns

In [44]:
# is same as
raw_data.loc[raw_data['Area'].isin(['Germany'])]
Out[44]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
7532 DE 79 Germany 2511 Wheat and products 5521 Feed 1000 tonnes 51.17 10.45 ... 8006.0 11084.0 10644.0 8993.0 10669.0 10608.0 8189.0 9242.0 7868 7494
7533 DE 79 Germany 2511 Wheat and products 5142 Food 1000 tonnes 51.17 10.45 ... 6524.0 6931.0 6796.0 6886.0 6868.0 7137.0 7235.0 7204.0 6712 6900
7534 DE 79 Germany 2805 Rice (Milled Equivalent) 5521 Feed 1000 tonnes 51.17 10.45 ... 7.0 9.0 12.0 17.0 8.0 18.0 13.0 12.0 7 7
7535 DE 79 Germany 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 51.17 10.45 ... 206.0 212.0 226.0 244.0 251.0 257.0 243.0 271.0 275 277
7536 DE 79 Germany 2513 Barley and products 5521 Feed 1000 tonnes 51.17 10.45 ... 7571.0 6878.0 7845.0 6940.0 7255.0 7592.0 7543.0 6062.0 6316 6598
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7674 DE 79 Germany 2948 Milk - Excluding Butter 5142 Food 1000 tonnes 51.17 10.45 ... 20252.0 20878.0 21323.0 21661.0 21091.0 22077.0 21431.0 21171.0 21169 21401
7675 DE 79 Germany 2960 Fish, Seafood 5521 Feed 1000 tonnes 51.17 10.45 ... 11.0 11.0 32.0 51.0 28.0 25.0 29.0 25.0 6 5
7676 DE 79 Germany 2960 Fish, Seafood 5142 Food 1000 tonnes 51.17 10.45 ... 1122.0 1177.0 1238.0 1287.0 1226.0 1187.0 1175.0 1188.0 1128 1039
7677 DE 79 Germany 2961 Aquatic Products, Other 5142 Food 1000 tonnes 51.17 10.45 ... 3.0 3.0 4.0 4.0 4.0 4.0 4.0 5.0 1 0
7678 DE 79 Germany 2928 Miscellaneous 5142 Food 1000 tonnes 51.17 10.45 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

147 rows × 63 columns

In [45]:
raw_data.loc[raw_data['Area'].isin(['Germany', 'France'])]
Out[45]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
6910 FR 68 France 2511 Wheat and products 5521 Feed 1000 tonnes 46.23 2.21 ... 9867.0 11311.0 10091.0 8483.0 10283.0 8351.0 6262.0 7727.0 7179 7822
6911 FR 68 France 2511 Wheat and products 5142 Food 1000 tonnes 46.23 2.21 ... 6057.0 6140.0 6402.0 6102.0 6677.0 6331.0 6986.0 6765.0 6984 6971
6912 FR 68 France 2805 Rice (Milled Equivalent) 5521 Feed 1000 tonnes 46.23 2.21 ... 80.0 75.0 84.0 88.0 92.0 82.0 81.0 91.0 99 101
6913 FR 68 France 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 46.23 2.21 ... 296.0 325.0 317.0 367.0 350.0 347.0 339.0 361.0 339 314
6914 FR 68 France 2513 Barley and products 5521 Feed 1000 tonnes 46.23 2.21 ... 3537.0 3560.0 4014.0 3446.0 4275.0 4428.0 4332.0 3137.0 3163 2865
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7674 DE 79 Germany 2948 Milk - Excluding Butter 5142 Food 1000 tonnes 51.17 10.45 ... 20252.0 20878.0 21323.0 21661.0 21091.0 22077.0 21431.0 21171.0 21169 21401
7675 DE 79 Germany 2960 Fish, Seafood 5521 Feed 1000 tonnes 51.17 10.45 ... 11.0 11.0 32.0 51.0 28.0 25.0 29.0 25.0 6 5
7676 DE 79 Germany 2960 Fish, Seafood 5142 Food 1000 tonnes 51.17 10.45 ... 1122.0 1177.0 1238.0 1287.0 1226.0 1187.0 1175.0 1188.0 1128 1039
7677 DE 79 Germany 2961 Aquatic Products, Other 5142 Food 1000 tonnes 51.17 10.45 ... 3.0 3.0 4.0 4.0 4.0 4.0 4.0 5.0 1 0
7678 DE 79 Germany 2928 Miscellaneous 5142 Food 1000 tonnes 51.17 10.45 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

287 rows × 63 columns

In [46]:
raw_data.loc[(raw_data['Area'].str.endswith('many')) & (raw_data['Element'] == 'Feed')]
Out[46]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
7532 DE 79 Germany 2511 Wheat and products 5521 Feed 1000 tonnes 51.17 10.45 ... 8006.0 11084.0 10644.0 8993.0 10669.0 10608.0 8189.0 9242.0 7868 7494
7534 DE 79 Germany 2805 Rice (Milled Equivalent) 5521 Feed 1000 tonnes 51.17 10.45 ... 7.0 9.0 12.0 17.0 8.0 18.0 13.0 12.0 7 7
7536 DE 79 Germany 2513 Barley and products 5521 Feed 1000 tonnes 51.17 10.45 ... 7571.0 6878.0 7845.0 6940.0 7255.0 7592.0 7543.0 6062.0 6316 6598
7538 DE 79 Germany 2514 Maize and products 5521 Feed 1000 tonnes 51.17 10.45 ... 3356.0 3481.0 2947.0 3752.0 4199.0 3895.0 3572.0 4251.0 5434 6136
7540 DE 79 Germany 2515 Rye and products 5521 Feed 1000 tonnes 51.17 10.45 ... 1400.0 1300.0 1287.0 1391.0 1858.0 2416.0 1625.0 1360.0 2150 3318
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7661 DE 79 Germany 2918 Vegetables 5521 Feed 1000 tonnes 51.17 10.45 ... 89.0 83.0 84.0 84.0 89.0 94.0 89.0 89.0 91 90
7668 DE 79 Germany 2945 Offals 5521 Feed 1000 tonnes 51.17 10.45 ... 2.0 2.0 2.0 2.0 1.0 1.0 1.0 1.0 1 1
7670 DE 79 Germany 2946 Animal fats 5521 Feed 1000 tonnes 51.17 10.45 ... 16.0 18.0 20.0 15.0 16.0 14.0 18.0 18.0 20 17
7673 DE 79 Germany 2948 Milk - Excluding Butter 5521 Feed 1000 tonnes 51.17 10.45 ... 1813.0 1701.0 1595.0 1633.0 2224.0 1503.0 1468.0 1771.0 2088 2064
7675 DE 79 Germany 2960 Fish, Seafood 5521 Feed 1000 tonnes 51.17 10.45 ... 11.0 11.0 32.0 51.0 28.0 25.0 29.0 25.0 6 5

43 rows × 63 columns

In [47]:
raw_data.loc[(raw_data['Y2004'] < 1000) & (raw_data['Y2004'] > 990)]
Out[47]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
3754 CL 40 Chile 2531 Potatoes and products 5142 Food 1000 tonnes -35.68 -71.54 ... 998.0 980.0 1018.0 875.0 884.0 850.0 1003.0 1132.0 1096 1089
7630 DE 79 Germany 2744 Eggs 5142 Food 1000 tonnes 51.17 10.45 ... 999.0 974.0 1011.0 997.0 1011.0 1016.0 1032.0 1052.0 990 1010
7672 DE 79 Germany 2949 Eggs 5142 Food 1000 tonnes 51.17 10.45 ... 999.0 974.0 1011.0 997.0 1011.0 1016.0 1032.0 1052.0 990 1010
9038 IN 100 India 2513 Barley and products 5142 Food 1000 tonnes 20.59 78.96 ... 995.0 954.0 944.0 691.0 740.0 1276.0 984.0 1217.0 1152 835
9104 IN 100 India 2641 Pimento 5142 Food 1000 tonnes 20.59 78.96 ... 998.0 844.0 1012.0 1010.0 1019.0 941.0 893.0 954.0 871 1018
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13436 MM 28 Myanmar 2948 Milk - Excluding Butter 5142 Food 1000 tonnes 21.91 95.96 ... 992.0 1069.0 1117.0 1253.0 1298.0 1472.0 1581.0 1660.0 1600 1677
13669 NP 149 Nepal 2919 Fruits - Excluding Wine 5142 Food 1000 tonnes 28.39 84.12 ... 993.0 1050.0 1044.0 1098.0 1195.0 1178.0 1301.0 1428.0 1859 1700
15525 PT 174 Portugal 2511 Wheat and products 5142 Food 1000 tonnes 39.40 -8.22 ... 991.0 982.0 995.0 1002.0 976.0 970.0 978.0 980.0 994 1000
19714 TM 213 Turkmenistan 2905 Cereals - Excluding Beer 5142 Food 1000 tonnes 38.97 59.56 ... 995.0 1009.0 1031.0 1019.0 1017.0 1023.0 1019.0 1039.0 1036 1059
20768 UZ 235 Uzbekistan 2919 Fruits - Excluding Wine 5142 Food 1000 tonnes 41.38 64.59 ... 993.0 1079.0 1448.0 1538.0 1693.0 1875.0 2108.0 2159.0 2422 2638

14 rows × 63 columns

In [48]:
raw_data.loc[(raw_data['Y2004'] < 1000) & (raw_data['Y2004'] > 990), ['Area', 'Item', 'latitude']]
Out[48]:
Area Item latitude
3754 Chile Potatoes and products -35.68
7630 Germany Eggs 51.17
7672 Germany Eggs 51.17
9038 India Barley and products 20.59
9104 India Pimento 20.59
... ... ... ...
13436 Myanmar Milk - Excluding Butter 21.91
13669 Nepal Fruits - Excluding Wine 28.39
15525 Portugal Wheat and products 39.40
19714 Turkmenistan Cereals - Excluding Beer 38.97
20768 Uzbekistan Fruits - Excluding Wine 41.38

14 rows × 3 columns

In [49]:
raw_data.loc[raw_data['Item'].apply(lambda x: len(x.split(' ')) == 5)]
Out[49]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
37 AF 2 Afghanistan 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes 33.94 67.71 ... 205.0 230.0 194.0 232.0 198.0 212.0 257.0 334.0 472 483
141 AL 3 Albania 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes 41.15 20.17 ... 90.0 104.0 112.0 126.0 133.0 139.0 145.0 148.0 148 153
267 DZ 4 Algeria 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes 28.03 1.66 ... 196.0 240.0 275.0 212.0 333.0 460.0 505.0 342.0 492 512
375 AO 7 Angola 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes -11.20 17.87 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2 2
493 AG 8 Antigua and Barbuda 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes 17.06 -61.80 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20941 VE 236 Venezuela (Bolivarian Republic of) 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes 6.42 -66.59 ... 31.0 33.0 39.0 48.0 55.0 41.0 47.0 53.0 55 58
21053 VN 237 Viet Nam 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes 14.06 108.28 ... 33.0 40.0 41.0 50.0 54.0 77.0 64.0 67.0 53 55
21174 YE 249 Yemen 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes 15.55 48.52 ... 96.0 101.0 111.0 121.0 123.0 123.0 155.0 138.0 143 144
21294 ZM 251 Zambia 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes -13.13 27.85 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1 1
21415 ZW 181 Zimbabwe 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes -19.02 29.15 ... 1.0 0.0 1.0 1.0 0.0 1.0 1.0 2.0 3 3

173 rows × 63 columns

In [50]:
# is same as
TF_indexing = raw_data['Item'].apply(lambda x: len(x.split(' ')) == 5)
raw_data.loc[TF_indexing]
Out[50]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
37 AF 2 Afghanistan 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes 33.94 67.71 ... 205.0 230.0 194.0 232.0 198.0 212.0 257.0 334.0 472 483
141 AL 3 Albania 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes 41.15 20.17 ... 90.0 104.0 112.0 126.0 133.0 139.0 145.0 148.0 148 153
267 DZ 4 Algeria 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes 28.03 1.66 ... 196.0 240.0 275.0 212.0 333.0 460.0 505.0 342.0 492 512
375 AO 7 Angola 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes -11.20 17.87 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2 2
493 AG 8 Antigua and Barbuda 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes 17.06 -61.80 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20941 VE 236 Venezuela (Bolivarian Republic of) 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes 6.42 -66.59 ... 31.0 33.0 39.0 48.0 55.0 41.0 47.0 53.0 55 58
21053 VN 237 Viet Nam 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes 14.06 108.28 ... 33.0 40.0 41.0 50.0 54.0 77.0 64.0 67.0 53 55
21174 YE 249 Yemen 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes 15.55 48.52 ... 96.0 101.0 111.0 121.0 123.0 123.0 155.0 138.0 143 144
21294 ZM 251 Zambia 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes -13.13 27.85 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1 1
21415 ZW 181 Zimbabwe 2620 Grapes and products (excl wine) 5142 Food 1000 tonnes -19.02 29.15 ... 1.0 0.0 1.0 1.0 0.0 1.0 1.0 2.0 3 3

173 rows × 63 columns

In [51]:
raw_data.loc[TF_indexing, ['Area', 'Item', 'latitude']]
Out[51]:
Area Item latitude
37 Afghanistan Grapes and products (excl wine) 33.94
141 Albania Grapes and products (excl wine) 41.15
267 Algeria Grapes and products (excl wine) 28.03
375 Angola Grapes and products (excl wine) -11.20
493 Antigua and Barbuda Grapes and products (excl wine) 17.06
... ... ... ...
20941 Venezuela (Bolivarian Republic of) Grapes and products (excl wine) 6.42
21053 Viet Nam Grapes and products (excl wine) 14.06
21174 Yemen Grapes and products (excl wine) 15.55
21294 Zambia Grapes and products (excl wine) -13.13
21415 Zimbabwe Grapes and products (excl wine) -19.02

173 rows × 3 columns

In [52]:
raw_data_test = raw_data.copy()
raw_data_test.loc[(raw_data_test['Y2004'] < 1000) & (raw_data_test['Y2004'] > 990), ['Area']]
Out[52]:
Area
3754 Chile
7630 Germany
7672 Germany
9038 India
9104 India
... ...
13436 Myanmar
13669 Nepal
15525 Portugal
19714 Turkmenistan
20768 Uzbekistan

14 rows × 1 columns

In [53]:
raw_data_test.loc[(raw_data_test['Y2004'] < 1000) & (raw_data_test['Y2004'] > 990), ['Area']] = 'Company'
raw_data_test.loc[(raw_data_test['Y2004'] < 1000) & (raw_data_test['Y2004'] > 980), ['Area']]
Out[53]:
Area
3754 Company
4636 Congo
5143 Cuba
7542 Germany
7630 Company
... ...
14561 Norway
15525 Company
15922 Romania
19714 Company
20768 Company

21 rows × 1 columns

In [54]:
raw_data['Y2007'].sum(), raw_data['Y2007'].mean(), raw_data['Y2007'].median(), raw_data['Y2007'].nunique(), raw_data['Y2007'].count(), raw_data['Y2007'].max(), raw_data['Y2007'].min()
Out[54]:
(10867788.0, 508.48210358863986, 7.0, 1994, 21373, 402975.0, 0.0)
In [55]:
[raw_data['Y2007'].sum(),
 raw_data['Y2007'].mean(),
 raw_data['Y2007'].median(),
 raw_data['Y2007'].nunique(),
 raw_data['Y2007'].count(),
 raw_data['Y2007'].max(),
 raw_data['Y2007'].min(),
 raw_data['Y2007'].isna().sum(),
 raw_data['Y2007'].fillna(0)]
Out[55]:
[10867788.0,
 508.48210358863986,
 7.0,
 1994,
 21373,
 402975.0,
 0.0,
 104,
 0        4164.0
 1         455.0
 2         263.0
 3          48.0
 4         249.0
           ...  
 21472     356.0
 21473       6.0
 21474      14.0
 21475       0.0
 21476       0.0
 Name: Y2007, Length: 21477, dtype: float64]
In [56]:
# Delete the "Area" column from the dataframe
raw_data.drop("Area", axis=1)
Out[56]:
Area Abbreviation Area Code Item Code Item Element Code Element Unit latitude longitude Y1961 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 1928.0 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 183.0 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 76.0 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 237.0 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 210.0 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 230.0 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 27.0 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 6.0 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21477 rows × 62 columns

In [57]:
# alternatively, delete columns using the columns parameter of drop
raw_data.drop(columns="Area")
Out[57]:
Area Abbreviation Area Code Item Code Item Element Code Element Unit latitude longitude Y1961 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 1928.0 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 183.0 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 76.0 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 237.0 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 210.0 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 230.0 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 27.0 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 6.0 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21477 rows × 62 columns

In [58]:
# Delete the Area column from the dataframe and the original 'data' object is changed when inplace=True
raw_data.drop("Area", axis=1, inplace=False)
Out[58]:
Area Abbreviation Area Code Item Code Item Element Code Element Unit latitude longitude Y1961 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 1928.0 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 183.0 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 76.0 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 237.0 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 210.0 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 230.0 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 27.0 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 6.0 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21477 rows × 62 columns

In [59]:
# Delete multiple columns from the dataframe
raw_data.drop(["Y2011", "Y2012", "Y2013"], axis=1)
Out[59]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2001 Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 2668.0 2776.0 3095.0 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 411.0 448.0 460.0 419.0 445.0 546.0 455.0 490.0 415.0 442.0
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 29.0 70.0 48.0 58.0 236.0 262.0 263.0 230.0 379.0 315.0
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 83.0 122.0 144.0 185.0 43.0 44.0 48.0 62.0 55.0 60.0
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 48.0 89.0 63.0 120.0 208.0 233.0 249.0 247.0 195.0 178.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 Zimbabwe 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 ... 439.0 360.0 386.0 373.0 357.0 359.0 356.0 341.0 385.0 418.0
21473 ZW 181 Zimbabwe 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 ... 5.0 1.0 0.0 5.0 4.0 9.0 6.0 9.0 5.0 15.0
21474 ZW 181 Zimbabwe 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 ... 18.0 16.0 14.0 18.0 14.0 17.0 14.0 15.0 18.0 29.0
21475 ZW 181 Zimbabwe 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

21477 rows × 60 columns

In [60]:
# Delete the rows with labels 0,1,5
raw_data.drop([0,1,5], axis=0)
Out[60]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200
6 AF 2 Afghanistan 2517 Millet and products 5142 Food 1000 tonnes 33.94 67.71 ... 15.0 21.0 11.0 19.0 21.0 18.0 14.0 14.0 14 12
7 AF 2 Afghanistan 2520 Cereals, Other 5142 Food 1000 tonnes 33.94 67.71 ... 2.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 Zimbabwe 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 Zimbabwe 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 Zimbabwe 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 Zimbabwe 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21474 rows × 63 columns

In [61]:
# Delete the rows with label "Afghanistan". For label-based deletion, set the index first on the dataframe
raw_data.set_index("Area")
raw_data.set_index("Area").drop("Afghanistan", axis=0)
Out[61]:
Area Abbreviation Area Code Item Code Item Element Code Element Unit latitude longitude Y1961 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
Area
Albania AL 3 2511 Wheat and products 5521 Feed 1000 tonnes 41.15 20.17 10.0 ... 28.0 28.0 30.0 28.0 28.0 30.0 26.0 25.0 20 18
Albania AL 3 2511 Wheat and products 5142 Food 1000 tonnes 41.15 20.17 166.0 ... 449.0 468.0 422.0 425.0 435.0 415.0 432.0 439.0 440 440
Albania AL 3 2805 Rice (Milled Equivalent) 5521 Feed 1000 tonnes 41.15 20.17 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
Albania AL 3 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 41.15 20.17 2.0 ... 23.0 24.0 30.0 27.0 20.0 23.0 24.0 21.0 22 25
Albania AL 3 2513 Barley and products 5521 Feed 1000 tonnes 41.15 20.17 2.0 ... 9.0 4.0 9.0 2.0 3.0 4.0 7.0 8.0 7 7
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Zimbabwe ZW 181 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 230.0 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
Zimbabwe ZW 181 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 27.0 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
Zimbabwe ZW 181 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 6.0 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
Zimbabwe ZW 181 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
Zimbabwe ZW 181 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21394 rows × 62 columns

In [62]:
# Delete the first five rows using iloc selector
raw_data.iloc[5:,]
Out[62]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
5 AF 2 Afghanistan 2514 Maize and products 5142 Food 1000 tonnes 33.94 67.71 ... 231.0 67.0 82.0 67.0 69.0 71.0 82.0 73.0 77 76
6 AF 2 Afghanistan 2517 Millet and products 5142 Food 1000 tonnes 33.94 67.71 ... 15.0 21.0 11.0 19.0 21.0 18.0 14.0 14.0 14 12
7 AF 2 Afghanistan 2520 Cereals, Other 5142 Food 1000 tonnes 33.94 67.71 ... 2.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0 0
8 AF 2 Afghanistan 2531 Potatoes and products 5142 Food 1000 tonnes 33.94 67.71 ... 276.0 294.0 294.0 260.0 242.0 250.0 192.0 169.0 196 230
9 AF 2 Afghanistan 2536 Sugar cane 5521 Feed 1000 tonnes 33.94 67.71 ... 50.0 29.0 61.0 65.0 54.0 114.0 83.0 83.0 69 81
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 Zimbabwe 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 Zimbabwe 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 Zimbabwe 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 Zimbabwe 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21472 rows × 63 columns

Renaming colmnns, exporting and saving Pandas DataFrames

DataFrame 이름 바꾸기 기능을 사용하면 Pandas에서 열 이름을 쉽게 바꿀 수 있습니다. 이름 바꾸기 기능은 사용하기 쉽고 매우 유연합니다.

  • Rename by mapping old names to new names using a dictionary, with form {“old_column_name”: “new_column_name”, …}
  • Rename by providing a function to change the column names with. Functions are applied to every column name.

조작 또는 계산 후 데이터를 CSV로 다시 저장하는 것이 다음 단계입니다.

  • to_csv to write a DataFrame to a CSV file,
  • to_excel to write DataFrame information to a Microsoft Excel file.
In [63]:
# Renaming of columns
raw_data.rename(columns={'Area':'New_Area'})
Out[63]:
Area Abbreviation Area Code New_Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 Zimbabwe 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 Zimbabwe 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 Zimbabwe 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 Zimbabwe 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21477 rows × 63 columns

In [64]:
display(raw_data)
raw_data.rename(columns={'Area':'New_Area'}, inplace=False)
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 Zimbabwe 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 Zimbabwe 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 Zimbabwe 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 Zimbabwe 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21477 rows × 63 columns

Out[64]:
Area Abbreviation Area Code New_Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 Zimbabwe 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 Zimbabwe 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 Zimbabwe 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 Zimbabwe 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21477 rows × 63 columns

In [65]:
raw_data.rename(columns={'Area':'New_Area',
                         'Y2013':'Year_2013'}, inplace=False)
Out[65]:
Area Abbreviation Area Code New_Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Year_2013
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 Zimbabwe 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 Zimbabwe 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 Zimbabwe 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 Zimbabwe 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21477 rows × 63 columns

In [66]:
raw_data.rename(columns=lambda x: x.upper().replace(' ', '_'), inplace=False)
Out[66]:
AREA_ABBREVIATION AREA_CODE AREA ITEM_CODE ITEM ELEMENT_CODE ELEMENT UNIT LATITUDE LONGITUDE ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 Zimbabwe 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 Zimbabwe 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 Zimbabwe 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 Zimbabwe 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21477 rows × 63 columns

In [67]:
# Exporting and saving
# Output data to a CSV file
# If you don't want row numbers in my output file, hence index=False, and to avoid character issues, you typically use utf8 encoding for input/output.
raw_data.to_csv("Tutorial_Pandas_Output_Filename.csv", index=False, encoding='utf8')
# Output data to an Excel file.
# For the excel output to work, you may need to install the "xlsxwriter" package.
raw_data.to_excel("Tutorial_Pandas_Output_Filename.xlsx", sheet_name="Sheet 1", index=False)

Summarising, grouping, and aggregating data in Pandas

.describe () 함수는 적용되는 모든 변수 또는 그룹에 대한 통계를 빠르게 표시하는 유용한 요약 도구입니다. describe () 출력은 숫자 또는 문자 열에 적용하는지에 따라 달라집니다.

Function Description
count Number of non-null observations
sum Sum of values
mean Mean of values
mad Mean absolute deviation
median Arithmetic median of values
min Minimum
max Maximum
mode Mode
abs Absolute Value
prod Product of values
std Unbiased standard deviation
var Unbiased variance
sem Unbiased standard error of the mean
skew Unbiased skewness (3rd moment)
kurt Unbiased kurtosis (4th moment)
quantile Sample quantile (value at %)
cumsum Cumulative sum
cumprod Cumulative product
cummax Cumulative maximum
cummin Cumulative minimum

우리는 다양한 변수로 큰 데이터 프레임을 그룹화하고 각 그룹에 summary functions를 적용 할 것입니다. 이는 Pandas DataFrame 객체의 "groupby()"및 "agg()"함수를 사용하여 Pandas에서 수행됩니다. (http://pandas.pydata.org/pandas-docs/stable/groupby.html)

  • groupby() 는 기본적으로 선택한 변수에 따라 데이터를 여러 그룹으로 나눕니다.
  • groupby () 함수는 GroupBy 오브젝트를 리턴하지만 원래 데이터 세트의 행이 분할 된 방식을 본질적으로 설명합니다.
  • GroupBy object.groups 변수는 키가 계산 된 unique 그룹이고 해당 값이 각 그룹에 속하는 axis label 인 dictionary입니다.
  • max(), min(), mean(), first(), last()와 같은 함수를 GroupBy 객체에 빠르게 적용하여 각 그룹에 대한 요약 통계를 얻을 수 있습니다.
  • 결과 열을 두 개 이상 계산하면 결과가 DataFrame이됩니다. 단일 결과 열의 경우 agg function는 기본적으로 Series를 생성합니다. 작업 열을 다르게 선택하여 이를 변경할 수 있습니다 (예 : [[]])
  • groupby 출력에는 선택한 그룹화 변수에 해당하는 행에 대한 색인 또는 다중 색인이 있습니다. 이 인덱스를 설정하지 않으려면 "as_index = False"를 groupby 작업에 전달하십시오.

agg () 함수가 제공하는 집계 기능을 사용하면 그룹당 여러 통계를 한 번의 계산으로 계산할 수 있습니다.

  • When multiple statistics are calculated on columns, the resulting dataframe will have a multi-index set on the column axis. This can be difficult to work with, and be better to rename columns after a groupby operation.
  • 여러 통계가 열에 대해 계산되면 결과 dataframe에 column axis에 multi-index set이 설정됩니다. 이는 작업하기 어려울 수 있으며 그룹 별 작업 후에 열 이름을 바꾸는 것이 좋습니다.
  • 깔끔한 접근 방식은 그룹화 된 열에서 ravel() 메서드를 사용하고 있습니다. Ravel ()은 Pandas multi-index을 더 간단한 배열로 바꾸어 현명한 열 이름으로 결합 할 수 있습니다.
In [68]:
# Summarising
url_path = 'https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2015/06/phone_data.csv'
raw_phone = pd.read_csv(url_path)
raw_phone
Out[68]:
index date duration item month network network_type
0 0 15/10/14 06:58 34.429 data 2014-11 data data
1 1 15/10/14 06:58 13.000 call 2014-11 Vodafone mobile
2 2 15/10/14 14:46 23.000 call 2014-11 Meteor mobile
3 3 15/10/14 14:48 4.000 call 2014-11 Tesco mobile
4 4 15/10/14 17:27 4.000 call 2014-11 Tesco mobile
... ... ... ... ... ... ... ...
825 825 13/03/15 00:38 1.000 sms 2015-03 world world
826 826 13/03/15 00:39 1.000 sms 2015-03 Vodafone mobile
827 827 13/03/15 06:58 34.429 data 2015-03 data data
828 828 14/03/15 00:13 1.000 sms 2015-03 world world
829 829 14/03/15 00:16 1.000 sms 2015-03 world world

830 rows × 7 columns

In [69]:
if 'date' in raw_phone.columns:
    raw_phone['date'] = pd.to_datetime(raw_phone['date'])
raw_phone
Out[69]:
index date duration item month network network_type
0 0 2014-10-15 06:58:00 34.429 data 2014-11 data data
1 1 2014-10-15 06:58:00 13.000 call 2014-11 Vodafone mobile
2 2 2014-10-15 14:46:00 23.000 call 2014-11 Meteor mobile
3 3 2014-10-15 14:48:00 4.000 call 2014-11 Tesco mobile
4 4 2014-10-15 17:27:00 4.000 call 2014-11 Tesco mobile
... ... ... ... ... ... ... ...
825 825 2015-03-13 00:38:00 1.000 sms 2015-03 world world
826 826 2015-03-13 00:39:00 1.000 sms 2015-03 Vodafone mobile
827 827 2015-03-13 06:58:00 34.429 data 2015-03 data data
828 828 2015-03-14 00:13:00 1.000 sms 2015-03 world world
829 829 2015-03-14 00:16:00 1.000 sms 2015-03 world world

830 rows × 7 columns

In [70]:
raw_phone['duration'].max()
Out[70]:
10528.0
In [71]:
raw_phone['item'].unique()
Out[71]:
array(['data', 'call', 'sms'], dtype=object)
In [72]:
raw_phone['duration'][raw_phone['item'] == 'data'].max()
Out[72]:
34.429
In [73]:
raw_phone['network'].unique()
Out[73]:
array(['data', 'Vodafone', 'Meteor', 'Tesco', 'Three', 'voicemail',
       'landline', 'special', 'world'], dtype=object)
In [74]:
raw_phone['month'].value_counts()
Out[74]:
2014-11    230
2015-01    205
2014-12    157
2015-02    137
2015-03    101
Name: month, dtype: int64
In [75]:
# Grouping
raw_phone.groupby(['month']).groups
Out[75]:
{'2014-11': Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
             ...
             220, 221, 222, 223, 224, 225, 226, 227, 229, 230],
            dtype='int64', length=230),
 '2014-12': Int64Index([228, 231, 232, 233, 234, 235, 236, 237, 238, 239,
             ...
             377, 378, 379, 380, 382, 383, 384, 385, 387, 388],
            dtype='int64', length=157),
 '2015-01': Int64Index([381, 386, 389, 390, 391, 392, 393, 394, 395, 396,
             ...
             583, 584, 585, 587, 588, 589, 590, 591, 592, 593],
            dtype='int64', length=205),
 '2015-02': Int64Index([577, 586, 594, 595, 596, 597, 598, 599, 600, 601,
             ...
             719, 720, 721, 722, 723, 724, 725, 726, 727, 728],
            dtype='int64', length=137),
 '2015-03': Int64Index([729, 730, 731, 732, 733, 734, 735, 736, 737, 738,
             ...
             820, 821, 822, 823, 824, 825, 826, 827, 828, 829],
            dtype='int64', length=101)}
In [76]:
raw_phone.groupby(['month']).groups.keys()
Out[76]:
dict_keys(['2014-11', '2014-12', '2015-01', '2015-02', '2015-03'])
In [77]:
raw_phone.groupby(['month']).first()
Out[77]:
index date duration item network network_type
month
2014-11 0 2014-10-15 06:58:00 34.429 data data data
2014-12 228 2014-11-13 06:58:00 34.429 data data data
2015-01 381 2014-12-13 06:58:00 34.429 data data data
2015-02 577 2015-01-13 06:58:00 34.429 data data data
2015-03 729 2015-12-02 20:15:00 69.000 call landline landline
In [78]:
raw_phone.groupby(['month'])['duration'].sum()
Out[78]:
month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64
In [79]:
raw_phone.groupby(['month'], as_index=False)[['duration']].sum()
Out[79]:
month duration
0 2014-11 26639.441
1 2014-12 14641.870
2 2015-01 18223.299
3 2015-02 15522.299
4 2015-03 22750.441
In [80]:
raw_phone.groupby(['month'])['date'].count()
Out[80]:
month
2014-11    230
2014-12    157
2015-01    205
2015-02    137
2015-03    101
Name: date, dtype: int64
In [81]:
raw_phone[raw_phone['item'] == 'call'].groupby('network')['duration'].sum()
Out[81]:
network
Meteor        7200.0
Tesco        13828.0
Three        36464.0
Vodafone     14621.0
landline     18433.0
voicemail     1775.0
Name: duration, dtype: float64
In [82]:
raw_phone.groupby(['month', 'item']).groups
Out[82]:
{('2014-11',
  'call'): Int64Index([  1,   2,   3,   4,   5,   7,   8,   9,  10,  19,
             ...
             194, 195, 196, 200, 201, 203, 216, 222, 223, 224],
            dtype='int64', length=107),
 ('2014-11',
  'data'): Int64Index([  0,   6,  13,  26,  39,  45,  54,  56,  58,  66,  80,  81,  87,
              92,  95,  97, 101, 111, 114, 120, 131, 151, 159, 170, 182, 189,
             192, 199, 208],
            dtype='int64'),
 ('2014-11',
  'sms'): Int64Index([ 11,  12,  14,  15,  16,  17,  18,  22,  23,  24,  25,  33,  36,
              37,  38,  52,  53,  61,  62,  63,  67,  68,  69,  70,  71,  72,
              73,  74,  75,  76,  77,  79, 102, 103, 107, 108, 121, 125, 132,
             133, 134, 135, 138, 142, 143, 144, 145, 148, 149, 153, 154, 155,
             157, 158, 160, 161, 167, 173, 174, 175, 176, 177, 178, 179, 180,
             181, 185, 186, 187, 188, 197, 198, 202, 204, 205, 206, 207, 209,
             210, 211, 212, 213, 214, 215, 217, 218, 219, 220, 221, 225, 226,
             227, 229, 230],
            dtype='int64'),
 ('2014-12',
  'call'): Int64Index([232, 236, 250, 251, 252, 255, 256, 258, 259, 260, 261, 267, 268,
             269, 270, 271, 272, 273, 274, 276, 277, 278, 279, 280, 282, 283,
             284, 285, 286, 287, 290, 292, 295, 297, 298, 299, 300, 301, 302,
             303, 306, 309, 311, 312, 313, 314, 320, 322, 327, 329, 337, 342,
             344, 345, 347, 348, 349, 350, 352, 353, 354, 356, 362, 364, 365,
             366, 367, 368, 369, 373, 375, 379, 380, 382, 383, 384, 385, 387,
             388],
            dtype='int64'),
 ('2014-12',
  'data'): Int64Index([228, 231, 234, 235, 237, 238, 249, 254, 263, 275, 281, 288, 291,
             305, 321, 324, 328, 330, 338, 341, 343, 346, 351, 355, 363, 372,
             374, 376, 377, 378],
            dtype='int64'),
 ('2014-12',
  'sms'): Int64Index([233, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 253, 257,
             262, 264, 265, 266, 289, 293, 294, 296, 304, 307, 308, 310, 315,
             316, 317, 318, 319, 323, 325, 326, 331, 332, 333, 334, 335, 336,
             339, 340, 357, 358, 359, 360, 361, 370, 371],
            dtype='int64'),
 ('2015-01',
  'call'): Int64Index([392, 398, 401, 402, 403, 404, 405, 406, 407, 408, 411, 412, 413,
             414, 415, 416, 417, 418, 423, 425, 428, 431, 433, 438, 441, 442,
             444, 445, 446, 448, 450, 451, 454, 455, 456, 457, 459, 460, 461,
             466, 475, 497, 498, 499, 506, 510, 511, 513, 514, 515, 517, 518,
             519, 520, 523, 524, 525, 526, 527, 528, 532, 533, 535, 536, 542,
             543, 544, 545, 547, 548, 557, 558, 559, 561, 562, 563, 564, 565,
             570, 572, 573, 574, 578, 584, 585, 587, 588, 589],
            dtype='int64'),
 ('2015-01',
  'data'): Int64Index([381, 386, 389, 396, 397, 400, 409, 420, 426, 427, 443, 453, 463,
             465, 468, 473, 474, 476, 496, 504, 505, 509, 512, 516, 529, 537,
             541, 555, 560, 568, 571],
            dtype='int64'),
 ('2015-01',
  'sms'): Int64Index([390, 391, 393, 394, 395, 399, 410, 419, 421, 422, 424, 429, 430,
             432, 434, 435, 436, 437, 439, 440, 447, 449, 452, 458, 462, 464,
             467, 469, 470, 471, 472, 477, 478, 479, 480, 481, 482, 483, 484,
             485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 500, 501,
             502, 503, 507, 508, 521, 522, 530, 531, 534, 538, 539, 540, 546,
             549, 550, 551, 552, 553, 554, 556, 566, 567, 569, 575, 576, 579,
             580, 581, 582, 583, 590, 591, 592, 593],
            dtype='int64'),
 ('2015-02',
  'call'): Int64Index([595, 597, 599, 600, 601, 602, 603, 611, 612, 614, 615, 618, 619,
             620, 626, 627, 629, 630, 631, 632, 633, 635, 636, 638, 639, 640,
             644, 647, 648, 650, 651, 653, 654, 656, 657, 658, 662, 664, 665,
             666, 668, 669, 671, 672, 674, 676, 677, 682, 684, 686, 687, 690,
             691, 693, 694, 695, 696, 700, 702, 708, 709, 710, 711, 716, 718,
             719, 720],
            dtype='int64'),
 ('2015-02',
  'data'): Int64Index([577, 586, 594, 598, 610, 613, 616, 621, 625, 634, 637, 645, 646,
             649, 652, 655, 659, 667, 670, 673, 675, 683, 685, 689, 692, 697,
             715, 717, 725, 727, 728],
            dtype='int64'),
 ('2015-02',
  'sms'): Int64Index([596, 604, 605, 606, 607, 608, 609, 617, 622, 623, 624, 628, 641,
             642, 643, 660, 661, 663, 678, 679, 680, 681, 688, 698, 699, 701,
             703, 704, 705, 706, 707, 712, 713, 714, 721, 722, 723, 724, 726],
            dtype='int64'),
 ('2015-03',
  'call'): Int64Index([729, 730, 732, 733, 735, 736, 738, 741, 742, 744, 745, 752, 756,
             758, 762, 763, 764, 769, 770, 771, 773, 774, 776, 777, 778, 780,
             781, 782, 783, 784, 785, 786, 787, 788, 792, 799, 800, 801, 802,
             803, 805, 806, 807, 808, 809, 810, 816],
            dtype='int64'),
 ('2015-03',
  'data'): Int64Index([731, 734, 737, 739, 740, 743, 746, 751, 753, 754, 755, 757, 761,
             772, 775, 779, 791, 793, 804, 811, 817, 818, 819, 820, 821, 822,
             823, 824, 827],
            dtype='int64'),
 ('2015-03',
  'sms'): Int64Index([747, 748, 749, 750, 759, 760, 765, 766, 767, 768, 789, 790, 794,
             795, 796, 797, 798, 812, 813, 814, 815, 825, 826, 828, 829],
            dtype='int64')}
In [83]:
raw_phone.groupby(['month', 'item']).groups.keys()
Out[83]:
dict_keys([('2014-11', 'call'), ('2014-11', 'data'), ('2014-11', 'sms'), ('2014-12', 'call'), ('2014-12', 'data'), ('2014-12', 'sms'), ('2015-01', 'call'), ('2015-01', 'data'), ('2015-01', 'sms'), ('2015-02', 'call'), ('2015-02', 'data'), ('2015-02', 'sms'), ('2015-03', 'call'), ('2015-03', 'data'), ('2015-03', 'sms')])
In [84]:
raw_phone.groupby(['month', 'item']).first()
Out[84]:
index date duration network network_type
month item
2014-11 call 1 2014-10-15 06:58:00 13.000 Vodafone mobile
data 0 2014-10-15 06:58:00 34.429 data data
sms 11 2014-10-16 22:18:00 1.000 Meteor mobile
2014-12 call 232 2014-11-14 17:24:00 124.000 voicemail voicemail
data 228 2014-11-13 06:58:00 34.429 data data
... ... ... ... ... ... ...
2015-02 data 577 2015-01-13 06:58:00 34.429 data data
sms 596 2015-01-15 12:23:00 1.000 special special
2015-03 call 729 2015-12-02 20:15:00 69.000 landline landline
data 731 2015-02-13 06:58:00 34.429 data data
sms 747 2015-02-19 18:46:00 1.000 Vodafone mobile

15 rows × 5 columns

In [85]:
raw_phone.groupby(['month', 'item'])['duration'].sum()
Out[85]:
month    item
2014-11  call    25547.000
         data      998.441
         sms        94.000
2014-12  call    13561.000
         data     1032.870
                   ...    
2015-02  data     1067.299
         sms        39.000
2015-03  call    21727.000
         data      998.441
         sms        25.000
Name: duration, Length: 15, dtype: float64
In [86]:
raw_phone.groupby(['month', 'item'])['date'].count()
Out[86]:
month    item
2014-11  call    107
         data     29
         sms      94
2014-12  call     79
         data     30
                ... 
2015-02  data     31
         sms      39
2015-03  call     47
         data     29
         sms      25
Name: date, Length: 15, dtype: int64
In [87]:
raw_phone.groupby(['month', 'network_type'])['date'].count()
Out[87]:
month    network_type
2014-11  data             29
         landline          5
         mobile          189
         special           1
         voicemail         6
                        ... 
2015-03  data             29
         landline         11
         mobile           54
         voicemail         4
         world             3
Name: date, Length: 24, dtype: int64
In [88]:
raw_phone.groupby(['month', 'network_type'])[['date']].count()
Out[88]:
date
month network_type
2014-11 data 29
landline 5
mobile 189
special 1
voicemail 6
... ... ...
2015-03 data 29
landline 11
mobile 54
voicemail 4
world 3

24 rows × 1 columns

In [89]:
raw_phone.groupby(['month', 'network_type'], as_index=False)[['date']].count()
Out[89]:
month network_type date
0 2014-11 data 29
1 2014-11 landline 5
2 2014-11 mobile 189
3 2014-11 special 1
4 2014-11 voicemail 6
... ... ... ...
19 2015-03 data 29
20 2015-03 landline 11
21 2015-03 mobile 54
22 2015-03 voicemail 4
23 2015-03 world 3

24 rows × 3 columns

In [90]:
raw_phone.groupby(['month', 'network_type'])[['date']].count().shape
Out[90]:
(24, 1)
In [91]:
raw_phone.groupby(['month', 'network_type'], as_index=False)[['date']].count().shape
Out[91]:
(24, 3)
In [92]:
# Aggregating
raw_phone.groupby(['month'], as_index=False)[['duration']].sum()
Out[92]:
month duration
0 2014-11 26639.441
1 2014-12 14641.870
2 2015-01 18223.299
3 2015-02 15522.299
4 2015-03 22750.441
In [93]:
# is same as
raw_phone.groupby(['month'], as_index=False).agg({'duration':'sum'})
Out[93]:
month duration
0 2014-11 26639.441
1 2014-12 14641.870
2 2015-01 18223.299
3 2015-02 15522.299
4 2015-03 22750.441
In [94]:
raw_phone.groupby(['month', 'item']).agg({'duration':'sum',
                                          'network_type':'count',
                                          'date':'first'})
Out[94]:
duration network_type date
month item
2014-11 call 25547.000 107 2014-10-15 06:58:00
data 998.441 29 2014-10-15 06:58:00
sms 94.000 94 2014-10-16 22:18:00
2014-12 call 13561.000 79 2014-11-14 17:24:00
data 1032.870 30 2014-11-13 06:58:00
... ... ... ... ...
2015-02 data 1067.299 31 2015-01-13 06:58:00
sms 39.000 39 2015-01-15 12:23:00
2015-03 call 21727.000 47 2015-12-02 20:15:00
data 998.441 29 2015-02-13 06:58:00
sms 25.000 25 2015-02-19 18:46:00

15 rows × 3 columns

In [95]:
# is same as
aggregation_logic = {'duration':'sum',
                     'network_type':'count',
                     'date':'first'}
raw_phone.groupby(['month', 'item']).agg(aggregation_logic)
Out[95]:
duration network_type date
month item
2014-11 call 25547.000 107 2014-10-15 06:58:00
data 998.441 29 2014-10-15 06:58:00
sms 94.000 94 2014-10-16 22:18:00
2014-12 call 13561.000 79 2014-11-14 17:24:00
data 1032.870 30 2014-11-13 06:58:00
... ... ... ... ...
2015-02 data 1067.299 31 2015-01-13 06:58:00
sms 39.000 39 2015-01-15 12:23:00
2015-03 call 21727.000 47 2015-12-02 20:15:00
data 998.441 29 2015-02-13 06:58:00
sms 25.000 25 2015-02-19 18:46:00

15 rows × 3 columns

In [96]:
aggregation_logic = {'duration':[min, max, sum],
                     'network_type':'count',
                     'date':[min, 'first', 'nunique']}
raw_phone.groupby(['month', 'item']).agg(aggregation_logic)
Out[96]:
duration network_type date
min max sum count min first nunique
month item
2014-11 call 1.000 1940.000 25547.000 107 2014-01-11 15:13:00 2014-10-15 06:58:00 104
data 34.429 34.429 998.441 29 2014-01-11 06:58:00 2014-10-15 06:58:00 29
sms 1.000 1.000 94.000 94 2014-03-11 08:40:00 2014-10-16 22:18:00 79
2014-12 call 2.000 2120.000 13561.000 79 2014-02-12 11:40:00 2014-11-14 17:24:00 76
data 34.429 34.429 1032.870 30 2014-01-12 06:58:00 2014-11-13 06:58:00 30
... ... ... ... ... ... ... ... ...
2015-02 data 34.429 34.429 1067.299 31 2015-01-02 06:58:00 2015-01-13 06:58:00 31
sms 1.000 1.000 39.000 39 2015-01-15 12:23:00 2015-01-15 12:23:00 27
2015-03 call 2.000 10528.000 21727.000 47 2015-01-03 12:19:00 2015-12-02 20:15:00 47
data 34.429 34.429 998.441 29 2015-01-03 06:58:00 2015-02-13 06:58:00 29
sms 1.000 1.000 25.000 25 2015-02-03 09:19:00 2015-02-19 18:46:00 17

15 rows × 7 columns

In [97]:
aggregation_logic = {'duration':[min, max, sum],
                     'network_type':'count',
                     'date':['first', lambda x: max(x)-min(x)]}
raw_phone.groupby(['month', 'item']).agg(aggregation_logic)
Out[97]:
duration network_type date
min max sum count first <lambda_0>
month item
2014-11 call 1.000 1940.000 25547.000 107 2014-10-15 06:58:00 334 days 03:48:00
data 34.429 34.429 998.441 29 2014-10-15 06:58:00 334 days 00:00:00
sms 1.000 1.000 94.000 94 2014-10-16 22:18:00 275 days 10:40:00
2014-12 call 2.000 2120.000 13561.000 79 2014-11-14 17:24:00 305 days 08:14:00
data 34.429 34.429 1032.870 30 2014-11-13 06:58:00 334 days 00:00:00
... ... ... ... ... ... ... ...
2015-02 data 34.429 34.429 1067.299 31 2015-01-13 06:58:00 334 days 00:00:00
sms 1.000 1.000 39.000 39 2015-01-15 12:23:00 260 days 09:17:00
2015-03 call 2.000 10528.000 21727.000 47 2015-12-02 20:15:00 333 days 08:32:00
data 34.429 34.429 998.441 29 2015-02-13 06:58:00 334 days 00:00:00
sms 1.000 1.000 25.000 25 2015-02-19 18:46:00 59 days 01:11:00

15 rows × 6 columns

In [98]:
raw_phone_test = raw_phone.groupby(['month', 'item']).agg(aggregation_logic)
raw_phone_test
Out[98]:
duration network_type date
min max sum count first <lambda_0>
month item
2014-11 call 1.000 1940.000 25547.000 107 2014-10-15 06:58:00 334 days 03:48:00
data 34.429 34.429 998.441 29 2014-10-15 06:58:00 334 days 00:00:00
sms 1.000 1.000 94.000 94 2014-10-16 22:18:00 275 days 10:40:00
2014-12 call 2.000 2120.000 13561.000 79 2014-11-14 17:24:00 305 days 08:14:00
data 34.429 34.429 1032.870 30 2014-11-13 06:58:00 334 days 00:00:00
... ... ... ... ... ... ... ...
2015-02 data 34.429 34.429 1067.299 31 2015-01-13 06:58:00 334 days 00:00:00
sms 1.000 1.000 39.000 39 2015-01-15 12:23:00 260 days 09:17:00
2015-03 call 2.000 10528.000 21727.000 47 2015-12-02 20:15:00 333 days 08:32:00
data 34.429 34.429 998.441 29 2015-02-13 06:58:00 334 days 00:00:00
sms 1.000 1.000 25.000 25 2015-02-19 18:46:00 59 days 01:11:00

15 rows × 6 columns

In [99]:
raw_phone_test.columns = raw_phone_test.columns.droplevel(level=0)
raw_phone_test
Out[99]:
min max sum count first <lambda_0>
month item
2014-11 call 1.000 1940.000 25547.000 107 2014-10-15 06:58:00 334 days 03:48:00
data 34.429 34.429 998.441 29 2014-10-15 06:58:00 334 days 00:00:00
sms 1.000 1.000 94.000 94 2014-10-16 22:18:00 275 days 10:40:00
2014-12 call 2.000 2120.000 13561.000 79 2014-11-14 17:24:00 305 days 08:14:00
data 34.429 34.429 1032.870 30 2014-11-13 06:58:00 334 days 00:00:00
... ... ... ... ... ... ... ...
2015-02 data 34.429 34.429 1067.299 31 2015-01-13 06:58:00 334 days 00:00:00
sms 1.000 1.000 39.000 39 2015-01-15 12:23:00 260 days 09:17:00
2015-03 call 2.000 10528.000 21727.000 47 2015-12-02 20:15:00 333 days 08:32:00
data 34.429 34.429 998.441 29 2015-02-13 06:58:00 334 days 00:00:00
sms 1.000 1.000 25.000 25 2015-02-19 18:46:00 59 days 01:11:00

15 rows × 6 columns

In [100]:
raw_phone_test.rename(columns={'min':'min_duration',
                               'max':'max_duration',
                               'sum':'sum_duration',
                               '<lambda>':'date_difference'})
raw_phone_test = raw_phone.groupby(['month', 'item']).agg(aggregation_logic)
raw_phone_test
Out[100]:
duration network_type date
min max sum count first <lambda_0>
month item
2014-11 call 1.000 1940.000 25547.000 107 2014-10-15 06:58:00 334 days 03:48:00
data 34.429 34.429 998.441 29 2014-10-15 06:58:00 334 days 00:00:00
sms 1.000 1.000 94.000 94 2014-10-16 22:18:00 275 days 10:40:00
2014-12 call 2.000 2120.000 13561.000 79 2014-11-14 17:24:00 305 days 08:14:00
data 34.429 34.429 1032.870 30 2014-11-13 06:58:00 334 days 00:00:00
... ... ... ... ... ... ... ...
2015-02 data 34.429 34.429 1067.299 31 2015-01-13 06:58:00 334 days 00:00:00
sms 1.000 1.000 39.000 39 2015-01-15 12:23:00 260 days 09:17:00
2015-03 call 2.000 10528.000 21727.000 47 2015-12-02 20:15:00 333 days 08:32:00
data 34.429 34.429 998.441 29 2015-02-13 06:58:00 334 days 00:00:00
sms 1.000 1.000 25.000 25 2015-02-19 18:46:00 59 days 01:11:00

15 rows × 6 columns

In [101]:
raw_phone_test.columns = ['_'.join(x) for x in raw_phone_test.columns.ravel()]
raw_phone_test
Out[101]:
duration_min duration_max duration_sum network_type_count date_first date_<lambda_0>
month item
2014-11 call 1.000 1940.000 25547.000 107 2014-10-15 06:58:00 334 days 03:48:00
data 34.429 34.429 998.441 29 2014-10-15 06:58:00 334 days 00:00:00
sms 1.000 1.000 94.000 94 2014-10-16 22:18:00 275 days 10:40:00
2014-12 call 2.000 2120.000 13561.000 79 2014-11-14 17:24:00 305 days 08:14:00
data 34.429 34.429 1032.870 30 2014-11-13 06:58:00 334 days 00:00:00
... ... ... ... ... ... ... ...
2015-02 data 34.429 34.429 1067.299 31 2015-01-13 06:58:00 334 days 00:00:00
sms 1.000 1.000 39.000 39 2015-01-15 12:23:00 260 days 09:17:00
2015-03 call 2.000 10528.000 21727.000 47 2015-12-02 20:15:00 333 days 08:32:00
data 34.429 34.429 998.441 29 2015-02-13 06:58:00 334 days 00:00:00
sms 1.000 1.000 25.000 25 2015-02-19 18:46:00 59 days 01:11:00

15 rows × 6 columns

Merge and join DataFrames with Pandas

(http://pandas.pydata.org/pandas-docs/stable/merging.html)
Python을 사용하는 실제 데이터 과학 상황의 경우 분석 데이터 세트를 구성하기 위해 Pandas Dataframes를 merge하거나 join해야하는 시간은 약 10 분입니다. 데이터 프레임 병합 및 결합은 모든 주목받는 데이터 분석가가 마스터해야하는 핵심 프로세스입니다.

  • 두 데이터 세트를 "merge"하는 것은 두 데이터 세트를 하나로 모으고 공통 속성 또는 열을 기준으로 각 행을 정렬하는 프로세스입니다.
  • 가장 간단한 병합 작업은 왼쪽 dataframe (첫 번째 인수), 오른쪽 dataframe (두 번째 인수), merge column name 또는 "on"을 병합 할 열을 사용합니다.
  • 출력/결과에서 "on"으로 지정된 병합 열의 공통 값이있는 경우 왼쪽 및 오른쪽 dataframe의 행이 일치됩니다.
  • 기본적으로 Pandas 병합 작업은 "inner" merge으로 작동합니다.

Pandas에는 세 가지 유형의 병합이 있습니다. 이러한 병합 유형은 대부분의 데이터베이스 및 데이터 지향 언어 (SQL, R, SAS)에서 일반적이며 "join"이라고합니다

  • inner merge / inner jon – 기본 팬더 동작은 병합 "on"값이 왼쪽 및 오른쪽 dataframe 모두에 존재하는 행만 유지합니다.
  • left merge / left outer join – (일명 왼쪽 병합 또는 왼쪽 조인) 왼쪽 dataframe의 모든 행을 유지합니다. 오른쪽 dataframe에 "on" 변수의 결 측값이있는 경우 결과에 empty/NaN 값을 추가하십시오.
  • Right merge / Right outer join – (일명 오른쪽 병합 또는 오른쪽 조인) 모든 행을 올바른 dataframe에 유지합니다. 왼쪽 열에 "on" 변수의 결 측값이 있으면 결과에 empty / NaN 값을 추가하십시오.
  • outer merge / full outer join – full outer join은 왼쪽 dateframe의 모든 행과 오른쪽 dataframe의 모든 행을 반환하고 가능한 경우 NaN과 다른 행을 일치시킵니다.

In [102]:
# Merge and join od dataframes
user_usage = pd.read_csv('https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/user_usage.csv')
user_device = pd.read_csv('https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/user_device.csv')
device_info = pd.read_csv('https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/android_devices.csv')
display(user_usage.head())
display(user_device.head())
display(device_info.head())
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id
0 21.97 4.82 1557.33 22787
1 1710.08 136.88 7267.55 22788
2 1710.08 136.88 7267.55 22789
3 94.46 35.17 519.12 22790
4 71.59 79.26 1557.33 22792
use_id user_id platform platform_version device use_type_id
0 22782 26980 ios 10.2 iPhone7,2 2
1 22783 29628 android 6.0 Nexus 5 3
2 22784 28473 android 5.1 SM-G903F 1
3 22785 15200 ios 10.2 iPhone7,2 3
4 22786 28239 android 6.0 ONE E1003 1
Retail Branding Marketing Name Device Model
0 NaN NaN AD681H Smartfren Andromax AD681H
1 NaN NaN FJL21 FJL21
2 NaN NaN T31 Panasonic T31
3 NaN NaN hws7721g MediaPad 7 Youth 2
4 3Q OC1020A OC1020A OC1020A
In [103]:
# Q: if the usage patterns for users differ between different devices
result = pd.merge(left=user_usage, right=user_device, on='use_id')
result.head()
Out[103]:
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id user_id platform platform_version device use_type_id
0 21.97 4.82 1557.33 22787 12921 android 4.3 GT-I9505 1
1 1710.08 136.88 7267.55 22788 28714 android 6.0 SM-G930F 1
2 1710.08 136.88 7267.55 22789 28714 android 6.0 SM-G930F 1
3 94.46 35.17 519.12 22790 29592 android 5.1 D2303 1
4 71.59 79.26 1557.33 22792 28217 android 5.1 SM-G361F 1
In [104]:
print(user_usage.shape, user_device.shape, device_info.shape, result.shape)
(240, 4) (272, 6) (14546, 4) (159, 9)
In [105]:
user_usage['use_id'].isin(user_device['use_id']).value_counts()
Out[105]:
True     159
False     81
Name: use_id, dtype: int64
In [106]:
result = pd.merge(left=user_usage, right=user_device, on='use_id', how='left')
print(user_usage.shape, result.shape, result['device'].isnull().sum())
(240, 4) (240, 9) 81
In [107]:
display(result.head(), result.tail())
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id user_id platform platform_version device use_type_id
0 21.97 4.82 1557.33 22787 12921.0 android 4.3 GT-I9505 1.0
1 1710.08 136.88 7267.55 22788 28714.0 android 6.0 SM-G930F 1.0
2 1710.08 136.88 7267.55 22789 28714.0 android 6.0 SM-G930F 1.0
3 94.46 35.17 519.12 22790 29592.0 android 5.1 D2303 1.0
4 71.59 79.26 1557.33 22792 28217.0 android 5.1 SM-G361F 1.0
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id user_id platform platform_version device use_type_id
235 260.66 68.44 896.96 25008 NaN NaN NaN NaN NaN
236 97.12 36.50 2815.00 25040 NaN NaN NaN NaN NaN
237 355.93 12.37 6828.09 25046 NaN NaN NaN NaN NaN
238 632.06 120.46 1453.16 25058 NaN NaN NaN NaN NaN
239 488.70 906.92 3089.85 25220 NaN NaN NaN NaN NaN
In [108]:
result = pd.merge(left=user_usage, right=user_device, on='use_id', how='right')
print(user_device.shape, result.shape, result['device'].isnull().sum(), result['monthly_mb'].isnull().sum())
(272, 6) (272, 9) 0 113
In [109]:
display(result.head(), result.tail())
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id user_id platform platform_version device use_type_id
0 21.97 4.82 1557.33 22787 12921 android 4.3 GT-I9505 1
1 1710.08 136.88 7267.55 22788 28714 android 6.0 SM-G930F 1
2 1710.08 136.88 7267.55 22789 28714 android 6.0 SM-G930F 1
3 94.46 35.17 519.12 22790 29592 android 5.1 D2303 1
4 71.59 79.26 1557.33 22792 28217 android 5.1 SM-G361F 1
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id user_id platform platform_version device use_type_id
267 NaN NaN NaN 23047 29720 ios 10.2 iPhone7,1 2
268 NaN NaN NaN 23048 29724 android 6.0 ONEPLUS A3003 3
269 NaN NaN NaN 23050 29726 ios 10.2 iPhone7,2 3
270 NaN NaN NaN 23051 29726 ios 10.2 iPhone7,2 3
271 NaN NaN NaN 23052 29727 ios 10.1 iPhone8,4 3
In [110]:
print(user_usage['use_id'].unique().shape[0], user_device['use_id'].unique().shape[0], pd.concat([user_usage['use_id'], user_device['use_id']]).unique().shape[0])
240 272 353
In [111]:
result = pd.merge(left=user_usage, right=user_device, on='use_id', how='outer')
print(result.shape)
(353, 9)
In [112]:
print((result.apply(lambda x: x.isnull().sum(), axis=1) == 0).sum())
159
In [113]:
# Note that all rows from left and right merge dataframes are included, but NaNs will be in different columns depending if the data originated in the left or right dataframe.
result = pd.merge(left=user_usage, right=user_device, on='use_id', how='outer', indicator=True)
result.iloc[[0, 1, 200, 201, 350, 351]]
Out[113]:
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id user_id platform platform_version device use_type_id _merge
0 21.97 4.82 1557.33 22787 12921.0 android 4.3 GT-I9505 1.0 both
1 1710.08 136.88 7267.55 22788 28714.0 android 6.0 SM-G930F 1.0 both
200 28.79 29.42 3114.67 23988 NaN NaN NaN NaN NaN left_only
201 616.56 99.85 5414.14 24006 NaN NaN NaN NaN NaN left_only
350 NaN NaN NaN 23050 29726.0 ios 10.2 iPhone7,2 3.0 right_only
351 NaN NaN NaN 23051 29726.0 ios 10.2 iPhone7,2 3.0 right_only
In [114]:
# For the question,
result1 = pd.merge(left=user_usage, right=user_device, on='use_id', how='left')
result1.head()
Out[114]:
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id user_id platform platform_version device use_type_id
0 21.97 4.82 1557.33 22787 12921.0 android 4.3 GT-I9505 1.0
1 1710.08 136.88 7267.55 22788 28714.0 android 6.0 SM-G930F 1.0
2 1710.08 136.88 7267.55 22789 28714.0 android 6.0 SM-G930F 1.0
3 94.46 35.17 519.12 22790 29592.0 android 5.1 D2303 1.0
4 71.59 79.26 1557.33 22792 28217.0 android 5.1 SM-G361F 1.0
In [115]:
device_info.head()
Out[115]:
Retail Branding Marketing Name Device Model
0 NaN NaN AD681H Smartfren Andromax AD681H
1 NaN NaN FJL21 FJL21
2 NaN NaN T31 Panasonic T31
3 NaN NaN hws7721g MediaPad 7 Youth 2
4 3Q OC1020A OC1020A OC1020A
In [116]:
result_final = pd.merge(left=result1, right=device_info[['Retail Branding', 'Marketing Name', 'Model']],
                        left_on='device', right_on='Model', how='left')
result_final[result_final['Retail Branding'] == 'Samsung'].head()
Out[116]:
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id user_id platform platform_version device use_type_id Retail Branding Marketing Name Model
0 21.97 4.82 1557.33 22787 12921.0 android 4.3 GT-I9505 1.0 Samsung Galaxy S4 GT-I9505
1 1710.08 136.88 7267.55 22788 28714.0 android 6.0 SM-G930F 1.0 Samsung Galaxy S7 SM-G930F
2 1710.08 136.88 7267.55 22789 28714.0 android 6.0 SM-G930F 1.0 Samsung Galaxy S7 SM-G930F
4 71.59 79.26 1557.33 22792 28217.0 android 5.1 SM-G361F 1.0 Samsung Galaxy Core Prime SM-G361F
5 71.59 79.26 1557.33 22793 28217.0 android 5.1 SM-G361F 1.0 Samsung Galaxy Core Prime SM-G361F
In [117]:
result_final[result_final['Retail Branding'] == 'LGE'].head()
Out[117]:
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id user_id platform platform_version device use_type_id Retail Branding Marketing Name Model
78 67.35 15.38 1557.33 22892 29671.0 android 7.1 Nexus 5X 1.0 LGE Nexus 5X Nexus 5X
95 155.71 10.14 1557.33 22922 28845.0 android 6.0 LG-H815 1.0 LGE LG G4 LG-H815
In [118]:
group1 = result_final[result_final['Retail Branding'] == 'Samsung']
group2 = result_final[result_final['Retail Branding'] == 'LGE']
display(group1.describe())
display(group2.describe())
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id user_id platform_version use_type_id
count 108.000000 108.000000 108.000000 108.000000 108.000000 108.000000 108.0
mean 191.010093 92.390463 4017.318889 22922.944444 25208.361111 5.453704 1.0
std 237.723857 82.747038 5637.419003 72.177848 7359.871323 0.664230 0.0
min 8.140000 0.790000 0.000000 22787.000000 2873.000000 4.100000 1.0
25% 69.137500 27.630000 1557.330000 22871.000000 26505.750000 5.000000 1.0
50% 145.550000 79.260000 2076.450000 22932.500000 28953.000000 6.000000 1.0
75% 239.125000 127.970000 3114.670000 22980.500000 29676.500000 6.000000 1.0
max 1710.080000 435.290000 31146.670000 23049.000000 29725.000000 6.000000 1.0
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id user_id platform_version use_type_id
count 2.000000 2.00000 2.00 2.000000 2.000000 2.000000 2.0
mean 111.530000 12.76000 1557.33 22907.000000 29258.000000 6.550000 1.0
std 62.479955 3.70524 0.00 21.213203 584.070201 0.777817 0.0
min 67.350000 10.14000 1557.33 22892.000000 28845.000000 6.000000 1.0
25% 89.440000 11.45000 1557.33 22899.500000 29051.500000 6.275000 1.0
50% 111.530000 12.76000 1557.33 22907.000000 29258.000000 6.550000 1.0
75% 133.620000 14.07000 1557.33 22914.500000 29464.500000 6.825000 1.0
max 155.710000 15.38000 1557.33 22922.000000 29671.000000 7.100000 1.0
In [119]:
result_final.groupby('Retail Branding').agg({'outgoing_mins_per_month':'mean',
                                             'outgoing_sms_per_month':'mean',
                                             'monthly_mb':'mean',
                                             'user_id':'count'})
Out[119]:
outgoing_mins_per_month outgoing_sms_per_month monthly_mb user_id
Retail Branding
HTC 299.842955 93.059318 5144.077955 44
Huawei 81.526667 9.500000 1561.226667 3
LGE 111.530000 12.760000 1557.330000 2
Lava 60.650000 261.900000 12458.670000 2
Lenovo 215.920000 12.930000 1557.330000 2
... ... ... ... ...
OnePlus 354.855000 48.330000 6575.410000 6
Samsung 191.010093 92.390463 4017.318889 108
Sony 177.315625 40.176250 3212.000625 16
Vodafone 42.750000 46.830000 5191.120000 1
ZTE 42.750000 46.830000 5191.120000 1

11 rows × 4 columns

Basic Plotting Pandas DataFrames

(https://pandas.pydata.org/pandas-docs/stable/visualization.html)
그래픽을 생성하려면 matplotlib 플로팅 패키지가 설치되어 있어야하며 인라인 플롯에 대해 "% matplotlib 인라인"노트북 'magic'이 활성화되어 있어야합니다. 다이어그램에 그림 레이블과 축 레이블을 추가하려면 "import matplotlib.pyplot as plt"도 필요합니다. Pandas가 기본적으로 .plot () 명령으로 많은 기능을 제공합니다.

In [120]:
# Plotting DataFrames
import matplotlib.pyplot as plt
raw_data['latitude'].plot(kind='hist', bins=100)
plt.xlabel('Latitude Value')
plt.show()
<Figure size 640x480 with 1 Axes>
In [121]:
raw_data.loc[raw_data['Element'] == 'Food']
Out[121]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
5 AF 2 Afghanistan 2514 Maize and products 5142 Food 1000 tonnes 33.94 67.71 ... 231.0 67.0 82.0 67.0 69.0 71.0 82.0 73.0 77 76
6 AF 2 Afghanistan 2517 Millet and products 5142 Food 1000 tonnes 33.94 67.71 ... 15.0 21.0 11.0 19.0 21.0 18.0 14.0 14.0 14 12
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21470 ZW 181 Zimbabwe 2949 Eggs 5142 Food 1000 tonnes -19.02 29.15 ... 15.0 18.0 18.0 21.0 22.0 27.0 27.0 24.0 24 25
21472 ZW 181 Zimbabwe 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21474 ZW 181 Zimbabwe 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 Zimbabwe 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

17528 rows × 63 columns

In [122]:
raw_data_test = raw_data.loc[raw_data['Element'] == 'Food']
pd.DataFrame(raw_data_test.groupby('Area')['Y2013'].sum())
Out[122]:
Y2013
Area
Afghanistan 21471
Albania 6952
Algeria 63455
Angola 30121
Antigua and Barbuda 119
... ...
Venezuela (Bolivarian Republic of) 39706
Viet Nam 105399
Yemen 18325
Zambia 10180
Zimbabwe 9524

174 rows × 1 columns

In [123]:
pd.DataFrame(raw_data_test.groupby('Area')['Y2013'].sum().sort_values(ascending=False))
Out[123]:
Y2013
Area
China, mainland 2499252
India 1238335
United States of America 641776
Brazil 312488
Russian Federation 253892
... ...
Kiribati 135
Grenada 122
Antigua and Barbuda 119
Bermuda 103
Saint Kitts and Nevis 56

174 rows × 1 columns

In [124]:
pd.DataFrame(raw_data_test.groupby('Area')['Y2013'].sum().sort_values(ascending=False)[:10])
Out[124]:
Y2013
Area
China, mainland 2499252
India 1238335
United States of America 641776
Brazil 312488
Russian Federation 253892
Indonesia 237826
Nigeria 228877
Pakistan 180994
Mexico 166591
Germany 158473
In [125]:
raw_data_test.groupby('Area')['Y2013'].sum().sort_values(ascending=False)[:10].plot(kind='bar')
plt.title('Top Ten Food Producers')
plt.ylabel('Food Produced (tonnes)')
Out[125]:
Text(0, 0.5, 'Food Produced (tonnes)')
In [ ]:
 

'Kafka & Elasticsearch' 카테고리의 다른 글

Text and Keyword 타입  (0) 2020.04.29
Elasticsearch scaling down  (0) 2019.08.04
Kafka Streams (Stateful, Aggregating)  (0) 2019.03.17
Kafka Fail-over (cluster)  (0) 2019.01.07
토픽 삭제 (Topic delete)  (0) 2019.01.06
댓글
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2025/01   »
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31
글 보관함