Table of Contents

Opening
Data loading manually and from CSV files to Pandas DataFrame
Loading, editing, and viewing data from Pandas DataFrame
Renaming colmnns, exporting and saving Pandas DataFrames
Summarising, grouping, and aggregating data in Pandas
Merge and join DataFrames with Pandas
Basic Plotting Pandas DataFrames

Opening¶

CSV (쉼표로 구분 된 값) 파일은 일반적인 데이터 파일 형식입니다. Python을 사용하여 CSV 파일에서 날짜를 읽고, 조작하고, 날짜를 쓰는 기능은 데이터 과학자 또는 비즈니스 분석을 마스터하는 핵심 기술입니다. 1) what CSV files are,
2) how to read CSV files into "Pandas DataFrames",
3) how to write DataFrames back to CSV files.

What is "Pandas DataFrame"? : Pandas는 Python에서 가장 널리 사용되는 데이터 조작 패키지이며 DataFrames는 테이블 형식 2D 데이터를 저장하기위한 Pandas 데이터 타입입니다. : Pandas 개발은 2008 년에 주요 개발자 인 Wes McKinney와 함께 시작되었으며 라이브러리는 Python을 사용한 데이터 분석 및 관리의 표준이되었습니다. : Pandas 유창성은 모든 Python 기반 데이터 전문가, Kaggle 과제에 관심이 있거나 데이터 프로세스를 자동화하려는 모든 사람에게 필수적입니다. : Pandas 라이브러리 설명서는 DataFrame을“축과 행이 레이블이 지정된 2 차원 크기 변경 가능 이종 테이블 형식 데이터 구조”로 정의합니다.

There can be multiple rows and columns in the data.
Each row represents a sample of data,
Each column contains a different variable that describes the samples (rows).
The data in every column is usually the same type of data – e.g. numbers, strings, dates.
Usually, unlike an excel data set, DataFrames avoid having missing values, and there are no gaps and empty values between rows or columns.

# Manually generate data
import pandas as pd
pd.options.display.max_columns = 20
pd.options.display.max_rows = 10

data = {'column1':[1,2,3,4,5],
        'anatoeh_column':['this', 'column', 'has', 'strings', 'indise!'],
        'float_column':[0.1, 0.5, 33, 48, 42.5555],
        'binary_column':[True, False, True, True, False]}
print(data)
print(data['column1'])
display(pd.DataFrame(data))
display(pd.DataFrame(data['column1']))

{'column1': [1, 2, 3, 4, 5], 'anatoeh_column': ['this', 'column', 'has', 'strings', 'indise!'], 'float_column': [0.1, 0.5, 33, 48, 42.5555], 'binary_column': [True, False, True, True, False]}
[1, 2, 3, 4, 5]

Data loading manually and from CSV files to Pandas DataFrame¶

(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)
There are 3 fundamantal conceps to grasp and debug the operation of the data loading procedure.
1) Understanding file extensions and file types – what do the letters CSV actually mean? what’s the difference between a .csv file and a .txt file?
2) Understanding how data is represented inside CSV files – if you open a CSV file, what does the data actually look like?
3) Understanding the Python path and how to reference a file – what is the absolute and relative path to the file you are loading? What directory are you working in?
4) CSV file loading errors

FileNotFoundError: File b'filename.csv' does not exist
=> A File Not Found error is typically an issue with path setup, current directory, or file name confusion (file extension can play a part here!)
UnicodeDecodeError: 'utf-8' codec can't decode byte in position : invalid continuation byte
=> A Unicode Decode Error is typically caused by not specifying the encoding of the file, and happens when you have a file with non-standard characters. For a quick fix, try opening the file in Sublime Text, and re-saving with encoding ‘UTF-8’.
pandas.parser.CParserError: Error tokenizing data.
=> Parse Errors can be caused in unusual circumstances to do with your data format – try to add the parameter “engine=’python'” to the read_csv function call; this changes the data reading function internally to a slower but more stable method.

# Finding your Python path
# The "OS module" is for operating system dependent functionality into Python
import os
print(os.getcwd())
print(os.listdir())
# os.chdir("path")

/Users/ijuseong/opt/anaconda3/TimeSeriesAnalysis/Shared
['[FastCampus] 2주차_강의자료_김경원박사.pptx', 'Practice3_Setting_Analysis_KK.ipynb', 'Lecture2_Learning_TimeSeries_KK.ipynb', 'Untitled.ipynb', '[FastCampus] 3주차_강의자료_김경원박사.pptx', 'Practice0_Installation_Program_KK.ipynb', '__pycache__', 'week_contents.txt', '[FastCampus] 1주차_강의자료_김경원박사.pptx', 'Image', 'Practice4_FE_Analysis_KK.ipynb', '[FastCampus] 7주차_강의자료_김경원박사.pptx', 'Lecture1_DataAnalysisCycle_DataStatistics_KK.ipynb', 'Practice5_Agile_Analysis_KK.ipynb', 'Lecture3_Algorithms_ML_TS_Linear_KK.ipynb', 'Practice6_TimeSeries_Analysis_KK.ipynb', '[FastCampus] 8주차_강의자료_김경원박사.pptx', '[FastCampus] 5주차_강의자료_김경원박사.pptx', 'Practice1_Tutorial_Pandas_KK.ipynb', 'Practice2_Tutorial_Numpy.ipynb', '.ipynb_checkpoints', 'module.py', '[FastCampus] 6주차_강의자료_김경원박사.pptx', 'Lecture4_Algorithms_TS_NonLinear_Multivariate_KK.ipynb', '[FastCampus] 4주차_강의자료_김경원박사.pptx']

# File Loading from "Absolute" and "Relative" paths
# Relative paths are directions to the file starting at your current working directory, where absolute paths always start at the base of your file system.
# direct_path : 'https://s3-eu-west-1.amazonaws.com/shanebucket/downloads/FAO+database.csv' from 'https://www.kaggle.com/dorbicycle/world-foodfeed-production'
absolute_path = '../Data/FoodAgricultureOrganization/Food_Agriculture_Organization_UN_Full.csv'
pd.read_csv(absolute_path, sep=',')

relative_path = '../Data/FoodAgricultureOrganization/Food_Agriculture_Organization_UN_Full.csv'
pd.read_csv(relative_path, sep=',')

pd.options.display.max_columns = 20
relative_path = '../Data/FoodAgricultureOrganization/Food_Agriculture_Organization_UN_Full.csv'
raw_data = pd.read_csv(relative_path, sep=',')
raw_data

Loading, editing, and viewing data from Pandas DataFrame¶

팬더는 넓은 데이터 데이터 프레임의 경우 기본적으로 20 개의 열만 표시하고 중간 섹션은 잘리는 60 개 정도의 행만 표시합니다. 이 한도를 변경하려면 Pandas 디스플레이의 일부 내부 옵션을 사용하여 기본값을 편집 할 수 있습니다 (simple use pd.display.options.XX = value to set these) (https://pandas.pydata.org/pandas-docs/stable/options.html)

pd.options.display.width – the width of the display in characters – use this if your display is wrapping rows over more than one line.
pd.options.display.max_rows – maximum number of rows displayed.
pd.options.display.max_columns – maximum number of columns displayed.

마지막으로 특정 열에 대한 핵심 통계를 보려면 'describe'기능을 사용하십시오.

For numeric columns, describe() returns basic statistics: the value count, mean, standard deviation, minimum, maximum, and 25th, 50th, and 75th quantiles for the data in a column.
For string columns, describe() returns the value count, the number of unique entries, the most frequently occurring value (‘top’), and the number of times the top value occurs (‘freq’)

Pandas에서 선택 및 색인 생성 활동을 수행하기위한 두 가지 주요 옵션이 있습니다. .loc 또는 .iloc을 사용하는 경우, 당신은 목록 또는 단일 값을 selector에 전달하여 출력 형식을 제어 할 수 있습니다. (http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label)

iloc
- .iloc은 한 row을 선택하면 Pandas Series를, 여러 row을 선택하거나 전체 column을 선택하면 Pandas DataFrame을 반환합니다. 이를 방지하려면 DataFrame 출력이 필요한 경우 single-valued list을 전달하십시오.
- 이러한 방식으로 여러 column 또는 여러 row을 선택할 때 선택 (예 : [1:5])에서 선택한 행 / 열은 첫 번째 숫자에서 두 번째 숫자에서 1을 뺀 숫자로 실행됩니다. 예 : [1:5]는 1,2,3,4.가되고, [x,y]는 x에서 y-1이됩니다.
loc
- Label-based / Index-based indexing
- Boolean / Logical indexing
  - array 또는 True / False 값 Series를 .loc 인덱서에 전달하여 Series에 True 값이있는 행을 선택합니다.

Selecting rows and columns

using a dot notation, e.g. data.column_name,
using square braces and the name of the column as a string, e.g. data['column_name']
using numeric indexing and the iloc selector data.iloc[:, ]

When a column is selected using any of these methodologies, a pandas.Series is the resulting datatype. A pandas series is a one-dimensional set of data.

square-brace selection with a list of column names, e.g. data[['column_name_1', 'column_name_2']]
using numeric indexing with the iloc selector and a list of column numbers, e.g. data.iloc[:, [0,1,20,22]]

Rows in a DataFrame are selected, typically, using the iloc/loc selection methods, or using logical selectors

numeric row selection using the iloc selector, e.g. data.iloc[0:10, :] – select the first 10 rows.
label-based row selection using the loc selector (this is only applicably if you have set an “index” on your dataframe. e.g. data.loc[44, :]
logical-based row selection using evaluated statements, e.g. data[data["Area"] == "Ireland"] – select the rows where Area value is ‘Ireland’.

To delete rows and columns from DataFrames, Pandas uses the “drop” function.

column 또는 여러 column을 삭제하려면 column 이름을 사용하고 “axis”을 1로 지정하십시오.
또는 아래 예와 같이 'columns'매개 변수가 pandas에 추가되어 'axis'이 필요하지 않습니다.
drop 함수는 열이 제거 된 새 DataFrame을 반환합니다. 원래 DataFrame을 실제로 편집하기 위해 "inplace"매개 변수를 True로 설정할 수 있으며 반환 된 값이 없습니다.
axis = 0을 지정하여 “drop”기능을 사용하여 행을 제거 할 수도 있습니다. Drop ()은 숫자 색인 대신 "labels"을 기준으로 행을 제거합니다. 숫자 position / index를 기준으로 행을 삭제하려면 iloc을 사용하여 데이터 프레임 값을 다시 지정하십시오.

# Examine data in a Pandas DataFrame
raw_data.shape

(21477, 63)

raw_data.ndim

2

raw_data.head(5)

raw_data.tail(5)

raw_data.dtypes

Area Abbreviation     object
Area Code              int64
Area                  object
Item Code              int64
Item                  object
                      ...   
Y2009                float64
Y2010                float64
Y2011                float64
Y2012                  int64
Y2013                  int64
Length: 63, dtype: object

raw_data['Item Code'] = raw_data['Item Code'].astype(str)
raw_data.dtypes

Area Abbreviation     object
Area Code              int64
Area                  object
Item Code             object
Item                  object
                      ...   
Y2009                float64
Y2010                float64
Y2011                float64
Y2012                  int64
Y2013                  int64
Length: 63, dtype: object

raw_data['Y2013'].describe()

count     21477.000000
mean        575.557480
std        6218.379479
min        -246.000000
25%           0.000000
50%           8.000000
75%          90.000000
max      489299.000000
Name: Y2013, dtype: float64

raw_data['Area'].describe()

count     21477
unique      174
top       Spain
freq        150
Name: Area, dtype: object

raw_data.describe()

# Selecting and manipulating data
raw_data.iloc[0]

Area Abbreviation                    AF
Area Code                             2
Area                        Afghanistan
Item Code                          2511
Item                 Wheat and products
                            ...        
Y2009                              4538
Y2010                              4605
Y2011                              4711
Y2012                              4810
Y2013                              4895
Name: 0, Length: 63, dtype: object

raw_data.iloc[[1]]

raw_data.iloc[[-1]]

raw_data.iloc[:,0]

0        AF
1        AF
2        AF
3        AF
4        AF
         ..
21472    ZW
21473    ZW
21474    ZW
21475    ZW
21476    ZW
Name: Area Abbreviation, Length: 21477, dtype: object

raw_data.iloc[:,[1]]

raw_data.iloc[:,[-1]]

raw_data.iloc[0:5]

raw_data.iloc[:,0:2]

raw_data.iloc[[0,3,6,24],[0,5,6]]

raw_data.iloc[0:5, 5:8]

raw_data.loc[0]

Area Abbreviation                    AF
Area Code                             2
Area                        Afghanistan
Item Code                          2511
Item                 Wheat and products
                            ...        
Y2009                              4538
Y2010                              4605
Y2011                              4711
Y2012                              4810
Y2013                              4895
Name: 0, Length: 63, dtype: object

raw_data.loc[[1]]

raw_data.loc[[1,3]]

raw_data.loc[[1,3],['Item','Y2013']]

raw_data.loc[[1,3],'Item':'Y2013']

raw_data.loc[1:3,'Item':'Y2013']

raw_data_test = raw_data.loc[10:,'Item':'Y2013']
raw_data_test.iloc[[0]]

raw_data_test.loc[[10]]

raw_data.loc[raw_data['Item'] == 'Sugar beet']

# is same as
raw_data[raw_data['Item'] == 'Sugar beet']

raw_data.loc[raw_data['Item'] == 'Sugar beet', 'Area']

10                              Afghanistan
103                                 Albania
699                                 Armenia
832                               Australia
1099                             Azerbaijan
                        ...                
20020                  United Arab Emirates
20676                            Uzbekistan
20900    Venezuela (Bolivarian Republic of)
21135                                 Yemen
21374                              Zimbabwe
Name: Area, Length: 66, dtype: object

raw_data.loc[raw_data['Item'] == 'Sugar beet', ['Area']]

# is not same as
raw_data[raw_data['Item'] == 'Sugar beet', ['Area']]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-d794b59d3bc7> in <module>
      1 # is not same as
----> 2 raw_data[raw_data['Item'] == 'Sugar beet', ['Area']]

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2895                 )
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:
   2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

TypeError: '(0        False
1        False
2        False
3        False
4        False
         ...  
21472    False
21473    False
21474    False
21475    False
21476    False
Name: Item, Length: 21477, dtype: bool, ['Area'])' is an invalid key

raw_data.loc[raw_data['Item'] == 'Sugar beet', ['Area', 'Item', 'latitude']]

raw_data.loc[raw_data['Item'] == 'Sugar beet', 'Area':'latitude']

raw_data.loc[raw_data['Area'].str.endswith('many')]

# is same as
raw_data.loc[raw_data['Area'].isin(['Germany'])]

raw_data.loc[raw_data['Area'].isin(['Germany', 'France'])]

raw_data.loc[(raw_data['Area'].str.endswith('many')) & (raw_data['Element'] == 'Feed')]

raw_data.loc[(raw_data['Y2004'] < 1000) & (raw_data['Y2004'] > 990)]

raw_data.loc[(raw_data['Y2004'] < 1000) & (raw_data['Y2004'] > 990), ['Area', 'Item', 'latitude']]

raw_data.loc[raw_data['Item'].apply(lambda x: len(x.split(' ')) == 5)]

# is same as
TF_indexing = raw_data['Item'].apply(lambda x: len(x.split(' ')) == 5)
raw_data.loc[TF_indexing]

raw_data.loc[TF_indexing, ['Area', 'Item', 'latitude']]

raw_data_test = raw_data.copy()
raw_data_test.loc[(raw_data_test['Y2004'] < 1000) & (raw_data_test['Y2004'] > 990), ['Area']]

raw_data_test.loc[(raw_data_test['Y2004'] < 1000) & (raw_data_test['Y2004'] > 990), ['Area']] = 'Company'
raw_data_test.loc[(raw_data_test['Y2004'] < 1000) & (raw_data_test['Y2004'] > 980), ['Area']]

raw_data['Y2007'].sum(), raw_data['Y2007'].mean(), raw_data['Y2007'].median(), raw_data['Y2007'].nunique(), raw_data['Y2007'].count(), raw_data['Y2007'].max(), raw_data['Y2007'].min()

(10867788.0, 508.48210358863986, 7.0, 1994, 21373, 402975.0, 0.0)

[raw_data['Y2007'].sum(),
 raw_data['Y2007'].mean(),
 raw_data['Y2007'].median(),
 raw_data['Y2007'].nunique(),
 raw_data['Y2007'].count(),
 raw_data['Y2007'].max(),
 raw_data['Y2007'].min(),
 raw_data['Y2007'].isna().sum(),
 raw_data['Y2007'].fillna(0)]

[10867788.0,
 508.48210358863986,
 7.0,
 1994,
 21373,
 402975.0,
 0.0,
 104,
 0        4164.0
 1         455.0
 2         263.0
 3          48.0
 4         249.0
           ...  
 21472     356.0
 21473       6.0
 21474      14.0
 21475       0.0
 21476       0.0
 Name: Y2007, Length: 21477, dtype: float64]

# Delete the "Area" column from the dataframe
raw_data.drop("Area", axis=1)

# alternatively, delete columns using the columns parameter of drop
raw_data.drop(columns="Area")

# Delete the Area column from the dataframe and the original 'data' object is changed when inplace=True
raw_data.drop("Area", axis=1, inplace=False)

# Delete multiple columns from the dataframe
raw_data.drop(["Y2011", "Y2012", "Y2013"], axis=1)

# Delete the rows with labels 0,1,5
raw_data.drop([0,1,5], axis=0)

# Delete the rows with label "Afghanistan". For label-based deletion, set the index first on the dataframe
raw_data.set_index("Area")
raw_data.set_index("Area").drop("Afghanistan", axis=0)

# Delete the first five rows using iloc selector
raw_data.iloc[5:,]

Renaming colmnns, exporting and saving Pandas DataFrames¶

DataFrame 이름 바꾸기 기능을 사용하면 Pandas에서 열 이름을 쉽게 바꿀 수 있습니다. 이름 바꾸기 기능은 사용하기 쉽고 매우 유연합니다.

Rename by mapping old names to new names using a dictionary, with form {“old_column_name”: “new_column_name”, …}
Rename by providing a function to change the column names with. Functions are applied to every column name.

조작 또는 계산 후 데이터를 CSV로 다시 저장하는 것이 다음 단계입니다.

to_csv to write a DataFrame to a CSV file,
to_excel to write DataFrame information to a Microsoft Excel file.

# Renaming of columns
raw_data.rename(columns={'Area':'New_Area'})

display(raw_data)
raw_data.rename(columns={'Area':'New_Area'}, inplace=False)

raw_data.rename(columns={'Area':'New_Area',
                         'Y2013':'Year_2013'}, inplace=False)

raw_data.rename(columns=lambda x: x.upper().replace(' ', '_'), inplace=False)

# Exporting and saving
# Output data to a CSV file
# If you don't want row numbers in my output file, hence index=False, and to avoid character issues, you typically use utf8 encoding for input/output.
raw_data.to_csv("Tutorial_Pandas_Output_Filename.csv", index=False, encoding='utf8')
# Output data to an Excel file.
# For the excel output to work, you may need to install the "xlsxwriter" package.
raw_data.to_excel("Tutorial_Pandas_Output_Filename.xlsx", sheet_name="Sheet 1", index=False)

Summarising, grouping, and aggregating data in Pandas¶

.describe () 함수는 적용되는 모든 변수 또는 그룹에 대한 통계를 빠르게 표시하는 유용한 요약 도구입니다. describe () 출력은 숫자 또는 문자 열에 적용하는지에 따라 달라집니다.

Function	Description
count	Number of non-null observations
sum	Sum of values
mean	Mean of values
mad	Mean absolute deviation
median	Arithmetic median of values
min	Minimum
max	Maximum
mode	Mode
abs	Absolute Value
prod	Product of values
std	Unbiased standard deviation
var	Unbiased variance
sem	Unbiased standard error of the mean
skew	Unbiased skewness (3rd moment)
kurt	Unbiased kurtosis (4th moment)
quantile	Sample quantile (value at %)
cumsum	Cumulative sum
cumprod	Cumulative product
cummax	Cumulative maximum
cummin	Cumulative minimum

우리는 다양한 변수로 큰 데이터 프레임을 그룹화하고 각 그룹에 summary functions를 적용 할 것입니다. 이는 Pandas DataFrame 객체의 "groupby()"및 "agg()"함수를 사용하여 Pandas에서 수행됩니다. (http://pandas.pydata.org/pandas-docs/stable/groupby.html)

groupby() 는 기본적으로 선택한 변수에 따라 데이터를 여러 그룹으로 나눕니다.
groupby () 함수는 GroupBy 오브젝트를 리턴하지만 원래 데이터 세트의 행이 분할 된 방식을 본질적으로 설명합니다.
GroupBy object.groups 변수는 키가 계산 된 unique 그룹이고 해당 값이 각 그룹에 속하는 axis label 인 dictionary입니다.
max(), min(), mean(), first(), last()와 같은 함수를 GroupBy 객체에 빠르게 적용하여 각 그룹에 대한 요약 통계를 얻을 수 있습니다.
결과 열을 두 개 이상 계산하면 결과가 DataFrame이됩니다. 단일 결과 열의 경우 agg function는 기본적으로 Series를 생성합니다. 작업 열을 다르게 선택하여 이를 변경할 수 있습니다 (예 : [[]])
groupby 출력에는 선택한 그룹화 변수에 해당하는 행에 대한 색인 또는 다중 색인이 있습니다. 이 인덱스를 설정하지 않으려면 "as_index = False"를 groupby 작업에 전달하십시오.

agg () 함수가 제공하는 집계 기능을 사용하면 그룹당 여러 통계를 한 번의 계산으로 계산할 수 있습니다.

When multiple statistics are calculated on columns, the resulting dataframe will have a multi-index set on the column axis. This can be difficult to work with, and be better to rename columns after a groupby operation.
여러 통계가 열에 대해 계산되면 결과 dataframe에 column axis에 multi-index set이 설정됩니다. 이는 작업하기 어려울 수 있으며 그룹 별 작업 후에 열 이름을 바꾸는 것이 좋습니다.
깔끔한 접근 방식은 그룹화 된 열에서 ravel() 메서드를 사용하고 있습니다. Ravel ()은 Pandas multi-index을 더 간단한 배열로 바꾸어 현명한 열 이름으로 결합 할 수 있습니다.

# Summarising
url_path = 'https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2015/06/phone_data.csv'
raw_phone = pd.read_csv(url_path)
raw_phone

if 'date' in raw_phone.columns:
    raw_phone['date'] = pd.to_datetime(raw_phone['date'])
raw_phone

raw_phone['duration'].max()

10528.0

raw_phone['item'].unique()

array(['data', 'call', 'sms'], dtype=object)

raw_phone['duration'][raw_phone['item'] == 'data'].max()

34.429

raw_phone['network'].unique()

array(['data', 'Vodafone', 'Meteor', 'Tesco', 'Three', 'voicemail',
       'landline', 'special', 'world'], dtype=object)

raw_phone['month'].value_counts()

2014-11    230
2015-01    205
2014-12    157
2015-02    137
2015-03    101
Name: month, dtype: int64

# Grouping
raw_phone.groupby(['month']).groups

{'2014-11': Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
             ...
             220, 221, 222, 223, 224, 225, 226, 227, 229, 230],
            dtype='int64', length=230),
 '2014-12': Int64Index([228, 231, 232, 233, 234, 235, 236, 237, 238, 239,
             ...
             377, 378, 379, 380, 382, 383, 384, 385, 387, 388],
            dtype='int64', length=157),
 '2015-01': Int64Index([381, 386, 389, 390, 391, 392, 393, 394, 395, 396,
             ...
             583, 584, 585, 587, 588, 589, 590, 591, 592, 593],
            dtype='int64', length=205),
 '2015-02': Int64Index([577, 586, 594, 595, 596, 597, 598, 599, 600, 601,
             ...
             719, 720, 721, 722, 723, 724, 725, 726, 727, 728],
            dtype='int64', length=137),
 '2015-03': Int64Index([729, 730, 731, 732, 733, 734, 735, 736, 737, 738,
             ...
             820, 821, 822, 823, 824, 825, 826, 827, 828, 829],
            dtype='int64', length=101)}

raw_phone.groupby(['month']).groups.keys()

dict_keys(['2014-11', '2014-12', '2015-01', '2015-02', '2015-03'])

raw_phone.groupby(['month']).first()

raw_phone.groupby(['month'])['duration'].sum()

month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64

raw_phone.groupby(['month'], as_index=False)[['duration']].sum()

raw_phone.groupby(['month'])['date'].count()

month
2014-11    230
2014-12    157
2015-01    205
2015-02    137
2015-03    101
Name: date, dtype: int64

raw_phone[raw_phone['item'] == 'call'].groupby('network')['duration'].sum()

network
Meteor        7200.0
Tesco        13828.0
Three        36464.0
Vodafone     14621.0
landline     18433.0
voicemail     1775.0
Name: duration, dtype: float64

raw_phone.groupby(['month', 'item']).groups

{('2014-11',
  'call'): Int64Index([  1,   2,   3,   4,   5,   7,   8,   9,  10,  19,
             ...
             194, 195, 196, 200, 201, 203, 216, 222, 223, 224],
            dtype='int64', length=107),
 ('2014-11',
  'data'): Int64Index([  0,   6,  13,  26,  39,  45,  54,  56,  58,  66,  80,  81,  87,
              92,  95,  97, 101, 111, 114, 120, 131, 151, 159, 170, 182, 189,
             192, 199, 208],
            dtype='int64'),
 ('2014-11',
  'sms'): Int64Index([ 11,  12,  14,  15,  16,  17,  18,  22,  23,  24,  25,  33,  36,
              37,  38,  52,  53,  61,  62,  63,  67,  68,  69,  70,  71,  72,
              73,  74,  75,  76,  77,  79, 102, 103, 107, 108, 121, 125, 132,
             133, 134, 135, 138, 142, 143, 144, 145, 148, 149, 153, 154, 155,
             157, 158, 160, 161, 167, 173, 174, 175, 176, 177, 178, 179, 180,
             181, 185, 186, 187, 188, 197, 198, 202, 204, 205, 206, 207, 209,
             210, 211, 212, 213, 214, 215, 217, 218, 219, 220, 221, 225, 226,
             227, 229, 230],
            dtype='int64'),
 ('2014-12',
  'call'): Int64Index([232, 236, 250, 251, 252, 255, 256, 258, 259, 260, 261, 267, 268,
             269, 270, 271, 272, 273, 274, 276, 277, 278, 279, 280, 282, 283,
             284, 285, 286, 287, 290, 292, 295, 297, 298, 299, 300, 301, 302,
             303, 306, 309, 311, 312, 313, 314, 320, 322, 327, 329, 337, 342,
             344, 345, 347, 348, 349, 350, 352, 353, 354, 356, 362, 364, 365,
             366, 367, 368, 369, 373, 375, 379, 380, 382, 383, 384, 385, 387,
             388],
            dtype='int64'),
 ('2014-12',
  'data'): Int64Index([228, 231, 234, 235, 237, 238, 249, 254, 263, 275, 281, 288, 291,
             305, 321, 324, 328, 330, 338, 341, 343, 346, 351, 355, 363, 372,
             374, 376, 377, 378],
            dtype='int64'),
 ('2014-12',
  'sms'): Int64Index([233, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 253, 257,
             262, 264, 265, 266, 289, 293, 294, 296, 304, 307, 308, 310, 315,
             316, 317, 318, 319, 323, 325, 326, 331, 332, 333, 334, 335, 336,
             339, 340, 357, 358, 359, 360, 361, 370, 371],
            dtype='int64'),
 ('2015-01',
  'call'): Int64Index([392, 398, 401, 402, 403, 404, 405, 406, 407, 408, 411, 412, 413,
             414, 415, 416, 417, 418, 423, 425, 428, 431, 433, 438, 441, 442,
             444, 445, 446, 448, 450, 451, 454, 455, 456, 457, 459, 460, 461,
             466, 475, 497, 498, 499, 506, 510, 511, 513, 514, 515, 517, 518,
             519, 520, 523, 524, 525, 526, 527, 528, 532, 533, 535, 536, 542,
             543, 544, 545, 547, 548, 557, 558, 559, 561, 562, 563, 564, 565,
             570, 572, 573, 574, 578, 584, 585, 587, 588, 589],
            dtype='int64'),
 ('2015-01',
  'data'): Int64Index([381, 386, 389, 396, 397, 400, 409, 420, 426, 427, 443, 453, 463,
             465, 468, 473, 474, 476, 496, 504, 505, 509, 512, 516, 529, 537,
             541, 555, 560, 568, 571],
            dtype='int64'),
 ('2015-01',
  'sms'): Int64Index([390, 391, 393, 394, 395, 399, 410, 419, 421, 422, 424, 429, 430,
             432, 434, 435, 436, 437, 439, 440, 447, 449, 452, 458, 462, 464,
             467, 469, 470, 471, 472, 477, 478, 479, 480, 481, 482, 483, 484,
             485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 500, 501,
             502, 503, 507, 508, 521, 522, 530, 531, 534, 538, 539, 540, 546,
             549, 550, 551, 552, 553, 554, 556, 566, 567, 569, 575, 576, 579,
             580, 581, 582, 583, 590, 591, 592, 593],
            dtype='int64'),
 ('2015-02',
  'call'): Int64Index([595, 597, 599, 600, 601, 602, 603, 611, 612, 614, 615, 618, 619,
             620, 626, 627, 629, 630, 631, 632, 633, 635, 636, 638, 639, 640,
             644, 647, 648, 650, 651, 653, 654, 656, 657, 658, 662, 664, 665,
             666, 668, 669, 671, 672, 674, 676, 677, 682, 684, 686, 687, 690,
             691, 693, 694, 695, 696, 700, 702, 708, 709, 710, 711, 716, 718,
             719, 720],
            dtype='int64'),
 ('2015-02',
  'data'): Int64Index([577, 586, 594, 598, 610, 613, 616, 621, 625, 634, 637, 645, 646,
             649, 652, 655, 659, 667, 670, 673, 675, 683, 685, 689, 692, 697,
             715, 717, 725, 727, 728],
            dtype='int64'),
 ('2015-02',
  'sms'): Int64Index([596, 604, 605, 606, 607, 608, 609, 617, 622, 623, 624, 628, 641,
             642, 643, 660, 661, 663, 678, 679, 680, 681, 688, 698, 699, 701,
             703, 704, 705, 706, 707, 712, 713, 714, 721, 722, 723, 724, 726],
            dtype='int64'),
 ('2015-03',
  'call'): Int64Index([729, 730, 732, 733, 735, 736, 738, 741, 742, 744, 745, 752, 756,
             758, 762, 763, 764, 769, 770, 771, 773, 774, 776, 777, 778, 780,
             781, 782, 783, 784, 785, 786, 787, 788, 792, 799, 800, 801, 802,
             803, 805, 806, 807, 808, 809, 810, 816],
            dtype='int64'),
 ('2015-03',
  'data'): Int64Index([731, 734, 737, 739, 740, 743, 746, 751, 753, 754, 755, 757, 761,
             772, 775, 779, 791, 793, 804, 811, 817, 818, 819, 820, 821, 822,
             823, 824, 827],
            dtype='int64'),
 ('2015-03',
  'sms'): Int64Index([747, 748, 749, 750, 759, 760, 765, 766, 767, 768, 789, 790, 794,
             795, 796, 797, 798, 812, 813, 814, 815, 825, 826, 828, 829],
            dtype='int64')}

raw_phone.groupby(['month', 'item']).groups.keys()

dict_keys([('2014-11', 'call'), ('2014-11', 'data'), ('2014-11', 'sms'), ('2014-12', 'call'), ('2014-12', 'data'), ('2014-12', 'sms'), ('2015-01', 'call'), ('2015-01', 'data'), ('2015-01', 'sms'), ('2015-02', 'call'), ('2015-02', 'data'), ('2015-02', 'sms'), ('2015-03', 'call'), ('2015-03', 'data'), ('2015-03', 'sms')])

raw_phone.groupby(['month', 'item']).first()

raw_phone.groupby(['month', 'item'])['duration'].sum()

month    item
2014-11  call    25547.000
         data      998.441
         sms        94.000
2014-12  call    13561.000
         data     1032.870
                   ...    
2015-02  data     1067.299
         sms        39.000
2015-03  call    21727.000
         data      998.441
         sms        25.000
Name: duration, Length: 15, dtype: float64

raw_phone.groupby(['month', 'item'])['date'].count()

month    item
2014-11  call    107
         data     29
         sms      94
2014-12  call     79
         data     30
                ... 
2015-02  data     31
         sms      39
2015-03  call     47
         data     29
         sms      25
Name: date, Length: 15, dtype: int64

raw_phone.groupby(['month', 'network_type'])['date'].count()

month    network_type
2014-11  data             29
         landline          5
         mobile          189
         special           1
         voicemail         6
                        ... 
2015-03  data             29
         landline         11
         mobile           54
         voicemail         4
         world             3
Name: date, Length: 24, dtype: int64

raw_phone.groupby(['month', 'network_type'])[['date']].count()

raw_phone.groupby(['month', 'network_type'], as_index=False)[['date']].count()

raw_phone.groupby(['month', 'network_type'])[['date']].count().shape

(24, 1)

raw_phone.groupby(['month', 'network_type'], as_index=False)[['date']].count().shape

(24, 3)

# Aggregating
raw_phone.groupby(['month'], as_index=False)[['duration']].sum()

# is same as
raw_phone.groupby(['month'], as_index=False).agg({'duration':'sum'})

raw_phone.groupby(['month', 'item']).agg({'duration':'sum',
                                          'network_type':'count',
                                          'date':'first'})

# is same as
aggregation_logic = {'duration':'sum',
                     'network_type':'count',
                     'date':'first'}
raw_phone.groupby(['month', 'item']).agg(aggregation_logic)

aggregation_logic = {'duration':[min, max, sum],
                     'network_type':'count',
                     'date':[min, 'first', 'nunique']}
raw_phone.groupby(['month', 'item']).agg(aggregation_logic)

aggregation_logic = {'duration':[min, max, sum],
                     'network_type':'count',
                     'date':['first', lambda x: max(x)-min(x)]}
raw_phone.groupby(['month', 'item']).agg(aggregation_logic)

raw_phone_test = raw_phone.groupby(['month', 'item']).agg(aggregation_logic)
raw_phone_test

raw_phone_test.columns = raw_phone_test.columns.droplevel(level=0)
raw_phone_test

raw_phone_test.rename(columns={'min':'min_duration',
                               'max':'max_duration',
                               'sum':'sum_duration',
                               '<lambda>':'date_difference'})
raw_phone_test = raw_phone.groupby(['month', 'item']).agg(aggregation_logic)
raw_phone_test

raw_phone_test.columns = ['_'.join(x) for x in raw_phone_test.columns.ravel()]
raw_phone_test

Merge and join DataFrames with Pandas¶

(http://pandas.pydata.org/pandas-docs/stable/merging.html)
Python을 사용하는 실제 데이터 과학 상황의 경우 분석 데이터 세트를 구성하기 위해 Pandas Dataframes를 merge하거나 join해야하는 시간은 약 10 분입니다. 데이터 프레임 병합 및 결합은 모든 주목받는 데이터 분석가가 마스터해야하는 핵심 프로세스입니다.

두 데이터 세트를 "merge"하는 것은 두 데이터 세트를 하나로 모으고 공통 속성 또는 열을 기준으로 각 행을 정렬하는 프로세스입니다.
가장 간단한 병합 작업은 왼쪽 dataframe (첫 번째 인수), 오른쪽 dataframe (두 번째 인수), merge column name 또는 "on"을 병합 할 열을 사용합니다.
출력/결과에서 "on"으로 지정된 병합 열의 공통 값이있는 경우 왼쪽 및 오른쪽 dataframe의 행이 일치됩니다.
기본적으로 Pandas 병합 작업은 "inner" merge으로 작동합니다.

Pandas에는 세 가지 유형의 병합이 있습니다. 이러한 병합 유형은 대부분의 데이터베이스 및 데이터 지향 언어 (SQL, R, SAS)에서 일반적이며 "join"이라고합니다

inner merge / inner jon – 기본 팬더 동작은 병합 "on"값이 왼쪽 및 오른쪽 dataframe 모두에 존재하는 행만 유지합니다.
left merge / left outer join – (일명 왼쪽 병합 또는 왼쪽 조인) 왼쪽 dataframe의 모든 행을 유지합니다. 오른쪽 dataframe에 "on" 변수의 결 측값이있는 경우 결과에 empty/NaN 값을 추가하십시오.
Right merge / Right outer join – (일명 오른쪽 병합 또는 오른쪽 조인) 모든 행을 올바른 dataframe에 유지합니다. 왼쪽 열에 "on" 변수의 결 측값이 있으면 결과에 empty / NaN 값을 추가하십시오.
outer merge / full outer join – full outer join은 왼쪽 dateframe의 모든 행과 오른쪽 dataframe의 모든 행을 반환하고 가능한 경우 NaN과 다른 행을 일치시킵니다.

# Merge and join od dataframes
user_usage = pd.read_csv('https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/user_usage.csv')
user_device = pd.read_csv('https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/user_device.csv')
device_info = pd.read_csv('https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/android_devices.csv')
display(user_usage.head())
display(user_device.head())
display(device_info.head())

# Q: if the usage patterns for users differ between different devices
result = pd.merge(left=user_usage, right=user_device, on='use_id')
result.head()

print(user_usage.shape, user_device.shape, device_info.shape, result.shape)

(240, 4) (272, 6) (14546, 4) (159, 9)

user_usage['use_id'].isin(user_device['use_id']).value_counts()

True     159
False     81
Name: use_id, dtype: int64

result = pd.merge(left=user_usage, right=user_device, on='use_id', how='left')
print(user_usage.shape, result.shape, result['device'].isnull().sum())

(240, 4) (240, 9) 81

display(result.head(), result.tail())

result = pd.merge(left=user_usage, right=user_device, on='use_id', how='right')
print(user_device.shape, result.shape, result['device'].isnull().sum(), result['monthly_mb'].isnull().sum())

(272, 6) (272, 9) 0 113

display(result.head(), result.tail())

print(user_usage['use_id'].unique().shape[0], user_device['use_id'].unique().shape[0], pd.concat([user_usage['use_id'], user_device['use_id']]).unique().shape[0])

240 272 353

result = pd.merge(left=user_usage, right=user_device, on='use_id', how='outer')
print(result.shape)

(353, 9)

print((result.apply(lambda x: x.isnull().sum(), axis=1) == 0).sum())

159

# Note that all rows from left and right merge dataframes are included, but NaNs will be in different columns depending if the data originated in the left or right dataframe.
result = pd.merge(left=user_usage, right=user_device, on='use_id', how='outer', indicator=True)
result.iloc[[0, 1, 200, 201, 350, 351]]

# For the question,
result1 = pd.merge(left=user_usage, right=user_device, on='use_id', how='left')
result1.head()

device_info.head()

result_final = pd.merge(left=result1, right=device_info[['Retail Branding', 'Marketing Name', 'Model']],
                        left_on='device', right_on='Model', how='left')
result_final[result_final['Retail Branding'] == 'Samsung'].head()

result_final[result_final['Retail Branding'] == 'LGE'].head()

group1 = result_final[result_final['Retail Branding'] == 'Samsung']
group2 = result_final[result_final['Retail Branding'] == 'LGE']
display(group1.describe())
display(group2.describe())

result_final.groupby('Retail Branding').agg({'outgoing_mins_per_month':'mean',
                                             'outgoing_sms_per_month':'mean',
                                             'monthly_mb':'mean',
                                             'user_id':'count'})

Basic Plotting Pandas DataFrames¶

(https://pandas.pydata.org/pandas-docs/stable/visualization.html)
그래픽을 생성하려면 matplotlib 플로팅 패키지가 설치되어 있어야하며 인라인 플롯에 대해 "% matplotlib 인라인"노트북 'magic'이 활성화되어 있어야합니다. 다이어그램에 그림 레이블과 축 레이블을 추가하려면 "import matplotlib.pyplot as plt"도 필요합니다. Pandas가 기본적으로 .plot () 명령으로 많은 기능을 제공합니다.

# Plotting DataFrames
import matplotlib.pyplot as plt
raw_data['latitude'].plot(kind='hist', bins=100)
plt.xlabel('Latitude Value')
plt.show()

<Figure size 640x480 with 1 Axes>

raw_data.loc[raw_data['Element'] == 'Food']

raw_data_test = raw_data.loc[raw_data['Element'] == 'Food']
pd.DataFrame(raw_data_test.groupby('Area')['Y2013'].sum())

pd.DataFrame(raw_data_test.groupby('Area')['Y2013'].sum().sort_values(ascending=False))

pd.DataFrame(raw_data_test.groupby('Area')['Y2013'].sum().sort_values(ascending=False)[:10])

raw_data_test.groupby('Area')['Y2013'].sum().sort_values(ascending=False)[:10].plot(kind='bar')
plt.title('Top Ten Food Producers')
plt.ylabel('Food Produced (tonnes)')

Text(0, 0.5, 'Food Produced (tonnes)')

	Area Abbreviation	Area Code	Area	Item Code	Item	Element Code	Element	Unit	latitude	longitude	...	Y2004	Y2005	Y2006	Y2007	Y2008	Y2009	Y2010	Y2011	Y2012	Y2013
0	AF	2	Afghanistan	2511	Wheat and products	5142	Food	1000 tonnes	33.94	67.71	...	3249.0	3486.0	3704.0	4164.0	4252.0	4538.0	4605.0	4711.0	4810	4895
1	AF	2	Afghanistan	2805	Rice (Milled Equivalent)	5142	Food	1000 tonnes	33.94	67.71	...	419.0	445.0	546.0	455.0	490.0	415.0	442.0	476.0	425	422
2	AF	2	Afghanistan	2513	Barley and products	5521	Feed	1000 tonnes	33.94	67.71	...	58.0	236.0	262.0	263.0	230.0	379.0	315.0	203.0	367	360
3	AF	2	Afghanistan	2513	Barley and products	5142	Food	1000 tonnes	33.94	67.71	...	185.0	43.0	44.0	48.0	62.0	55.0	60.0	72.0	78	89
4	AF	2	Afghanistan	2514	Maize and products	5521	Feed	1000 tonnes	33.94	67.71	...	120.0	208.0	233.0	249.0	247.0	195.0	178.0	191.0	200	200

	Area Code	Element Code	latitude	longitude	Y1961	Y1962	Y1963	Y1964	Y1965	Y1966	...	Y2004	Y2005	Y2006	Y2007	Y2008	Y2009	Y2010	Y2011	Y2012	Y2013
count	21477.000000	21477.000000	21477.000000	21477.000000	17938.000000	17938.000000	17938.000000	17938.000000	17938.000000	17938.000000	...	21128.000000	21128.000000	21373.000000	21373.000000	21373.000000	21373.000000	21373.000000	21373.000000	21477.000000	21477.000000
mean	125.449411	5211.687154	20.450613	15.794445	195.262069	200.782250	205.464600	209.925577	217.556751	225.988962	...	486.690742	493.153256	496.319328	508.482104	522.844898	524.581996	535.492069	553.399242	560.569214	575.557480
std	72.868149	146.820079	24.628336	66.012104	1864.124336	1884.265591	1861.174739	1862.000116	2014.934333	2100.228354	...	5001.782008	5100.057036	5134.819373	5298.939807	5496.697513	5545.939303	5721.089425	5883.071604	6047.950804	6218.379479
min	1.000000	5142.000000	-40.900000	-172.100000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-169.000000	-246.000000
25%	63.000000	5142.000000	6.430000	-11.780000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	120.000000	5142.000000	20.590000	19.150000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	...	6.000000	6.000000	7.000000	7.000000	7.000000	7.000000	7.000000	8.000000	8.000000	8.000000
75%	188.000000	5142.000000	41.150000	46.870000	21.000000	22.000000	23.000000	24.000000	25.000000	26.000000	...	75.000000	77.000000	78.000000	80.000000	82.000000	83.000000	83.000000	86.000000	88.000000	90.000000
max	276.000000	5521.000000	64.960000	179.410000	112227.000000	109130.000000	106356.000000	104234.000000	119378.000000	118495.000000	...	360767.000000	373694.000000	388100.000000	402975.000000	425537.000000	434724.000000	451838.000000	462696.000000	479028.000000	489299.000000

	Area Abbreviation	Area Code	Area	Item Code	Item	Element Code	Element	Unit	latitude	longitude	...	Y2004	Y2005	Y2006	Y2007	Y2008	Y2009	Y2010	Y2011	Y2012	Y2013
0	AF	2	Afghanistan	2511	Wheat and products	5142	Food	1000 tonnes	33.94	67.71	...	3249.0	3486.0	3704.0	4164.0	4252.0	4538.0	4605.0	4711.0	4810	4895
1	AF	2	Afghanistan	2805	Rice (Milled Equivalent)	5142	Food	1000 tonnes	33.94	67.71	...	419.0	445.0	546.0	455.0	490.0	415.0	442.0	476.0	425	422
2	AF	2	Afghanistan	2513	Barley and products	5521	Feed	1000 tonnes	33.94	67.71	...	58.0	236.0	262.0	263.0	230.0	379.0	315.0	203.0	367	360
3	AF	2	Afghanistan	2513	Barley and products	5142	Food	1000 tonnes	33.94	67.71	...	185.0	43.0	44.0	48.0	62.0	55.0	60.0	72.0	78	89
4	AF	2	Afghanistan	2514	Maize and products	5521	Feed	1000 tonnes	33.94	67.71	...	120.0	208.0	233.0	249.0	247.0	195.0	178.0	191.0	200	200

	Element Code	Element	Unit
0	5142	Food	1000 tonnes
1	5142	Food	1000 tonnes
2	5521	Feed	1000 tonnes
3	5142	Food	1000 tonnes
4	5521	Feed	1000 tonnes

	Item	Element Code	Element	Unit	latitude	longitude	Y1961	Y1962	Y1963	Y1964	...	Y2004	Y2005	Y2006	Y2007	Y2008	Y2009	Y2010	Y2011	Y2012	Y2013
1	Rice (Milled Equivalent)	5142	Food	1000 tonnes	33.94	67.71	183.0	183.0	182.0	220.0	...	419.0	445.0	546.0	455.0	490.0	415.0	442.0	476.0	425	422
3	Barley and products	5142	Food	1000 tonnes	33.94	67.71	237.0	237.0	237.0	238.0	...	185.0	43.0	44.0	48.0	62.0	55.0	60.0	72.0	78	89

IT 정글

티스토리 뷰

Pandas