1. pandas 기초

CodeJC 2024. 7. 20. 08:00

2024. 7. 20. 08:00

1. pandas library

: 데이터를 가공하기 위해 dataframe을 다루는 library 이다. csv, excel 파일을 불러와 처리할 수 있다.

2. kaggle 에 접속해서 dataset download

: 이번 실습에서는 kaggle 에서 "Top 50 Indian Companies" 의 dataset을 활용했다. Link를 타고 들어가서 가입 후 무료로 다운받아 사용하면 된다. kaggle에는 분석을 위해 공개된 다양한 datasets이 존재한다.

3. pandas library 실행

: pandas를 import 해서 pd 라는 약어를 사용하겠다고 선언하는 것이다. 실행했을 때 아무것도 뜨지 않는다면, 정상적으로 import 된것이다. 만약 아래 "# 결과 #" 와 같이 나왔다면 터미널 창에 "pip install pandas"를 입력해 library를 실행한다.

import pandas as pd


# 결과 #
  File "경로~", line 1, in <module>
    import pandas as pd
ModuleNotFoundError: No module named 'pandas'

4. dataset 불러오기

: code 파일이 있는 폴더에 2번 캐글에서 다운 받은 파일의 압축을 풀었다는 가정에서 진행하겠다. 압축을 풀면 archive 폴더에 "Top Company.csv" 파일이 생성된다. 아래 code를 입력하면 dataframe 확인이 가능하다.

import pandas as pd

df_archive = pd.read_csv('archive/Top Company.csv')

print(df_archive)          # dataframe 전체 확인.
print(df_archive.head())   # dataframe 에서 5개의 행을 확인. head(3) 3번째 행까지 확인.
print(df_archive.tail())   # head와 반대로 마지막 5개의 행을 확인.


# 결과 # → head()의 결과이다.
   Rank                             Name     Industry  Revenue (in ₹ Crore) Revenue growth Profits (in ₹ Crore) Headquarters State Controlled
0     1           Indian Oil Corporation  Oil and gas                424321          13.2%                22189    New Delhi              Yes
1     2      Reliance Industries Limited  Oil and gas                410295          28.2%                36075       Mumbai              NaN
2     3  Oil and Natural Gas Corporation  Oil and gas                333143          11.0%                22106    New Delhi              Yes
3     4              State Bank of India      Banking                306528           2.6%               −4,556       Mumbai              Yes
4     5                      Tata Motors   Automotive                301175           7.9%                 8989       Mumbai              NaN

4. dataset 확인

4.01. info() : column별 data type 확인

: column별 data type에 따라 data를 처리하는 방식이 달라진다. 그렇기 때문에 data의 type의 확인은 중요하다. 아래 보면 컬럼별로 50개의 정보가 있으며, "State Controlled"에 50개중 20개의 정보만 있는 것을 확인 할수 있다. (= 결측치가 30개가 있다.)

print(df_archive.info())


# 결과 #
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   Rank                  50 non-null     int64
 1   Name                  50 non-null     object
 2   Industry              50 non-null     object
 3   Revenue (in ₹ Crore)  50 non-null     int64
 4   Revenue growth        50 non-null     object
 5   Profits (in ₹ Crore)  50 non-null     object
 6   Headquarters          50 non-null     object
 7   State Controlled      20 non-null     object
dtypes: int64(2), object(6)
memory usage: 3.2+ KB
None

4.02. 컬럼명 확인

: 컬럼 목록을 확인 할 수 있다.

print(df_archive.columns)


# 결과 #
Index(['Rank', 'Name', 'Industry', 'Revenue (in ₹ Crore)', 'Revenue growth',
       'Profits (in ₹ Crore)', 'Headquarters', 'State Controlled'],
      dtype='object')

4.03. 특정 컬럼 data 확인

: 1개의 컬럼 선택과 여러개의 컬럼을 선택해서 확인하는 방법이다. 차례로 입력해서 결과를 확인하자.

# 1개의 컬럼 선택
print(df_archive['Name'].head())

# 결과 #
0             Indian Oil Corporation
1        Reliance Industries Limited
2    Oil and Natural Gas Corporation
3                State Bank of India
4                        Tata Motors



# 2개의 컬럼 선택
print(df_archive[['Name','Industry']].head())

# 결과 #
                              Name     Industry
0           Indian Oil Corporation  Oil and gas
1      Reliance Industries Limited  Oil and gas
2  Oil and Natural Gas Corporation  Oil and gas
3              State Bank of India      Banking
4                      Tata Motors   Automotive

4.04. 원하는 조건의 정보만 추출하기

: 1 개의 조건으로 정보를 추출하는 방법이다. Industry 컬럼에서 "Oil and gas"만 추출해 보겠다.

print(df_archive[df_archive['Industry']=="Oil and gas"])


# 결과 # 
    Rank                             Name     Industry  Revenue (in ₹ Crore) Revenue growth Profits (in ₹ Crore) Headquarters State Controlled
0      1           Indian Oil Corporation  Oil and gas                424321          13.2%                22189    New Delhi              Yes
1      2      Reliance Industries Limited  Oil and gas                410295          28.2%                36075       Mumbai              NaN
2      3  Oil and Natural Gas Corporation  Oil and gas                333143          11.0%                22106    New Delhi              Yes
5      6                 Bharat Petroleum  Oil and gas                238638          13.7%                 9009       Mumbai              Yes
6      7              Hindustan Petroleum  Oil and gas                221693          13.4%                 7218       Mumbai              Yes
21    22                    Nayara Energy  Oil and gas                 73015          10.7%                  576       Mumbai              NaN
30    31                             GAIL  Oil and gas                 55503          11.6%                 4799    New Delhi              Yes
33    34                             MRPL  Oil and gas                 50209           8.9%                 1993    Mangalore              Yes
43    44    Chennai Petroleum Corporation  Oil and gas                 33187          20.0%                  927      Chennai              Yes
47    48                     Petronet LNG  Oil and gas                 30949          23.9%                 2110    New Delhi              Yes

: 2개 이상의 조건이다. 아래와 같은 형식으로 조건을 추가하면된다.

and : (조건1) & (조건2)
 or : (조건1) | (조건2)

: 아래 예제를 참고하자. info()에서 확인했듯이 숫자형 자료는 "Rank"와 "Revenue (in ₹ Crore)" 두개 뿐이다. 조건을 줄 때 이러한 정보를 확인해야 한다. 나머지는 data type이 object 였다. 필요시 추가로 가공해서 사용해야 한다.

print(df_archive[(df_archive['Industry']=="Oil and gas") & (df_archive['Revenue (in ₹ Crore)']<=100000)])


# 결과 #
    Rank                           Name     Industry  Revenue (in ₹ Crore) Revenue growth Profits (in ₹ Crore) Headquarters State Controlled
21    22                  Nayara Energy  Oil and gas                 73015          10.7%                  576       Mumbai              NaN
30    31                           GAIL  Oil and gas                 55503          11.6%                 4799    New Delhi              Yes
33    34                           MRPL  Oil and gas                 50209           8.9%                 1993    Mangalore              Yes
43    44  Chennai Petroleum Corporation  Oil and gas                 33187          20.0%                  927      Chennai              Yes
47    48                   Petronet LNG  Oil and gas                 30949          23.9%                 2110    New Delhi              Yes

pandas library 기초는 이정도로 마무리 하겠다. 추가적이 내용은 data 가공 및 활용을 하면서 필요시에 추가해 나가겠다.

'데이터 처리' 카테고리의 다른 글

2. matplotlib 기본 사용법 (0)	2024.07.22

CodeJC