[데이터 취업 스쿨 스터디 노트] Google Maps API / Seaborn / 서울시 범죄 검거 현황 2

« 2024/10 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Recent Posts

Link

SH_GitHub

관리 메뉴

log.Sehee

[데이터 취업 스쿨 스터디 노트] Google Maps API / Seaborn / 서울시 범죄 검거 현황 2 - 3 본문

Zerobase DS School

[데이터 취업 스쿨 스터디 노트] Google Maps API / Seaborn / 서울시 범죄 검거 현황 2 - 3

Sehe_e 2024. 8. 1. 20:25

Google Maps API

구글맵 conda 설치

conda install -c conda-forge googlemaps

Google Cloud에서 Google Map API Key 받기.

프로젝트 생성 후 'geocoding API' 설치. API 제한 설정 후 사용

import googlemaps

gmaps_key = 발급받은 키 입력
gmaps = googlemaps.Client(key=gmaps_key)

출력 데이터 확인

gmaps.geocode('서울영등포경찰서', language='ko')

정보 얻기

# 장소 정보 get
tmp[0].get('geometry')['location']

# 위도 경도 get
print(tmp[0].get('geometry')['location']['lat'])
print(tmp[0].get('geometry')['location']['lng'])

# 지역구 정보 get
tmp[0].get('formatted_address').split()    # ['대한민국', '서울특별시', '영등포구', '국회대로', '608']
tmp[0].get('formatted_address').split()[2]

구별, 위도, 경도 column 추가

crime_station['구별'] = np.nan
crime_station['lat'] = np.nan
crime_station['lng'] = np.nan

crime_station.head()

+ 강의에서는 강간 / 강간, 추행 column이 각각 존재하는데 내 화면에서는 강간이 Null 값인 index는 강간, 추행 데이터가 존재하고 강간,추행 column이 Null 값인 index는 강간 데이터가 존재한다. 그러므로 강간 / 강간, 추행 column을 합치기로 결정.

fill_value 옵션 추가하여 pivot table 생성

crime_station = crime_raw_data.pivot_table(
    crime_raw_data, 
    index = '구분', 
    columns = ['죄종', '발생검거'], 
    aggfunc = [np.sum],
    fill_value = 0
)

crime_station.columns = crime_station.columns.droplevel([0, 1])

crime_station['구별'] = np.nan
crime_station['lat'] = np.nan
crime_station['lng'] = np.nan

crime_station.head()

column 확인

crime_station.columns

column 합친 후 강의와 같은 column에 넣기

rape_data = crime_station[('강간', '발생')] + crime_station[('강간,추행', '발생')]
rape_catch_data = crime_station[('강간', '검거')] + crime_station[('강간,추행', '검거')]

crime_station[('강간', '발생')] = rape_data
crime_station[('강간', '검거')] = rape_catch_data

기존의 강간,추행 column drop

crime_station.drop(columns=[('강간,추행', '발생'), ('강간,추행', '검거')], inplace=True)

crime_station.head()

+ 추가 pivot table 생성 전 데이터를 합쳐서 테이블을 생성하는 코드

import numpy as np
import pandas as pd

# 데이터 읽기
crime_raw_data = pd.read_csv('../data/02. crime_in_Seoul.csv', thousands=',', encoding='euc-kr')
# '강간,추행' column name을 '강간'으로 변경
crime_raw_data['죄종'] = crime_raw_data['죄종'].replace('강간,추행', '강간')

# pivot table 생성
crime_station = crime_raw_data.pivot_table(
    crime_raw_data, 
    index = '구분', 
    columns = ['죄종', '발생검거'], 
    aggfunc = [np.sum],
)
# 상단 sum, 건수 column 없애기
crime_station.columns = crime_station.columns.droplevel([0, 1])

crime_station['구별'] = np.nan
crime_station['lat'] = np.nan
crime_station['lng'] = np.nan

crime_station.head()

반복문으로 Null값 채우기

for idx, rows in crime_station.iterrows():
    station_name = '서울' + str(idx) + '경찰서'
    tmp = gmaps.geocode(station_name, language='ko')
    gu = tmp[0].get('formatted_address')

    lat = tmp[0].get('geometry')['location']['lat']
    lng = tmp[0].get('geometry')['location']['lng']

    crime_station.loc[idx, 'lat'] = lat
    crime_station.loc[idx, 'lng'] = lng
    crime_station.loc[idx, '구별'] = gu.split()[2]

crime_station.head()

# 컬럼 level 순서대로 얻어 column 정리하기
crime_station.columns.get_level_values(0)[2]    # 강도
crime_station.columns.get_level_values(1)[2]    # 검거
crime_station.columns.get_level_values(0)[2] + crime_station.columns.get_level_values(1)[2]    # 강도검거

# for문으로 작성
tmp = [
    crime_station.columns.get_level_values(0)[n] + crime_station.columns.get_level_values(1)[n]
    for n in range(len(crime_station.columns.get_level_values(0)))
]
tmp

column 교체

crime_station.columns = tmp
crime_station.head()

데이터 저장

# 데이터 저장
crime_station.to_csv('../data/02. crime_in_Seoul_raw.csv', sep=',', encoding='utf-8')

구별 데이터로 정리

# index_col = index로 사용할 column 설정
crime_anal_station = pd.read_csv('../data/02. crime_in_Seoul_raw.csv', index_col = 0, encoding='utf-8')
crime_anal_station.head()

# 구별 데이터를 정리할 pivot table 생성
crime_anal_gu = pd.pivot_table(crime_anal_station, index = '구별', aggfunc=np.sum)

# crime_anal_gu에 필요없는 위치정보 컬럼 lng, lat 삭제하기
crime_anal_gu.drop('lng', axis = 1, inplace = True)
crime_anal_gu.drop('lat', axis = 1, inplace = True)
crime_anal_gu.head()

검거율 생성

target = ['강간검거율', '강도검거율', '살인검거율', '절도검거율', '폭력검거율']

num = ['강간검거', '강도검거', '살인검거', '절도검거', '폭력검거']
den = ['강간발생', '강도발생', '살인발생', '절도발생', '폭력발생']

crime_anal_gu[target] = crime_anal_gu[num].div(crime_anal_gu[den].values) * 100

# 필요없는 컬럼 제거
crime_anal_gu.drop(['강간검거', '강도검거', '살인검거', '절도검거', '폭력검거'], axis=1, inplace=True)

crime_anal_gu.head()

100% 초과한 데이터 조정

# 100보다 큰 숫자 변경하기
crime_anal_gu[crime_anal_gu[target] > 100] = 100

crime_anal_gu.head()

column 이름 정리

# 컬럼 이름 변경
crime_anal_gu.rename(
    columns={'강간발생':'강간', '살인발생':'살인', '강도발생':'강도', '절도발생':'절도', '폭력발생':'폭력'}, 
    inplace = True
)
crime_anal_gu.head()

범죄 데이터 정렬을 위한 데이터 정리

column의 최고값으로 데이터들을 나눠 0 ~ 1 사이 값으로 정규화

col = ['강간', '강도', '살인', '절도', '폭력']
crime_anal_norm = crime_anal_gu[col] / crime_anal_gu[col].max()
crime_anal_norm.head()

검거율 추가하기

# 검거율 추가
col2 = ['강간검거율', '강도검거율', '살인검거율', '절도검거율', '폭력검거율']
crime_anal_norm[col2] = crime_anal_gu[col2]
crime_anal_norm.head()

구별 CCTV 자료에서 인구수와 CCTV수 추가

result_CCTV = pd.read_csv('../data/01. CCTV_result.csv', encoding='cp949', index_col='구별')
crime_anal_norm[['인구수', 'CCTV']] = result_CCTV[['인구수', '소계']]

crime_anal_norm.head()

범죄 column 추가

# 정규화된 범죄발생 건수 전체의 평균을 구해서 범죄 컬럼 대표값으로 사용
col = ['강간', '강도', '강간', '절도', '폭력']

# axis = 1 : 행 기준 연산, 0 : 열 기준 연산
crime_anal_norm['범죄'] = np.mean(crime_anal_norm[col], axis=1)
crime_anal_norm.head()

검거 column 추가

# 검거율의 평균을 구해서 검거 컬럼의 대표값으로 사용

col = ['강간검거율', '강도검거율', '살인검거율', '절도검거율', '폭력검거율']
crime_anal_norm['검거'] = np.mean(crime_anal_norm[col], axis=1)
crime_anal_norm.head()

Seaborn

기본 설정

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rc

plt.rcParams['axes.unicode_minus'] = False
rc('font', family='Arial Unicode MS')
get_ipython().run_line_magic('matplotlib', 'inline')

set_style() : white, whitegrid, black, blackgrid, sti 설정

x = np.linspace(0, 14, 100)
y1 = np.sin(x)
y2 = 2 * np.sin(x + 0.5)
y3 = 3 * np.sin(x + 1.0)
y4 = 4 * np.sin(x + 1.5)

plt.figure(figsize=(10, 6))
plt.plot(x, y1, x, y2, x, y3, x, y4)
sns.set_style('whitegrid')    # 배경색 / white, whitegrid, dark, darkgrid, sti 등 배경에 관한 설정이 가능하다
# plt.grid()
plt.show()

Seaborn tips data

tips 내장 데이터셋

tips = sns.load_dataset('tips')
tips

boxplot() : 중앙값, 사분위수, 최솟값, 최댓값, 이상치를 그린다.

# boxplot
plt.figure(figsize=(8, 6))
sns.boxplot(x = tips['total_bill'])
plt.show()

boxplot(x, y, data) : x column 그룹별로 y column 데이터를 그린다.

# boxplot
plt.figure(figsize=(8, 6))
sns.boxplot(x ='day', y='total_bill', data=tips)
plt.show()

hue : 카테고리 or 이산형 데이터가 들어가며 그룹에 따른 분류를 보여준다.

palette : 색상 설정을 할 수 있다.

# boxplot hue, palette option
# hue : 카테고리 데이터 표현 / palette : 색상 변경

plt.figure(figsize=(8, 6))
sns.boxplot(x='day', y='total_bill', data=tips, hue='smoker', palette='Set2')
plt.show()

swarmplot : 산점도로 표현한다.

color : 0 ~ 1의 값을 가지며 1에 가까울수록 흰색, 0에 가까울수록 검정이 된다.

# swarmplot : 산점도로 표현
# color : 1에 가까울 수록 흰색, 0에 가까울수록 검정이 된다.

plt.figure(figsize=(8, 6))
sns.swarmplot(x='day', y='total_bill', data=tips, color='0.2')
plt.show()

boxplot, swarmplot

# boxplot with swarmplot

plt.figure(figsize=(8, 6))
sns.boxplot(x='day', y='total_bill', data=tips)
sns.swarmplot(x='day', y='total_bill', data=tips, color='0.25')
plt.show()

lmplot : 선형 회귀 모델을 시각화하며 산점도와 회귀선을 통해 선형관계를 파악할 수 있다.

# lmplot : total_bill과 tip 사이 관계 파악

sns.set_style('darkgrid')
sns.lmplot(x='total_bill', y='tip', data=tips, height=7)    # size -> height로 변경됨
plt.show()

hue : 카테고리 or 이산형 데이터가 들어가며 그룹에 따른 분류를 보여준다.

# lmplot : total_bill과 tip 사이 관계 파악

sns.set_style('darkgrid')
sns.lmplot(x='total_bill', y='tip', data=tips, height=7, hue='smoker')
plt.show()

flights data

flights 데이터셋

flights = sns.load_dataset('flights')
flights.head()

pivot table 생성

# pivot
# index, columns, values
flights = flights.pivot(index='month', columns='year', values='passengers')
flights.head()

heatmap : 2차원 데이터 값을 색상으로 표현하여 시각화한다.

데이터의 상관관계, 빈도, 분포 등을 나타낼 때 유용하다.

# heatmap
# annot = 데이터값 표시 여부 / fmt = 자료형 표현 설정, 'd': 정수, 'f': 실수

plt.figure(figsize=(10, 8))
sns.heatmap(data=flights, annot=True, fmt='d')
plt.show()

colormap : 색상 설정

# colormap / 색상설정, 공식홈페이지에 설정 참고
plt.figure(figsize=(10, 8))
sns.heatmap(data=flights, annot=True, fmt='d', cmap='YlGnBu')
plt.show()

iris data

iris 데이터셋

iris = sns.load_dataset('iris')
iris.tail()

pairplot : 모든 변수 간 상관관계를 column별 산점도와 히스토그램으로 보여준다.

# pairplot / set_style의 ticks 스타일이 추가된다

sns.set_style('ticks')
sns.pairplot(iris)
plt.show()

hue 적용

# hue option

sns.pairplot(iris, hue='species')
plt.show()

column 설정

# 원하는 컬럼만 pairplot

sns.pairplot(iris, x_vars = ['sepal_width', 'sepal_length'], y_vars = ['petal_width', 'petal_length'])
plt.show()

anscombe data

anscombe 데이터셋

anscombe = sns.load_dataset('anscombe')
anscombe.tail()

lmplot

ci : 신뢰구간을 설정한다. 모집단의 모수를 추정할 때 추정치를 얼마나 신뢰할 수 있는지를 나타낸다. 일반적으로 95% 사용

data : dataset column 값이 I인 값만 사용

# ci : 신뢰구간 선택
sns.set_style('darkgrid')
sns.lmplot(x='x', y='y', data=anscombe.query('dataset == "I"'), ci=None, height=7)
plt.show()

scatter_kws : 포인트 크기 설정

# scatter_kws : 특정 값에 대한 개별 포인트 크기 설정
sns.set_style('darkgrid')
sns.lmplot(x='x', y='y', data=anscombe.query('dataset == "I"'), ci=None, height=7, scatter_kws={'s':100})
plt.show()

order : 회귀선의 차수를 설정한다. 1 : 일차다항식, 2: 이차 다항회귀, 3: 삼차 다항회귀, n: n차 다항회귀

# order option : 1

sns.set_style('darkgrid')
sns.lmplot(
    x='x', 
    y='y', 
    data=anscombe.query('dataset == "II"'), 
    order = 1,
    ci=None, 
    height=7, 
    scatter_kws={'s':100}
)
plt.show()

# order option : 2

sns.set_style('darkgrid')
sns.lmplot(
    x='x', 
    y='y', 
    data=anscombe.query('dataset == "II"'), 
    order = 2,
    ci=None, 
    height=7, 
    scatter_kws={'s':100}
)
plt.show()

outlier : 이상치. 이상치로 인해 전반적인 경향성이 왜곡될 수 있다.

# outlier 설정

sns.set_style('darkgrid')
sns.lmplot(
    x='x', 
    y='y', 
    data=anscombe.query('dataset == "III"'), 
    ci=None, 
    height=7, 
    scatter_kws={'s':100}
)
plt.show()

robust : 이상치나 노이즈가 분석 결과에 미치는 영향을 최소화 시켜준다.

# outlier 설정

sns.set_style('darkgrid')
sns.lmplot(
    x='x', 
    y='y', 
    data=anscombe.query('dataset == "III"'), 
    robust=True,
    ci=None, 
    height=7, 
    scatter_kws={'s':100}
)
plt.show()