Link Search Menu Expand Document

Kiva Crowdfunding 2


Table of contents
  1. 데이터 정리
  2. Funds for Philippines
    1. 분기별 추이
    2. borrower 성별
    3. borrower 지역
    4. repayment interval
    5. 주요 sector, activity, use
    6. ‘use’ 단어 빈도 분석
  3. 데이터 결합해 분석
    1. 데이터 정리
    2. kiva만을 위한 loan 비중 확인
    3. country, sector별 forkiva 비중
    4. Field Partner와의 협업 관계 확인
    5. 국가별 funded amount와 MPI

*분석 대상 데이터셋: Kiva Crowdfunding Data

  • 데이터셋 출처
  • kiva.org: 세계 각국의 경제적으로 어려운 사람들에게 돈을 빌려주는 online crowdfunding platform
  • kiva_loans.csv: 2014.01.01~2017.07.26 사이에 kiva에 올라온 loan 정보 데이터. (671,205개의 row)
  • kiva_mpi_region_locations.csv: 2,772개의 row. 각 지역의 geolocation & poverty level 정보
  • loan_theme_ids.csv: 779,092개의 row. 각 loan의 theme에 대한 정보.
  • loan_theme_by_region.csv: 15,736개의 row. 각 loan theme과 연결된 regional 정보들

데이터 정리

# 필요한 라이브러리 import
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

import scipy.stats as stats
kiva_df = pd.read_csv('data/kiva_loans.csv')
kiva_df.head(3)
  id funded_amount loan_amount activity sector use country_code country region currency partner_id posted_time disbursed_time funded_time term_in_months lender_count tags borrower_genders repayment_interval date
0 653051 300 300 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 247 2014-01-01 06:12:39+00:00 2013-12-17 08:00:00+00:00 2014-01-02 10:06:32+00:00 12 12 nan female irregular 2014-01-01
1 653053 575 575 Rickshaw Transportation to repair and maintain the auto rickshaw used in their business. PK Pakistan Lahore PKR 247 2014-01-01 06:51:08+00:00 2013-12-17 08:00:00+00:00 2014-01-02 09:17:23+00:00 11 14 nan female, female irregular 2014-01-01
2 653068 150 150 Transportation Transportation To repair their old cycle-van and buy another one to rent out as a source of income IN India Maynaguri INR 334 2014-01-01 09:58:07+00:00 2013-12-17 08:00:00+00:00 2014-01-01 16:01:36+00:00 43 6 user_favorite, user_favorite female bullet 2014-01-01
3 653063 200 200 Embroidery Arts to purchase an embroidery machine and a variety of new embroidery materials. PK Pakistan Lahore PKR 247 2014-01-01 08:03:11+00:00 2013-12-24 08:00:00+00:00 2014-01-01 13:00:00+00:00 11 8 nan female irregular 2014-01-01
4 653084 400 400 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 245 2014-01-01 11:53:19+00:00 2013-12-17 08:00:00+00:00 2014-01-01 19:18:51+00:00 14 16 nan female monthly 2014-01-01
  • funded_amount: amount disbursed by Kiva to the field agent (USD)
  • loan_amount: amount disbursed by the field agent to the borrower
    • funded_amount < loan_amount인 경우: 48328건
    • funded_amount > loan_amount인 경우: 2건
  • country_code: ISO country code
  • posted_time: the time at which the loan is posted on Kiva by the field agent
  • disbursed_time: the time at which the loan is disbursed by the field agent to the borrower
  • funded_time: the time at which the loan posted to Kiva gets funded by lenders completely
    • null값: 48331개 (funded_amount >= loan_amount인 경우 중 3건도 funded_time이 null로 나옴…)
  • term_in_months: the duration for which the loan was disbursed in months
  • date: posted_time에서 date만 추출한 것. (Kiva에 올라온 날짜)
  1. datetime 데이터 정리

     # datetime 형식 변환
     kiva_df['posted_time'] = pd.to_datetime(kiva_df['posted_time'])
     kiva_df['disbursed_time'] = pd.to_datetime(kiva_df['disbursed_time'])
     kiva_df['funded_time'] = pd.to_datetime(kiva_df['funded_time'])
    
     # posted_time -> Month, Quarter 단위로 저장해둠
     kiva_df['posted_month'] = kiva_df['posted_time'].dt.strftime('%Y%m')
     kiva_df['posted_quarter'] = kiva_df['posted_time'].dt.to_period("Q").astype('str')
    
  2. borrower type 정리

     ## borrower를 성별 / 단체 유무로 정리
     def borrower_type(x):
         if type(x) != str:
             borrower = 'N/A'
         elif x == 'female':
             borrower = 'female'
         elif x == 'male':
             borrower = 'male'
         else:
             borr_set = set(x.split(', '))
             if 'female' in borr_set:
                 if 'male' in borr_set:
                     borrower = 'mixed_group'
                 else:
                     borrower = 'female_group'
             else:
                 borrower = 'male_group'
         return borrower
    
     kiva_df['borrower_type'] = kiva_df['borrower_genders'].apply(lambda x: borrower_type(x))
    

Funds for Philippines

: kiva를 통해 받은 funding 금액이 가장 많은 국가인 Philippines에 대해 집중 탐구

loan_per_country

# philippines 데이터만 따로 추출
ph_df = kiva_df[kiva_df['country'] == 'Philippines']

ph_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 160441 entries, 51 to 670445
Data columns (total 23 columns):
 #   Column              Non-Null Count   Dtype              
---  ------              --------------   -----              
 0   id                  160441 non-null  int64              
 1   funded_amount       160441 non-null  float64            
 2   loan_amount         160441 non-null  float64            
 3   activity            160441 non-null  object             
 4   sector              160441 non-null  object             
 5   use                 160361 non-null  object             
 6   country_code        160441 non-null  object             
 7   country             160441 non-null  object             
 8   region              160360 non-null  object             
 9   currency            160441 non-null  object             
 10  partner_id          160441 non-null  float64            
 11  posted_time         160441 non-null  datetime64[ns, UTC]
 12  disbursed_time      160441 non-null  datetime64[ns, UTC]
 13  funded_time         157607 non-null  datetime64[ns, UTC]
 14  term_in_months      160441 non-null  float64            
 15  lender_count        160441 non-null  int64              
 16  tags                94094 non-null   object             
 17  borrower_genders    160361 non-null  object             
 18  repayment_interval  160441 non-null  object             
 19  date                160441 non-null  object             
 20  posted_quarter      160441 non-null  object             
 21  borrower_type       160441 non-null  object             
 22  posted_month        160441 non-null  object             
dtypes: datetime64[ns, UTC](3), float64(4), int64(2), object(14)
memory usage: 29.4+ MB

분기별 추이

  1. 분기별 loan count

     plt.figure(figsize=(16, 6))
     # 2017Q3은 2017.07.26까지의 기록밖에 없으므로 제외하고 시각화
     sns.lineplot(data=ph_df[ph_df['posted_quarter'] != '2017Q3'], x='posted_quarter', y='funded_amount', 
                 estimator='count', ci=None, color='#C88686')
     plt.ylim(5000, 18000);
    

    loan_count_ph

  2. 분기별 평균 funded_amount

     plt.figure(figsize=(16, 6))
     sns.lineplot(data=ph_df[ph_df['posted_quarter'] != '2017Q3'], x='posted_quarter', y='funded_amount', color='#C88686')
     plt.ylim(250, 450);
    

    avg_funded_amount_ph

  3. 분기별 총 funded_amount

     plt.figure(figsize=(16, 6))
     sns.lineplot(data=ph_df[ph_df['posted_quarter'] != '2017Q3'], x='posted_quarter', y='funded_amount', 
                 estimator='sum', ci=None, color='#C88686')
     plt.ylim(1500000, 5500000);
    

    total_funded_amount_ph

    • 2014년 이후 post당 평균 funded_amount는 줄고 있지만, 올라온 loan post 수 자체가 계속 증가세를 보임에 따라 총 funded_amount도 증가 추세.

borrower 성별

  1. borrower type별 loan count

     sns.countplot(data=ph_df, x='borrower_type', palette='pink');
    

    loan_count_per_borrower_type_ph

    • 여러 borrower가 그룹으로 빌리는 경우는 없음 (female_group, male_group, mixed_group)
    • female 혼자 빌리는 경우가 압도적 (전체 loan의 약 95%)
     ph_df.groupby('borrower_type')['funded_amount'].count()
    
     borrower_type
     N/A           80
     female    151984
     male        8377
     Name: funded_amount, dtype: int64
    
  2. borrower type별 평균 funded amount

     sns.barplot(data=ph_df, x='borrower_type', y='funded_amount', palette='pink');
    

    funded_amount_per_borrower_type_ph

    • 평균 funded amount는 오히려 female이 많은 편은 아님
  3. 분기별 총 funded amount 비교

     plt.figure(figsize=(16, 6))
     sns.lineplot(data=ph_df[ph_df['posted_quarter'] != '2017Q3'], x='posted_quarter', y='funded_amount', 
                 hue='borrower_type', estimator='sum', ci=None, palette='pink');
    

    quarterly_funded_amount_per_borrower_type_ph

    • 총 funded amount 자체도 female이 모든 기간에 가장 압도적.
    • female borrower를 중심으로 funded_amount가 크게 증가세

borrower 지역

  1. funded_amount Top 10 지역

     ph_region = ph_df.groupby('region')[['funded_amount']].sum().reset_index()
     ph_region.sort_values(by='funded_amount', ascending=False, inplace=True)
    
     # Top 10까지만 시각화
     plt.figure(figsize=(7, 5))
     sns.barplot(data=ph_region.head(10), x='funded_amount', y='region', palette='pink');
    

    region_funded_amount_ph

    • Palawan과 Negros Occidental의 지역들이 상위권에 많이 포진되어 있음

     

  2. region의 뒷부분을 따로 ‘province’로 저장
    • region에 적힌 구조가 대체로 ‘Narra, Palawan’처럼 ‘municipality, province’의 구조로 되어 있다고 생각해, 쉼표(,) 뒷부분을 분리해서 저장
     # 쉼표가 없는 region명이 몇 개인지 확인
     print(len(ph_region))
     print(len(ph_region[ph_region['region'].str.contains(',')]))  
    
     3638
     3249
    

    → 쉼표로 분리된 region명이 대부분이므로, 쉼표를 기준으로 나누기로 함.

     # 쉼표(,)를 기준으로 뒷부분 반을 province로 저장하고, 쉼표 없이 하나가 통째로 region명인 경우 그대로 province에 저장.
    
     ph_region_temp = ph_region['region'].str.rsplit(',', 1, expand=True)
    
     for i in ph_region_temp.index:
         if ph_region_temp.loc[i, 1]:   ## 1이 None이 아니면 실행됨 (None이면 else로)
             ph_region.loc[i, 'province'] = ph_region_temp.loc[i, 1]
         else:
             ph_region.loc[i, 'province'] = ph_region_temp.loc[i, 0]
    
     ph_region.head()
    
      region funded_amount province
    2000 Narra, Palawan 1234625.0 Palawan
    2556 Quezon, Palawan 1210875.0 Palawan
    583 Brookes Point, Palawan 1180100.0 Palawan
    1381 Kabankalan, Negros Occidental 1040700.0 Negros Occidental
    1282 Hinigaran, Negros Occidental 956950.0 Negros Occidental
  3. province별 funded_amount 정리, 비교

     ph_province = ph_region.groupby('province')[['funded_amount']].sum().reset_index()
     ph_province.sort_values(by='funded_amount', ascending=False, inplace=True)
    
     # Top 10까지만 시각화
     plt.figure(figsize=(7, 5))
     sns.barplot(data=ph_province.head(10), x='funded_amount', y='province', palette='pink');
    

    province_funded_amount_ph

    • Negros Occidental이 가장 압도적
  4. 각 province별로, 어떤 sector의 funded_amount가 많은지 확인

     # ph_df에 'province' 칼럼를 더해줌
     ph_df = pd.merge(ph_df, ph_region[['region', 'province']], how='left', on='region')
    
     # funded amount Top 10 province만 따로 저장
     top_countries = ph_df.groupby('province')[['funded_amount']].sum().sort_values(by='funded_amount', ascending=False).head(10).index
    
     # funded amount 내림차순으로 sector 순서 정렬
     sectors_order = ph_df.groupby('sector')[['funded_amount']].sum().sort_values(by='funded_amount', ascending=False).index
    
     loans_province_sector = pd.pivot_table(ph_df, index='province', columns='sector', values='funded_amount', fill_value=0, aggfunc='sum')
     loans_province_sector = loans_province_sector[loans_province_sector.index.isin(top_countries)]  # top 10 province만 대상으로 자름
     loans_province_sector = loans_province_sector[sectors_order]   # 전체 loan amount가 가장 높은 sector부터 내림차순 정렬
    
     loans_province_sector = loans_province_sector.T
     loans_province_sector = loans_province_sector[top_countries]   # 전체 loan amount가 가장 높은 province부터 내림차순 정렬
    
    
     # heatmap으로 시각화
     plt.figure(figsize=(14, 9))
     sns.heatmap(loans_province_sector, annot=True, cmap='pink_r', fmt='.0f')
     plt.xticks(rotation=40);
    

    province_sector_funded_amount_ph

    • Negros Occidental은 Retail sector로 빌린 금액이 가장 많음
    • Palawan은 Retail과 Food sector가 비슷하게 많음
    • Negros Occidental과 Negros Oriental은 Agriculture sector로 빌린 금액도 많은 편

repayment interval

  1. repayment interval별 총 funded_amount

     sns.barplot(data=ph_df, x='repayment_interval', y='funded_amount', 
                 estimator=np.sum, ci=None, palette='pink');
    

    repayment_interval_funded_amount_ph

  2. repayment interval별 평균 funded_amount

     sns.barplot(data=ph_df, x='repayment_interval', y='funded_amount', palette='pink');
    

    repayment_interval_avg_funded_amount_ph

  3. repayment interval별 loan 수

     sns.countplot(data=ph_df, x='repayment_interval', palette='pink');
    

    repayment_interval_loans_ph

    • irregular 상환 방식의 loan이 가장 많음
    • irregular 상환 방식의 loan이 평균 funded amount가 많은 편은 아니지만, 수 자체가 압도적이기 때문에 전체 funded amount 중 irregular 상환 방식인 금액이 압도적
    • 전체 kiva loan 중에서는 monthly로 상환되는 금액이 가장 많은 것을 생각하면, irregular 상환 금액이 많다는 것은 Philippines의 특징인 듯
  4. 분기별 repayment interval별 funded amount

     quarter_interval = ph_df.groupby(['posted_quarter', 'repayment_interval'])[['funded_amount']].sum().reset_index()
     quarter_interval = quarter_interval[quarter_interval['posted_quarter'] != '2017Q3']
    
     plt.figure(figsize=(16, 6))
     sns.lineplot(data=quarter_interval, x='posted_quarter', y='funded_amount', hue='repayment_interval', palette='pink',
                 hue_order=['irregular', 'monthly', 'bullet'])
    
     # legend를 box 밖으로 빼 줌
     plt.legend(bbox_to_anchor=(1.01, 1), borderaxespad=0);
    

    quarterly_repayment_interval_funded_amount_ph

    • irregular 상환 방식의 금액이 가장 많이 증가하는 추세
    • irregular 상환보다는 monthly 상환이 더 안정적인 자금 계획에 유리하므로, irregular 상환 방식보다는 monthly 상환 방식을 늘리는 것이 좋다고 생각됨

주요 sector, activity, use

  1. sector별 funded amount

     # funded amount 순으로 내림차순 정렬해 시각화
     ph_sectors = ph_df.groupby('sector')[['funded_amount']].sum().sort_values(by='funded_amount', ascending=False).reset_index()
    
     plt.figure(figsize=(12, 5))
     sns.barplot(data=ph_sectors, x='sector', y='funded_amount', palette='pink')
     plt.xticks(rotation=40);
    

    funded_amount_sector_ph

    • Retail, Food, Agriculture 관련 funded amount가 가장 많음
  2. funded amount Top 10 activities

     ph_activities = ph_df.groupby(['activity', 'sector'])[['funded_amount']].sum().sort_values(by='funded_amount', ascending=False).reset_index()
     ph_activities.head(10)
    
      activity sector funded_amount
    0 General Store Retail 15025550.0
    1 Pigs Agriculture 6039975.0
    2 Farming Agriculture 4063150.0
    3 Fishing Food 3129675.0
    4 Fish Selling Food 2969850.0
    5 Food Production/Sales Food 2659425.0
    6 Personal Housing Expenses Housing 1710425.0
    7 Fruits & Vegetables Food 1563675.0
    8 Retail Retail 1366075.0
    9 Motorcycle Transport Transportation 1067450.0

    → Top 10 activities 시각화

     plt.figure(figsize=(7, 5))
     sns.barplot(data=ph_activities.head(10), x='funded_amount', y='activity', palette='pink');
    

    funded_amount_activity_ph

    • ‘Retail’ sector의 ‘General Store’ 목적의 funded amount가 압도적으로 많고, 그 다음은 오히려 ‘Agriculture’ sector의 ‘Pigs’와 ‘Farming’ 목적의 funded amount가 많음
    • 여러 섬으로 구성된 국가인만큼, ‘Food’ sector 중에서는 ‘Fishing’과 ‘Fish Selling’ 목적의 funded amount가 많음
  3. funded amount Top 10 use

     ph_use = ph_df.groupby(['use'])[['funded_amount']].sum().sort_values(by='funded_amount', ascending=False).reset_index()
    
     plt.figure(figsize=(7, 5))
     sns.barplot(data=ph_use.head(10), x='funded_amount', y='use', palette='pink');
    

    funded_amount_use_ph

    • 같은 내용이라도 사용한 단어에 따라 다르게 집계되므로 정확히 파악하기 어렵지만, ‘grocery’, ‘pig’, ‘food production business’ 등의 단어들이 눈에 띔
  4. ‘General Store’ 관련 Top 10 use

     # 우선, activity, sector, use별 funded amount를 정리
     ph_use_activity = ph_df.groupby(['use', 'activity', 'sector'])[['funded_amount']].sum().sort_values(by='funded_amount', ascending=False).reset_index()
     ph_use_activity.head()
    
      use activity sector funded_amount
    0 to purchase more groceries to sell. General Store Retail 501400.0
    1 to buy feed and other supplies to raise her pigs. Pigs Agriculture 431850.0
    2 to buy fertilizers and other farm supplies. Farming Agriculture 427725.0
    3 to buy feed and vitamins for her pigs. Pigs Agriculture 374800.0
    4 to buy ingredients for her food production business Food Production/Sales Food 344125.0

    → ‘General Store’ Top 10 use 시각화

     plt.figure(figsize=(7, 5))
     sns.barplot(data=ph_use_activity[ph_use_activity['activity'] == 'General Store'].head(10), x='funded_amount', y='use', palette='pink');
    

    funded_amount_use_general_store_ph

    • 같은 내용이라도 사용한 단어에 따라 다르게 집계되므로 정확히 파악하기 어렵지만, General Store를 위한 fund로는 ‘grocery 구매’ 관련 목적이 많다고 생각됨

‘use’ 단어 빈도 분석

  • ‘use’는 같은 내용이라도 단어 사용 등에 따라 다르게 집계되므로, 단순 집계 후 시각화보다 단어별로 쪼개서 파악하는 것이 도움이 될 수 있다고 생각
  • 사유를 단어별로 나눠서 분석해, 어떤 단어가 빈번하게 등장하는지 확인하고, 이로부터 필리핀의 전반적인 loan needs를 파악
## 단어 빈도 계산 &  wordcloud 생성 & 빈번하게 사용된 단어 bar graph로 시각화하는 코드를 미리 함수로 다 만들어 둠
## 각 loan의 funded amount는 고려하지 않고, count만 고려

import nltk
from wordcloud import WordCloud
from collections import Counter


# wordcloud mask 
x, y = np.ogrid[:1200, :1200]
mask = (x - 600) ** 2 + (y - 600) ** 2 > 520 ** 2
mask = 255 * mask.astype(int)


def count_words(sector_name):
    # Text Cleaning & Case Conversion & Tokenization
    words = []
    for content in ph_df[ph_df['sector'] == sector_name]['use']:
        content_cleaned = str(content).replace('.', '').replace(',','').strip().lower()  # ., 정도만 제거, strip, lower 
        words.extend(content_cleaned.split())  # 간단하게 공백 기준으로만 쪼개줌

    # POS Tagging
    tokens_pos = nltk.pos_tag(words)  

    # Noun, Verb만 추출 & Lemmatization
    wlem = nltk.WordNetLemmatizer()
    lemm_words = []

    for word, pos in tokens_pos:
        if ('NN' in pos) | ('VB' in pos):
            lemm_words.append(wlem.lemmatize(word))

    # Counter로 빈도 계산
    c = Counter(lemm_words)
    return c


def generate_wordcloud(c):
    # generate wordcloud
    wordcloud = WordCloud(relative_scaling=.2,
                         background_color='white',
                         colormap='pink',
                         mask=mask).generate_from_frequencies(c)

    plt.figure(figsize=(7, 7))
    plt.imshow(wordcloud)
    plt.axis('off')


def visualize_top10_words(c):
    word_count = pd.DataFrame.from_dict(c, orient='index', columns=['count']).reset_index().rename(columns={'index':'word'})
    word_count.sort_values(by='count', ascending=False, inplace=True)
    plt.figure(figsize=(6, 4))
    sns.barplot(data=word_count.head(10), x='count', y='word', palette='pink')
  1. Retail sector의 use

     retail_c = count_words('Retail')
     generate_wordcloud(retail_c)
    

    retail_use_wordcloud_ph

    • canned goods, grocery 등의 item들을 사기 위한 목적이 많다고 파악됨
     visualize_top10_words(retail_c)
    

    retail_use_bargraph_ph

  2. Food sector의 use

     food_c = count_words('Food')
     generate_wordcloud(food_c)
    

    food_use_wordcloud_ph

    • ‘food’ sector도 단순히 자신의 식량을 사기 위한 것보다, ‘food production business’의 재료 구매 & 판매할 식료품 구매 목적이 많다고 판단됨
     visualize_top10_words(food_c)
    

    food_use_bargraph_ph

  3. Agriculture sector의 use

     agri_c = count_words('Agriculture')
     generate_wordcloud(agri_c)
    

    agri_use_wordcloud_ph

    • pig를 위한 식량 등을 구매하는 목적이 많고, 그 다음으로는 농업을 위한 fertilizer 등의 구매 목적도 많다고 판단됨
    • 돼지 농장이 필리핀의 주요 business 중 하나라고 추정됨
     visualize_top10_words(agri_c)
    

    agri_use_bargraph_ph

데이터 결합해 분석

데이터 정리

  1. kiva_mpi_region_locations.csv

     location_df = pd.read_csv('data/kiva_mpi_region_locations.csv')
     location_df.head(3)
    
      LocationName ISO country region world_region MPI geo lat lon
    0 Badakhshan, Afghanistan AFG Afghanistan Badakhshan South Asia 0.387 (36.7347725, 70.81199529999999) 36.7348 70.812
    1 Badghis, Afghanistan AFG Afghanistan Badghis South Asia 0.466 (35.1671339, 63.7695384) 35.1671 63.7695
    2 Baghlan, Afghanistan AFG Afghanistan Baghlan South Asia 0.3 (35.8042947, 69.2877535) 35.8043 69.2878
    • ISO: unique ID for country
    • MPI: multidimensional poverty index – 수치가 높을수록 poverty level이 높은 것

    → 중복값 제거

     print('중복 제거 이전: ', len(location_df))
    
     location_df.drop_duplicates(inplace=True, ignore_index=True)
     print('중복 제거 이후: ', len(location_df))
    
     중복 제거 이전:  2772
     중복 제거 이후:  1009
    

    → LocationName이 null값인 행 제외 (LocationName이 null이면 region, MPI도 다 null임)

     print('결측치 제거 이전: ', len(location_df))
    
     location_df = location_df[~location_df['LocationName'].isna()]
     location_df.reset_index(inplace=True, drop=True)
    
     print('결측치 제거 이후: ', len(location_df))
    
     결측치 제거 이전:  1009
     결측치 제거 이후:  984
    
  2. loan_theme_ids.csv

     loan_themes_df = pd.read_csv('data/loan_theme_ids.csv')
     loan_themes_df.head(3)
    
      id Loan Theme ID Loan Theme Type Partner ID
    0 638631 a1050000000skGl General 151
    1 640322 a1050000000skGl General 151
    2 641006 a1050000002X1ij Higher Education 160

    → id 제외하고 다 null값인 경우 삭제 (Loan Theme ID가 null이면 Loan Theme Type, Partner ID도 다 null)

     print('결측치 제거 이전: ', len(loan_themes_df))
    
     loan_themes_df = loan_themes_df[~loan_themes_df['Loan Theme ID'].isna()]
     loan_themes_df.reset_index(inplace=True, drop=True)
    
     print('결측치 제거 이후: ', len(loan_themes_df))
    
  3. loan_themes_by_region.csv

     themes_region_df = pd.read_csv('data/loan_themes_by_region.csv')
     themes_region_df.head(3)
    
      Partner ID Field Partner Name sector Loan Theme ID Loan Theme Type country forkiva region geocode_old ISO number amount LocationName geocode names geo lat lon mpi_region mpi_geo rural_pct
    0 9 KREDIT Microfinance Institution General Financial Inclusion a1050000000slfi Higher Education Cambodia No Banteay Meanchey (13.75, 103.0) KHM 1 450 Banteay Meanchey, Cambodia [(13.6672596, 102.8975098)] Banteay Meanchey Province; Cambodia (13.6672596, 102.8975098) 13.6673 102.898 Banteay Mean Chey, Cambodia (13.6672596, 102.8975098) 90
    1 9 KREDIT Microfinance Institution General Financial Inclusion a10500000068jPe Vulnerable Populations Cambodia No Battambang Province nan KHM 58 20275 Battambang Province, Cambodia [(13.0286971, 102.989615)] Battambang Province; Cambodia (13.0286971, 102.989615) 13.0287 102.99 Banteay Mean Chey, Cambodia (13.6672596, 102.8975098) 90
    2 9 KREDIT Microfinance Institution General Financial Inclusion a1050000000slfi Higher Education Cambodia No Battambang Province nan KHM 7 9150 Battambang Province, Cambodia [(13.0286971, 102.989615)] Battambang Province; Cambodia (13.0286971, 102.989615) 13.0287 102.99 Banteay Mean Chey, Cambodia (13.6672596, 102.8975098) 90
    • forkiva: Was this loan theme created specifically for Kiva?
    • geocode_old: 옛날 geocoding system
    • ISO: unique ID for country
    • number: Number of loans funded in this LocationName and this loan theme
    • amount: Dollar value of loans funded in this LocationName and this loan theme
    • LocationName: “{region}, {country}”
    • names: All placenames that the Gmaps API associates with LocationName
    • mpi_region: MPI Region where we think this loan theme is located
    • mpi_geo: Lat-Lon pair where we think this MPI region is located
    • rural_pct: The percentage of this field partners’ borrowers that are in rural areas

kiva만을 위한 loan 비중 확인

  1. kiva_df와 loan_themes_df를 merge

     kiva_loan_themes = kiva_df.copy()
     kiva_loan_themes.drop(['loan_amount', 'country_code', 'disbursed_time', 'tags', 'date', 'borrower_genders'], axis='columns', inplace=True)
    
     kiva_loan_themes = pd.merge(kiva_loan_themes, loan_themes_df, how='left', on='id')
     kiva_loan_themes.head(3)
    
      id funded_amount activity sector use country region currency partner_id posted_time funded_time term_in_months lender_count repayment_interval posted_month posted_quarter borrower_type Loan Theme ID Loan Theme Type Partner ID
    0 653051 300 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. Pakistan Lahore PKR 247 2014-01-01 06:12:39+00:00 2014-01-02 10:06:32+00:00 12 12 irregular 201401 2014Q1 female nan nan nan
    1 653053 575 Rickshaw Transportation to repair and maintain the auto rickshaw used in their business. Pakistan Lahore PKR 247 2014-01-01 06:51:08+00:00 2014-01-02 09:17:23+00:00 11 14 irregular 201401 2014Q1 female_group a1050000000sjEC Underserved 247
    2 653068 150 Transportation Transportation To repair their old cycle-van and buy another one to rent out as a source of income India Maynaguri INR 334 2014-01-01 09:58:07+00:00 2014-01-01 16:01:36+00:00 43 6 bullet 201401 2014Q1 female a1050000002VkWz Underserved 334

    → 결합 후의 null값 파악

     kiva_loan_themes.info()
    
     <class 'pandas.core.frame.DataFrame'>
     Int64Index: 671205 entries, 0 to 671204
     Data columns (total 17 columns):
     #   Column              Non-Null Count   Dtype              
     ---  ------              --------------   -----              
     0   id                  671205 non-null  int64              
     1   funded_amount       671205 non-null  float64            
     2   activity            671205 non-null  object             
     3   sector              671205 non-null  object             
     4   country             671205 non-null  object             
     5   region              614405 non-null  object             
     6   currency            671205 non-null  object             
     7   partner_id          657698 non-null  float64            
     8   posted_time         671205 non-null  datetime64[ns, UTC]
     9   funded_time         622874 non-null  datetime64[ns, UTC]
     10  term_in_months      671205 non-null  float64            
     11  lender_count        671205 non-null  int64              
     12  borrower_genders    666984 non-null  object             
     13  repayment_interval  671205 non-null  object             
     14  Loan Theme ID       657692 non-null  object             
     15  Loan Theme Type     657692 non-null  object             
     16  Partner ID          657692 non-null  float64            
     dtypes: datetime64[ns, UTC](2), float64(4), int64(2), object(9)
     memory usage: 92.2+ MB
    
  2. 위에서 만든 kiva_loan_themes에 themes_region_df를 merge

     temp1 = kiva_loan_themes[['funded_amount', 'posted_quarter', 'Loan Theme ID', 'borrower_type', 'activity', 'sector', 'use', 'country']]
    
     temp2 = themes_region_df[['Loan Theme ID', 'Loan Theme Type', 'forkiva', 'Partner ID', 'Field Partner Name']]
     temp2.drop_duplicates(inplace=True)
    
     kiva_themes_df = pd.merge(temp1, temp2, how='left', on='Loan Theme ID')
     kiva_themes_df.head(3)
    
      funded_amount posted_quarter Loan Theme ID borrower_type activity sector use country Loan Theme Type forkiva Partner ID Field Partner Name
    0 300 2014Q1 nan female Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. Pakistan nan nan nan nan
    1 575 2014Q1 a1050000000sjEC female_group Rickshaw Transportation to repair and maintain the auto rickshaw used in their business. Pakistan Underserved No 247 BRAC Pakistan
    2 150 2014Q1 a1050000002VkWz female Transportation Transportation To repair their old cycle-van and buy another one to rent out as a source of income India Underserved Yes 334 Belghoria Janakalyan Samity

    → 결합 후의 null값 파악

     kiva_themes_df.info()
    
     <class 'pandas.core.frame.DataFrame'>
     Int64Index: 671205 entries, 0 to 671204
     Data columns (total 11 columns):
     #   Column              Non-Null Count   Dtype  
     ---  ------              --------------   -----  
     0   funded_amount       671205 non-null  float64
     1   posted_quarter      671205 non-null  object 
     2   Loan Theme ID       657692 non-null  object 
     3   borrower_type       671205 non-null  object 
     4   activity            671205 non-null  object 
     5   sector              671205 non-null  object 
     6   country             671205 non-null  object 
     7   Loan Theme Type     629021 non-null  object 
     8   forkiva             629021 non-null  object 
     9   Partner ID          629021 non-null  float64
     10  Field Partner Name  629021 non-null  object 
     dtypes: float64(2), object(9)
     memory usage: 61.5+ MB
    
  3. kiva loan 중 kiva만을 위한 loan이 어느 정도인지 파악
    • forkiva=Yes면 kiva만을 위해 구축한 loan이라는 의미
     sns.countplot(data=kiva_themes_df, x='forkiva', palette='pink');
    

    forkiva_count

  4. kiva만을 위한 loan과 아닌 loan의 평균 funded amount 비교

     sns.barplot(data=kiva_themes_df, x='forkiva', y='funded_amount', palette='pink');
    

    forkiva_avg_amount

    • forkiva=No인 loan이 압도적으로 많지만, 평균 funded amount 자체는 별 차이가 없음
  5. kiva만을 위한 loan과 아닌 loan의 분기별 총 funded amount 추이

     temp = kiva_themes_df.groupby(['posted_quarter', 'forkiva'])[['funded_amount']].sum().reset_index()
    
     plt.figure(figsize=(16, 6))
     sns.lineplot(data = temp[temp['posted_quarter'] != '2017Q3'], 
                 x='posted_quarter', y='funded_amount', hue='forkiva', palette='pink');
    

    quarterly_forkiva_amount

    • kiva만을 위한 loan과 그렇지 않은 loan 모두 funded amount가 증가 추세에 있다

country, sector별 forkiva 비중

  1. kiva만을 위한 loan 비중이 많은 country 확인

     # country별 forkiva=yes인 비중: 총 funded amount를 기준으로 계산
     country_forkiva = pd.pivot_table(df, index='country', columns='forkiva', values='funded_amount', fill_value=0, aggfunc='sum')
     country_forkiva['forkiva_ratio(%)'] = (country_forkiva['Yes'] / (country_forkiva['No'] + country_forkiva['Yes'])) * 100
     country_forkiva.sort_values(by='forkiva_ratio(%)', ascending=False, inplace=True)
     country_forkiva.head(10)
    
    country No Yes forkiva_ratio(%)
    Belize 0 63150 100.00
    Chile 0 38525 100.00
    Georgia 105875 3263625 96.86
    Lao People’s Democratic Republic 59000 1089050 94.86
    Moldova 58825 628025 91.44
    Myanmar (Burma) 292925 2742925 90.35
    Egypt 428750 656175 60.48
    Mongolia 532425 676950 55.98
    Zimbabwe 1690075 1365500 44.69
    India 2932575 2312850 44.09

    → forkiva=yes인 비중이 0보다 큰 국가만 시각화

     plt.figure(figsize=(9, 9))
     sns.barplot(data=country_forkiva[country_forkiva['forkiva_ratio(%)'] > 0].reset_index(), 
                 x='forkiva_ratio(%)', y='country', palette='pink');
    

    forkiva_amount_per_country

    • Belize, Chile는 지원받은 금액의 100%가 kiva만을 위해 구축된 loan
    • Georgia, Lao People’s Democratic Republic, Moldova, Myanmar는 90% 이상의 금액이 kiva만을 위한 loan으로 지원받은 것
  2. kiva만을 위한 loan 비중이 많은 sector 확인

     sector_forkiva = pd.pivot_table(kiva_themes_df, index='sector', columns='forkiva', values='funded_amount', fill_value=0, aggfunc='sum')
     sector_forkiva['forkiva_ratio(%)'] = (sector_forkiva['Yes'] / (sector_forkiva['No'] + sector_forkiva['Yes'])) * 100
     sector_forkiva.sort_values(by='forkiva_ratio(%)', ascending=False, inplace=True)
    
     # Sector별 forkiva 비중을 시각화
     plt.figure(figsize=(9, 6))
     sns.barplot(data=sector_forkiva.reset_index(), x='forkiva_ratio(%)', y='sector', palette='pink');
    

    forkiva_amount_per_sector

    • ‘Personal Use’ sector는 funded amount의 약 43%가 kiva만을 위한 단독 loan으로 지원받은 것
  3. kiva만을 위한 loan 중, ‘Personal Use’ sector의 구체적인 use를 파악

     temp = kiva_themes_df[(kiva_themes_df['forkiva'] == 'Yes') & (kiva_themes_df['sector'] == 'Personal Use')]
     temp.groupby('use')[['funded_amount']].sum().sort_values(by='funded_amount', ascending=False).head()
    
    use funded_amount
    to buy a water filter to provide safe drinking water for their family. 805125
    to buy a water filter to provide safe drinking water for her family. 628150
    to purchase TerraClear water filters so they can have access to safe drinking water. 566900
    To buy a water filter to provide safe drinking water for their family. 253300
    to buy a water filter to provide safe drinking water for his family. 200375

    → use에 등장하는 단어를 wordcloud로 시각화

     # Text Cleaning & Case Conversion & Tokenization
     words = []
     for content in kiva_themes_df[(kiva_themes_df['forkiva'] == 'Yes') & (kiva_themes_df['sector'] == 'Personal Use')]['use']:
         content_cleaned = str(content).replace('.', '').replace(',','').strip().lower()  # ., 정도만 제거, strip, lower 
         words.extend(content_cleaned.split())  # 간단하게 공백 기준으로만 쪼개줌
    
     # POS Tagging
     tokens_pos = nltk.pos_tag(words)  
    
     # Noun, Verb만 추출 & Lemmatization
     wlem = nltk.WordNetLemmatizer()
     lemm_words = []
    
     for word, pos in tokens_pos:
         if ('NN' in pos) | ('VB' in pos):
             lemm_words.append(wlem.lemmatize(word))
    
     # Counter로 빈도 계산
     c = Counter(lemm_words)
    
     generate_wordcloud(c)  # 위에서 만들어둔 워드클라우드 만드는 함수
    

    forkiva_personal_use_wordcloud

    +) Top 10 단어 bar graph로 시각화

     visualize_top10_words(c)  # 위에서 만들어둔 top 10 단어 bargraph 만드는 함수
    

    forkiva_personal_use_bargraph

    • drinking water 관련 loan이 많다고 판단됨
    • 식수도 관련 타 loan / 사업을 끌어오면 kiva 100% loan을 구축하지 않고도 다른 방법으로 금전적 지원이 가능할 수 있다고 생각됨

Field Partner와의 협업 관계 확인

# funded amount순으로 Field Partner Name 확인
loans_partner = kiva_themes_df.groupby('Field Partner Name')[['funded_amount']].sum().reset_index()
loans_partner.sort_values(by='funded_amount', ascending=False, inplace=True)

plt.figure(figsize=(9, 6))
sns.barplot(data=loans_partner.head(10), x='funded_amount', y='Field Partner Name', palette='pink')
plt.xticks(rotation=60);

fied_partner_funded_amount

  • Negros Women for Tomorrow Foundation (NWTF)가 가장 많은 funded amount의 자금 융통을 담당하는 field partner

+) NWTF가 주로 담당하는 지역 확인:

temp = kiva_themes_df[kiva_themes_df['Field Partner Name'] == 'Negros Women for Tomorrow Foundation (NWTF)']
temp.groupby('country')[['funded_amount']].sum().sort_values(by='funded_amount', ascending=False).head(3)
country funded_amount
Philippines 33831500.0
Kenya 25.0
Bolivia 0.0
  • NWTF는 Philippines에서 주로 활동하는 field partner인 것으로 확인됨

국가별 funded amount와 MPI

  • 국가별 총 funded amount와 평균 MPI 지수를 비교
  1. region 기준으로 merge가 가능할지 탐색

     kiva_df_region_set = set(kiva_df['region'])
     location_df_region_set = set(location_df['region'])
    
     print('kiva_df: ', len(kiva_df_region_set))
     print('location_df: ', len(location_df_region_set))
     print('겹치는 region: ', len(kiva_df_region_set & location_df_region_set))
    
     kiva_df:  12696
     location_df:  838
     겹치는 region:  129
    
    • 겹치는 region이 너무 적어서 단순 merge는 불가…
  2. country 기준으로 총 funded amount, 평균 MPI 지수를 각각 groupby 후 merge

     kiva_df_groupby = kiva_df.groupby('country')[['funded_amount', 'loan_amount']].sum().reset_index()
     location_df_groupby = location_df.groupby(['country', 'world_region'])[['MPI']].mean().reset_index()
    
     groupby_merged = pd.merge(kiva_df_groupby, location_df_groupby, how='inner', on='country')
     print(len(groupby_merged))
     groupby_merged.head()
    
      country funded_amount loan_amount world_region MPI
    0 Afghanistan 14000 14000 South Asia 0.309853
    1 Belize 114025 114025 Latin America and Caribbean 0.0201429
    2 Benin 516825 518950 Sub-Saharan Africa 0.320333
    3 Bhutan 15625 20000 South Asia 0.123474
    4 Brazil 661025 662200 Latin America and Caribbean 0.0272593

    → 국가별 총 funded_amount와 평균 MPI 간의 관계 시각화

     plt.figure(figsize=(12, 8))
     sns.scatterplot(data=groupby_merged, x='funded_amount', y='MPI', 
                     hue='world_region', style='world_region', palette='Set2', s=100);
    

    country_amount_mpi_scatter

    • 평균 MPI가 높은 국가일수록 funded amount가 많을 거라고 예상했지만, 그런 관계는 발견되지 않음
    • 오히려 평균 MPI가 높은 (= poverty level이 높은) Sub-Shararan Africa의 국가들 중 상위 8개 국가는 funded amount가 많지 않은 편

Copyright © 2021 Chaeyun.
This site uses Just the Docs, a Jekyll theme.