Beautiful Data

ng-Flickr

Jonathan Kroening, Eric Liao, & Daniel Young

22 March 2014

Collection

Scraping the Flickr website via its API

for all photos from 2013 that have tags

Collection Code


for item in search_results.findall('photo'):
    photo_id = item.get('id')
    if photo_id is not None:
        photo = self.flickr_api.photos_getInfo(photo_id=photo_id,
        format='json')
    if 'jsonFlickrApi(' in photo:
        photo = json.loads(photo[14:-1])
    if 'photo' in photo:
        p = Photo.fromSearchResult(photo['photo'])
    if p.tags == []:
        continue
    photo = {'photo_id' : p.id, 'tags' : p.tags, 'geolocation' :
        p.geolocation, 'date_taken' : p.date_taken,
        'date_posted' : p.date_posted,'views' : p.views,
        'locale' : p.locale, 'county' : p.county, 'region' :
        p.region, 'url' : p.url}
    self.writeOut(photo)
            
Figure 1: Downloading Metadata through API

The Raw Data


{
    'county': [u'District of Columbia', u'r7KcVlZQUL8J2lfjCQ'],
    'geolocation': [38.924578, -77.04198],
    'photo_id': u'12992167163',
    'url': u'http://farm8.staticflickr.com/7452/12992167163_bc33d8234c_o.jpg',
    'date_taken': u'2013-10-05 15:40:35',
    'tags': [u'Washington', u'DC', u'Washington DC', u'guitar', u'DCist'],
    'locale': [u'Washington', u'aKGrC25TV7vTJcir'],
    'region': [u'District of Columbia', u'.9.rXhhTUb5eYUuK'],
    'views': u'65',
    'date_posted': u'1394210276'
}
            
Figure 2: Data Sample Being Collected into Dictionary

Storage

Save the streaming data and write to csv files on a local drive for later analysis

Storage Code


def writeOut(self, photo):
    with open('flickrdump.csv', 'ab') as f:
        dict_writer = csv.DictWriter(f, delimiter=',', fieldnames=keys)
        dict_writer.writerow(photo)
            

Figure 3: Storing Data in .csv File

Analyzation

Classifying, Understanding, and Anticipating Data

Level One: Classification

Flickr API provided Location, Tags,

Date Taken, Date Uploaded, and Views

Images are classified by month taken

Level Two: Comprehension

What locations are popular each month?

What tags are popular each month?

What days are popular each month?


{
    'county': [u'District of Columbia', u'r7KcVlZQUL8J2lfjCQ'],
    'geolocation': [38.924578, -77.04198],
    'photo_id': u'12992167163',
    'url': u'http://farm8.staticflickr.com/7452/12992167163_bc33d8234c_o.jpg',
    'date_taken': u'2013-10-05 15:40:35',
    'tags': [u'Washington', u'DC', u'Washington DC', u'guitar', u'DCist'],
    'locale': [u'Washington', u'aKGrC25TV7vTJcir'],
    'region': [u'District of Columbia', u'.9.rXhhTUb5eYUuK'],
    'views': u'65',
    'date_posted': u'1394210276'
}
            

Figure 4: Raw Data

Understanding

Clustering photos by tags

Levenshtein Distance: Distances between two words

Jaro–Winkler Distance: Similarity between two words

Correct Spelling using n-gram: bad results and false positives

Level Three: Prediction

Unobtainable due to the nature of the tags and the inability to cluster properly

Visualization

  • Basic Pic Chart
  • Tag Cloud
  • Time Series
  • Map

Live Project

https://lazy-flickr-map.appspot.com/

Architecture

Pie Chart

Figure 5: pie chart on Location

Tag Cloud

Figure 6: tag cloud

Time Series

Figure 7: Time series showing when picture is taken

Mapping

Figure 8: Mapping photo onto Google map