Jonathan Kroening, Eric Liao, & Daniel Young
22 March 2014
Scraping the Flickr website via its API
for all photos from 2013 that have tags
for item in search_results.findall('photo'):
photo_id = item.get('id')
if photo_id is not None:
photo = self.flickr_api.photos_getInfo(photo_id=photo_id,
format='json')
if 'jsonFlickrApi(' in photo:
photo = json.loads(photo[14:-1])
if 'photo' in photo:
p = Photo.fromSearchResult(photo['photo'])
if p.tags == []:
continue
photo = {'photo_id' : p.id, 'tags' : p.tags, 'geolocation' :
p.geolocation, 'date_taken' : p.date_taken,
'date_posted' : p.date_posted,'views' : p.views,
'locale' : p.locale, 'county' : p.county, 'region' :
p.region, 'url' : p.url}
self.writeOut(photo)
Figure 1: Downloading Metadata through API
{
'county': [u'District of Columbia', u'r7KcVlZQUL8J2lfjCQ'],
'geolocation': [38.924578, -77.04198],
'photo_id': u'12992167163',
'url': u'http://farm8.staticflickr.com/7452/12992167163_bc33d8234c_o.jpg',
'date_taken': u'2013-10-05 15:40:35',
'tags': [u'Washington', u'DC', u'Washington DC', u'guitar', u'DCist'],
'locale': [u'Washington', u'aKGrC25TV7vTJcir'],
'region': [u'District of Columbia', u'.9.rXhhTUb5eYUuK'],
'views': u'65',
'date_posted': u'1394210276'
}
Figure 2: Data Sample Being Collected into Dictionary
Save the streaming data and write to csv files on a local drive for later analysis
def writeOut(self, photo):
with open('flickrdump.csv', 'ab') as f:
dict_writer = csv.DictWriter(f, delimiter=',', fieldnames=keys)
dict_writer.writerow(photo)
Figure 3: Storing Data in .csv File
Classifying, Understanding, and Anticipating Data
Flickr API provided Location, Tags,
Date Taken, Date Uploaded, and Views
Images are classified by month taken
What locations are popular each month?
What tags are popular each month?
What days are popular each month?
{
'county': [u'District of Columbia', u'r7KcVlZQUL8J2lfjCQ'],
'geolocation': [38.924578, -77.04198],
'photo_id': u'12992167163',
'url': u'http://farm8.staticflickr.com/7452/12992167163_bc33d8234c_o.jpg',
'date_taken': u'2013-10-05 15:40:35',
'tags': [u'Washington', u'DC', u'Washington DC', u'guitar', u'DCist'],
'locale': [u'Washington', u'aKGrC25TV7vTJcir'],
'region': [u'District of Columbia', u'.9.rXhhTUb5eYUuK'],
'views': u'65',
'date_posted': u'1394210276'
}
Figure 4: Raw Data
Clustering photos by tags
Levenshtein Distance: Distances between two words
Jaro–Winkler Distance: Similarity between two words
Correct Spelling using n-gram: bad results and false positives
Unobtainable due to the nature of the tags and the inability to cluster properly