Animal Wildlife EStimator using SOcial MEdia (A. W. E. S. O. M. E)

AWESOME is a research project initiative to build a population estimator using pictures from social media. Tracking of wildlife population using conventional methods incurs a financial as well as an operational burden. AWESOME tries to solve this problem by turning to a promising and opportunistic form of citizen science. We are trying to build a population estimator by mining images of animals from social media. Estimating population from social media images is a non-trivial problem due to the inherent complexities introduced by biases in social media data. We attempt to quantify and understand the biases that influence the sharing behavior of human with respect to wildlife images. The first step towards building such an estimator that accounts from such self-reporting bias is to build a classifier that can learn whether or not a picture of a certain wildlife species will be shared or not.

Author: Sreejith Menon
Email: smenon8@uic.edu
Home-page: http://compbio.cs.uic.edu/~sreejith/

Download as .zip Download as .tar.gz View on GitHub View GitHub Wiki
View Technology Stack View Project Progress View Results Dashboard View Index

  1. GetPropertiesAPI.py

  2. GenerateMTurkFileAPI.py

  3. CreateTurkFilesBulk.py

  4. BuildConsolidatedFeaturesFile.py

  5. JobsMapResultsFilesToContainerObjs.py

  6. DeriveFinalResultSet.py

  7. BaseCapsuleClass.py

  8. RegressionCapsuleClass.py

  9. ClassifierCapsuleClass.py

  10. ClassifierHelperAPI.py

  11. FeatureSelectionAPI.py

  12. MarkRecapHelper.py

  13. PopulationEstimatorAPI.py

  14. SocialMediaImageExtracts.py

  15. AppendMicrosoftAIData.ipynb

  16. VisualizeResults.ipynb

  17. ImageShareabilityClassifiers.ipynb

  18. InterAnnotatorAgreement.ipynb

  19. LocationBiasCheck.ipynb


Description: Contains methods to extract specific information from IBEIS database through REST-ful API calls.

  • getAnnotID(gid)
  • a. Argument: GID of a single image.
    b. Argument data-type: int/str
    c. PRE-condition:
        i. gid should be a valid image ID
    d. Returns: Corresponding Annotation IDs in the image.
    e. Return data-type: Python list object
    f. Comments: Used to extract annotation IDs for a particular image GID

  • getImageFeature(aidList, feature)
  • a. Arguments: List of Annotation IDs, Required Feature
    b. Argument data-type: Python list, str
    c. PRE-condition:
        i. aid should be a valid annotation ID
        ii. feature should be from the set: {"age/months", "yaw/text", "exemplar", "quality/text", "sex/text", "species/text", "name/rowid", "name/text", "image_contributor_tag"}
    d. Returns: Corresponding feature of the specified annotation ID
    e. Return data-type: Python list object
    f. Comments: Used to extract features for a given list of annotation IDs.

  • getExifData(gidList, exifFtr)
  • a. Arguments: List of GIDs, Required EXIF Feature
    b. Argument data-type: Python list, str
    c. PRE-condition:
        i. GID should be a valid image in IBEIS database
        ii. feature should be from the set: {unixtime, lat, long}
    d. Returns: Corresponding EXIF feature for the specified image
    e. Return data-type: Python list object
    f. Comments: Used to extract EXIF features for a given list of image GIDs.

  • getContributorGID(cid)
  • a. Arguments: Contributor ID
    b. Argument data-type: int/str
    c. PRE-condition:
        i. cid should be a valid contributor ID
    d. Returns: List of images clicked by the contributor
    e. Return data-type: Python list object
    f. Comments: Used to extract all images clicked by a particular photographer.

  • getAgeFeatureReadableFmt(ageList)
  • a. Arguments: Age as extracted from IBEIS API call.
    b. Arguments data-type: Python list.
    c. PRE-condition:
        i. Age should be in one of the following formats: [[-1,-1]] or [[None,2]] or [[3, 5]] or [[6, 11]] or [[12,23]] or [[24,35]] or [[36,None]]
    d. Returns: A human readable string of the age.
    e. Comments: Age in IBEIS is stored as a list (an estimated range in months). IBEIS identifies the age of the animal up to 3 years. The animals will be classified as Infants, a year-old juveniles, two-year-old juveniles or fully grown adults.

  • getUnixTimeReadableFmt(unixtm)
  • a. Arguments: Unixtime as extracted from IBEIS API call.
    b. Arguments data-type: int
    c. PRE-condition:
        i. The unix time provided in the input should be a valid unix time in integer form.
    d. Returns: A human readable string of the unixtime in YYYY-MM-DD HH:mm:ss.
    e. Comments: The time each picture is taken is stored in unix time format. This method is short-hand method to convert the unix time into a human readable date time. This method makes necessary adjustments to take into consideration UTC times or in other words adjusts for daylight saving changes.


Description: Contains methods to generate a single mechanical turk photo album. There are multiple methods in the API but the method generateMTurkFile() needs to be called to generate one image album job file.
Hardcoded parameter: excepFL – points to the CSV file which is a enumeration of all identified human images. This is provided to avoid occurrences of human images in the mechanical turk photo album jobs.

  • genRandomImg(begin, end, oldList)
  • a. Arguments: Start range, End Range and list in the same range
    b. Argument data-type: int, int, int
    c. PRE-condition:
        i. All elements in the oldList should be in the range [begin, end]
        ii. begin, end should be valid image GIDs.
    d. Returns: A randomly selected image GID in the range [begin, end] and not in oldList.
    e. Return data-type: int
    f. Comments: This method is used when there arises a case when a particular image in the image album have to be replaced with another randomly selected image.

  • genImageList(begin, stop, maxImgs=20)
  • a. Arguments: Start range, end range and number of images in photo album
    b. Argument data-type: int, int, int
    c. PRE-condition:
        i. begin and stop should be both valid image GID’s.
        ii. begin < stop
        iii. stop – begin > maxImgs
    d. Returns: A python list with randomly selected image GID’s that are in the range [begin, stop] and len(list) = maxImgs
    e. Comments: This method is used to generate a list of random images that will be further used as an input to the method that generates the photo album mechanical turk job. The list has no duplicates and has exactly maxImgs number of images.

  • generateMTurkFile(startGID, endGID, outFile, maxImgs=20, prodFileWrite=False)
  • a. Arguments: Start GID, end GID, name of output file, number of images in photo album, a boolean to generate a m-turk job for production environment.
    b. Argument data-type: int, int, str, int, boolean
    c. PRE-condition:
        i. startGID and endGID should be both valid image GID’s.
        ii. startGID < endGID
        iii. endGID - startGID > maxImgs
    d. Returns: A python list with randomly selected image GID’s that are in the photo album. The GID’s are in the range [startGID, endGID] and len(list) = maxImgs
    e. Comments: Most important method of this script. This generates a .question file that is fed to createHIT command. If prodFileWrite parameter is set to True, then a job with the same image GID’s is generated with the word ‘prod’ appended to the outFile name. prodFileWrite is defaulted to False, so it will only generate a file for mechanical turk sandbox if this parameter is unspecified.

Description: Contains a single method that generates mechanical turk jobs for generating photo-albums in bulk. This script contains a __main__() method that accepts command line arguments and can be directly executed through terminal.

To run the script, provide 4 parameters in the order fileName, jobMapName, numberOfFiles, numberOfImgs.
python CreateTurkFilesBulk.py /tmp/sample /tmp/sample.CSV 2 10
Successfully written : /tmp/sample1
Successfully written : /tmp/sample2

  • createTurkFilesBulk(flNm, jobMapName, noOfJobs, noOfImgsPerJob = 20)
  • a. Arguments: File Name prefix of the job files, File name of the map file of albums and the images in those albums, number of jobs needed, number of images needed per job.
    b. Argument data-type: str, str, int, int
    c. PRE-condition:
        i. Entire path is provided for both flNm and jobMapName.
        ii. Path exists in the host system.
    d. Returns: None
    e. Comments: Selects noOfJobs number of random contributors (there might be duplicates in the selected contributor list). There is hard-coded code for removing contributors who did not click any picture. For each job, noOfImgsPerJob number of images are selected from the range of images a particular contributor has clicked. This is done to ensure that given a particular album, all the images are clicked by the same contributor. The script assumes the value of prodFileWrite in line 63 as True. It generates 1 input, 1 question file per noOfJobs and 1 map file that contains a map between albums and the images in them.


Description: Contains multiple methods to calculate final result statistics. The methods in this script file highly utilizes objects created by JobsMapResultsFilesToContainerObjs.py. Three global variables are declared in the scope of this project, the currently point to CSV files that have information from second phase of the mechanical turk deployment.

gidAidMapFl, aidFeatureMapFl, imgJobMap
These parameters are used by multiple methods within this script.

  • getPosShrProptn(imgJobMap, resStart, resEnd)
  • a. Arguments: Image Job Map in CSV format, start of the results file, end of the results files.
    b. Argument data-type: str, int, int
    c. PRE-condition:
        i. Entire path is provided for imgJobMap and the file should exist.
        ii. All files starting from photo_album_.results until photo_album_.results should exist under results folder.
    b. Returns: A Python dictionary object that has position of images in the album and number of shares, number of not_shares, the total and the proportion of share rate as a nested dictionary.
    c. Comments: Number of shares/not shares for each and every position are enumerated inside a list. Use of OrderedDict() from the Python collections framework ensures that the records are picked in the exact same order they appear in the albums. This returned dictionary can then be embedded inside a data-frame and can be visualized.

  • genTotCnts(ovrCnts)
  • a. Arguments: Overall counts dictionary, an enumeration of share/no share data per image.
    b. Arguments data-type: Python dictionary object.
    c. PRE-condition:
        i. The values in the argument dictionary should be a Python list object.
        ii. All the values in the list are integers.
    d. Returns: A python dictionary which has the same keys as the argument data-type, but the values are simply the sum total of all values in the list corresponding to the argument dictionary key.
    e. Comments: This method is a simplification of adding up the share/no share counts of a particular image. For example, for a particular image which appeared in 5 different albums, the share rates were 9,5,9,10,10. This method will return gid:sum([9,5,9,10,10)].

  • getShrProp(ovrAggCnts)
  • a. Arguments: Overall Aggregate counts, typically outputted by genTotCnts method. A python dictionary with key and sum total of shares, no_shares and totals.
    b. Arguments data-type: Python dictionary object.
    c. PRE-condition:
        i. The value in the argument dictionary should a single valued integer.
        ii. The key should be a iterable, most preferably a Python tuple object.
        iii. Keys should contain the strings: ‘share’, ‘total’ or ‘not_share’
    d. Returns: A python dictionary which is simply the share proportion of that particular key.
    e. Comments: This method will be used to calculate the share proportion of a particular feature or an image. This dictionary answers the question, what percentage of this feature/image were shared. The share proportion is calculated as total_shares/(total_shares+total_not_shares).

  • getCountingLogic(gidAidMapFl, aidFeatureMapFl, feature, withNumInds=True)
  • a. Arguments: GID-AID map file(JSON), AID-Feature map file(JSON), feature, number of individuals in the output required (defaulted to True)
    b. Arguments data-type: str, str, str, boolean
    c. PRE-condition:
        i. Entire path is provided for gidAidMapFl and aidFeatureMapFl and the file should exist in JSON file format.
        ii. Should be from an acceptable set of feature strings.
    d. Returns: A Python dictionary object.
    e. Comments: This method defines the counting logic for a particular image for a given feature. The counting logic will return an optional number of individuals in the result dictionary if the aparamerter withNumInds is set to True. For instance, for the feature ‘SPECIES’, there might be images that contains both a zebra and a giraffe. In that case, the share counts have to be added to both zebra and giraffe.

  • genAlbmFtrs(gidAidMapFl, aidFeatureMapFl, imgJobMap, reqdFtrList)
  • a. Arguments: GID-AID map file(JSON), AID-Feature map file(JSON), Image GID-Album map file(CSV), required features list – a list of features for which the properties needs to be determined.
    b. Arguments data-type: str, str, str, Python List
    c. PRE-condition:
        i. Entire path is provided for gidAidMapFl and aidFeatureMapFl and the file should exist in JSON file format.
        ii. Entire path is provided for imgJobMap and the file should exist in CSV format.
        iii. The feature list should contain elements from the acceptable set of features.
    d. Returns: A Python dictionary object.
    e. Comments: This method is used to generate the number of shares and not_shares per feature in a particular album. For instance, if the required features list contains ‘SPECIES’, then it tells the share and not_share count per album for each available/identifiable species.

  • getShrPropImgsAcrossAlbms(imgJobMap, resSetStrt, resSetEnd, flNm)
  • a. Arguments: Image Job Map in CSV format, start of the results file, end of the results files, output file name in json format.
    b. Arguments data-type: str, int, int, str
    c. PRE-condition:
        i. Entire path is provided for imgJobMap and the file should exist.
        ii. All files starting from photo_album_<resStart>.results until photo_album_<resEnd>.results should exist under results folder.
    d. Returns: Two Python Dictionary objects.
    e. Comments: This method derives the share proportions of images across different albums as opposed to generating an overall proportion of images. For example, the output of this method will contain how many times a particular image was shared in a particular album. This is particularly insightful for images that appear across multiple albums. It could essentially tell if a certain image was shared in the same way or differently and if the share rate of an image is orthogonal to the context of album.

  • getFltrCondn(ftr)
  • a. Arguments: Feature name
    b. Arguments data-type: str
    c. PRE-condition:
        i. Provided argument should be a valid feature.
    d. Returns: A lambda function
    e. Comments: This method returns a lambda function or a logic in filtering all valid rows without the UNIDENTIFIED. A slight adjustment is needed depending on what feature is being filtered.

  • genObjsForConsistency(gidAidMapFl, aidFeatureMapFl, ftr, imgJobMap, resSetStrt, resSetEnd)
  • a. Arguments: GID-AID map file(JSON), AID-Feature map file(JSON), Feature name, Image GID-Album map file(CSV), start of the results file, end of the results files.
    b. Arguments data-type: str, str, str, str, int, int
    c. PRE-condition:
        i. Entire path is provided for gidAidMapFl and aidFeatureMapFl and the file should exist in JSON file format.
        ii. Entire path is provided for imgJobMap and the file should exist in CSV format.
        iii. The feature list should contain elements from the acceptable set of features.
    d. Returns: Python List Object, Two Python Dictionary objects
    e. Comments: This is a helper method for getConsistencyDict(). It returns three objects.
        i. A list with only valid features i.e. without the UNIDENTIFIED entries.
        ii. A dictionary object that contains the albums in which a particular feature is seen. Ex. GIRAFFE is seen in albums 1, 3,4,5 etc.
        iii. A dictionary object that contains the share proportion of feature per album. Ex. giraffes in album 1 were shared 70% of times etc. Format of the key: (feature, album)

  • getConsistencyDict(filteredKeyArr, ftrAlbmDict, ftrAlbmShrPropDict, flNm)
  • a. Arguments: The filtered key array, Feature Album dictionary and feature album share proportion dictionary are direct outputs of genObjsForConsistency()
    b. Arguments data-type: Python List Object, Python Dictionary Object, Python Dictionary object, str
    c. PRE-condition: Same conditions as genObjsForConsistency()
    d. Returns: A Python Dictionary object.
    e. Comments: This method should be thought as the main counting logic while calculating the share rate related statistic for any feature that is extracted from IBEIS. The output dictionary is all the valid features and thier share-rates across different albums. The output dictionary is a key - dictionary pair. Ex. {giraffe : {album_1 : 70, album_2 : 89}, ...}

  • genVarStddevShrPropAcrsAlbms(consistency)
  • a. Arguments: Consistency dictionary is the direct ouput of getConsistencyDict()
    b. Arguments data-type: Python Dictionary object
    c. PRE-condition: Same conditions as genObjsForConsistency()
    d. Returns: A Python Dictionary object.
    e. Comments: As the method name suggests, this method generates the mean, variance and standard deviation for a particular feature element for across different albums.

  • ovrallShrCntsByFtr(gidAidMapFl,aidFeatureMapFl,feature,imgJobMap,resSetStrt,resSetEnd)
  • a. Arguments: GID-AID map file(JSON), AID-Feature map file(JSON), Feature name, Image GID-Album map file(CSV), start of the results file, end of the results files.
    b. Arguments data-type: str, str, str, str, int, int
    c. PRE-condition:
        i. Entire path is provided for gidAidMapFl and aidFeatureMapFl and the file should exist in JSON file format.
        ii. Entire path is provided for imgJobMap and the file should exist in CSV format.
        iii. The feature list should contain elements from the acceptable set of features.
    d. Returns: A Python Dictionary object.
    e. Comments: This method generates the overall share statistic for every element of a particular feature. The logic involves simple counting for each element of the feature based on the counting logic (refer getCountingLogic()). This explicitly handles the images where there are multiple features and the counts are made in each of the feature. This method is ideal for getting share rate of a particular feature across the entire experiment, not limited by album. In other words, the result dict has share proportions calculated across albums.

  • shrCntsByFtrPrAlbm(gidAidMapFl, aidFeatureMapFl, feature, imgJobMap, esSetStrt, resSetEnd)
  • a. Arguments: GID-AID map file(JSON), AID-Feature map file(JSON), Feature name, Image GID-Album map file(CSV), start of the results file, end of the results files.
    b. Arguments data-type: str, str, str, str, int, int
    c. PRE-condition:
        i. Entire path is provided for gidAidMapFl and aidFeatureMapFl and the file should exist in JSON file format.
        ii. Entire path is provided for imgJobMap and the file should exist in CSV format.
        iii. The feature list should contain elements from the acceptable set of features.
    d. Returns: A Python Dictionary object.
    e. Comments: This method generates the share statistic for every element of a particular feature for each album in which there are some instances with the said feature. The logic involves simple counting for each element of the feature based on the counting logic (refer getCountingLogic()). This explicitly handles the images where there are multiple features and the counts are made in each of the feature. This method is ideal for getting share rate of a particular feature across the entire experiment by album. In other words, this should used to calculate and compare the share rates of a particular feature and how it changes across different albums. As a sidenote, if the same feature is being shared differently across albums, then it means there is some contextual information from the album that is dominating this effect.

  • ovrallShrCntsByTwoFtrs(gidAidMapFl, aidFeatureMapFl, ftr1, ftr2, imgJobMap, resSetStrt, resSetEnd)
  • a. Arguments: GID-AID map file(JSON), AID-Feature map file(JSON), Feature name 1, Feature name 2, Image GID-Album map file(CSV), start of the results file, end of the results files.
    b. Arguments data-type: str, str, str, str, int, int
    c. PRE-condition:
        i. Entire path is provided for gidAidMapFl and aidFeatureMapFl and the file should exist in JSON file format.
        ii. Entire path is provided for imgJobMap and the file should exist in CSV format.
        iii. The feature list should contain elements from the acceptable set of features.
    d. Returns: A Python Dictionary object.
    e. Comments: This method is used to generate cross statistic for two features. For example, this question typically answers questions like How many giraffes that were looking left were shared? Or How many infants that were males were shared?
    The logic involves simple counting for each element of the feature based on the counting logic (refer getCountingLogic()). Since there are two features to be dealt here, the instances are divided into even and uneven features. Even features are defined as when the number of instances of feature 1 and feature 2 are identical. Uneven features on the other hand are when the number of instances of feature 1 and feature 2 are not identical. Uneven features are handled differently for 1-many, many-1 and many-many.

  • genNumIndsRankList()
  • a. Arguments:
    b. Arguments data-type:
    c. PRE-condition:
    d. Returns: A Python Dictionary object.
    e. Comments: As the method name suggests, this method generates the rank list of number of individuals in an image by share proportion. There is an active issue being tracked to parametrize this method for future use at GitHub Issues


Description: Contains methods that generates a Python Object from different map files. This method gives an interface for building required data structures to compute various stastistics. There are multiple methods that output CSV or JSON or both. Use methods in this script to generate easy-to-use Python objects. There are methods that parse the .results file that is extracted from Amazon Mechanical Turk jobs.
Note that the method names are very similar and take same argument in most cases.

  • genUniqueImageListFromMap(mapFlName)
  • a. Arguments: Image Job Map in CSV format
    b. Argument data-type: str
    c. PRE-condition:
        i. Entire path is provided for imgJobMap and the file should exist.
    d. Returns: A Python List Object
    e. Comments: This method is used to generate a list of all images that are used in the current experiment as specified by the map file.

  • genAlbumGIDDictFromMap(mapFlName)
  • a. Arguments: Image Job Map in CSV format
    b. Argument data-type: str
    c. PRE-condition:
        i. Entire path is provided for imgJobMap and the file should exist.
    d. Returns: A Python Dictionary Object
    e. Comments: This method is used to generate a dictionary in the form { Album : [GIDS] }. This object will give us the capability to query which images exist in a particular album.

  • genImgAlbumDictFromMap(mapFlName)
  • a. Arguments: Image Job Map in CSV format
    b. Argument data-type: str
    c. PRE-condition:
        i. Entire path is provided for imgJobMap and the file should exist.
    d. Returns: A Python Dictionary Object
    e. Comments: This method is used to generate a dictionary in the form { GID: [Albums] }. This object will give us the capability to query which album contain a particular GID.

  • getImageFreqFromMap(inFL)
  • a. Arguments: Image Job Map in CSV format
    b. Argument data-type: str
    c. PRE-condition:
        i. Entire path is provided for imgJobMap and the file should exist.
    d. Returns: A Python Dictionary Object
    e. Comments: This method is used to generate a dictionary in the form { GID : No. of albums it appears }.

  • genAidGidDictFromMap(mapFL)
  • a. Arguments: GID-AID map file(JSON)
    b. Argument data-type: str
    c. PRE-condition:
        i. Entire path is provided for imgJobMap and the file should exist.
    d. Returns: A Python Dictionary Object
    e. Comments: This method is used to generate a dictionary in the form { AID : GID }.

  • genGidAidDictFromMap(mapFL)
  • a. Arguments: GID-AID map file(JSON)
    b. Argument data-type: str
    c. PRE-condition:
        i. Entire path is provided for imgJobMap and the file should exist.
    d. Returns: A Python Dictionary Object
    e. Comments: This method is used to generate a dictionary in the form { GID : List of AIDs in that image }.

  • genAidGidTupListFromMap(mapFL)
  • a. Arguments: GID-AID map file(JSON)
    b. Argument data-type: str
    c. PRE-condition:
        i. Entire path is provided for imgJobMap and the file should exist.
    d. Returns: A Python List Object (A list of tuples)
    e. Comments: This method is used to generate a list of tuples in the form ( AID , GID ).

  • genAidFeatureDictList(mapFL)
  • a. Arguments: AID-Feature map file(JSON)
    b. Argument data-type: str
    c. PRE-condition:
        i. Entire path is provided for imgJobMap and the file should exist.
    d. Returns: A Python List Object (A list of tuples)
    e. Comments: This method is used to generate a list of dictionaries in the form [{'AID': xx,'NID' : xx ,.. }]. This object will give us the capability to iterate through all annotations and their respective features.

  • genAidFeatureDictDict(mapFL)
  • a. Arguments: AID-Feature map file(JSON)
    b. Argument data-type: str
    c. PRE-condition:
        i. Entire path is provided for imgJobMap and the file should exist.
    d. Returns: A Python Dictionary
    e. Comments: This method is used to generate a dictionary in the form { AID : {'NID' : xxx , 'SPECIES' : xxx, .. }}. This object will give us the capability to query one/multiple features given a annotation ID.

  • extractImageFeaturesFromMap(gidAidMapFl,aidFtrMapFl,feature)
  • a. Arguments: GID-AID map file(JSON), AID-Feature map file(JSON), feature
    b. Argument data-type: str, str, str
    c. PRE-condition:
        i. Entire path is provided for imgJobMap and the file should exist.
        ii. The feature list should contain elements from the acceptable set of features.
    d. Returns: A Python Dictionary object.
    e. Comments: This method is used to generate a dictionary in the form { GID : [list of features instances in that image]}. This object will give us the capability to check what feature instances are present in a given image.

  • createResultDict(jobRangeStart, jobRangeEnd, workerData)
  • a. Arguments: Start of the results file, end of the results files.
    b. Argument data-type: int, int
    c. PRE-condition:
    d. Returns: A Python Dictionary object.
    e. Comments: Every album corresponds to a .result file which is extracted from the Amazon Mechanical Turk interface. This method parses the results file and generates a python object consisting of each response key with the actual response from the users. The dictionary is of the form: { photo_album_i : { Answer.GID : [ GID|'share' , 'GID'|'noShare'] }} All the results file from jobRangeStart to jobRangeEnd will be parsed and included in the output object. When the workerData parameter is set to True, turker ID's who worked on a particular album is included in the result object.

  • imgShareCountsPerAlbum(imgAlbumDict, results)
  • a. Arguments: Image Job Map in CSV format, results is the direct output of the method createResultDict()
    b. Argument data-type: str, Python Dictionary object
    c. PRE-condition:
        i. Entire path is provided for imgJobMap and the file should exist.
    d. Returns: 2 Python List object.
    e. Comments: This method returns a Python list which gives us the capability to iterate through all the images, the number of times an image was shared or not shared in a particular album. This object will form the basis of all statistic computations in the project. The format of a tuple inside the list is of the form (GID, Album, Share count, Not Share count, Proportion). The other return object is the list of all (GID, albums) for which there was no valid response. (Form fields in certain albums in experiment 2 were not mandatory in the beginning, the bug was identified and corrected in a later stage.)

  • genMSAIDataHighConfidenceTags(tagInpDataFl, threshold)
  • a. Arguments: Tag input file in JSON format, confidence threshold (between 0 and 1)
    b. Argument data-type: str, float
    c. PRE-condition:
        i. Entire path is provided for tagInpDataFl and the file should exist.
    d. Returns: Python Dictionary object.
    e. Comments: This method is used to generate the list of tags generated by Microsoft Image tagging API, thresholded by confindence. With each tag, there is an associated confidence which quantifies the confidence of the tag prediciting algorithm. For purpose of experiments, the threshold is defaulted to 0.5, any tags predicted with confidence greater than 0.5 is accepted and the rest is rejected.

  • auditResMap(imgAlbumDict, resultList)
  • a. Arguments: Image Job Map in CSV format, results is the direct output of the method imgShareCountsPerAlbum()
    b. Argument data-type: str, Python list object
    c. PRE-condition:
        i. Entire path is provided for imgJobMap and the file should exist.
    d. Returns: 3 boolean variables.
    e. Comments: This method is an audit method that ensures there are no leaks or incorrect data in the result and feature objects. The 3 boolean variables indicate 3 different types of errors.
        i. If error1 is returned as True it means there are images in the result objects that were not originally a part of the map file
        ii. If error2 is returned as True it means there are images in the map file but not in results file.
        iii. If error3 is returned as True it means there are images in the results file that are not in the map file.

Description: Contains methods to run population estimation synthetic experiements.

  • trainTestClf(train_data_fl, test_data_fl, clf, attribType, infoGainFl, methArgs)
  • a. Argument: Training data in CSV format, Testing data in CSV format, Classifier method name, attribute selection method name, Information gain file in CSV format, keyword arguments for classifiers.
    b. Argument data-type: str, str, str, str, str, dict
    c. PRE-condition:
        i. Files should exist
    d. Returns: Classifier object, shared(1) not-shared(0) predictions
    e. Return data-type: ClassifierCapsuleClass object, Python dictionary object
    f. Comments: This method is a wrapper method to build and run classifiers using the training and the testing data respecitively. There are no checks to determine if there is any overlap between training and testing data files.

  • biasedCoinFlipper(p)
  • a. Arguments: Probability
    b. Argument data-type: float
    c. PRE-condition:
        i. p should be between 0 and 1
    d. Returns: 0/1
    e. Return data-type: int
    f. Comments: As the name suggests, this is a biased coin flipper. Simulates flipping a biased coin with a bias of p. Heads mean 1 and tail mean 0.

  • trainTestRgrs(train_data_fl, test_data_fl, method name , attribType, infoGainFl, methArgs)
  • a. Argument: Training data in CSV format, Testing data in CSV format, Regression method name, attribute selection method name, Information gain file in CSV format, keyword arguments for classifiers.
    b. Argument data-type: str, str, str, str, str, dict
    c. PRE-condition:
        i. Files should exist
    d. Returns: Regression object, Likelihoods of an image being shared.
    e. Return data-type: RegressionCapsuleclass object, Python dictionary object
    f. Comments: This method is a wrapper method to build and run regressors using the training and the testing data respecitively. There are no checks to determine if there is any overlap between training and testing data files.

  • estimatePopulation(prediction_results, inExifFl, inGidAidMapFl, inAidFtrFl)
  • a. Arguments: Prediction results, EXIF information of images in JSON, GID-AID map file(JSON), AID-Feature map file(JSON)
    b. Argument data-type: Python dict, str, str, str
    c. PRE-condition:
        i. Files should exist.
    d. Returns: Populaion of zebras, giraffes and all animals
    e. Return data-type: Python dict object
    f. Comments: This method is used to actually estimate the population. This method utilizes methods from MarkRecapHelper to calculate the population of all animals, zebras and giraffes on the basis of the prediction results that are generated by the classifiers or regressor predictions.

  • kSharesPerContribAfterCoinFlip(prediction_results, inExifFl, inGidAidMapFl, inAidFtrFl,genk)
  • a. Arguments: Prediction results, EXIF information of images in JSON, GID-AID map file(JSON), AID-Feature map file(JSON), function to generate the value of k
    b. Argument data-type: Python dict, str, str, str, Python function object
    c. PRE-condition:
        i. Files should exist.
    d. Returns: Populaion of zebras, giraffes and all animals when each contributor shares k images
    e. Return data-type: Python dict object
    f. Comments: This method is used to actually estimate the population when each contributor shares top k images. The images are ranked on the basis of the likelihood estimates obtained from the regression models. Since the regression outputs are a likelihood, to make a choice of whether or not each photo will be shared a coin with bias equal to likelihood of the picture being shared is flipped.

  • kSharesPerContributor(prediction_probabs, inExifFl, inGidAidMapFl, inAidFtrFl, genk)
  • a. Arguments: Prediction results, EXIF information of images in JSON, GID-AID map file(JSON), AID-Feature map file(JSON), function to generate the value of k
    b. Argument data-type: Python dict, str, str, str, Python function object
    c. PRE-condition:
        i. Files should exist.
    d. Returns: Populaion of zebras, giraffes and all animals when each contributor shares k images
    e. Return data-type: Python dict object
    f. Comments: This method is used to actually estimate the population when each contributor shares top k images. The images are ranked on the basis of the likelihood estimates obtained from the regression models. Since the regression outputs are a likelihood, to make a choice of whether or not each photo will be shared a coin with bias equal to likelihood of the picture being shared is flipped.