- Python-3
- Jupyter-Notebook
- pandas
- matplotlib
- Plot.ly
- Folium
- sklearn
- sklearn-metrics
- sk-image
- Amazon Mechanical Turk
- Bash
- Windows Batch
- Methods available for extracting features of images and annotations from IBEIS.
- Completely automated the process for selection of images and creation of Amazon Mechanical Turk jobs.
- Completely automated deployment, approval and download for all the mechanical turk jobs.
- Completely automated parsing of .results file from the mechanical turk engine and return a python object/csv/json ready for processing.
- Methods available for both single as well multiple feature extraction for an images or a list of images (can be specified as a csv).
- Methods available to join features with the results and return python data-frames/csv/json for statistical calculation.
- Methods available for generating rank list of most shared pictures.
- Methods available for generating rank list by share proportion based on ecological features like species, sex, age, view_point of the animal.
- Methods available for generating rank lists for a specific feature across all albums or individual albums (number of shares for zebra in a particular album versus number of shares for giraffes etc.)
- Methods available to append the results from Amazon Mechanical Turk Methods with tags from Microsoft Image Tagging API and generate a rank list of most shared tags.
- Added functionality to generate reports of all statistics in HTML format with bar charts wherever necessary.
- Methods available for data prepartion for applying classifiers using the Bag-of-words methodology.
- Methods available for performing feature selection by calculating information gain of attributes with resepct to the target variable
- Methods available for building learning models like Logistic Regression, Support Vector Machines, Decision Trees and Random Forests and returning the predictions as well as performance metrics for the classifier.
- Methods available for visualizing the shared/not shared pictures on map and create clusters of share-no share homogenous region.
- Methods available for generating heat maps of the region of shared photos and not shared photos
- Methods available for applying mark-recapture and calculate the Petersen-Lincoln Index on the GZC data-set.
- Methods available for applying mark-recapture and calculate the Petersen-Lincoln Index on the shared GZC data-set obtained from the Mechanical Turk experiments.
- Methods available for extracting images from Flickr when the tags and text are specified.
- Methods available for parallel download of images from Flickr and concurrent execution of image detection tasks on IBEIS.
- Methods available for extracting beauty features from images. This is a image processing module currently being developed using scikit-image library.
- Methods available for directly predicting the share proportion using different regression techniques. Regression methods used include Linear, Ridge, Lasso, Elastic Nets, SVR etc.
- Methods to estimate population using synthetic albums which are formed using the predicted shared data. Mark-Recapture models are then applied to this predicted data, to study effects of individual photographers in the population estimation steps.
- The syntenthic experiments use both probability scores generated from the classifiers as well as the likelihood of picture shared generated by regression algorithms to rank images.
- Synthetic experiments simulate population estimates when every contributor shares their top k images, bottom k images, random k images and image above a certain likelihood threshold.
JobsMapResultsFilesToContainerObjs.py
- genUniqueImageListFromMap(mapFlName)
- genAlbumGIDDictFromMap(mapFlName)
- genImgAlbumDictFromMap(mapFLName)
- getImageFreqFromMap(inFL)
- genAidGidDictFromMap(mapFL)
- genGidAidDictFromMap(mapFL)
- genAidGidTupListFromMap(mapFL)
- genAidFeatureDictList(mapFL)
- genAidFeatureDictDict(mapFL)
- extractImageFeaturesFromMap(gidAidMapFl, aidFtrMapFl, feature)
- createResultDict(jobRangeStart, jobRangeEnd, workerData)
- imgShareCountsPerAlbum(imgAlbumDict, results)
- genMSAIDataHighConfidenceTags(tagInpDataFl, threshold)
- auditResMap(imgAlbumDict, resultList)
-
- getPosShrProptn(imgJobMap, resStart, resEnd)
- genTotCnts(ovrCnts)
- getShrProp(ovrAggCnts)
- getCountingLogic(gidAidMapFl, aidFeatureMapFl, feature, withNumInds)
- genAlbmFtrs(gidAidMapFl, aidFeatureMapFl, imgJobMap, reqdFtrList)
- getShrPropImgsAcrossAlbms(imgJobMap, resSetStrt, resSetEnd, flNm)
- getFltrCondn(ftr)
- genObjsForConsistency(gidAidMapFl, aidFeatureMapFl, ftr, imgJobMap, resSetStrt, resSetEnd)
- getConsistencyDict(filteredKeyArr, ftrAlbmDict, ftrAlbmShrPropDict, flNm)
- genVarStddevShrPropAcrsAlbms(consistency)
- ovrallShrCntsByFtr(gidAidMapFl, aidFeatureMapFl, feature, imgJobMap, resSetStrt, resSetEnd)
- shrCntsByFtrPrAlbm(gidAidMapFl, aidFeatureMapFl, feature, imgJobMap, resSetStrt, resSetEnd)
- ovrallShrCntsByTwoFtrs(gidAidMapFl, aidFeatureMapFl, ftr1, ftr2, imgJobMap, resSetStrt, resSetEnd)
- genNumIndsRankList()
-
- genHead(dataDict, ftr)
- getMasterData(flNm)
- genAttribsHead(data, ftrList)
- createDataFlDict(data, allAttribs, threshold, dataMode, writeTempFiles)
- getLearningAlgo(methodName, kwargs)
- trainTestSplitter(gidAttribDict, allAttribs, trainTestSplit)
- buildBinClassifier(data, allAttribs, trainTestSplit, threshold, methodName, extremeClf)
- genAllAttribs(masterDataFl, constraint, infoGainFlNm
- buildRegrMod(train_data_fl, allAttribs, trainTestSplit, methodName, kwargs)
-
- trainLearningObj(train_data_fl, test_data_fl, clf, attribType, infoGainFl, methArgs, isClf)
- trainTestClf(train_data_fl, test_data_fl, clf, attribType, infoGainFl, clfArgs)
- estimatePopulation(prediction_results, inExifFl, inGidAidMapFl, inAidFtrFl)
- biasedCoinFlipper(p)
- trainTestRgrs(train_data_fl, test_data_fl, methodName, attribType, infoGainFl, methArgs)
- kSharesPerContributor(prediction_probabs, inExifFl, inGidAidMapFl, inAidFtrFl, genk)
- kSharesPerContribAfterCoinFlip(prediction_results, inExifFl, inGidAidMapFl, inAidFtrFl, genk)
- shareAbvThreshold(prediction_probabs, inExifFl, inGidAidMapFl, inAidFtrFl, threshold, randomShare)
- runSyntheticExpts(inExifFl, inGidAidMapFl, inAidFtrFl, krange)
- runSyntheticExptsRgr(inExifFl, inGidAidMapFl, inAidFtrFl, krange)
- runSyntheticExptsClf(inExifFl, inGidAidMapFl, inAidFtrFl, krange)
Description: Contains methods to extract specific information from IBEIS database through REST-ful API calls.
- getAnnotID(gid)
- getImageFeature(aidList, feature)
- getExifData(gidList, exifFtr)
- getContributorGID(cid)
- getAgeFeatureReadableFmt(ageList)
- getUnixTimeReadableFmt(unixtm)
a. Argument: GID of a single image.
b. Argument data-type: int/str
c. PRE-condition:
    i. gid should be a valid image ID
d. Returns: Corresponding Annotation IDs in the image.
e. Return data-type: Python list object
f. Comments: Used to extract annotation IDs for a particular image GID
a. Arguments: List of Annotation IDs, Required Feature
b. Argument data-type: Python list, str
c. PRE-condition:
    i. aid should be a valid annotation ID
    ii. feature should be from the set: {"age/months", "yaw/text", "exemplar", "quality/text", "sex/text", "species/text", "name/rowid", "name/text", "image_contributor_tag"}
d. Returns: Corresponding feature of the specified annotation ID
e. Return data-type: Python list object
f. Comments: Used to extract features for a given list of annotation IDs.
a. Arguments: List of GIDs, Required EXIF Feature
b. Argument data-type: Python list, str
c. PRE-condition:
    i. GID should be a valid image in IBEIS database
    ii. feature should be from the set: {unixtime, lat, long}
d. Returns: Corresponding EXIF feature for the specified image
e. Return data-type: Python list object
f. Comments: Used to extract EXIF features for a given list of image GIDs.
a. Arguments: Contributor ID
b. Argument data-type: int/str
c. PRE-condition:
    i. cid should be a valid contributor ID
d. Returns: List of images clicked by the contributor
e. Return data-type: Python list object
f. Comments: Used to extract all images clicked by a particular photographer.
a. Arguments: Age as extracted from IBEIS API call.
b. Arguments data-type: Python list.
c. PRE-condition:
    i. Age should be in one of the following formats: [[-1,-1]] or [[None,2]] or [[3, 5]] or [[6, 11]] or [[12,23]] or [[24,35]] or [[36,None]]
d. Returns: A human readable string of the age.
e. Comments: Age in IBEIS is stored as a list (an estimated range in months). IBEIS identifies the age of the animal up to 3 years. The animals will be classified as Infants, a year-old juveniles, two-year-old juveniles or fully grown adults.
a. Arguments: Unixtime as extracted from IBEIS API call.
b. Arguments data-type: int
c. PRE-condition:
    i. The unix time provided in the input should be a valid unix time in integer form.
d. Returns: A human readable string of the unixtime in YYYY-MM-DD HH:mm:ss.
e. Comments: The time each picture is taken is stored in unix time format. This method is short-hand method to convert the unix time into a human readable date time. This method makes necessary adjustments to take into consideration UTC times or in other words adjusts for daylight saving changes.
Description: Contains methods to generate a single mechanical turk photo album. There are multiple methods in the API but the method generateMTurkFile() needs to be called to generate one image album job file.
Hardcoded parameter: excepFL – points to the CSV file which is a enumeration of all identified human images. This is provided to avoid occurrences of human images in the mechanical turk photo album jobs.
- genRandomImg(begin, end, oldList)
- genImageList(begin, stop, maxImgs=20)
- generateMTurkFile(startGID, endGID, outFile, maxImgs=20, prodFileWrite=False)
a. Arguments: Start range, End Range and list in the same range
b. Argument data-type: int, int, int
c. PRE-condition:
    i. All elements in the oldList should be in the range [begin, end]
    ii. begin, end should be valid image GIDs.
d. Returns: A randomly selected image GID in the range [begin, end] and not in oldList.
e. Return data-type: int
f. Comments: This method is used when there arises a case when a particular image in the image album have to be replaced with another randomly selected image.
a. Arguments: Start range, end range and number of images in photo album
b. Argument data-type: int, int, int
c. PRE-condition:
    i. begin and stop should be both valid image GID’s.
    ii. begin < stop
    iii. stop – begin > maxImgs
d. Returns: A python list with randomly selected image GID’s that are in the range [begin, stop] and len(list) = maxImgs
e. Comments: This method is used to generate a list of random images that will be further used as an input to the method that generates the photo album mechanical turk job. The list has no duplicates and has exactly maxImgs number of images.
a. Arguments: Start GID, end GID, name of output file, number of images in photo album, a boolean to generate a m-turk job for production environment.
b. Argument data-type: int, int, str, int, boolean
c. PRE-condition:
    i. startGID and endGID should be both valid image GID’s.
    ii. startGID < endGID
    iii. endGID - startGID > maxImgs
d. Returns: A python list with randomly selected image GID’s that are in the photo album. The GID’s are in the range [startGID, endGID] and len(list) = maxImgs
e. Comments: Most important method of this script. This generates a .question file that is fed to createHIT command. If prodFileWrite parameter is set to True, then a job with the same image GID’s is generated with the word ‘prod’ appended to the outFile name. prodFileWrite is defaulted to False, so it will only generate a file for mechanical turk sandbox if this parameter is unspecified.
Description: Contains a single method that generates mechanical turk jobs for generating photo-albums in bulk. This script contains a __main__() method that accepts command line arguments and can be directly executed through terminal.
To run the script, provide 4 parameters in the order fileName, jobMapName, numberOfFiles, numberOfImgs.
python CreateTurkFilesBulk.py /tmp/sample /tmp/sample.CSV 2 10
Successfully written : /tmp/sample1
Successfully written : /tmp/sample2
- createTurkFilesBulk(flNm, jobMapName, noOfJobs, noOfImgsPerJob = 20)
a. Arguments: File Name prefix of the job files, File name of the map file of albums and the images in those albums, number of jobs needed, number of images needed per job.
b. Argument data-type: str, str, int, int
c. PRE-condition:
    i. Entire path is provided for both flNm and jobMapName.
    ii. Path exists in the host system.
d. Returns: None
e. Comments: Selects noOfJobs number of random contributors (there might be duplicates in the selected contributor list). There is hard-coded code for removing contributors who did not click any picture. For each job, noOfImgsPerJob number of images are selected from the range of images a particular contributor has clicked. This is done to ensure that given a particular album, all the images are clicked by the same contributor. The script assumes the value of prodFileWrite in line 63 as True. It generates 1 input, 1 question file per noOfJobs and 1 map file that contains a map between albums and the images in them.
Description: Contains multiple methods to calculate final result statistics. The methods in this script file highly utilizes objects created by JobsMapResultsFilesToContainerObjs.py. Three global variables are declared in the scope of this project, the currently point to CSV files that have information from second phase of the mechanical turk deployment.
gidAidMapFl, aidFeatureMapFl, imgJobMap
These parameters are used by multiple methods within this script.
- getPosShrProptn(imgJobMap, resStart, resEnd)
- genTotCnts(ovrCnts)
- getShrProp(ovrAggCnts)
- getCountingLogic(gidAidMapFl, aidFeatureMapFl, feature, withNumInds=True)
- genAlbmFtrs(gidAidMapFl, aidFeatureMapFl, imgJobMap, reqdFtrList)
- getShrPropImgsAcrossAlbms(imgJobMap, resSetStrt, resSetEnd, flNm)
- getFltrCondn(ftr)
- genObjsForConsistency(gidAidMapFl, aidFeatureMapFl, ftr, imgJobMap, resSetStrt, resSetEnd)
- getConsistencyDict(filteredKeyArr, ftrAlbmDict, ftrAlbmShrPropDict, flNm)
- genVarStddevShrPropAcrsAlbms(consistency)
- ovrallShrCntsByFtr(gidAidMapFl,aidFeatureMapFl,feature,imgJobMap,resSetStrt,resSetEnd)
- shrCntsByFtrPrAlbm(gidAidMapFl, aidFeatureMapFl, feature, imgJobMap, esSetStrt, resSetEnd)
- ovrallShrCntsByTwoFtrs(gidAidMapFl, aidFeatureMapFl, ftr1, ftr2, imgJobMap, resSetStrt, resSetEnd)
- genNumIndsRankList()
a. Arguments: Image Job Map in CSV format, start of the results file, end of the results files.
b. Argument data-type: str, int, int
c. PRE-condition:
    i. Entire path is provided for imgJobMap and the file should exist.
    ii. All files starting from photo_album_
b. Returns: A Python dictionary object that has position of images in the album and number of shares, number of not_shares, the total and the proportion of share rate as a nested dictionary.
c. Comments: Number of shares/not shares for each and every position are enumerated inside a list. Use of OrderedDict() from the Python collections framework ensures that the records are picked in the exact same order they appear in the albums. This returned dictionary can then be embedded inside a data-frame and can be visualized.
a. Arguments: Overall counts dictionary, an enumeration of share/no share data per image.
b. Arguments data-type: Python dictionary object.
c. PRE-condition:
    i. The values in the argument dictionary should be a Python list object.
    ii. All the values in the list are integers.
d. Returns: A python dictionary which has the same keys as the argument data-type, but the values are simply the sum total of all values in the list corresponding to the argument dictionary key.
e. Comments: This method is a simplification of adding up the share/no share counts of a particular image. For example, for a particular image which appeared in 5 different albums, the share rates were 9,5,9,10,10. This method will return gid:sum([9,5,9,10,10)].
a. Arguments: Overall Aggregate counts, typically outputted by genTotCnts method. A python dictionary with key and sum total of shares, no_shares and totals.
b. Arguments data-type: Python dictionary object.
c. PRE-condition:
    i. The value in the argument dictionary should a single valued integer.
    ii. The key should be a iterable, most preferably a Python tuple object.
    iii. Keys should contain the strings: ‘share’, ‘total’ or ‘not_share’
d. Returns: A python dictionary which is simply the share proportion of that particular key.
e. Comments: This method will be used to calculate the share proportion of a particular feature or an image. This dictionary answers the question, what percentage of this feature/image were shared. The share proportion is calculated as total_shares/(total_shares+total_not_shares).
a. Arguments: GID-AID map file(JSON), AID-Feature map file(JSON), feature, number of individuals in the output required (defaulted to True)
b. Arguments data-type: str, str, str, boolean
c. PRE-condition:
    i. Entire path is provided for gidAidMapFl and aidFeatureMapFl and the file should exist in JSON file format.
    ii. Should be from an acceptable set of feature strings.
d. Returns: A Python dictionary object.
e. Comments: This method defines the counting logic for a particular image for a given feature. The counting logic will return an optional number of individuals in the result dictionary if the aparamerter withNumInds is set to True. For instance, for the feature ‘SPECIES’, there might be images that contains both a zebra and a giraffe. In that case, the share counts have to be added to both zebra and giraffe.
a. Arguments: GID-AID map file(JSON), AID-Feature map file(JSON), Image GID-Album map file(CSV), required features list – a list of features for which the properties needs to be determined.
b. Arguments data-type: str, str, str, Python List
c. PRE-condition:
    i. Entire path is provided for gidAidMapFl and aidFeatureMapFl and the file should exist in JSON file format.
    ii. Entire path is provided for imgJobMap and the file should exist in CSV format.
    iii. The feature list should contain elements from the acceptable set of features.
d. Returns: A Python dictionary object.
e. Comments: This method is used to generate the number of shares and not_shares per feature in a particular album. For instance, if the required features list contains ‘SPECIES’, then it tells the share and not_share count per album for each available/identifiable species.
a. Arguments: Image Job Map in CSV format, start of the results file, end of the results files, output file name in json format.
b. Arguments data-type: str, int, int, str
c. PRE-condition:
    i. Entire path is provided for imgJobMap and the file should exist.
    ii. All files starting from photo_album_<resStart>.results until photo_album_<resEnd>.results should exist under results folder.
d. Returns: Two Python Dictionary objects.
e. Comments: This method derives the share proportions of images across different albums as opposed to generating an overall proportion of images. For example, the output of this method will contain how many times a particular image was shared in a particular album. This is particularly insightful for images that appear across multiple albums. It could essentially tell if a certain image was shared in the same way or differently and if the share rate of an image is orthogonal to the context of album.
a. Arguments: Feature name
b. Arguments data-type: str
c. PRE-condition:
    i. Provided argument should be a valid feature.
d. Returns: A lambda function
e. Comments: This method returns a lambda function or a logic in filtering all valid rows without the UNIDENTIFIED. A slight adjustment is needed depending on what feature is being filtered.
a. Arguments: GID-AID map file(JSON), AID-Feature map file(JSON), Feature name, Image GID-Album map file(CSV), start of the results file, end of the results files.
b. Arguments data-type: str, str, str, str, int, int
c. PRE-condition:
    i. Entire path is provided for gidAidMapFl and aidFeatureMapFl and the file should exist in JSON file format.
    ii. Entire path is provided for imgJobMap and the file should exist in CSV format.
    iii. The feature list should contain elements from the acceptable set of features.
d. Returns: Python List Object, Two Python Dictionary objects
e. Comments: This is a helper method for getConsistencyDict(). It returns three objects.
    i. A list with only valid features i.e. without the UNIDENTIFIED entries.
    ii. A dictionary object that contains the albums in which a particular feature is seen. Ex. GIRAFFE is seen in albums 1, 3,4,5 etc.
    iii. A dictionary object that contains the share proportion of feature per album. Ex. giraffes in album 1 were shared 70% of times etc. Format of the key: (feature, album)
a. Arguments: The filtered key array, Feature Album dictionary and feature album share proportion dictionary are direct outputs of genObjsForConsistency()
b. Arguments data-type: Python List Object, Python Dictionary Object, Python Dictionary object, str
c. PRE-condition: Same conditions as genObjsForConsistency()
d. Returns: A Python Dictionary object.
e. Comments: This method should be thought as the main counting logic while calculating the share rate related statistic for any feature that is extracted from IBEIS. The output dictionary is all the valid features and thier share-rates across different albums. The output dictionary is a key - dictionary pair. Ex. {giraffe : {album_1 : 70, album_2 : 89}, ...}
a. Arguments: Consistency dictionary is the direct ouput of getConsistencyDict()
b. Arguments data-type: Python Dictionary object
c. PRE-condition: Same conditions as genObjsForConsistency()
d. Returns: A Python Dictionary object.
e. Comments: As the method name suggests, this method generates the mean, variance and standard deviation for a particular feature element for across different albums.
a. Arguments: GID-AID map file(JSON), AID-Feature map file(JSON), Feature name, Image GID-Album map file(CSV), start of the results file, end of the results files.
b. Arguments data-type: str, str, str, str, int, int
c. PRE-condition:
    i. Entire path is provided for gidAidMapFl and aidFeatureMapFl and the file should exist in JSON file format.
    ii. Entire path is provided for imgJobMap and the file should exist in CSV format.
    iii. The feature list should contain elements from the acceptable set of features.
d. Returns: A Python Dictionary object.
e. Comments: This method generates the overall share statistic for every element of a particular feature. The logic involves simple counting for each element of the feature based on the counting logic (refer getCountingLogic()). This explicitly handles the images where there are multiple features and the counts are made in each of the feature. This method is ideal for getting share rate of a particular feature across the entire experiment, not limited by album. In other words, the result dict has share proportions calculated across albums.
a. Arguments: GID-AID map file(JSON), AID-Feature map file(JSON), Feature name, Image GID-Album map file(CSV), start of the results file, end of the results files.
b. Arguments data-type: str, str, str, str, int, int
c. PRE-condition:
    i. Entire path is provided for gidAidMapFl and aidFeatureMapFl and the file should exist in JSON file format.
    ii. Entire path is provided for imgJobMap and the file should exist in CSV format.
    iii. The feature list should contain elements from the acceptable set of features.
d. Returns: A Python Dictionary object.
e. Comments: This method generates the share statistic for every element of a particular feature for each album in which there are some instances with the said feature. The logic involves simple counting for each element of the feature based on the counting logic (refer getCountingLogic()). This explicitly handles the images where there are multiple features and the counts are made in each of the feature. This method is ideal for getting share rate of a particular feature across the entire experiment by album. In other words, this should used to calculate and compare the share rates of a particular feature and how it changes across different albums. As a sidenote, if the same feature is being shared differently across albums, then it means there is some contextual information from the album that is dominating this effect.
a. Arguments: GID-AID map file(JSON), AID-Feature map file(JSON), Feature name 1, Feature name 2, Image GID-Album map file(CSV), start of the results file, end of the results files.
b. Arguments data-type: str, str, str, str, int, int
c. PRE-condition:
    i. Entire path is provided for gidAidMapFl and aidFeatureMapFl and the file should exist in JSON file format.
    ii. Entire path is provided for imgJobMap and the file should exist in CSV format.
    iii. The feature list should contain elements from the acceptable set of features.
d. Returns: A Python Dictionary object.
e. Comments: This method is used to generate cross statistic for two features. For example, this question typically answers questions like How many giraffes that were looking left were shared? Or How many infants that were males were shared?
The logic involves simple counting for each element of the feature based on the counting logic (refer getCountingLogic()). Since there are two features to be dealt here, the instances are divided into even and uneven features. Even features are defined as when the number of instances of feature 1 and feature 2 are identical. Uneven features on the other hand are when the number of instances of feature 1 and feature 2 are not identical. Uneven features are handled differently for 1-many, many-1 and many-many.
a. Arguments:
b. Arguments data-type:
c. PRE-condition:
d. Returns: A Python Dictionary object.
e. Comments: As the method name suggests, this method generates the rank list of number of individuals in an image by share proportion. There is an active issue being tracked to parametrize this method for future use at GitHub Issues
Description: Contains methods that generates a Python Object from different map files. This method gives an interface for building required data structures to compute various stastistics. There are multiple methods that output CSV or JSON or both. Use methods in this script to generate easy-to-use Python objects. There are methods that parse the .results file that is extracted from Amazon Mechanical Turk jobs.
Note that the method names are very similar and take same argument in most cases.
- genUniqueImageListFromMap(mapFlName)
- genAlbumGIDDictFromMap(mapFlName)
- genImgAlbumDictFromMap(mapFlName)
- getImageFreqFromMap(inFL)
- genAidGidDictFromMap(mapFL)
- genGidAidDictFromMap(mapFL)
- genAidGidTupListFromMap(mapFL)
- genAidFeatureDictList(mapFL)
- genAidFeatureDictDict(mapFL)
- extractImageFeaturesFromMap(gidAidMapFl,aidFtrMapFl,feature)
- createResultDict(jobRangeStart, jobRangeEnd, workerData)
- imgShareCountsPerAlbum(imgAlbumDict, results)
- genMSAIDataHighConfidenceTags(tagInpDataFl, threshold)
- auditResMap(imgAlbumDict, resultList)
a. Arguments: Image Job Map in CSV format
b. Argument data-type: str
c. PRE-condition:
    i. Entire path is provided for imgJobMap and the file should exist.
d. Returns: A Python List Object
e. Comments: This method is used to generate a list of all images that are used in the current experiment as specified by the map file.
a. Arguments: Image Job Map in CSV format
b. Argument data-type: str
c. PRE-condition:
    i. Entire path is provided for imgJobMap and the file should exist.
d. Returns: A Python Dictionary Object
e. Comments: This method is used to generate a dictionary in the form { Album : [GIDS] }. This object will give us the capability to query which images exist in a particular album.
a. Arguments: Image Job Map in CSV format
b. Argument data-type: str
c. PRE-condition:
    i. Entire path is provided for imgJobMap and the file should exist.
d. Returns: A Python Dictionary Object
e. Comments: This method is used to generate a dictionary in the form { GID: [Albums] }. This object will give us the capability to query which album contain a particular GID.
a. Arguments: Image Job Map in CSV format
b. Argument data-type: str
c. PRE-condition:
    i. Entire path is provided for imgJobMap and the file should exist.
d. Returns: A Python Dictionary Object
e. Comments: This method is used to generate a dictionary in the form { GID : No. of albums it appears }.
a. Arguments: GID-AID map file(JSON)
b. Argument data-type: str
c. PRE-condition:
    i. Entire path is provided for imgJobMap and the file should exist.
d. Returns: A Python Dictionary Object
e. Comments: This method is used to generate a dictionary in the form { AID : GID }.
a. Arguments: GID-AID map file(JSON)
b. Argument data-type: str
c. PRE-condition:
    i. Entire path is provided for imgJobMap and the file should exist.
d. Returns: A Python Dictionary Object
e. Comments: This method is used to generate a dictionary in the form { GID : List of AIDs in that image }.
a. Arguments: GID-AID map file(JSON)
b. Argument data-type: str
c. PRE-condition:
    i. Entire path is provided for imgJobMap and the file should exist.
d. Returns: A Python List Object (A list of tuples)
e. Comments: This method is used to generate a list of tuples in the form ( AID , GID ).
a. Arguments: AID-Feature map file(JSON)
b. Argument data-type: str
c. PRE-condition:
    i. Entire path is provided for imgJobMap and the file should exist.
d. Returns: A Python List Object (A list of tuples)
e. Comments: This method is used to generate a list of dictionaries in the form [{'AID': xx,'NID' : xx ,.. }]. This object will give us the capability to iterate through all annotations and their respective features.
a. Arguments: AID-Feature map file(JSON)
b. Argument data-type: str
c. PRE-condition:
    i. Entire path is provided for imgJobMap and the file should exist.
d. Returns: A Python Dictionary
e. Comments: This method is used to generate a dictionary in the form { AID : {'NID' : xxx , 'SPECIES' : xxx, .. }}. This object will give us the capability to query one/multiple features given a annotation ID.
a. Arguments: GID-AID map file(JSON), AID-Feature map file(JSON), feature
b. Argument data-type: str, str, str
c. PRE-condition:
    i. Entire path is provided for imgJobMap and the file should exist.
    ii. The feature list should contain elements from the acceptable set of features.
d. Returns: A Python Dictionary object.
e. Comments: This method is used to generate a dictionary in the form { GID : [list of features instances in that image]}. This object will give us the capability to check what feature instances are present in a given image.
a. Arguments: Start of the results file, end of the results files.
b. Argument data-type: int, int
c. PRE-condition:
d. Returns: A Python Dictionary object.
e. Comments: Every album corresponds to a .result file which is extracted from the Amazon Mechanical Turk interface. This method parses the results file and generates a python object consisting of each response key with the actual response from the users. The dictionary is of the form: { photo_album_i : { Answer.GID : [ GID|'share' , 'GID'|'noShare'] }} All the results file from jobRangeStart to jobRangeEnd will be parsed and included in the output object. When the workerData parameter is set to True, turker ID's who worked on a particular album is included in the result object.
a. Arguments: Image Job Map in CSV format, results is the direct output of the method createResultDict()
b. Argument data-type: str, Python Dictionary object
c. PRE-condition:
    i. Entire path is provided for imgJobMap and the file should exist.
d. Returns: 2 Python List object.
e. Comments: This method returns a Python list which gives us the capability to iterate through all the images, the number of times an image was shared or not shared in a particular album. This object will form the basis of all statistic computations in the project. The format of a tuple inside the list is of the form (GID, Album, Share count, Not Share count, Proportion). The other return object is the list of all (GID, albums) for which there was no valid response. (Form fields in certain albums in experiment 2 were not mandatory in the beginning, the bug was identified and corrected in a later stage.)
a. Arguments: Tag input file in JSON format, confidence threshold (between 0 and 1)
b. Argument data-type: str, float
c. PRE-condition:
    i. Entire path is provided for tagInpDataFl and the file should exist.
d. Returns: Python Dictionary object.
e. Comments: This method is used to generate the list of tags generated by Microsoft Image tagging API, thresholded by confindence. With each tag, there is an associated confidence which quantifies the confidence of the tag prediciting algorithm. For purpose of experiments, the threshold is defaulted to 0.5, any tags predicted with confidence greater than 0.5 is accepted and the rest is rejected.
a. Arguments: Image Job Map in CSV format, results is the direct output of the method imgShareCountsPerAlbum()
b. Argument data-type: str, Python list object
c. PRE-condition:
    i. Entire path is provided for imgJobMap and the file should exist.
d. Returns: 3 boolean variables.
e. Comments: This method is an audit method that ensures there are no leaks or incorrect data in the result and feature objects. The 3 boolean variables indicate 3 different types of errors.
    i. If error1 is returned as True it means there are images in the result objects that were not originally a part of the map file
    ii. If error2 is returned as True it means there are images in the map file but not in results file.
    iii. If error3 is returned as True it means there are images in the results file that are not in the map file.
Description: Contains methods to run population estimation synthetic experiements.
- trainTestClf(train_data_fl, test_data_fl, clf, attribType, infoGainFl, methArgs)
- biasedCoinFlipper(p)
- trainTestRgrs(train_data_fl, test_data_fl, method name , attribType, infoGainFl, methArgs)
- estimatePopulation(prediction_results, inExifFl, inGidAidMapFl, inAidFtrFl)
- kSharesPerContribAfterCoinFlip(prediction_results, inExifFl, inGidAidMapFl, inAidFtrFl,genk)
- kSharesPerContributor(prediction_probabs, inExifFl, inGidAidMapFl, inAidFtrFl, genk)
a. Argument: Training data in CSV format, Testing data in CSV format, Classifier method name, attribute selection method name, Information gain file in CSV format, keyword arguments for classifiers.
b. Argument data-type: str, str, str, str, str, dict
c. PRE-condition:
    i. Files should exist
d. Returns: Classifier object, shared(1) not-shared(0) predictions
e. Return data-type: ClassifierCapsuleClass object, Python dictionary object
f. Comments: This method is a wrapper method to build and run classifiers using the training and the testing data respecitively. There are no checks to determine if there is any overlap between training and testing data files.
a. Arguments: Probability
b. Argument data-type: float
c. PRE-condition:
    i. p should be between 0 and 1
d. Returns: 0/1
e. Return data-type: int
f. Comments: As the name suggests, this is a biased coin flipper. Simulates flipping a biased coin with a bias of p. Heads mean 1 and tail mean 0.
a. Argument: Training data in CSV format, Testing data in CSV format, Regression method name, attribute selection method name, Information gain file in CSV format, keyword arguments for classifiers.
b. Argument data-type: str, str, str, str, str, dict
c. PRE-condition:
    i. Files should exist
d. Returns: Regression object, Likelihoods of an image being shared.
e. Return data-type: RegressionCapsuleclass object, Python dictionary object
f. Comments: This method is a wrapper method to build and run regressors using the training and the testing data respecitively. There are no checks to determine if there is any overlap between training and testing data files.
a. Arguments: Prediction results, EXIF information of images in JSON, GID-AID map file(JSON), AID-Feature map file(JSON)
b. Argument data-type: Python dict, str, str, str
c. PRE-condition:
    i. Files should exist.
d. Returns: Populaion of zebras, giraffes and all animals
e. Return data-type: Python dict object
f. Comments: This method is used to actually estimate the population. This method utilizes methods from MarkRecapHelper to calculate the population of all animals, zebras and giraffes on the basis of the prediction results that are generated by the classifiers or regressor predictions.
a. Arguments: Prediction results, EXIF information of images in JSON, GID-AID map file(JSON), AID-Feature map file(JSON), function to generate the value of k
b. Argument data-type: Python dict, str, str, str, Python function object
c. PRE-condition:
    i. Files should exist.
d. Returns: Populaion of zebras, giraffes and all animals when each contributor shares k images
e. Return data-type: Python dict object
f. Comments: This method is used to actually estimate the population when each contributor shares top k images. The images are ranked on the basis of the likelihood estimates obtained from the regression models. Since the regression outputs are a likelihood, to make a choice of whether or not each photo will be shared a coin with bias equal to likelihood of the picture being shared is flipped.
a. Arguments: Prediction results, EXIF information of images in JSON, GID-AID map file(JSON), AID-Feature map file(JSON), function to generate the value of k
b. Argument data-type: Python dict, str, str, str, Python function object
c. PRE-condition:
    i. Files should exist.
d. Returns: Populaion of zebras, giraffes and all animals when each contributor shares k images
e. Return data-type: Python dict object
f. Comments: This method is used to actually estimate the population when each contributor shares top k images. The images are ranked on the basis of the likelihood estimates obtained from the regression models. Since the regression outputs are a likelihood, to make a choice of whether or not each photo will be shared a coin with bias equal to likelihood of the picture being shared is flipped.