The idea here is that you have a huge set of webpages (URLs) and
webpage-content, and you want to create a huge table indexed by each
word that shows you what URL the word is in. Fill in the body for the
and
for the
Google
sprite and test it. Hint: This
problem is very similar to WordCount.
Problem | Input | Map Domain | Map Range | Map Function | Binary reducer function | Output |
---|---|---|---|---|---|---|
Google simulation! Given web pages (URLs) and data, create a massive reverse-lookup-table, that allows us to quickly query, given any single word, what webpages it was on. | The input is a list of lists. The first element in each inner list is the web page address, the second element is the content of the webpage | Two-element list, the web page address and the text of the web page | A list of lists, where the inner list has the
word as the first element and all the URLs that have the word as
followup elements. e.g., if the input were: ("hamlet" "to be or
not to be") , the output would be ((to hamlet) (be hamlet)
(or hamlet) (not hamlet)) |
For every unique word in the webpage, make a list of the word and the URL. Return a list of all these pairs. | Take two lists of words and their counts and
merge them. E.g., Given ((to hamlet) (be hamlet) (or hamlet) (not
hamlet)) and ((to webster) (wit webster)) , it
would return ((to hamlet webster) (be hamlet) (or hamlet) (not
hamlet) (wit webster)) |
A single list of lists, with each inner list a unique word as the first element and the URLs that contain the word as the following elements. |