Mapreduce Python API + Django helper

Mapreduce API is great. We've got a tool now that can process tasks taking more than 30 seconds. Yeaaaahhh! This is a huge improvement. I wish we have had this tools months ago. All the examples in the documentation use the webapp framework, there aren't many examples using the Django helper in the internet. This post is about that.

mapreduce.yaml:

  1.  
  2. mapreduce:
  3.  
  4. - name: Delete SearchableTowns
  5. mapper:
  6. input_reader: mapreduce.input_readers.DatastoreInputReader
  7. handler: main_map_reduce.delete_searchable_towns
  8. params:
  9. - name: entity_kind
  10. default: mapreduce_models.SearchableTown
  11.  
  12. - name: Create SearchableTown from Town
  13. mapper:
  14. input_reader: mapreduce.input_readers.DatastoreInputReader
  15. handler: main_map_reduce.town_to_searchable
  16. params:
  17. - name: entity_kind
  18. default: mapreduce_models.Town
  19.  
  20. - name: Create Town and SearchableTown from csv for USA
  21. mapper:
  22. input_reader: mapreduce.input_readers.BlobstoreLineInputReader
  23. handler: main_map_reduce.csv_to_towns
  24. params:
  25. - name: blob_keys
  26. default: AMIfv97g-x4G9-KM24YXQi6dSyBddAb97p0n98NgJlCL68jJA9jcvwETojEcF7MGGlZsDLEFVcJeeLHGgwxo9Nlay9GR33LniA06Obw3C781Te9yAn9Dk1EkwxjrFqHEBo4-WbZ7GUS9nKa3NOpDGdbxBBkD2sTYUg
  27.  

The file contain 3 tasks. 2 of them are intended to create or modify datastore entities. The other one is going to read a big csv from the blogstore, creating a datastore entity for every line in the file. This is the Python version of this blog post (which uses Java).

Now, main_map_reduce is a python file that I keep in the same location than mapreduce.yaml. Just a regular python file. The imports in that file might cause exceptions, specially if they try to load Django stuff. In order to avoid problems we had to copy our models.py into mapreduce_models.py removing almost all the imports. As mapreduce_models.py is placed at same level than mapreduce.yaml, we had to hack also the file appengine_django/models.py, replacing this line:

  1.  
  2. self.app_label = model_module.name.split('.')[-2]
  3.  

With this block:

  1.  
  2. self.app_label = 'my_app_name'
  3. try:
  4. self.app_label = model_module.__name__.split('.')[-2]
  5. except IndexError:
  6. pass
  7.  

main_map_reduce.py:

  1.  
  2. def delete_searchable_towns(town_entity):
  3. yield op.db.Delete(town_entity)
  4.  
  5. def town_to_searchable(town_entity):
  6. searchable = models.SearchableTown()
  7. searchable.code = town_entity.code
  8. searchable.lower_name = town_entity.name.lower()
  9. yield op.db.Put(searchable)
  10.  
  11. def csv_to_towns(input_tuple):
  12. line = input_tuple[1]
  13. offset = input_tuple[0]
  14. # process the line ...
  15. yield op.db.Put(town_entity)
  16.  

In the first two methods, the mapreducer passes in an entity. In the last one, it passes a tuple, where its second item is the line read from the blog, which is a big csv file.
This way, we can now upload a huge csv and then create entities from it. This tasks was really painful before, as we had to make a ton of dirty hacks in order to avoid the 30 seconds restriction.

Enjoyed reading this post?
Subscribe to the RSS feed and have all new posts delivered straight to you.
  • Tate

    Hi Carlos,
    Is it possible that you share me the code of this python example.
    I am exactly trying to make an application that first uploads a CSV file to the datastore and afterwards is capable to map the content of the CSV into the database. I do not find any tutorial that do that in python or any simple example.
    If you could share me the code somehow I would appreciate.
    my addres is duartefernandez@yahoo.com

    Best regards

  • Gaspar Muñoz

    Hola carlos, ¿puedes pasarme el código? Estoy empezando con un proyecto fin de carrera en el que necesito map reduce y usaré django, así que me vendría de lujo.

    My email es munozs.88@gmail.com

    Gracias!

  • http://www.carlosble.com Carlos Ble

    Hola!
    Lo siento a los dos pero no se donde esta el codigo. Solo recuerdo que era practicamnte lo mismo que escribí en este post. Con unos minutos y la info de este post, debe salir.

    Suerte con ellos.

  • http://seizethedave.com/ David Grant

    Thanks for the tips. That was WAY more work than I wanted to do on this beautiful Saturday.