GSoC2010DataSeeder.wiki - wiki - Git at Google

 #summary Tracks progress for the data seeder project.
 #labels data-seeder,gsoc2010

 <wiki:toc max_depth="3" />

 = Introduction =

 This page will track progress of the GSoC 2010 project involving a new data seeder for Melange.

 = Design documents =

 == Data seeder configuration sheet and fixtures ==

 === Configuration sheet ===

 One of the requirements of the data seeder is to support some sort of predefined configuration so that seeding tasks are easily repeatable and customizable. A JSON file generated by the web interface seems suitable for this task. I believe JSON is better than XML due to native JavaScript support that will be used when generating the file from the web interface and also because it's more readable.

 The configuration sheet will include the names of the models to seed. Parameters for the seeding will be provided via "data providers" (might think of a better name) that return different parameters (numbers, string, dates, references etc.). These data providers might follow different statistical distributions, depending on context.

 A draft of the configuration sheet format might look like this:

 {{{
   {
     "GSoCStudent" : // model name
     {
       "number": // number of models to seed
       {
         "provider_name" : "FixedInteger", // points to a data provider which simply returns an integer
         "parameters" : // different data providers might have different parameters
         {
           "value" : 100 // configures the data provider to always return 100
         }
       },
       "properties" : // this section will specify how to populate the properties or fields of a model
       {
         "given_name" : // seed the given_name property. Note that this is actually inherited from the Role model
         {
           "provider_name" : "NameGenerator", // points to a data provider that returns a name
           "parameters" : {} // use default/no parameters for this data provider
         },
         "birth_date" :
         {
           "provider_name" : "DateNormalDistribution", // this data provider generates random dates using the specified parameters
           "parameters" :
           {
             "mean" : "05/13/1988",
             "variance" : "20m", // a variance of 20 months
             "max" : "01/01/1992" // force a maximum value, for example to enforce that students are at least 18 years old
           }
         }
         // ... other properties should go here
       }
     }
     // ... other models should go here
   }
 }}}

 This example shows how to configure the seeder to populate the datastore with some random data. However, it's missing one vital element, model relationship!

 There are two ways to define relations for models:

   # Seed all models with a reference to a specific model (either existing or new, either fixed or random etc.), as configured by a data provider. For example, we might want to seed a number of organization administrators, all referencing the same program.
   # Specify how to seed related models for a specific model (indicated either by many-to-many relations via list properties or by auto-generated back-references). For example, for an organization, we might configure the seeder to add a number of student proposals for that specific organization.

 *Examples:*
   * Example 1
 {{{
   {
     "StudentProject" : // model name
     {
       "number": { ... }, // number of models to seed
       "properties" :
       {
         "student" :
         {
           "provider_name" : "RandomModelReference", // data provider that returns a random model
           "parameters" :
           {
             "model_name" : "Student" // configure the data provider to return a random student
           }
         }
         "mentor" :
         {
           "provider_name" : "FixedModelReference", // data provider that returns a specific model
           "parameters" :
           {
             "key" : "23502935" // the key of the model to reference to. Must already exist in the datastore
           }
         }
       }
     }
   }
 }}}

   * Example 2
 {{{
   {
     "Student" :
     {
       "number": { ... }, // number of models to seed
       "properties" :
       {
         "student_projects" : // this is the name of the back-reference property
         {
           "provider_name" : "RelatedModelSeeder", // data provider that seeds related models
           "parameters" :
           {
             "model_name" : "StudentProject" // the name of the related model (this could be unnecessary as it might be auto-detected)
             "configuration" : // the configuration for this data provider should follow the same rules as any model
             {
               "number" : { ... },
               "properties" :
               {
                 "title" : { ... },
                 "abstract", { ... },
                 ...
               }
             }
           }
         }
       }
     }
   }
 }}}

 This might prove flexible enough to accomplish most data seeding needs, but could perhaps be a bit too hard to configure, although configuring is a one time job, as configuration sheets will be reusable. Also, autodetecting relations might perhaps fail in case there are some "tricky" relations, not sure if this is the case. In case this proves to work successfully, changes in models will not require any changes in the data seeder at all.

 === Python fixtures ===

 Python fixtures are used to easily set up and tear down a database for test purposes. The fixture module supports easy addition of static predefined data to the datastore and it's very well integrated with unit tests. The data seeder's main purpose however is to load arbitrary random data.

 A local script could be used to generate a fixture dataset file from a configuration sheet which will then be used for unit testing. This will make it easy to provide lots of generated data to tests, and in the same time ensure that the data used for tests will not change because of randomness.

 == Remote vs local ==

 While it is necessary to use the GAE Task Queue API for seeding a live instance, it's too much of an overhead for seeding locally, especially for automated tests.

 Therefore, I find the need for two scripts that will handle local data generation and seeding, by using a configuration sheet, provided by the web interface. The scripts should be minimal, reusing most of the implementation provided in the SoC data seeder module implementation.

   # *Python fixture generator*
 {{{
 This script generates Python fixture datasets for use with unit tests, based on a data seeder JSON configuration file.

 Usage: ./gen_fixture.py [-o output.py] <configuration.json>
 }}}
   # *Local data seeder*
 {{{
 This script seeds the local datastore based on a data seeder JSON configuration file.

 Usage: ./data_seeder.py <configraution.json>
 }}}

 = Meetings and Agendas =

 == Wednesday May 12, 2010 8:00 in UTC ==
 ===Meeting Logs===
     Meeting start: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-12.log.html#t2010-05-12T08:20:19

     Meeting End: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-12.log.html#t2010-05-12T09:29:37

 ===Meeting Notes===

   Location: #melange on freenode

   Time: 8:00 UTC, Wednesday, 12th of May, 2010

   Chaired by: Mario Ferraro

   Notes taken by: Felix Kerekes


     * TODOs:
        * Edit blog post, give credits to open source projects used for generating graphs
        * Integrate the diagram generator script in the code base
        * Generate .dot files with date so we can keep a history of the models
        * In the future, making some kind of animation from the first commit might be really cool
        * Post a detailed project schedule in the wiki, with weekly milestones
        * Focus on getting the design documents done until the end of the community bonding period
        * Homework for next Monday: remote/local thingy and XML/YaML/Fixtures analysis
     * Conference call
        * Problems regarding Melange state and timeline
        * how does Melange track the current phase of a program? is it solely by date or are there any switches in the DB to reflect that ?
           * Madhusudan told us that checks to the timeline are always made and showed us some code, huge thanks :)


 == Tuesday May 4, 2010 13:30:00 in UTC ==
 ===Meeting Logs===
     Meeting start: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-04.log.html#t2010-05-04T15:00:09

     Meeting End: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-04.log.html#t2010-05-04T16:18:24

 ===Meeting Notes===

   Location: #melange on freenode

   Time: 15:00 UTC, Tuesday, 4th of May, 2010

   Chaired by: Mario Ferraro

   Notes taken by: Felix Kerekes


     * Meeting schedule
        * IRC meetings will be held on a weekly basis, after the conference call.
     * Availability during summer
        * Felix will attend a part-time internship sometime around July.
     * Tasks during community bonding period
        * Draw an UML schema of the current data model.
        * Design documents for web interface workflow.
        * Further explore the possibilities of seeding locally/remotely.
           *  XML/YAML/Fixtures - advantages and drawbacks
        * Build some prototypes to really get the feel of each solution.
	#summary Tracks progress for the data seeder project.
	#labels data-seeder,gsoc2010

	<wiki:toc max_depth="3" />

	= Introduction =

	This page will track progress of the GSoC 2010 project involving a new data seeder for Melange.

	= Design documents =

	== Data seeder configuration sheet and fixtures ==

	=== Configuration sheet ===

	One of the requirements of the data seeder is to support some sort of predefined configuration so that seeding tasks are easily repeatable and customizable. A JSON file generated by the web interface seems suitable for this task. I believe JSON is better than XML due to native JavaScript support that will be used when generating the file from the web interface and also because it's more readable.

	The configuration sheet will include the names of the models to seed. Parameters for the seeding will be provided via "data providers" (might think of a better name) that return different parameters (numbers, string, dates, references etc.). These data providers might follow different statistical distributions, depending on context.

	A draft of the configuration sheet format might look like this:

	{{{
	{
	"GSoCStudent" : // model name
	{
	"number": // number of models to seed
	{
	"provider_name" : "FixedInteger", // points to a data provider which simply returns an integer
	"parameters" : // different data providers might have different parameters
	{
	"value" : 100 // configures the data provider to always return 100
	}
	},
	"properties" : // this section will specify how to populate the properties or fields of a model
	{
	"given_name" : // seed the given_name property. Note that this is actually inherited from the Role model
	{
	"provider_name" : "NameGenerator", // points to a data provider that returns a name
	"parameters" : {} // use default/no parameters for this data provider
	},
	"birth_date" :
	{
	"provider_name" : "DateNormalDistribution", // this data provider generates random dates using the specified parameters
	"parameters" :
	{
	"mean" : "05/13/1988",
	"variance" : "20m", // a variance of 20 months
	"max" : "01/01/1992" // force a maximum value, for example to enforce that students are at least 18 years old
	}
	}
	// ... other properties should go here
	}
	}
	// ... other models should go here
	}
	}}}

	This example shows how to configure the seeder to populate the datastore with some random data. However, it's missing one vital element, model relationship!

	There are two ways to define relations for models:

	# Seed all models with a reference to a specific model (either existing or new, either fixed or random etc.), as configured by a data provider. For example, we might want to seed a number of organization administrators, all referencing the same program.
	# Specify how to seed related models for a specific model (indicated either by many-to-many relations via list properties or by auto-generated back-references). For example, for an organization, we might configure the seeder to add a number of student proposals for that specific organization.

	Examples:
	* Example 1
	{{{
	{
	"StudentProject" : // model name
	{
	"number": { ... }, // number of models to seed
	"properties" :
	{
	"student" :
	{
	"provider_name" : "RandomModelReference", // data provider that returns a random model
	"parameters" :
	{
	"model_name" : "Student" // configure the data provider to return a random student
	}
	}
	"mentor" :
	{
	"provider_name" : "FixedModelReference", // data provider that returns a specific model
	"parameters" :
	{
	"key" : "23502935" // the key of the model to reference to. Must already exist in the datastore
	}
	}
	}
	}
	}
	}}}

	* Example 2
	{{{
	{
	"Student" :
	{
	"number": { ... }, // number of models to seed
	"properties" :
	{
	"student_projects" : // this is the name of the back-reference property
	{
	"provider_name" : "RelatedModelSeeder", // data provider that seeds related models
	"parameters" :
	{
	"model_name" : "StudentProject" // the name of the related model (this could be unnecessary as it might be auto-detected)
	"configuration" : // the configuration for this data provider should follow the same rules as any model
	{
	"number" : { ... },
	"properties" :
	{
	"title" : { ... },
	"abstract", { ... },
	...
	}
	}
	}
	}
	}
	}
	}
	}}}

	This might prove flexible enough to accomplish most data seeding needs, but could perhaps be a bit too hard to configure, although configuring is a one time job, as configuration sheets will be reusable. Also, autodetecting relations might perhaps fail in case there are some "tricky" relations, not sure if this is the case. In case this proves to work successfully, changes in models will not require any changes in the data seeder at all.

	=== Python fixtures ===

	Python fixtures are used to easily set up and tear down a database for test purposes. The fixture module supports easy addition of static predefined data to the datastore and it's very well integrated with unit tests. The data seeder's main purpose however is to load arbitrary random data.

	A local script could be used to generate a fixture dataset file from a configuration sheet which will then be used for unit testing. This will make it easy to provide lots of generated data to tests, and in the same time ensure that the data used for tests will not change because of randomness.

	== Remote vs local ==

	While it is necessary to use the GAE Task Queue API for seeding a live instance, it's too much of an overhead for seeding locally, especially for automated tests.

	Therefore, I find the need for two scripts that will handle local data generation and seeding, by using a configuration sheet, provided by the web interface. The scripts should be minimal, reusing most of the implementation provided in the SoC data seeder module implementation.

	# Python fixture generator
	{{{
	This script generates Python fixture datasets for use with unit tests, based on a data seeder JSON configuration file.

	Usage: ./gen_fixture.py [-o output.py] <configuration.json>
	}}}
	# Local data seeder
	{{{
	This script seeds the local datastore based on a data seeder JSON configuration file.

	Usage: ./data_seeder.py <configraution.json>
	}}}

	= Meetings and Agendas =

	== Wednesday May 12, 2010 8:00 in UTC ==
	===Meeting Logs===
	Meeting start: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-12.log.html#t2010-05-12T08:20:19

	Meeting End: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-12.log.html#t2010-05-12T09:29:37

	===Meeting Notes===

	Location: #melange on freenode

	Time: 8:00 UTC, Wednesday, 12th of May, 2010

	Chaired by: Mario Ferraro

	Notes taken by: Felix Kerekes


	* TODOs:
	* Edit blog post, give credits to open source projects used for generating graphs
	* Integrate the diagram generator script in the code base
	* Generate .dot files with date so we can keep a history of the models
	* In the future, making some kind of animation from the first commit might be really cool
	* Post a detailed project schedule in the wiki, with weekly milestones
	* Focus on getting the design documents done until the end of the community bonding period
	* Homework for next Monday: remote/local thingy and XML/YaML/Fixtures analysis
	* Conference call
	* Problems regarding Melange state and timeline
	* how does Melange track the current phase of a program? is it solely by date or are there any switches in the DB to reflect that ?
	* Madhusudan told us that checks to the timeline are always made and showed us some code, huge thanks :)


	== Tuesday May 4, 2010 13:30:00 in UTC ==
	===Meeting Logs===
	Meeting start: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-04.log.html#t2010-05-04T15:00:09

	Meeting End: http://www.thousandparsec.net/~irc/logm/%23melange.2010-05-04.log.html#t2010-05-04T16:18:24

	===Meeting Notes===

	Location: #melange on freenode

	Time: 15:00 UTC, Tuesday, 4th of May, 2010

	Chaired by: Mario Ferraro

	Notes taken by: Felix Kerekes


	* Meeting schedule
	* IRC meetings will be held on a weekly basis, after the conference call.
	* Availability during summer
	* Felix will attend a part-time internship sometime around July.
	* Tasks during community bonding period
	* Draw an UML schema of the current data model.
	* Design documents for web interface workflow.
	* Further explore the possibilities of seeding locally/remotely.
	* XML/YAML/Fixtures - advantages and drawbacks
	* Build some prototypes to really get the feel of each solution.