Oxfam America banner

Tuesday, October 28, 2008

Class Variables

My dreaded multiple inheritance article was published in Python Magazine yesterday. It was tricky enough giving a presentation on the subject, so I look at the article as somewhat of an achievement. Someone contacted me shortly after concerning my use of class variable in one of the examples:




class Publishable(object):
published = False
def __init__(self, start_date, end_date):
self.start_date = start_date
self.end_date = end_date
def publish(self):
self.published = True


The inquirer's question:


I've always been under the impression having class & instance
variables with the same name is just confusing and not a good way to
write Python. Why did you do this?



Anyhow, on the subject of "class variables," here are my thoughts:



[Begin email response]




I guess it depends on who you ask. The pattern of using class level
variable as a default values for instances is one I like and don't
find confusing. But these kind of things are always subjective. It's
not my own pattern either, but one I've noticed in other projects.



http://twistedmatrix.com/trac/browser/tags/releases/twisted-8.1.0/twisted/internet/defer.py#L137
http://trac.pythonpaste.org/pythonpaste/browser/Paste/WebOb/trunk/webob/__init__.py#L474


I make a distinction, though, between a "class variable" and a
variable defined at the class level. In terms of the VM, they are of
course the same thing, but are differentiated in human terms by uses
cases. The former "class variable" is useful if you are tracking
state on a class - for example keeping a count of objects created
through a class method:




class Foo:
_created = 0
@classmethod
def create(cls):
foo = Foo()
cls._created += 1
return foo



The latter "class level variable" can be useful for defining default
values for attributes (especially ones which cannot be specified in
the constructor):




class Task:
completed = False
def __init__(self):
self.start_time = time.time()



(This is same pattern I'm adopting in my article.) Of course, when
the "completed" attribute is looked up on an instance of Task, it has
to take an extra step - first looking in dict of instance and then in
the class. However, this also means a decent space optimization if
your ratio of incomplete tasks to complete tasks is very high - once a
task is marked as completed it is likely to get discarded and hence
garbage collected.




I think it's only confusing (again, this is very subjective) if you
try to mix the two uses cases in one class:




class Confusing:
completed = 0
def __init__(self):
self.completed = False
@classmethod
def complete(cls, confusingInstance):
cls.completed += 1
confusingInstance.completed = True


Thanks for the feedback.



[End email response]

Thursday, October 9, 2008

Lexical Scope

So, briefly on the topic of lexical scope, what does the following print?


def f():
funcs = []
for i in range(5):
def g():
print i
funcs.append(g)
return funcs

funcs = f()
for func in funcs:
func()

Wednesday, October 8, 2008

Deferred Task Queue

Problem: You have a queue of tasks to manage but only n should run at a time. An example of this is some CPU-intensive task where running more than a fixed number of tasks decreases your overall throughput due to context switching. Sometimes the precise number is simplified to the number of CPUs you have to work with. ( Insert actual numbers here ;)



A simple queuing mechanism is the perhaps the most obvious solution to the above problem. I wrote one Task Queue implementation which was pretty terrible and required you to "pump" initial events - but only a certain number to "start" the queue. I won't post the code for that here. However, I revisited the problem yesterday and came up with some simple code which seems to do the job nicely. (Note, this is for an application using Twisted, hence "Deferred Task Queue". In a producer/consumer thread-based model where consumers are threads in fixed-size thread-pool, the problem is already solved by just using Python's Queue and letting the consumers pull jobs off the queue - i.e. the size of thread pool dictates how many jobs can run concurrently.)




class TaskQueue:

def __init__(self, concurrentMax=cpuCount()):
self.concurrentMax = concurrentMax
self._running = 0
self._queued = []

def push(self, f, *args, **kwargs):
if self._running < self.concurrentMax:
self._running += 1
return f(*args, **kwargs).addBoth(self._try_queued)
d = defer.Deferred()
self._queued.append((f, args, kwargs, d))
return d

def _try_queued(self, r):
self._running -= 1
if self._running < self.concurrentMax and self._queued:
f, args, kwargs, d = self._queued.pop(0)
self._running += 1
actuald = f(*args, **kwargs).addBoth(self._try_queued)
actuald.chainDeferred(d)
if isinstance(r, failure.Failure):
r.trap()
return r



Note that the above implementation is missing a notion of "capacity" - which might be important for a more general solution. My application actually handles capacity external to queue, but there might be some benefit in internalizing the concept and raising exceptions on push() when capacity is exceeded. I'm still undecided on this.




The interface is pretty straightforward. You have a function f (and its arguments) that returns a Deferred and that you want to call (eventually). For example, doSomeStuff() below simply returns a Deferred object that will fire after 2 seconds have elapsed:


def doSomeStuff(a, b=None):
print 'doSomeStuff(%s, %s) called: %f' % (a, b, time.time())
def finishUp():
print 'doSomeStuff(%s, %s) finished: %f' % (a, b, time.time())
d.callback('done %d %d' % (a, b))
d = defer.Deferred()
reactor.callLater(2.0, finishUp)
return d


Let's queue up some calls to doSomeStuff()


taskq = TaskQueue(3)
taskq.push(doSomeStuff, 1, b=2)
taskq.push(doSomeStuff, 2, b=3)
taskq.push(doSomeStuff, 3, b=4)
taskq.push(doSomeStuff, 4, b=5)
taskq.push(doSomeStuff, 5, b=6)


The output of the above would be something like this:


doSomeStuff(1, 2) called: 1223472790.943929
doSomeStuff(2, 3) called: 1223472790.944112
doSomeStuff(1, 2) finished: 1223472792.947769
doSomeStuff(3, 4) called: 1223472792.947887
doSomeStuff(2, 3) finished: 1223472792.948004
doSomeStuff(4, 5) called: 1223472792.948162
doSomeStuff(3, 4) finished: 1223472794.951818
doSomeStuff(5, 6) called: 1223472794.951937
doSomeStuff(4, 5) finished: 1223472794.952080
doSomeStuff(5, 6) finished: 1223472796.955836



As you can see, as soon as one of the called functions completes its job, the next one queued is called. There is no need to tell the queue to start doing its work - just push jobs onto the queue.




The technique that makes this work is simply exploiting the elegance of Deferreds by "sneaking" in a check for pushed jobs that have been queued via Deferred's addBoth() method. For those of you unfamiliar with how a Deferred works in Twisted, you should the deferred section of Twisted's asynchronous programming guide and then this document on Deferreds ... and maybe this one too.

What makes this useful is that I can treat a call to push() as if it were simply a call to the function being queued - no other messy API to tell the queue which callbacks or errbacks need to be invoked when the function is finally called. The internal function _try_queued() acts as a transparent pass-through gateway so the caller to push() doesn't need to worry about adding a funky callback to translate some wrapped value or otherwise - again I'm eschewing unnecessary API details. So for example:




# this
taskq.push(doSomeStuff, 1, b=2).addCallback(cb).addErrback(eb)

# is the same as
doSomeStuff(1, b=2).addCallback(cb).addErrback(eb)

# ... just with queuing behavior under the hood


In closing, in the context of asynchronous programming or otherwise, I've begun to strongly believe "the best API is no API".