[Issues] some CPUs are not used
Ewout M. Helmich
helmich at astro.rug.nl
Fri Sep 28 15:24:25 CEST 2007
I've updated DBRecipes/mods/Pipeline.py and astro/net/dpu.py to more
evenly spread the tasks over the reserved nodes.
Regards,
Ewout
Ewout M. Helmich wrote:
> I checked and confirm that the node with the 4 jobs is not released
> until the entire job is finished. With 40 frames the overhead is smaller
> percentage wise compared to 20, not bigger. I think it is correct that
> two different users can not run jobs on the same node at the same time,
> even if one of the CPUs is idling, which is what you seem to suggest.
> It's easy to make the groups so that always an even number are created,
> but it really depends on the number of processes that you let run
> simultaneously on each node, which could be 1 or 3 as well as 2. I can
> probably use that information, however. I'd have to see how MDia is
> normally used before I can say anything about that; if you want to use
> the cluster more effectively with MDia then what's needed is a
> dpu.run(..) command that includes information for more than 1
> independent MDia tasks, which can then be run on different nodes/CPUs.
>
> Ewout
>
>
> Johannes Koppenhoefer wrote:
>
>> Hi Ewout,
>>
>> in the situation you mentioned the 4 frame CPU will indeed be blocked until the other 8 frame CPUs are finished. As you said this causes only minor overhead.
>> But if you submit a job with 40 frames, the job will be split on 5 CPUs and 3 nodes out of which one node only uses one CPU. The other CPU on this node will not be used by other jobs until the job finishes (at least on our cluster). This is causing bigger overheads up to 50% if you submit single-file jobs (or single-CPU jobs like in MDia). The workaround is, as John pointed out, is to optimize the lists, i.e. choosing a clever GROUP_SIZE.
>> I am doing this now but it might be useful to integrate a piece of code in Pipeline.py which chooses the GROUP_SIZE adequately for all Tasks in order to always have an even number of CPUs used.
>>
>> Bye,
>> Johannes
>>
>>
>> "Ewout M. Helmich" <helmich at astro.rug.nl> schrieb am 27.09.2007 16:22:13:
>>
>>
>>> Hi Johannes,
>>>
>>> I'm not sure I completely understand your problem, but I can explain a
>>> few things. In DBRecipes/mods/Pipeline.py a variable GROUP_SIZE is used
>>> which in the case of the image pipeline is 8. This results in your 20
>>> filenames being split up in groups of 8, 8 and 4. The GROUP_SIZE was
>>> chosen so as to work best for the HPC cluster in Groningen, in
>>> particular because of the ~30min job limitation in the "short queue"
>>> here. If the number of processes per node (CPUs/cores) is 2, as in
>>> Groningen, that means two nodes are reserved in the call to the PBS
>>> queuing system, where one is handling 16 frames and the other 4. That is
>>> not very balanced and we could try to optimize by dividing the load
>>> evenly. On the other hand I doubt that this alone would be a serious
>>> problem (the main question here being whether the node that handles 4
>>> files is occupied for the entire time the node that handles 16 is busy).
>>> You mention losing 50% of the CPUs; how is that exactly? Are your
>>> submitting many jobs where you specify 1 filename?
>>>
>>> Regards,
>>> Ewout
>>>
>>> John P. McFarland wrote:
>>>
>>>
>>>> Hi Johannes,
>>>>
>>>> The DPU/CPU behavior might have something to do with the cluster queueing
>>>> system not controlled by the DPU, but that is only a guess. For now, you
>>>> could simply try to optimize the lists you use both CPUs on one node and no
>>>> others if possible.
>>>>
>>>> I'm CCing this to the Issues list so that anybody else with some ideas
>>>> (especially our DPU experts) can chime in.
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> -=John
>>>>
>>>>
>>>> On Mon, 24 Sep 2007, Johannes Koppenhoefer wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> Hello John,
>>>>>
>>>>> I have realized, that if you submit a job on the dpu with the
>>>>> red_filenames option, and the number of files is e.g. 20 it results in 3
>>>>> CPU jobs, two on one node and on on the next node. Now, for some reason I
>>>>> do not understand, the second CPU on the second node is not going to be
>>>>> used by other processes. This is in particular painful in my situation
>>>>> where I have to submit jobs that run on a single CPU, because I can use
>>>>> only half of the CPUs on our cluster and the rest is blocked. Is there any
>>>>> reason for this dpu-behavior? Do you know of any quick workaround for me?
>>>>>
>>>>> Cheers,
>>>>> Johannes
>>>>>
>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Issues mailing list
>>>> Issues at astro-wise.org
>>>> http://listman.astro-wise.org/mailman/listinfo/issues
>>>>
>>>>
>>>>
>>> --
>>> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>>> Drs. Ewout Helmich <><>
>>> Kapteyn Astronomical Institute <><> Astro-WISE/OmegaCEN
>>> Landleven 12 <><>
>>> P.O.Box 800 <><> email: helmich at astro.rug.nl
>>> 9700 AV Groningen <><> tel : +31(0)503634548
>>> The Netherlands <><>
>>> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>>>
>>> _______________________________________________
>>> Issues mailing list
>>> Issues at astro-wise.org
>>> http://listman.astro-wise.org/mailman/listinfo/issues
>>>
>>>
>>>
> _______________________________________________
> Issues mailing list
> Issues at astro-wise.org
> http://listman.astro-wise.org/mailman/listinfo/issues
>
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Drs. Ewout Helmich <><>
Kapteyn Astronomical Institute <><> Astro-WISE/OmegaCEN
Landleven 12 <><>
P.O.Box 800 <><> email: helmich at astro.rug.nl
9700 AV Groningen <><> tel : +31(0)503634548
The Netherlands <><>
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
More information about the Issues
mailing list