Skip to content

Weeks_1and2

Michal J. Gajda edited this page Oct 28, 2020 · 3 revisions
alternate text

(source: http://learnyouahaskell.com,
cc Miran Lipovača)

19th of May: official start of Google Summer of Code!

Hey ho, lets go!

#First Week

In my original timeline, this is the milestone for the first two weeks:
Milestone 1: Have a complete analysis of the program’s time and space profiling statistics.

##Work To Do

Analysis of time and space consumption of the program (Profiling) to see which parts of the code need the most time and memory. When those parts will be found, think about a possibility to change the code in a way that the time and space consumption will be improved.

##Work Done

I did some profiling runs to see the productivity of the program, the time spend for garbage collection and allocation rate. Additionally, a detailed profiling run shows these information for each of the methods separately. Here, you can see a summary of the results.

Overview:

  INIT    time    0.00s  (  0.00s elapsed)  
  MUT     time  699.10s  (729.01s elapsed)  
  GC      time  1169.81s  (1329.56s elapsed)  
  RP      time    0.00s  (  0.00s elapsed)  
  PROF    time    0.00s  (  0.00s elapsed)  
  EXIT    time    0.08s  (  1.59s elapsed)  
  Total   time  1868.99s  (2060.17s elapsed)  

  %GC     time      62.6%  (64.5% elapsed)  

  Alloc rate    1,694,144,469 bytes per MUT second  

  Productivity  37.4% of total user, 33.9% of total elapsed  

It can be seen that the program uses most of the time for garbage collection which lowers its productivity a lot. Thus, the amount of memory which is allocated during runtime needs to be reduced.

Detailed statistic:

   transalign_prof +RTS -p -RTS 452.xml U50_vs_SP.xml  

total time  =      753.48 secs   (753482 ticks @ 1000 us, 1 processor)  
total alloc = 782,868,820,008 bytes  (excludes profiling overheads)  

COST CENTRE                      MODULE         %time     %alloc  

group_al'''.merge            Align       39.0   43.7  
trans_align                  Align       38.4    9.7  
group_al'''.toMap            Align        8.3   26.7  
group_al'''                  Align        6.3    6.0  
readAlignmentCache           BlastCache   1.6    5.0  
group_al'''.toList           Align        1.1    1.3  
merge1                       Align        1.1    1.6  
readAlignments               Blast        0.9    1.9  
readAlignmentCache.unconvert BlastCache   0.4    1.3  

In the detailed analysis, a list is given which shows the methods having a large time and space consumption. In this way, I see which methods have to be changed. There are some more important methods which should be changed, but they are not listed in this summary.

#Second Week

In my original timeline, this is the milestone for the first two weeks:
Milestone 1: Have a complete analysis of the program’s time and space profiling statistics.

##Work To Do

I will look at the methods which need the largest amounts of time and space. I will think about how to change those methods or if it is necessary to rewrite them.

##Work Done I did some space profiling statistics and plots to see which methods have the highest space consumption. In the space profiling statistics, it could be seen that a large part of the allocation is due to lists. That's why I also did a tracing to see the size of the lists.
In the folling subsections you can see plots for the standard profiling showing different methods and a profiling plot showing which constructors allocate the highest amount of space. Afterwards, you see the summary of the tracing.
At the top of the plots, you can see which compiler flags I used to generate those statistics:

  • +RTS: linkage with a non-trivial runtime system
  • -p: profiling
  • -K100M: limitation of the stack space
  • -hc: standard heap profiling
  • -hd: profiling showing allocation by constructors

The plots were done using the Haskell package hp2pretty. As an input, hp2pretty takes the output file of the profiling run, the .hp file. See also the references for some more information.

###Standard heap profiling

If you look at the keys on the right, it can be seen that the first, second and fourth entry is due to reading the input files. The third entry and all others from the fifth entry on are from the 'real' program, mostly from the methods trans_align and group_al'''.

alternate text

###Profiling showing the allocating constructors The keys on the right shows as a first entry the list constructor since lists are used very often. The second and third entry are due to reading the input. The following two entry occur in the transalign program as positions of the alignment.

alternate text

###Tracing The current version of transalign uses sparse matrices for the alignment algorithm. As the time and space profiling showed, lists are one of the causes for high time and space consumption. Tracing shows the sizes of the lists while the program is running.
One of the traces is done in the group_al''' method. This method uses lists to group together positions in the alignment. The output of the tracing is a table showing for each call of the group_al''' method the sizes of the lists. The list has the type [Pos,[(Pos,Score)]]. In the first column, the length of the outer list is given. The second column show the sum of the lengths of the inner lists. In this way, a small number would show a sparsa matrix, a high number shows that there are many lists which are not empty.
The table shows a summary of the output table done with R. The average values are 1829 and 164982 which indicates that the matrix is not a sparse matrix.

  length y        sum $ map (length snd) y  
 Min.   :   5       Min.   :       5  
 1st Qu.: 448       1st Qu.:    1351  
 Median :1881       Median :    5336  
 Mean   :1829       Mean   :  164982  
 3rd Qu.:2967       3rd Qu.:   31296  
 Max.   :3954       Max.   :11056220  

The second table gives an analysis of the output of the transalign program started with a test set. This test set was also used for the profiling. Again, the summary is done with R.
The output of the program is a table showing the scores, lengths and positions of the aligned sequences of query and target (these columns are not shown here). p0, pn, q0 and qn are the start and end positions on both sequences. For this test set alignments with an average length of 412 were found.

     score             length         score/length  
Min.   :   0.79    Min.   :   5.0    Min.   :0.0800  
1st Qu.:  48.23    1st Qu.: 169.5    1st Qu.:0.1900  
Median :  89.06    Median : 343.0    Median :0.2800  
Mean   : 123.25    Mean   : 412.4    Mean   :0.2987  
3rd Qu.: 131.25    3rd Qu.: 503.5    3rd Qu.:0.3900  
Max.   :1369.80    Max.   :2634.0    Max.   :1.2200  


       p0              pn               q0                qn  
Min.   :    1    Min.   : 2458    Min.   :   1.0    Min.   :   20.0  
1st Qu.: 4546    1st Qu.: 8524    1st Qu.:  47.0    1st Qu.:  490.5  
Median : 7825    Median :10792    Median : 121.0    Median :  673.0  
Mean   : 7567    Mean   :10353    Mean   : 281.1    Mean   : 1292.6  
3rd Qu.:10231    3rd Qu.:12646    3rd Qu.: 292.5    3rd Qu.: 1255.5  
Max.   :12856    Max.   :12913    Max.   :2732.0    Max.   :35213.0  

#Next Weeks

Here, you can find the entry for the next two weeks.

#References

Profiling
Compiler flags for profiling
hp2pretty

Back to main page.

Clone this wiki locally