Skip to content

Corrupt lines in pair file #43

@malinkallen

Description

@malinkallen

I have run the SourcererCC clone detector on a little bit more than 35,000,000 files. The resulting clone pair file consists of >18,000,000,000 lines. Of these, 5 lines contain more than 4 numbers separated with commas (which should be the expected format):

263694,263710,455981,41668,70616
591916,1015368,508215,591934,1015376,192522,333749
14702,100025479,527866,914862,100025719,706877,1213095
502505,200858502537,200858458,1527027,102616237
1454158,2021454205,202495178,785203,101352033

The first one is located on line 1604224 in query_3clones_index_WITH_FILTER.txt, which is attached in zipped format (split in 3 since I cannot upload files larger than 10MB). query_3clones_index_WITH_FILTER_1.txt.gz query_3clones_index_WITH_FILTER_2.txt.gz query_3clones_index_WITH_FILTER_3.txt.gz

The server that I ran on went down a couple of times, so one could imagine that 263694,<parts of an ID> was written before the crash, and the next clone pair was written on the same line. However, I don't think that's the case: Since SourcererCC starts from the last line logged in recovery.txt, I see two possibilities:

  1. The last line logged in recovery.txt is the last line before the one that was processed when the server went down. Then the second number of the line should end with the first number of the line, which is not the case.
  2. The last line processed (and giving rise to an output line) before the crash is not the last one logged in recovery.txt. Then the first line to be processed after recovery should already have been processed before the crash. Then we should find another line ending with 455981,41668,70616, which I can't.

My blocks file is 7,9 GB, so I don't attach it, but let me know if you need more information!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions