Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNAPHU Killed Error Causes Missing scenes in output timeseries (Only 15/34 scenes showing) #104

Open
pbrotoisworo opened this issue Jan 25, 2025 · 4 comments

Comments

@pbrotoisworo
Copy link

pbrotoisworo commented Jan 25, 2025

EDIT: Updated title to reflect underlying SNAPHU issue. (out of memory)

Hi,

I've managed to create an analysis using 34 Sentinel-1 images. However, around 50% of the data is missing when I did a single reference network.
I put in 34 SLC images and the output timeseries and network only contains 15 data points. Is that normal? The network is still nicely distributed but the temporal resolution is less than expected due to the missing data.

Image

I've checked the output of stackSentinel.py and the missing dates are there in the output folders such as merged/interferograms, baselines, coreg_secondarys, etc.

I've inspected the slcStack.h5 file and the "slc" key has shape (34, 1029, 5864) which I assume means all Sentinel-1 data was able to be ingested.

Input stackSentinel code

stackSentinel.py -s /mnt/e/data/insar-highways/demak \
--workflow interferogram \
--working_directory /mnt/e/data/insar-highways/demak_v5 \
-n 1 --bbox "-6.980585 -6.896600 110.435772 110.636444" \
-o /mnt/e/data/insar-highways/demak_v5/orbits \
-a /mnt/e/data/insar-highways/demak_v5/auxfiles \
-d /mnt/e/data/insar-highways/demak_v5/dem/dem.geo \
-V False \
-z 4 \
-r 20

Then I ran the miaplpyApp using miaplpyApp.py demak.cfg --dir /mnt/e/data/insar-highways/demak_v5/miaplpy using the cfg below.

################
miaplpy.load.processor      = isce  #[isce,snap,gamma,roipac], auto for isceTops
miaplpy.load.updateMode     = no  #[yes / no], auto for yes, skip re-loading if HDF5 files are complete
miaplpy.load.compression    = auto  #[gzip / lzf / no], auto for no.
miaplpy.load.autoPath       = no    # [yes, no] auto for no
        
		
miaplpy.load.slcFile        = /mnt/e/data/insar-highways/demak_v5/merged/SLC/*/*.slc.full  #[path2slc_file]
##---------for ISCE only:
miaplpy.load.metaFile       = /mnt/e/data/insar-highways/demak_v5/reference/IW*.xml
miaplpy.load.baselineDir    = /mnt/e/data/insar-highways/demak_v5/baselines
##---------geometry datasets:
miaplpy.load.demFile          = /mnt/e/data/insar-highways/demak_v5/merged/geom_reference/hgt.rdr.full
miaplpy.load.lookupYFile      = /mnt/e/data/insar-highways/demak_v5/merged/geom_reference/lat.rdr.full
miaplpy.load.lookupXFile      = /mnt/e/data/insar-highways/demak_v5/merged/geom_reference/lon.rdr.full
miaplpy.load.incAngleFile     = /mnt/e/data/insar-highways/demak_v5/merged/geom_reference/los.rdr.full
miaplpy.load.azAngleFile      = /mnt/e/data/insar-highways/demak_v5/merged/geom_reference/los.rdr.full
miaplpy.load.shadowMaskFile   = /mnt/e/data/insar-highways/demak_v5/merged/geom_reference/shadowMask.rdr.full
##---------miaplpy.load.waterMaskFile    = /mnt/e/data/insar-highways/demak_v4/water_mask/swbdLat_S08_S06_Lon_E110_E111.wbd
##---------interferogram datasets:
miaplpy.load.unwFile        = /mnt/e/data/insar-highways/demak_v5/miaplpy/inverted/interferograms_single_reference/*/*fine*.unw
miaplpy.load.corFile        = /mnt/e/data/insar-highways/demak_v5/miaplpy/inverted/interferograms_single_reference/*/*fine*.cor
miaplpy.load.connCompFile   = /mnt/e/data/insar-highways/demak_v5/miaplpy/inverted/interferograms_single_reference/*/*.unw.conncomp
        
##---------subset (optional):
## if both yx and lalo are specified, use lalo option unless a) no lookup file AND b) dataset is in radar coord
miaplpy.subset.lalo         = -6.980585:-6.896600,110.435772:110.636444

# MiaplPy options 
miaplpy.multiprocessing.numProcessor   = 10
miaplpy.interferograms.type = single_reference

## Mintpy options
mintpy.compute.cluster     = local  # if dask is not available, set this option to no 
mintpy.compute.numWorker   = 4

mintpy.reference.lalo     = -6.9062397501293855, 110.62864532047873
mintpy.troposphericDelay.method = no

@pbrotoisworo
Copy link
Author

pbrotoisworo commented Jan 25, 2025

Just an update. I fixed the error.

I saw output in step 5 unwrap_ifgram where it says Killed multiple times which I think my computer just ran out of memory. So I think this resulted in the downstream processing assuming there were only 15 datasets because of many SNAPHU failures.

Checking my run_05_miaplpy_unwrap_ifgram file I see it has a lot of run commands executing at the same time. There are 20 run commands then a wait. After that 13 run commands before the last wait. I rewrote the file so there is a wait command after every 4 SNAPHU command. I'm not sure which parameter I originally used in the cfg, maybe miaplpy.compute.numCores. I set it to 20 because I have 20 CPU cores.

I run again with no problem. But then I had to delete numInvIfgram.h5, timeseries.h5, and temporalCoherence.h5 due to a mismatch in dataset sizes in later steps but the resulting output is good.

My thoughts on this for the project team:

  1. Could there be a specific parameter for number of jobs for SNAPHU? I was fine tuning the CFG file for the phase linking step since it takes so long so I wanted to maximize the CPU cores. But if I'm understanding correctly the same parameter was used for number of jobs for SNAPHU which lead to unsafe process terminations.
  2. There should be a way to safely catch the SNAPHU out of memory error in unwrap_ifgram. It doesn't raise an exception and the rest of the MiaplPy was able to run and just assumed it was valid despite 50% of the dataset missing.

Image

@mirzaees
Copy link
Collaborator

mirzaees commented Feb 6, 2025

Yes you are right, I am planning to use a python version of Snaphu and fix this issue in the near future.

@pbrotoisworo pbrotoisworo changed the title Missing scenes in output timeseries (Only 15/34 scenes showing) SNAPHU Killed Error Causes Missing scenes in output timeseries (Only 15/34 scenes showing) Feb 9, 2025
Copy link

codeautopilot bot commented Feb 9, 2025

Potential Solution

The plan to solve the bug involves addressing the memory management issues during the SNAPHU unwrapping process, which is likely causing the "Killed" error due to excessive memory usage. By optimizing the configuration settings for SNAPHU, implementing memory checks, and enhancing error handling, we can prevent the process from being terminated unexpectedly and ensure all scenes are processed correctly.

What is Causing This Bug?

The bug is primarily caused by the SNAPHU unwrapping process consuming more memory than is available, leading to the process being killed by the operating system. This is likely due to the size of the interferograms being processed and the configuration settings not being optimized for the available system resources. Additionally, the lack of memory checks and detailed error handling in the scripts contributes to the issue.

Code

  1. Optimize SNAPHU Configuration: Adjust the SNAPHU configuration parameters to better match the available system resources. This may involve reducing the number of tiles or adjusting other parameters to reduce memory usage.

    # Example of adjusting SNAPHU configuration
    snaphu_config = {
        'NLOOKSRANGE': 1,
        'NLOOKSAZ': 1,
        'TILEDIR': '/path/to/tiledir',
        'NUM_TILES': 4,  # Reduce the number of tiles if memory is limited
        'MAX_DISCONTINUITY': 10  # Adjust based on dataset characteristics
    }
  2. Implement Memory Checks: Add checks to ensure that sufficient memory is available before starting the SNAPHU process.

    import psutil
    
    def check_memory_availability(required_memory_gb):
        available_memory_gb = psutil.virtual_memory().available / (1024 ** 3)
        if available_memory_gb < required_memory_gb:
            raise MemoryError(f"Insufficient memory: {available_memory_gb} GB available, {required_memory_gb} GB required.")
    
    # Example usage
    check_memory_availability(8)  # Check if at least 8 GB of memory is available
  3. Enhanced Error Handling: Improve error handling to catch and report memory-related errors more gracefully.

    try:
        # Run SNAPHU process
        run_snaphu_process()
    except MemoryError as e:
        print(f"Memory error encountered: {e}")
        # Additional logging or cleanup actions
    except RuntimeError as e:
        print(f"Runtime error encountered: {e}")
        # Additional logging or cleanup actions
  4. Logging and Monitoring: Enhance logging to provide more detailed information about the execution of each step, including memory usage and any errors encountered.

    import logging
    
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)
    
    def log_memory_usage():
        memory_info = psutil.virtual_memory()
        logger.info(f"Memory usage: {memory_info.percent}% used, {memory_info.available / (1024 ** 3):.2f} GB available")
    
    # Example usage
    log_memory_usage()

How to Replicate the Bug

  1. Prepare a dataset of 34 Sentinel-1 SLC images and configure the processing pipeline as described in the user input.
  2. Run the stackSentinel.py script with the specified parameters to generate the necessary input files.
  3. Execute the miaplpyApp.py script using the provided configuration file.
  4. Observe the output timeseries and network for missing scenes, and check for any "Killed" error messages in the logs.

By following these steps, the bug should be replicated, allowing for further testing and validation of the proposed solution.

Click here to create a Pull Request with the proposed solution

Files used for this task:

Changes on src/miaplpy/unwrap_ifgram.py

Analysis of src/miaplpy/unwrap_ifgram.py

Overview

The file unwrap_ifgram.py is responsible for unwrapping interferograms using the SNAPHU software. The script includes several functions and classes that manage the configuration and execution of the SNAPHU unwrapping process. The main class, Snaphu, handles the setup and execution of the unwrapping, including determining whether to split the process into tiles and managing the configuration files.

Potential Causes of the Bug

  1. Memory Management:

    • The error message "Killed" typically indicates that the process was terminated by the operating system, often due to excessive memory usage. The script does not appear to have explicit memory management or optimization strategies, which could lead to high memory consumption, especially when processing large datasets or multiple tiles.
  2. Tile Management:

    • The script includes logic to determine whether the unwrapping should be split into tiles (unwrap_tile method). If the number of tiles is not optimally configured, it could lead to inefficient memory usage. The calculation of y_tile and x_tile might not be optimal for the given dataset size.
  3. Configuration File Handling:

    • The configuration for SNAPHU is dynamically generated and written to a file. If the configuration parameters (e.g., NLOOKSRANGE, NLOOKSAZ, TILEDIR) are not set correctly, it could lead to inefficient processing and increased memory usage.
  4. Error Handling:

    • The script raises a RuntimeError if SNAPHU returns an error. However, it does not provide detailed logging or handling for memory-specific errors, which could help diagnose the issue.

Recommendations

  1. Optimize Memory Usage:

    • Implement memory profiling to identify bottlenecks and optimize memory usage. Consider using memory-efficient data structures or processing techniques.
  2. Tile Configuration:

    • Review and optimize the logic for determining the number of tiles (get_nproc_tile). Ensure that the tile size and number are appropriate for the dataset size and available system memory.
  3. Configuration Parameters:

    • Verify that the configuration parameters for SNAPHU are set optimally for the dataset and system. Consider allowing user input for critical parameters that affect memory usage.
  4. Enhanced Logging and Error Handling:

    • Implement detailed logging to capture memory usage statistics and SNAPHU output. Enhance error handling to provide more informative messages, particularly for memory-related issues.
  5. Testing with Smaller Datasets:

    • Test the unwrapping process with smaller datasets to ensure that the logic and configuration are correct before scaling up to larger datasets.

By addressing these areas, the script can be improved to handle large datasets more efficiently and reduce the likelihood of memory-related errors.

Changes on src/miaplpy/miaplpyApp.py

Analysis of miaplpyApp.py

The miaplpyApp.py script is a part of a larger workflow for processing InSAR data using MiaplPy. The script orchestrates various steps in the processing pipeline, including loading data, generating interferograms, unwrapping them, and performing time series analysis. The script is designed to be flexible, allowing for different configurations and processing options through a configuration file.

Key Observations

  1. Workflow Steps: The script defines a series of steps (STEP_LIST) that are executed in sequence. These steps include loading data, phase linking, generating interferograms, unwrapping interferograms, and more.

  2. Configuration Management: The script uses a configuration file to manage various parameters and settings. This includes paths to input data, processing options, and output directories.

  3. Integration with SNAPHU: The unwrapping step (run_unwrap) is particularly relevant to the issue at hand. This step involves calling an external script (unwrap_ifgram.py) to perform phase unwrapping using SNAPHU. The script constructs command-line arguments for SNAPHU based on the configuration settings.

  4. Resource Management: The script attempts to manage computational resources by determining the number of CPU cores available and adjusting the number of tasks accordingly. However, there is no explicit memory management or checks for available memory before executing memory-intensive tasks like unwrapping.

  5. Error Handling: There is limited error handling in the script. If SNAPHU runs out of memory, it may simply terminate with a "Killed" message, which is consistent with the user's reported issue.

Potential Causes of the Bug

  • Memory Usage: The SNAPHU process may be consuming more memory than is available, leading to the process being killed by the operating system. This could be due to the size of the interferograms or the number of tiles being processed simultaneously.

  • Configuration Settings: The configuration settings for SNAPHU, such as the number of tiles (--num_tiles) and the maximum discontinuity (--max_discontinuity), may not be optimal for the available system resources.

  • Lack of Memory Checks: The script does not perform any checks on available system memory before starting the SNAPHU process. Implementing such checks could prevent the process from being killed unexpectedly.

Recommendations

  1. Optimize SNAPHU Configuration: Review and adjust the SNAPHU configuration settings to better match the available system resources. This may involve reducing the number of tiles or adjusting other parameters to reduce memory usage.

  2. Implement Memory Checks: Add checks to ensure that sufficient memory is available before starting the SNAPHU process. This could involve querying the system for available memory and adjusting the processing parameters accordingly.

  3. Error Handling: Improve error handling to catch and report memory-related errors more gracefully. This could involve capturing the output of the SNAPHU process and checking for specific error messages.

  4. Logging and Monitoring: Enhance logging to provide more detailed information about the execution of each step, including memory usage and any errors encountered. This would aid in diagnosing issues more effectively.

By addressing these areas, the likelihood of encountering out-of-memory errors during the SNAPHU process can be reduced, potentially resolving the issue of missing scenes in the output timeseries.

Changes on src/miaplpy/objects/slcStack.py

Analysis of src/miaplpy/objects/slcStack.py

The slcStack.py file is responsible for handling the loading and processing of Sentinel-1 SLC (Single Look Complex) data into a format suitable for time-series analysis. The main class, slcStackDict, manages a collection of SLCs, while the slcStack class handles the reading and writing of these datasets to and from HDF5 files.

Key Functions and Potential Issues

  1. Data Loading and Processing:

    • The slcStackDict class initializes with a dictionary of SLC pairs, which are then processed to extract metadata, size, and date information.
    • The write2hdf5 method writes the processed SLC data into an HDF5 file, ensuring that the data is correctly structured and metadata is included.
  2. Date and Scene Management:

    • The get_date_list method retrieves a sorted list of dates from the SLC pairs, which is crucial for ensuring all scenes are accounted for in the time-series.
    • The write2hdf5 method includes a section for creating a dataset of dates, which should match the number of SLCs processed.
  3. Potential Causes for Missing Scenes:

    • Incomplete Data Handling: If the pairsDict does not contain all expected SLC pairs, some scenes might be missing from the output. This could be due to an error in how the pairs are generated or filtered.
    • HDF5 Writing Issues: The write2hdf5 method might not correctly write all datasets if there are issues with the input data or if the method encounters an error during execution.
    • Metadata Misalignment: If the metadata does not correctly reflect the number of scenes or their dates, this could lead to discrepancies in the output.
  4. Memory Management:

    • The file does not explicitly handle memory management, which could be a concern given the large size of SLC datasets. This might indirectly affect the processing if the system runs out of memory, leading to incomplete data processing.

Recommendations

  • Verify Input Data: Ensure that the pairsDict contains all expected SLC pairs and that they are correctly processed. This might involve checking the input data source and any filtering logic applied before this point.
  • Error Handling: Implement additional error handling in the write2hdf5 method to catch and log any issues that occur during the writing process.
  • Memory Usage: Consider optimizing memory usage, especially if the system is prone to running out of memory during processing. This might involve processing data in smaller chunks or using more efficient data structures.

Overall, the file appears to be correctly structured for its intended purpose, but careful attention should be paid to the input data and the handling of potential errors during processing.

This comment was generated by AI. Information provided may be incorrect.

Current plan usage: 0%

Have feedback or need help?
Documentation
[email protected]

@pbrotoisworo
Copy link
Author

Thanks @mirzaees. I've updated the Issue title to reflect the actual issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants