Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: collect files in PV when another container restart or stop #2010

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Abingcbc
Copy link
Collaborator

@Abingcbc Abingcbc commented Jan 6, 2025

问题

LoongCollector之前的设计中,每个目录只会对应一个容器。但在采集volume(包括PV、hostPath等)时会出现问题:

多个容器采集同一个volume时。其中一个容器停止或者同一个容器重启时,会将采集同一目录的所有的reader置为container stopped。但实际上这些reader正在采集其他容器的文件,后续会不断触发读取stop容器告警,产生截断。

修复方法

  1. Reader中添加container id信息,主机情况下为空。
  2. 容器停止事件中携带container id,并向后续event传递,同时更新配置中的容器状态。
  3. 在读取停止容器时,reader尝试更新container id。如果发现新容器id发生变化,并且新容器不为停止,则将reader的状态恢复。

测试用例

新容器start,旧容器stop,文件写入modify 这三个事件之间存在时序关系。

  1. stop -> start -> modify:modify事件会更新reader中container相关的信息。
  2. stop -> modify -> start:stop和start之间的日志会截断,无法避免。
  3. start -> modify -> stop:同 1
  4. start -> stop -> modify:同 1
  5. modify -> stop -> start:同 1
  6. modify -> start -> stop:同 1
  7. stop -> modify -> stop:reader保持container stop
  8. stop -> stop -> modify:reader拒绝container id不相同的stop事件

@Abingcbc Abingcbc marked this pull request as draft January 6, 2025 03:59
@Abingcbc Abingcbc force-pushed the abing/fix-container-pv branch 4 times, most recently from dcd68aa to 3e5407b Compare January 7, 2025 09:14
@Abingcbc Abingcbc changed the title [WIP] fix: collect files in PV when another container restart or stop fix: collect files in PV when another container restart or stop Jan 7, 2025
@Abingcbc Abingcbc marked this pull request as ready for review January 7, 2025 09:15
@Abingcbc Abingcbc force-pushed the abing/fix-container-pv branch from 3e5407b to 6af14e0 Compare January 7, 2025 09:41
@Abingcbc Abingcbc force-pushed the abing/fix-container-pv branch from 6af14e0 to bfd097c Compare January 7, 2025 09:42
"file inode", reader->GetDevInode().inode)("file size", reader->GetFileSize()));
ForceReadLogAndPush(reader);
reader->CloseFilePtr();
// update container info one more time, ensure file is hold by same cotnainer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cotnainer -> container

@@ -206,6 +206,7 @@ void CheckPointManager::LoadFileCheckPoint(const Json::Value& root) {
string realFilePath;
int32_t fileOpenFlag = 0; // default, we close file ptr
int32_t containerStopped = 0;
string containerID;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个场景E2E是否可以构造,是否有对应的用例?

@@ -206,6 +206,7 @@ void CheckPointManager::LoadFileCheckPoint(const Json::Value& root) {
string realFilePath;
int32_t fileOpenFlag = 0; // default, we close file ptr
int32_t containerStopped = 0;
string containerID;
int32_t lastForceRead = 0;
int32_t idxInReaderArray = LogFileReader::CHECKPOINT_IDX_OF_NEW_READER_IN_ARRAY;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UT、E2E重点找迅飞review下。

if (discoveryConfig.first == nullptr) {
return false;
}
ContainerInfo* containerInfo = discoveryConfig.first->GetContainerPathByLogPath(mHostLogPathDir);
Copy link
Collaborator

@yyuuttaaoo yyuuttaaoo Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可能有问题,如果一个是主机上采集pv,一个是容器内,那么主机的是不会去更新container信息的,此时还能否将stop的container重置为空?
所以config信息是不是还是有用,不能只靠containerid

@yyuuttaaoo yyuuttaaoo self-requested a review January 13, 2025 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants