Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] 优化搜索性能 #911

Open
Yxa2111 opened this issue Feb 2, 2025 · 1 comment
Open

[Discussion] 优化搜索性能 #911

Yxa2111 opened this issue Feb 2, 2025 · 1 comment

Comments

@Yxa2111
Copy link

Yxa2111 commented Feb 2, 2025

项目讨论

现状

现在的搜索逻辑看上去是先拿到所有的 torrents,然后遍历每个 torrent,调用 torrent_to_datatorrent_to_data 会先通过 raw_parser 从 title 中 parse 出原始信息,再用 official_title_parser 去 tmdb 或者 mikan 上拿具体的其它信息(如 poster_link、official_title 等等)。因为 ab 会把不同字幕组的相同番剧也视作不同的 Bangumi 返回给前端,所以会生成一个 special_url 来去重,满足 limit 后返回。

    def analyse_keyword(
        self, keywords: list[str], site: str = "mikan", limit: int = 5
    ) -> BangumiJSON:
        rss_item = search_url(site, keywords)
        torrents = self.search_torrents(rss_item)
        # yield for EventSourceResponse (Server Send)
        exist_list = []
        for torrent in torrents:
            if len(exist_list) >= limit:
                break
            bangumi = self.torrent_to_data(torrent=torrent, rss=rss_item)
            if bangumi:
                special_link = self.special_url(bangumi, site).url
                if special_link not in exist_list:
                    bangumi.rss_link = special_link
                    exist_list.append(special_link)
                    yield json.dumps(bangumi.dict(), separators=(",", ":"))
    def torrents_to_data(
        self, torrents: list[Torrent], rss: RSSItem, full_parse: bool = True
    ) -> list:
        new_data = []
        for torrent in torrents:
            bangumi = self.raw_parser(raw=torrent.name)
            if bangumi and bangumi.title_raw not in [i.title_raw for i in new_data]:
                self.official_title_parser(bangumi=bangumi, rss=rss, torrent=torrent)
                if not full_parse:
                    return [bangumi]
                new_data.append(bangumi)
                logger.info(f"[RSS] New bangumi founded: {bangumi.official_title}")
        return new_data

问题

official_title_parser 这个操作是每个 torrent 都要去搜一下,感觉不太合理。因为同一字幕组的不同集数都会被重复搜一下,但这种情况下,返回的元数据都是相同的。而且由于 tmdb 或者 mikan 在国内环境都需要挂梯子,所以下行操作非常慢。

经过分析,发现生成 special_url 其实没有用到 official_title_parser 的元数据,因此可以先生成 special_url 去重。如果已存在,就不需要再去拿元数据了:

    def analyse_keyword(
        self, keywords: list[str], site: str = "mikan", limit: int = 5
    ) -> BangumiJSON:
        rss_item = search_url(site, keywords)
        torrents = self.search_torrents(rss_item)
        # yield for EventSourceResponse (Server Send)
        exist_list = set()
        for torrent in torrents:
            if len(exist_list) >= limit:
                break
           # 我新增的函数
            bangumi, special_link = self.torrent_to_bangumi(
                torrent, site, rss_item, exist_list
            )
            if bangumi:
                exist_list.add(special_link)
                yield json.dumps(bangumi.dict(), separators=(",", ":"))

    def torrent_to_bangumi(
        self, torrent: Torrent, site: str, rss: RSSItem, exist_list: set[str]
    ) -> tuple[Bangumi | None, str]:
        bangumi = self.raw_parser(raw=torrent.name)
        if bangumi is None:
            return None, ""
        special_link = self.special_url(bangumi, site).url
        if special_link in exist_list:
            return None, ""
       # 如果special_link已经存在(针对同一字幕组的不同集数的情况)
        self.official_title_parser(bangumi=bangumi, rss=rss, torrent=torrent)
        bangumi.rss_link = special_link
        return bangumi, special_link

试了一下,搜索辉夜大小姐,limit=5,速度能提升一半左右

Image

Other

ab 的搜索功能是先按照 RSS 接口直接搜索 torrents(https://mikanani.me/RSS/Search?searchstr=xxx) ,所以需要根据 torrent name 再把字幕组名、作品名 parse 出来,然后再分组。但是 mikan 其实有自己的番剧页面(https://mikanani.me/Home/Bangumi/xxx) ,下面已经按照字幕组分好类了。

感觉完全可以直接 HTML 解析这个页面,这样每个作品就只需要拿一次元数据就好了。但如果 mikan 的 HTML 页面频繁改动,ab 也需要同步更改。不过看了下隔壁似乎也是这样做的……

@shininome
Copy link
Contributor

话说隔壁是哪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants