2021-04-28

How to improve performance scraping 20.000 url in c# [closed]

I have working a process that scrapes 20,000 urls, but it currently takes around 20 minutes to complete. I am looking for a way to improve this processing time.

I have experimented with using Thread.Start() but have settled on ThreadPool.QueueUserWorkItem

StartScraping() contains a loop to iterate through many Skus, which will resolve a URL for a product page that we want to check the stock status. We are concerned only if the stock is still available.

GetUrlHtmlText() is the method that does the actual HTML scrape using HttpWebRequest. We determine if the product is available by checking for the presence of a button wuth the content "Tükendi", this will indicate that the stock has been depleted (Sold Out).

Main Loop

public void StartScraping()
    {
        try
        {
            int counter = 0;
            Random rnd = new Random();
            Debug.WriteLine("Starting => " + BrandName);
            foreach (var item in competitionItemList.competitionItems)
            {
                //Thread thread = new Thread(() => StartPerSku(BrandID, item, validProxyList[rnd.Next(0, validProxyList.Count - 1)], StockCheckStatus));
                //thread.Start();
                
                ThreadPool.QueueUserWorkItem(state => StartPerSku(BrandID, item, validProxyList[rnd.Next(0, validProxyList.Count - 1)], StockCheckStatus));
                request++;
                Debug.WriteLine(request);
                counter++;
                if (counter == 10)
                {
                    validProxy = false;
                    proxyIndex++;
                    counter = 0;
                    Thread.Sleep(100);
                    if (validProxyList.Count - 1 <= proxyIndex)
                    {
                        proxyIndex = 0;
                    }
                }
                // tasks.Add(thread); 
            }
        }
        catch (System.Exception ex)
        {
            db.InsertRekabetWinServError(90002, "StartScraping() ==> " + ex.ToString());
            Debug.WriteLine("LetStartAllTask => " + ex.ToString());
        }
    }

Scape logic

private async Task<bool> GetUrlHtmlText()
    {
        if (errorStatus)
        {
            htmlInnerText = "";
            try
            {
                ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12 | SecurityProtocolType.Ssl3;
                ServicePointManager.ServerCertificateValidationCallback += (s, cert, ch, sec) => { return true; };
                ServicePointManager.DefaultConnectionLimit = 10000;

                HttpWebRequest httpRequest = WebRequest.CreateHttp(uri);
                //byte[] bytes = System.Text.Encoding.ASCII.GetBytes(requestXml);

                httpRequest.CookieContainer = new CookieContainer();
                httpRequest.Timeout = 30000;
                httpRequest.AllowAutoRedirect = true;
                httpRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
                httpRequest.ServicePoint.Expect100Continue = false;
                httpRequest.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36";
                httpRequest.Accept = "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
                httpRequest.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip, deflate;q=0.8");
                httpRequest.Headers.Add(HttpRequestHeader.CacheControl, "no-cache");
                if (proxy != "0")
                {
                    httpRequest.Proxy = new WebProxy(proxy);
                }
                try
                {
                    using (HttpWebResponse httpResponse = (HttpWebResponse)httpRequest.GetResponse())
                    {
                        if (httpResponse != null)
                        {
                            using (var reader = new System.IO.StreamReader(httpResponse.GetResponseStream(), ASCIIEncoding.UTF8))
                            {
                                if (htmlInnerText.Contains("barcode"))
                                {
                                    htmlInnerText = reader.ReadToEnd();
                                    if (htmlInnerText.Contains(">Tükendi</button>"))
                                    {
                                        tApiResult = "Tükendi";
                                        return await Task.FromResult(false);
                                    }
                                    else
                                    {
                                        httpResponse.Close();
                                        reader.Close();
                                        return await Task.FromResult(true);
                                    }
                                }
                                else
                                {
                                    return await Task.FromResult(false);
                                }
                            }
                        }
                        else
                        {
                            tApiResult = "404 Page"; return await Task.FromResult(false);
                        }
                    }
                }
                catch (WebException ex)
                {
                    tApiResult = "404 Page";
                    Debug.WriteLine("GetUrlHtmlText => (" + barcode + "| Link : " + uri + " ) " + ex.ToString());
                    if (/*!ex.ToString().Contains("402") || !ex.ToString().Contains("502") ||*/ !ex.ToString().Contains("404"))
                    {
                        tApiResult = proxy;
                    }
                    return await Task.FromResult(false);
                }
            }
            catch (Exception ex)
            {
                tApiResult = "404 Page";
                Debug.WriteLine(ex.ToString());
                db.InsertRekabetWinServError(OBJ_CompetitionItem.ID, "GetUrlHtmlText ==> " + ex.ToString());
                //MultiFuncComp.Manager.RekabetBrandManager.mobileNotification.SendNotification("<=A little Exception=>" + barcode, "ex str :>>  " + ex.ToString());
                tApiResult = proxy;
                return await Task.FromResult(false);
            }
        }
        else
        {
            return await Task.FromResult(false);
        }
    }

I have almost 400mbps downstream and upstream speed

Computer features

EDIT:

This question was titled How can I scrape 20.000 url same time in c# and was closed for reason : "Opinion-based - discussions focused on diverse opinions are great, but they just don't fit our format well."

I'm changing my question because I know how, but I am looking for advice specifically on how to improve the speed of this process.



from Recent Questions - Stack Overflow https://ift.tt/3xxXCFU
https://ift.tt/3dXfrGD

No comments:

Post a Comment