一:背景
1. 講故事
這次生產事故的dump是訓練營里一位朋友給到我的,由于朋友沒有分析出來,讓我幫忙看看,畢竟我的修車經驗相對來說更豐富一些,算是他們背后堅實的保障吧,話不多說上windbg說話。
二:WinDbg分析
1. 為什么會崩潰
由于windbg默認自動定位到崩潰的線程,而崩潰的dump重點是觀察它的崩潰前上下文,這里使用 .ecxr
和 k
命令,輸出參考如下:
0:000> .ecxr
eax=00000000 ebx=4d6d8360 ecx=00000003 edx=00000000 esi=4d6f0ca0 edi=4d6f0c74
eip=71a567c7 esp=026fd834 ebp=026fd83c iopl=0 nv up di pl nz na po nc
cs=0000 ss=0000 ds=0000 es=0000 fs=0000 gs=0000 efl=00000000System_Windows_Forms_ni!System.Windows.Forms.ImageList.ImageCollection.SetKeyName+0x1b:71a567c7 cc int 30:000> k
*** Stack trace for last set context - .thread/.cxr resets it
# ChildEBP RetAddr 00 026fd83c 0c2c4e7e System_Windows_Forms_ni!System.Windows.Forms.ImageList.ImageCollection.SetKeyName+0x1b01 026fe474 0c2c063b xxx!xxx.MainForm.InitializeComponent+0x198e02 026fe488 095cb9de xxx!xxx.MainForm..ctor+0x5fb03 026fe4e4 0da5bc7a xxx!xxx.LoginForm.button_OK_Click+0x52e04 026fe4f8 71a38bdf xxx!xxx.LoginForm.LoginForm_Load+0x9a05 026fe528 710b325a System_Windows_Forms_ni!System.Windows.Forms.Form.OnLoad+0x2f...
從卦象看是崩潰在 System.Windows.Forms.ImageList.ImageCollection.SetKeyName
方法上,很顯然這個方法屬于微軟的SDK底層庫,不管怎么說是一個托管異常,既然是托管異常我們可以用 !t
觀察到具體的崩潰信息。
0:000> !t
ThreadCount: 26UnstartedThread: 0BackgroundThread: 12PendingThread: 0DeadThread: 13Hosted Runtime: no
Lock
ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
0 1 534 0299e700 a6028 Preemptive 4D6F0EFC:00000000 02997bd0 0 STA System.IndexOutOfRangeException 4d6f0ca0
2 2 5b4 029af278 2b228 Preemptive 00000000:00000000 02997bd0 0 MTA (Finalizer)
3 6 ff0 02a60eb0 102a228 Preemptive 00000000:00000000 02997bd0 0 MTA (Threadpool Worker)
...0:000> !pe
Exception object: 4d6f0ca0
Exception type: System.IndexOutOfRangeException
Message: Index was outside the bounds of the array.
InnerException: <none>
StackTrace (generated):
SP IP Function
026FD834 71A567C7 System_Windows_Forms_ni!System.Windows.Forms.ImageList+ImageCollection.SetKeyName(Int32, System.String)+0x33e157StackTraceString: <none>
HResult: 80131508
從卦象看非常奇怪,怎么底層庫中拋了一個數組索引越界
異常?難道是底層的bug?一般來說這些代碼都是銅墻鐵壁,固若金湯,堅如磐石,穩如泰山,無懈可擊,不可能有如此低級的bug。。。
2. 真的是底層庫bug嗎?
要想尋找答案,可以根據線程棧上的函數尋找底層源碼,從源碼上尋找答案,修剪后的代碼如下:
private void InitializeComponent()
{
this.imageList_btnbg.ImageStream = (System.Windows.Forms.ImageListStreamer)resources.GetObject("imageList_btnbg.ImageStream");
this.imageList_btnbg.TransparentColor = System.Drawing.Color.Transparent;
this.imageList_btnbg.Images.SetKeyName(0, "normal-main.bmp");
this.imageList_btnbg.Images.SetKeyName(1, "focus-main.bmp");
this.imageList_btnbg.Images.SetKeyName(2, "select-main.bmp");
this.imageList_btnbg.Images.SetKeyName(3, "gray-main.bmp");
this.imageList_btnbg.Images.SetKeyName(4, "down_1.png");
this.imageList_btnbg.Images.SetKeyName(5, "down_2.png");
this.imageList_btnbg.Images.SetKeyName(6, "down_3.png");
this.imageList_btnbg.Images.SetKeyName(7, "up_1.png");
this.imageList_btnbg.Images.SetKeyName(8, "up_2.png");
this.imageList_btnbg.Images.SetKeyName(9, "up_3.png");
}
public void SetKeyName(int index, string name)
{
if (!IsValidIndex(index))
{
throw new IndexOutOfRangeException();
}
if (imageInfoCollection[index] == null)
{
imageInfoCollection[index] = new ImageInfo();
}
((ImageInfo)imageInfoCollection[index]).Name = name;
}
private bool IsValidIndex(int index)
{
if (index >= 0)
{
return index < Count;
}
return false;
}
public int Count
{
get
{
if (owner.HandleCreated)
{
return SafeNativeMethods.ImageList_GetImageCount(new HandleRef(owner, owner.Handle));
}
int num = 0;
foreach (Original original in owner.originals)
{
if (original != null)
{
num += original.nImages;
}
}
return num;
}
}
[Browsable(false)]
[EditorBrowsable(EditorBrowsableState.Advanced)]
[DesignerSerializationVisibility(DesignerSerializationVisibility.Hidden)]
[SRDescription("ImageListHandleCreatedDescr")]
public bool HandleCreated => nativeImageList != null;
仔細通讀卦中的代碼邏輯,看樣子是 IsValidIndex()=false
導致的手工 IndexOutOfRangeException 異常,而 IsValidIndex()=false
是由于 index < Count
的條件成立,后面的 Count 是取自 ImageList_GetImageCount
或者 owner.originals
值。
代碼邏輯我們是分析清楚了,接下來就是看匯編來分析下這個dump的現狀,入手點就是從 index 值入手,即對 InitializeComponent()
方法進行反匯編。
0:000> !clrstack
OS Thread Id: 0x534 (0)
Child SP IP Call Site026fd784 771316bc [HelperMethodFrame: 026fd784]
026fd834 71a567c7 System.Windows.Forms.ImageList+ImageCollection.SetKeyName(Int32, System.String)026fd848 0c2c4e7e xxx.MainForm.InitializeComponent()026fe47c 0c2c063b xxx.MainForm..ctor()026fe490 095cb9de xxx.LoginForm.button_OK_Click(System.Object, System.EventArgs)
...0:000> !U /d 0c2c4e7e
Normal JIT generated code
xxx.MainForm.InitializeComponent()
Begin 0c2c34f0, size 5ded
...0c2c4e62 e8d9b5d864 call System_Windows_Forms_ni+0x160440 (71050440) (System.Windows.Forms.ImageList.get_Images(), mdToken: 06002599)0c2c4e67 898514f5ffff mov dword ptr [ebp-0AECh],eax0c2c4e6d ff35f0a07f05 push dword ptr ds:[57FA0F0h] ("normal-main.bmp")0c2c4e73 8bc8 mov ecx,eax0c2c4e75 33d2 xor edx,edx0c2c4e77 3909 cmp dword ptr [ecx],ecx0c2c4e79 e8f2374565 call System_Windows_Forms_ni!System.Windows.Forms.ImageList.ImageCollection.SetKeyName (71718670)
>>> 0c2c4e7e 8b8e74020000 mov ecx,dword ptr [esi+274h]
...
從卦象看,尼瑪。。。執行第一個 SetKeyName(0, "normal-main.bmp");
就異常啦,這就說明那個 Count=0,無語了,為什么 Count=0 呢? 接下來尋找Count數據源ImageCollection 集合,可以從線程棧中尋找,使用 !dso
命令即可。
0:000> !dso
OS Thread Id: 0x534 (0)
ESP/REG Object Name026FD774 4d6f0c74 System.Windows.Forms.ImageList+ImageCollection026FD778 4d6f0ca0 System.IndexOutOfRangeException
...0:000> !do 4d6f0c74
Name: System.Windows.Forms.ImageList+ImageCollection
MethodTable: 71120ff0
EEClass: 70f230ec
Size: 20(0x14) bytes
File: C:\Windows\Microsoft.Net\assembly\GAC_MSIL\System.Windows.Forms\v4.0_4.0.0.0__b77a5c561934e089\System.Windows.Forms.dll
Fields:
MT Field Offset Type VT Attr Value Name7111ecc0 4003916 4 ...s.Forms.ImageList 0 instance 4d6d97b0 owner72d909dc 4003917 8 ...ections.ArrayList 0 instance 4d6f0c88 imageInfoCollection72d8df5c 4003918 c System.Int32 1 instance -1 lastAccessedIndex0:000> !DumpObj /d 4d6d97b0
Name: System.Windows.Forms.ImageList
...
Fields:
MT Field Offset Type VT Attr Value Name71121b0c 4001013 10 ...t+NativeImageList 0 instance 4d6f0c40 nativeImageList728e15a0 4001019 1c ...Collections.IList 0 instance 00000000 originals
根據卦中的 nativeImageList
和 originals
再配合源代碼,應該就是禍首 SafeNativeMethods.ImageList_GetImageCount
方法返回 0 導致的,先觀察一下它的簽名。
[DllImport("comctl32.dll")]
public static extern int ImageList_GetImageCount(HandleRef himl);
從簽名看這是C++寫的外部方法,這就沃草了。。。我總不能用 ida 去捋這里面的邏輯吧。。。到這里貌似已經快要撞到南墻了。。。有點慌了。
3. 天要絕人之路嗎
經過短暫的恍恍惚惚之后,我突然靈光一現,尼瑪這是32bit的內存地址,是不是2G的空間不夠用哦?剛好 ImageList_GetImageCount 是一個關于圖片的UI控件,用了底層的COM資源,會不會真的是空間不足導致的?有了這個想法之后趕緊 !address -summary
觀察提交內存。
0:000> !address -summary
...
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_COMMIT 1933 6e768000 ( 1.726 GB) 93.52% 86.30%
MEM_FREE 631 9e01000 ( 158.004 MB) 7.72%
MEM_RESERVE 607 7a87000 ( 122.527 MB) 6.48% 5.98%
...
尼瑪。。。卦象中的 MEM_COMMIT=1.72G, %ofBusy= 93.52%
早已超過了1.2G的臨界值,終于真相大白。。。
解決辦法就比較簡單了,開啟大地址,讓程序吃 4G 的內存,后來朋友反饋這個問題已不再出來。。。
三:總結
分析完這個dump之后其實我挺感慨的,人生也如此dump一樣,在真相和假象之間不斷的交織穿梭,有些人走出來了,有些人永遠留在了里面。。。
轉自https://www.cnblogs.com/huangxincheng/p/18643600
該文章在 2025/1/2 9:10:38 編輯過